Commit Graph

3407 Commits

Author SHA1 Message Date
reger
7889fc2389 Hack to prevent Solr issue on partial update on a document containing multivalued date field
(regardless if these fields part of update).
Switch partial update option off in postprocessing if schema contains *_dts (multivalued date field).
see http://mantis.tokeek.de/view.php?id=601
2015-09-13 20:23:15 +02:00
reger
b4cbdea1e7 adapt SolrServerConnector.add to handle error on partial update input document.
In case of error we deleted the original document and added the new doc to the index.
This is not valid for partial update documents (which contain only a subset of the fields).
Remove the "delete" error handling step.
2015-09-13 20:19:50 +02:00
reger
98ab655917 on reindex delete index document with invalid url
if discovered
2015-09-12 23:06:13 +02:00
reger
1e8369e18b use a parsed date in Document.toString 2015-09-12 22:00:40 +02:00
luccioman
199b2ce52d Translator refactoring : to simplify locale files writing, process keys
as simple string and no more as regular expressions.
Updated all locale files to adapt to refectored Translator : removed
useless escaped characters and did minor corrections.
Performed minor syntax corrections on some html source files.
Added an util to translate all html source files with all locales
without launching full YaCy application.
Corrected main arguments parsing on other translation utils.
2015-09-11 17:20:11 +02:00
luccioman
4dd9c0d5d9 Merge from main repository 2015-09-08 08:54:48 +02:00
reger
3428b6f13b improve filtering by filetype navigator.
The used url-filter for filetype doesn't require ".ext" resulting in too many matches,
add a sort-out filter for RWI results.
2015-09-07 02:36:22 +02:00
reger
e37a4f0b3d prevent metadata records in index w/o valid url
by throwing MalformedURL exception on URIMetadataNode creation
2015-09-06 22:19:05 +02:00
reger
41c4eade51 extract modification date from vCard (vcfParser) 2015-09-06 04:28:27 +02:00
reger
8768896975 extract lastmodified from openoffice doc
set lastmod date in office document parsers
2015-09-06 00:04:54 +02:00
Michael Peter Christen
c40c302748 when many crawl queues are generated, this NPE can occur; probably
caused as concurrency issue:
W 2015/09/05 14:09:10 ConcurrentLog java.lang.NullPointerException
java.lang.NullPointerException
	at java.util.TreeMap.rotateRight(TreeMap.java:2239)
	at java.util.TreeMap.fixAfterInsertion(TreeMap.java:2271)
	at java.util.TreeMap.put(TreeMap.java:582)
	at net.yacy.kelondro.table.Table.<init>(Table.java:235)
	at net.yacy.crawler.HostQueue.openStack(HostQueue.java:229)
	at net.yacy.crawler.HostQueue.getStack(HostQueue.java:204)
	at net.yacy.crawler.HostQueue.push(HostQueue.java:397)
	at net.yacy.crawler.HostBalancer.push(HostBalancer.java:237)
	at net.yacy.crawler.data.NoticedURL.push(NoticedURL.java:184)
	at net.yacy.crawler.CrawlStacker.stackCrawl(CrawlStacker.java:355)
	at net.yacy.crawler.CrawlStacker.job(CrawlStacker.java:134)
	at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at
net.yacy.kelondro.workflow.InstantBlockingThread.job(InstantBlockingThread.java:101)
	at
net.yacy.kelondro.workflow.AbstractBlockingThread.run(AbstractBlockingThread.java:82)
	at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
2015-09-05 14:12:17 +02:00
reger
367fe388b9 fix exception throw after sendError in DefaultServlet
- reduce debug exception logs in crawler
2015-09-05 01:57:30 +02:00
luccioman
9752bd5f88 Added utils to help translation without launching full YaCy application
:
- translate all source files with a locale
- list all non translated files with a locale
2015-09-04 13:44:44 +02:00
luccioman
2f0f0180e2 Added a function to list files recursively. 2015-09-04 13:42:57 +02:00
luccioman
7e4c1d2282 Translator refactoring :
- deleted useless new StringBuilder allocation
- use of a new reusable FileNameFilter
- added javadoc
2015-09-04 13:42:10 +02:00
reger
802ccaead6 fix init of error cache, use latest faildates => load_date_dt 2015-09-02 02:36:31 +02:00
reger
dba7f15073 apply same size constrain on result image from doc
as for linked images
see 19f1308bf0
2015-09-01 23:22:48 +02:00
reger
4cf875336c complete TODO: getFileExtension handle dot in query part
+ testcase
2015-08-31 23:28:03 +02:00
sixcooler
87e4abe393 fight the fieldcache by usind DocValues: in Solr-5.x the fieldcache has
moved and was not cleared anymore. This results in an huge fieldcache.
(http://lucene.apache.org/#highlights-of-the-lucene-release-include
https://issues.apache.org/jira/browse/LUCENE-5666)
Here I try to use DovValues where it is possible.
For this I used the Api-Scheme as new basis für the Solr-Schema.
This needs at least a complete optimization of the Solr-Index to get a
smaller FieldCache.
Everything that is indexed with these setting will not use the
Fieldcache at all.
2015-08-31 20:24:41 +02:00
reger
eaf0e8ff2c start recording/indexing pixel size for image document
as for linked images
2015-08-31 01:58:36 +02:00
reger
c33229fc0c check mime prior to ext for metadata modification for images 2015-08-30 23:02:19 +02:00
reger
19f1308bf0 enforce th result images limit to > 16x16px
for linked images
http://mantis.tokeek.de/view.php?id=594
2015-08-30 02:19:52 +02:00
reger
0e4ba0360b fix NPE on .yacyh result url of disconnected peer
(cleanup yacyshare remaining)
2015-08-25 23:26:17 +02:00
reger
7ed812a2bf log missing seed.port
in favour of exception to prevent repeating throws
2015-08-25 02:19:00 +02:00
reger
206883f80d fix: Preserve protocol in url proxy
to connect to http/https. Display warning if https target is viewed over http
2015-08-25 01:16:41 +02:00
reger
f7b0b3b7b3 avoid runtime exception by earlier testing for seed.ip=null 2015-08-23 23:01:20 +02:00
Michael Peter Christen
906b5fd742 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-08-11 00:42:46 +02:00
Michael Peter Christen
8f90767889 fix for filesystem crawl 2015-08-11 00:42:26 +02:00
sixcooler
a3dd4be749 added / corrected charste to be 1.7 compatible.
@Orbiter: please check is this is ok for you
2015-08-10 20:53:20 +02:00
Michael Peter Christen
8028410ab7 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-08-10 14:27:53 +02:00
Michael Peter Christen
df3314ac1a added a new facet type based on a probabilistic classifier using
bayesian filters. This can be used to classify documents during
indexing-time using a pre-definied bayesian filter.

New wordings:
- a context is a class where different categories are possible. The
context name is equal to a facet name.
- a category is a facet type within a facet navigation. Each context
must have several categories, at least one custom name (things you want
to discover) and one with the exact name "negative".

To use this, you must do:
- for each context, you must create a directory within
DATA/CLASSIFICATION with the name of the context (the facet name)
- within each context directory, you must create text files with one
document each per line for every categroy. One of these categories MUST
have the name 'negative.txt'.

Then, each new document is classified to match within one of the given
categories for each context.
2015-08-10 14:27:44 +02:00
reger
1409cabe8b exclude more default search fields from text copy to text_t
for metadata index documents
2015-08-09 21:01:30 +02:00
reger
e2e73258ca remove obsolete interface SearchAccumulator
and unused SRURSSConnector Thread inheritance
2015-08-08 18:35:49 +02:00
Michael Peter Christen
dbbad23e12 removed warnings 2015-08-03 05:37:34 +02:00
Michael Peter Christen
500cfa9457 enhanced logging 2015-08-03 05:17:22 +02:00
Michael Peter Christen
c14bc8d9b7 revert of fq transformation (recent fix) 2015-08-03 05:15:34 +02:00
Michael Peter Christen
203df5a750 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-08-03 05:02:26 +02:00
reger
fa08ca207e ! finish running crawls before applying !
Allow crawl urls up to 2048 character 
fix for http://mantis.tokeek.de/view.php?id=575
2015-08-03 00:49:24 +02:00
reger
ee77f24e52 use some more declared HeaderFramework constants 2015-08-02 22:56:14 +02:00
Michael Peter Christen
11a848da5a Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-08-02 14:53:36 +02:00
Michael Peter Christen
b94bd7f20a a collection of search query enhancements:
- fixed superfluous space in query field list
- fixed filter query logic
- removed look-ahead query which caused that each new search page
submitted two solr queries
- fixed random solr result orders in case that the solr score was equal:
this was then re-ordered by YaCy using the document hash which came from
the solr object and that appeared to be random. Now the hash of the url
is used and the score is additionally modified by the url length to
prevent that this particular case appears at all.
2015-08-02 14:52:41 +02:00
reger
dbe2594c38 replace deprecated myPublicLocalIP() in AbstractRemoteHandler 2015-08-02 00:53:49 +02:00
reger
6d3534e725 remove unused Transmission hit counter 2015-08-02 00:20:14 +02:00
reger
cb67eb7baf use more absolute path for config file opening
as suggested in pull request 5 (https://github.com/yacy/yacy_search_server/pull/5)
2015-08-01 23:54:26 +02:00
Michael Peter Christen
1ccbf739b1 added bayes filter from Philipp Nolte, originally taken from
https://github.com/ptnplanet/Java-Naive-Bayes-Classifier
and modified inside the loklak.org project. After optimization in loklak
it was inserted into the net.yacy.cora.bayes package. It shall be used
to create custom search navigation filters.

The original copyright notice was copied from the README.md from
https://github.com/ptnplanet/Java-Naive-Bayes-Classifier/blob/master/README.md
The original package domain was
de.daslaboratorium.machinelearning.classifier
2015-07-30 14:10:31 +02:00
Michael Peter Christen
1bced1ae60 using latest enhanced (un/)gzip methods from loklak for yacy 2015-07-30 13:39:10 +02:00
Michael Peter Christen
3e6657288d Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-07-30 03:39:11 +02:00
Michael Peter Christen
de8cfbe1d7 added export option to export the fulltext of the search index text only 2015-07-30 03:21:40 +02:00
reger
2fb6ebe88a move java environment parameter setting disabling SNI (Server Name Indicator) support for https connections from code to startup script allowing admin to ~easy/transparent alter the YaCy default FALSE setting.
Background: some user report problem with connecting/crawling some sites via https which require SNI support (by default switched off in YaCy). On the other hand systems not demanding SNI support are sometimes not properly configured and due to a bug/feature in java 1.7 connection is aborted. The later is more often the case, so the default is still fine. With the java start parameter expert user can no alter the startparameter to -Djsse.enableSNIExtension=true (java default) if they crawl more hosts requiring SNI support.
The alternative to let YaCy try both during https handshake (deep inside the httpclient) is not pursut at this time.
2015-07-29 23:30:05 +02:00
Michael Peter Christen
fbeae20b3a try a healing of the cache if the index file is corrupted 2015-07-27 15:16:08 +02:00
Michael Peter Christen
03ea723889 added log lines for query performance profiling 2015-07-27 15:03:13 +02:00
Michael Peter Christen
0e87a99ab8 more fixes for special windows paths 2015-07-10 17:34:29 +02:00
Michael Peter Christen
e5b6424eed patch for bad windows file paths 2015-07-10 17:14:14 +02:00
Michael Peter Christen
0aa6fcf259 remove old vocabularies and synonyms before adding new 2015-07-10 16:47:19 +02:00
Michael Peter Christen
289018b559 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-07-08 17:37:03 +02:00
Michael Peter Christen
7b412e8c07 added msg (text emails) format; should be handled by html parser. 2015-07-08 17:36:37 +02:00
reger
f91298d3b6 fix one implicit Integer/Long type conversion
-> causes Java 1.8 compile error
2015-07-08 03:02:10 +02:00
reger
821262a179 add CommonPattern for multiple spaces
to eliminate empty split words on following spaces
2015-07-04 22:49:01 +02:00
Michael Peter Christen
90f75c8c3d added enrichment of synonyms and vocabularies for imported documents
during surrogate reading: those attributes from the dump are removed
during the import process and replaced by new detected attributes
according to the setting of the YaCy peer.
This may cause that all such attributes are removed if the importing
peer has no synonyms and/or no vocabularies defined.
2015-07-02 00:23:50 +02:00
Michael Peter Christen
7829480b82 refactoring: separated condenser and tokenizer 2015-07-01 18:28:18 +02:00
Michael Peter Christen
593de05922 enhanced surrogate import process speed (dramatically!) 2015-06-29 12:28:34 +02:00
Michael Peter Christen
3c4c69adea fix for
- bad regex computation for crawl start from file (limitation on domain
did not work)
- servlet error when starting crawl from a large list of urls
2015-06-29 02:02:01 +02:00
Michael Peter Christen
1fec7fb3c1 suppress access to solr when doing search suggestions in case that the
index has more than two million documents. This protects the index from
beeing flooded with search requests that cannot be resolved before the
real search query has to be computet.
2015-06-24 13:02:12 +02:00
Michael Peter Christen
694b22f165 migration to Solr 5.2: huge benefits - this is a lot faster!
This is a very complex migration: many classes had been renamed or
removed, dependencies changed and the solr index type is now aligned to
be a solr cloud repository.
Together with the Solr 5.2 library update, one other dependent library
had been updated as well: httpclient 4.4->4.4.1

Older indexes are migrated from 4_10 to 5_2. However, the new index
structure is more efficient and we recommend to re-index everything.
Please use the index export before you do the update to a large
surrogate xml file. After the update, start with an empty index and then
initialize this with your dump.
2015-06-24 01:55:51 +02:00
sixcooler
e427efbe54 Next Try for a fix for upload-connection staying in blocked state.
This was caused by reading via GZIP from close-wait connection an caused
high cpu- and system-loads.
Instat of implementing handling of the RedListener now I found a
timelimeted 'get' "realy" solving this problem.
2015-06-14 22:56:26 +02:00
reger
0fab445b19 Resourceobserver log warning - deleting releases files - only on actual deletes
instead of entering routine
2015-06-10 02:35:37 +02:00
sixcooler
ef6a64b2a4 Fix for upload-connection staying in blocked state.
This was caused by reading via GZIP from close-wait connection an caused
high cpu- and system-loads.
Solved by implementing handling of the RedListener.
2015-06-09 21:26:10 +02:00
reger
c973f94936 add log entry on release file delete by ResourceObserver 2015-06-08 03:17:12 +02:00
reger
121972752c implement deleteOldDownloads in RexourceObserver on low diskspace
- direct assign sb.observer (skip redundant InitThread)
2015-06-08 02:52:13 +02:00
Michael Peter Christen
9c12555be5 added link to Snapshots in search results if the snapshot exists and
option is set in ConfigSearchPage_p
(this is a stub: we also need a visualization of pdf files!)
2015-06-07 20:37:37 +02:00
reger
72f6a0b0b2 enhance recrawl job
- allow to modify the query to select documents to  process (after job has started)
- allow to include failed urls (httpstatus <> 200)
2015-06-06 18:45:39 +02:00
reger
7478338a40 remove augmented parsing activation from frontend
experimental implementation not used and based on error prone experimental rdfaparser
2015-06-05 00:51:00 +02:00
reger
11aa2edfe1 remove RDFa parser activation from frontend
reason: experimental implementatin of RDFa parser not executed (limited to special urls) but may cause error on normal html parsing due to a inputstream.reset
2015-06-05 00:15:16 +02:00
reger
49b79987c9 remove obsolete searchfl work table
was used to register urls with not complete words in snippet but is never accessed
2015-06-04 22:44:01 +02:00
Michael Peter Christen
d0aff91f23 fix for index import 2015-06-01 01:56:09 +02:00
Michael Peter Christen
34de1e8cbc gzip compression will perform more efficient and with better compression
level
2015-06-01 01:24:33 +02:00
Michael Peter Christen
98be59ce9c full solr xml exports will now be automatically compressed during
export. That makes it possible to export a solr xml dump even if disc
space is low.
2015-05-30 19:02:54 +02:00
Michael Peter Christen
a1a8edfc0a wrap HeaReader close() in a catch Throwable block to prevent that an
excpetion during close blocks the whole shotdown process
2015-05-30 17:54:02 +02:00
Michael Peter Christen
b43811d38c added surrogate import process for exported solr dumps.
Just throw your solr dump file into DATA/SURROGATES/in/ and it will be
imported!
2015-05-30 13:19:59 +02:00
Michael Peter Christen
b77537294d prevent disc usage when showing tray animation 2015-05-30 06:57:15 +02:00
Michael Peter Christen
eec78e1b0c added intensity option to graphics 2015-05-30 06:31:08 +02:00
Michael Peter Christen
a5007f345e re-licensing some of my old visualization classes under LGPL 2.1 2015-05-30 06:12:08 +02:00
Michael Peter Christen
c99a665593 adding a 3-pixel font generator made some time ago.. 2015-05-30 06:01:52 +02:00
Michael Peter Christen
c7576d6028 added a full solr export to the IndexControlURLs_p.html servlet. The
export function is also now the default export option. The export file
format for a full solr export is very similar to a solr search result
xml, only the <lst name="responseHeader"> tag is missing.

The exported xml has a special line termination feature: all documents
will be exported into a single line without any CR in between. That
means that every document is completely inside a single line. While this
is not readable at all for humans, it is very useful for linux line
processing scripts, like grep. Using grep it will be easy to select
single documents which match for a given pattern.

Such dumps shall be importable with the DATA/SURROGATE/in import
function, but that import is not yet adopted to the new file format.
2015-05-29 15:05:52 +02:00
Michael Peter Christen
197f7449e5 All entities of crawl profiles are now editable in the crawl profile
editor.
2015-05-28 16:07:40 +02:00
reger
1d8e1e4bac - Image search expand box, adjust javascript hs padtominsize parameter, to make sure expand box doesn't shrink on small images
- asure ImageResult.imagetext has value for the link text (use filename if no alt text given)
2015-05-27 02:31:13 +02:00
reger
8b35656007 remove hard throw exception in makeResultEntry
remove not used "share." peername.yacy url rewrite
2015-05-26 23:57:06 +02:00
reger
af57fbefad use available mime (instead null) on imageresult from metadatanode 2015-05-26 23:54:04 +02:00
reger
dd7782bac0 revert deletion of BinSearch
(accident)
2015-05-26 04:26:26 +02:00
reger
000dde9511 Eleminate duplication of values for search ResultEntry
by instatiation from URIMetadataNode, by eleminating differentiation of ResultEntry/URIMetadataNode.
- moved remaining ResultEntry functionallity to URIMetadataNode
   - for 1:1 functionallity added a function makeResultEntry() 
- removed ResultEntry 
- refactored related code

Main difference is after makeResultEntry the text_t content is removed and alternative title/url strings for display are calculated.


Main difference left is, that
2015-05-26 04:15:00 +02:00
reger
29c4aa3991 fix compiler notification of missing serialID
from last commit
2015-05-25 21:51:32 +02:00
reger
3d53da8236 refactor ResultEntry to be based on MetadataNode/SolrDocument
to share/reuse common access routines
2015-05-25 21:28:48 +02:00
reger
d882991bc5 Implement sharing of ioDispatcher for term & citation index
as proposed in ioDispatcher description
2015-05-25 19:46:26 +02:00
reger
370ba9da71 On imageSearch prefere mime to sort out none-image documents
Generalize the hack to prevent urls with just a img extension beeing returned

improving http://mantis.tokeek.de/view.php?id=528
2015-05-24 21:48:58 +02:00
reger
cd31633369 improve MultiprotocolURL.getFileExtension()
prevent string OOB while querypart contains a dot (return just "")
see log snippet in http://mantis.tokeek.de/view.php?id=533
2015-05-24 19:38:04 +02:00
reger
c60ccdfbcf Increase IODspatcher dumpQueue size to 2 to reduce risk of concurrent emergency dump,
skip concurrent emergency merge
dealing with/see  http://mantis.tokeek.de/view.php?id=566
2015-05-24 18:03:27 +02:00
reger
8a9622c31c fix string OoB on getImagelinks with long alttext
in description calculation
2015-05-24 01:59:40 +02:00
reger
3e742d1e34 Init remote crawler on demand
If remote crawl option is not activated, skip init of remoteCrawlJob to save the resources of queue and ideling thread.
Deploy of the remoteCrawlJob deferred on activation of the option.
2015-05-23 02:06:39 +02:00
reger
13f013f64a Limit extra sleep of BusyThread on LowMemCycle 2015-05-17 06:21:12 +02:00
reger
cd7c0e0aae detail optimization of RecrawlThread 2015-05-17 00:13:00 +02:00