Commit Graph

810 Commits

Author SHA1 Message Date
Michael Peter Christen
d82d311995 Merge branch 'master' of https://github.com/luccioman/yacy_search_server
# Conflicts:
#	.classpath
2015-11-30 13:34:10 +01:00
reger
1160b13172 remove unused md5 from ViewFile servlet params 2015-11-28 23:09:15 +01:00
reger
b2c8bc0ae6 remove md5_s from default index fields
it is not assigned a value / not used
Due to above also excluded from transfer protocol.
2015-11-27 02:41:02 +01:00
luc
5bbb2e1730 Ensure resource is closed when reading a full file InputStream 2015-11-18 10:08:06 +01:00
reger
7d0d19cb8e avoid File.deleteOnExit() on temp files
JVM registers each file in a list regardless of already deleted and never
cleans up the list during runtime.
This accumulates to a considerable amount of mem during large crawls and/or
long uptime.
To tackle this, all temp files are now created in a subdir of java.io.tmpdir 
and the jvm tmpdir property is set to this subdir, which is deleted by
code on shutdown.
Additionally let pdfParser use this tmp subdir too.
2015-11-17 22:27:07 +01:00
reger
02e4489a23 set tmpfile.deleteOnExit by default,
to make sure files are removed on shutdown.
2015-11-16 21:37:45 +01:00
reger
ca3d26a401 harmonize wordsintitle & CollectionSchema.title_words_val calculation,
remove obsolete partial init of wordreference from urimetadata
2015-11-15 06:06:37 +01:00
sixcooler
d3b9349b6f simplification / speedup of GenerationMemoryStrategy 2015-11-10 20:39:46 +01:00
luc
c38d6c1f37 Correction for mantis 535: inurl: parameter doesn't work on URLs with
upper-case letters
2015-09-23 21:01:51 +02:00
reger
3f2b8ab5e5 optionally include mime in p2p url exchange string
if doctype decodes to ambiguous mime and default conversion is not equal to original
2015-09-22 00:12:31 +02:00
reger
e37a4f0b3d prevent metadata records in index w/o valid url
by throwing MalformedURL exception on URIMetadataNode creation
2015-09-06 22:19:05 +02:00
Michael Peter Christen
c40c302748 when many crawl queues are generated, this NPE can occur; probably
caused as concurrency issue:
W 2015/09/05 14:09:10 ConcurrentLog java.lang.NullPointerException
java.lang.NullPointerException
	at java.util.TreeMap.rotateRight(TreeMap.java:2239)
	at java.util.TreeMap.fixAfterInsertion(TreeMap.java:2271)
	at java.util.TreeMap.put(TreeMap.java:582)
	at net.yacy.kelondro.table.Table.<init>(Table.java:235)
	at net.yacy.crawler.HostQueue.openStack(HostQueue.java:229)
	at net.yacy.crawler.HostQueue.getStack(HostQueue.java:204)
	at net.yacy.crawler.HostQueue.push(HostQueue.java:397)
	at net.yacy.crawler.HostBalancer.push(HostBalancer.java:237)
	at net.yacy.crawler.data.NoticedURL.push(NoticedURL.java:184)
	at net.yacy.crawler.CrawlStacker.stackCrawl(CrawlStacker.java:355)
	at net.yacy.crawler.CrawlStacker.job(CrawlStacker.java:134)
	at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at
net.yacy.kelondro.workflow.InstantBlockingThread.job(InstantBlockingThread.java:101)
	at
net.yacy.kelondro.workflow.AbstractBlockingThread.run(AbstractBlockingThread.java:82)
	at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
2015-09-05 14:12:17 +02:00
luccioman
2f0f0180e2 Added a function to list files recursively. 2015-09-04 13:42:57 +02:00
reger
0e4ba0360b fix NPE on .yacyh result url of disconnected peer
(cleanup yacyshare remaining)
2015-08-25 23:26:17 +02:00
Michael Peter Christen
dbbad23e12 removed warnings 2015-08-03 05:37:34 +02:00
Michael Peter Christen
b94bd7f20a a collection of search query enhancements:
- fixed superfluous space in query field list
- fixed filter query logic
- removed look-ahead query which caused that each new search page
submitted two solr queries
- fixed random solr result orders in case that the solr score was equal:
this was then re-ordered by YaCy using the document hash which came from
the solr object and that appeared to be random. Now the hash of the url
is used and the score is additionally modified by the url length to
prevent that this particular case appears at all.
2015-08-02 14:52:41 +02:00
Michael Peter Christen
34de1e8cbc gzip compression will perform more efficient and with better compression
level
2015-06-01 01:24:33 +02:00
Michael Peter Christen
a1a8edfc0a wrap HeaReader close() in a catch Throwable block to prevent that an
excpetion during close blocks the whole shotdown process
2015-05-30 17:54:02 +02:00
reger
8b35656007 remove hard throw exception in makeResultEntry
remove not used "share." peername.yacy url rewrite
2015-05-26 23:57:06 +02:00
reger
dd7782bac0 revert deletion of BinSearch
(accident)
2015-05-26 04:26:26 +02:00
reger
000dde9511 Eleminate duplication of values for search ResultEntry
by instatiation from URIMetadataNode, by eleminating differentiation of ResultEntry/URIMetadataNode.
- moved remaining ResultEntry functionallity to URIMetadataNode
   - for 1:1 functionallity added a function makeResultEntry() 
- removed ResultEntry 
- refactored related code

Main difference is after makeResultEntry the text_t content is removed and alternative title/url strings for display are calculated.


Main difference left is, that
2015-05-26 04:15:00 +02:00
reger
d882991bc5 Implement sharing of ioDispatcher for term & citation index
as proposed in ioDispatcher description
2015-05-25 19:46:26 +02:00
reger
c60ccdfbcf Increase IODspatcher dumpQueue size to 2 to reduce risk of concurrent emergency dump,
skip concurrent emergency merge
dealing with/see  http://mantis.tokeek.de/view.php?id=566
2015-05-24 18:03:27 +02:00
reger
13f013f64a Limit extra sleep of BusyThread on LowMemCycle 2015-05-17 06:21:12 +02:00
Michael Peter Christen
fed26f33a8 enhanced timezone managament for indexed data:
to support the new time parser and search functions in YaCy a high
precision detection of date and time on the day is necessary. That
requires that the time zone of the document content and the time zone of
the user, doing a search, is detected. The time zone of the search
request is done automatically using the browsers time zone offset which
is delivered to the search request automatically and invisible to the
user. The time zone for the content of web pages cannot be detected
automatically and must be an attribute of crawl starts. The advanced
crawl start now provides an input field to set the time zone in minutes
as an offset number. All parsers must get a time zone offset passed, so
this required the change of the parser java api. A lot of other changes
had been made which corrects the wrong handling of dates in YaCy which
was to add a correction based on the time zone of the server. Now no
correction is added and all dates in YaCy are UTC/GMT time zone, a
normalized time zone for all peers.
2015-04-15 13:17:23 +02:00
Michael Peter Christen
9bf0d7ecb9 added a new collection type 'dht' to all documents from the peer-to-peer
interface to distinguish rich and poor document data.
This also reverts some changes from commit
796770e070 because the firstSeen database
is the wrong method to distinguish these types of data
2015-03-24 12:32:39 +01:00
Michael Peter Christen
ee2490ab98 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-03-19 10:42:57 +01:00
reger
431311df42 fix get fresh_date_dt to allow returned value to be date in future 2015-03-18 22:04:03 +01:00
otter
74c7e8b686 Fixes hanging FlushThread (see
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5447)
by replacing put() method by the more robust add() to
add a merge job to the queue.
2015-03-18 21:57:41 +01:00
reger
706f75ddc2 try to fix hang on index blob merge on shutdown
http://mantis.tokeek.de/view.php?id=505
It happens but not able to reproduce. This change makes sure terminate signal is catched at end of currently running merge jobs
2015-03-11 19:36:23 +01:00
Michael Peter Christen
fd4e2c809a Show dates in the content of a document in the search result:
- if an eventDate is given in the search result, replace the document
date with the event date and prefix it with the string "on ".
- the document date is omitted if a date from the cent is shown

Added also the date as fields in the json and rss result sets.
2015-03-02 18:00:20 +01:00
reger
df83fcc4fc disable optimistic GC assumption in StandardMemoryStrategy
After several tests found that eom is not prevented. Major reason in testing was assumption future GC will free avg of last 5 GC.
Disabeling this check improved eom exceptions.

Added simplest testcase used for verification
2015-02-11 01:42:01 +01:00
Michael Peter Christen
ac19690d30 refactoring with CommonPattern.COMMA 2015-01-29 01:35:28 +01:00
Michael Peter Christen
3d717b749a fix for urlmaskfilter 2015-01-28 13:40:41 +01:00
reger
24f68a4eb7 refactor opensearch heuristic
introduce FederateSearchManager handling search heuristic to external systems via specific FederateSearchConnectors,
which provide the query() functionallity, the translation to YaCy schema .toYaCySchema() and the search() routine to deliver results to searchevents, which is generally implemented in Abstract connector.
The manager enforces now a min 15s delay between calls to external systems.
Besides the OpensearchConnector a SolrFederateSearchConnector is available. It uses a additional config file for fieldname translation.

default heuristicopensearch.conf: 
- openbdb.com removed - seems not longer to deliver results
- config via solrconnector to  datacite.org added (large technical library archive)
2015-01-19 03:30:35 +01:00
reger
8e751d754a - add javadoc to busythread with hint about the init parameter useage
- remove obsolete 10_httpd config parameter
2015-01-09 01:31:57 +01:00
Michael Peter Christen
3cd7deb3b8 do not flush non-errors to stdout because this is a concurrency issue.
the flush-call appeared very often in thread dumps with high load, so
this hopefully gives some performances
2014-12-28 15:48:37 +01:00
reger
198102304b refactor size() -> filesize() of URIMetadataNode
(harmonize with ResultEntry and to not get confused with Collection.size())
2014-12-21 06:05:35 +01:00
reger
c6f634a4f2 remove redundant caching of urlhash in URIMetadataNode
(is already cached in underlaying DigestURL .url)

upd pom keyword for maven-antrun-plugin
2014-12-21 03:45:54 +01:00
Michael Peter Christen
413eeefed4 added character set detection library from
http://www-archive.mozilla.org/projects/intl/chardet.html
2014-12-10 13:08:29 +01:00
Michael Peter Christen
a304058840 added Image Events as another option to generate images with a mac if no
Ghostscript is available or does not work...
2014-12-04 01:21:24 +01:00
Michael Peter Christen
321840fde3 Replaced all fixed thread pools with cached thread pools. The cached
thread pools will flush their cached (dead) threads after 60 seconds.
This will cause that YaCy now runs constantly withl about 50 threads,
about 100 at peak times. Previously, about 400 threads had been cached
and kept in a hibernation state, which caused that the numproc counter
in /proc/user_beancounters (exists only in VM-hosted linux) was as high
as the cached number of threads. This caused that VM supervisors
terminated whole VM sessions if a limit was reached. Many VM providers
have limits of numproc=96 which made it virtually impossible to run YaCy
on such machines. With this change, it will be possible to run many YaCy
instances even on VM hosts.
2014-12-02 16:26:07 +01:00
Michael Peter Christen
7bfab5eb9d set Busy- and Blocking-Threads to daemon mode (they will now not prevent
YaCy from termination if still running)
2014-12-02 16:05:00 +01:00
Michael Peter Christen
ad0da5f246 added new web page snapshot infrastructure which will lead to the
ability to have web page previews in the search results.
(This is a stub, no function available with this yet...)
2014-11-29 11:56:32 +01:00
Michael Peter Christen
4920ab7b76 optimize usage of size() cache 2014-11-23 20:07:32 +01:00
Michael Peter Christen
2beb6abeb6 disabled crazy sleep loop 2014-11-21 14:38:54 +01:00
Michael Peter Christen
8aee7f940e added missing class for latest changes 2014-11-13 01:30:12 +01:00
Michael Peter Christen
97039049e4 fix in key enumeration methods for cases where the enumeration is done
in reverse order.
2014-11-13 01:15:31 +01:00
Michael Peter Christen
421ee64f33 another fix to ordering of table indexes; fixes also network stats
graphics
2014-11-11 13:57:04 +01:00
Michael Peter Christen
1db476c67e fix for bad table iteration 2014-11-10 18:52:01 +01:00