yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	70f03f7c8e	do not cache search requests to Solr if the result is used for doublechecking. If a double-check comes from cached results the doublecheck fails.	2014-11-20 18:45:27 +01:00
Michael Peter Christen	6a2a669db4	added loading of the synonyms file from addon/synonyms into the knowledge loader	2014-11-19 17:36:56 +01:00
Michael Peter Christen	0a879c98e7	added new 'firstSeen' database table and necessary data structures which hold a date for each URL to record when a url was first seen. This is then used to overwrite the modification date for urls upon recrawl in case that the first-seen date is before the latest document date. This behaviour is necessary due to the common behaviour of content management systems which attach always the current date to all documents. Using the firstSeen database it is possible to approximate a real first document creation date in case that the crawler starts frequently for the same domain. As a result the search results ordered by date have a much better quality and the usage of YaCy as search agent for latest news has a better quality.	2014-11-13 00:58:58 +01:00
sixcooler	9c6e3a6b1c	fix assertation-failure in version-string for Solr-4.10.2 by changing the assert - hope that is ok + add forgotten NB-Projekt-changes	2014-11-07 22:43:50 +01:00
sixcooler	725b206fb4	update to solr-/lucene-4.10.2	2014-11-07 18:51:31 +01:00
orbiter	0fcd8097a3	removed unused options from BusyThreads	2014-11-02 20:08:49 +01:00
Michael Peter Christen	77662e08e1	concurrently initialize the error cache; extended also the cache by factor 10 up to 1000 entries. This error cache is only used to catch up paused crawls between shutdown+startup	2014-10-17 12:45:26 +02:00
orbiter	a922b122a3	added a hack to forward solr search results from an external attached solr to the YaCy built-in solr search servlet. Its not complete and not fully correct (there is still a utf8 encoding problem) but it is a way to get easily requests forwarded through YaCy to an external Solr.	2014-09-22 15:28:54 +02:00
Michael Peter Christen	6d3d4c4ea6	changed the concurrent enumeration of query results in such a way that it is now possible to get the results in two steps: - first retrieve all IDs as given for a query - then retieve each document individually This was necessary for very large result sets where a query may run for hours and is possibly terminated by a solr-internal timeout. This occurs regulary during postprocessing and therefore this commit may fix unwanted postprocessing terminations.	2014-09-17 13:58:55 +02:00
Michael Peter Christen	81f9b34da7	increaesed ability ot search for all images on a single server within the p2p remote search	2014-09-15 20:33:22 +02:00
Michael Peter Christen	a7dd89c4de	changed method to write the citation index: do not catch up references during document parsing; instead use the same references that would also be written into the webgraph. That should cause that the webgraph and the citation index express the exact same semantic.	2014-09-02 13:22:12 +02:00
Michael Peter Christen	001e05bb80	do not store failure of loading of robots.txt into the index as a fail document	2014-08-01 12:15:14 +02:00
Michael Peter Christen	05d58e4df0	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-08-01 12:04:25 +02:00
Michael Peter Christen	98f45c9032	fix for image alt attachment to AnchorURLs in html parser.	2014-08-01 12:04:15 +02:00
orbiter	22ce4fb4dd	better error handling for remote solr queries and exists-checks	2014-08-01 11:00:10 +02:00
orbiter	738989aab7	reverted commit `f94c91315b` because the webgraph has not enough performance for that	2014-07-29 18:49:42 +02:00
Michael Peter Christen	f94c91315b	if the webgraph is used, then use it also for reference computation to avoid contradictions with references_i in the collection index.	2014-07-24 15:35:53 +02:00
Michael Peter Christen	4eec1a7452	refactoring (change Metadata name of load time data structure to avoid confusion with Node data which is also called metadata)	2014-07-21 23:54:23 +02:00
Michael Peter Christen	2de159719b	added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead.	2014-07-18 12:43:01 +02:00
Michael Peter Christen	d07cdd8c3b	added SolrCloud access mode and configuration	2014-07-16 14:57:51 +02:00
Michael Peter Christen	b5fc2b63ea	removed exist() retrieval functions from error cache and replaced it with metadata retrieval from connectors directly. This should cause better usage of the cache. Automatically increase the metadata cache if more memory is available.	2014-07-11 19:52:25 +02:00
Michael Peter Christen	b5d78ba156	reduced number of solr queries during crawling	2014-07-11 18:05:11 +02:00
Michael Peter Christen	fd87fa1613	removed more unnecessary exist-checks in ErrorCache	2014-07-11 16:48:08 +02:00
Michael Peter Christen	f2b476e08b	don't do a double check to solr for failed documents if they are not written to solr	2014-07-11 16:26:52 +02:00
Michael Peter Christen	09dcdb9b19	update to solr 4.9.0	2014-07-01 16:39:00 +02:00
Michael Peter Christen	5b94a257ce	no timeout for large reference collections	2014-06-29 22:26:22 +02:00
Michael Peter Christen	8ad41a882c	fixed several problems with postprocessing: - unique-postprocessing was destroying results from other postprocessings; removed cross-updates as they had been not necessary - unique-postprocessing did not restrict on same protocol - inefficient concurrent update cache was redesigned completely - increased limits for concurrent blocking queues to prevent early time-out	2014-05-29 13:24:24 +02:00
Michael Peter Christen	f0db501630	better handling of ranking parameters and new default values for date navigation which is done using ranking in solr.	2014-05-22 03:01:07 +02:00
Michael Peter Christen	53948da7d0	tried to make last_modified recognition smarter	2014-05-22 00:28:51 +02:00
Michael Peter Christen	6634b5b737	debug code for index distribution testing	2014-05-21 18:20:16 +02:00
orbiter	97983ba89f	fixed generics warnings for generic array instantiation that appeared after migration to Java 7	2014-05-20 21:50:16 +02:00
Michael Peter Christen	10cf8215bd	added crawl depth for failed documents	2014-04-17 13:21:43 +02:00
Michael Peter Christen	9a5ab4e2c1	removed clickdepth_i field and related postprocessing. This information is now available in the crawldepth_i field which is identical to clickdepth_i because of a specific crawler strategy.	2014-04-16 22:16:20 +02:00
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	2014-04-16 21:34:28 +02:00
orbiter	95780eed32	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-04-10 21:40:54 +02:00
Michael Peter Christen	6bd8c6f195	fix for wrong status codes of error pages	2014-04-10 09:08:59 +02:00
orbiter	c250fac9f4	linkstructure refactoring to get more options for clickdepth analysis	2014-04-09 17:52:51 +02:00
Michael Peter Christen	bd886054cb	new structure and enhancements for link graph computation: - added order option to solr queries to be able to retrieve document lists in specific order, here: link length - added HyperlinkEdge class which manages the link structure - integrated the HyperlinkEdge class into clickdepth computation - extended the linkstructure.json servlet to show also the clickdepth and other statistic information	2014-04-09 12:45:04 +02:00
Michael Peter Christen	ebd44a7080	replaced solr 4.6.1 with solr 4.7.1 and added index migration to lucene_47	2014-04-06 10:45:03 +02:00
Michael Peter Christen	926d28dd3f	fixed a bug which prevented crawl starts after a network switch	2014-04-04 14:43:35 +02:00
reger	227c42bc96	eleminate obsolete URIMetaDataRow class by joining it with/into URIMetaDataNode.	2014-04-03 00:35:15 +02:00
Michael Peter Christen	63c9fcf3e0	free configuration of postprocessing clickdepth maximum depth and time	2014-04-02 02:34:39 +02:00
Michael Peter Christen	51800007c4	- added concurrency to postprocessing of webgraph document - bundeled separate webgraph postprocesing steps into one	2014-03-06 01:43:48 +01:00
Michael Peter Christen	fdaeac374a	- enhanced postprocessing speed and memory footprint (by using HashMaps instead of TreeMaps) - enhanced memory footprint of database indexes (by introduction of optimize calls) - optimize calls shrink the amount of used memory for index sets if they are not changed afterwards any more	2014-02-28 14:01:09 +01:00
Michael Peter Christen	7c1b968378	another fix for the shutdown exceptions	2014-02-28 01:53:32 +01:00
Michael Peter Christen	7640834b37	removed double concurrency to put Solr documents into the index. The writings to the solr index are also buffered in ConcurrentUpdateSolrConnector	2014-02-26 22:21:00 +01:00
Michael Peter Christen	0f6b72f24b	do not use luke requests for remote solr servers if the result is different from normal requests. This happens if the remote solr is actually a solrCloud; in such cases the luke request returns only the result of the single solr peer, not the whole cloud. also done: some refactoring.	2014-02-26 14:30:48 +01:00
orbiter	ced1a96f9c	fixed error cache	2014-02-25 02:16:22 +01:00
orbiter	cfb647db6e	- introduced a miss cache in ConcurrentUpdateSolrConnector - better usage of cache - bugfix for postprocessing	2014-02-24 23:42:50 +01:00
orbiter	a87d8e4a8e	changed caching of ConcurrentUpdateSolrConnector: it caches now also the url along with the load date. While this takes much more memory, it eliminates database lookups for getURL() requests, which happen equally often. This speeds up remote solr configurations.	2014-02-24 22:59:58 +01:00

1 2 3 4 5 ...

373 Commits