yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
reger	727dfb5875	refactore URIMetadataNode to further unify interaction with index - URIMetadataNode extending SolrDocument - use language as stored (String), reducing conversion to string - optimize debug code in transferIndex	2014-04-20 01:41:30 +02:00
Michael Peter Christen	74ab5ef9fa	increased runtime for postprocessing query job	2014-04-18 06:51:10 +02:00
Michael Peter Christen	10cf8215bd	added crawl depth for failed documents	2014-04-17 13:21:43 +02:00
Michael Peter Christen	c2f62e783f	- better subgraph handling, less overhead for crawls without the webgraph - usage of crawler crawldepth cache for the linkgraph target depth computation	2014-04-17 12:54:18 +02:00
Michael Peter Christen	9a5ab4e2c1	removed clickdepth_i field and related postprocessing. This information is now available in the crawldepth_i field which is identical to clickdepth_i because of a specific crawler strategy.	2014-04-16 22:16:20 +02:00
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	2014-04-16 21:34:28 +02:00
Michael Peter Christen	8aeef73d49	fix for virtual root nodes	2014-04-11 15:12:34 +02:00
Michael Peter Christen	7c7fbb9818	find depth-matches also for edge targets	2014-04-11 12:27:21 +02:00
Michael Peter Christen	dd12dd392f	introduction of a data structure for HyperlinkEdges which should use less memory as it does no double-storage of source links for each edge of the graph.	2014-04-11 12:09:33 +02:00
Michael Peter Christen	6ea8bb7348	using MultiProtocolURL for edge data which is faster (hash computation is now much easier) and smaller in size	2014-04-11 10:58:37 +02:00
Michael Peter Christen	a37d067692	refactoring	2014-04-10 23:46:35 +02:00
orbiter	95780eed32	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-04-10 21:40:54 +02:00
Michael Peter Christen	67beef657f	strong redesign of html parser: object recursion is now made using a stack on html tag objects, not using a recursive parse-again method which may cause bad performance and huge memory allocation. The new method also produced better parsed image objects with exact anchor text references.	2014-04-10 18:58:03 +02:00
orbiter	67501c9dda	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-04-09 19:58:54 +02:00
Michael Peter Christen	1c21b3256d	fix for robots.txt handling: delete old entry before starting a new crawl.	2014-04-09 18:33:48 +02:00
orbiter	c250fac9f4	linkstructure refactoring to get more options for clickdepth analysis	2014-04-09 17:52:51 +02:00
Michael Peter Christen	8068e68474	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-04-09 12:45:15 +02:00
Michael Peter Christen	bd886054cb	new structure and enhancements for link graph computation: - added order option to solr queries to be able to retrieve document lists in specific order, here: link length - added HyperlinkEdge class which manages the link structure - integrated the HyperlinkEdge class into clickdepth computation - extended the linkstructure.json servlet to show also the clickdepth and other statistic information	2014-04-09 12:45:04 +02:00
reger	f326a67561	fix: typo in default charset in metadata2solr update pom and NB build to Solr 4.7.1 libs	2014-04-06 22:31:22 +02:00
Michael Peter Christen	e8ddd415a8	enhanced the new link structure graph	2014-04-04 14:43:54 +02:00
Michael Peter Christen	3ce8eff21b	another fix for inbound/outbound detection	2014-04-04 12:41:59 +02:00
orbiter	18f9c40302	moved Edge class out of linkstructure servlet as this does not work on non-eclipse driven environments (all non-dev cases)	2014-04-04 10:54:11 +02:00
Michael Peter Christen	c64c10ef00	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-04-03 01:58:06 +02:00
Michael Peter Christen	48fbfa60c1	bugfix to inbound/outbound identification	2014-04-03 01:21:43 +02:00
reger	227c42bc96	eleminate obsolete URIMetaDataRow class by joining it with/into URIMetaDataNode.	2014-04-03 00:35:15 +02:00
Michael Peter Christen	cca851a417	introduced new solr field crawldepth_i which records the crawl depth of a document. This is the upper limit for the clickdepth_i value which may be shorter in case that the crawler did not take the shortest path to the document.	2014-04-02 23:37:01 +02:00
Michael Peter Christen	63c9fcf3e0	free configuration of postprocessing clickdepth maximum depth and time	2014-04-02 02:34:39 +02:00
Michael Peter Christen	8b44fcf0f4	added missing @Override annotation	2014-03-28 13:48:37 +01:00
Michael Peter Christen	e515dd460d	added linkscount_i and linksnofollowcount_i to the default solr schema	2014-03-27 23:36:08 +01:00
Michael Peter Christen	61ad194065	fix for source and target clickdepth in webgraph index	2014-03-26 16:00:05 +01:00
Michael Peter Christen	51800007c4	- added concurrency to postprocessing of webgraph document - bundeled separate webgraph postprocesing steps into one	2014-03-06 01:43:48 +01:00
Michael Peter Christen	fdaeac374a	- enhanced postprocessing speed and memory footprint (by using HashMaps instead of TreeMaps) - enhanced memory footprint of database indexes (by introduction of optimize calls) - optimize calls shrink the amount of used memory for index sets if they are not changed afterwards any more	2014-02-28 14:01:09 +01:00
Michael Peter Christen	d325cb8912	fixes and enhancements for postprocessing	2014-02-28 02:51:14 +01:00
Michael Peter Christen	1d069c5861	make sure that postprocessed documents are overwritten	2014-02-27 12:27:15 +01:00
Michael Peter Christen	e644981697	added one more postprocessing low memory check	2014-02-27 00:34:13 +01:00
Michael Peter Christen	e1bf65c892	added short memory protection during postprocessing	2014-02-26 23:02:56 +01:00
Michael Peter Christen	0f6b72f24b	do not use luke requests for remote solr servers if the result is different from normal requests. This happens if the remote solr is actually a solrCloud; in such cases the luke request returns only the result of the single solr peer, not the whole cloud. also done: some refactoring.	2014-02-26 14:30:48 +01:00
Michael Peter Christen	a2b66fe2eb	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-02-25 14:37:39 +01:00
Michael Peter Christen	9f6be762a6	- better logging for postprocessing - fixed collection bug in postprocessing	2014-02-25 14:37:30 +01:00
orbiter	cfb647db6e	- introduced a miss cache in ConcurrentUpdateSolrConnector - better usage of cache - bugfix for postprocessing	2014-02-24 23:42:50 +01:00
Michael Peter Christen	790f103f32	delete fail-docs during postprocessing to prevent that they will appear again and stay in postprocessing forever.	2014-02-18 01:38:56 +01:00
Michael Peter Christen	3d474a843e	added memory protection for postprocessing	2014-02-09 12:36:56 +01:00
Michael Peter Christen	82c0525e71	wrong logger fix	2013-12-23 10:52:02 +01:00
orbiter	937273d4e3	added parsing of metadata to surrogate reading: a dublin core record inside of surrogate input files may now contain tokens within the namespace 'md' (short for: metadata). The token names must be valid withing the namespace of the solr field names. All md-tokens inside of surrogate files then overwrite values within solr documents before they are written to the solr index. This makes it possible to assign collection names to each surrogate entry and also ranking information can be added. Please see the example file.	2013-12-17 14:02:27 +01:00
Michael Peter Christen	2702d9e56b	- added a SolrQueryResponse2SolrDocumentList method which is able to work around the unfolding process in Solr's BinaryResponseWriter. This was a huge performance bottleneck in the embedded solr connector and the problem is actually on Solr side, but we have now a workaround. - This made it possible to abstract a high-performance index access method which is implemented as method getDocumentListByParams. That method is also implemented in the SolrServerConnector and provides a very efficient access to a solr index if the index is embedded. - a popular use of the document list retrieval is a result count which can now also make use of the new method, via getDocumentCountByParams. - enhanced the Error cache which now does not store error documents within the ram cache if the document is also written to solr. When documents are retrieved from the cache, they are partly read from the ram cache and if not existent there, from the Solr index.	2013-12-13 15:56:29 +01:00
Michael Peter Christen	e3c2f09de9	- reduce computation in case that specific postprocessing fields are not selected - de-select citation rank computation	2013-12-04 17:48:12 +01:00
Michael Peter Christen	a125904a1c	fixed a NPE in surrogat processing	2013-12-04 01:56:38 +01:00
Michael Peter Christen	0db8e34625	enhanced webgraph processing	2013-12-04 01:54:45 +01:00
reger	f23471c471	add check to prevent index entries containing url_file_ext_s with ";jsession=xyz" note: check could be implemented in MultiProtocolURL (but at this time didn't oversee possible implication)	2013-11-25 00:14:53 +01:00
orbiter	da33ee0d77	extended also timeout fr webgraph postprocessing	2013-11-16 18:30:06 +01:00

1 2 3

129 Commits