Commit Graph

129 Commits

Author SHA1 Message Date
reger
727dfb5875 refactore URIMetadataNode to further unify interaction with index
-  URIMetadataNode extending SolrDocument
- use language as stored (String), reducing conversion to string
- optimize debug code in transferIndex
2014-04-20 01:41:30 +02:00
Michael Peter Christen
74ab5ef9fa increased runtime for postprocessing query job 2014-04-18 06:51:10 +02:00
Michael Peter Christen
10cf8215bd added crawl depth for failed documents 2014-04-17 13:21:43 +02:00
Michael Peter Christen
c2f62e783f - better subgraph handling, less overhead for crawls without the
webgraph
- usage of crawler crawldepth cache for the linkgraph target depth
computation
2014-04-17 12:54:18 +02:00
Michael Peter Christen
9a5ab4e2c1 removed clickdepth_i field and related postprocessing. This information
is now available in the crawldepth_i field which is identical to
clickdepth_i because of a specific crawler strategy.
2014-04-16 22:16:20 +02:00
Michael Peter Christen
da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
This organizes all urls to be loaded in separate queues for each host.
Each host separates the crawl depth into it's own queue. The primary
rule for urls taken from any queue is, that the crawl depth is minimal.
This produces a crawl depth which is identical to the clickdepth.
Furthermorem the crawl is able to create a much better balancing over
all hosts which is fair to all hosts that are in the queue.
This process will create a very large number of files for wide crawls in
the QUEUES folder: for each host a directory, for each crawl depth a
file inside the directory. A crawl with maxdepth = 4 will be able to
create 10.000s of files. To be able to use that many file readers, it
was necessary to implement a new index data structure which opens the
file only if an access is wanted (OnDemandOpenFileIndex). The usage of
such on-demand file reader shall prevent that the number of file
pointers is over the system limit, which is usually about 10.000 open
files. Some parts of YaCy had to be adopted to handle the crawl depth
number correctly. The logging and the IndexCreateQueues servlet had to
be adopted to show the crawl queues differently, because the host name
is attached to the port on the host to differentiate between http,
https, and ftp services.
2014-04-16 21:34:28 +02:00
Michael Peter Christen
8aeef73d49 fix for virtual root nodes 2014-04-11 15:12:34 +02:00
Michael Peter Christen
7c7fbb9818 find depth-matches also for edge targets 2014-04-11 12:27:21 +02:00
Michael Peter Christen
dd12dd392f introduction of a data structure for HyperlinkEdges which should use
less memory as it does no double-storage of source links for each edge
of the graph.
2014-04-11 12:09:33 +02:00
Michael Peter Christen
6ea8bb7348 using MultiProtocolURL for edge data which is faster (hash computation
is now much easier) and smaller in size
2014-04-11 10:58:37 +02:00
Michael Peter Christen
a37d067692 refactoring 2014-04-10 23:46:35 +02:00
orbiter
95780eed32 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-04-10 21:40:54 +02:00
Michael Peter Christen
67beef657f strong redesign of html parser: object recursion is now made using a
stack on html tag objects, not using a recursive parse-again method
which may cause bad performance and huge memory allocation. The new
method also produced better parsed image objects with exact anchor text
references.
2014-04-10 18:58:03 +02:00
orbiter
67501c9dda Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-04-09 19:58:54 +02:00
Michael Peter Christen
1c21b3256d fix for robots.txt handling: delete old entry before starting a new
crawl.
2014-04-09 18:33:48 +02:00
orbiter
c250fac9f4 linkstructure refactoring to get more options for clickdepth analysis 2014-04-09 17:52:51 +02:00
Michael Peter Christen
8068e68474 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-04-09 12:45:15 +02:00
Michael Peter Christen
bd886054cb new structure and enhancements for link graph computation:
- added order option to solr queries to be able to retrieve document
lists in specific order, here: link length
- added HyperlinkEdge class which manages the link structure
- integrated the HyperlinkEdge class into clickdepth computation
- extended the linkstructure.json servlet to show also the clickdepth
and other statistic information
2014-04-09 12:45:04 +02:00
reger
f326a67561 fix: typo in default charset in metadata2solr
update pom and NB build to Solr 4.7.1 libs
2014-04-06 22:31:22 +02:00
Michael Peter Christen
e8ddd415a8 enhanced the new link structure graph 2014-04-04 14:43:54 +02:00
Michael Peter Christen
3ce8eff21b another fix for inbound/outbound detection 2014-04-04 12:41:59 +02:00
orbiter
18f9c40302 moved Edge class out of linkstructure servlet as this does not work on
non-eclipse driven environments (all non-dev cases)
2014-04-04 10:54:11 +02:00
Michael Peter Christen
c64c10ef00 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-04-03 01:58:06 +02:00
Michael Peter Christen
48fbfa60c1 bugfix to inbound/outbound identification 2014-04-03 01:21:43 +02:00
reger
227c42bc96 eleminate obsolete URIMetaDataRow class
by joining it with/into URIMetaDataNode.
2014-04-03 00:35:15 +02:00
Michael Peter Christen
cca851a417 introduced new solr field crawldepth_i which records the crawl depth of
a document. This is the upper limit for the clickdepth_i value which may
be shorter in case that the crawler did not take the shortest path to
the document.
2014-04-02 23:37:01 +02:00
Michael Peter Christen
63c9fcf3e0 free configuration of postprocessing clickdepth maximum depth and time 2014-04-02 02:34:39 +02:00
Michael Peter Christen
8b44fcf0f4 added missing @Override annotation 2014-03-28 13:48:37 +01:00
Michael Peter Christen
e515dd460d added linkscount_i and linksnofollowcount_i to the default solr schema 2014-03-27 23:36:08 +01:00
Michael Peter Christen
61ad194065 fix for source and target clickdepth in webgraph index 2014-03-26 16:00:05 +01:00
Michael Peter Christen
51800007c4 - added concurrency to postprocessing of webgraph document
- bundeled separate webgraph postprocesing steps into one
2014-03-06 01:43:48 +01:00
Michael Peter Christen
fdaeac374a - enhanced postprocessing speed and memory footprint (by using HashMaps
instead of TreeMaps)
- enhanced memory footprint of database indexes (by introduction of
optimize calls)
- optimize calls shrink the amount of used memory for index sets if they
are not changed afterwards any more
2014-02-28 14:01:09 +01:00
Michael Peter Christen
d325cb8912 fixes and enhancements for postprocessing 2014-02-28 02:51:14 +01:00
Michael Peter Christen
1d069c5861 make sure that postprocessed documents are overwritten 2014-02-27 12:27:15 +01:00
Michael Peter Christen
e644981697 added one more postprocessing low memory check 2014-02-27 00:34:13 +01:00
Michael Peter Christen
e1bf65c892 added short memory protection during postprocessing 2014-02-26 23:02:56 +01:00
Michael Peter Christen
0f6b72f24b do not use luke requests for remote solr servers if the result is
different from normal requests. This happens if the remote solr is
actually a solrCloud; in such cases the luke request returns only the
result of the single solr peer, not the whole cloud.
also done: some refactoring.
2014-02-26 14:30:48 +01:00
Michael Peter Christen
a2b66fe2eb Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-02-25 14:37:39 +01:00
Michael Peter Christen
9f6be762a6 - better logging for postprocessing
- fixed collection bug in postprocessing
2014-02-25 14:37:30 +01:00
orbiter
cfb647db6e - introduced a miss cache in ConcurrentUpdateSolrConnector
- better usage of cache
- bugfix for postprocessing
2014-02-24 23:42:50 +01:00
Michael Peter Christen
790f103f32 delete fail-docs during postprocessing to prevent that they will appear
again and stay in postprocessing forever.
2014-02-18 01:38:56 +01:00
Michael Peter Christen
3d474a843e added memory protection for postprocessing 2014-02-09 12:36:56 +01:00
Michael Peter Christen
82c0525e71 wrong logger fix 2013-12-23 10:52:02 +01:00
orbiter
937273d4e3 added parsing of metadata to surrogate reading:
a dublin core record inside of surrogate input files may now contain
tokens within the namespace 'md' (short for: metadata). The token names
must be valid withing the namespace of the solr field names. All
md-tokens inside of surrogate files then overwrite values within solr
documents before they are written to the solr index. This makes it
possible to assign collection names to each surrogate entry and also
ranking information can be added. Please see the example file.
2013-12-17 14:02:27 +01:00
Michael Peter Christen
2702d9e56b - added a SolrQueryResponse2SolrDocumentList method which is able to
work around the unfolding process in Solr's BinaryResponseWriter.
This was a huge performance bottleneck in the embedded solr connector
and the problem is actually on Solr side, but we have now a workaround.
- This made it possible to abstract a high-performance index access
method which is implemented as method getDocumentListByParams. That
method is also implemented in the SolrServerConnector and provides a
very efficient access to a solr index if the index is embedded.
- a popular use of the document list retrieval is a result count which
can now also make use of the new method, via getDocumentCountByParams.
- enhanced the Error cache which now does not store error documents
within the ram cache if the document is also written to solr. When
documents are retrieved from the cache, they are partly read from the
ram cache and if not existent there, from the Solr index.
2013-12-13 15:56:29 +01:00
Michael Peter Christen
e3c2f09de9 - reduce computation in case that specific postprocessing fields are not
selected
- de-select citation rank computation
2013-12-04 17:48:12 +01:00
Michael Peter Christen
a125904a1c fixed a NPE in surrogat processing 2013-12-04 01:56:38 +01:00
Michael Peter Christen
0db8e34625 enhanced webgraph processing 2013-12-04 01:54:45 +01:00
reger
f23471c471 add check to prevent index entries containing url_file_ext_s with ";jsession=xyz"
note: check could be implemented in MultiProtocolURL (but at this time didn't oversee possible implication)
2013-11-25 00:14:53 +01:00
orbiter
da33ee0d77 extended also timeout fr webgraph postprocessing 2013-11-16 18:30:06 +01:00