Commit Graph

732 Commits

Author SHA1 Message Date
Michael Peter Christen
a16534cb0a tried to fix timeout and connection-lost problems when using an outside
solr.
2013-11-28 01:31:53 +01:00
Michael Peter Christen
9932c441c8 fixed a problem with Date fields parsing Solr results if a remote Solr
is attached.
2013-11-28 00:54:53 +01:00
sixcooler
94db054aff memory-leak-fix: the DocListSearcher fires an query in its constructor
and it is highly recommend to close every SolrRequest.
Every Request, which is not closed leaves a Searcher with its Chaches an
can not be garbage-collectet.
2013-11-27 19:07:36 +01:00
Michael Peter Christen
5592ea57f0 hack to remove compiler warnings about deprecated classes. It would be
better to remove the deprecated usage but to do this the Solr core must
adopt the latest apache http core changes as well .. this is not our
fault.
2013-11-25 23:30:35 +01:00
orbiter
037cd0a57c using the BinaryResponseWriter which is supported within the YaCy solr
servlet since YaCy 1.63. This is much more performant for the client
than using the XMLResponseWriter because parsing of XML data is very CPU
intensive. Older YaCy peers are still requested using the
XMLResponseWriter but the majority of YaCy peers already respond with
the binary writer. This makes remote searches much faster and less CPU
intensive.
2013-11-25 21:31:40 +01:00
reger
8da75a4b0c fix contentType definition for Solr html responswriter
from xml to html
(hint: value is currently not used, but is in SolrServlet)
2013-11-24 04:31:08 +01:00
Michael Peter Christen
1f0bfa8fec added test to Base64Order (runs successfully!) 2013-11-22 10:38:42 +01:00
Michael Peter Christen
219d5934a4 fixed termination bug in Solr Connector 2013-11-16 08:22:29 +01:00
Michael Peter Christen
9d5895f643 enhanced and fixed postprocessing 2013-11-15 15:41:12 +01:00
Michael Peter Christen
f86fe90eda enhanced mass storage speed to remote solr servers 2013-11-15 15:40:07 +01:00
Michael Peter Christen
6ed9821209 fixed several problems in solr connectors 2013-11-15 15:39:35 +01:00
Michael Peter Christen
191fd3d7e7 added an optimization option to HandleSet mass data storage structure 2013-11-15 15:38:00 +01:00
Michael Peter Christen
94b565ea0d fixed keepalive min value 2013-11-15 15:37:01 +01:00
Michael Peter Christen
24a052ecb9 removed debug code for existsByIds 2013-11-13 13:41:18 +01:00
Michael Peter Christen
1a4a69c226 set more logger to 'final static' 2013-11-13 06:18:48 +01:00
orbiter
b085cb522b replaced old existsByIds for embedded Solr with obviously much faster
new selection method (including stil existing debug code to test that
this is in fact better)
2013-11-11 11:25:01 +01:00
Michael Peter Christen
899e7e92b0 added debug code 2013-11-09 02:37:12 +01:00
Michael Peter Christen
81bb50118e found and fixed a huge memory leak in solr caching (inside Solr). The
not-flushed Solr cache is now handled in this way:
- it is smaller by default
- an Solr-internal process is started to flush the cache periodically
(this does NOT clean the cache, just removes old objects)
- a Solr-external process (the standard YaCy cleanup-process) now has
direct access to the solr internal cache and flushes them completely.
The time frame for such a flush is defined by the cleanup-process
frequency, by default 10 minutes.
2013-11-07 10:01:44 +01:00
Michael Peter Christen
b2c329929f Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-11-04 10:18:52 +01:00
Michael Peter Christen
60187a4ec2 fix in html parser 2013-11-04 10:16:20 +01:00
Michael Peter Christen
e1c1e57877 less overhead calling exist() with only one hash 2013-11-04 09:37:31 +01:00
reger
3d5d366f1c fix html header in Solr HTMLResponseWriter
- move 1st body content after </head> tag
- add closing <span> tag
2013-11-04 03:12:02 +01:00
Michael Peter Christen
5a02d650ee avoid cloning 2013-11-03 18:31:50 +01:00
Michael Peter Christen
cc39667399 Speed enhancements and less CPU usage during Solr searches when using
the embedded Solr (the default). This was obtained by cirumventing solrj
search encapsulation and the implementation of direct index access
methods to Solr.
The effect will not only be seen during search, but this has also a
strong effect on suggestions (much more) and less CPU power usage during
index distribution (which needs many search requests)
2013-11-01 17:24:36 +01:00
Michael Peter Christen
9bb7eab389 hacks to prevent storage of data longer than necessary during search and
some speed enhancements. This should reduce the memory usage during
heavy-load search a bit.
2013-10-25 15:05:30 +02:00
Michael Peter Christen
1a8783147b enhanced computation of number of solr documents. 2013-10-24 15:48:05 +02:00
Michael Peter Christen
4948c39e48 added concurrency for mass crawl check 2013-10-23 11:27:19 +02:00
Michael Peter Christen
1b4fa2947d - fixed a problem which ocurred when a document was not recognized with
the right content domain (i.e. identifying that it is an image, text
etc.) because it used the file extension and not an existing mime type
assignment.
- fixed the new setting that images shall be loaded for a better image
search.
- both fixes together makes it now possible to crawl
commons.wikimedia.org which makes use of 'funny' document names (i.e.
ending with .jpg while the document is html)
2013-10-23 00:16:54 +02:00
Michael Peter Christen
74d0256e93 enhanced postprocessing: fixed bugs, enable proper postprocessing also
without the harvestingkey, remove crawl profiles after postprocessing,
speed-up for clickdepth computation.
2013-10-16 11:27:06 +02:00
sixcooler
d9a02ed277 NPE fix for my last commit 2013-10-11 00:44:04 +02:00
sixcooler
61f627eb85 fix for ssl-connections from proxy-usage staying in close-wait-state
+ some extra 'close' in HttpClient
2013-10-10 20:57:37 +02:00
Michael Peter Christen
1b61bd40ed - Added new solr field url_file_name_tokens_t which stores the file name
tokens. This can be used to enhance the ranking.
- Added also a rating_i field as basis for later usage.
- enhanced the tokenization process.
2013-10-08 23:48:13 +02:00
sixcooler
d536092fe4 fix false fill NAME_CACHE_MISS-DNS-Cache in case of a timeout
for eg. caused by massive requests when crawl from file
2013-10-08 18:02:42 +02:00
Michael Peter Christen
ef31d0f279 fix for rss reader, see http://bugs.yacy.net/view.php?id=294 2013-10-07 12:59:54 +02:00
Michael Peter Christen
b28d43decc added two more fields source_cr_host_norm_i,target_cr_host_norm_i in
webgraph and an addition to postprocessing to copy all cr ranking
attributes to the link edges associated to the postprocessing documents
2013-09-27 16:57:05 +02:00
Michael Peter Christen
4476dea5ba do not fail if a wrong boost key is used; instead, print only a warning
See also: http://bugs.yacy.net/view.php?id=293
2013-09-27 12:28:09 +02:00
Michael Peter Christen
1b3d26dd23 hack to remove most of the warning: deprecated messages (but not all,
one is left)
2013-09-25 21:14:52 +02:00
sixcooler
3c48fc65fd reverted RemoteInstance to deprecated methods of httpClient-4.2
this should work with current remote-Solr-Instances
2013-09-25 18:45:16 +02:00
sixcooler
0cae420d8e some dns-timing changes:
since httpclient uses the domain-cache it is useful not to clean the
domain cache until crawling is running (domains are filled into this
cache)
On huge crawl-starts (eg. from file) my DNS did not follow the high
rates - so I reduced the rate and give some more time(-out)
2013-09-25 15:01:28 +02:00
sixcooler
15b1bb2513 bump to httpClient-4.3 2013-09-25 14:48:37 +02:00
orbiter
d86d2be5c3 automatically removed Places autotagging if no location library is
wanted
2013-09-24 11:23:45 +02:00
reger
6b9a624808 remove double declaration of TLD_any_zone_filter 2013-09-23 03:01:08 +02:00
orbiter
6e8377b8ad do not check all words with synonym library if the library is empty 2013-09-21 08:56:24 +02:00
Michael Peter Christen
2602be8d1e - removed ZURL data structure; removed also the ZURL data file
- replaced load failure logging by information which is stored in Solr
- fixed a bug with crawling of feeds: added must-match pattern
application to feed urls to filter out such urls which shall not be in a
wanted domain
- delegatedURLs, which also used ZURLs are now temporary objects in
memory
2013-09-17 15:27:02 +02:00
Michael Peter Christen
61c5e40687 - replaced the properties object in AnchorURL with distinct variables
for anchor attributes.
- this caused that large portions of the parser code had to be adopted
as well
- added a counter target_order_i for anchor links in webgraph
computation
2013-09-15 23:27:04 +02:00
Michael Peter Christen
3ea9bb4427 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-09-15 00:30:41 +02:00
Michael Peter Christen
5e31bad711 - the webgraph shall store all links which appear on a web page and not
all unique links! This made it necessary, that a large portion of the
parser and link processing classes must be adopted to carry a different
type of link collection which carry a property attribute which are
attached to web anchors.
- introduction of a new URL class, AnchorURL
- the other url classes, DigestURI and MultiProtocolURI had been renamed
and refactored to fit into a new document package schema, document.id
- cleanup of net.yacy.cora.document package and refactoring
2013-09-15 00:30:23 +02:00
reger
603368fc3e remove redundant declaration of USER_AGENT 2013-09-14 18:29:44 +02:00
Michael Peter Christen
dbef8ccfcb forced deletion of ZURL entries for a specific host for each host that
appears in the crawl url list
2013-09-05 13:22:16 +02:00
Michael Peter Christen
85456f46b2 added two new fields, exact_signature_copycount_i and
fuzzy_signature_copycount_i, which count the number of copies of
non-unique documents and assigns this to each document. Thus, each
document there is a number assigned which shows how many copies of this
document exists.
These fields are disabled by default.
2013-09-04 23:11:53 +02:00