Commit Graph

728 Commits

Author SHA1 Message Date
Michael Peter Christen
0db8e34625 enhanced webgraph processing 2013-12-04 01:54:45 +01:00
Michael Peter Christen
a16534cb0a tried to fix timeout and connection-lost problems when using an outside
solr.
2013-11-28 01:31:53 +01:00
Michael Peter Christen
c3dcbdc8d5 try to recover from an OOM during citation index reading and fail-over
to second solr core in case of unrecoverable OOM.
2013-11-28 01:10:25 +01:00
Michael Peter Christen
9932c441c8 fixed a problem with Date fields parsing Solr results if a remote Solr
is attached.
2013-11-28 00:54:53 +01:00
Michael Peter Christen
ae55d69ef6 include/exclude size NPE fix (recently added) 2013-11-26 11:47:04 +01:00
Michael Peter Christen
2c39b65409 fixes for searches containing stopwords. The fix was done using a
reconstruction of the search word set access method to protect that
words are deleted from the sets from the outside of the QueryGoal class.
2013-11-26 02:24:47 +01:00
orbiter
037cd0a57c using the BinaryResponseWriter which is supported within the YaCy solr
servlet since YaCy 1.63. This is much more performant for the client
than using the XMLResponseWriter because parsing of XML data is very CPU
intensive. Older YaCy peers are still requested using the
XMLResponseWriter but the majority of YaCy peers already respond with
the binary writer. This makes remote searches much faster and less CPU
intensive.
2013-11-25 21:31:40 +01:00
orbiter
61409788eb less word hash computations (removing some overhead because of MD5
calcs) using the clear word in a normalized form.
2013-11-25 15:20:54 +01:00
reger
f23471c471 add check to prevent index entries containing url_file_ext_s with ";jsession=xyz"
note: check could be implemented in MultiProtocolURL (but at this time didn't oversee possible implication)
2013-11-25 00:14:53 +01:00
orbiter
3e552550d1 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-11-18 22:48:00 +01:00
orbiter
c2d720cdaf purge a lucene cache - possible memory leak fix 2013-11-18 22:47:35 +01:00
reger
e4f49fb175 for searchresults with empty title use filename as title
- to not store a title in index which isn't extracted from source 
  the title is empty check only added to ResultEntry class
2013-11-18 19:41:31 +01:00
orbiter
da33ee0d77 extended also timeout fr webgraph postprocessing 2013-11-16 18:30:06 +01:00
orbiter
74f9e40747 extended timeout during postprocessing of 30 minutes. 2013-11-16 18:29:08 +01:00
orbiter
19a051bec8 more monitoring for postprocessing and enhanced layout in Crawler
monitor page
2013-11-16 18:23:14 +01:00
Michael Peter Christen
9cf9727685 fix for wrong counter 2013-11-16 11:33:35 +01:00
Michael Peter Christen
fceac8cffd more monitoring for postprocessing 2013-11-16 08:23:42 +01:00
Michael Peter Christen
6842783761 fixed and enhanced postprocessing 2013-11-16 08:23:21 +01:00
Michael Peter Christen
bf1bdd52a6 prevent requesting of 0-facets (which actually exist) 2013-11-15 15:41:41 +01:00
Michael Peter Christen
9d5895f643 enhanced and fixed postprocessing 2013-11-15 15:41:12 +01:00
Michael Peter Christen
087df05e24 added option to Config_Network_p.html to enable remote search while
DHT-Receive is switched off.
2013-11-13 13:38:01 +01:00
Michael Peter Christen
1a4a69c226 set more logger to 'final static' 2013-11-13 06:18:48 +01:00
Michael Peter Christen
69b8d61c47 fix for search requests in GSA interface which contain 'funny'
characters (like ':' etc.)
2013-11-12 15:54:54 +01:00
orbiter
4234b0ed6c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-11-10 18:50:43 +01:00
orbiter
909bbb49d8 added (partly commented) test code for url rewrite methods .. to be
completed
2013-11-10 18:50:34 +01:00
Michael Peter Christen
acc1f8a749 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-11-07 12:01:37 +01:00
Michael Peter Christen
81bb50118e found and fixed a huge memory leak in solr caching (inside Solr). The
not-flushed Solr cache is now handled in this way:
- it is smaller by default
- an Solr-internal process is started to flush the cache periodically
(this does NOT clean the cache, just removes old objects)
- a Solr-external process (the standard YaCy cleanup-process) now has
direct access to the solr internal cache and flushes them completely.
The time frame for such a flush is defined by the cleanup-process
frequency, by default 10 minutes.
2013-11-07 10:01:44 +01:00
reger
7b17cdf6dd add content_type:image/* to image search
- see numerous idx entries with content_type image without url_file_ext_s (for various reason) which should be included in result
- try it yourself with following sample query
   /solr/select?q=content_type:image/* AND -url_file_ext_s:[* TO *]&defType=edismax&fl=sku,url_file_ext_s,content_type

adresses also possible url without or deviating extension.
2013-11-07 03:11:03 +01:00
sixcooler
987f410011 URL-export:add query and fix for cast-class-exception 2013-11-06 19:22:26 +01:00
Michael Peter Christen
0cf9e9580b added clickdepth and CR computation debug code to verify that the
process is complete
2013-11-06 15:01:40 +01:00
Michael Peter Christen
234a974955 load image only if their parser flag is activated 2013-11-04 11:59:28 +01:00
Michael Peter Christen
e1c1e57877 less overhead calling exist() with only one hash 2013-11-04 09:37:31 +01:00
Michael Peter Christen
5a02d650ee avoid cloning 2013-11-03 18:31:50 +01:00
Michael Peter Christen
cc39667399 Speed enhancements and less CPU usage during Solr searches when using
the embedded Solr (the default). This was obtained by cirumventing solrj
search encapsulation and the implementation of direct index access
methods to Solr.
The effect will not only be seen during search, but this has also a
strong effect on suggestions (much more) and less CPU power usage during
index distribution (which needs many search requests)
2013-11-01 17:24:36 +01:00
Michael Peter Christen
434e13b46d in host browser also show the properties of failed documents including
referrer urls (this is a VERY USEFUL SEO and Web Admin feature!!)
2013-11-01 13:30:53 +01:00
Michael Peter Christen
9bb7eab389 hacks to prevent storage of data longer than necessary during search and
some speed enhancements. This should reduce the memory usage during
heavy-load search a bit.
2013-10-25 15:05:30 +02:00
Michael Peter Christen
5afa6e3aee Automatically flush the log cache if a short memory status is reached.
For the default of 200 lines this can flush about 10MB.
2013-10-24 17:39:50 +02:00
Michael Peter Christen
030d0776ff Enhanced crawl start for very, very large crawl lists (i.e. > 5000)
which had a problem because of badly used concurrency.
This fix also caused a redesign of the whole host deletion process.
This should fix bug http://bugs.yacy.net/view.php?id=250
2013-10-24 16:20:20 +02:00
Michael Peter Christen
1b4fa2947d - fixed a problem which ocurred when a document was not recognized with
the right content domain (i.e. identifying that it is an image, text
etc.) because it used the file extension and not an existing mime type
assignment.
- fixed the new setting that images shall be loaded for a better image
search.
- both fixes together makes it now possible to crawl
commons.wikimedia.org which makes use of 'funny' document names (i.e.
ending with .jpg while the document is html)
2013-10-23 00:16:54 +02:00
Michael Peter Christen
82621bead0 When doing bootstraping, always accept one seedlist-File without
checking the date of the file. This should help to start the peer in
case that the user has a completely wrong date setting.
2013-10-22 15:34:51 +02:00
Michael Peter Christen
691d7e70fa added hint to development/commit rss feed 2013-10-21 15:16:29 +02:00
Michael Peter Christen
c833d02cf5 fixed webgraph postprocessing (did nothing and repeated to do this...) 2013-10-16 11:49:04 +02:00
Michael Peter Christen
74d0256e93 enhanced postprocessing: fixed bugs, enable proper postprocessing also
without the harvestingkey, remove crawl profiles after postprocessing,
speed-up for clickdepth computation.
2013-10-16 11:27:06 +02:00
Michael Peter Christen
d328cc4a83 fix for didyoumean, added also more asian alphabets 2013-10-09 16:17:50 +02:00
Michael Peter Christen
90c8577840 enhanced ranking; patches to replace old ranking 2013-10-09 15:10:03 +02:00
Michael Peter Christen
1b61bd40ed - Added new solr field url_file_name_tokens_t which stores the file name
tokens. This can be used to enhance the ranking.
- Added also a rating_i field as basis for later usage.
- enhanced the tokenization process.
2013-10-08 23:48:13 +02:00
orbiter
5f5a97bafc added the anchor text within web pages to the searcheable entities of a
web page. This can be of benefit for the ranking if these fields are
used for boosts.
2013-10-08 18:41:07 +02:00
orbiter
705b3338ee list more fields available for search and for ranking boosts 2013-10-08 18:15:35 +02:00
Michael Peter Christen
78e7aadb26 removed unused initialization method 2013-10-07 23:51:28 +02:00
Michael Peter Christen
4fbc4740df removed warnings 2013-10-07 23:41:50 +02:00