Commit Graph

696 Commits

Author SHA1 Message Date
reger
1437c45383 merge rc1/master 2013-11-07 21:30:17 +01:00
Michael Peter Christen
cc39667399 Speed enhancements and less CPU usage during Solr searches when using
the embedded Solr (the default). This was obtained by cirumventing solrj
search encapsulation and the implementation of direct index access
methods to Solr.
The effect will not only be seen during search, but this has also a
strong effect on suggestions (much more) and less CPU power usage during
index distribution (which needs many search requests)
2013-11-01 17:24:36 +01:00
Michael Peter Christen
434e13b46d in host browser also show the properties of failed documents including
referrer urls (this is a VERY USEFUL SEO and Web Admin feature!!)
2013-11-01 13:30:53 +01:00
Michael Peter Christen
9bb7eab389 hacks to prevent storage of data longer than necessary during search and
some speed enhancements. This should reduce the memory usage during
heavy-load search a bit.
2013-10-25 15:05:30 +02:00
Michael Peter Christen
5afa6e3aee Automatically flush the log cache if a short memory status is reached.
For the default of 200 lines this can flush about 10MB.
2013-10-24 17:39:50 +02:00
Michael Peter Christen
030d0776ff Enhanced crawl start for very, very large crawl lists (i.e. > 5000)
which had a problem because of badly used concurrency.
This fix also caused a redesign of the whole host deletion process.
This should fix bug http://bugs.yacy.net/view.php?id=250
2013-10-24 16:20:20 +02:00
Michael Peter Christen
1b4fa2947d - fixed a problem which ocurred when a document was not recognized with
the right content domain (i.e. identifying that it is an image, text
etc.) because it used the file extension and not an existing mime type
assignment.
- fixed the new setting that images shall be loaded for a better image
search.
- both fixes together makes it now possible to crawl
commons.wikimedia.org which makes use of 'funny' document names (i.e.
ending with .jpg while the document is html)
2013-10-23 00:16:54 +02:00
Michael Peter Christen
82621bead0 When doing bootstraping, always accept one seedlist-File without
checking the date of the file. This should help to start the peer in
case that the user has a completely wrong date setting.
2013-10-22 15:34:51 +02:00
Michael Peter Christen
691d7e70fa added hint to development/commit rss feed 2013-10-21 15:16:29 +02:00
Michael Peter Christen
c833d02cf5 fixed webgraph postprocessing (did nothing and repeated to do this...) 2013-10-16 11:49:04 +02:00
Michael Peter Christen
74d0256e93 enhanced postprocessing: fixed bugs, enable proper postprocessing also
without the harvestingkey, remove crawl profiles after postprocessing,
speed-up for clickdepth computation.
2013-10-16 11:27:06 +02:00
Michael Peter Christen
d328cc4a83 fix for didyoumean, added also more asian alphabets 2013-10-09 16:17:50 +02:00
Michael Peter Christen
90c8577840 enhanced ranking; patches to replace old ranking 2013-10-09 15:10:03 +02:00
Michael Peter Christen
1b61bd40ed - Added new solr field url_file_name_tokens_t which stores the file name
tokens. This can be used to enhance the ranking.
- Added also a rating_i field as basis for later usage.
- enhanced the tokenization process.
2013-10-08 23:48:13 +02:00
orbiter
5f5a97bafc added the anchor text within web pages to the searcheable entities of a
web page. This can be of benefit for the ranking if these fields are
used for boosts.
2013-10-08 18:41:07 +02:00
orbiter
705b3338ee list more fields available for search and for ranking boosts 2013-10-08 18:15:35 +02:00
Michael Peter Christen
78e7aadb26 removed unused initialization method 2013-10-07 23:51:28 +02:00
Michael Peter Christen
4fbc4740df removed warnings 2013-10-07 23:41:50 +02:00
Michael Peter Christen
21aa6a0321 migration to Solr 4.5.0 2013-10-07 17:09:40 +02:00
Michael Peter Christen
101a6e6e14 Patch the citation index for links with canonical tags.
This shall fulfill the following requirement:
If a document A links to B and B contains a 'canonical C', then the
citation rank computation shall consider that A links to C and B does
not link to C.
To do so, we first must collect all canonical links, find all references
to them, get the anchor list of the documents and patch the citation
reference of these links.
2013-10-07 11:15:58 +02:00
Michael Peter Christen
b28d43decc added two more fields source_cr_host_norm_i,target_cr_host_norm_i in
webgraph and an addition to postprocessing to copy all cr ranking
attributes to the link edges associated to the postprocessing documents
2013-09-27 16:57:05 +02:00
Michael Peter Christen
a52f3a597e fix for canonical-from-http-header feature 2013-09-27 15:09:04 +02:00
Michael Peter Christen
2dd7c5be44 added parsing of http-canonical tags (untested, could not find an
example page)
2013-09-27 13:17:50 +02:00
Michael Peter Christen
3bf0104199 fix for crawl domain counter limitation (limit was reached too early) 2013-09-26 13:41:52 +02:00
Michael Peter Christen
82bfd9e00a - crawl profiles shall be deleted from active and passive stacks if they
are deleted to terminate the crawl because otherwise the crawl will go
on after the load-from-passive stack policy.
- better check if a crawl is terminated using the loader queue.
2013-09-26 10:22:31 +02:00
Michael Peter Christen
91a875dff5 self-healing of mistakenly deactivated crawl profiles. This fixes a bug
which can happen in rare cases when a crawl start and a cleanup process
happen at the same time.
2013-09-25 18:27:54 +02:00
Michael Peter Christen
095053a9b4 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-09-25 17:32:52 +02:00
sixcooler
0cae420d8e some dns-timing changes:
since httpclient uses the domain-cache it is useful not to clean the
domain cache until crawling is running (domains are filled into this
cache)
On huge crawl-starts (eg. from file) my DNS did not follow the high
rates - so I reduced the rate and give some more time(-out)
2013-09-25 15:01:28 +02:00
Michael Peter Christen
4f83d5f18c added the new field harvestkey_s to the collection index and the
webgraph index which is temporary filled with the crawl profile key.
This is used to select a set of documents for post-processing as soon as
a crawl is finished. Now the postprocessing for a specific crawl is
started when that specific crawl is finished and not at the end of all
post-processing steps.
2013-09-25 14:38:24 +02:00
orbiter
14442efa6d when profiles are cleaned, there shall be first a callback showing which
profiles are cleaned. This shall enable a profile-termination-driven
postprocessing. To do this, index writings must carry the profile key
which will be implemented in another (next) step.
2013-09-25 11:04:12 +02:00
orbiter
8ac2e8c8c9 added location navigator which causes that the image to the map search
is visible whenever a location is available in the search result.
To activate this, the search.navigation property in yacy.conf must be
modified to the new default values.
2013-09-24 11:26:51 +02:00
Michael Peter Christen
96ed0c980e - added hosthash to all documents (also fail documents which is needed
there for deletion), this fixes a problem for the deletion of old
documents for new crawl starts
- added clickdepth and citation computation for fail documents
2013-09-23 18:09:42 +02:00
orbiter
828603e4f1 fix for 100%CPU problem in error cache cleaning process 2013-09-21 10:20:13 +02:00
orbiter
c64b51134e hack to add all tokens from the url to text_t. This was working for the
RWI index (and still is working) but not for solr-only search indexes.
Maybe we should find a solution using a separate search field instead.
2013-09-21 08:57:43 +02:00
orbiter
f3be1930cb CPU problem when pusing to the error cache; wrong class,
ConcurrentHashMap needed for concurrency
2013-09-20 16:51:50 +02:00
Michael Peter Christen
e40671ddb7 better and consistent deletions for error urls 2013-09-17 15:52:57 +02:00
Michael Peter Christen
2602be8d1e - removed ZURL data structure; removed also the ZURL data file
- replaced load failure logging by information which is stored in Solr
- fixed a bug with crawling of feeds: added must-match pattern
application to feed urls to filter out such urls which shall not be in a
wanted domain
- delegatedURLs, which also used ZURLs are now temporary objects in
memory
2013-09-17 15:27:02 +02:00
Michael Peter Christen
31920385f7 set anchor rel attribute of all links to "nofollow" if the html meta
contains a robots:nofollow or if the http header contains a
"X-Robots-Tag: nofollow"
2013-09-16 16:14:56 +02:00
Michael Peter Christen
61c5e40687 - replaced the properties object in AnchorURL with distinct variables
for anchor attributes.
- this caused that large portions of the parser code had to be adopted
as well
- added a counter target_order_i for anchor links in webgraph
computation
2013-09-15 23:27:04 +02:00
Michael Peter Christen
5e31bad711 - the webgraph shall store all links which appear on a web page and not
all unique links! This made it necessary, that a large portion of the
parser and link processing classes must be adopted to carry a different
type of link collection which carry a property attribute which are
attached to web anchors.
- introduction of a new URL class, AnchorURL
- the other url classes, DigestURI and MultiProtocolURI had been renamed
and refactored to fit into a new document package schema, document.id
- cleanup of net.yacy.cora.document package and refactoring
2013-09-15 00:30:23 +02:00
Michael Peter Christen
35ab2cef7b added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in
html meta fields to get a correct (or: better) date timestamp. The
http:last-modified mostly does not work because it is set to the current
date from most CMS.
2013-09-10 10:31:57 +02:00
Michael Peter Christen
9cc8468b30 added tools to visualize image generation (i.e. during testing) 2013-09-09 12:58:26 +02:00
Michael Peter Christen
dbef8ccfcb forced deletion of ZURL entries for a specific host for each host that
appears in the crawl url list
2013-09-05 13:22:16 +02:00
Michael Peter Christen
e137ff4171 refactoring (im preparation for new removeHost method) 2013-09-05 09:59:41 +02:00
Michael Peter Christen
7a5574cd51 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-09-04 23:12:04 +02:00
Michael Peter Christen
85456f46b2 added two new fields, exact_signature_copycount_i and
fuzzy_signature_copycount_i, which count the number of copies of
non-unique documents and assigns this to each document. Thus, each
document there is a number assigned which shows how many copies of this
document exists.
These fields are disabled by default.
2013-09-04 23:11:53 +02:00
orbiter
26366596d9 fix for a problem which ocurres when a site is crawled where the start
url is redirected.
2013-09-04 16:00:47 +02:00
Michael Peter Christen
a2511b5600 turned images_alt_txt back to images_alt_sxt because it is not necessary
to index the alt text. Indexed image Text is in images_text_t
2013-09-04 10:47:18 +02:00
Michael Peter Christen
85b1922244 activated image type navigation for image search 2013-09-03 13:34:01 +02:00
Michael Peter Christen
9e12fdff23 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-09-03 12:22:57 +02:00