Commit Graph

261 Commits

Author SHA1 Message Date
Michael Peter Christen
90c8577840 enhanced ranking; patches to replace old ranking 2013-10-09 15:10:03 +02:00
Michael Peter Christen
3bf0104199 fix for crawl domain counter limitation (limit was reached too early) 2013-09-26 13:41:52 +02:00
Michael Peter Christen
82bfd9e00a - crawl profiles shall be deleted from active and passive stacks if they
are deleted to terminate the crawl because otherwise the crawl will go
on after the load-from-passive stack policy.
- better check if a crawl is terminated using the loader queue.
2013-09-26 10:22:31 +02:00
Michael Peter Christen
91a875dff5 self-healing of mistakenly deactivated crawl profiles. This fixes a bug
which can happen in rare cases when a crawl start and a cleanup process
happen at the same time.
2013-09-25 18:27:54 +02:00
Michael Peter Christen
095053a9b4 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-09-25 17:32:52 +02:00
sixcooler
0cae420d8e some dns-timing changes:
since httpclient uses the domain-cache it is useful not to clean the
domain cache until crawling is running (domains are filled into this
cache)
On huge crawl-starts (eg. from file) my DNS did not follow the high
rates - so I reduced the rate and give some more time(-out)
2013-09-25 15:01:28 +02:00
Michael Peter Christen
4f83d5f18c added the new field harvestkey_s to the collection index and the
webgraph index which is temporary filled with the crawl profile key.
This is used to select a set of documents for post-processing as soon as
a crawl is finished. Now the postprocessing for a specific crawl is
started when that specific crawl is finished and not at the end of all
post-processing steps.
2013-09-25 14:38:24 +02:00
orbiter
14442efa6d when profiles are cleaned, there shall be first a callback showing which
profiles are cleaned. This shall enable a profile-termination-driven
postprocessing. To do this, index writings must carry the profile key
which will be implemented in another (next) step.
2013-09-25 11:04:12 +02:00
Michael Peter Christen
e40671ddb7 better and consistent deletions for error urls 2013-09-17 15:52:57 +02:00
Michael Peter Christen
2602be8d1e - removed ZURL data structure; removed also the ZURL data file
- replaced load failure logging by information which is stored in Solr
- fixed a bug with crawling of feeds: added must-match pattern
application to feed urls to filter out such urls which shall not be in a
wanted domain
- delegatedURLs, which also used ZURLs are now temporary objects in
memory
2013-09-17 15:27:02 +02:00
Michael Peter Christen
61c5e40687 - replaced the properties object in AnchorURL with distinct variables
for anchor attributes.
- this caused that large portions of the parser code had to be adopted
as well
- added a counter target_order_i for anchor links in webgraph
computation
2013-09-15 23:27:04 +02:00
Michael Peter Christen
5e31bad711 - the webgraph shall store all links which appear on a web page and not
all unique links! This made it necessary, that a large portion of the
parser and link processing classes must be adopted to carry a different
type of link collection which carry a property attribute which are
attached to web anchors.
- introduction of a new URL class, AnchorURL
- the other url classes, DigestURI and MultiProtocolURI had been renamed
and refactored to fit into a new document package schema, document.id
- cleanup of net.yacy.cora.document package and refactoring
2013-09-15 00:30:23 +02:00
Michael Peter Christen
9cc8468b30 added tools to visualize image generation (i.e. during testing) 2013-09-09 12:58:26 +02:00
Michael Peter Christen
dbef8ccfcb forced deletion of ZURL entries for a specific host for each host that
appears in the crawl url list
2013-09-05 13:22:16 +02:00
Michael Peter Christen
e137ff4171 refactoring (im preparation for new removeHost method) 2013-09-05 09:59:41 +02:00
orbiter
26366596d9 fix for a problem which ocurres when a site is crawled where the start
url is redirected.
2013-09-04 16:00:47 +02:00
Michael Peter Christen
69f85265e1 added an option to put image links to the crawl queue and handle these
like normal documents. Using this option (by default on at this moment;
this might change soon) it is possible to get the exif data into the
search index to be used in image search.
2013-09-03 11:13:45 +02:00
orbiter
f106345eef link strings should not be tokenized 2013-09-01 14:35:36 +02:00
Michael Peter Christen
a88a62f7aa added a feature to set a collection for a crawl result based on a
regular expression on th url: the collection attribut for a crawl start
may be now either a token or a list of tokens, seperated by ',' where a
token is either a string or a pair <string,pattern> where the string is
separated to the pattern with a ':' and the string is assigned to the
document as collection only if the pattern matches with the url.
2013-08-25 00:13:48 +02:00
Michael Peter Christen
765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user
in intranets and the internet can now choose to appear as Googlebot.
This is an essential necessity to be able to compete in the field of
commercial search appliances, since most web pages are these days
optimized only for Google and no other search platform any more. All
commercial search engine providers have a built-in fake-Google User
Agent to be able to get the same search index as Google can do. Without
the resistance against obeying to robots.txt in this case, no
competition is possible any more. YaCy will always obey the robots.txt
when it is used for crawling the web in a peer-to-peer network, but to
establish a Search Appliance (like a Google Search Appliance, GSA) it is
necessary to be able to behave exactly like a Google crawler.
With this change, you will be able to switch the user agent when portal
or intranet mode is selected on per-crawl-start basis. Every crawl start
can have a different user agent.
2013-08-22 14:23:47 +02:00
Michael Peter Christen
58fe986cca Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-07-30 12:49:14 +02:00
Michael Peter Christen
cf12835f20 replaced the single-text description solr field with a multi-value
description_txt text field
2013-07-30 12:48:57 +02:00
orbiter
d05e0c5368 wait a bit longer before doing the first peer ping 2013-07-27 11:00:35 +02:00
orbiter
b8f57f7703 don't be noisy when doing background tasks that may be allowed to fail 2013-07-27 10:51:58 +02:00
Roland Haeder
7263bb82fb Fix for NPE on shutdown:
java.lang.NullPointerException
        at net.yacy.search.Switchboard.storeDocumentIndex(Switchboard.java:2732)
        at net.yacy.search.Switchboard.access00(Switchboard.java:207)
        at net.yacy.search.Switchboard.run(Switchboard.java:3049)
2013-07-27 09:55:43 +02:00
Michael Peter Christen
61e015268b fix in forced deletion: forced commit needed 2013-07-25 09:53:19 +02:00
orbiter
3e901dcb06 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-07-23 19:33:07 +02:00
orbiter
f50b596e0b do not run dht ditribution if system load is over 2.5 2013-07-23 19:32:32 +02:00
orbiter
6fb2811e68 fixes for problems with remote solr and non-activated webgraph index 2013-07-23 16:46:44 +02:00
sixcooler
af740f3058 changed optimization to a segment-size of index-size/5.000.000
+ one if not idle
+ one (and force) if postprocessing
2013-07-23 14:21:12 +02:00
orbiter
5364c4dcc9 delayed first peer-ping to send the first ping out after the http got
up; if the ping comes before the http is up, it cannot be recognized as
senior peer (if at all). See also: http://bugs.yacy.net/view.php?id=266
2013-07-22 18:21:37 +02:00
orbiter
e24016e30a added the property federated.service.solr.indexing.timeout to yacy.init
to provide a configurable time-out for solr; see also:
http://bugs.yacy.net/view.php?id=254
2013-07-22 17:45:12 +02:00
orbiter
c124037f19 removed forced non-soft commits to prevent index fragmentation 2013-07-22 17:28:20 +02:00
Michael Peter Christen
c15aa758dc removed failreason_t removal patch because that causes too much
confusion using an external solr. to clean up the index after a schema
change, use the index cleaner function from the online servlet
2013-07-22 14:17:38 +02:00
Roland Haeder
841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
to optimize memory usage

Conflicts:
	source/net/yacy/search/Switchboard.java
2013-07-17 18:31:30 +02:00
Michael Peter Christen
89c0aa0e74 added collection_sxt to error documents 2013-07-17 15:20:56 +02:00
orbiter
d0dc86cf3d logging of deadlocks (if any) during cleanup process 2013-07-17 12:38:58 +02:00
Michael Peter Christen
c6a6f159e8 fix for crawl stack domain counter 2013-07-16 18:18:55 +02:00
Michael Peter Christen
93d1bac140 do a more frequent optimization, reduces IO after optimization 2013-07-16 17:16:48 +02:00
Michael Peter Christen
b79471ee67 grr 2013-07-14 10:15:47 +02:00
Michael Peter Christen
a79f288ac1 automatically running optimize on solr if user/search is idle for some
time
2013-07-14 10:02:08 +02:00
orbiter
a9c8046c87 do a light optimization at the end of a crawl postprocessing 2013-07-13 19:09:46 +02:00
Michael Peter Christen
bcc623a843 refactoring of load_delay: this is a matter of client identification 2013-07-12 16:24:56 +02:00
orbiter
0d0b3a30f5 activate api actions after postprocessing of crawls 2013-07-12 16:05:48 +02:00
orbiter
2be456e7fb added a postprocessing field into api/status_p.xml to show if the
postprocessing task is running at that time (status: busy) or not
(status:idle)
2013-07-12 14:29:22 +02:00
Michael Peter Christen
5878c1d599 - refactoring of log to ConcurrentLog:
jdk-based logger tend to block
at java.util.logging.Logger.log(Logger.java:476) in concurrent
environments. This makes logging a main performance issue. To overcome
this problem, this is a add-on to jdk logging to put log entries on a
concurrent message queue and log the messages one by one using a
separate process.
- FTPClient uses the concurrent logging instead of the log4j logger
2013-07-09 14:28:25 +02:00
Michael Peter Christen
57ffdfad4c added a crawl option to obey html-meta-robots-noindex. This is on by
default.
2013-07-03 14:50:06 +02:00
Michael Peter Christen
f1c5338210 prepartion for greedy crawl profiles and refactoring 2013-07-01 13:10:09 +02:00
Michael Peter Christen
f9d859f5dc now writing image alt texts and (camelcase-)parsed urls into a text
search field for a better image retrieval
2013-06-18 16:51:56 +02:00
Michael Peter Christen
bdf306e0a7 increased time-out for loading of seed-lists 2013-06-13 22:32:06 +02:00