Commit Graph

107 Commits

Author SHA1 Message Date
Michael Peter Christen
e1bc768f9d more IPv6 bugfixes 2014-10-06 17:44:27 +02:00
reger
fb1fcc2b03 handle noarchive tag, skip writing page to cache
http://mantis.tokeek.de/view.php?id=44
2014-10-01 04:35:34 +02:00
Michael Peter Christen
9ac0c93f17 fix for subpath crawl filter 2014-08-06 01:33:24 +02:00
Michael Peter Christen
66106bdaf0 fix for crawler attribute maxdompages 2014-08-05 21:32:25 +02:00
Michael Peter Christen
98f45c9032 fix for image alt attachment to AnchorURLs in html parser. 2014-08-01 12:04:15 +02:00
Michael Peter Christen
542c20a597 changed handling of crawl profile field crawlingIfOlder: this should be
filled with the date, when the url is recognized as to be outdated. That
field was partly misinterpreted and the time interval was filled in. In
case that all the urls which are in the index shall be treated as
outdated, the field is filled now with Long.MAX_VALUE because then all
crawl dates are before that date and therefore outdated.
2014-07-22 00:23:17 +02:00
Michael Peter Christen
2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
attribute in the <a> tag for each crawl. This introduces a lot of
changes because it extends the usage of the AnchorURL Object type which
now also has a different toString method that the underlying
DigestURL.toString. It is therefore not advised to use .toString at all
for urls, just just toNormalform(false) instead.
2014-07-18 12:43:01 +02:00
Michael Peter Christen
b5fc2b63ea removed exist() retrieval functions from error cache and replaced it
with metadata retrieval from connectors directly. This should cause
better usage of the cache. Automatically increase the metadata cache if
more memory is available.
2014-07-11 19:52:25 +02:00
Michael Peter Christen
62c72360ee cleanup of checkAcceptanceInitially in CrawlStacker, should avoid
double-calling of solr
2014-07-11 18:36:04 +02:00
Michael Peter Christen
b5d78ba156 reduced number of solr queries during crawling 2014-07-11 18:05:11 +02:00
orbiter
d7d38f9135 made number of open files in crawler configurable and increased default
maximum number of open files from 100 to 1000. This number can be
changed with the attribut crawler.onDemandLimit
2014-05-31 09:29:55 +02:00
Michael Peter Christen
9c6228d948 fix for deadlocks in crawler 2014-04-17 16:58:17 +02:00
Michael Peter Christen
10cf8215bd added crawl depth for failed documents 2014-04-17 13:21:43 +02:00
Michael Peter Christen
da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
This organizes all urls to be loaded in separate queues for each host.
Each host separates the crawl depth into it's own queue. The primary
rule for urls taken from any queue is, that the crawl depth is minimal.
This produces a crawl depth which is identical to the clickdepth.
Furthermorem the crawl is able to create a much better balancing over
all hosts which is fair to all hosts that are in the queue.
This process will create a very large number of files for wide crawls in
the QUEUES folder: for each host a directory, for each crawl depth a
file inside the directory. A crawl with maxdepth = 4 will be able to
create 10.000s of files. To be able to use that many file readers, it
was necessary to implement a new index data structure which opens the
file only if an access is wanted (OnDemandOpenFileIndex). The usage of
such on-demand file reader shall prevent that the number of file
pointers is over the system limit, which is usually about 10.000 open
files. Some parts of YaCy had to be adopted to handle the crawl depth
number correctly. The logging and the IndexCreateQueues servlet had to
be adopted to show the crawl queues differently, because the host name
is attached to the port on the host to differentiate between http,
https, and ftp services.
2014-04-16 21:34:28 +02:00
Michael Peter Christen
075b6f9278 refactoring of the crawl balancer: the balancer is turned into an
interface and the old balancer class is moved into LegacyBalancer to
make room for a fresh implementation of a crawl balancer.
2014-04-14 13:32:35 +02:00
Michael Peter Christen
6bd8c6f195 fix for wrong status codes of error pages 2014-04-10 09:08:59 +02:00
Michael Peter Christen
926d28dd3f fixed a bug which prevented crawl starts after a network switch 2014-04-04 14:43:35 +02:00
Michael Peter Christen
d4b5c457e4 NPE fix 2014-04-04 12:34:34 +02:00
Michael Peter Christen
b08375da33 fix for bad/missing values of size_i 2014-03-11 09:51:04 +01:00
Michael Peter Christen
e485fbd0ce - let crawl loader jobs die after 10 seconds without new jobs
- corrected shutdown order t prevent a deadlock during shutdown
2014-03-04 00:33:13 +01:00
Michael Peter Christen
bcd9dd9e1d enhanced concurrent loading by using a fixed set of concurrent loader
processes in favor of throwaway-processes. The control mechanism does
less often report a 'queue full' message to the busy loop which then
does not perform a long busy waiting; instead all requests are queued
and new loader processes are started if necessary up to a given limit
(as set before)
2014-03-03 22:13:40 +01:00
Michael Peter Christen
6ed9c0164e attaching names to all Threads to get a better view in profiling tools
like VisualVM
2014-02-28 15:02:01 +01:00
Michael Peter Christen
fdaeac374a - enhanced postprocessing speed and memory footprint (by using HashMaps
instead of TreeMaps)
- enhanced memory footprint of database indexes (by introduction of
optimize calls)
- optimize calls shrink the amount of used memory for index sets if they
are not changed afterwards any more
2014-02-28 14:01:09 +01:00
orbiter
ced1a96f9c fixed error cache 2014-02-25 02:16:22 +01:00
Michael Peter Christen
6ada0daae9 making latency_factor and maximum number of same hosts in loader queue
settings available in Crawler_p.html servlet for steering.
2014-01-21 19:28:00 +01:00
Michael Peter Christen
0168f80c28 new crawling factors can now be changed during runtime 2014-01-21 17:52:16 +01:00
Michael Peter Christen
77531850b5 reverted crawling strategy from latest commit. 2014-01-21 16:05:55 +01:00
Michael Peter Christen
c0da966dfa enhanced crawler speed 2014-01-20 21:46:40 +01:00
Michael Peter Christen
0d235a565b cleanup crawl loader jobs 2014-01-20 18:36:00 +01:00
reger
28eae57e8b spend CrawlQueues a fremem routine
- clears errorStack
- will not get hit often (but better little than nothing on low mem)
2014-01-10 10:24:33 +01:00
Michael Peter Christen
1a4a69c226 set more logger to 'final static' 2013-11-13 06:18:48 +01:00
Michael Peter Christen
87a956e881 calculating and showing the number of files and the average size of a
file in the HTCACHE in ConfigHTCache_p.html
2013-11-07 12:13:12 +01:00
orbiter
20bbde8665 fix for mustmatch regex computation: result had correct semantic, but
may have contained multiple same expressions within the disjunction of
domain-restrictions. This fix removes the redundant restrictions and
makes the regex shorter.
2013-10-18 13:55:37 +02:00
Michael Peter Christen
82bfd9e00a - crawl profiles shall be deleted from active and passive stacks if they
are deleted to terminate the crawl because otherwise the crawl will go
on after the load-from-passive stack policy.
- better check if a crawl is terminated using the loader queue.
2013-09-26 10:22:31 +02:00
Michael Peter Christen
91a875dff5 self-healing of mistakenly deactivated crawl profiles. This fixes a bug
which can happen in rare cases when a crawl start and a cleanup process
happen at the same time.
2013-09-25 18:27:54 +02:00
Michael Peter Christen
4f83d5f18c added the new field harvestkey_s to the collection index and the
webgraph index which is temporary filled with the crawl profile key.
This is used to select a set of documents for post-processing as soon as
a crawl is finished. Now the postprocessing for a specific crawl is
started when that specific crawl is finished and not at the end of all
post-processing steps.
2013-09-25 14:38:24 +02:00
orbiter
0e8d752462 refactoring 2013-09-24 19:55:59 +02:00
Michael Peter Christen
e40671ddb7 better and consistent deletions for error urls 2013-09-17 15:52:57 +02:00
Michael Peter Christen
2602be8d1e - removed ZURL data structure; removed also the ZURL data file
- replaced load failure logging by information which is stored in Solr
- fixed a bug with crawling of feeds: added must-match pattern
application to feed urls to filter out such urls which shall not be in a
wanted domain
- delegatedURLs, which also used ZURLs are now temporary objects in
memory
2013-09-17 15:27:02 +02:00
Michael Peter Christen
61c5e40687 - replaced the properties object in AnchorURL with distinct variables
for anchor attributes.
- this caused that large portions of the parser code had to be adopted
as well
- added a counter target_order_i for anchor links in webgraph
computation
2013-09-15 23:27:04 +02:00
Michael Peter Christen
5e31bad711 - the webgraph shall store all links which appear on a web page and not
all unique links! This made it necessary, that a large portion of the
parser and link processing classes must be adopted to carry a different
type of link collection which carry a property attribute which are
attached to web anchors.
- introduction of a new URL class, AnchorURL
- the other url classes, DigestURI and MultiProtocolURI had been renamed
and refactored to fit into a new document package schema, document.id
- cleanup of net.yacy.cora.document package and refactoring
2013-09-15 00:30:23 +02:00
Michael Peter Christen
1a8c64117f decreased the responseHeaderDB database which is now flushed more
frequently. This will preserve more documents in the cache in case of a
crash.
2013-09-11 13:03:58 +02:00
Michael Peter Christen
dbef8ccfcb forced deletion of ZURL entries for a specific host for each host that
appears in the crawl url list
2013-09-05 13:22:16 +02:00
Michael Peter Christen
e137ff4171 refactoring (im preparation for new removeHost method) 2013-09-05 09:59:41 +02:00
orbiter
26366596d9 fix for a problem which ocurres when a site is crawled where the start
url is redirected.
2013-09-04 16:00:47 +02:00
Michael Peter Christen
a88a62f7aa added a feature to set a collection for a crawl result based on a
regular expression on th url: the collection attribut for a crawl start
may be now either a token or a list of tokens, seperated by ',' where a
token is either a string or a pair <string,pattern> where the string is
separated to the pattern with a ':' and the string is assigned to the
document as collection only if the pattern matches with the url.
2013-08-25 00:13:48 +02:00
Michael Peter Christen
e4cbe9232d fixed a crawler bug where a double-occurring url was not re-crawled
because the double-check error was written to the error-db and never
deleted. No the error-db is cleared on every start and these
double-messages are not written to the error-db any more.
2013-08-22 15:56:09 +02:00
Michael Peter Christen
765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user
in intranets and the internet can now choose to appear as Googlebot.
This is an essential necessity to be able to compete in the field of
commercial search appliances, since most web pages are these days
optimized only for Google and no other search platform any more. All
commercial search engine providers have a built-in fake-Google User
Agent to be able to get the same search index as Google can do. Without
the resistance against obeying to robots.txt in this case, no
competition is possible any more. YaCy will always obey the robots.txt
when it is used for crawling the web in a peer-to-peer network, but to
establish a Search Appliance (like a Google Search Appliance, GSA) it is
necessary to be able to behave exactly like a Google crawler.
With this change, you will be able to switch the user agent when portal
or intranet mode is selected on per-crawl-start basis. Every crawl start
can have a different user agent.
2013-08-22 14:23:47 +02:00
Michael Peter Christen
cf12835f20 replaced the single-text description solr field with a multi-value
description_txt text field
2013-07-30 12:48:57 +02:00
Roland Haeder
841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
to optimize memory usage

Conflicts:
	source/net/yacy/search/Switchboard.java
2013-07-17 18:31:30 +02:00