Commit Graph

23 Commits

Author SHA1 Message Date
Michael Peter Christen
5e31bad711 - the webgraph shall store all links which appear on a web page and not
all unique links! This made it necessary, that a large portion of the
parser and link processing classes must be adopted to carry a different
type of link collection which carry a property attribute which are
attached to web anchors.
- introduction of a new URL class, AnchorURL
- the other url classes, DigestURI and MultiProtocolURI had been renamed
and refactored to fit into a new document package schema, document.id
- cleanup of net.yacy.cora.document package and refactoring
2013-09-15 00:30:23 +02:00
Michael Peter Christen
a88a62f7aa added a feature to set a collection for a crawl result based on a
regular expression on th url: the collection attribut for a crawl start
may be now either a token or a list of tokens, seperated by ',' where a
token is either a string or a pair <string,pattern> where the string is
separated to the pattern with a ':' and the string is assigned to the
document as collection only if the pattern matches with the url.
2013-08-25 00:13:48 +02:00
Michael Peter Christen
765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user
in intranets and the internet can now choose to appear as Googlebot.
This is an essential necessity to be able to compete in the field of
commercial search appliances, since most web pages are these days
optimized only for Google and no other search platform any more. All
commercial search engine providers have a built-in fake-Google User
Agent to be able to get the same search index as Google can do. Without
the resistance against obeying to robots.txt in this case, no
competition is possible any more. YaCy will always obey the robots.txt
when it is used for crawling the web in a peer-to-peer network, but to
establish a Search Appliance (like a Google Search Appliance, GSA) it is
necessary to be able to behave exactly like a Google crawler.
With this change, you will be able to switch the user agent when portal
or intranet mode is selected on per-crawl-start basis. Every crawl start
can have a different user agent.
2013-08-22 14:23:47 +02:00
Roland Haeder
841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
to optimize memory usage

Conflicts:
	source/net/yacy/search/Switchboard.java
2013-07-17 18:31:30 +02:00
Michael Peter Christen
5878c1d599 - refactoring of log to ConcurrentLog:
jdk-based logger tend to block
at java.util.logging.Logger.log(Logger.java:476) in concurrent
environments. This makes logging a main performance issue. To overcome
this problem, this is a add-on to jdk logging to put log entries on a
concurrent message queue and log the messages one by one using a
separate process.
- FTPClient uses the concurrent logging instead of the log4j logger
2013-07-09 14:28:25 +02:00
Michael Peter Christen
57ffdfad4c added a crawl option to obey html-meta-robots-noindex. This is on by
default.
2013-07-03 14:50:06 +02:00
Michael Peter Christen
f1c5338210 prepartion for greedy crawl profiles and refactoring 2013-07-01 13:10:09 +02:00
Michael Peter Christen
25499eead5 - added a new field for the regular expression in crawl start
- added the field in crawl profile
- adopted logging end error management
- adopted duplicate document detection
- added a new rule to the indexing process to reject non-matching
content
- full redesign of the expert crawl start servlet
The new filter field can now be seen in /CrawlStartExpert_p.html at
Section "Document Filter", subsection item "Filter on Content of
Document"
2013-04-26 10:49:55 +02:00
Michael Peter Christen
3de784c8dd replaced more split and replaceAll missing pattern pre-compilation with
pre-compiled pattern
2012-11-26 13:40:53 +01:00
orbiter
5aa5202adf fixes for filesystem indexing 2012-11-24 10:27:29 +01:00
orbiter
354ef8000d - added 'deleteold' option to crawler which causes that documents are
deleted which are selected by a crawl filter (host or subpath)
- site crawl used this option be default now
- made option to deleteDomain() concurrency
2012-11-04 02:58:26 +01:00
Michael Peter Christen
0716a24737 added more / all new crawl profile fields into crawl profile editor 2012-10-31 15:13:05 +01:00
Michael Peter Christen
4a14122ba7 in case that a crawl profile has a collection assigned, use the
collection to show a name in the web interface. This should prevent that
much too long names make the interface unusable.
2012-10-31 14:08:33 +01:00
Michael Peter Christen
ac9540dfb6 removed options for stopwords which are not used 2012-10-30 12:36:36 +01:00
Michael Peter Christen
c25d7bcb80 - added concurrency for robots.txt loading
- changed data model for domain counter
2012-10-29 21:08:45 +01:00
Michael Peter Christen
85ca07b90e when a new crawl is started, an equal crawl, if still running, is
terminated and the corresponding crawl profile is deleted (this also
clears the crawl queue entries for that crawl profile)
2012-10-25 10:20:55 +02:00
Michael Peter Christen
5f0ab25382 removed the option to prevent removal of &amp; parts inside of the
MultiProtocolURI during normalform computation because that should
always be done and also be done during initialization of the
MultiProtocolURI Object. The new normalform method takes only one
argument which should be 'true' unless you know exactly what you are
doing.
2012-10-10 11:46:22 +02:00
Michael Peter Christen
53789555b9 fix for crawl start filter 2012-10-10 10:40:32 +02:00
Michael Peter Christen
76d218fbef fixes to crawl profiles 2012-10-08 10:50:40 +02:00
Michael Peter Christen
1533bfd63b refactoring 2012-09-25 21:20:03 +02:00
Michael Peter Christen
8219a445f3 refactoring 2012-09-21 16:46:57 +02:00
Michael Peter Christen
f879a344e7 fix for no depth limit default value 2012-09-21 16:05:17 +02:00
Michael Peter Christen
00c1c777fa refactoring 2012-09-21 15:48:16 +02:00