Commit Graph

16 Commits

Author SHA1 Message Date
Michael Peter Christen
47b1c81d08 - refactoring
- generalized writing of url attributes to solr documents
- added more url attributes to error documents
2013-08-20 15:46:04 +02:00
Roland Haeder
841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
to optimize memory usage

Conflicts:
	source/net/yacy/search/Switchboard.java
2013-07-17 18:31:30 +02:00
Michael Peter Christen
5c6946dd5f replaced usage of log4j by ConcurrentLog where possible 2013-07-09 14:42:39 +02:00
Michael Peter Christen
8f2d3ce2f9 reduced locking situation in crawler: shifted synchronized location and
reduced time-out of robots.txt load limit
2013-05-20 22:05:28 +02:00
Michael Peter Christen
038f956821 fix for sitemap detection: the sitemap url was not visible if it
appeared after the declaration of robots allow/deny for the crawler
because the sitemap parser terminated after the allow/deny rules had
been found. Now the parser reads the robots.txt until the end to
discover also sitemap rules at the end of the file.
2013-05-10 04:56:58 +02:00
Michael Peter Christen
af465cdca5 fix for wrong robots.txt loading for https protocol
see also: http://forum.yacy-websuche.de/viewtopic.php?f=5&t=4579
2013-01-16 17:38:06 +01:00
Michael Peter Christen
8f3bd0c387 fix for smb crawl situation (lost too many urls) 2012-12-26 19:15:11 +01:00
orbiter
5aa5202adf fixes for filesystem indexing 2012-11-24 10:27:29 +01:00
Michael Peter Christen
71ed8e5e07 bugfixes for crawler 2012-11-07 12:52:19 +01:00
Michael Peter Christen
0fe8be7981 enhaced data structures for balancer and latency computation which
should produce a bit better prognosis about forced waiting times.
2012-10-30 17:30:24 +01:00
Michael Peter Christen
0833937c1c better balancing and duetime-cumputation also for no-delay intranet
hosts
2012-10-30 11:28:49 +01:00
Michael Peter Christen
c25d7bcb80 - added concurrency for robots.txt loading
- changed data model for domain counter
2012-10-29 21:08:45 +01:00
Michael Peter Christen
2d9e577ad0 replaced the custom robots.txt loader by the standard http loader 2012-10-28 22:48:11 +01:00
Michael Peter Christen
a33e2742cb - removed unnecessary synchronized and deadlock in crawler
- removed problem with monitoring object on Balancer.wait
- added missing user agent settings
2012-10-28 19:56:02 +01:00
Michael Peter Christen
5f0ab25382 removed the option to prevent removal of & parts inside of the
MultiProtocolURI during normalform computation because that should
always be done and also be done during initialization of the
MultiProtocolURI Object. The new normalform method takes only one
argument which should be 'true' unless you know exactly what you are
doing.
2012-10-10 11:46:22 +02:00
Michael Peter Christen
00c1c777fa refactoring 2012-09-21 15:48:16 +02:00