Michael Peter Christen
bb4bf3d8fd
infinity timeout bug protection patch
2013-04-30 11:06:48 +02:00
Michael Peter Christen
8f3bd0c387
fix for smb crawl situation (lost too many urls)
2012-12-26 19:15:11 +01:00
Michael Peter Christen
1e002ab18e
added another blacklist-cleaner into balancer
2012-12-07 01:27:24 +01:00
Michael Peter Christen
fa27e5820f
- check blacklist (again) when taking urls from the crawl stack because
...
the blacklist may get extended during crawling
- removed debug output
2012-12-06 00:12:16 +01:00
Michael Peter Christen
5e182a566f
- added another enumeration method in kelondro data structure to get a
...
more random access to data for the balancer
- added random access inside the balancer
2012-11-23 13:58:39 +01:00
Michael Peter Christen
75dd706e1b
update to HostBrowser:
...
- time-out after 3 seconds to speed up display (may be incomplete)
- showing also all links from the balancer queue in the host list (after
the '/') and in the result browser view with tag 'loading'
2012-11-02 13:57:43 +01:00
Michael Peter Christen
0fe8be7981
enhaced data structures for balancer and latency computation which
...
should produce a bit better prognosis about forced waiting times.
2012-10-30 17:30:24 +01:00
Michael Peter Christen
0833937c1c
better balancing and duetime-cumputation also for no-delay intranet
...
hosts
2012-10-30 11:28:49 +01:00
Michael Peter Christen
c326aa8f67
disabled writing new entries to crawl stacks to prevent that a domain
...
with many documents block refreshing of the crawl queue
2012-10-29 22:26:52 +01:00
Michael Peter Christen
c25d7bcb80
- added concurrency for robots.txt loading
...
- changed data model for domain counter
2012-10-29 21:08:45 +01:00
Michael Peter Christen
2d9e577ad0
replaced the custom robots.txt loader by the standard http loader
2012-10-28 22:48:11 +01:00
Michael Peter Christen
a33e2742cb
- removed unnecessary synchronized and deadlock in crawler
...
- removed problem with monitoring object on Balancer.wait
- added missing user agent settings
2012-10-28 19:56:02 +01:00
orbiter
8952153ecf
update to Balancer algorithm:
...
- create a load list from the current list of known hosts
- do not create this list for each Balancer.pop access
- create the list from those hosts which have a zero-waiting time
- select 1/3 from that list which have the most urls waiting
- get hosts from the wainting list in random order
- fixes for some delta-time computations
- always load all urls from hosts which have never been loaded before
2012-10-28 13:24:49 +01:00
Michael Peter Christen
1533bfd63b
refactoring
2012-09-25 21:20:03 +02:00
Michael Peter Christen
8219a445f3
refactoring
2012-09-21 16:46:57 +02:00
Michael Peter Christen
00c1c777fa
refactoring
2012-09-21 15:48:16 +02:00