Commit Graph

134 Commits

Author SHA1 Message Date
Michael Peter Christen
8f3bd0c387 fix for smb crawl situation (lost too many urls) 2012-12-26 19:15:11 +01:00
Michael Peter Christen
7c3de8b4cd - fix for localhost detection
- added IPv6 patterns for localhost detection
2012-12-18 12:52:20 +01:00
Michael Peter Christen
a3cd3852ab introduced a better place to update the lastacc time value in latency 2012-12-07 15:49:23 +01:00
Michael Peter Christen
864abcd33d removed Latency update after URL selection because that causes
a completely wrong behaviour when cache fresh cases appear. Makes
re-crawling MUCH faster!
2012-12-07 15:35:44 +01:00
Michael Peter Christen
10527e28ae fix for wrong display of error urls in HostBrowser 2012-12-07 00:31:10 +01:00
Michael Peter Christen
756772fbd3 fix for waitingtime computation for intranet configuration 2012-12-06 17:40:52 +01:00
Michael Peter Christen
3de784c8dd replaced more split and replaceAll missing pattern pre-compilation with
pre-compiled pattern
2012-11-26 13:40:53 +01:00
orbiter
5aa5202adf fixes for filesystem indexing 2012-11-24 10:27:29 +01:00
Michael Peter Christen
efd2c4622d added a new fail type attribute for the index to distinguish two
separate fail types: network fail and forced exclusion (i.e. by robots
or forwarding rules).
2012-11-23 14:00:30 +01:00
Michael Peter Christen
d481abd087 added the visualization of error-urls to host browser
- only visible for admins
- a faceted search generates a huge list for all hosts in the host list
- the faceted search algorithms had to be modified for that
- within the browsing of the directory path, the error cause is written
to the url which is presented as error-url
- the errors are also accumulated for directory sums
2012-11-06 00:29:37 +01:00
Michael Peter Christen
791e1dcfdf when a new crawl is started, delete all entries about error-urls for
crawl-start domains
2012-11-05 22:14:27 +01:00
orbiter
354ef8000d - added 'deleteold' option to crawler which causes that documents are
deleted which are selected by a crawl filter (host or subpath)
- site crawl used this option be default now
- made option to deleteDomain() concurrency
2012-11-04 02:58:26 +01:00
Michael Peter Christen
75dd706e1b update to HostBrowser:
- time-out after 3 seconds to speed up display (may be incomplete)
- showing also all links from the balancer queue in the host list (after
the '/') and in the result browser view with tag 'loading'
2012-11-02 13:57:43 +01:00
Michael Peter Christen
0716a24737 added more / all new crawl profile fields into crawl profile editor 2012-10-31 15:13:05 +01:00
Michael Peter Christen
4a14122ba7 in case that a crawl profile has a collection assigned, use the
collection to show a name in the web interface. This should prevent that
much too long names make the interface unusable.
2012-10-31 14:08:33 +01:00
Michael Peter Christen
0fe8be7981 enhaced data structures for balancer and latency computation which
should produce a bit better prognosis about forced waiting times.
2012-10-30 17:30:24 +01:00
Michael Peter Christen
ac9540dfb6 removed options for stopwords which are not used 2012-10-30 12:36:36 +01:00
Michael Peter Christen
b2ffd49817 less latency 2012-10-30 12:26:32 +01:00
Michael Peter Christen
0833937c1c better balancing and duetime-cumputation also for no-delay intranet
hosts
2012-10-30 11:28:49 +01:00
Michael Peter Christen
c25d7bcb80 - added concurrency for robots.txt loading
- changed data model for domain counter
2012-10-29 21:08:45 +01:00
Michael Peter Christen
2d9e577ad0 replaced the custom robots.txt loader by the standard http loader 2012-10-28 22:48:11 +01:00
Michael Peter Christen
a33e2742cb - removed unnecessary synchronized and deadlock in crawler
- removed problem with monitoring object on Balancer.wait
- added missing user agent settings
2012-10-28 19:56:02 +01:00
orbiter
8952153ecf update to Balancer algorithm:
- create a load list from the current list of known hosts
- do not create this list for each Balancer.pop access
- create the list from those hosts which have a zero-waiting time
- select 1/3 from that list which have the most urls waiting
- get hosts from the wainting list in random order
- fixes for some delta-time computations
- always load all urls from hosts which have never been loaded before
2012-10-28 13:24:49 +01:00
Michael Peter Christen
85ca07b90e when a new crawl is started, an equal crawl, if still running, is
terminated and the corresponding crawl profile is deleted (this also
clears the crawl queue entries for that crawl profile)
2012-10-25 10:20:55 +02:00
Michael Peter Christen
ccc3760a47 Refactoring and redesign of data architecture to make URIMetadataRow
superfluous. The target is to make a solr document as the core of YaCy
documents which would cause that many conversions can be removed. On the
way to this target the Equivalence of URIMetadataRow and URIMetadataNode
had to be removed to expose the usage of the old URIMetadataRow data
structure.
This refactoring already removes unneccessary conversions and should
make memory usage during indexing lower.
2012-10-18 14:29:11 +02:00
Michael Peter Christen
e5b3c172ff removed hack which translated Solr documents to virtual RWI entries
which had been then mixed with remote RWIs. Now these Solr documents are
feeded into the result set as they appear during local and remote
search. That makes the search much faster.
2012-10-17 17:45:41 +02:00
Michael Peter Christen
5f0ab25382 removed the option to prevent removal of & parts inside of the
MultiProtocolURI during normalform computation because that should
always be done and also be done during initialization of the
MultiProtocolURI Object. The new normalform method takes only one
argument which should be 'true' unless you know exactly what you are
doing.
2012-10-10 11:46:22 +02:00
Michael Peter Christen
53789555b9 fix for crawl start filter 2012-10-10 10:40:32 +02:00
Michael Peter Christen
76d218fbef fixes to crawl profiles 2012-10-08 10:50:40 +02:00
Michael Peter Christen
1533bfd63b refactoring 2012-09-25 21:20:03 +02:00
Michael Peter Christen
872f83ebe0 refactoring 2012-09-25 21:04:58 +02:00
Michael Peter Christen
8219a445f3 refactoring 2012-09-21 16:46:57 +02:00
Michael Peter Christen
f879a344e7 fix for no depth limit default value 2012-09-21 16:05:17 +02:00
Michael Peter Christen
00c1c777fa refactoring 2012-09-21 15:48:16 +02:00