Commit Graph

10824 Commits

Author SHA1 Message Date
Michael Peter Christen
da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
This organizes all urls to be loaded in separate queues for each host.
Each host separates the crawl depth into it's own queue. The primary
rule for urls taken from any queue is, that the crawl depth is minimal.
This produces a crawl depth which is identical to the clickdepth.
Furthermorem the crawl is able to create a much better balancing over
all hosts which is fair to all hosts that are in the queue.
This process will create a very large number of files for wide crawls in
the QUEUES folder: for each host a directory, for each crawl depth a
file inside the directory. A crawl with maxdepth = 4 will be able to
create 10.000s of files. To be able to use that many file readers, it
was necessary to implement a new index data structure which opens the
file only if an access is wanted (OnDemandOpenFileIndex). The usage of
such on-demand file reader shall prevent that the number of file
pointers is over the system limit, which is usually about 10.000 open
files. Some parts of YaCy had to be adopted to handle the crawl depth
number correctly. The logging and the IndexCreateQueues servlet had to
be adopted to show the crawl queues differently, because the host name
is attached to the port on the host to differentiate between http,
https, and ftp services.
2014-04-16 21:34:28 +02:00
Michael Peter Christen
075b6f9278 refactoring of the crawl balancer: the balancer is turned into an
interface and the old balancer class is moved into LegacyBalancer to
make room for a fresh implementation of a crawl balancer.
2014-04-14 13:32:35 +02:00
Michael Peter Christen
8470dfe3f8 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-04-14 12:17:52 +02:00
reger
46016fa153 autoupdate fails to download latest release (1.71) due to default release blacklist
- removed the default version blacklist regex from init (for future versions)

!!!  left existing update  blacklist setting untouched !!! 
(existing installation wanting autoupdate for 1.71 need to change blacklist in ConfigUpdate_p.html)

- moved old blacklist patch to migration.java
2014-04-13 07:32:32 +02:00
Michael Peter Christen
8aeef73d49 fix for virtual root nodes 2014-04-11 15:12:34 +02:00
Michael Peter Christen
7c7fbb9818 find depth-matches also for edge targets 2014-04-11 12:27:21 +02:00
Michael Peter Christen
dd12dd392f introduction of a data structure for HyperlinkEdges which should use
less memory as it does no double-storage of source links for each edge
of the graph.
2014-04-11 12:09:33 +02:00
Michael Peter Christen
6ea8bb7348 using MultiProtocolURL for edge data which is faster (hash computation
is now much easier) and smaller in size
2014-04-11 10:58:37 +02:00
Michael Peter Christen
b21c208b4d enhanced hashcode computation for MultiProtocolURL 2014-04-11 10:23:48 +02:00
Michael Peter Christen
ce1d1b2fa0 fix for maximum tag length in parser 2014-04-11 09:56:44 +02:00
Michael Peter Christen
17e0956312 refactoring of SystemLoad calls (only one backend tool) 2014-04-11 09:25:18 +02:00
Michael Peter Christen
a37d067692 refactoring 2014-04-10 23:46:35 +02:00
orbiter
95780eed32 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-04-10 21:40:54 +02:00
Michael Peter Christen
67beef657f strong redesign of html parser: object recursion is now made using a
stack on html tag objects, not using a recursive parse-again method
which may cause bad performance and huge memory allocation. The new
method also produced better parsed image objects with exact anchor text
references.
2014-04-10 18:58:03 +02:00
Michael Peter Christen
6bd8c6f195 fix for wrong status codes of error pages 2014-04-10 09:08:59 +02:00
Michael Peter Christen
9e503b3376 also delete the robots.txt file from the cache when a new crawl is
started
2014-04-09 21:59:54 +02:00
orbiter
67501c9dda Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-04-09 19:58:54 +02:00
Michael Peter Christen
1c21b3256d fix for robots.txt handling: delete old entry before starting a new
crawl.
2014-04-09 18:33:48 +02:00
orbiter
c250fac9f4 linkstructure refactoring to get more options for clickdepth analysis 2014-04-09 17:52:51 +02:00
Michael Peter Christen
8068e68474 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-04-09 12:45:15 +02:00
Michael Peter Christen
bd886054cb new structure and enhancements for link graph computation:
- added order option to solr queries to be able to retrieve document
lists in specific order, here: link length
- added HyperlinkEdge class which manages the link structure
- integrated the HyperlinkEdge class into clickdepth computation
- extended the linkstructure.json servlet to show also the clickdepth
and other statistic information
2014-04-09 12:45:04 +02:00
reger
f326a67561 fix: typo in default charset in metadata2solr
update pom and NB build to Solr 4.7.1 libs
2014-04-06 22:31:22 +02:00
Michael Peter Christen
df138084c0 do solr optimization independently from memory and load constraints:
- not doing an optimization will likely cause a too many files exception
- without optimization performance will be even worse which would
prevent optimization in the future as well (prevent a deadlock
situation)
2014-04-06 11:04:23 +02:00
Michael Peter Christen
ebd44a7080 replaced solr 4.6.1 with solr 4.7.1 and added index migration to
lucene_47
2014-04-06 10:45:03 +02:00
Michael Peter Christen
0f3fbae438 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-04-06 08:56:31 +02:00
reger
1a6e0354db update commons-compress.jar to 1.8 2014-04-06 03:59:11 +02:00
Michael Peter Christen
68417a05c5 different algorithm to test checkalive as it depends less on the
existence of wget (or curl) on the OS.
2014-04-06 01:20:03 +02:00
Michael Peter Christen
6b0e62ec59 Emergency bugfix for killYACY.sh as the file yacy00.log does not exist
in case that a too many open files error exist. In such a case, the file
yacy00.log does not exist but only the file yacy00.log.lck.
In the long term a different solution should be addressed.
2014-04-06 01:00:09 +02:00
Michael Peter Christen
ee92d748b5 test using compound file format, see UseCompoundFile in
https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig
This appears to be necessary as many times a
java.io.FileNotFoundException: (Too many open files) appears.
See also: https://issues.apache.org/jira/browse/SOLR-4 and desperate
users at
http://stackoverflow.com/questions/3828343/too-many-open-file-exception-while-indexin-using-solr
We cannot force users to do a "ulimit -n 1000000", so this action seems
to be required.
2014-04-06 00:35:35 +02:00
Michael Peter Christen
d2055f3d4b next development version 1.71
It's nowhere explained or declared, but since some time we follow the
schema that uneven version numbers are used for development versions and
even numbers for release versions. That concept may change sometime but
this is used at this time to distinguish development from main.
2014-04-06 00:32:10 +02:00
reger
d1b5180dd9 upd version in pom 2014-04-06 00:20:12 +02:00
Michael Peter Christen
d051d2d85f release 1.7 2014-04-04 17:05:03 +02:00
Michael Peter Christen
0a95fd27f3 update of seed list 2014-04-04 17:04:49 +02:00
Michael Peter Christen
6e84770fd9 Merge branch 'master' of gitorious.org:yacy/icewindxs-rc1
Conflicts:
	locales/ru.lng
2014-04-04 16:42:50 +02:00
malykhin.dmitry
f509cd4aab Update russian translation 2014-04-04 18:28:57 +04:00
Michael Peter Christen
f296a529d5 update to german locale 2014-04-04 15:45:08 +02:00
Michael Peter Christen
734778c0c8 fixed a time-out problem in the default servlet which is also a logging
problem because the error log showed the wrong reason (file not found)
instead the actual reason (time-out).
2014-04-04 15:27:29 +02:00
Michael Peter Christen
466d90ad42 fixed a problem with resource observer; probably coming from uncatched
exceptions within the apache library which appear only in concurrency
environments.
2014-04-04 15:26:39 +02:00
Michael Peter Christen
c8d4a63604 eliminating the word 'Facet' from the interface because it is ugly. If
people do not know what search navigation is, then they also do not know
what a 'facet' is.
2014-04-04 15:25:37 +02:00
Michael Peter Christen
e8ddd415a8 enhanced the new link structure graph 2014-04-04 14:43:54 +02:00
Michael Peter Christen
926d28dd3f fixed a bug which prevented crawl starts after a network switch 2014-04-04 14:43:35 +02:00
Michael Peter Christen
8443255e18 better link structure limit calibration 2014-04-04 12:48:55 +02:00
Michael Peter Christen
7f5733638b fix for linkstructure computation: now also detecting dead links 2014-04-04 12:47:29 +02:00
Michael Peter Christen
3ce8eff21b another fix for inbound/outbound detection 2014-04-04 12:41:59 +02:00
Michael Peter Christen
d4b5c457e4 NPE fix 2014-04-04 12:34:34 +02:00
Michael Peter Christen
36a66b0704 fix for parsing of numeric value in case that boolean values are given 2014-04-04 11:59:51 +02:00
orbiter
41730c8048 better logging in template engine: shows filename of servlets where
errors in templates occur
2014-04-04 10:55:46 +02:00
orbiter
3c1274057d fixed thread dump in case of wrong seeds 2014-04-04 10:54:56 +02:00
orbiter
18f9c40302 moved Edge class out of linkstructure servlet as this does not work on
non-eclipse driven environments (all non-dev cases)
2014-04-04 10:54:11 +02:00
orbiter
de95e5e524 reduced search activity corona strength in network image 2014-04-04 10:08:44 +02:00