Commit Graph

10472 Commits

Author SHA1 Message Date
Michael Peter Christen
489c3fbc90 code simplifications / removed warnings 2014-01-21 17:53:39 +01:00
Michael Peter Christen
0168f80c28 new crawling factors can now be changed during runtime 2014-01-21 17:52:16 +01:00
Michael Peter Christen
be5e808236 - removed hardcoded load-test which is now handled in BusyQueues
steering, see /PerformanceQueues_p.html
- changed default values for crawler queue load limit (high, because
these jobs are started upon user request)
2014-01-21 17:48:45 +01:00
sixcooler
40a4030b55 configurable max-load values for YaCy-Threads:
try lower values on smal systems like a Pi
2014-01-21 17:04:22 +01:00
sixcooler
6d8c023a5e lower client-connection for single-cpu-systems 2014-01-21 16:56:44 +01:00
Michael Peter Christen
77531850b5 reverted crawling strategy from latest commit. 2014-01-21 16:05:55 +01:00
Michael Peter Christen
c0da966dfa enhanced crawler speed 2014-01-20 21:46:40 +01:00
Michael Peter Christen
79809342fa added synchronization to exists() call bacause the concurrent call to
that method showed in thread dump close to deadlock situations. Its also
better to synchronize IO operations because they become faster then.
2014-01-20 21:09:03 +01:00
Michael Peter Christen
9a6912f2e6 if a http client thread is still running but we do not wait for it any
more, call an interrupt
2014-01-20 18:39:36 +01:00
Michael Peter Christen
0d235a565b cleanup crawl loader jobs 2014-01-20 18:36:00 +01:00
Michael Peter Christen
1ea17bd9f3 - removed old metadata database and all migration code
- refactored all code which uses URIMetadataRow as standard for word
hash length and word hash ordering and moved that to the class 'Word',
becuase the class URIMetadataRow defined the old metadata data structure
and should be superfluous in the future
- removed unused methods from URIMetadataRow as preparation for further
removal of that class
2014-01-20 18:31:46 +01:00
reger
d3de309953 fix IOexception logging issue in DefaultServlet
reason not sure but .logException triggers another exception
2014-01-20 08:12:35 +01:00
reger
97e84439fb adjusted ConfigHeuristic and changed QueryGoal.getOriginalQueryString to .getQueryString
- since specific heuristic Twitter & Blekko is not longer available or redundant with OpenSearchHeuristic,
adjusted ConfigHeuristic to use OpensearchHeuristic settings only.
For this the default OSD search target list is made available (copied) by default and the other configs are removed.

- the return of QueryGoal.getOriginalQueryString includes the queryModifier, which are held separately in a modifier object,
but in most (all) cases just the query term is expected, clarified and renamed it to QueryGoal.getQueryString which returns
just the search term (if needed a .getOrigianlQueryString could be implemented in Queryparameters, adding the modifiers)

- started to adjust internal html href references from absolute to relative (currently it is mixed).
For future development we should prefer relative href targets (less trouble with context aware  servlets)
2014-01-20 00:58:17 +01:00
reger
d24a0ec32c upd heuristic default list (heuristicopensearch.conf)
- Faroo Web taken out (requires api key) http://www.faroo.com/hp/api/api.html#description
- update Faroo News to new url
- Twitter taken out (change to Api 1.1 not supporting rss) https://dev.twitter.com/discussions/24239
2014-01-20 00:03:55 +01:00
Michael Peter Christen
022c6d3ce1 do YaCy p2p connections using a timeout-request which covers the http
request into a separate thread and ignores the furthure result of a
request if that does not answer within the requested time-out. This is a
try to solve a problem with the peer-ping, which hangs whenever a peer
appears to be dead or blocked.
2014-01-19 15:21:23 +01:00
Michael Peter Christen
42f3733a05 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-01-19 14:47:24 +01:00
Michael Peter Christen
25a6c05008 experimental removal of synchronization. This should work for all cases
where the size() and isEmpty() method is used only for statistics, which
happens at many locations in YaCy. If these methods are used for
structual reasons (like accessing the last element in an array) then it
may fail or cause other problems. As far as visible, this is not the
case.
2014-01-19 14:47:11 +01:00
Michael Peter Christen
5695280edd removed superfluous synchronization 2014-01-19 14:44:58 +01:00
Michael Peter Christen
a1977b7a75 removed debug code 2014-01-19 14:42:26 +01:00
orbiter
fd4abc0565 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-01-19 01:50:55 +01:00
orbiter
d5b8e473c8 added load limit for DHT transfer: RWI acceptance only if local load is
not too high
2014-01-19 01:50:42 +01:00
reger
41c126978b fix bug: Crawl Start (Expert) crawls "?-URLs" even if told not to do so
http://bugs.yacy.net/view.php?id=329
2014-01-18 23:27:16 +01:00
reger
2614fa7aeb Skip remote Solr search if last try showed error
As the solr servlet may not be available (e.g. no public search page, old version, individual access setting) a /solr/select error is 
remembered in the seed.dna of the remote peer.
This is not permanent, as flag is not stored and the seed is reloaded on several occasions, it is just a memory of the recent past status.
Might also be set to "not available" on time-out of last try.
2014-01-18 18:48:52 +01:00
orbiter
a07e9b3582 concurrency-solid version of transmission limitation 2014-01-18 12:55:05 +01:00
orbiter
ec21f0494e removed -d64 jvm option because that causes problems on non-64 bit
linux, see http://bugs.yacy.net/view.php?id=349 and
http://bugs.yacy.net/view.php?id=339
2014-01-18 12:54:14 +01:00
orbiter
60ead31273 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-01-18 10:50:36 +01:00
orbiter
52bf7d1ac8 reduce load during dht transfer 2014-01-18 10:50:24 +01:00
sixcooler
f0587d4af5 NP-fix, which was found on a Pi under 'havy' load 2014-01-18 00:03:44 +01:00
Michael Peter Christen
a9ed28c0b5 no commit if no action is requested 2014-01-17 14:54:44 +01:00
Michael Peter Christen
0bf3cab8c7 - better 'extra'-peer selection
- logging of health status for 'extra'-peer selection
- concurrency for remote peer IO and interrupting the threads if
time-out occurrs
2014-01-17 14:54:19 +01:00
orbiter
e3c4456c8e Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-01-17 09:43:09 +01:00
orbiter
7f21d21d1d added synchronization to deeply-embedded solr connector
EmbeddedSolrConnector because deadlock situations show that methods in
lucene class seem to block.
2014-01-17 09:42:55 +01:00
reger
9b06774414 fix role name in GSA servlet 2014-01-17 01:00:02 +01:00
reger
0c754dd794 implemented DIGEST authentication, which is for remote login more secure
as BASIC were pwd is transmitted near clear text (B64enc).
This has some implication as RFC 2617 requires and recommends a password hash MD5(user:realm:pwd) for DIGEST.

!!! before activating DIGEST you have to reassign all passwords !!! to allow new calculation of the hash
- default authentication is still BASIC
- configuration at this time only manually in (DATA/settings) or  defaults/web.xml  (<auth-method>
- the realmname is in defaults/yacy.init  adminRealm=YaCy-AdminUI
- fyi: the realmname is shown on login screen
- changing the realm name invalidates all passwords - but for security you are encouraged to do so (as localhostadmin)
- implemented to support both, old hashes for BASIC and new hashes for BASIC and DIGEST
- to differentiate old / new hash the in Jetty used hash-prefix "MD5:" is used for new pwd-hashes (  "MD5:hash" )
2014-01-17 00:02:23 +01:00
Michael Peter Christen
ba44eb1160 when scaling the number of remote peers, also consider the machine load
and the number of cores
2014-01-16 17:34:26 +01:00
Michael Peter Christen
f8ce7040ab remote search peer selection schema change:
- all non-dht targets (previously separated into 'robinson' for dht-like
queries and 'node' for solr queries) are non 'extra' peers, which are
queries using solr
- these extra-peers are now selected using a ranking on last-seen,
peer-tag-matches, node-peer flags, peer age, and link count. The ranking
is done using a weight and a random factor.
- the number of extra peers is 50% of the dht peers
- the dht peers now exclude too young peers to prevent bad results
during strong growth of the network
- the number of dht peers (and therefore extra-peers) is reduced when
the memory of the peer is low and/or some documents still appear in the
indexing-queue. This shall prevent a peer from deadlocks when p2p
queries are made in a fast sequence on weak hardware.
2014-01-16 17:27:14 +01:00
Michael Peter Christen
47a82e471c less blocking in SeedDB which caused deadlocks in peer ping 2014-01-16 13:10:20 +01:00
Michael Peter Christen
ec10ed45bd better logging in logger 2014-01-16 13:08:39 +01:00
Michael Peter Christen
a5d7961812 replaced old caching in SolrConnector with a new one which is better for
concurrency and should prevent from 100% CPU usage after a long run of a
peer with a large number of documents.
2014-01-15 23:13:22 +01:00
Michael Peter Christen
84cf7e8e9f backmigration from solrj 4.6.0 to 4.5.1. This is necessary because
solrj.4.6.0 has a bug which prevents the attachment of a remote solr (as
tested with a SolrCloud). See bug report
https://issues.apache.org/jira/browse/SOLR-5532
This bug shall be fixed in Solr 4.6.1.
Fortunately, solrj-4.5.1 works together with solr-4.6.0 thus the current
index does not need to be changed.
2014-01-15 17:18:32 +01:00
reger
6e2fe777af simulate Authorization cookie for yacy servlet header 2014-01-10 19:31:36 +01:00
reger
ea7cef5d05 fix NPE in TemplateEngine
StackTrace For input string: ""
java.lang.NumberFormatException: For input string: ""
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Integer.parseInt(Integer.java:504)
	at java.lang.Integer.parseInt(Integer.java:527)
	at net.yacy.server.http.TemplateEngine.writeTemplate(TemplateEngine.java:241)
	at net.yacy.server.http.TemplateEngine.writeTemplate(TemplateEngine.java:199)
	at net.yacy.http.servlets.YaCyDefaultServlet.handleTemplate(YaCyDefaultServlet.java:896)
2014-01-10 18:11:32 +01:00
reger
cb6d0c2113 implementing YaCy legacy role names
- taking out customized SecurityHandler code as the original/default seems to just work fine
- with this individual sec. constraints can be applied via web.xml (using legacy role names)
2014-01-10 14:07:49 +01:00
reger
530b9f6de8 Merge origin/master 2014-01-10 12:38:00 +01:00
reger
f09dbbef96 make SecurityHandler webappcontext ready 2014-01-10 12:36:42 +01:00
Michael Peter Christen
ca3411d805 added checkindex, solr index check 2014-01-10 12:27:49 +01:00
Michael Peter Christen
36594d0348 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-01-10 12:15:13 +01:00
reger
37f2a82a5d making root context (htroot) a WebAppContext
- this allows additional features, like servlet configuration via web.xml and many more things.
- currently the standard servlets are still configured in the code (so the supplied defaults/web.xml is not realy needed, yet),
  but could be expanded
- lookup for web.xml - 1. in /DATA/SETTINGS then in /defaults
2014-01-10 10:42:47 +01:00
reger
f6099b730d disabled unused fields in default Solr collection schema 2014-01-10 10:26:45 +01:00
reger
28eae57e8b spend CrawlQueues a fremem routine
- clears errorStack
- will not get hit often (but better little than nothing on low mem)
2014-01-10 10:24:33 +01:00