Commit Graph

308 Commits

Author SHA1 Message Date
Michael Peter Christen
69391e5d9e changed strategy to test existence of documents in Solr: using the
update time. The reason for that is a better caching for the crawler
double-check, which needs the update time for crawler steering.
2014-02-19 04:03:45 +01:00
Michael Peter Christen
ca8b100f96 run the cleanup process even when load is high, do postprocessing even
if load > 1 (but < 2) but only if there is enough memory (now: 0.5 GB
RAM available). The memory amount of the postprocessing is the cause
that systems block because they run into a frequent-GC chain which
almost locks the peer. If running with enough memory, the postprocessing
is fast and not damaging to the system.
Because the required RAM of 0.5 GB is never available in default
setting, the postprocessing will not run if the peer is not reconfigured
to use more memory.
2014-02-10 12:59:30 +01:00
Michael Peter Christen
3d474a843e added memory protection for postprocessing 2014-02-09 12:36:56 +01:00
Michael Peter Christen
6e59ca4ebf removed jena library and all code that depended on jena. When jena was
introduced, it was also used for search facets. The generic search
facets are now deduced from generic solr fields which makes jena as tool
for facet semantics superfluous.
2014-02-07 01:20:06 +01:00
Michael Peter Christen
931541d198 re-inserted default value re-set button to performance queues and
patched missing values for recent new queues
2014-02-06 22:39:19 +01:00
Michael Peter Christen
8b14e92ba4 added button in host browser to re-load 404/failed documents 2014-01-23 15:56:36 +01:00
Michael Peter Christen
6ada0daae9 making latency_factor and maximum number of same hosts in loader queue
settings available in Crawler_p.html servlet for steering.
2014-01-21 19:28:00 +01:00
Michael Peter Christen
489c3fbc90 code simplifications / removed warnings 2014-01-21 17:53:39 +01:00
Michael Peter Christen
0168f80c28 new crawling factors can now be changed during runtime 2014-01-21 17:52:16 +01:00
Michael Peter Christen
be5e808236 - removed hardcoded load-test which is now handled in BusyQueues
steering, see /PerformanceQueues_p.html
- changed default values for crawler queue load limit (high, because
these jobs are started upon user request)
2014-01-21 17:48:45 +01:00
sixcooler
40a4030b55 configurable max-load values for YaCy-Threads:
try lower values on smal systems like a Pi
2014-01-21 17:04:22 +01:00
Michael Peter Christen
77531850b5 reverted crawling strategy from latest commit. 2014-01-21 16:05:55 +01:00
Michael Peter Christen
0d235a565b cleanup crawl loader jobs 2014-01-20 18:36:00 +01:00
Michael Peter Christen
1ea17bd9f3 - removed old metadata database and all migration code
- refactored all code which uses URIMetadataRow as standard for word
hash length and word hash ordering and moved that to the class 'Word',
becuase the class URIMetadataRow defined the old metadata data structure
and should be superfluous in the future
- removed unused methods from URIMetadataRow as preparation for further
removal of that class
2014-01-20 18:31:46 +01:00
reger
97e84439fb adjusted ConfigHeuristic and changed QueryGoal.getOriginalQueryString to .getQueryString
- since specific heuristic Twitter & Blekko is not longer available or redundant with OpenSearchHeuristic,
adjusted ConfigHeuristic to use OpensearchHeuristic settings only.
For this the default OSD search target list is made available (copied) by default and the other configs are removed.

- the return of QueryGoal.getOriginalQueryString includes the queryModifier, which are held separately in a modifier object,
but in most (all) cases just the query term is expected, clarified and renamed it to QueryGoal.getQueryString which returns
just the search term (if needed a .getOrigianlQueryString could be implemented in Queryparameters, adding the modifiers)

- started to adjust internal html href references from absolute to relative (currently it is mixed).
For future development we should prefer relative href targets (less trouble with context aware  servlets)
2014-01-20 00:58:17 +01:00
Michael Peter Christen
022c6d3ce1 do YaCy p2p connections using a timeout-request which covers the http
request into a separate thread and ignores the furthure result of a
request if that does not answer within the requested time-out. This is a
try to solve a problem with the peer-ping, which hangs whenever a peer
appears to be dead or blocked.
2014-01-19 15:21:23 +01:00
reger
0c754dd794 implemented DIGEST authentication, which is for remote login more secure
as BASIC were pwd is transmitted near clear text (B64enc).
This has some implication as RFC 2617 requires and recommends a password hash MD5(user:realm:pwd) for DIGEST.

!!! before activating DIGEST you have to reassign all passwords !!! to allow new calculation of the hash
- default authentication is still BASIC
- configuration at this time only manually in (DATA/settings) or  defaults/web.xml  (<auth-method>
- the realmname is in defaults/yacy.init  adminRealm=YaCy-AdminUI
- fyi: the realmname is shown on login screen
- changing the realm name invalidates all passwords - but for security you are encouraged to do so (as localhostadmin)
- implemented to support both, old hashes for BASIC and new hashes for BASIC and DIGEST
- to differentiate old / new hash the in Jetty used hash-prefix "MD5:" is used for new pwd-hashes (  "MD5:hash" )
2014-01-17 00:02:23 +01:00
Michael Peter Christen
f8ce7040ab remote search peer selection schema change:
- all non-dht targets (previously separated into 'robinson' for dht-like
queries and 'node' for solr queries) are non 'extra' peers, which are
queries using solr
- these extra-peers are now selected using a ranking on last-seen,
peer-tag-matches, node-peer flags, peer age, and link count. The ranking
is done using a weight and a random factor.
- the number of extra peers is 50% of the dht peers
- the dht peers now exclude too young peers to prevent bad results
during strong growth of the network
- the number of dht peers (and therefore extra-peers) is reduced when
the memory of the peer is low and/or some documents still appear in the
indexing-queue. This shall prevent a peer from deadlocks when p2p
queries are made in a fast sequence on weak hardware.
2014-01-16 17:27:14 +01:00
reger
6932aa4d7a use configured admin-username for api calls
- the admin user name can be configured, in apiExec calls the default "admin" username is used. 

TODO: the bin/apicall.sh script should likely take that into account.
2014-01-07 21:26:50 +01:00
orbiter
2ead4e44d9 introduced a new storage path ARCHIVE inside of DATA which will be used
as path for solr index dumps (instead of the SEGMENTS path). This will
make a maintenance of index backups easier. It will also provide a tool
to migrate from an freeworld index to a webportal index.
2014-01-07 17:53:49 +01:00
orbiter
3cb6c7861f fixed shutdown authenticaton problem 2014-01-06 01:48:54 +01:00
Michael Peter Christen
2939b47986 removed non-working realm setting in http client (auth for localhost was
added in previous commit)
2014-01-05 15:04:18 +01:00
Michael Peter Christen
7d6fc79eb8 refactoring (usage of constant names for attributes of authentication
check)
2014-01-05 04:23:44 +01:00
reger
e9081c0f17 moved startup execAPIActions call after Jetty startup
execAPIActions require http to be up. The 10s sleep was sufficient to allow Jetty to start, 
but it's more robust to place the call after http is assigned to switchboard/serverSwitch.
2014-01-01 10:28:49 +01:00
reger
8eaabb9600 remove dependency from old serverCore.java
- remaining getPortNr not needed 
  (as current release allows only to set plain integer as port,
   see ConfigBasic)
2013-12-29 02:00:44 +01:00
orbiter
15882beb19 fix for strange NPE
java.lang.NullPointerException
        at
net.yacy.search.Switchboard.updateMySeed(Switchboard.java:3667)
        at net.yacy.peers.Network.peerPing(Network.java:195)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at
net.yacy.kelondro.workflow.InstantBusyThread.job(InstantBusyThread.java:107)
        at
net.yacy.kelondro.workflow.AbstractBusyThread.run(AbstractBusyThread.java:165)
2013-12-29 00:40:31 +01:00
orbiter
f3ac923a7e ftp client shall be able to open non-anonymous ftp servers if login
details are given
2013-12-28 22:42:02 +01:00
Michael Peter Christen
ee17bd0b69 added option to attach remote solr servers in read-only mode 2013-12-27 02:55:21 +01:00
reger
71cac1a278 added SSL/HTTPS connector to support SSL/https connection on port 8443
!!! attention !!! to make sure YaCy can start, https will be disabled if port 8443 is used
   - added ping test for above to migration 

- as of now port for https is hardcoded to default 8443
- if not urgend required I'd leave it this way (it's standard) to use different ports for http and https 

- post https port on ConfigBasic.html (if active)
2013-12-25 05:20:13 +01:00
Michael Peter Christen
cfa08024c7 removed optimization bevore postprocessing because that may cause a
time-out which will cause that postprocessing fails.
2013-12-04 16:04:29 +01:00
Michael Peter Christen
0db8e34625 enhanced webgraph processing 2013-12-04 01:54:45 +01:00
Michael Peter Christen
a16534cb0a tried to fix timeout and connection-lost problems when using an outside
solr.
2013-11-28 01:31:53 +01:00
orbiter
c2d720cdaf purge a lucene cache - possible memory leak fix 2013-11-18 22:47:35 +01:00
orbiter
19a051bec8 more monitoring for postprocessing and enhanced layout in Crawler
monitor page
2013-11-16 18:23:14 +01:00
Michael Peter Christen
6842783761 fixed and enhanced postprocessing 2013-11-16 08:23:21 +01:00
Michael Peter Christen
9d5895f643 enhanced and fixed postprocessing 2013-11-15 15:41:12 +01:00
orbiter
4234b0ed6c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-11-10 18:50:43 +01:00
orbiter
909bbb49d8 added (partly commented) test code for url rewrite methods .. to be
completed
2013-11-10 18:50:34 +01:00
Michael Peter Christen
81bb50118e found and fixed a huge memory leak in solr caching (inside Solr). The
not-flushed Solr cache is now handled in this way:
- it is smaller by default
- an Solr-internal process is started to flush the cache periodically
(this does NOT clean the cache, just removes old objects)
- a Solr-external process (the standard YaCy cleanup-process) now has
direct access to the solr internal cache and flushes them completely.
The time frame for such a flush is defined by the cleanup-process
frequency, by default 10 minutes.
2013-11-07 10:01:44 +01:00
Michael Peter Christen
234a974955 load image only if their parser flag is activated 2013-11-04 11:59:28 +01:00
Michael Peter Christen
e1c1e57877 less overhead calling exist() with only one hash 2013-11-04 09:37:31 +01:00
Michael Peter Christen
5afa6e3aee Automatically flush the log cache if a short memory status is reached.
For the default of 200 lines this can flush about 10MB.
2013-10-24 17:39:50 +02:00
Michael Peter Christen
030d0776ff Enhanced crawl start for very, very large crawl lists (i.e. > 5000)
which had a problem because of badly used concurrency.
This fix also caused a redesign of the whole host deletion process.
This should fix bug http://bugs.yacy.net/view.php?id=250
2013-10-24 16:20:20 +02:00
Michael Peter Christen
1b4fa2947d - fixed a problem which ocurred when a document was not recognized with
the right content domain (i.e. identifying that it is an image, text
etc.) because it used the file extension and not an existing mime type
assignment.
- fixed the new setting that images shall be loaded for a better image
search.
- both fixes together makes it now possible to crawl
commons.wikimedia.org which makes use of 'funny' document names (i.e.
ending with .jpg while the document is html)
2013-10-23 00:16:54 +02:00
Michael Peter Christen
82621bead0 When doing bootstraping, always accept one seedlist-File without
checking the date of the file. This should help to start the peer in
case that the user has a completely wrong date setting.
2013-10-22 15:34:51 +02:00
Michael Peter Christen
691d7e70fa added hint to development/commit rss feed 2013-10-21 15:16:29 +02:00
Michael Peter Christen
74d0256e93 enhanced postprocessing: fixed bugs, enable proper postprocessing also
without the harvestingkey, remove crawl profiles after postprocessing,
speed-up for clickdepth computation.
2013-10-16 11:27:06 +02:00
Michael Peter Christen
90c8577840 enhanced ranking; patches to replace old ranking 2013-10-09 15:10:03 +02:00
Michael Peter Christen
3bf0104199 fix for crawl domain counter limitation (limit was reached too early) 2013-09-26 13:41:52 +02:00
Michael Peter Christen
82bfd9e00a - crawl profiles shall be deleted from active and passive stacks if they
are deleted to terminate the crawl because otherwise the crawl will go
on after the load-from-passive stack policy.
- better check if a crawl is terminated using the loader queue.
2013-09-26 10:22:31 +02:00