yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	69391e5d9e	changed strategy to test existence of documents in Solr: using the update time. The reason for that is a better caching for the crawler double-check, which needs the update time for crawler steering.	2014-02-19 04:03:45 +01:00
Michael Peter Christen	ca8b100f96	run the cleanup process even when load is high, do postprocessing even if load > 1 (but < 2) but only if there is enough memory (now: 0.5 GB RAM available). The memory amount of the postprocessing is the cause that systems block because they run into a frequent-GC chain which almost locks the peer. If running with enough memory, the postprocessing is fast and not damaging to the system. Because the required RAM of 0.5 GB is never available in default setting, the postprocessing will not run if the peer is not reconfigured to use more memory.	2014-02-10 12:59:30 +01:00
Michael Peter Christen	3d474a843e	added memory protection for postprocessing	2014-02-09 12:36:56 +01:00
Michael Peter Christen	6e59ca4ebf	removed jena library and all code that depended on jena. When jena was introduced, it was also used for search facets. The generic search facets are now deduced from generic solr fields which makes jena as tool for facet semantics superfluous.	2014-02-07 01:20:06 +01:00
Michael Peter Christen	931541d198	re-inserted default value re-set button to performance queues and patched missing values for recent new queues	2014-02-06 22:39:19 +01:00
Michael Peter Christen	8b14e92ba4	added button in host browser to re-load 404/failed documents	2014-01-23 15:56:36 +01:00
Michael Peter Christen	6ada0daae9	making latency_factor and maximum number of same hosts in loader queue settings available in Crawler_p.html servlet for steering.	2014-01-21 19:28:00 +01:00
Michael Peter Christen	489c3fbc90	code simplifications / removed warnings	2014-01-21 17:53:39 +01:00
Michael Peter Christen	0168f80c28	new crawling factors can now be changed during runtime	2014-01-21 17:52:16 +01:00
Michael Peter Christen	be5e808236	- removed hardcoded load-test which is now handled in BusyQueues steering, see /PerformanceQueues_p.html - changed default values for crawler queue load limit (high, because these jobs are started upon user request)	2014-01-21 17:48:45 +01:00
sixcooler	40a4030b55	configurable max-load values for YaCy-Threads: try lower values on smal systems like a Pi	2014-01-21 17:04:22 +01:00
Michael Peter Christen	77531850b5	reverted crawling strategy from latest commit.	2014-01-21 16:05:55 +01:00
Michael Peter Christen	0d235a565b	cleanup crawl loader jobs	2014-01-20 18:36:00 +01:00
Michael Peter Christen	1ea17bd9f3	- removed old metadata database and all migration code - refactored all code which uses URIMetadataRow as standard for word hash length and word hash ordering and moved that to the class 'Word', becuase the class URIMetadataRow defined the old metadata data structure and should be superfluous in the future - removed unused methods from URIMetadataRow as preparation for further removal of that class	2014-01-20 18:31:46 +01:00
reger	97e84439fb	adjusted ConfigHeuristic and changed QueryGoal.getOriginalQueryString to .getQueryString - since specific heuristic Twitter & Blekko is not longer available or redundant with OpenSearchHeuristic, adjusted ConfigHeuristic to use OpensearchHeuristic settings only. For this the default OSD search target list is made available (copied) by default and the other configs are removed. - the return of QueryGoal.getOriginalQueryString includes the queryModifier, which are held separately in a modifier object, but in most (all) cases just the query term is expected, clarified and renamed it to QueryGoal.getQueryString which returns just the search term (if needed a .getOrigianlQueryString could be implemented in Queryparameters, adding the modifiers) - started to adjust internal html href references from absolute to relative (currently it is mixed). For future development we should prefer relative href targets (less trouble with context aware servlets)	2014-01-20 00:58:17 +01:00
Michael Peter Christen	022c6d3ce1	do YaCy p2p connections using a timeout-request which covers the http request into a separate thread and ignores the furthure result of a request if that does not answer within the requested time-out. This is a try to solve a problem with the peer-ping, which hangs whenever a peer appears to be dead or blocked.	2014-01-19 15:21:23 +01:00
reger	0c754dd794	implemented DIGEST authentication, which is for remote login more secure as BASIC were pwd is transmitted near clear text (B64enc). This has some implication as RFC 2617 requires and recommends a password hash MD5(user:realm:pwd) for DIGEST. !!! before activating DIGEST you have to reassign all passwords !!! to allow new calculation of the hash - default authentication is still BASIC - configuration at this time only manually in (DATA/settings) or defaults/web.xml (<auth-method> - the realmname is in defaults/yacy.init adminRealm=YaCy-AdminUI - fyi: the realmname is shown on login screen - changing the realm name invalidates all passwords - but for security you are encouraged to do so (as localhostadmin) - implemented to support both, old hashes for BASIC and new hashes for BASIC and DIGEST - to differentiate old / new hash the in Jetty used hash-prefix "MD5:" is used for new pwd-hashes ( "MD5:hash" )	2014-01-17 00:02:23 +01:00
Michael Peter Christen	f8ce7040ab	remote search peer selection schema change: - all non-dht targets (previously separated into 'robinson' for dht-like queries and 'node' for solr queries) are non 'extra' peers, which are queries using solr - these extra-peers are now selected using a ranking on last-seen, peer-tag-matches, node-peer flags, peer age, and link count. The ranking is done using a weight and a random factor. - the number of extra peers is 50% of the dht peers - the dht peers now exclude too young peers to prevent bad results during strong growth of the network - the number of dht peers (and therefore extra-peers) is reduced when the memory of the peer is low and/or some documents still appear in the indexing-queue. This shall prevent a peer from deadlocks when p2p queries are made in a fast sequence on weak hardware.	2014-01-16 17:27:14 +01:00
reger	6932aa4d7a	use configured admin-username for api calls - the admin user name can be configured, in apiExec calls the default "admin" username is used. TODO: the bin/apicall.sh script should likely take that into account.	2014-01-07 21:26:50 +01:00
orbiter	2ead4e44d9	introduced a new storage path ARCHIVE inside of DATA which will be used as path for solr index dumps (instead of the SEGMENTS path). This will make a maintenance of index backups easier. It will also provide a tool to migrate from an freeworld index to a webportal index.	2014-01-07 17:53:49 +01:00
orbiter	3cb6c7861f	fixed shutdown authenticaton problem	2014-01-06 01:48:54 +01:00
Michael Peter Christen	2939b47986	removed non-working realm setting in http client (auth for localhost was added in previous commit)	2014-01-05 15:04:18 +01:00
Michael Peter Christen	7d6fc79eb8	refactoring (usage of constant names for attributes of authentication check)	2014-01-05 04:23:44 +01:00
reger	e9081c0f17	moved startup execAPIActions call after Jetty startup execAPIActions require http to be up. The 10s sleep was sufficient to allow Jetty to start, but it's more robust to place the call after http is assigned to switchboard/serverSwitch.	2014-01-01 10:28:49 +01:00
reger	8eaabb9600	remove dependency from old serverCore.java - remaining getPortNr not needed (as current release allows only to set plain integer as port, see ConfigBasic)	2013-12-29 02:00:44 +01:00
orbiter	15882beb19	fix for strange NPE java.lang.NullPointerException at net.yacy.search.Switchboard.updateMySeed(Switchboard.java:3667) at net.yacy.peers.Network.peerPing(Network.java:195) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at net.yacy.kelondro.workflow.InstantBusyThread.job(InstantBusyThread.java:107) at net.yacy.kelondro.workflow.AbstractBusyThread.run(AbstractBusyThread.java:165)	2013-12-29 00:40:31 +01:00
orbiter	f3ac923a7e	ftp client shall be able to open non-anonymous ftp servers if login details are given	2013-12-28 22:42:02 +01:00
Michael Peter Christen	ee17bd0b69	added option to attach remote solr servers in read-only mode	2013-12-27 02:55:21 +01:00
reger	71cac1a278	added SSL/HTTPS connector to support SSL/https connection on port 8443 !!! attention !!! to make sure YaCy can start, https will be disabled if port 8443 is used - added ping test for above to migration - as of now port for https is hardcoded to default 8443 - if not urgend required I'd leave it this way (it's standard) to use different ports for http and https - post https port on ConfigBasic.html (if active)	2013-12-25 05:20:13 +01:00
Michael Peter Christen	cfa08024c7	removed optimization bevore postprocessing because that may cause a time-out which will cause that postprocessing fails.	2013-12-04 16:04:29 +01:00
Michael Peter Christen	0db8e34625	enhanced webgraph processing	2013-12-04 01:54:45 +01:00
Michael Peter Christen	a16534cb0a	tried to fix timeout and connection-lost problems when using an outside solr.	2013-11-28 01:31:53 +01:00
orbiter	c2d720cdaf	purge a lucene cache - possible memory leak fix	2013-11-18 22:47:35 +01:00
orbiter	19a051bec8	more monitoring for postprocessing and enhanced layout in Crawler monitor page	2013-11-16 18:23:14 +01:00
Michael Peter Christen	6842783761	fixed and enhanced postprocessing	2013-11-16 08:23:21 +01:00
Michael Peter Christen	9d5895f643	enhanced and fixed postprocessing	2013-11-15 15:41:12 +01:00
orbiter	4234b0ed6c	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-11-10 18:50:43 +01:00
orbiter	909bbb49d8	added (partly commented) test code for url rewrite methods .. to be completed	2013-11-10 18:50:34 +01:00
Michael Peter Christen	81bb50118e	found and fixed a huge memory leak in solr caching (inside Solr). The not-flushed Solr cache is now handled in this way: - it is smaller by default - an Solr-internal process is started to flush the cache periodically (this does NOT clean the cache, just removes old objects) - a Solr-external process (the standard YaCy cleanup-process) now has direct access to the solr internal cache and flushes them completely. The time frame for such a flush is defined by the cleanup-process frequency, by default 10 minutes.	2013-11-07 10:01:44 +01:00
Michael Peter Christen	234a974955	load image only if their parser flag is activated	2013-11-04 11:59:28 +01:00
Michael Peter Christen	e1c1e57877	less overhead calling exist() with only one hash	2013-11-04 09:37:31 +01:00
Michael Peter Christen	5afa6e3aee	Automatically flush the log cache if a short memory status is reached. For the default of 200 lines this can flush about 10MB.	2013-10-24 17:39:50 +02:00
Michael Peter Christen	030d0776ff	Enhanced crawl start for very, very large crawl lists (i.e. > 5000) which had a problem because of badly used concurrency. This fix also caused a redesign of the whole host deletion process. This should fix bug http://bugs.yacy.net/view.php?id=250	2013-10-24 16:20:20 +02:00
Michael Peter Christen	1b4fa2947d	- fixed a problem which ocurred when a document was not recognized with the right content domain (i.e. identifying that it is an image, text etc.) because it used the file extension and not an existing mime type assignment. - fixed the new setting that images shall be loaded for a better image search. - both fixes together makes it now possible to crawl commons.wikimedia.org which makes use of 'funny' document names (i.e. ending with .jpg while the document is html)	2013-10-23 00:16:54 +02:00
Michael Peter Christen	82621bead0	When doing bootstraping, always accept one seedlist-File without checking the date of the file. This should help to start the peer in case that the user has a completely wrong date setting.	2013-10-22 15:34:51 +02:00
Michael Peter Christen	691d7e70fa	added hint to development/commit rss feed	2013-10-21 15:16:29 +02:00
Michael Peter Christen	74d0256e93	enhanced postprocessing: fixed bugs, enable proper postprocessing also without the harvestingkey, remove crawl profiles after postprocessing, speed-up for clickdepth computation.	2013-10-16 11:27:06 +02:00
Michael Peter Christen	90c8577840	enhanced ranking; patches to replace old ranking	2013-10-09 15:10:03 +02:00
Michael Peter Christen	3bf0104199	fix for crawl domain counter limitation (limit was reached too early)	2013-09-26 13:41:52 +02:00
Michael Peter Christen	82bfd9e00a	- crawl profiles shall be deleted from active and passive stacks if they are deleted to terminate the crawl because otherwise the crawl will go on after the load-from-passive stack policy. - better check if a crawl is terminated using the loader queue.	2013-09-26 10:22:31 +02:00

1 2 3 4 5 ...

308 Commits