yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
orbiter	0c88a32c36	do not apply lazy value instantiation for numeric or boolean values because that is misleading and confusing in case of 0- or false-values and may cause NPEs in retrieval functions.	2014-04-23 08:41:36 +02:00
orbiter	8e04030596	in case of short memory, do not cut down robinson peers to 1, just reduce by 50%	2014-04-23 08:37:19 +02:00
orbiter	ccb1864d55	catch IllegalArgumentException for wrong process types (that is needed for migrations when new process types are introduced or disappear)	2014-04-22 23:14:05 +02:00
orbiter	4ee4ba1576	fix for NPE in IndexCreateParserErrors_p.html caused by bad handling of lazy value instantiation of 0-value in crawldepth_i	2014-04-22 19:48:49 +02:00
orbiter	12ba890205	removed warnings	2014-04-22 19:35:15 +02:00
reger	d51f9cc863	add custom Jetty errorhandler to provide custom error page footer line - remove redundant mime check in UrlProxyServlet	2014-04-21 17:28:21 +02:00
reger	c193a02023	defer creation of new ArrayList after possible early return (to skip not used object allocation)	2014-04-21 17:16:06 +02:00
reger	727dfb5875	refactore URIMetadataNode to further unify interaction with index - URIMetadataNode extending SolrDocument - use language as stored (String), reducing conversion to string - optimize debug code in transferIndex	2014-04-20 01:41:30 +02:00
reger	79e7947442	- remove empty http0_9 status text array and unused default_charset = ISO-8859-1	2014-04-18 22:03:16 +02:00
reger	2dabe2009d	- remove unused manual http KeepAlive config (reducing references to obsolete httpdemon) - add port info to settings_http	2014-04-18 19:57:35 +02:00
Michael Peter Christen	5746aae3db	add canonical links to the same crawldepth, not the next crawldepth	2014-04-18 06:51:46 +02:00
Michael Peter Christen	74ab5ef9fa	increased runtime for postprocessing query job	2014-04-18 06:51:10 +02:00
Michael Peter Christen	8b32dd5f9e	special strategy for balancer: do not remove targets with zero wait time from the queue	2014-04-18 06:50:07 +02:00
Michael Peter Christen	9c6228d948	fix for deadlocks in crawler	2014-04-17 16:58:17 +02:00
Michael Peter Christen	10cf8215bd	added crawl depth for failed documents	2014-04-17 13:21:43 +02:00
Michael Peter Christen	7fefebaeca	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-04-17 12:55:38 +02:00
Michael Peter Christen	c2f62e783f	- better subgraph handling, less overhead for crawls without the webgraph - usage of crawler crawldepth cache for the linkgraph target depth computation	2014-04-17 12:54:18 +02:00
Michael Peter Christen	06afb568e2	new Strategies in Balancer: - doublecheck cache now records the crawl depth as well - doublecheck cache is available from the outside (made static) - no more need to crawl hosts with lowest depth first, instead all hosts which have only singleton entries are preferred to reduce the number of files.	2014-04-17 12:52:54 +02:00
Michael Peter Christen	1aea01fe5b	fix for Table in case that requested file does not exist and paths also do not exist	2014-04-17 12:44:05 +02:00
reger	710054bb37	implement gzip input handling directly in defaultservlet (making reference to legacy httpdemon obsolete)	2014-04-17 03:20:29 +02:00
Michael Peter Christen	9a5ab4e2c1	removed clickdepth_i field and related postprocessing. This information is now available in the crawldepth_i field which is identical to clickdepth_i because of a specific crawler strategy.	2014-04-16 22:16:20 +02:00
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	2014-04-16 21:34:28 +02:00
Michael Peter Christen	075b6f9278	refactoring of the crawl balancer: the balancer is turned into an interface and the old balancer class is moved into LegacyBalancer to make room for a fresh implementation of a crawl balancer.	2014-04-14 13:32:35 +02:00
Michael Peter Christen	8470dfe3f8	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-04-14 12:17:52 +02:00
reger	46016fa153	autoupdate fails to download latest release (1.71) due to default release blacklist - removed the default version blacklist regex from init (for future versions) !!! left existing update blacklist setting untouched !!! (existing installation wanting autoupdate for 1.71 need to change blacklist in ConfigUpdate_p.html) - moved old blacklist patch to migration.java	2014-04-13 07:32:32 +02:00
Michael Peter Christen	8aeef73d49	fix for virtual root nodes	2014-04-11 15:12:34 +02:00
Michael Peter Christen	7c7fbb9818	find depth-matches also for edge targets	2014-04-11 12:27:21 +02:00
Michael Peter Christen	dd12dd392f	introduction of a data structure for HyperlinkEdges which should use less memory as it does no double-storage of source links for each edge of the graph.	2014-04-11 12:09:33 +02:00
Michael Peter Christen	6ea8bb7348	using MultiProtocolURL for edge data which is faster (hash computation is now much easier) and smaller in size	2014-04-11 10:58:37 +02:00
Michael Peter Christen	b21c208b4d	enhanced hashcode computation for MultiProtocolURL	2014-04-11 10:23:48 +02:00
Michael Peter Christen	ce1d1b2fa0	fix for maximum tag length in parser	2014-04-11 09:56:44 +02:00
Michael Peter Christen	17e0956312	refactoring of SystemLoad calls (only one backend tool)	2014-04-11 09:25:18 +02:00
Michael Peter Christen	a37d067692	refactoring	2014-04-10 23:46:35 +02:00
orbiter	95780eed32	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-04-10 21:40:54 +02:00
Michael Peter Christen	67beef657f	strong redesign of html parser: object recursion is now made using a stack on html tag objects, not using a recursive parse-again method which may cause bad performance and huge memory allocation. The new method also produced better parsed image objects with exact anchor text references.	2014-04-10 18:58:03 +02:00
Michael Peter Christen	6bd8c6f195	fix for wrong status codes of error pages	2014-04-10 09:08:59 +02:00
Michael Peter Christen	9e503b3376	also delete the robots.txt file from the cache when a new crawl is started	2014-04-09 21:59:54 +02:00
orbiter	67501c9dda	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-04-09 19:58:54 +02:00
Michael Peter Christen	1c21b3256d	fix for robots.txt handling: delete old entry before starting a new crawl.	2014-04-09 18:33:48 +02:00
orbiter	c250fac9f4	linkstructure refactoring to get more options for clickdepth analysis	2014-04-09 17:52:51 +02:00
Michael Peter Christen	8068e68474	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-04-09 12:45:15 +02:00
Michael Peter Christen	bd886054cb	new structure and enhancements for link graph computation: - added order option to solr queries to be able to retrieve document lists in specific order, here: link length - added HyperlinkEdge class which manages the link structure - integrated the HyperlinkEdge class into clickdepth computation - extended the linkstructure.json servlet to show also the clickdepth and other statistic information	2014-04-09 12:45:04 +02:00
reger	f326a67561	fix: typo in default charset in metadata2solr update pom and NB build to Solr 4.7.1 libs	2014-04-06 22:31:22 +02:00
Michael Peter Christen	df138084c0	do solr optimization independently from memory and load constraints: - not doing an optimization will likely cause a too many files exception - without optimization performance will be even worse which would prevent optimization in the future as well (prevent a deadlock situation)	2014-04-06 11:04:23 +02:00
Michael Peter Christen	ebd44a7080	replaced solr 4.6.1 with solr 4.7.1 and added index migration to lucene_47	2014-04-06 10:45:03 +02:00
Michael Peter Christen	734778c0c8	fixed a time-out problem in the default servlet which is also a logging problem because the error log showed the wrong reason (file not found) instead the actual reason (time-out).	2014-04-04 15:27:29 +02:00
Michael Peter Christen	466d90ad42	fixed a problem with resource observer; probably coming from uncatched exceptions within the apache library which appear only in concurrency environments.	2014-04-04 15:26:39 +02:00
Michael Peter Christen	e8ddd415a8	enhanced the new link structure graph	2014-04-04 14:43:54 +02:00
Michael Peter Christen	926d28dd3f	fixed a bug which prevented crawl starts after a network switch	2014-04-04 14:43:35 +02:00
Michael Peter Christen	3ce8eff21b	another fix for inbound/outbound detection	2014-04-04 12:41:59 +02:00

1 2 3 4 5 ...

2641 Commits