yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	12fb9d7cd1	log postprocessing constraints in case that postprocessing is not performed	2014-08-04 14:19:37 +02:00
Michael Peter Christen	338f574bdc	no sorting if http/www unique fields are not demanded (makes query faster) and some code restrucuring	2014-08-04 12:59:38 +02:00
Michael Peter Christen	0ceeceb35e	more logic on Solr queries; usage of the query terms in posprocessing, saving one query for double document detection now per document	2014-08-04 02:35:38 +02:00
orbiter	4099296b45	added new classes which shall reduce call overhead to Solr (stub)	2014-08-03 22:44:22 +02:00
orbiter	3491ab4c38	removed unused images from webgraph edge computation	2014-08-01 13:21:16 +02:00
orbiter	2371d6b8db	target linktexts must be string to enable search facets on these fields	2014-08-01 13:20:25 +02:00
Michael Peter Christen	001e05bb80	do not store failure of loading of robots.txt into the index as a fail document	2014-08-01 12:15:14 +02:00
Michael Peter Christen	05d58e4df0	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-08-01 12:04:25 +02:00
Michael Peter Christen	98f45c9032	fix for image alt attachment to AnchorURLs in html parser.	2014-08-01 12:04:15 +02:00
orbiter	22ce4fb4dd	better error handling for remote solr queries and exists-checks	2014-08-01 11:00:10 +02:00
orbiter	738989aab7	reverted commit `f94c91315b` because the webgraph has not enough performance for that	2014-07-29 18:49:42 +02:00
Michael Peter Christen	c115f3869c	enhanced snippet computation and test method in ViewFile	2014-07-28 15:42:57 +02:00
orbiter	1027f3d04a	fix for the usage of ready-prepared solr queries, some queries are formulated as edismax query but this was not set as query attribut. The defType=edismax property needs a qf-field, so this was added as well. Do not remove that field again! This fixes also a problem with title-unique computation.	2014-07-25 18:53:13 +02:00
Michael Peter Christen	f94c91315b	if the webgraph is used, then use it also for reference computation to avoid contradictions with references_i in the collection index.	2014-07-24 15:35:53 +02:00
Michael Peter Christen	6e1dc444c3	added a snippet test function in ViewFile: you can now search for a specific word on the document; the servlet returns the snippet in the same way as it would be shown in a search result.	2014-07-24 14:59:37 +02:00
Michael Peter Christen	b44626e55b	fixed target_alt_t in webgraph	2014-07-22 18:24:10 +02:00
Michael Peter Christen	504327b15c	fix for condition for writing the webgraph	2014-07-22 00:59:08 +02:00
Michael Peter Christen	542c20a597	changed handling of crawl profile field crawlingIfOlder: this should be filled with the date, when the url is recognized as to be outdated. That field was partly misinterpreted and the time interval was filled in. In case that all the urls which are in the index shall be treated as outdated, the field is filled now with Long.MAX_VALUE because then all crawl dates are before that date and therefore outdated.	2014-07-22 00:23:17 +02:00
Michael Peter Christen	4eec1a7452	refactoring (change Metadata name of load time data structure to avoid confusion with Node data which is also called metadata)	2014-07-21 23:54:23 +02:00
reger	f96cfdc84d	prevent array out of bound exception on getRankingProfile(x) on faulty &profileNr= query parameter	2014-07-21 00:04:54 +02:00
reger	a2cb366b25	Combine /heuristic search modifier with opensearch configured targets - with search modifier /heuristic a request is send to all configured opensearch target systems (old /heuristic/blekko modifier not longer valid) - this allows to use opensearch heuristic on individual search request (in contrast to configuration HEURISTIC_OPENSEARCH=true which sends a osd request on all global searches - the index.html searchoption text adjusted to be displayed only if option configured - add Archive-It to predefined systems	2014-07-20 00:00:43 +02:00
Michael Peter Christen	2de159719b	added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead.	2014-07-18 12:43:01 +02:00
Michael Peter Christen	bf1b6b93e7	do not write CR values to webgraph if no CR values are computed	2014-07-16 18:13:29 +02:00
Michael Peter Christen	d07cdd8c3b	added SolrCloud access mode and configuration	2014-07-16 14:57:51 +02:00
Michael Peter Christen	8514bffc22	enhanced postprocessing status report	2014-07-16 14:57:25 +02:00
Michael Peter Christen	b5fc2b63ea	removed exist() retrieval functions from error cache and replaced it with metadata retrieval from connectors directly. This should cause better usage of the cache. Automatically increase the metadata cache if more memory is available.	2014-07-11 19:52:25 +02:00
Michael Peter Christen	62c72360ee	cleanup of checkAcceptanceInitially in CrawlStacker, should avoid double-calling of solr	2014-07-11 18:36:04 +02:00
Michael Peter Christen	b5d78ba156	reduced number of solr queries during crawling	2014-07-11 18:05:11 +02:00
Michael Peter Christen	fd87fa1613	removed more unnecessary exist-checks in ErrorCache	2014-07-11 16:48:08 +02:00
Michael Peter Christen	f2b476e08b	don't do a double check to solr for failed documents if they are not written to solr	2014-07-11 16:26:52 +02:00
orbiter	dab9a0786a	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-07-11 04:04:34 +02:00
orbiter	51bf5c85b0	Renamed the transmission cloud to buffer in dispatcher since the name 'cloud' was a bad idea. Changed also the accumulation process for peer targets so that every dht chunk is not assigned the set of redundant targets but they are assigned to redundant targets individually. This enhances the granularity of the target accumulation and should enhance the efficiency of the process. Finally the dht protocol client was enriched with the ability to remove the 'accept remote index' flag from peers or remove peers completely if they do not answer at all.	2014-07-11 04:04:09 +02:00
Michael Peter Christen	fb3dd56b02	fix for processing of noindex flag in http header	2014-07-10 17:13:35 +02:00
Michael Peter Christen	b0d941626f	fixed bugs in canonical, robots and title/description unique calculation	2014-07-10 15:40:38 +02:00
reger	d9472d043a	cleanup older unused classes	2014-07-10 02:20:01 +02:00
reger	665e12f88e	move startup time from old serverCore to switchboard (most used here) to make servercore eventually obsolete.	2014-07-10 02:17:56 +02:00
reger	336425912a	remove unused localSearchThread from SearchEvent	2014-07-10 02:14:03 +02:00
Michael Peter Christen	1092e798a5	fixed double content postprocessing	2014-07-07 19:15:11 +02:00
orbiter	59160984cc	timeline performance update	2014-07-03 13:06:29 +02:00
orbiter	2073e69034	fix for long periods in timeline	2014-07-02 11:29:50 +02:00
Michael Peter Christen	09dcdb9b19	update to solr 4.9.0	2014-07-01 16:39:00 +02:00
Michael Peter Christen	1cd4b2e8be	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-07-01 16:06:12 +02:00
Michael Peter Christen	8c52f0651b	refactoring of AccessTracker events & timeline fix	2014-07-01 16:06:01 +02:00
reger	431a5f9c4e	added test case for TextSnippet, removed obsolete/unused parameter and reference to MediaSnippet	2014-06-30 05:36:48 +02:00
Michael Peter Christen	5b94a257ce	no timeout for large reference collections	2014-06-29 22:26:22 +02:00
Michael Peter Christen	f5b817bac4	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-06-29 22:25:08 +02:00
reger	a5707cd2eb	enable proper Author navigator - author facet is based on omitted author_sxt field - adjust to make author nav available on exist of author field but keep using author_sxt to construct the facet (why!?) - add check for querymodifier author in searchevent	2014-06-27 23:05:06 +02:00
Michael Peter Christen	74206a10c7	refactoring	2014-06-27 14:40:36 +02:00
orbiter	fec673c9d1	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-06-27 10:15:37 +02:00
orbiter	c59da9fe7a	added access tracker log reader stub	2014-06-27 10:14:36 +02:00
Michael Peter Christen	36e623d8bf	enhanced metadata enrichment for media file type search: - Web servers may now deliver YaCy-specific http header field with a title and keywords. The new http header fields are: X-YaCy-Media-Title - to be used for media (image, audio, video) titles X-YaCy-Media-Keywords - to be used for media (image, audio, video) keywords - both fields are written to document fields title and keywords and are searched also during image search. - to make the usage of arbitrary http header fields (including this new fields) possible in the /api/push_p.json servlet, a new POST argument is also introduced to push http header fields. The new POST attribute is named "responseHeader-X" (where X is the counter). It is allowed to use this attribute as multi-attribute several times, each can be filled with a http header line. - see /api/push_p.html for examples	2014-06-26 13:02:35 +02:00
Michael Peter Christen	b893c42a0f	bugfix for image search	2014-06-26 12:56:33 +02:00
orbiter	0bbb5040b8	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-06-15 12:38:52 +02:00
orbiter	9d5d86cd03	Added filter query options to the ranking servlet /RankingSolr_p.html. Filter queries are not actually related to ranking, but user requests have pointed out that specific boost queries to move results to the end of the result list are not sufficient. Such boost filters may be better executed as actual filter and therefore such a filter can now be statically applied to every search request. A typical use could be the expression "http_unique_b:true AND www_unique_b:true" which uses the recently introduced fields http_unique_b and www_unique_b which are true only for one of the alternatives with/without http(s) and with/without prefix 'www.' in host names.	2014-06-15 12:38:30 +02:00
Michael Peter Christen	d2151857f1	Added collection navigation: The collection field (can be filled i.e. in Crawl Start) can be used to add categories to YaCy index entries. The usage of that field was restricted to solr searches and post argument filters as implemented in commit `f7571386a3`. This commit extends collections to a full navigation option in the standard YaCy search interface. The field is not active by default but can be activated easily in the /ConfigSearchPage_p.html servlet (just check the 'Collection' facet field). Collections can now be used for (at least) two purposes: - to provide search tenants (through post argument collection) - to provide self-made category navigation Search requests may now have (independently from switched on or off collection facet) a "collection:<collection-name>" modifier attached; firthermore collection names may use disjunctions using the '\|' pipe symbol. For example, this is a valid search request: www collection:user\|proxy	2014-06-15 12:11:23 +02:00
Michael Peter Christen	74c249288a	added a push api to make it possible to upload files directly without crawling to the YaCy indexer. Files are uploaded using POST multipart requests; multiple file uploads are possible as well. Each file has attached the file date and mime type which is used to get the right parser for the submitted data. Also an url is submitted which is assigned to the document. The CrawlSwitchboard has a new option for default Crawl Profiles which are assigned dynamically from the new push interface.	2014-06-12 18:10:07 +02:00
Michael Peter Christen	ba6ffddefc	refactoring	2014-06-12 05:23:26 +02:00
Michael Peter Christen	0c324d735c	NPE fix for postprocessing without term index	2014-06-04 12:28:28 +02:00
Michael Peter Christen	922979aae1	added option to prefer http over https in unique-protocol ranking	2014-06-02 17:40:56 +02:00
Michael Peter Christen	b3b174e2b8	fixed webgraph postprocessing and status display in Crawler_p servlet	2014-06-02 15:06:38 +02:00
Michael Peter Christen	f23c4142e0	added option to configure a custom user agent within allip networks	2014-06-01 01:02:03 +02:00
Michael Peter Christen	8ad41a882c	fixed several problems with postprocessing: - unique-postprocessing was destroying results from other postprocessings; removed cross-updates as they had been not necessary - unique-postprocessing did not restrict on same protocol - inefficient concurrent update cache was redesigned completely - increased limits for concurrent blocking queues to prevent early time-out	2014-05-29 13:24:24 +02:00
Michael Peter Christen	ff5b3ac84d	added new fields http_unique_b and www_unique_b which can be used for ranking to prefer urls containing a www subdomain or using the https protocol	2014-05-27 15:28:28 +02:00
Michael Peter Christen	f0db501630	better handling of ranking parameters and new default values for date navigation which is done using ranking in solr.	2014-05-22 03:01:07 +02:00
Michael Peter Christen	53948da7d0	tried to make last_modified recognition smarter	2014-05-22 00:28:51 +02:00
Michael Peter Christen	6634b5b737	debug code for index distribution testing	2014-05-21 18:20:16 +02:00
orbiter	97983ba89f	fixed generics warnings for generic array instantiation that appeared after migration to Java 7	2014-05-20 21:50:16 +02:00
sixcooler	830057d788	lower Segment-size (hope to get Segments of 10GB) see: http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5216&p=30036#p30034	2014-05-19 17:55:03 +02:00
orbiter	c028ae9b09	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-05-18 21:21:17 +02:00
reger	e31493e139	"Use remote proxy for yacy" has no function, remove option and related config item see/fix bug http://mantis.tokeek.de/view.php?id=23 http://mantis.tokeek.de/view.php?id=189	2014-05-17 23:36:59 +02:00
orbiter	0d8072aa99	removed warnings	2014-05-13 22:29:05 +02:00
Michael Peter Christen	a1ac4c3b76	automatically clear graphics cache	2014-05-12 15:45:25 +02:00
reger	1432a817dd	respect "index media" switched off in CrawlStartExpert.html fix http://mantis.tokeek.de/view.php?id=64	2014-05-08 22:21:24 +02:00
Michael Peter Christen	4e734815e8	enhanced snippets: remove lines which are identical to the title and choose longer versions if possible. Prefer the description part.	2014-05-06 16:48:50 +02:00
Michael Peter Christen	e84e07399a	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-05-06 14:51:57 +02:00
reger	8a7c68e4c7	content of surrogates/out never accessed (remove) After import the conent is never accessed but may take up a lot of disk space, also the getLoadedOAIServer (which lists the files in surrogate out) is not used. Making the surrogate.out obsolete. Removed keeping of xmls after import.	2014-05-04 09:29:07 +02:00
Michael Peter Christen	229f2248b8	added configuration option for maxmimum load and minimum ram for postprocessing	2014-04-30 13:26:32 +02:00
orbiter	8e5ce7cd51	fixed a situation where finished crawls had not been detected.	2014-04-23 23:13:07 +02:00
orbiter	ccb1864d55	catch IllegalArgumentException for wrong process types (that is needed for migrations when new process types are introduced or disappear)	2014-04-22 23:14:05 +02:00
orbiter	4ee4ba1576	fix for NPE in IndexCreateParserErrors_p.html caused by bad handling of lazy value instantiation of 0-value in crawldepth_i	2014-04-22 19:48:49 +02:00
reger	727dfb5875	refactore URIMetadataNode to further unify interaction with index - URIMetadataNode extending SolrDocument - use language as stored (String), reducing conversion to string - optimize debug code in transferIndex	2014-04-20 01:41:30 +02:00
Michael Peter Christen	5746aae3db	add canonical links to the same crawldepth, not the next crawldepth	2014-04-18 06:51:46 +02:00
Michael Peter Christen	74ab5ef9fa	increased runtime for postprocessing query job	2014-04-18 06:51:10 +02:00
Michael Peter Christen	10cf8215bd	added crawl depth for failed documents	2014-04-17 13:21:43 +02:00
Michael Peter Christen	c2f62e783f	- better subgraph handling, less overhead for crawls without the webgraph - usage of crawler crawldepth cache for the linkgraph target depth computation	2014-04-17 12:54:18 +02:00
Michael Peter Christen	9a5ab4e2c1	removed clickdepth_i field and related postprocessing. This information is now available in the crawldepth_i field which is identical to clickdepth_i because of a specific crawler strategy.	2014-04-16 22:16:20 +02:00
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	2014-04-16 21:34:28 +02:00
Michael Peter Christen	075b6f9278	refactoring of the crawl balancer: the balancer is turned into an interface and the old balancer class is moved into LegacyBalancer to make room for a fresh implementation of a crawl balancer.	2014-04-14 13:32:35 +02:00
Michael Peter Christen	8aeef73d49	fix for virtual root nodes	2014-04-11 15:12:34 +02:00
Michael Peter Christen	7c7fbb9818	find depth-matches also for edge targets	2014-04-11 12:27:21 +02:00
Michael Peter Christen	dd12dd392f	introduction of a data structure for HyperlinkEdges which should use less memory as it does no double-storage of source links for each edge of the graph.	2014-04-11 12:09:33 +02:00
Michael Peter Christen	6ea8bb7348	using MultiProtocolURL for edge data which is faster (hash computation is now much easier) and smaller in size	2014-04-11 10:58:37 +02:00
Michael Peter Christen	a37d067692	refactoring	2014-04-10 23:46:35 +02:00
orbiter	95780eed32	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-04-10 21:40:54 +02:00
Michael Peter Christen	67beef657f	strong redesign of html parser: object recursion is now made using a stack on html tag objects, not using a recursive parse-again method which may cause bad performance and huge memory allocation. The new method also produced better parsed image objects with exact anchor text references.	2014-04-10 18:58:03 +02:00
Michael Peter Christen	6bd8c6f195	fix for wrong status codes of error pages	2014-04-10 09:08:59 +02:00
orbiter	67501c9dda	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-04-09 19:58:54 +02:00
Michael Peter Christen	1c21b3256d	fix for robots.txt handling: delete old entry before starting a new crawl.	2014-04-09 18:33:48 +02:00
orbiter	c250fac9f4	linkstructure refactoring to get more options for clickdepth analysis	2014-04-09 17:52:51 +02:00
Michael Peter Christen	8068e68474	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-04-09 12:45:15 +02:00
Michael Peter Christen	bd886054cb	new structure and enhancements for link graph computation: - added order option to solr queries to be able to retrieve document lists in specific order, here: link length - added HyperlinkEdge class which manages the link structure - integrated the HyperlinkEdge class into clickdepth computation - extended the linkstructure.json servlet to show also the clickdepth and other statistic information	2014-04-09 12:45:04 +02:00
reger	f326a67561	fix: typo in default charset in metadata2solr update pom and NB build to Solr 4.7.1 libs	2014-04-06 22:31:22 +02:00
Michael Peter Christen	df138084c0	do solr optimization independently from memory and load constraints: - not doing an optimization will likely cause a too many files exception - without optimization performance will be even worse which would prevent optimization in the future as well (prevent a deadlock situation)	2014-04-06 11:04:23 +02:00
Michael Peter Christen	ebd44a7080	replaced solr 4.6.1 with solr 4.7.1 and added index migration to lucene_47	2014-04-06 10:45:03 +02:00
Michael Peter Christen	466d90ad42	fixed a problem with resource observer; probably coming from uncatched exceptions within the apache library which appear only in concurrency environments.	2014-04-04 15:26:39 +02:00
Michael Peter Christen	e8ddd415a8	enhanced the new link structure graph	2014-04-04 14:43:54 +02:00
Michael Peter Christen	926d28dd3f	fixed a bug which prevented crawl starts after a network switch	2014-04-04 14:43:35 +02:00
Michael Peter Christen	3ce8eff21b	another fix for inbound/outbound detection	2014-04-04 12:41:59 +02:00
orbiter	3c1274057d	fixed thread dump in case of wrong seeds	2014-04-04 10:54:56 +02:00
orbiter	18f9c40302	moved Edge class out of linkstructure servlet as this does not work on non-eclipse driven environments (all non-dev cases)	2014-04-04 10:54:11 +02:00
Michael Peter Christen	c64c10ef00	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-04-03 01:58:06 +02:00
Michael Peter Christen	48fbfa60c1	bugfix to inbound/outbound identification	2014-04-03 01:21:43 +02:00
reger	227c42bc96	eleminate obsolete URIMetaDataRow class by joining it with/into URIMetaDataNode.	2014-04-03 00:35:15 +02:00
Michael Peter Christen	cca851a417	introduced new solr field crawldepth_i which records the crawl depth of a document. This is the upper limit for the clickdepth_i value which may be shorter in case that the crawler did not take the shortest path to the document.	2014-04-02 23:37:01 +02:00
Michael Peter Christen	63c9fcf3e0	free configuration of postprocessing clickdepth maximum depth and time	2014-04-02 02:34:39 +02:00
Michael Peter Christen	8b44fcf0f4	added missing @Override annotation	2014-03-28 13:48:37 +01:00
Michael Peter Christen	e515dd460d	added linkscount_i and linksnofollowcount_i to the default solr schema	2014-03-27 23:36:08 +01:00
Michael Peter Christen	cbdfef7ce1	changed protocol facet to show also all other counts if one facet is selected	2014-03-27 13:29:14 +01:00
Michael Peter Christen	61ad194065	fix for source and target clickdepth in webgraph index	2014-03-26 16:00:05 +01:00
Marc Nause	809b4e1fd9	Team added support for URLs with unicode characters in host part to blacklist. Punycode is used to handle unicode characters.	2014-03-25 22:14:54 +01:00
reger	ca7444dbdf	limit filetype nav to known extension also on image/media search - on text search we limit filetype nav already to known extension, apply filter to image search	2014-03-23 23:10:29 +01:00
Michael Peter Christen	d1091e79f8	- added stealth button to navigation menu - more fixes to progress bar	2014-03-21 18:01:26 +01:00
orbiter	3c8d6e1eee	added adminAccount switch to ConfigAccounts_p servlet to switch on protection of all pages; some refactoring as well	2014-03-20 22:11:49 +01:00
orbiter	7d24bcb98d	added flag to require that all web pages, even such without a "_p" extension require authorization. (default off)	2014-03-20 19:09:47 +01:00
Michael Peter Christen	b08375da33	fix for bad/missing values of size_i	2014-03-11 09:51:04 +01:00
Michael Peter Christen	51800007c4	- added concurrency to postprocessing of webgraph document - bundeled separate webgraph postprocesing steps into one	2014-03-06 01:43:48 +01:00
Michael Peter Christen	e485fbd0ce	- let crawl loader jobs die after 10 seconds without new jobs - corrected shutdown order t prevent a deadlock during shutdown	2014-03-04 00:33:13 +01:00
Michael Peter Christen	bcd9dd9e1d	enhanced concurrent loading by using a fixed set of concurrent loader processes in favor of throwaway-processes. The control mechanism does less often report a 'queue full' message to the busy loop which then does not perform a long busy waiting; instead all requests are queued and new loader processes are started if necessary up to a given limit (as set before)	2014-03-03 22:13:40 +01:00
Michael Peter Christen	6ed9c0164e	attaching names to all Threads to get a better view in profiling tools like VisualVM	2014-02-28 15:02:01 +01:00
Michael Peter Christen	fdaeac374a	- enhanced postprocessing speed and memory footprint (by using HashMaps instead of TreeMaps) - enhanced memory footprint of database indexes (by introduction of optimize calls) - optimize calls shrink the amount of used memory for index sets if they are not changed afterwards any more	2014-02-28 14:01:09 +01:00
Michael Peter Christen	d325cb8912	fixes and enhancements for postprocessing	2014-02-28 02:51:14 +01:00
Michael Peter Christen	7c1b968378	another fix for the shutdown exceptions	2014-02-28 01:53:32 +01:00
Michael Peter Christen	1d069c5861	make sure that postprocessed documents are overwritten	2014-02-27 12:27:15 +01:00
Michael Peter Christen	e644981697	added one more postprocessing low memory check	2014-02-27 00:34:13 +01:00
Michael Peter Christen	e1bf65c892	added short memory protection during postprocessing	2014-02-26 23:02:56 +01:00
Michael Peter Christen	7640834b37	removed double concurrency to put Solr documents into the index. The writings to the solr index are also buffered in ConcurrentUpdateSolrConnector	2014-02-26 22:21:00 +01:00
Michael Peter Christen	0f6b72f24b	do not use luke requests for remote solr servers if the result is different from normal requests. This happens if the remote solr is actually a solrCloud; in such cases the luke request returns only the result of the single solr peer, not the whole cloud. also done: some refactoring.	2014-02-26 14:30:48 +01:00
Michael Peter Christen	a2b66fe2eb	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-02-25 14:37:39 +01:00
Michael Peter Christen	9f6be762a6	- better logging for postprocessing - fixed collection bug in postprocessing	2014-02-25 14:37:30 +01:00
orbiter	ced1a96f9c	fixed error cache	2014-02-25 02:16:22 +01:00
orbiter	cfb647db6e	- introduced a miss cache in ConcurrentUpdateSolrConnector - better usage of cache - bugfix for postprocessing	2014-02-24 23:42:50 +01:00
orbiter	a87d8e4a8e	changed caching of ConcurrentUpdateSolrConnector: it caches now also the url along with the load date. While this takes much more memory, it eliminates database lookups for getURL() requests, which happen equally often. This speeds up remote solr configurations.	2014-02-24 22:59:58 +01:00
orbiter	f6e441dd77	refactoring	2014-02-24 21:01:56 +01:00
orbiter	76c53faeb2	removed unused code (HostStat)	2014-02-24 20:51:43 +01:00
reger	0923b09216	fix: allow 4 character admin user name (was min 5 char)	2014-02-24 00:01:11 +01:00
Michael Peter Christen	254a7ac66c	fixed cleaning of index	2014-02-22 01:35:01 +01:00
Michael Peter Christen	69391e5d9e	changed strategy to test existence of documents in Solr: using the update time. The reason for that is a better caching for the crawler double-check, which needs the update time for crawler steering.	2014-02-19 04:03:45 +01:00
Michael Peter Christen	790f103f32	delete fail-docs during postprocessing to prevent that they will appear again and stay in postprocessing forever.	2014-02-18 01:38:56 +01:00
Michael Peter Christen	9eb668e951	enhanced the resource observer The resource observer is now able to recognize free disk space AND available space for YaCy. The amount of space which is assigned for YaCy are defined in new settings in the configuration file. Furthermore, there is now a cleanup process which deletes files in case that an autodelete is activated. The autodelete is now BY DEFAULT ON if the disk space is low, which means that YaCy starts to delete documents when the disk is full!	2014-02-12 01:00:44 +01:00
Michael Peter Christen	bf97e38b83	removed clearURLIndex, which is a stub remaining from the old metadata database and not needed any more	2014-02-11 22:01:25 +01:00

1 2 3 4 5 ...

1048 Commits