yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	8f2d3ce2f9	reduced locking situation in crawler: shifted synchronized location and reduced time-out of robots.txt load limit	2013-05-20 22:05:28 +02:00
Michael Peter Christen	77faeada4d	small memory leak patch	2013-05-11 11:19:06 +02:00
orbiter	5d442dad82	avoid NPE in regex checker	2013-04-20 10:53:49 +02:00
Marc Nause	ac478384d3	*) did some long overdue refactoring	2013-04-13 23:04:44 +02:00
Michael Peter Christen	788288eb9e	added the generation of 50 (!!) new solr field in the core 'webgraph'. The default schema uses only some of them and the resting search index has now the following properties: - webgraph size will have about 40 times as much entries as default index - the complete index size will increase and may be about the double size of current amount As testing showed, not much indexing performance is lost. The default index will be smaller (moved fields out of it); thus searching can be faster. The new index will cause that some old parts in YaCy can be removed, i.e. specialized webgraph data and the noload crawler. The new index will make it possible to: - search within link texts of linked but not indexed documents (about 20 times of document index in size!!) - get a very detailed link graph - enhance ranking using a complete link graph To get the full access to the new index, the API to solr has now two access points: one with attribute core=collection1 for the default search index and core=webgraph to the new webgraph search index. This is also avaiable for p2p operation but client access is not yet implemented.	2013-02-22 15:45:15 +01:00
sixcooler	3a13906121	clear some more caches if running out of memory	2013-01-25 04:24:36 +01:00
Michael Peter Christen	84f82541e8	search process enhancements	2012-12-19 10:41:22 +01:00
reger	e80dfeca23	- making blacklist path part case insensitive (solving http://bugs.yacy.net/view.php?id=171 ) - blacklist test adding explicite response text "not blocked" if no blacklist match	2012-12-08 06:34:48 +01:00
reger	1faa045dc1	fix: prevent regex pattern compile error for blacklist import for path '' (extend it to '.')	2012-12-01 22:41:21 +01:00
Michael Peter Christen	2d9e577ad0	replaced the custom robots.txt loader by the standard http loader	2012-10-28 22:48:11 +01:00
Michael Peter Christen	ccc3760a47	Refactoring and redesign of data architecture to make URIMetadataRow superfluous. The target is to make a solr document as the core of YaCy documents which would cause that many conversions can be removed. On the way to this target the Equivalence of URIMetadataRow and URIMetadataNode had to be removed to expose the usage of the old URIMetadataRow data structure. This refactoring already removes unneccessary conversions and should make memory usage during indexing lower.	2012-10-18 14:29:11 +02:00
Michael Peter Christen	5f0ab25382	removed the option to prevent removal of & parts inside of the MultiProtocolURI during normalform computation because that should always be done and also be done during initialization of the MultiProtocolURI Object. The new normalform method takes only one argument which should be 'true' unless you know exactly what you are doing.	2012-10-10 11:46:22 +02:00
Michael Peter Christen	1533bfd63b	refactoring	2012-09-25 21:20:03 +02:00
Michael Peter Christen	00c1c777fa	refactoring	2012-09-21 15:48:16 +02:00
Michael Peter Christen	bbd242afb4	fix for a NPE	2012-07-30 14:51:01 +02:00
Michael Peter Christen	24d9db1613	snippet retrieval loading processes may use a smaller minimum load time value than crawling processes. This speeds up the search result preparation dramatically.	2012-07-30 10:38:23 +02:00
Michael Peter Christen	1687737771	Abstraction of HandleMap and HandleSet	2012-07-27 12:13:53 +02:00
orbiter	69e743d9e3	- more abstraction for the RWI index as preparation for solr integration - added options in search index to switch parts of the index on or off	2012-07-22 13:18:45 +02:00
orbiter	0cbda0b2b8	- replaced all length() == 0 and size() == 0 with isEmpty() - replaced some length() > 0 and size() > 0 with !isEmpty() - cannot be done automatically - implemented some isEmpty() methods	2012-07-10 22:59:03 +02:00
Roland 'Quix0r' Haeder	aef9dd0350	- removed cleaning of blacklist cache on startup - added cleaning of blacklist cache if cache is modified in interface - extended cache saving to all cache types - moved cache location to DATA/LISTS - fixed static file path which was relative to the application path but should be relative to data path - which is different in debian and mac implementations	2012-07-10 13:08:16 +02:00
Michael Peter Christen	c3db015410	prevent loading of content from the cache when retrieval with IFFRESH is used and cache is stale. Should speed up snippet generation when cache strategy is IFFRESH.	2012-07-06 08:29:41 +02:00
Michael Peter Christen	b0c408788b	made class methods static where possible	2012-07-05 12:38:41 +02:00
Michael Peter Christen	5bd3c90907	- removed unnecessary semicolons - added default case for switch	2012-07-05 11:18:31 +02:00
Michael Peter Christen	7c1ba99755	removed more unused method parameters	2012-07-05 10:44:30 +02:00
Michael Peter Christen	0301aba1e9	removed unused method parameters	2012-07-05 10:23:07 +02:00
Michael Peter Christen	ea10766bfd	cleaned unnecessary nested code	2012-07-05 08:44:39 +02:00
Michael Peter Christen	1825f165b8	better integration of blacklist according to use case	2012-07-02 13:57:29 +02:00
Michael Peter Christen	03280fb161	removed segments-concept and the Segments class: the segments had been there to create a tenant-infrastructure but were never be used since that was all much too complex. There will be a replacement using a solr navigation using a segment field in the search index.	2012-06-28 14:27:29 +02:00
Michael Peter Christen	77f795756c	fixing redirects and status codes: storing of status code in ResponseHeader to make it available for late evaluations, like storage in solr.	2012-06-25 18:17:31 +02:00
Michael Peter Christen	7dc59979bc	fix for npe, possibly for http://bugs.yacy.net/view.php?id=195	2012-06-18 21:25:39 +02:00
Michael Peter Christen	4ee6fb1de9	added missing blacklist dht cache storage (maybe due to mistakes in cherry picking)	2012-06-11 00:38:02 +02:00
Roland 'Quix0r' Haeder	e4d36fa5eb	Fix to make all values lower-case (this should make all existing blacklists compatible with the new enum)	2012-06-11 00:17:53 +02:00
Roland 'Quix0r' Haeder	edaa09b9b1	Rewrote all String blacklist types to enum 'BlacklistType', closes bug #143 Conflicts: htroot/Supporter.java htroot/yacy/crawlReceipt.java htroot/yacy/transferRWI.java htroot/yacy/transferURL.java source/de/anomic/crawler/CrawlStacker.java source/de/anomic/data/ListManager.java source/net/yacy/peers/Protocol.java source/net/yacy/repository/Blacklist.java source/net/yacy/repository/LoaderDispatcher.java source/net/yacy/search/Switchboard.java source/net/yacy/search/index/MetadataRepository.java source/net/yacy/search/index/Segment.java source/net/yacy/search/query/RWIProcess.java source/net/yacy/search/snippet/MediaSnippet.java	2012-06-11 00:17:30 +02:00
Michael Peter Christen	7a329465b3	using pre-compile pattern in blacklist; should enhance search speed	2012-06-04 15:34:53 +02:00
Michael Peter Christen	7e0ddbd275	added a "fromCache" flag in Response object to omit one cache.has() check during snippet generation. This should cause less blockings	2012-05-21 03:03:47 +02:00
Michael Peter Christen	659178942f	- Redesigned crawler and parser to accept embedded links from the NOLOAD queue and not from virtual documents generated by the parser. - The parser now generates nice description texts for NOLOAD entries which shall make it possible to find media content using the search index and not using the media prefetch algorithm during search (which was costly) - Removed the media-search prefetch process from image search	2012-04-24 16:07:03 +02:00
Michael Peter Christen	33d1062c79	refactoring: the cache belongs to the crawler	2012-04-21 13:34:07 +02:00
reger	a95f645a61	Bugfix class repository.Loaddispatcher fixed download file limit of 10000 line 355: final Response response = this.load(request, cachePolicy, 10000, true);	2012-01-26 04:10:44 +01:00
Michael Peter Christen	ef5192f8c9	using the generic document parser for crawl starts instead of the html parser. This makes it possible that every type of document can be a crawl start point, not only text documents or html documents. Testet this with a pdf document.	2012-01-23 17:27:29 +01:00
Marek Otahal	f40efb39af	Blacklist loadList() remove duplicates by using Set Signed-off-by: Marek Otahal <markotahal@gmail.com>	2012-01-09 01:18:01 +01:00
Michael Christen	eebc02f5c1	fix	2012-01-04 20:24:48 +01:00
Michael Christen	216a287a85	Merge commit '6d4e08ed06c5cd28c45981b2ebe31c7f7ec6fd83' into quix0r Conflicts: source/de/anomic/crawler/CrawlQueues.java	2012-01-04 20:16:37 +01:00
Michael Christen	361146dd7a	better error handling for file loader	2011-12-29 14:37:19 +01:00
Roland 'Quix0r' Haeder	6d4e08ed06	Rewrote filesize() to (hopefully) avoid a NPE, rewrote Blacklist class to concurrent classes to avoid a CME	2011-12-29 03:42:38 +01:00
Roland Haeder	319fd1f4aa	A concurrent access can happen on the blacklist (with latest introduced blacklist check in media snippet computation)	2011-12-28 21:40:44 +01:00
Roland 'Quix0r' Haeder	a3083d13bf	Blacklist checks are now always turned on, in media searches (e.g. image search) images matching blacklist entries are no longer shown to the user	2011-12-28 20:09:17 +01:00
orbiter	e22f8497c9	- tested the ARC methods - removed strict authentication (if password is empty; this was buggy and not useful; can be switched on if necessary globally and not for each interface method) - increased speed of CrawlResults page (no dns lookup any more) - increased speed of favicon display (removed dns lookup) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8104 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-25 14:09:25 +00:00
orbiter	5a55397f99	some last-minute performance hacks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8101 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-25 11:23:52 +00:00
orbiter	d2ea250d99	refactoring: - moved many classes from de.anomic to net.yacy - made more sub-packages for search classes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7973 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-09-25 16:59:06 +00:00
orbiter	49e5ca579f	added new configuration property "crawler.embedLinksAsDocuments". If this is switched on (this is default now), the all embedded image, audio and video links from all parsed documents are added to the search index as individual document. This will increase the search index size dramatically but will also enable us to create a much faster image, audio and video search. If the flag is switched on, the index entries are also stored to a solr index, if this is also enabled. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7931 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-09-07 10:08:57 +00:00

1 2 3

111 Commits