yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	659178942f	- Redesigned crawler and parser to accept embedded links from the NOLOAD queue and not from virtual documents generated by the parser. - The parser now generates nice description texts for NOLOAD entries which shall make it possible to find media content using the search index and not using the media prefetch algorithm during search (which was costly) - Removed the media-search prefetch process from image search	2012-04-24 16:07:03 +02:00
Michael Peter Christen	f5efdb21fd	refactoring	2012-04-24 12:54:41 +02:00
Michael Peter Christen	f8cd57c92f	new indexing strategy: ALL links that appear anywhere are indexed, not only links where the content can be parsed. All non-parseable links are placed into the noload queue. The search process must therefore be able to filter out non-text search results. - This fixes the problem that image search results appeared in the text search. - The interactive search can retrieve now ALL types of links - The p2p interface is now extended to retrieve only certain types of links (text, image, video, apps) - The search process has an extension to filter the right document type according to the search query	2012-04-22 02:05:17 +02:00
Michael Peter Christen	a1a5b015d8	refactoring: moved document Classification to cora package	2012-04-21 21:31:13 +02:00
Michael Peter Christen	a5d7da68a0	refactoring: removed dependency from switchboard in Balancer/CrawlQueues	2012-04-21 13:47:48 +02:00
Michael Peter Christen	33d1062c79	refactoring: the cache belongs to the crawler	2012-04-21 13:34:07 +02:00
Michael Christen	22f05c83ff	fixed default must-match filter for full domain crawls - the old filter was to restrictive and did not allow intranet crawls	2012-03-28 21:50:00 +02:00
Michael Peter Christen	0cc0290978	bugfix for a must-not-match pattern check. This bug did not make the check semantically wrong, but a trick that prevented an IP lookup in case that the filter was not used did not work. That bugfix causes that crawling gets a huge speed boost for noload urls!	2012-02-27 00:52:44 +01:00
Michael Peter Christen	2fc8ecee36	ConcurrentLinkedQueue has a VERY long return time on the .size() method. See http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ConcurrentLinkedQueue.html and the following test programm: public class QueueLengthTimeTest { public static long countTest(Queue<Integer> q, int c) { long t = System.currentTimeMillis(); for (int i = 0; i < c; i++) { q.add(q.size()); } return System.currentTimeMillis() - t; } public static void main(String[] args) { int c = 1; for (int i = 0; i < 100; i++) { Runtime.getRuntime().gc(); long t1 = countTest(new ArrayBlockingQueue<Integer>(c), c); Runtime.getRuntime().gc(); long t2 = countTest(new LinkedBlockingQueue<Integer>(), c); Runtime.getRuntime().gc(); long t3 = countTest(new ConcurrentLinkedQueue<Integer>(), c); System.out.println("count = " + c + ": ArrayBlockingQueue = " + t1 + ", LinkedBlockingQueue = " + t2 + ", ConcurrentLinkedQueue = " + t3); c = c * 2; } } }	2012-02-27 00:42:32 +01:00
Michael Peter Christen	c6c61be3f0	fix for http://bugs.yacy.net/view.php?id=148	2012-02-24 00:38:57 +01:00
Michael Peter Christen	0d148c3353	more logging in resource observer	2012-02-23 01:20:42 +01:00
Michael Peter Christen	2fa037ae1d	enhanced crawler	2012-02-23 01:20:24 +01:00
Lotus	ee89cf5ae5	fix must match filter for full domain crawl allow: http://www.example.com http://www.example.com/ http://www.example.com/abc.html?xyz=q block: http://www.example.com.cn http://www.example.com.cn/dsf	2012-02-07 16:13:13 +01:00
Michael Peter Christen	9ad1d8dde2	complete redesign of crawl queue monitoring: do not look at a ready-prepared crawl list but at the stacks of the domains that are stored for balanced crawling. This affects also the balancer since that does not need to prepare the pre-selected crawl list for monitoring. As a effect: - it is no more possible to see the correct order of next to-be-crawled links, since that depends on the actual state of the balancer stack the next time another url is requested for loading - the balancer works better since the next url can be selected according to the current situation and not according to a pre-selected order.	2012-02-02 21:33:42 +01:00
Michael Peter Christen	1f4f60654a	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Conflicts: source/net/yacy/document/parser/pdfParser.java	2012-01-24 20:42:30 +01:00
Michael Peter Christen	2ee8cbeb2c	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Conflicts: source/net/yacy/search/Switchboard.java	2012-01-05 18:37:46 +01:00
Michael Peter Christen	992dbdf4bb	added noload statistic to servlets	2012-01-05 18:33:05 +01:00
Michael Christen	c21966bb43	fix	2012-01-04 23:02:12 +01:00
Michael Christen	13b05f9c08	fix	2012-01-04 23:01:04 +01:00
Michael Christen	e5d878c59e	Merge branch 'master' of ssh://gitorious.org/yacy/rc1 Conflicts: source/de/anomic/crawler/CrawlQueues.java	2012-01-04 22:08:17 +01:00
Michael Christen	ec26b2bea4	Merge commit 'fa08ed5ae5d72bddc3cc6a662b23103579e86109' into quix0r Conflicts: source/de/anomic/crawler/CrawlQueues.java	2012-01-04 20:32:42 +01:00
Michael Christen	216a287a85	Merge commit '6d4e08ed06c5cd28c45981b2ebe31c7f7ec6fd83' into quix0r Conflicts: source/de/anomic/crawler/CrawlQueues.java	2012-01-04 20:16:37 +01:00
stbrumm	d18095dc48	Patch fuer Issue 0000102 and fixes to Patch (private peer status is a property of a peer, not a status)	2012-01-03 17:49:37 +01:00
Roland 'Quix0r' Haeder	901f37d608	Also this ... :( #2	2011-12-29 00:36:56 +01:00
Roland 'Quix0r' Haeder	a985717ed2	Also this ... :(	2011-12-29 00:35:51 +01:00
Roland 'Quix0r' Haeder	5f490de554	Fix for ported fix from my old days ...	2011-12-29 00:34:46 +01:00
Roland 'Quix0r' Haeder	fa08ed5ae5	Fixed a lot CHMOD rights (no need for execute flag on .java/.html) and introduced local/remote crawl size ratio based check	2011-12-29 00:33:16 +01:00
Michael Christen	9e5894c784	Removed handling of components objects for URIMetadataRows. This is a preparation to replace this rows with nodes from the node store.	2011-12-17 01:27:08 +01:00
Michael Christen	c04bfaa51b	refactoring	2011-12-16 23:59:29 +01:00
Michael Christen	6e66c9d7f1	fix for http://bugs.yacy.net/view.php?id=87	2011-12-05 23:46:42 +01:00
Michael Christen	e7e429705a	- less automatic indexing after a search (needs to reset the default crawl profiles) - fix for concurrency problem in storage of serverSwitch Properties - markup update	2011-12-05 16:22:11 +01:00
orbiter	11729061f2	added an option in the bookmark import process to put everything into the crawler git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8134 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-12-03 00:27:01 +00:00
orbiter	8895d8c1cd	removed unnecessary log entries git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8117 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-27 16:54:48 +00:00
orbiter	5a55397f99	some last-minute performance hacks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8101 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-25 11:23:52 +00:00
orbiter	e4a82ddd8b	produce a bookmark entry from every crawl start. these bookmarks are always private. these bookmarks will be used to get a source reference for the search in case of intranet or portal searches. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8062 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-21 23:10:29 +00:00
orbiter	aa322bc6d0	fix git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8050 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-16 15:36:30 +00:00
orbiter	97d1347adb	added also a default accept field to robots.txt downloads git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8049 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-16 15:33:55 +00:00
orbiter	f183d3822c	added a default accept header in http requests since some http fraud detection functions check that this header field exist see also: http://bad-behavior.ioerror.us/ in source file browser.inc.php git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8048 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-16 15:27:43 +00:00
orbiter	06352b8d6b	more logging git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8047 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-16 14:09:50 +00:00
orbiter	a99934226e	more logging for debugging of robots.txt git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8046 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-16 13:56:31 +00:00
orbiter	7a5841e061	fix for robot parser git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8045 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-16 13:12:46 +00:00
orbiter	458c20ff72	fix for robot parser git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8044 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-16 13:06:46 +00:00
orbiter	017a01714d	- enhanced logging in robots.txt parser for remote debugging - robots.txt is now more robust against database operations git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8043 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-16 01:03:49 +00:00
orbiter	eb1c7c041d	write info about robots.txt evaluation into getpageinfo_p.xml git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8038 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-15 00:33:54 +00:00
orbiter	775b44017e	refactoring git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8033 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-14 15:11:57 +00:00
orbiter	78ce3b13be	typo git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8027 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-10 11:57:26 +00:00
orbiter	85d6bf4ac4	fixed urls to media content during indexing git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8021 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-09 15:40:14 +00:00
orbiter	3a807e10cf	- added a cache for active crawl profiles to the crawl switchboard - moved the domain cache for domain counter from the crawl switchboard to the crawl profiles. the crawl domain counter is now therefore relative for each crawl start, not for the whole crawler. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8018 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-08 15:38:08 +00:00
orbiter	37e35f2741	normalization of url using urlencoding/decoding git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8017 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-08 12:02:22 +00:00
orbiter	1b86d06d1e	fix for http://bugs.yacy.net/view.php?id=62 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8004 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-10-26 10:07:16 +00:00

1 2 3 4 5 ...

485 Commits