yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	eca9380e3d	bugfix for crawler double-check: if an url is redirected, the redirect-target was not double-checked. This is now done by replacing the redirect-URL on the crawl queue again (where it is double-checked)	2014-08-06 12:35:12 +02:00
Michael Peter Christen	9ac0c93f17	fix for subpath crawl filter	2014-08-06 01:33:24 +02:00
Michael Peter Christen	66106bdaf0	fix for crawler attribute maxdompages	2014-08-05 21:32:25 +02:00
Michael Peter Christen	49d91b94c3	npe fix in crawler	2014-08-05 21:31:59 +02:00
Michael Peter Christen	c465b791af	typo	2014-08-04 16:13:39 +02:00
Michael Peter Christen	3c23b89823	less logging	2014-08-04 13:37:34 +02:00
Michael Peter Christen	1609763be5	toString fix	2014-08-04 12:58:39 +02:00
Michael Peter Christen	001e05bb80	do not store failure of loading of robots.txt into the index as a fail document	2014-08-01 12:15:14 +02:00
Michael Peter Christen	05d58e4df0	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-08-01 12:04:25 +02:00
Michael Peter Christen	98f45c9032	fix for image alt attachment to AnchorURLs in html parser.	2014-08-01 12:04:15 +02:00
orbiter	22ce4fb4dd	better error handling for remote solr queries and exists-checks	2014-08-01 11:00:10 +02:00
orbiter	e9163e7e10	fix for malformed hostpath names in crawl balancer	2014-07-29 11:18:45 +02:00
Michael Peter Christen	6e1dc444c3	added a snippet test function in ViewFile: you can now search for a specific word on the document; the servlet returns the snippet in the same way as it would be shown in a search result.	2014-07-24 14:59:37 +02:00
orbiter	4b06adb751	fix for file urls	2014-07-23 17:54:31 +02:00
Michael Peter Christen	542c20a597	changed handling of crawl profile field crawlingIfOlder: this should be filled with the date, when the url is recognized as to be outdated. That field was partly misinterpreted and the time interval was filled in. In case that all the urls which are in the index shall be treated as outdated, the field is filled now with Long.MAX_VALUE because then all crawl dates are before that date and therefore outdated.	2014-07-22 00:23:17 +02:00
Michael Peter Christen	4eec1a7452	refactoring (change Metadata name of load time data structure to avoid confusion with Node data which is also called metadata)	2014-07-21 23:54:23 +02:00
Michael Peter Christen	2de159719b	added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead.	2014-07-18 12:43:01 +02:00
Michael Peter Christen	b5fc2b63ea	removed exist() retrieval functions from error cache and replaced it with metadata retrieval from connectors directly. This should cause better usage of the cache. Automatically increase the metadata cache if more memory is available.	2014-07-11 19:52:25 +02:00
Michael Peter Christen	62c72360ee	cleanup of checkAcceptanceInitially in CrawlStacker, should avoid double-calling of solr	2014-07-11 18:36:04 +02:00
Michael Peter Christen	b5d78ba156	reduced number of solr queries during crawling	2014-07-11 18:05:11 +02:00
Michael Peter Christen	06ab72d1af	enhanced crawler host round-robin strategy	2014-07-11 16:01:42 +02:00
Michael Peter Christen	49886fab08	enhanced debugging	2014-06-26 12:57:01 +02:00
Michael Peter Christen	b893c42a0f	bugfix for image search	2014-06-26 12:56:33 +02:00
Michael Peter Christen	74c249288a	added a push api to make it possible to upload files directly without crawling to the YaCy indexer. Files are uploaded using POST multipart requests; multiple file uploads are possible as well. Each file has attached the file date and mime type which is used to get the right parser for the submitted data. Also an url is submitted which is assigned to the document. The CrawlSwitchboard has a new option for default Crawl Profiles which are assigned dynamically from the new push interface.	2014-06-12 18:10:07 +02:00
Michael Peter Christen	ba6ffddefc	refactoring	2014-06-12 05:23:26 +02:00
reger	92d1604a31	Crawler hostbalancer does not delete finished queue files, use alternative delete to fight the sympthom (and fix deletion of host dirs on startup) Root cause (which class holds a lock on .stack) not found. http://mantis.tokeek.de/view.php?id=404	2014-06-05 02:13:08 +02:00
orbiter	d7d38f9135	made number of open files in crawler configurable and increased default maximum number of open files from 100 to 1000. This number can be changed with the attribut crawler.onDemandLimit	2014-05-31 09:29:55 +02:00
reger	ca5437dd50	fix crawl of file:// , also http://mantis.tokeek.de/view.php?id=149 local files can be crawled (intranet mode) url parsing fixed according to RFC 1738 (for unix and windows) for win like file:///c:/tmp or file://localhost/c:/tmp for linux like file:///tmp or file://localhost/tmp Host is ignored and path must be absolute	2014-05-28 03:01:34 +02:00
orbiter	97983ba89f	fixed generics warnings for generic array instantiation that appeared after migration to Java 7	2014-05-20 21:50:16 +02:00
reger	1600414450	fix NPE on continuing crawls after YaCy restart (Agent is then nulll)	2014-05-02 19:32:09 +02:00
Michael Peter Christen	c1c1be8f02	fix for slow crawling and better logging in balancer	2014-04-29 19:50:33 +02:00
Michael Peter Christen	3acf416335	npe fix	2014-04-29 19:24:05 +02:00
orbiter	2f63bd0261	enhanced Host Balancer strategy: fair round robin	2014-04-23 23:11:37 +02:00
Michael Peter Christen	8b32dd5f9e	special strategy for balancer: do not remove targets with zero wait time from the queue	2014-04-18 06:50:07 +02:00
Michael Peter Christen	9c6228d948	fix for deadlocks in crawler	2014-04-17 16:58:17 +02:00
Michael Peter Christen	10cf8215bd	added crawl depth for failed documents	2014-04-17 13:21:43 +02:00
Michael Peter Christen	06afb568e2	new Strategies in Balancer: - doublecheck cache now records the crawl depth as well - doublecheck cache is available from the outside (made static) - no more need to crawl hosts with lowest depth first, instead all hosts which have only singleton entries are preferred to reduce the number of files.	2014-04-17 12:52:54 +02:00
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	2014-04-16 21:34:28 +02:00
Michael Peter Christen	075b6f9278	refactoring of the crawl balancer: the balancer is turned into an interface and the old balancer class is moved into LegacyBalancer to make room for a fresh implementation of a crawl balancer.	2014-04-14 13:32:35 +02:00
Michael Peter Christen	6bd8c6f195	fix for wrong status codes of error pages	2014-04-10 09:08:59 +02:00
Michael Peter Christen	9e503b3376	also delete the robots.txt file from the cache when a new crawl is started	2014-04-09 21:59:54 +02:00
Michael Peter Christen	1c21b3256d	fix for robots.txt handling: delete old entry before starting a new crawl.	2014-04-09 18:33:48 +02:00
Michael Peter Christen	926d28dd3f	fixed a bug which prevented crawl starts after a network switch	2014-04-04 14:43:35 +02:00
Michael Peter Christen	d4b5c457e4	NPE fix	2014-04-04 12:34:34 +02:00
Michael Peter Christen	8b44fcf0f4	added missing @Override annotation	2014-03-28 13:48:37 +01:00
Michael Peter Christen	85a427ec54	support for multiple sitemaps in robots.txt	2014-03-14 13:33:23 +01:00
Michael Peter Christen	b08375da33	fix for bad/missing values of size_i	2014-03-11 09:51:04 +01:00
reger	dd5bf0b71b	cleanup old reference to HTTPDemon.setAlternativeResolver optimize .yacyh check in AbstractRemoteHandler	2014-03-06 03:08:04 +01:00
Michael Peter Christen	e485fbd0ce	- let crawl loader jobs die after 10 seconds without new jobs - corrected shutdown order t prevent a deadlock during shutdown	2014-03-04 00:33:13 +01:00
Michael Peter Christen	bcd9dd9e1d	enhanced concurrent loading by using a fixed set of concurrent loader processes in favor of throwaway-processes. The control mechanism does less often report a 'queue full' message to the busy loop which then does not perform a long busy waiting; instead all requests are queued and new loader processes are started if necessary up to a given limit (as set before)	2014-03-03 22:13:40 +01:00

1 2 3 4 5

207 Commits