yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

History

luccioman fcf6b16db4 Added new crawler attribute for finer control over Media Type detection New "Media Type detection" section in the advanced crawl start page allow to choose between : - not loading URLs with unknown or unsupported file extension without checking the actual Media Type (relying Content-Type header for now). This was the old default behavior, faster, but not really accurate. - always cross check URL file extension against the actual Media Type. This lets properly parse URLs ending with an apparently odd file extension, but which have actually a supported Media Type such as text/html. Sample URLs with misleading file extensions added as documentation in the crawl start page. fixes issue #244		2018-10-25 10:42:12 +02:00
..
data	Added new crawler attribute for finer control over Media Type detection	2018-10-25 10:42:12 +02:00
retrieval	Added new crawler attribute for finer control over Media Type detection	2018-10-25 10:42:12 +02:00
robots	Small perf improvement : initialize threads names early when possible	2018-05-23 14:45:35 +02:00
Balancer.java	Fixed display of crawler pending URLs counts in HostBrowser.html page.	2017-01-22 12:31:14 +01:00
CrawlStacker.java	Added new crawler attribute for finer control over Media Type detection	2018-10-25 10:42:12 +02:00
CrawlStarterFromScraper.java	Updated a license header typo.	2017-10-30 07:38:47 +01:00
CrawlSwitchboard.java	Do not block whole server startup on persisted crawl profile load error	2018-06-19 12:48:17 +02:00
FileCrawlStarterTask.java	removed transformer	2018-06-19 00:42:23 +02:00
HarvestProcess.java	fix for wrong display of error urls in HostBrowser	2012-12-07 00:31:10 +01:00
HostBalancer.java	Fixed NullPointerException case on malformed crawl queue folder name	2018-08-13 14:35:26 +02:00
HostQueue.java	Fixed crawl queue folder naming for IPv6 hosts on MS Windows filesystems	2018-08-11 10:02:26 +02:00
IllegalCrawlProfileException.java	Crawl from local file : faster task end when manually terminating crawl.	2016-10-22 09:11:20 +02:00
LegacyBalancer.java	use supplied url port to get robots.txt in crawlers hostqueue	2016-03-02 00:12:34 +01:00
RecrawlBusyThread.java	Create recrawl requests with the relevant crawl profile.	2018-01-30 21:00:18 +01:00