yacy_search_server/source/net/yacy/crawler
luccioman fcf6b16db4 Added new crawler attribute for finer control over Media Type detection
New "Media Type detection" section in the advanced crawl start page
allow to choose between :
- not loading URLs with unknown or unsupported file extension without
checking the actual Media Type (relying Content-Type header for now).
This was the old default behavior, faster, but not really accurate.
- always cross check URL file extension against the actual Media Type.
This lets properly parse URLs ending with an apparently odd file
extension, but which have actually a supported Media Type such as
text/html.

Sample URLs with misleading file extensions added as documentation in
the crawl start page.

fixes issue #244
2018-10-25 10:42:12 +02:00
..
data Added new crawler attribute for finer control over Media Type detection 2018-10-25 10:42:12 +02:00
retrieval Added new crawler attribute for finer control over Media Type detection 2018-10-25 10:42:12 +02:00
robots Small perf improvement : initialize threads names early when possible 2018-05-23 14:45:35 +02:00
Balancer.java Fixed display of crawler pending URLs counts in HostBrowser.html page. 2017-01-22 12:31:14 +01:00
CrawlStacker.java Added new crawler attribute for finer control over Media Type detection 2018-10-25 10:42:12 +02:00
CrawlStarterFromScraper.java Updated a license header typo. 2017-10-30 07:38:47 +01:00
CrawlSwitchboard.java Do not block whole server startup on persisted crawl profile load error 2018-06-19 12:48:17 +02:00
FileCrawlStarterTask.java removed transformer 2018-06-19 00:42:23 +02:00
HarvestProcess.java fix for wrong display of error urls in HostBrowser 2012-12-07 00:31:10 +01:00
HostBalancer.java Fixed NullPointerException case on malformed crawl queue folder name 2018-08-13 14:35:26 +02:00
HostQueue.java Fixed crawl queue folder naming for IPv6 hosts on MS Windows filesystems 2018-08-11 10:02:26 +02:00
IllegalCrawlProfileException.java Crawl from local file : faster task end when manually terminating crawl. 2016-10-22 09:11:20 +02:00
LegacyBalancer.java use supplied url port to get robots.txt in crawlers hostqueue 2016-03-02 00:12:34 +01:00
RecrawlBusyThread.java Create recrawl requests with the relevant crawl profile. 2018-01-30 21:00:18 +01:00