yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
luccioman	7adbd1f87d	Fixed raw IPV6 addresses snapshots read/write on FAT32 and NTFS fs Fixes issue #225	2018-09-12 17:34:40 +02:00
luccioman	4ee14ff3c5	Fixed NullPointerException case on malformed crawl queue folder name	2018-08-13 14:35:26 +02:00
luccioman	21ad9435ec	Fixed crawl queue folder naming for IPv6 hosts on MS Windows filesystems As reported by @vikulin in issue #187, crawling websites using a raw IPv6 address as host name in their URL failed when running on Microsoft Windows platforms (FAT32 or NTFS filesystems) when YaCy crawler created the crawl queue folder, as the ':' character which is part of an IPV6 address is forbidden on these filesystems.	2018-08-11 10:02:26 +02:00
luccioman	dcad393fe5	Fixed exceeding max size of failreason_s Solr field on large link list When using the 'From Link-List of URL' as a crawl start, with lists in the order of one or more thousands of links, the failreason_s Solr field maximum size (32kb) was exceeded by the string representation of the URL must-match filter when a crawl URL was rejected because not matching.	2018-07-11 08:13:29 +02:00
luccioman	c726154a59	Fixed removal of URLs from the delegatedURL remote crawl stack URLs were removed from the stack using their hash as a bytes array, whereas the hash is stored in the stack as String instance.	2018-07-05 09:36:36 +02:00
luccioman	a15ac8e0ca	Made CrawlProfile loading tolerant to malformed json string attribute	2018-06-19 12:53:17 +02:00
luccioman	a715bb7876	Fixed rendering of solr mustNoMatch value on CrawlProfileEditor_p.xml	2018-06-19 12:50:28 +02:00
luccioman	0b302c5004	Do not block whole server startup on persisted crawl profile load error	2018-06-19 12:48:17 +02:00
luccioman	4d9aa4ed1e	Fixed default crawl profile solr mustnotmatch query from previous commit	2018-06-19 11:58:47 +02:00
luccioman	cced94298a	Added a new crawler document filter type using Solr syntax This makes possbile to set up much more advanced document crawl filters, by filtering on one or more document indexed fields before inserting in the index.	2018-06-19 10:12:20 +02:00
Michael Christen	e0dc632020	removed transformer it was not used any more	2018-06-19 00:42:23 +02:00
luccioman	fa4399d5d2	Small perf improvement : initialize threads names early when possible Initializing Thread names using the Thread constructor parameter is faster as it already sets a thread name even if no customized one is given, while an additional call to the Thread.setName() function internally do synchronized access, eventually runs access check on the security manager and performs a native call. Profiling a running YaCy server revealed that the total processing time spent on Thread.setName() for a typical p2p search was in the range of seconds.	2018-05-23 14:45:35 +02:00
luccioman	fb3032c530	Added a crawl filtering possibility on documents Media Type (MIME)	2018-03-23 10:28:19 +01:00
luccioman	e45afedee4	Added support for enclosures (media links) to the RSS loader	2018-03-21 08:22:29 +01:00
luccioman	aaefd5219c	Reduce log verbosity of RSS loader on feed items with no link	2018-03-20 10:09:17 +01:00
luccioman	17c7a85f18	Make StreamResponse usable in Java try-with-resources statements	2018-02-21 08:38:35 +01:00
luccioman	80fb1026d0	Create recrawl requests with the relevant crawl profile. Recrawl default profile was previously effectively used for crawl stacker acceptance check, but request entries were indeed still created with the "snippetGlobalText" profile.	2018-01-30 21:00:18 +01:00
luccioman	46b5249c20	Removed time condition on HostBalancer initialization in JUnit test. Its initialization in main application usage remains asynchronous.	2018-01-26 17:15:27 +01:00
luccioman	8b572b7337	Commit Solr index before simulating or starting recrawl job. This ensures up-to-date simulation query results, and recrawl processing.	2018-01-26 10:31:13 +01:00
luccioman	7baa99f26f	Fixed stored URL in web cache when redirection(s) occurs. Associate cached content to the last redirection location, instead of the first URL of a redirection(s) chain : - for proper base URL processing in parsers (fixes mantis 636 - http://mantis.tokeek.de/view.php?id=636) - to prevent duplicated content in Solr index when recrawling a redirected URL	2018-01-20 18:56:40 +01:00
luccioman	9ddf92d143	Removed unncessary reflection usage for workflow tasks. This improves code readability and maintainability (calls hierarchy are easier to read) and eventually performance.	2018-01-15 10:05:49 +01:00
luccioman	897d3d30cc	Added new recrawl job profile to the list of default crawl profiles	2018-01-15 08:30:37 +01:00
luccioman	b712a0671e	Added a specific default crawl profile for the recrawl job. - with only light constraint on known indexed documents load date, as it can already been controlled by the selection query, and the goal of the job is indeed to recrawl selected documents now - using the iffresh cache strategy	2018-01-13 15:46:04 +01:00
luccioman	adf3fa493d	Added comments about crawl profiles recrawl cycles	2018-01-13 12:13:04 +01:00
luccioman	3638e16c2e	More comprehensive log on rejected recrawls caused by date constraint	2018-01-13 12:07:56 +01:00
luccioman	d47afe6fab	Use a constant for crawler reject reason prefix with specific processing	2018-01-13 10:45:00 +01:00
luccioman	4e03335625	Added more details to the recrawl job report	2018-01-12 11:47:13 +01:00
luccioman	433e241e4f	Added a report info box about eventual last terminated recrawl job For easier monitoring of recrawls.	2018-01-09 22:33:15 +01:00
luccioman	b2af25b14f	Added a stop condition to the Recrawl busy thread	2018-01-09 10:22:26 +01:00
luccioman	421728d25a	Made possible to customize selection query before launching a recrawl	2018-01-08 21:20:46 +01:00
luccioman	09c4ee56a7	Added optional https support for remote crawl and profile operations	2017-12-21 18:41:32 +01:00
luccioman	5db1c9155a	Do locale independant case conversion on hosts, schemes, and file exts. Required for proper operation when the default system locale is Turkish, as dottless and dotted i characters have specific case conversion rules in this language.	2017-12-19 13:52:05 +01:00
Michael Peter Christen	25573bd5ab	added a crawl filter based on <div> tag class names When a crawl is started, a new field to exclude content from scraping is available. The field can be identified with the class name of div tags. All text contained in such a div tag where the configured class name(s) match are not indexed, while the remaining page is indexed.	2017-12-09 22:29:35 +01:00
luccioman	46f37e38dc	Customized Threads with generic name for easier monitoring.	2017-10-31 08:53:17 +01:00
luccioman	046be566e1	Updated a license header typo.	2017-10-30 07:38:47 +01:00
Apply55gx	3c905a2a5c	fix typo	2017-10-27 14:00:30 +02:00
luccioman	6cec2cdcb5	Use unredirected robots.txt URL when adding an entry to the table.	2017-08-16 14:21:07 +02:00
luccioman	3f0446f14b	Ensure proper synchronous robots entry retrieval on first check. Previously, when checking for the first time the robots.txt policy on a unknown host (not cached in the robots table), result was always empty in the /getpageinfo_p.xml api and in the /CrawlCheck_p.html page. Next calls returned however the correct information.	2017-08-16 09:30:33 +02:00
luccioman	5a646540cc	Support parsing gzip files from servers with redundant headers. Some web servers provide both 'Content-Encoding : "gzip"' and 'Content-Type : "application/x-gzip"' HTTP headers on their ".gz" files. This was annoying to fail on such resources which are not so uncommon, while non conforming (see RFC 7231 section 3.1.2.2 for "Content-Encoding" header specification https://tools.ietf.org/html/rfc7231#section-3.1.2.2)	2017-07-16 14:46:46 +02:00
luccioman	11a7f923d4	Distinguish response parsing failures from unexpected exceptions.	2017-07-16 14:39:53 +02:00
luccioman	452a17a8d5	Finer control on bounded input streams with custom stream implementation	2017-07-12 00:13:24 +02:00
luccioman	1e84956721	Support loading local files with a per request specified maximum size. Consistently with the HTTP loader implementation.	2017-07-11 09:04:23 +02:00
luccioman	bf55f1d6e5	Started support of partial parsing on large streamed resources. Thus enable getpageinfo_p API to return something in a reasonable amount of time on resources over MegaBytes size range. Support added first with the generic XML parser, for other formats regular crawler limits apply as usual.	2017-07-08 09:04:03 +02:00
luccioman	433bdb7c0d	Respect maxFileSize limit also when streaming HTTP and when relevant. Constraint applied consistently with HTTP content full load in byte array.	2017-06-30 00:30:54 +02:00
luccioman	8da3174867	Ensure lower case conversion consistency with any default locale. Especially for Turkish speaking users using "tr" as their system default locale : strings for technical stuff (URLs, tag names, constants...) must not be lower cased with the default locale, as 'I' doesn't becomes 'i' like in other locales such as "en", but becomes 'ı'.	2017-06-27 06:42:33 +02:00
luccioman	9dd790087d	Added HT Cache basic statistics (hit rate)	2017-06-15 09:50:02 +02:00
luccioman	28b451a0b3	Made Cache compression level and lock timeout user configurable	2017-06-14 19:02:08 +02:00
luccioman	a7394b479b	Limit the synchronization blocking time on some Cache operations. Using a Reentrant lock instead of the intrinsic synchronization lock permits limiting the blocking time to acquire a lock. Useful on a very busy Cache concurrently accessed by many threads : when the time to acquire a lock is too high, getting/storing content on the cache becomes inefficient, and it is then better to fall back to loading remote resources. Illustrated by the CacheTest stress test and some traces reported in mantis 751 ( http://mantis.tokeek.de/view.php?id=751 )	2017-06-14 09:13:50 +02:00
luccioman	8399275142	Properly close file output streams even on exceptions scenarios.	2017-06-08 07:19:16 +02:00
luccioman	d98c04853d	Ensure proper closing of file input streams.	2017-06-02 12:14:29 +02:00

1 2 3 4 5 ...

366 Commits