Commit Graph

366 Commits

Author SHA1 Message Date
luccioman
7adbd1f87d Fixed raw IPV6 addresses snapshots read/write on FAT32 and NTFS fs
Fixes issue #225
2018-09-12 17:34:40 +02:00
luccioman
4ee14ff3c5 Fixed NullPointerException case on malformed crawl queue folder name 2018-08-13 14:35:26 +02:00
luccioman
21ad9435ec Fixed crawl queue folder naming for IPv6 hosts on MS Windows filesystems
As reported by @vikulin in issue #187, crawling websites using a raw
IPv6 address as host name in their URL failed when running on Microsoft
Windows platforms (FAT32 or NTFS filesystems) when YaCy crawler created
the crawl queue folder, as the ':' character which is part of an IPV6
address is forbidden on these filesystems.
2018-08-11 10:02:26 +02:00
luccioman
dcad393fe5 Fixed exceeding max size of failreason_s Solr field on large link list
When using the 'From Link-List of URL' as a crawl start, with lists in
the order of one or more thousands of links, the failreason_s Solr field
maximum size (32kb) was exceeded by the string representation of the URL
must-match filter when a crawl URL was rejected because not matching.
2018-07-11 08:13:29 +02:00
luccioman
c726154a59 Fixed removal of URLs from the delegatedURL remote crawl stack
URLs were removed from the stack using their hash as a bytes array,
whereas the hash is stored in the stack as String instance.
2018-07-05 09:36:36 +02:00
luccioman
a15ac8e0ca Made CrawlProfile loading tolerant to malformed json string attribute 2018-06-19 12:53:17 +02:00
luccioman
a715bb7876 Fixed rendering of solr mustNoMatch value on CrawlProfileEditor_p.xml 2018-06-19 12:50:28 +02:00
luccioman
0b302c5004 Do not block whole server startup on persisted crawl profile load error 2018-06-19 12:48:17 +02:00
luccioman
4d9aa4ed1e Fixed default crawl profile solr mustnotmatch query from previous commit 2018-06-19 11:58:47 +02:00
luccioman
cced94298a Added a new crawler document filter type using Solr syntax
This makes possbile to set up much more advanced document crawl filters,
by filtering on one or more document indexed fields before inserting in
the index.
2018-06-19 10:12:20 +02:00
Michael Christen
e0dc632020 removed transformer
it was not used any more
2018-06-19 00:42:23 +02:00
luccioman
fa4399d5d2 Small perf improvement : initialize threads names early when possible
Initializing Thread names using the Thread constructor parameter is
faster as it already sets a thread name even if no customized one is
given, while an additional call to the Thread.setName() function
internally do synchronized access, eventually runs access check on the
security manager and performs a native call.

Profiling a running YaCy server revealed that the total processing time
spent on Thread.setName() for a typical p2p search was in the range of
seconds.
2018-05-23 14:45:35 +02:00
luccioman
fb3032c530 Added a crawl filtering possibility on documents Media Type (MIME) 2018-03-23 10:28:19 +01:00
luccioman
e45afedee4 Added support for enclosures (media links) to the RSS loader 2018-03-21 08:22:29 +01:00
luccioman
aaefd5219c Reduce log verbosity of RSS loader on feed items with no link 2018-03-20 10:09:17 +01:00
luccioman
17c7a85f18 Make StreamResponse usable in Java try-with-resources statements 2018-02-21 08:38:35 +01:00
luccioman
80fb1026d0 Create recrawl requests with the relevant crawl profile.
Recrawl default profile was previously effectively used for crawl
stacker acceptance check, but request entries were indeed still created
with the "snippetGlobalText" profile.
2018-01-30 21:00:18 +01:00
luccioman
46b5249c20 Removed time condition on HostBalancer initialization in JUnit test.
Its initialization in main application usage remains asynchronous.
2018-01-26 17:15:27 +01:00
luccioman
8b572b7337 Commit Solr index before simulating or starting recrawl job.
This ensures up-to-date simulation query results, and recrawl
processing.
2018-01-26 10:31:13 +01:00
luccioman
7baa99f26f Fixed stored URL in web cache when redirection(s) occurs.
Associate cached content to the last redirection location, instead of
the first URL of a redirection(s) chain :
 - for proper base URL processing in parsers (fixes mantis 636 -
http://mantis.tokeek.de/view.php?id=636)
 - to prevent duplicated content in Solr index when recrawling a
redirected URL
2018-01-20 18:56:40 +01:00
luccioman
9ddf92d143 Removed unncessary reflection usage for workflow tasks.
This improves code readability and maintainability (calls hierarchy are
easier to read) and eventually performance.
2018-01-15 10:05:49 +01:00
luccioman
897d3d30cc Added new recrawl job profile to the list of default crawl profiles 2018-01-15 08:30:37 +01:00
luccioman
b712a0671e Added a specific default crawl profile for the recrawl job.
- with only light constraint on known indexed documents load date, as it
can already been controlled by the selection query, and the goal of the
job is indeed to recrawl selected documents now
- using the iffresh cache strategy
2018-01-13 15:46:04 +01:00
luccioman
adf3fa493d Added comments about crawl profiles recrawl cycles 2018-01-13 12:13:04 +01:00
luccioman
3638e16c2e More comprehensive log on rejected recrawls caused by date constraint 2018-01-13 12:07:56 +01:00
luccioman
d47afe6fab Use a constant for crawler reject reason prefix with specific processing 2018-01-13 10:45:00 +01:00
luccioman
4e03335625 Added more details to the recrawl job report 2018-01-12 11:47:13 +01:00
luccioman
433e241e4f Added a report info box about eventual last terminated recrawl job
For easier monitoring of recrawls.
2018-01-09 22:33:15 +01:00
luccioman
b2af25b14f Added a stop condition to the Recrawl busy thread 2018-01-09 10:22:26 +01:00
luccioman
421728d25a Made possible to customize selection query before launching a recrawl 2018-01-08 21:20:46 +01:00
luccioman
09c4ee56a7 Added optional https support for remote crawl and profile operations 2017-12-21 18:41:32 +01:00
luccioman
5db1c9155a Do locale independant case conversion on hosts, schemes, and file exts.
Required for proper operation when the default system locale is Turkish,
as dottless and dotted i characters have specific case conversion rules
in this language.
2017-12-19 13:52:05 +01:00
Michael Peter Christen
25573bd5ab added a crawl filter based on <div> tag class names
When a crawl is started, a new field to exclude content from scraping is
available. The field can be identified with the class name of div tags.
All text contained in such a div tag where the configured class name(s)
match are not indexed, while the remaining page is indexed.
2017-12-09 22:29:35 +01:00
luccioman
46f37e38dc Customized Threads with generic name for easier monitoring. 2017-10-31 08:53:17 +01:00
luccioman
046be566e1 Updated a license header typo. 2017-10-30 07:38:47 +01:00
Apply55gx
3c905a2a5c fix typo 2017-10-27 14:00:30 +02:00
luccioman
6cec2cdcb5 Use unredirected robots.txt URL when adding an entry to the table. 2017-08-16 14:21:07 +02:00
luccioman
3f0446f14b Ensure proper synchronous robots entry retrieval on first check.
Previously, when checking for the first time the robots.txt policy on a
unknown host (not cached in the robots table), result was always empty
in the /getpageinfo_p.xml api and in the /CrawlCheck_p.html page. Next
calls returned however the correct information.
2017-08-16 09:30:33 +02:00
luccioman
5a646540cc Support parsing gzip files from servers with redundant headers.
Some web servers provide both 'Content-Encoding : "gzip"' and
'Content-Type : "application/x-gzip"' HTTP headers on their ".gz" files.
This was annoying to fail on such resources which are not so uncommon,
while non conforming (see RFC 7231 section 3.1.2.2 for
"Content-Encoding" header specification
https://tools.ietf.org/html/rfc7231#section-3.1.2.2)
2017-07-16 14:46:46 +02:00
luccioman
11a7f923d4 Distinguish response parsing failures from unexpected exceptions. 2017-07-16 14:39:53 +02:00
luccioman
452a17a8d5 Finer control on bounded input streams with custom stream implementation 2017-07-12 00:13:24 +02:00
luccioman
1e84956721 Support loading local files with a per request specified maximum size.
Consistently with the HTTP loader implementation.
2017-07-11 09:04:23 +02:00
luccioman
bf55f1d6e5 Started support of partial parsing on large streamed resources.
Thus enable getpageinfo_p API to return something in a reasonable amount
of time on resources over MegaBytes size range.
Support added first with the generic XML parser, for other formats
regular crawler limits apply as usual.
2017-07-08 09:04:03 +02:00
luccioman
433bdb7c0d Respect maxFileSize limit also when streaming HTTP and when relevant.
Constraint applied consistently with HTTP content full load in byte
array.
2017-06-30 00:30:54 +02:00
luccioman
8da3174867 Ensure lower case conversion consistency with any default locale.
Especially for Turkish speaking users using "tr" as their system default
locale : strings for technical stuff (URLs, tag names, constants...)
must not be lower cased with the default locale, as 'I' doesn't becomes
'i' like in other locales such as "en", but becomes 'ı'.
2017-06-27 06:42:33 +02:00
luccioman
9dd790087d Added HT Cache basic statistics (hit rate) 2017-06-15 09:50:02 +02:00
luccioman
28b451a0b3 Made Cache compression level and lock timeout user configurable 2017-06-14 19:02:08 +02:00
luccioman
a7394b479b Limit the synchronization blocking time on some Cache operations.
Using a Reentrant lock instead of the intrinsic synchronization lock
permits limiting the blocking time to acquire a lock.

Useful on a very busy Cache concurrently accessed by many threads : when
the time to acquire a lock is too high, getting/storing content on the
cache becomes inefficient, and it is then better to fall back to loading
remote resources.

Illustrated by the CacheTest stress test and some traces reported in
mantis 751 ( http://mantis.tokeek.de/view.php?id=751 )
2017-06-14 09:13:50 +02:00
luccioman
8399275142 Properly close file output streams even on exceptions scenarios. 2017-06-08 07:19:16 +02:00
luccioman
d98c04853d Ensure proper closing of file input streams. 2017-06-02 12:14:29 +02:00