Commit Graph

172 Commits

Author SHA1 Message Date
Michael Peter Christen
e0ad8ca9da replaced json library from JSON.org with libandroid-json-java
This fixes https://github.com/yacy/yacy_search_server/issues/347
2020-04-24 11:45:25 +02:00
luccioman
6b45cd5799 New optional crawl filter on the URL a doc must match to crawl its links
For finer control over which parsed documents can trigger an addition of
their links to the crawl stack, complementary to the existing crawl
depth parameter.
2019-05-01 08:54:19 +02:00
luccioman
08ea0b0397 Added a configurable timeout to wkhtmltopdf calls for pdf snapshots
Necessary to prevent blocking the indexing workflow when some
wkhtmltopdf renderings fail without terminating
2018-12-11 22:31:31 +01:00
luccioman
fcf6b16db4 Added new crawler attribute for finer control over Media Type detection
New "Media Type detection" section in the advanced crawl start page
allow to choose between :
- not loading URLs with unknown or unsupported file extension without
checking the actual Media Type (relying Content-Type header for now).
This was the old default behavior, faster, but not really accurate.
- always cross check URL file extension against the actual Media Type.
This lets properly parse URLs ending with an apparently odd file
extension, but which have actually a supported Media Type such as
text/html.

Sample URLs with misleading file extensions added as documentation in
the crawl start page.

fixes issue #244
2018-10-25 10:42:12 +02:00
luccioman
7adbd1f87d Fixed raw IPV6 addresses snapshots read/write on FAT32 and NTFS fs
Fixes issue #225
2018-09-12 17:34:40 +02:00
luccioman
4ee14ff3c5 Fixed NullPointerException case on malformed crawl queue folder name 2018-08-13 14:35:26 +02:00
luccioman
dcad393fe5 Fixed exceeding max size of failreason_s Solr field on large link list
When using the 'From Link-List of URL' as a crawl start, with lists in
the order of one or more thousands of links, the failreason_s Solr field
maximum size (32kb) was exceeded by the string representation of the URL
must-match filter when a crawl URL was rejected because not matching.
2018-07-11 08:13:29 +02:00
luccioman
c726154a59 Fixed removal of URLs from the delegatedURL remote crawl stack
URLs were removed from the stack using their hash as a bytes array,
whereas the hash is stored in the stack as String instance.
2018-07-05 09:36:36 +02:00
luccioman
a15ac8e0ca Made CrawlProfile loading tolerant to malformed json string attribute 2018-06-19 12:53:17 +02:00
luccioman
a715bb7876 Fixed rendering of solr mustNoMatch value on CrawlProfileEditor_p.xml 2018-06-19 12:50:28 +02:00
luccioman
4d9aa4ed1e Fixed default crawl profile solr mustnotmatch query from previous commit 2018-06-19 11:58:47 +02:00
luccioman
cced94298a Added a new crawler document filter type using Solr syntax
This makes possbile to set up much more advanced document crawl filters,
by filtering on one or more document indexed fields before inserting in
the index.
2018-06-19 10:12:20 +02:00
luccioman
fa4399d5d2 Small perf improvement : initialize threads names early when possible
Initializing Thread names using the Thread constructor parameter is
faster as it already sets a thread name even if no customized one is
given, while an additional call to the Thread.setName() function
internally do synchronized access, eventually runs access check on the
security manager and performs a native call.

Profiling a running YaCy server revealed that the total processing time
spent on Thread.setName() for a typical p2p search was in the range of
seconds.
2018-05-23 14:45:35 +02:00
luccioman
fb3032c530 Added a crawl filtering possibility on documents Media Type (MIME) 2018-03-23 10:28:19 +01:00
luccioman
09c4ee56a7 Added optional https support for remote crawl and profile operations 2017-12-21 18:41:32 +01:00
luccioman
5db1c9155a Do locale independant case conversion on hosts, schemes, and file exts.
Required for proper operation when the default system locale is Turkish,
as dottless and dotted i characters have specific case conversion rules
in this language.
2017-12-19 13:52:05 +01:00
Michael Peter Christen
25573bd5ab added a crawl filter based on <div> tag class names
When a crawl is started, a new field to exclude content from scraping is
available. The field can be identified with the class name of div tags.
All text contained in such a div tag where the configured class name(s)
match are not indexed, while the remaining page is indexed.
2017-12-09 22:29:35 +01:00
luccioman
9dd790087d Added HT Cache basic statistics (hit rate) 2017-06-15 09:50:02 +02:00
luccioman
28b451a0b3 Made Cache compression level and lock timeout user configurable 2017-06-14 19:02:08 +02:00
luccioman
a7394b479b Limit the synchronization blocking time on some Cache operations.
Using a Reentrant lock instead of the intrinsic synchronization lock
permits limiting the blocking time to acquire a lock.

Useful on a very busy Cache concurrently accessed by many threads : when
the time to acquire a lock is too high, getting/storing content on the
cache becomes inefficient, and it is then better to fall back to loading
remote resources.

Illustrated by the CacheTest stress test and some traces reported in
mantis 751 ( http://mantis.tokeek.de/view.php?id=751 )
2017-06-14 09:13:50 +02:00
luccioman
8399275142 Properly close file output streams even on exceptions scenarios. 2017-06-08 07:19:16 +02:00
luccioman
d98c04853d Ensure proper closing of file input streams. 2017-06-02 12:14:29 +02:00
luccioman
39e081ef38 Fixed display of crawler pending URLs counts in HostBrowser.html page.
As described in mantis 722 (http://mantis.tokeek.de/view.php?id=722)

Also updated some Javadoc.
2017-01-22 12:31:14 +01:00
reger
87f6631a2a adjust Cache getHeader to prev. changes/commit 2016-12-18 01:02:56 +01:00
luccioman
f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
This makes threads monitoring easier to read.
2016-10-22 17:17:21 +02:00
luccioman
dcdea2d02f Fixed shutdown for crawler.MaxActiveThreads value greater than 200
Shutdown was hanging in CrawlQueues.close() at
this.workerQueue.put(POISON_REQUEST) when config value
crawler.MaxActiveThreads was greater than 200.

Revealed by "Collision" Threads dumps in mantis 689
(http://mantis.tokeek.de/view.php?id=689#c1312)

Fixed consistency between this.worker.length and this.workerQueue
capacity, and made the process more reliable using non-blocking offer()
function.
2016-09-29 10:33:11 +02:00
luccioman
3ee4f56c39 Improved ErrorCache behavior when switching networks
Even after network switch, ErroCache was still holding a reference to
the previous Solr cores, thus becoming useless until next YaCy restart.

Initial error cache filling with recent errors from the index was also
missing after the swtich.
2016-09-22 09:07:07 +02:00
Michael Peter Christen
5e165a8150 removed unused imports 2016-09-06 18:46:24 +02:00
reger
eb2a00b1d8 fix NPE on missing crawldepth_i 2016-05-15 01:26:38 +02:00
reger
7be1c7a05a fix logger name 2016-04-17 03:20:14 +02:00
reger
379e9b330d use supplied url port to get robots.txt in crawlers hostqueue 2016-03-02 00:12:34 +01:00
Ryszard Goń
a98c395023 Add the Autocrawl thread 2016-01-14 00:50:23 +01:00
reger
90686a75a2 fix flux factor (additional crawl delay by access count) calculation 2015-11-25 01:34:41 +01:00
reger
367fe388b9 fix exception throw after sendError in DefaultServlet
- reduce debug exception logs in crawler
2015-09-05 01:57:30 +02:00
Michael Peter Christen
8f90767889 fix for filesystem crawl 2015-08-11 00:42:26 +02:00
Michael Peter Christen
fbeae20b3a try a healing of the cache if the index file is corrupted 2015-07-27 15:16:08 +02:00
Michael Peter Christen
3c4c69adea fix for
- bad regex computation for crawl start from file (limitation on domain
did not work)
- servlet error when starting crawl from a large list of urls
2015-06-29 02:02:01 +02:00
Michael Peter Christen
9c12555be5 added link to Snapshots in search results if the snapshot exists and
option is set in ConfigSearchPage_p
(this is a stub: we also need a visualization of pdf files!)
2015-06-07 20:37:37 +02:00
Michael Peter Christen
197f7449e5 All entities of crawl profiles are now editable in the crawl profile
editor.
2015-05-28 16:07:40 +02:00
reger
3e742d1e34 Init remote crawler on demand
If remote crawl option is not activated, skip init of remoteCrawlJob to save the resources of queue and ideling thread.
Deploy of the remoteCrawlJob deferred on activation of the option.
2015-05-23 02:06:39 +02:00
Michael Peter Christen
97930a6aad added must-not-match filter to snapshot generation.
also: fixed some bugs
2015-05-08 13:46:27 +02:00
Ryszard Goń
ca1a70aec8 fix for Accept '?' URLs column in Crawl Profile List 2015-04-19 15:55:49 +02:00
Michael Peter Christen
fed26f33a8 enhanced timezone managament for indexed data:
to support the new time parser and search functions in YaCy a high
precision detection of date and time on the day is necessary. That
requires that the time zone of the document content and the time zone of
the user, doing a search, is detected. The time zone of the search
request is done automatically using the browsers time zone offset which
is delivered to the search request automatically and invisible to the
user. The time zone for the content of web pages cannot be detected
automatically and must be an attribute of crawl starts. The advanced
crawl start now provides an input field to set the time zone in minutes
as an offset number. All parsers must get a time zone offset passed, so
this required the change of the parser java api. A lot of other changes
had been made which corrects the wrong handling of dates in YaCy which
was to add a correction based on the time zone of the server. Now no
correction is added and all dates in YaCy are UTC/GMT time zone, a
normalized time zone for all peers.
2015-04-15 13:17:23 +02:00
Michael Peter Christen
3288489fd2 more logging during start-up 2015-04-11 13:00:32 +02:00
Michael Peter Christen
535f1ebe3b added a new way of content browsing in search results:
- date navigation

The date is taken from the CONTENT of the documents / web pages, NOT
from a date submitted in the context of metadata (i.e. http header or
html head form). This makes it possible to search for documents in the
future, i.e. when documents contain event descriptions for future
events.

The date is written to an index field which is now enabled by default.
All documents are scanned for contained date mentions.
To visualize the dates for a specific search results, a histogram
showing the number of documents for each day is displayed. To render
these histograms the morris.js library is used. Morris.js requires also
raphael.js which is now also integrated in YaCy.

The histogram is now also displayed in the index browser by default.

To select a specific range from a search result, the following modifiers
had been introduced:
from:<date>
to:<date>
These modifiers can be used separately (i.e. only 'from' or only 'to')
to describe an open interval or combined to have a closed interval. Both
dates are inclusive. To select a specific single date only, use the
'to:' - modifier.

The histogram shows blue and green lines; the green lines denot weekend
days (saturday and sunday).

Clicking on bars in the histogram has the following reaction:
1st click: add a from:<date> modifier for the date of the bar
2nd click: add a to:<date> modifier for the date of the bar
3rd click: remove from and date modifier and set a on:<date> for the bar
When the on:<date> modifier is used, the histogram shows an unlimited
time period. This makes it possible to click again (4th click) which is
then interpreted as a 1st click again (sets a from modifier).

The display feature is NOT switched on by default; to switch it on use
the /ConfigSearchPage_p.html servlet.
2015-03-02 04:30:10 +01:00
Michael Peter Christen
b5ac29c9a5 added a html field scraper which reads text from html entities of a
given css class and extends a given vocabulary with a term consisting
with the text content of the html class tag. Additionally, the term is
included into the semantic facet of the document. This allows the
creation of faceted search to documents without the pre-creation of
vocabularies; instead, the vocabulary is created on-the-fly, possibly
for use in other crawls. If any of the term scraping for a specific
vocabulary is successful on a document, this vocabulary is excluded for
auto-annotation on the page.

To use this feature, do the following:
- create a vocabulary on /Vocabulary_p.html (if not existent)
- in /CrawlStartExpert.html you will now see the vocabularies as column
in a table. The second column provides text fields where you can name
the class of html entities where the literal of the corresponding
vocabulary shall be scraped out
- when doing a search, you will see the content of the scraped fields in
a navigation facet for the given vocabulary
2015-01-30 13:20:56 +01:00
Michael Peter Christen
69eacdf4eb applying precompiled CommonPattern.COMMA.split to all places where
split(",") was used
2015-01-29 01:46:22 +01:00
Michael Peter Christen
bee5ee7cce removed some warnings 2015-01-27 17:00:20 +01:00
Michael Peter Christen
7db2888336 fixed font size and print page generation in pdf snapshots 2015-01-20 17:14:14 +01:00
Michael Peter Christen
3e6c3e2237 documents pushed over the api/push_p.html interface will have their
unique flag set by default
2015-01-06 15:22:59 +01:00