Commit Graph

180 Commits

Author SHA1 Message Date
luccioman
61c337f29a Decode blacklist entries for easier edition of non ascii chars
Not using the JDK URLDecoder.decode() function, as it strips '+'
characters when they occur after '?' (both characters having regular
expression semantics when used in blacklist path patterns)
2018-10-04 09:33:58 +02:00
luccioman
ed93221fa1 Improved normalization of blacklist path patterns having non ascii chars
Normalize blacklist path patterns using percent-encoding, at pattern
edition in web interface and at loading from configuration files.

Fixes issue #237
2018-10-02 14:36:13 +02:00
luccioman
dbf4c1cd76 Improved blacklist entries editing operations :
- Fixes issue #160 : handle properly syntax exceptions with a user
friendly message
- Fixes loss of information on multiple blacklist entries editions
- Fixes loss of entries when moving entries from one list to another
2018-02-13 18:24:26 +01:00
luccioman
7baa99f26f Fixed stored URL in web cache when redirection(s) occurs.
Associate cached content to the last redirection location, instead of
the first URL of a redirection(s) chain :
 - for proper base URL processing in parsers (fixes mantis 636 -
http://mantis.tokeek.de/view.php?id=636)
 - to prevent duplicated content in Solr index when recrawling a
redirected URL
2018-01-20 18:56:40 +01:00
luccioman
5db1c9155a Do locale independant case conversion on hosts, schemes, and file exts.
Required for proper operation when the default system locale is Turkish,
as dottless and dotted i characters have specific case conversion rules
in this language.
2017-12-19 13:52:05 +01:00
Michael Peter Christen
25573bd5ab added a crawl filter based on <div> tag class names
When a crawl is started, a new field to exclude content from scraping is
available. The field can be identified with the class name of div tags.
All text contained in such a div tag where the configured class name(s)
match are not indexed, while the remaining page is indexed.
2017-12-09 22:29:35 +01:00
luccioman
d8eaf621cc Fixed blacklist returned location URL on empty parameters 2017-10-24 09:30:21 +02:00
luccioman
1e84956721 Support loading local files with a per request specified maximum size.
Consistently with the HTTP loader implementation.
2017-07-11 09:04:23 +02:00
luccioman
bf55f1d6e5 Started support of partial parsing on large streamed resources.
Thus enable getpageinfo_p API to return something in a reasonable amount
of time on resources over MegaBytes size range.
Support added first with the generic XML parser, for other formats
regular crawler limits apply as usual.
2017-07-08 09:04:03 +02:00
luccioman
0b75e92ac2 Do not wrap unnecessarily loader IOExceptions in IOExceptions 2017-06-30 01:06:17 +02:00
luccioman
433bdb7c0d Respect maxFileSize limit also when streaming HTTP and when relevant.
Constraint applied consistently with HTTP content full load in byte
array.
2017-06-30 00:30:54 +02:00
luccioman
8399275142 Properly close file output streams even on exceptions scenarios. 2017-06-08 07:19:16 +02:00
luccioman
a04feac064 Ensure file input streams proper closing in both success and failures
Also add when possible a warning level log message on input stream
closing error instead of failing silently. This could help understanding
some IO exceptions such as "too many files open".
2017-06-03 04:00:46 +02:00
luccioman
a9cb083fa1 Improved consistency between loader openInputStream and load functions 2017-06-02 01:46:06 +02:00
luccioman
522a268305 Improved new blacklist entries URL scheme detection. 2017-05-04 16:36:45 +02:00
luccioman
58d23047dd Handle '?' and '+' chars as valid wild cards when adding to blacklist.
An entry such as "domain.com/[a-z]+" is a valid regular expression and
do not need additional ".*.*/.*" wildcards.
2017-05-04 11:19:59 +02:00
luccioman
f66438442e Extended Mediawiki dump import to remote URLs.
When using a public HTTP URL in /IndexImportMediawiki_p.html, the remote
file now is directly streamed and processed, allowing import of several
GB dumps even with a low memory remote peer, and without need to
manually download the dump file first.
2017-04-14 14:32:44 +02:00
luccioman
54405577aa Replaced absolute redirection locations by relative ones when possible.
This makes integration of YaCy behind a reverse proxy subfolder easier.
2017-02-09 16:42:21 +01:00
luccioman
339f005ced Blacklist import and update performance improvements.
Measurement sample : import from blacklist local file containing about
15000 entries
 - before refactoring : several minutes
 - after refactoring : a few seconds!
2017-01-06 12:24:31 +01:00
reger
395f2e8946 Make ServletRequest implement the standardized HttpServletRequest interface,
to make all readily available information from the original ServletRequest
available to YaCy servlets (without converting data to internal structures).
The implementation of the common interface allows easier integration of
YaCy servlets with the servlet standard (e.g. shared login service with
the servlet container etc.)
2016-11-14 01:37:16 +01:00
luccioman
f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
This makes threads monitoring easier to read.
2016-10-22 17:17:21 +02:00
luccioman
da362628fb Added fine log level for too long blacklist matching processing. 2016-10-17 22:32:19 +02:00
luccioman
4b699c469a Blacklist refactoring : extracted a function for easier unit testing 2016-10-13 15:33:31 +02:00
luccioman
242707f9b4 Fixed loadFromCache with strategy IFFRESH.
This fixes mantis 695 ( http://mantis.tokeek.de/view.php?id=695 ) :
crawl start with 'Link-List of URL' option on websites using cookies.
2016-10-10 01:10:35 +02:00
Michael Peter Christen
5e165a8150 removed unused imports 2016-09-06 18:46:24 +02:00
reger
5e335b32da fix Blacklist.contains() matching path pattern to string
similar to 5e9e871192
+ add proof testcase
2016-08-04 01:12:49 +02:00
reger
5e9e871192 fix Blacklist.remove by using pattern.toString to find pattern to remove,
parameter String path did never equal Pattern.
+ delete unused removeAll, as it does not persist changes after restart
2016-08-03 02:13:26 +02:00
reger
1843ea7e69 on Blacklist.add pattern to source file also update internal entry maps
as in Blacklist.add(blacklistType) to make entry effective w/o restart
fix for http://mantis.tokeek.de/view.php?id=676
2016-08-02 02:41:03 +02:00
reger
efb9f1a8b7 save resource for unused blacklistFiles map 2016-05-12 00:13:57 +02:00
reger
06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
- Above brought up that parser start url parameter, declared as AnchorURL uses only methodes of parent object DigestURL (changed parameter declaration accordingly).
2016-02-16 02:05:58 +01:00
reger
b7e8358645 make use of header.getContentType where possible (mime is normalized afterwards)
otherwise use header.mime() differentiated in prev. commit.
2015-12-20 15:49:24 +01:00
luc
f01d49c37a Process large or local file images dealing directly with content
InputStream.
2015-11-18 10:15:38 +01:00
Michael Peter Christen
fed26f33a8 enhanced timezone managament for indexed data:
to support the new time parser and search functions in YaCy a high
precision detection of date and time on the day is necessary. That
requires that the time zone of the document content and the time zone of
the user, doing a search, is detected. The time zone of the search
request is done automatically using the browsers time zone offset which
is delivered to the search request automatically and invisible to the
user. The time zone for the content of web pages cannot be detected
automatically and must be an attribute of crawl starts. The advanced
crawl start now provides an input field to set the time zone in minutes
as an offset number. All parsers must get a time zone offset passed, so
this required the change of the parser java api. A lot of other changes
had been made which corrects the wrong handling of dates in YaCy which
was to add a correction based on the time zone of the server. Now no
correction is added and all dates in YaCy are UTC/GMT time zone, a
normalized time zone for all peers.
2015-04-15 13:17:23 +02:00
Michael Peter Christen
b5ac29c9a5 added a html field scraper which reads text from html entities of a
given css class and extends a given vocabulary with a term consisting
with the text content of the html class tag. Additionally, the term is
included into the semantic facet of the document. This allows the
creation of faceted search to documents without the pre-creation of
vocabularies; instead, the vocabulary is created on-the-fly, possibly
for use in other crawls. If any of the term scraping for a specific
vocabulary is successful on a document, this vocabulary is excluded for
auto-annotation on the page.

To use this feature, do the following:
- create a vocabulary on /Vocabulary_p.html (if not existent)
- in /CrawlStartExpert.html you will now see the vocabularies as column
in a table. The second column provides text fields where you can name
the class of html entities where the literal of the corresponding
vocabulary shall be scraped out
- when doing a search, you will see the content of the scraped fields in
a navigation facet for the given vocabulary
2015-01-30 13:20:56 +01:00
Michael Peter Christen
69eacdf4eb applying precompiled CommonPattern.COMMA.split to all places where
split(",") was used
2015-01-29 01:46:22 +01:00
Michael Peter Christen
8df8ffbb6d enhanced the snapshot functionality:
- snapshots can now also be xml files which are extracted from the solr
index and stored as individual xml files in the snapshot directory along
the pdf and jpg images
- a transaction layer was placed above of the snapshot directory to
distinguish snapshots into 'inventory' and 'archive'. This may be used
to do transactions of index fragments using archived solr search results
between peers. This is currently unfinished, we need a protocol to move
snapshots from inventory to archive
- the SNAPSHOT directory was renamed to snapshot and contains now two
snapshot subdirectories: inventory and archive
- snapshots may now be generated by everyone, not only such peers
running on a server with xkhtml2pdf installed. The expert crawl starts
provides the option for snapshots to everyone. PDF snapshots are now
optional and the option is only shown if xkhtml2pdf is installed.
- the snapshot api now provides the request for historised xml files,
i.e. call:
http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ
The result of such xml files is identical with solr search results with
only one hit.
The pdf generation has been moved from the http loading process to the
solr document storage process. This may slow down the process a lot and
a different version of the process may be needed.
2014-12-09 16:20:34 +01:00
reger
ff18129def ViewFile servlet: update index if newer,
so viewed text and metadata (stored) info is similar
- to archive it, use request with profile to allow indexing (defaultglobaltext) and update index 
   (the resource is loaded, parsed anyway, so it's not a expensive operation)

Request: remove 2 unused init parameter 
- number of anchors of the parent
- forkfactor sum of anchors of all ancestors
2014-12-05 01:13:37 +01:00
Michael Peter Christen
e586e423aa in case that loading from the cache fails, load from wkhtmltopdf without
cache using the user agent string given in the crawl profile
2014-12-02 13:35:19 +01:00
Michael Peter Christen
d5bac64421 recognize more html file types for snapshots 2014-12-02 12:52:36 +01:00
Michael Peter Christen
25a64c51b3 moved snapshot generation out of the html handler to prevent that
existing cache entries cause that the handler is not executed
2014-12-01 17:37:25 +01:00
reger
48aed15c48 skip loader wait cycle on concurrent access in nocache configuration.
In nocache config resource is loaded online, leaving no benefit to wait for a faster cache hit.
2014-09-26 23:49:10 +02:00
Michael Peter Christen
1735dbc9d9 enhanced image search: bugfixes and performance enhancements 2014-09-12 16:37:01 +02:00
Michael Peter Christen
ebd0be2cea fixes and speed updates for search process 2014-09-10 14:24:03 +02:00
Michael Peter Christen
2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
attribute in the <a> tag for each crawl. This introduces a lot of
changes because it extends the usage of the AnchorURL Object type which
now also has a different toString method that the underlying
DigestURL.toString. It is therefore not advised to use .toString at all
for urls, just just toNormalform(false) instead.
2014-07-18 12:43:01 +02:00
Michael Peter Christen
fb3dd56b02 fix for processing of noindex flag in http header 2014-07-10 17:13:35 +02:00
Marc Nause
c97da1a0d8 First draft of a blacklist API. 2014-04-30 00:48:38 +02:00
reger
727dfb5875 refactore URIMetadataNode to further unify interaction with index
-  URIMetadataNode extending SolrDocument
- use language as stored (String), reducing conversion to string
- optimize debug code in transferIndex
2014-04-20 01:41:30 +02:00
Michael Peter Christen
10cf8215bd added crawl depth for failed documents 2014-04-17 13:21:43 +02:00
Michael Peter Christen
da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
This organizes all urls to be loaded in separate queues for each host.
Each host separates the crawl depth into it's own queue. The primary
rule for urls taken from any queue is, that the crawl depth is minimal.
This produces a crawl depth which is identical to the clickdepth.
Furthermorem the crawl is able to create a much better balancing over
all hosts which is fair to all hosts that are in the queue.
This process will create a very large number of files for wide crawls in
the QUEUES folder: for each host a directory, for each crawl depth a
file inside the directory. A crawl with maxdepth = 4 will be able to
create 10.000s of files. To be able to use that many file readers, it
was necessary to implement a new index data structure which opens the
file only if an access is wanted (OnDemandOpenFileIndex). The usage of
such on-demand file reader shall prevent that the number of file
pointers is over the system limit, which is usually about 10.000 open
files. Some parts of YaCy had to be adopted to handle the crawl depth
number correctly. The logging and the IndexCreateQueues servlet had to
be adopted to show the crawl queues differently, because the host name
is attached to the port on the host to differentiate between http,
https, and ftp services.
2014-04-16 21:34:28 +02:00
Michael Peter Christen
6bd8c6f195 fix for wrong status codes of error pages 2014-04-10 09:08:59 +02:00