Commit Graph

120 Commits

Author SHA1 Message Date
luccioman
d41ad7af6f Restore initial locale at the end of a JUnit test case which modify it. 2017-11-20 18:50:49 +01:00
luccioman
7206f1ed71 Do locale neutral case conversions on domain names.
Required to properly run on systems with default locale set to Turkish
language, as with this locale the 'i' character has different upper and
lower case flavors than with other locales.
2017-11-20 18:47:46 +01:00
luccioman
398c66f06c Do locale neutral case conversions in MultiProtocolURL
For any relevant URL parts : host name, URL scheme, session ids or
technical parts (see https://url.spec.whatwg.org/#url-writing and
https://tools.ietf.org/html/rfc3986 for current standard references).

Remaining locale sensitive conversion used for detection of URL word
components in urlComps() makes sense but using detected language would
be preferable than using the default system locale.
2017-11-20 15:23:33 +01:00
luccioman
9531b83598 Do locale neutral case conversions in Classification
Required for people using Turkish language as their default system
locale, as with this locale the 'i' character has different upper and
lower case flavors than with other locales.
2017-11-20 09:48:46 +01:00
luccioman
ac209cac2e Updated the generic top-level known domains list.
Using current IANA reference list at
https://www.iana.org/domains/root/db

The generated URL hashes on these domains stay the same but performance
is greatly improved as a DNS resolve request is required on URL hash
computation when the TLD part of the host name is unknown.

Hash computation mean time measured on 1541 sample URLs (one on each
TLD) and a computer with a DSL connection : about 230ms before change,
then only 20ms.
2017-11-14 09:42:09 +01:00
luccioman
fcd57e2d0f Improved some JUnit tests isolation and resources release
The modified tests were successfull when run manually from an IDE such
as Eclipse, but failed occasionnally when run with maven as part of the
overall test suite.
2017-11-08 09:33:30 +01:00
luccioman
e0eda84c24 Remove old hard-coded holiday dates from DateDection class.
Replaced with rules based relative to current year as already done for a
part of the supported dates.
2017-11-07 19:02:09 +01:00
luccioman
73977ec0fe Added a html parser charset detection unit test 2017-11-06 09:14:03 +01:00
luccioman
285f0d6a39 Consistently encode snapshot image with format requested on the API.
Previously, calling /api/snapshot.png rendered JPEG encoded images.
2017-10-18 07:53:07 +02:00
luccioman
7c319c841e Fixed pdf2image conversion with imagemagick on PDFs having transparency
The target image format (jpeg) doesn't support transparency, so the
Html2ImageTest produced unusable black images when ran on a linux
machine having imagemagick package installed.
2017-10-16 19:45:17 +02:00
luccioman
fe75f326d8 Fixed ProfilingGraph calculation integer overflows and added test class.
Complementary to fix proposed in PR #128 by @otteresk.
2017-10-16 09:18:12 +02:00
luccioman
5bf76f058a Adjusted ResponseHeaderTest to succeed on slow or highly loaded CPU 2017-10-09 19:08:39 +02:00
luccioman
32c9dfa768 Added partial bzip2 stream parsing support and bzipParser Junit test 2017-10-04 18:33:09 +02:00
luccioman
dd9cb06d25 Fixed RWI distance calculation on multi words search queries.
Distance was lost when storing/retrieving references to intermediate
result container.

Now all JUnit tests are again successfully passing!
2017-10-04 08:41:43 +02:00
luccioman
c6ae87168a Added unit tests on the gzip parser. 2017-08-22 14:13:00 +02:00
luccioman
169ffdd1c7 Finer control on max links to parse in the html parser. 2017-08-22 14:11:35 +02:00
luccioman
4743a104b5 Added some unit tests on FileUtils. 2017-08-22 14:06:09 +02:00
luccioman
e41d046a9d Improved parsing support for OOXML spreadsheets (.xlsx)
As reported edycop in mantis 765 (
http://mantis.tokeek.de/view.php?id=765 ), parsing of xlsx files was
quite incomplete.
Now properly support "Shared String Table" entry in Office Open XML
spreadsheets, an also detect embedded URLs.

Integrating the Apache poi-ooxml library could be an option for finer
OOXML formats support, but their SAX style parsing example (
http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api ) tends to
show that a custom SAX handler is still efficient for lightweight and
low memory footprint processing.
2017-08-21 09:38:20 +02:00
luccioman
780173008e Implemented partial stream parsing of tar archives.
Also added JUnit tests for the tar parser and fixed unwanted use of the
tar parser as a fallback on files included in a tar archive.
2017-08-14 14:57:58 +02:00
luccioman
acab6a6def Also handle text content when parsing XML within limits. 2017-08-14 14:47:01 +02:00
reger
f38fb7f02c Add junit test for AbstractOperations.addOperand() 2017-08-14 02:16:43 +02:00
luccioman
ed678186a8 Updated xml parser limited parsing test for use latest jdk. 2017-08-12 09:42:06 +02:00
luccioman
f369679d1c Fixed read/copy on input streams reading sometimes less than expected. 2017-07-11 09:00:27 +02:00
luccioman
bf55f1d6e5 Started support of partial parsing on large streamed resources.
Thus enable getpageinfo_p API to return something in a reasonable amount
of time on resources over MegaBytes size range.
Support added first with the generic XML parser, for other formats
regular crawler limits apply as usual.
2017-07-08 09:04:03 +02:00
luccioman
2a87b08cea Removed temporary html parser test code 2017-07-03 14:53:36 +02:00
luccioman
90a7c1affa HTML parser : removed unnecessary remaining recursive processing
Recursive processing was removed in commit
67beef657f, but one remained for anchors
content(likely omitted from refactoring). It is no more necessary :
other links such as images embedded in anchors are currently correctly
detected by the parser.

More annoying : that remaining recursive processing could lead to almost
endless processing when encountering some (invalid) HTML structures
involving nested anchors, as detected and reported by lucipher on YaCy
forum ( http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6005 ).
2017-07-03 10:00:53 +02:00
luccioman
9b1bb2545e Refactored plain-text URLs detection implementation.
For faster processing (measured about 2 times faster on many real-world
examples) and more advanced detection (previous algorithm detected only
URLs separated from the rest of the text by a space character).
2017-06-27 19:30:40 +02:00
luccioman
8da3174867 Ensure lower case conversion consistency with any default locale.
Especially for Turkish speaking users using "tr" as their system default
locale : strings for technical stuff (URLs, tag names, constants...)
must not be lower cased with the default locale, as 'I' doesn't becomes
'i' like in other locales such as "en", but becomes 'ı'.
2017-06-27 06:42:33 +02:00
luccioman
286f3018bd Made mime type and extension normalization locale independent.
Previously, upper cased mime type was incorrectly normalized when the
default locale is Turkish.
2017-06-26 17:33:56 +02:00
luccioman
319231a458 Added a generic XML parser, able to parse elements text and URLs.
This parser adds support for any XML based format other than already
supported XML vocabularies such XHTML, RSS/Atom feeds... It will
eventually be used as a fallback if one of these specific parsers fail,
before falling back to the existing genericParser which extracts not
that much useful information except URL tokens.
2017-06-26 16:30:21 +02:00
luccioman
64cec2790d Improved character encoding detection from Content-Type header
Also updated some related JavaDocs
2017-06-22 10:50:34 +02:00
luccioman
1acb7005d0 Added a basic JUnit test with test gz files for the gzip parser 2017-06-21 09:14:50 +02:00
luccioman
1e2fb76720 Properly close test files in htmlParser unit test 2017-06-21 09:11:17 +02:00
luccioman
9dd790087d Added HT Cache basic statistics (hit rate) 2017-06-15 09:50:02 +02:00
luccioman
28b451a0b3 Made Cache compression level and lock timeout user configurable 2017-06-14 19:02:08 +02:00
luccioman
a7394b479b Limit the synchronization blocking time on some Cache operations.
Using a Reentrant lock instead of the intrinsic synchronization lock
permits limiting the blocking time to acquire a lock.

Useful on a very busy Cache concurrently accessed by many threads : when
the time to acquire a lock is too high, getting/storing content on the
cache becomes inefficient, and it is then better to fall back to loading
remote resources.

Illustrated by the CacheTest stress test and some traces reported in
mantis 751 ( http://mantis.tokeek.de/view.php?id=751 )
2017-06-14 09:13:50 +02:00
Michael Peter Christen
6fe735945d migrated Solr 5.5 -> Solr 6.6 and from Java 1.7 -> 1.8
Also: now Version 1.921
2017-06-09 12:25:23 +02:00
luccioman
a04feac064 Ensure file input streams proper closing in both success and failures
Also add when possible a warning level log message on input stream
closing error instead of failing silently. This could help understanding
some IO exceptions such as "too many files open".
2017-06-03 04:00:46 +02:00
luccioman
d98c04853d Ensure proper closing of file input streams. 2017-06-02 12:14:29 +02:00
luccioman
c226ded799 Fix unescape of URLs having some '%' chars but not percent-encoded 2017-05-30 12:32:14 +02:00
reger
077d062be3 Adjust mergeDocuments to keep youngest last-modified date of document
collection
2017-05-09 22:52:54 +02:00
luccioman
522a268305 Improved new blacklist entries URL scheme detection. 2017-05-04 16:36:45 +02:00
luccioman
31fff2c986 Extended WikiCode template inclusion syntax support.
Wiki templates are not rendered but syntax support is improved, which
greatly enhance snippets rendering on search results coming from a
MediaWiki dump import.
Tested on various dumps from Wikimedia at
https://dumps.wikimedia.org/backup-index.html
See also Wikipedia transclusion documentation at
https://en.wikipedia.org/wiki/Wikipedia:Transclusion
2017-04-27 09:50:04 +02:00
reger
7a7da698d4 fix unit test MultiProtocolURL(file) assertion for Windows path with
drive letter.
2017-04-20 00:47:52 +02:00
luccioman
23775e76e2 Fixed endless loop case in wikicode processing.
Detected when importing recent MediaWiki dumps containing some pages
with script content in plain text format (see Scribunto extension
https://www.mediawiki.org/wiki/Extension:Scribunto ).

Further improvement : modify the MediawikiImporter to prevent processing
revisions whose <model> is not wikitext.
2017-04-12 17:17:03 +02:00
luccioman
0bc868a819 Improved support for non ASCII chars in local file system URLs
Creating a MultiProtocolURL instance from a File object and then
retrieving a File with getFSFile() was inconsistent with file paths
containing space or non ASCII chars.
2017-04-12 09:23:10 +02:00
reger
777cb5b812 remove test case for Standard_MemoryControl which will always fail
see https://github.com/yacy/yacy_search_server/pull/114
2017-04-02 03:59:37 +02:00
reger
1ccc44e681 fix default/httpd.mime Z file extension to lower case
+ test case
2017-03-26 23:52:31 +02:00
reger
18c7563dbe Extend DCEntry.getLanguage convert to ISO639-1 codes for more languages
by using icu.ULocale for languages not already covered (ICU normalizes 
to ISO639-1 2 char codes).
Add test class
Use DublinCore vocabulary declarations in DCEntry and SurrogateReader 
for easier usage debugging, 
Init SurrogateReader.inputSource on first use.
2017-03-05 02:26:10 +01:00
reger
275c0cddd1 Adjust DefaultServlet test case to recent change,
depreciate unused CONNECTION_PROP_PROTOCOL (also as it might be 
misleading with getProtocol vs getScheme)
2017-02-26 02:39:52 +01:00