Commit Graph

35 Commits

Author SHA1 Message Date
luccioman
c6ae87168a Added unit tests on the gzip parser. 2017-08-22 14:13:00 +02:00
luccioman
169ffdd1c7 Finer control on max links to parse in the html parser. 2017-08-22 14:11:35 +02:00
luccioman
e41d046a9d Improved parsing support for OOXML spreadsheets (.xlsx)
As reported edycop in mantis 765 (
http://mantis.tokeek.de/view.php?id=765 ), parsing of xlsx files was
quite incomplete.
Now properly support "Shared String Table" entry in Office Open XML
spreadsheets, an also detect embedded URLs.

Integrating the Apache poi-ooxml library could be an option for finer
OOXML formats support, but their SAX style parsing example (
http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api ) tends to
show that a custom SAX handler is still efficient for lightweight and
low memory footprint processing.
2017-08-21 09:38:20 +02:00
luccioman
780173008e Implemented partial stream parsing of tar archives.
Also added JUnit tests for the tar parser and fixed unwanted use of the
tar parser as a fallback on files included in a tar archive.
2017-08-14 14:57:58 +02:00
luccioman
acab6a6def Also handle text content when parsing XML within limits. 2017-08-14 14:47:01 +02:00
luccioman
ed678186a8 Updated xml parser limited parsing test for use latest jdk. 2017-08-12 09:42:06 +02:00
luccioman
bf55f1d6e5 Started support of partial parsing on large streamed resources.
Thus enable getpageinfo_p API to return something in a reasonable amount
of time on resources over MegaBytes size range.
Support added first with the generic XML parser, for other formats
regular crawler limits apply as usual.
2017-07-08 09:04:03 +02:00
luccioman
2a87b08cea Removed temporary html parser test code 2017-07-03 14:53:36 +02:00
luccioman
90a7c1affa HTML parser : removed unnecessary remaining recursive processing
Recursive processing was removed in commit
67beef657f, but one remained for anchors
content(likely omitted from refactoring). It is no more necessary :
other links such as images embedded in anchors are currently correctly
detected by the parser.

More annoying : that remaining recursive processing could lead to almost
endless processing when encountering some (invalid) HTML structures
involving nested anchors, as detected and reported by lucipher on YaCy
forum ( http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6005 ).
2017-07-03 10:00:53 +02:00
luccioman
9b1bb2545e Refactored plain-text URLs detection implementation.
For faster processing (measured about 2 times faster on many real-world
examples) and more advanced detection (previous algorithm detected only
URLs separated from the rest of the text by a space character).
2017-06-27 19:30:40 +02:00
luccioman
8da3174867 Ensure lower case conversion consistency with any default locale.
Especially for Turkish speaking users using "tr" as their system default
locale : strings for technical stuff (URLs, tag names, constants...)
must not be lower cased with the default locale, as 'I' doesn't becomes
'i' like in other locales such as "en", but becomes 'ı'.
2017-06-27 06:42:33 +02:00
luccioman
286f3018bd Made mime type and extension normalization locale independent.
Previously, upper cased mime type was incorrectly normalized when the
default locale is Turkish.
2017-06-26 17:33:56 +02:00
luccioman
319231a458 Added a generic XML parser, able to parse elements text and URLs.
This parser adds support for any XML based format other than already
supported XML vocabularies such XHTML, RSS/Atom feeds... It will
eventually be used as a fallback if one of these specific parsers fail,
before falling back to the existing genericParser which extracts not
that much useful information except URL tokens.
2017-06-26 16:30:21 +02:00
luccioman
1acb7005d0 Added a basic JUnit test with test gz files for the gzip parser 2017-06-21 09:14:50 +02:00
luccioman
1e2fb76720 Properly close test files in htmlParser unit test 2017-06-21 09:11:17 +02:00
Michael Peter Christen
6fe735945d migrated Solr 5.5 -> Solr 6.6 and from Java 1.7 -> 1.8
Also: now Version 1.921
2017-06-09 12:25:23 +02:00
luccioman
a04feac064 Ensure file input streams proper closing in both success and failures
Also add when possible a warning level log message on input stream
closing error instead of failing silently. This could help understanding
some IO exceptions such as "too many files open".
2017-06-03 04:00:46 +02:00
luccioman
d98c04853d Ensure proper closing of file input streams. 2017-06-02 12:14:29 +02:00
reger
077d062be3 Adjust mergeDocuments to keep youngest last-modified date of document
collection
2017-05-09 22:52:54 +02:00
reger
18c7563dbe Extend DCEntry.getLanguage convert to ISO639-1 codes for more languages
by using icu.ULocale for languages not already covered (ICU normalizes 
to ISO639-1 2 char codes).
Add test class
Use DublinCore vocabulary declarations in DCEntry and SurrogateReader 
for easier usage debugging, 
Init SurrogateReader.inputSource on first use.
2017-03-05 02:26:10 +01:00
reger
41e2ee0eca Fix call parameter for ConnectionInfo in MonitorHandler
(expected scheme e.g. http, was protocol version).
Depreceate obsolete custom X-...-Scheme header constant.
Use existing FORMAT_ANSIC Dateformatter in HeaderFramework.
Correct htmlParserTest (del one not intended println)
2017-02-25 23:55:17 +01:00
reger
f254fcfc67 fix htmlParser <script> text extraction on code containing expression
recognized as tag like 1<a
reported in https://github.com/yacy/yacy_search_server/issues/109

Script content is ignored by default, but the text is filtered for html
tags. Modified scraper to skip tag filtering while within a <script> 
section (until a closing tag is detected </script>. 
Possible side effect, missing </script> end-tag will truncate trailing 
content text.
2017-02-24 01:25:32 +01:00
luccioman
c9889991b9 Fixed 2 failing JUNit tests. 2017-01-09 17:59:01 +01:00
reger
cb95b7339a include html5 <time> tag in content scraper,
add "datetime" property of <time> tag to scrapers startdate list.
Datetime is parsed as iso8601 (xml) date, html5 allows partial as well
as duration (not handled by this)
2016-12-24 03:11:35 +01:00
luccioman
7717a3d43d Fixed license headers on files created to improve favicon management. 2016-10-14 11:55:49 +02:00
luccioman
6e1959f469 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
Conflicts:
	htroot/yacysearchitem.java
	source/net/yacy/cora/federate/solr/responsewriter/YJsonResponseWriter.java
	source/net/yacy/search/schema/CollectionConfiguration.java
	source/net/yacy/server/serverObjects.java
2016-10-14 11:29:55 +02:00
reger
b752bcfecb adjust date in text detection to ignore some program version strings
like "3.1.2.0102" see http://mantis.tokeek.de/view.php?id=650
+ expand test case
2016-10-06 23:37:12 +02:00
reger
b017e97421 optimize condenser language detection a little.
langdetect probabilities take letter case into account, add words from
description and anchors etc. as is.
+ add it to javadoc
2016-10-06 19:03:52 +02:00
reger
ae3717d087 adjust Tokenizer sentence count to ignore repeated punktuation (like !!!! )
+ remove unused sentenceword map (we use only the count)
+ upd test case for sentence count
2016-10-06 03:41:07 +02:00
reger
1a79c64495 generalize DateDetection with holiday date rules readily available in icu
to make sure current dates are recognized (was fixed to 2014 - 2016)
+ adjust holiday date parser from pattern.match to pattern.find to deal with leading and trailing text
+ moved relative date recognition (morgen, tomorrow) to parseline (used by query parser only), as not working and problematic for indexing
+ add test case for parseline (used by query parser)
2016-10-02 03:19:12 +02:00
reger
272cdd496a reactivate sentence counter in WordTokenizer for phrasepos ranking,
by counting punktuation (delivered as 1 char word) again.
2016-09-07 02:16:16 +02:00
reger
e310ec5f70 fix posInText ranking calculation to score 0 on no position info
+ fix Word posInText calc in Tokenizer to start with 1
+ test case
2016-09-06 00:05:59 +02:00
reger
ebde21079a refactor xlsParser to include Excel file attribute (like author) in parser result doc.
Similar to ppt and doc parser, completing a TODO in xlsParser.
2016-08-13 23:46:36 +02:00
luc
3cc5619d93 Improved HTML icons indexing and rendering in search results.
See http://mantis.tokeek.de/view.php?id=629
2016-02-02 09:57:54 +01:00
reger
84c970eaec move test classes to test/java (subdirectory as in Maven standard subdir layout)
because ViewImage*Test.java breaks test run
2016-01-16 19:22:27 +01:00