yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
luccioman	c6ae87168a	Added unit tests on the gzip parser.	2017-08-22 14:13:00 +02:00
luccioman	169ffdd1c7	Finer control on max links to parse in the html parser.	2017-08-22 14:11:35 +02:00
luccioman	e41d046a9d	Improved parsing support for OOXML spreadsheets (.xlsx) As reported edycop in mantis 765 ( http://mantis.tokeek.de/view.php?id=765 ), parsing of xlsx files was quite incomplete. Now properly support "Shared String Table" entry in Office Open XML spreadsheets, an also detect embedded URLs. Integrating the Apache poi-ooxml library could be an option for finer OOXML formats support, but their SAX style parsing example ( http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api ) tends to show that a custom SAX handler is still efficient for lightweight and low memory footprint processing.	2017-08-21 09:38:20 +02:00
luccioman	780173008e	Implemented partial stream parsing of tar archives. Also added JUnit tests for the tar parser and fixed unwanted use of the tar parser as a fallback on files included in a tar archive.	2017-08-14 14:57:58 +02:00
luccioman	acab6a6def	Also handle text content when parsing XML within limits.	2017-08-14 14:47:01 +02:00
luccioman	ed678186a8	Updated xml parser limited parsing test for use latest jdk.	2017-08-12 09:42:06 +02:00
luccioman	bf55f1d6e5	Started support of partial parsing on large streamed resources. Thus enable getpageinfo_p API to return something in a reasonable amount of time on resources over MegaBytes size range. Support added first with the generic XML parser, for other formats regular crawler limits apply as usual.	2017-07-08 09:04:03 +02:00
luccioman	2a87b08cea	Removed temporary html parser test code	2017-07-03 14:53:36 +02:00
luccioman	90a7c1affa	HTML parser : removed unnecessary remaining recursive processing Recursive processing was removed in commit `67beef657f`, but one remained for anchors content(likely omitted from refactoring). It is no more necessary : other links such as images embedded in anchors are currently correctly detected by the parser. More annoying : that remaining recursive processing could lead to almost endless processing when encountering some (invalid) HTML structures involving nested anchors, as detected and reported by lucipher on YaCy forum ( http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6005 ).	2017-07-03 10:00:53 +02:00
luccioman	9b1bb2545e	Refactored plain-text URLs detection implementation. For faster processing (measured about 2 times faster on many real-world examples) and more advanced detection (previous algorithm detected only URLs separated from the rest of the text by a space character).	2017-06-27 19:30:40 +02:00
luccioman	8da3174867	Ensure lower case conversion consistency with any default locale. Especially for Turkish speaking users using "tr" as their system default locale : strings for technical stuff (URLs, tag names, constants...) must not be lower cased with the default locale, as 'I' doesn't becomes 'i' like in other locales such as "en", but becomes 'ı'.	2017-06-27 06:42:33 +02:00
luccioman	286f3018bd	Made mime type and extension normalization locale independent. Previously, upper cased mime type was incorrectly normalized when the default locale is Turkish.	2017-06-26 17:33:56 +02:00
luccioman	319231a458	Added a generic XML parser, able to parse elements text and URLs. This parser adds support for any XML based format other than already supported XML vocabularies such XHTML, RSS/Atom feeds... It will eventually be used as a fallback if one of these specific parsers fail, before falling back to the existing genericParser which extracts not that much useful information except URL tokens.	2017-06-26 16:30:21 +02:00
luccioman	1acb7005d0	Added a basic JUnit test with test gz files for the gzip parser	2017-06-21 09:14:50 +02:00
luccioman	1e2fb76720	Properly close test files in htmlParser unit test	2017-06-21 09:11:17 +02:00
Michael Peter Christen	6fe735945d	migrated Solr 5.5 -> Solr 6.6 and from Java 1.7 -> 1.8 Also: now Version 1.921	2017-06-09 12:25:23 +02:00
luccioman	a04feac064	Ensure file input streams proper closing in both success and failures Also add when possible a warning level log message on input stream closing error instead of failing silently. This could help understanding some IO exceptions such as "too many files open".	2017-06-03 04:00:46 +02:00
luccioman	d98c04853d	Ensure proper closing of file input streams.	2017-06-02 12:14:29 +02:00
reger	077d062be3	Adjust mergeDocuments to keep youngest last-modified date of document collection	2017-05-09 22:52:54 +02:00
reger	18c7563dbe	Extend DCEntry.getLanguage convert to ISO639-1 codes for more languages by using icu.ULocale for languages not already covered (ICU normalizes to ISO639-1 2 char codes). Add test class Use DublinCore vocabulary declarations in DCEntry and SurrogateReader for easier usage debugging, Init SurrogateReader.inputSource on first use.	2017-03-05 02:26:10 +01:00
reger	41e2ee0eca	Fix call parameter for ConnectionInfo in MonitorHandler (expected scheme e.g. http, was protocol version). Depreceate obsolete custom X-...-Scheme header constant. Use existing FORMAT_ANSIC Dateformatter in HeaderFramework. Correct htmlParserTest (del one not intended println)	2017-02-25 23:55:17 +01:00
reger	f254fcfc67	fix htmlParser <script> text extraction on code containing expression recognized as tag like 1<a reported in https://github.com/yacy/yacy_search_server/issues/109 Script content is ignored by default, but the text is filtered for html tags. Modified scraper to skip tag filtering while within a <script> section (until a closing tag is detected </script>. Possible side effect, missing </script> end-tag will truncate trailing content text.	2017-02-24 01:25:32 +01:00
luccioman	c9889991b9	Fixed 2 failing JUNit tests.	2017-01-09 17:59:01 +01:00
reger	cb95b7339a	include html5 <time> tag in content scraper, add "datetime" property of <time> tag to scrapers startdate list. Datetime is parsed as iso8601 (xml) date, html5 allows partial as well as duration (not handled by this)	2016-12-24 03:11:35 +01:00
luccioman	7717a3d43d	Fixed license headers on files created to improve favicon management.	2016-10-14 11:55:49 +02:00
luccioman	6e1959f469	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git Conflicts: htroot/yacysearchitem.java source/net/yacy/cora/federate/solr/responsewriter/YJsonResponseWriter.java source/net/yacy/search/schema/CollectionConfiguration.java source/net/yacy/server/serverObjects.java	2016-10-14 11:29:55 +02:00
reger	b752bcfecb	adjust date in text detection to ignore some program version strings like "3.1.2.0102" see http://mantis.tokeek.de/view.php?id=650 + expand test case	2016-10-06 23:37:12 +02:00
reger	b017e97421	optimize condenser language detection a little. langdetect probabilities take letter case into account, add words from description and anchors etc. as is. + add it to javadoc	2016-10-06 19:03:52 +02:00
reger	ae3717d087	adjust Tokenizer sentence count to ignore repeated punktuation (like !!!! ) + remove unused sentenceword map (we use only the count) + upd test case for sentence count	2016-10-06 03:41:07 +02:00
reger	1a79c64495	generalize DateDetection with holiday date rules readily available in icu to make sure current dates are recognized (was fixed to 2014 - 2016) + adjust holiday date parser from pattern.match to pattern.find to deal with leading and trailing text + moved relative date recognition (morgen, tomorrow) to parseline (used by query parser only), as not working and problematic for indexing + add test case for parseline (used by query parser)	2016-10-02 03:19:12 +02:00
reger	272cdd496a	reactivate sentence counter in WordTokenizer for phrasepos ranking, by counting punktuation (delivered as 1 char word) again.	2016-09-07 02:16:16 +02:00
reger	e310ec5f70	fix posInText ranking calculation to score 0 on no position info + fix Word posInText calc in Tokenizer to start with 1 + test case	2016-09-06 00:05:59 +02:00
reger	ebde21079a	refactor xlsParser to include Excel file attribute (like author) in parser result doc. Similar to ppt and doc parser, completing a TODO in xlsParser.	2016-08-13 23:46:36 +02:00
luc	3cc5619d93	Improved HTML icons indexing and rendering in search results. See http://mantis.tokeek.de/view.php?id=629	2016-02-02 09:57:54 +01:00
reger	84c970eaec	move test classes to test/java (subdirectory as in Maven standard subdir layout) because ViewImage*Test.java breaks test run	2016-01-16 19:22:27 +01:00

35 Commits