yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
luccioman	d41ad7af6f	Restore initial locale at the end of a JUnit test case which modify it.	2017-11-20 18:50:49 +01:00
luccioman	7206f1ed71	Do locale neutral case conversions on domain names. Required to properly run on systems with default locale set to Turkish language, as with this locale the 'i' character has different upper and lower case flavors than with other locales.	2017-11-20 18:47:46 +01:00
luccioman	398c66f06c	Do locale neutral case conversions in MultiProtocolURL For any relevant URL parts : host name, URL scheme, session ids or technical parts (see https://url.spec.whatwg.org/#url-writing and https://tools.ietf.org/html/rfc3986 for current standard references). Remaining locale sensitive conversion used for detection of URL word components in urlComps() makes sense but using detected language would be preferable than using the default system locale.	2017-11-20 15:23:33 +01:00
luccioman	9531b83598	Do locale neutral case conversions in Classification Required for people using Turkish language as their default system locale, as with this locale the 'i' character has different upper and lower case flavors than with other locales.	2017-11-20 09:48:46 +01:00
luccioman	ac209cac2e	Updated the generic top-level known domains list. Using current IANA reference list at https://www.iana.org/domains/root/db The generated URL hashes on these domains stay the same but performance is greatly improved as a DNS resolve request is required on URL hash computation when the TLD part of the host name is unknown. Hash computation mean time measured on 1541 sample URLs (one on each TLD) and a computer with a DSL connection : about 230ms before change, then only 20ms.	2017-11-14 09:42:09 +01:00
luccioman	fcd57e2d0f	Improved some JUnit tests isolation and resources release The modified tests were successfull when run manually from an IDE such as Eclipse, but failed occasionnally when run with maven as part of the overall test suite.	2017-11-08 09:33:30 +01:00
luccioman	e0eda84c24	Remove old hard-coded holiday dates from DateDection class. Replaced with rules based relative to current year as already done for a part of the supported dates.	2017-11-07 19:02:09 +01:00
luccioman	73977ec0fe	Added a html parser charset detection unit test	2017-11-06 09:14:03 +01:00
luccioman	285f0d6a39	Consistently encode snapshot image with format requested on the API. Previously, calling /api/snapshot.png rendered JPEG encoded images.	2017-10-18 07:53:07 +02:00
luccioman	7c319c841e	Fixed pdf2image conversion with imagemagick on PDFs having transparency The target image format (jpeg) doesn't support transparency, so the Html2ImageTest produced unusable black images when ran on a linux machine having imagemagick package installed.	2017-10-16 19:45:17 +02:00
luccioman	fe75f326d8	Fixed ProfilingGraph calculation integer overflows and added test class. Complementary to fix proposed in PR #128 by @otteresk.	2017-10-16 09:18:12 +02:00
luccioman	5bf76f058a	Adjusted ResponseHeaderTest to succeed on slow or highly loaded CPU	2017-10-09 19:08:39 +02:00
luccioman	32c9dfa768	Added partial bzip2 stream parsing support and bzipParser Junit test	2017-10-04 18:33:09 +02:00
luccioman	dd9cb06d25	Fixed RWI distance calculation on multi words search queries. Distance was lost when storing/retrieving references to intermediate result container. Now all JUnit tests are again successfully passing!	2017-10-04 08:41:43 +02:00
luccioman	c6ae87168a	Added unit tests on the gzip parser.	2017-08-22 14:13:00 +02:00
luccioman	169ffdd1c7	Finer control on max links to parse in the html parser.	2017-08-22 14:11:35 +02:00
luccioman	4743a104b5	Added some unit tests on FileUtils.	2017-08-22 14:06:09 +02:00
luccioman	e41d046a9d	Improved parsing support for OOXML spreadsheets (.xlsx) As reported edycop in mantis 765 ( http://mantis.tokeek.de/view.php?id=765 ), parsing of xlsx files was quite incomplete. Now properly support "Shared String Table" entry in Office Open XML spreadsheets, an also detect embedded URLs. Integrating the Apache poi-ooxml library could be an option for finer OOXML formats support, but their SAX style parsing example ( http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api ) tends to show that a custom SAX handler is still efficient for lightweight and low memory footprint processing.	2017-08-21 09:38:20 +02:00
luccioman	780173008e	Implemented partial stream parsing of tar archives. Also added JUnit tests for the tar parser and fixed unwanted use of the tar parser as a fallback on files included in a tar archive.	2017-08-14 14:57:58 +02:00
luccioman	acab6a6def	Also handle text content when parsing XML within limits.	2017-08-14 14:47:01 +02:00
reger	f38fb7f02c	Add junit test for AbstractOperations.addOperand()	2017-08-14 02:16:43 +02:00
luccioman	ed678186a8	Updated xml parser limited parsing test for use latest jdk.	2017-08-12 09:42:06 +02:00
luccioman	f369679d1c	Fixed read/copy on input streams reading sometimes less than expected.	2017-07-11 09:00:27 +02:00
luccioman	bf55f1d6e5	Started support of partial parsing on large streamed resources. Thus enable getpageinfo_p API to return something in a reasonable amount of time on resources over MegaBytes size range. Support added first with the generic XML parser, for other formats regular crawler limits apply as usual.	2017-07-08 09:04:03 +02:00
luccioman	2a87b08cea	Removed temporary html parser test code	2017-07-03 14:53:36 +02:00
luccioman	90a7c1affa	HTML parser : removed unnecessary remaining recursive processing Recursive processing was removed in commit `67beef657f`, but one remained for anchors content(likely omitted from refactoring). It is no more necessary : other links such as images embedded in anchors are currently correctly detected by the parser. More annoying : that remaining recursive processing could lead to almost endless processing when encountering some (invalid) HTML structures involving nested anchors, as detected and reported by lucipher on YaCy forum ( http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6005 ).	2017-07-03 10:00:53 +02:00
luccioman	9b1bb2545e	Refactored plain-text URLs detection implementation. For faster processing (measured about 2 times faster on many real-world examples) and more advanced detection (previous algorithm detected only URLs separated from the rest of the text by a space character).	2017-06-27 19:30:40 +02:00
luccioman	8da3174867	Ensure lower case conversion consistency with any default locale. Especially for Turkish speaking users using "tr" as their system default locale : strings for technical stuff (URLs, tag names, constants...) must not be lower cased with the default locale, as 'I' doesn't becomes 'i' like in other locales such as "en", but becomes 'ı'.	2017-06-27 06:42:33 +02:00
luccioman	286f3018bd	Made mime type and extension normalization locale independent. Previously, upper cased mime type was incorrectly normalized when the default locale is Turkish.	2017-06-26 17:33:56 +02:00
luccioman	319231a458	Added a generic XML parser, able to parse elements text and URLs. This parser adds support for any XML based format other than already supported XML vocabularies such XHTML, RSS/Atom feeds... It will eventually be used as a fallback if one of these specific parsers fail, before falling back to the existing genericParser which extracts not that much useful information except URL tokens.	2017-06-26 16:30:21 +02:00
luccioman	64cec2790d	Improved character encoding detection from Content-Type header Also updated some related JavaDocs	2017-06-22 10:50:34 +02:00
luccioman	1acb7005d0	Added a basic JUnit test with test gz files for the gzip parser	2017-06-21 09:14:50 +02:00
luccioman	1e2fb76720	Properly close test files in htmlParser unit test	2017-06-21 09:11:17 +02:00
luccioman	9dd790087d	Added HT Cache basic statistics (hit rate)	2017-06-15 09:50:02 +02:00
luccioman	28b451a0b3	Made Cache compression level and lock timeout user configurable	2017-06-14 19:02:08 +02:00
luccioman	a7394b479b	Limit the synchronization blocking time on some Cache operations. Using a Reentrant lock instead of the intrinsic synchronization lock permits limiting the blocking time to acquire a lock. Useful on a very busy Cache concurrently accessed by many threads : when the time to acquire a lock is too high, getting/storing content on the cache becomes inefficient, and it is then better to fall back to loading remote resources. Illustrated by the CacheTest stress test and some traces reported in mantis 751 ( http://mantis.tokeek.de/view.php?id=751 )	2017-06-14 09:13:50 +02:00
Michael Peter Christen	6fe735945d	migrated Solr 5.5 -> Solr 6.6 and from Java 1.7 -> 1.8 Also: now Version 1.921	2017-06-09 12:25:23 +02:00
luccioman	a04feac064	Ensure file input streams proper closing in both success and failures Also add when possible a warning level log message on input stream closing error instead of failing silently. This could help understanding some IO exceptions such as "too many files open".	2017-06-03 04:00:46 +02:00
luccioman	d98c04853d	Ensure proper closing of file input streams.	2017-06-02 12:14:29 +02:00
luccioman	c226ded799	Fix unescape of URLs having some '%' chars but not percent-encoded	2017-05-30 12:32:14 +02:00
reger	077d062be3	Adjust mergeDocuments to keep youngest last-modified date of document collection	2017-05-09 22:52:54 +02:00
luccioman	522a268305	Improved new blacklist entries URL scheme detection.	2017-05-04 16:36:45 +02:00
luccioman	31fff2c986	Extended WikiCode template inclusion syntax support. Wiki templates are not rendered but syntax support is improved, which greatly enhance snippets rendering on search results coming from a MediaWiki dump import. Tested on various dumps from Wikimedia at https://dumps.wikimedia.org/backup-index.html See also Wikipedia transclusion documentation at https://en.wikipedia.org/wiki/Wikipedia:Transclusion	2017-04-27 09:50:04 +02:00
reger	7a7da698d4	fix unit test MultiProtocolURL(file) assertion for Windows path with drive letter.	2017-04-20 00:47:52 +02:00
luccioman	23775e76e2	Fixed endless loop case in wikicode processing. Detected when importing recent MediaWiki dumps containing some pages with script content in plain text format (see Scribunto extension https://www.mediawiki.org/wiki/Extension:Scribunto ). Further improvement : modify the MediawikiImporter to prevent processing revisions whose <model> is not wikitext.	2017-04-12 17:17:03 +02:00
luccioman	0bc868a819	Improved support for non ASCII chars in local file system URLs Creating a MultiProtocolURL instance from a File object and then retrieving a File with getFSFile() was inconsistent with file paths containing space or non ASCII chars.	2017-04-12 09:23:10 +02:00
reger	777cb5b812	remove test case for Standard_MemoryControl which will always fail see https://github.com/yacy/yacy_search_server/pull/114	2017-04-02 03:59:37 +02:00
reger	1ccc44e681	fix default/httpd.mime Z file extension to lower case + test case	2017-03-26 23:52:31 +02:00
reger	18c7563dbe	Extend DCEntry.getLanguage convert to ISO639-1 codes for more languages by using icu.ULocale for languages not already covered (ICU normalizes to ISO639-1 2 char codes). Add test class Use DublinCore vocabulary declarations in DCEntry and SurrogateReader for easier usage debugging, Init SurrogateReader.inputSource on first use.	2017-03-05 02:26:10 +01:00
reger	275c0cddd1	Adjust DefaultServlet test case to recent change, depreciate unused CONNECTION_PROP_PROTOCOL (also as it might be misleading with getProtocol vs getScheme)	2017-02-26 02:39:52 +01:00

1 2 3

120 Commits