yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
luccioman	e115e57cc7	Reduced text snippet extraction processing time. By not generating MD5 hashes on all words of indexed texts, processing time is reduced by 30 to 50% on indexed documents with more than 1Mbytes of plain text.	2018-05-11 15:42:53 +02:00
luccioman	3b89c232db	Easier tracking of longest text snippets initializations When text snippets statistics are enabled and FINE log level is enabled on the TextSnippetStatistics class.	2018-05-01 09:58:05 +02:00
luccioman	eb20589e29	Fixed issue #158 : completed div CSS class ignore in crawl	2018-02-10 11:56:28 +01:00
luccioman	33593c22e9	Fixed loss of other modifiers on keywords/tags search navigation links	2018-02-06 17:17:13 +01:00
luccioman	9412881230	Added basic support for autotagging microdata annotated item types. With the appropriate vocabulary settings in Vocabulary_p.html page, this can produce Vocabulary search facets displaying item types referenced in html documents by microdata annotation. Tested notably, but not limited to, vocabulary classes/types defined by Schema.org and Dublin Core.	2018-02-06 10:25:38 +01:00
luccioman	5a14d34a7d	Refactoring : documented and extracted autotagging processing functions.	2018-02-02 10:27:36 +01:00
luccioman	58b9834729	Added HTML microdata typed items parsing capability. This adds the possibility for the HTML parser to gather typed items URLs annotated in HTML tags with itemscope and itemtype attributes (see microdata specification https://www.w3.org/TR/microdata/ ), notably Types from the schema.org vocabulary, but also Types/Classes from any other vocabulary, such as the common ones listed in the RDFa core context ( https://www.w3.org/2011/rdfa-context/rdfa-1.1.html ).	2018-02-02 09:31:40 +01:00
luccioman	fa6d030b0b	Moved dbtest to the test source folder.	2018-01-29 14:03:01 +01:00
luccioman	098ee63911	Added a manual performance test for the HostBalancer. Consequently to the report in mantis 776 (http://mantis.tokeek.de/view.php?id=776). Running the perfs test with different control parameters seems to reveal that the YaCy's RowHandleMap used in the balancer depthCache is finally more efficient than for example the ConcurrentHashMap from JDK 8.	2018-01-28 12:41:56 +01:00
luccioman	46b5249c20	Removed time condition on HostBalancer initialization in JUnit test. Its initialization in main application usage remains asynchronous.	2018-01-26 17:15:27 +01:00
luccioman	36e9b1c5b3	Fixed SegmentTest test case time dependant occasional failures As highlighted by latest automated Travis builds.	2018-01-02 10:21:07 +01:00
Michael Peter Christen	b907819cb4	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	2017-12-09 22:29:54 +01:00
Michael Peter Christen	25573bd5ab	added a crawl filter based on <div> tag class names When a crawl is started, a new field to exclude content from scraping is available. The field can be identified with the class name of div tags. All text contained in such a div tag where the configured class name(s) match are not indexed, while the remaining page is indexed.	2017-12-09 22:29:35 +01:00
luccioman	d95b288f19	Removed use of deprecated Jetty IPAccessHandler for client filtering. Upgraded to InetAccessHandler. Added InetPathAccessHandler extension to InetAccessHandler to maintain path patterns capability previously available in IPAccessHandler but lost in InetAccessHandler. Filtering on IPv6 addresses is now supported. Support for deprecated pattern formats such as "192.168." and "192.168.1.1/path" has been removed, but startup automated migration should convert such patterns eventually present in serverClient.	2017-12-08 15:12:08 +01:00
luccioman	0a120787e3	Improved accuracy of URLs search filters : protocol, tld, host, file ext	2017-12-01 11:19:31 +01:00
luccioman	d1c7dfd852	Fixed URL parsing with fragment and empty path	2017-12-01 09:48:42 +01:00
luccioman	e2f6427a63	Added a basic JUnit test for the Visio parser (vsdParser)	2017-11-22 09:06:16 +01:00
luccioman	d41ad7af6f	Restore initial locale at the end of a JUnit test case which modify it.	2017-11-20 18:50:49 +01:00
luccioman	7206f1ed71	Do locale neutral case conversions on domain names. Required to properly run on systems with default locale set to Turkish language, as with this locale the 'i' character has different upper and lower case flavors than with other locales.	2017-11-20 18:47:46 +01:00
luccioman	398c66f06c	Do locale neutral case conversions in MultiProtocolURL For any relevant URL parts : host name, URL scheme, session ids or technical parts (see https://url.spec.whatwg.org/#url-writing and https://tools.ietf.org/html/rfc3986 for current standard references). Remaining locale sensitive conversion used for detection of URL word components in urlComps() makes sense but using detected language would be preferable than using the default system locale.	2017-11-20 15:23:33 +01:00
luccioman	9531b83598	Do locale neutral case conversions in Classification Required for people using Turkish language as their default system locale, as with this locale the 'i' character has different upper and lower case flavors than with other locales.	2017-11-20 09:48:46 +01:00
luccioman	ac209cac2e	Updated the generic top-level known domains list. Using current IANA reference list at https://www.iana.org/domains/root/db The generated URL hashes on these domains stay the same but performance is greatly improved as a DNS resolve request is required on URL hash computation when the TLD part of the host name is unknown. Hash computation mean time measured on 1541 sample URLs (one on each TLD) and a computer with a DSL connection : about 230ms before change, then only 20ms.	2017-11-14 09:42:09 +01:00
luccioman	fcd57e2d0f	Improved some JUnit tests isolation and resources release The modified tests were successfull when run manually from an IDE such as Eclipse, but failed occasionnally when run with maven as part of the overall test suite.	2017-11-08 09:33:30 +01:00
luccioman	e0eda84c24	Remove old hard-coded holiday dates from DateDection class. Replaced with rules based relative to current year as already done for a part of the supported dates.	2017-11-07 19:02:09 +01:00
luccioman	73977ec0fe	Added a html parser charset detection unit test	2017-11-06 09:14:03 +01:00
luccioman	285f0d6a39	Consistently encode snapshot image with format requested on the API. Previously, calling /api/snapshot.png rendered JPEG encoded images.	2017-10-18 07:53:07 +02:00
luccioman	7c319c841e	Fixed pdf2image conversion with imagemagick on PDFs having transparency The target image format (jpeg) doesn't support transparency, so the Html2ImageTest produced unusable black images when ran on a linux machine having imagemagick package installed.	2017-10-16 19:45:17 +02:00
luccioman	fe75f326d8	Fixed ProfilingGraph calculation integer overflows and added test class. Complementary to fix proposed in PR #128 by @otteresk.	2017-10-16 09:18:12 +02:00
luccioman	5bf76f058a	Adjusted ResponseHeaderTest to succeed on slow or highly loaded CPU	2017-10-09 19:08:39 +02:00
luccioman	32c9dfa768	Added partial bzip2 stream parsing support and bzipParser Junit test	2017-10-04 18:33:09 +02:00
luccioman	dd9cb06d25	Fixed RWI distance calculation on multi words search queries. Distance was lost when storing/retrieving references to intermediate result container. Now all JUnit tests are again successfully passing!	2017-10-04 08:41:43 +02:00
luccioman	c6ae87168a	Added unit tests on the gzip parser.	2017-08-22 14:13:00 +02:00
luccioman	169ffdd1c7	Finer control on max links to parse in the html parser.	2017-08-22 14:11:35 +02:00
luccioman	4743a104b5	Added some unit tests on FileUtils.	2017-08-22 14:06:09 +02:00
luccioman	e41d046a9d	Improved parsing support for OOXML spreadsheets (.xlsx) As reported edycop in mantis 765 ( http://mantis.tokeek.de/view.php?id=765 ), parsing of xlsx files was quite incomplete. Now properly support "Shared String Table" entry in Office Open XML spreadsheets, an also detect embedded URLs. Integrating the Apache poi-ooxml library could be an option for finer OOXML formats support, but their SAX style parsing example ( http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api ) tends to show that a custom SAX handler is still efficient for lightweight and low memory footprint processing.	2017-08-21 09:38:20 +02:00
luccioman	780173008e	Implemented partial stream parsing of tar archives. Also added JUnit tests for the tar parser and fixed unwanted use of the tar parser as a fallback on files included in a tar archive.	2017-08-14 14:57:58 +02:00
luccioman	acab6a6def	Also handle text content when parsing XML within limits.	2017-08-14 14:47:01 +02:00
reger	f38fb7f02c	Add junit test for AbstractOperations.addOperand()	2017-08-14 02:16:43 +02:00
luccioman	ed678186a8	Updated xml parser limited parsing test for use latest jdk.	2017-08-12 09:42:06 +02:00
luccioman	f369679d1c	Fixed read/copy on input streams reading sometimes less than expected.	2017-07-11 09:00:27 +02:00
luccioman	bf55f1d6e5	Started support of partial parsing on large streamed resources. Thus enable getpageinfo_p API to return something in a reasonable amount of time on resources over MegaBytes size range. Support added first with the generic XML parser, for other formats regular crawler limits apply as usual.	2017-07-08 09:04:03 +02:00
luccioman	2a87b08cea	Removed temporary html parser test code	2017-07-03 14:53:36 +02:00
luccioman	90a7c1affa	HTML parser : removed unnecessary remaining recursive processing Recursive processing was removed in commit `67beef657f`, but one remained for anchors content(likely omitted from refactoring). It is no more necessary : other links such as images embedded in anchors are currently correctly detected by the parser. More annoying : that remaining recursive processing could lead to almost endless processing when encountering some (invalid) HTML structures involving nested anchors, as detected and reported by lucipher on YaCy forum ( http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6005 ).	2017-07-03 10:00:53 +02:00
luccioman	9b1bb2545e	Refactored plain-text URLs detection implementation. For faster processing (measured about 2 times faster on many real-world examples) and more advanced detection (previous algorithm detected only URLs separated from the rest of the text by a space character).	2017-06-27 19:30:40 +02:00
luccioman	8da3174867	Ensure lower case conversion consistency with any default locale. Especially for Turkish speaking users using "tr" as their system default locale : strings for technical stuff (URLs, tag names, constants...) must not be lower cased with the default locale, as 'I' doesn't becomes 'i' like in other locales such as "en", but becomes 'ı'.	2017-06-27 06:42:33 +02:00
luccioman	286f3018bd	Made mime type and extension normalization locale independent. Previously, upper cased mime type was incorrectly normalized when the default locale is Turkish.	2017-06-26 17:33:56 +02:00
luccioman	319231a458	Added a generic XML parser, able to parse elements text and URLs. This parser adds support for any XML based format other than already supported XML vocabularies such XHTML, RSS/Atom feeds... It will eventually be used as a fallback if one of these specific parsers fail, before falling back to the existing genericParser which extracts not that much useful information except URL tokens.	2017-06-26 16:30:21 +02:00
luccioman	64cec2790d	Improved character encoding detection from Content-Type header Also updated some related JavaDocs	2017-06-22 10:50:34 +02:00
luccioman	1acb7005d0	Added a basic JUnit test with test gz files for the gzip parser	2017-06-21 09:14:50 +02:00
luccioman	1e2fb76720	Properly close test files in htmlParser unit test	2017-06-21 09:11:17 +02:00

1 2 3

134 Commits