yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	5ba5fb5d23	upgraded pdfbox to 3.0.0	2023-10-27 12:05:24 +02:00
Michael Peter Christen	0689f4f0ae	Check if the character is a minus sign and is followed by a letter or a digit. Treat it as part of the word/number.	2023-09-03 10:22:03 +02:00
Michael Peter Christen	5db97a8928	parser can now separate numbers from words also when they are not separated by space, i.e. 4.7Ohm	2023-09-02 19:15:22 +02:00
Michael Peter Christen	e3797de7de	enhanced the word tokenizer to recognize numbers in a proper way	2023-09-01 20:10:08 +02:00
Michael Peter Christen	8285fe715a	tab to spaces for classes supporting the condenser. This is a preparation step to make changes in condenser and parser more visible; no functional changes so far.	2023-09-01 11:00:42 +02:00
Michael Peter Christen	92dad3ed49	removed 7Zip parser because the old library could not be replaced by a maven repository	2023-07-27 23:11:27 +02:00
Michael Peter Christen	1c0f50985c	fixed documentation and some details of handling of keywords	2023-04-04 12:41:12 +02:00
Michael Christen	3472bcb4d3	patched a 'java.lang.NoSuchMethodError: com.twelvemonkeys.imageio.util.IIOUtil.lookupProviderByName' problem which occurred only on ARM	2023-03-05 01:17:28 +01:00
Michael Peter Christen	9fcd8f1bda	added canonical filter attention: this is on by default! (it should do the right thing)	2023-01-16 14:50:30 +01:00
Michael Christen	4304e07e6f	crawl profile adoption to new tag valency attribute	2023-01-15 01:20:12 +01:00
Michael Peter Christen	5acd98f4da	introduction of tag-to-indexing relation TagValency	2023-01-13 17:20:18 +01:00
Michael Peter Christen	309adb814e	fixed import of jsonlist imort from searchlab.eu using a direct URL	2022-10-25 00:51:53 +02:00
Michael Peter Christen	62d177bf59	stub for jsonlist index importer web page	2022-10-23 12:22:31 +02:00
Michael Peter Christen	efa0425f00	refactoring: moved jsonlist importer to importer class	2022-10-23 11:35:32 +02:00
Michael Peter Christen	d49f937b98	added iso,apk,dmg to extension-deny list see also https://github.com/yacy/yacy_search_server/issues/510 zip is not on the list because it can be parsed	2022-10-05 16:28:50 +02:00
Michael Christen	867f96a32b	removed warnings	2022-10-04 22:05:32 +02:00
Michael Christen	8a06beaf24	removed finalize() methods, deprecated	2022-10-04 20:12:47 +02:00
Daleth Darko	3ced06c731	Various javadoc fixes	2022-01-26 11:22:43 +01:00
reger24	eae16287e9	Added epub (ebook) format to existing zipParser *.epub files are zip files containing xhtml files with content and other artifact files, which the zipParser can already feed to index - extension "epub" - mime "epub+zip"	2022-01-24 13:51:27 +01:00
sgaebel	cdf901270c	always use HTTPClient by 'try with resources' pattern to free up resources	2021-10-31 23:06:23 +01:00
sgaebel	69adaa9f55	makes our HTTPClient closable	2021-10-31 23:06:02 +01:00
Michael Peter Christen	552ab7051b	fix for warc importer	2021-10-25 19:35:15 +02:00
Michael Peter Christen	e9c5e78868	replaced new Number(Number) with Number.instanceOf to remove deprecation warnings for Java 9	2021-08-08 00:39:03 +02:00
Michael Peter Christen	9ef4503672	fixed some newInstance() warnings .. by adding .getDeclaredConstructor()	2021-08-07 18:46:53 +02:00
jfhs	10bddc2c2d	Decode HTML entities in all property values by default	2021-03-30 22:24:55 +02:00
jfhs	2135d259e3	Replace hardcoded html/xml entities with a file, support decoding all defined HTML entities	2021-03-30 22:24:54 +02:00
Michael Peter Christen	d3526c52af	fixed a problem in warc importer: do not fail if single WARC entries are faulty	2020-12-28 17:05:06 +01:00
Michael Peter Christen	d359d521a1	fixed warc importer The importer tried to import a gziped files as plain warc. It will now check the file extension and use a unzip automatically on-the-fly.	2020-12-10 11:19:25 +01:00
sgaebel	fc03c4b4fe	removes some warning and unused objects	2020-08-03 20:44:31 +02:00
sgaebel	df9ea0a42a	removes some warnings: unused imports, params	2020-07-27 22:20:49 +02:00
Michael Peter Christen	e0ad8ca9da	replaced json library from JSON.org with libandroid-json-java This fixes https://github.com/yacy/yacy_search_server/issues/347	2020-04-24 11:45:25 +02:00
luccioman	e90405b6f0	Support parsing audio URLs without file extension Added also a Junit for the audio tag parser	2019-04-09 11:40:21 +02:00
sgaebel	c2398fd890	remove warnings: 'Statement unnecessarily nested within else clause'	2019-01-10 20:02:57 +01:00
sgaebel	811d40a6c4	taking care of closing inputstreams, HTTPClient	2019-01-04 18:58:49 +01:00
luccioman	3fb449b3b6	Properly resolve relative URLs against document URL in html base tags Fixes issue #256	2018-12-06 20:18:00 +01:00
luccioman	fcf6b16db4	Added new crawler attribute for finer control over Media Type detection New "Media Type detection" section in the advanced crawl start page allow to choose between : - not loading URLs with unknown or unsupported file extension without checking the actual Media Type (relying Content-Type header for now). This was the old default behavior, faster, but not really accurate. - always cross check URL file extension against the actual Media Type. This lets properly parse URLs ending with an apparently odd file extension, but which have actually a supported Media Type such as text/html. Sample URLs with misleading file extensions added as documentation in the crawl start page. fixes issue #244	2018-10-25 10:42:12 +02:00
luccioman	54fbe166ba	Updated pdf cache clear steps consistently with current pdfbox version - Removed calls to no more existing clearResources functions (on PDFont class and its children) since upgrade to pdfbox 2.n.n - Removed hacky usage of protected internal ClassLoader function. This removes the warnings displayed when running with JDK9 or JDK10 : [java] WARNING: Illegal reflective access by net.yacy.document.parser.pdfParser$ResourceCleaner (file:<path>) to method java.lang.ClassLoader.findLoadedClass(java.lang.String) [java] WARNING: Please consider reporting this to the maintainers of net.yacy.document.parser.pdfParser$ResourceCleaner [java] WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations [java] WARNING: All illegal access operations will be denied in a future release Crawling thousands of pdf documents from various sources after modifications applied, revealed no new memory leak related to pdfbox (measurements done with JVisualVM).	2018-08-16 18:23:42 +02:00
luccioman	685122363d	Added a parser for XZ compressed archives. As suggested by LA_FORGE on mantis 781 (http://mantis.tokeek.de/view.php?id=781)	2018-08-15 10:07:39 +02:00
luccioman	8a29551c54	Upgraded the OpenGeoDB dump URL The status of the library in the DictionaryLoader_p.html page now also advertises the user that an upgrade can be applied when an older dump is already loaded. Upgrade applied as suggested by Niklas Andrus @fapth_gitlab on Gitter chat.	2018-08-03 18:39:41 +02:00
luccioman	bb51555830	Removed remaining unsafe accesses to SimpleDateFormat instances. SimpleDateFormat must not be used by concurrent threads without synchronization for parsing or formating dates as it is not thread-safe (internally holds a calendar instance that is not synchronized). Prefer now DateTimeFormatter when possible as it is thread-safe without concurrent access performance bottleneck (does not internally use synchronization locks).	2018-07-02 10:00:40 +02:00
luccioman	e97580dfc7	Fixed unsafe conccurent access to generic SimpleDateFormat instances SimpleDateFormat must not be used by concurrent threads without synchronization for parsing or formating dates as it is not thread-safe (internally holds a calendar instance that is not synchronized). Prefer now DateTimeFormatter when possible as it is thread-safe without concurrent access performance bottleneck (does not internally use synchronization locks).	2018-06-28 14:59:23 +02:00
Michael Christen	e0dc632020	removed transformer it was not used any more	2018-06-19 00:42:23 +02:00
luccioman	fa4399d5d2	Small perf improvement : initialize threads names early when possible Initializing Thread names using the Thread constructor parameter is faster as it already sets a thread name even if no customized one is given, while an additional call to the Thread.setName() function internally do synchronized access, eventually runs access check on the security manager and performs a native call. Profiling a running YaCy server revealed that the total processing time spent on Thread.setName() for a typical p2p search was in the range of seconds.	2018-05-23 14:45:35 +02:00
luccioman	e357ade47d	Reduced memory footprint of text snippet extraction By not parsing and storing at first all sentences of a document, but only on the fly the ones necessary to compute the snippet.	2018-05-13 10:29:52 +02:00
luccioman	e115e57cc7	Reduced text snippet extraction processing time. By not generating MD5 hashes on all words of indexed texts, processing time is reduced by 30 to 50% on indexed documents with more than 1Mbytes of plain text.	2018-05-11 15:42:53 +02:00
luccioman	fb3032c530	Added a crawl filtering possibility on documents Media Type (MIME)	2018-03-23 10:28:19 +01:00
luccioman	cf62b571bd	Added RSS reader support for `enclosure` feed item sub element. Enclosure element (see http://www.rssboard.org/rss-specification#ltenclosuregtSubelementOfLtitemgt ) can be seen for example in podcasts feeds.	2018-03-20 07:38:29 +01:00
luccioman	3da2739bbd	Parse and index more common audio metadata text tag fields.	2018-03-15 09:59:57 +01:00
luccioman	846aba00fa	Added parsing of URLs eventually present in audio metadata tags	2018-03-13 23:08:52 +01:00
Michael Peter Christen	187075b878	added nav filter	2018-03-10 15:46:53 +01:00

1 2 3 4 5 ...

770 Commits