Commit Graph

770 Commits

Author SHA1 Message Date
Michael Peter Christen
5ba5fb5d23 upgraded pdfbox to 3.0.0 2023-10-27 12:05:24 +02:00
Michael Peter Christen
0689f4f0ae Check if the character is a minus sign and is followed by a letter or a
digit. Treat it as part of the word/number.
2023-09-03 10:22:03 +02:00
Michael Peter Christen
5db97a8928 parser can now separate numbers from words also when they are not
separated by space, i.e. 4.7Ohm
2023-09-02 19:15:22 +02:00
Michael Peter Christen
e3797de7de enhanced the word tokenizer to recognize numbers in a proper way 2023-09-01 20:10:08 +02:00
Michael Peter Christen
8285fe715a tab to spaces for classes supporting the condenser.
This is a preparation step to make changes in condenser and parser more
visible; no functional changes so far.
2023-09-01 11:00:42 +02:00
Michael Peter Christen
92dad3ed49 removed 7Zip parser because the old library could not be replaced by a maven repository 2023-07-27 23:11:27 +02:00
Michael Peter Christen
1c0f50985c fixed documentation and some details of handling of keywords 2023-04-04 12:41:12 +02:00
Michael Christen
3472bcb4d3 patched a 'java.lang.NoSuchMethodError: com.twelvemonkeys.imageio.util.IIOUtil.lookupProviderByName' problem which occurred only on ARM 2023-03-05 01:17:28 +01:00
Michael Peter Christen
9fcd8f1bda added canonical filter
attention: this is on by default!
(it should do the right thing)
2023-01-16 14:50:30 +01:00
Michael Christen
4304e07e6f crawl profile adoption to new tag valency attribute 2023-01-15 01:20:12 +01:00
Michael Peter Christen
5acd98f4da introduction of tag-to-indexing relation TagValency 2023-01-13 17:20:18 +01:00
Michael Peter Christen
309adb814e fixed import of jsonlist imort from searchlab.eu using a direct URL 2022-10-25 00:51:53 +02:00
Michael Peter Christen
62d177bf59 stub for jsonlist index importer web page 2022-10-23 12:22:31 +02:00
Michael Peter Christen
efa0425f00 refactoring: moved jsonlist importer to importer class 2022-10-23 11:35:32 +02:00
Michael Peter Christen
d49f937b98 added iso,apk,dmg to extension-deny list
see also https://github.com/yacy/yacy_search_server/issues/510
zip is not on the list because it can be parsed
2022-10-05 16:28:50 +02:00
Michael Christen
867f96a32b removed warnings 2022-10-04 22:05:32 +02:00
Michael Christen
8a06beaf24 removed finalize() methods, deprecated 2022-10-04 20:12:47 +02:00
Daleth Darko
3ced06c731 Various javadoc fixes 2022-01-26 11:22:43 +01:00
reger24
eae16287e9 Added epub (ebook) format to existing zipParser
*.epub files are zip files containing xhtml files with content and other artifact files,
which the zipParser can  already feed to index
- extension "epub"
- mime "epub+zip"
2022-01-24 13:51:27 +01:00
sgaebel
cdf901270c always use HTTPClient by 'try with resources' pattern to free up
resources
2021-10-31 23:06:23 +01:00
sgaebel
69adaa9f55 makes our HTTPClient closable 2021-10-31 23:06:02 +01:00
Michael Peter Christen
552ab7051b fix for warc importer 2021-10-25 19:35:15 +02:00
Michael Peter Christen
e9c5e78868 replaced new Number(Number) with Number.instanceOf
to remove deprecation warnings for Java 9
2021-08-08 00:39:03 +02:00
Michael Peter Christen
9ef4503672 fixed some newInstance() warnings
.. by adding .getDeclaredConstructor()
2021-08-07 18:46:53 +02:00
jfhs
10bddc2c2d Decode HTML entities in all property values by default 2021-03-30 22:24:55 +02:00
jfhs
2135d259e3 Replace hardcoded html/xml entities with a file, support decoding all defined HTML entities 2021-03-30 22:24:54 +02:00
Michael Peter Christen
d3526c52af fixed a problem in warc importer: do not fail if single WARC entries are
faulty
2020-12-28 17:05:06 +01:00
Michael Peter Christen
d359d521a1 fixed warc importer
The importer tried to import a gziped files as plain warc.
It will now check the file extension and use a unzip automatically
on-the-fly.
2020-12-10 11:19:25 +01:00
sgaebel
fc03c4b4fe removes some warning and unused objects 2020-08-03 20:44:31 +02:00
sgaebel
df9ea0a42a removes some warnings: unused imports, params 2020-07-27 22:20:49 +02:00
Michael Peter Christen
e0ad8ca9da replaced json library from JSON.org with libandroid-json-java
This fixes https://github.com/yacy/yacy_search_server/issues/347
2020-04-24 11:45:25 +02:00
luccioman
e90405b6f0 Support parsing audio URLs without file extension
Added also a Junit for the audio tag parser
2019-04-09 11:40:21 +02:00
sgaebel
c2398fd890 remove warnings: 'Statement unnecessarily nested within else clause' 2019-01-10 20:02:57 +01:00
sgaebel
811d40a6c4 taking care of closing inputstreams, HTTPClient 2019-01-04 18:58:49 +01:00
luccioman
3fb449b3b6 Properly resolve relative URLs against document URL in html base tags
Fixes issue #256
2018-12-06 20:18:00 +01:00
luccioman
fcf6b16db4 Added new crawler attribute for finer control over Media Type detection
New "Media Type detection" section in the advanced crawl start page
allow to choose between :
- not loading URLs with unknown or unsupported file extension without
checking the actual Media Type (relying Content-Type header for now).
This was the old default behavior, faster, but not really accurate.
- always cross check URL file extension against the actual Media Type.
This lets properly parse URLs ending with an apparently odd file
extension, but which have actually a supported Media Type such as
text/html.

Sample URLs with misleading file extensions added as documentation in
the crawl start page.

fixes issue #244
2018-10-25 10:42:12 +02:00
luccioman
54fbe166ba Updated pdf cache clear steps consistently with current pdfbox version
- Removed calls to no more existing clearResources functions (on PDFont
class and its children) since upgrade to pdfbox 2.n.n
- Removed hacky usage of protected internal ClassLoader function. This
removes the warnings displayed when running with JDK9 or JDK10 :

     [java] WARNING: Illegal reflective access by
net.yacy.document.parser.pdfParser$ResourceCleaner (file:<path>) to
method java.lang.ClassLoader.findLoadedClass(java.lang.String)
     [java] WARNING: Please consider reporting this to the maintainers
of net.yacy.document.parser.pdfParser$ResourceCleaner
     [java] WARNING: Use --illegal-access=warn to enable warnings of
further illegal reflective access operations
     [java] WARNING: All illegal access operations will be denied in a
future release

Crawling thousands of pdf documents from various sources after
modifications applied, revealed no new memory leak related to pdfbox
(measurements done with JVisualVM).
2018-08-16 18:23:42 +02:00
luccioman
685122363d Added a parser for XZ compressed archives.
As suggested by LA_FORGE on mantis 781
(http://mantis.tokeek.de/view.php?id=781)
2018-08-15 10:07:39 +02:00
luccioman
8a29551c54 Upgraded the OpenGeoDB dump URL
The status of the library in the DictionaryLoader_p.html page now also
advertises the user that an upgrade can be applied when an older dump is
already loaded.

Upgrade applied as suggested by Niklas Andrus @fapth_gitlab on Gitter
chat.
2018-08-03 18:39:41 +02:00
luccioman
bb51555830 Removed remaining unsafe accesses to SimpleDateFormat instances.
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).

Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
2018-07-02 10:00:40 +02:00
luccioman
e97580dfc7 Fixed unsafe conccurent access to generic SimpleDateFormat instances
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).

Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
2018-06-28 14:59:23 +02:00
Michael Christen
e0dc632020 removed transformer
it was not used any more
2018-06-19 00:42:23 +02:00
luccioman
fa4399d5d2 Small perf improvement : initialize threads names early when possible
Initializing Thread names using the Thread constructor parameter is
faster as it already sets a thread name even if no customized one is
given, while an additional call to the Thread.setName() function
internally do synchronized access, eventually runs access check on the
security manager and performs a native call.

Profiling a running YaCy server revealed that the total processing time
spent on Thread.setName() for a typical p2p search was in the range of
seconds.
2018-05-23 14:45:35 +02:00
luccioman
e357ade47d Reduced memory footprint of text snippet extraction
By not parsing and storing at first all sentences of a document, but
only on the fly the ones necessary to compute the snippet.
2018-05-13 10:29:52 +02:00
luccioman
e115e57cc7 Reduced text snippet extraction processing time.
By not generating MD5 hashes on all words of indexed texts, processing
time is reduced by 30 to 50% on indexed documents with more than 1Mbytes
of plain text.
2018-05-11 15:42:53 +02:00
luccioman
fb3032c530 Added a crawl filtering possibility on documents Media Type (MIME) 2018-03-23 10:28:19 +01:00
luccioman
cf62b571bd Added RSS reader support for enclosure feed item sub element.
Enclosure element (see
http://www.rssboard.org/rss-specification#ltenclosuregtSubelementOfLtitemgt
) can be seen for example in podcasts feeds.
2018-03-20 07:38:29 +01:00
luccioman
3da2739bbd Parse and index more common audio metadata text tag fields. 2018-03-15 09:59:57 +01:00
luccioman
846aba00fa Added parsing of URLs eventually present in audio metadata tags 2018-03-13 23:08:52 +01:00
Michael Peter Christen
187075b878 added nav filter 2018-03-10 15:46:53 +01:00