Commit Graph

8947 Commits

Author SHA1 Message Date
Michael Peter Christen
3d3bdb0f5f added zim importer rule for mdwiki 2023-11-16 23:11:57 +01:00
Michael Peter Christen
4a611ac6a3 another possible fix for
https://github.com/yacy/yacy_search_server/issues/500
2023-11-15 23:45:53 +01:00
sgaebel
d72cd7916c Merge branch 'master' of https://github.com/yacy/yacy_search_server 2023-11-14 20:43:42 +01:00
sgaebel
0663ae3c99 adds synchornized dumplog 2023-11-14 20:42:00 +01:00
Michael Peter Christen
cff0991d85 test if this is helpful for https://github.com/yacy/yacy_search_server/issues/500 2023-11-13 16:41:19 +01:00
Michael Peter Christen
ceb07a5218 fixed problem with zim importer which crashed when non-valid urls appeared 2023-11-13 11:12:10 +01:00
Michael Peter Christen
3268a93019 added a 'minified' option to YaCy dumps 2023-11-13 10:27:50 +01:00
Michael Peter Christen
c20c4b8a21 modified export: added maximum number of docs per chunk
The export file can now be many files, called chunks.
By default still only one chunk is exported.
This function is required in case that the exported files shall be
imported to an elasticsearch/opensearch index. The bulk import function
of elasticsearch/opensearch is limited to 100MB. To make it possible to
import YaCy files, those must be splitted into chunks. Right now we
cannot estimate the chunk size as bytes, only as number of documents.
The user must do experiments to find out the optimum chunk max size,
like 50000 docs per chunk. Try this as first attempt.
2023-11-12 22:11:55 +01:00
Michael Peter Christen
24011dcbcc more file name extensions for json list surrogate files 2023-11-06 22:44:18 +01:00
Michael Peter Christen
34a9fc1a07 bugfixes to zim reader: 2023-11-05 12:46:37 +01:00
Michael Peter Christen
7db0534d8a Added a zim parser to the surrogate import option.
You can now import zim files into YaCy by simply moving them
to the DATA/SURROGATE/IN folder. They will be fetched and after
parsing moved to DATA/SURROGATE/OUT.
There are exceptions where the parser is not able to identify the
original URL of the documents in the zim file. In that case the file
is simply ignored.
This commit also carries an important fix to the pdf parser and an
increase of the maximum parsing speed to 60000 PPM which should make it
possible to index up to 1000 files in one second.
2023-11-05 02:16:40 +01:00
Michael Peter Christen
70e29937ef added a check in zim importer which tests if import URLs actually exist 2023-11-04 19:07:50 +01:00
Michael Peter Christen
496f768c44 modified cache strategy for zim clusters 2023-11-03 18:20:10 +01:00
Michael Peter Christen
fdc6311dc7 added parsing rules for wikibooks and wikinews in zim reader 2023-11-02 00:27:24 +01:00
Michael Peter Christen
2ea54b3503 fixed blob iterator in zim cluster definition 2023-11-01 23:43:27 +01:00
Michael Peter Christen
54fa5d3c2e added a cluster cache but it requires more testing 2023-11-01 19:52:44 +01:00
Michael Peter Christen
53b01dbf2e Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2023-11-01 18:57:04 +01:00
Michael Peter Christen
41856e9f34 added an optimized zim file entry iterator 2023-11-01 18:50:28 +01:00
Michael Peter Christen
1c0df28bfb added a zim importer that can be used for surrogate imports.
Can not be used yet because it requires some security additions
to verify that the given urls actually work.
2023-11-01 18:48:40 +01:00
Michael Peter Christen
33b6878ded Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2023-10-29 14:58:47 +01:00
okybaca
4add1f6bc7 replaced all the links to legacy legacy wiki to legacy wiki 2023-10-29 13:12:24 +01:00
Michael Peter Christen
e2c86a8eba added a ZIM cluster pointer cache 2023-10-29 12:49:08 +01:00
Michael Peter Christen
4a54b24703 fix for "negative seek offset" error during extension of heap files.
This would have always happend when a heap file exceeds 2GB.
should fix https://github.com/yacy/yacy_search_server/issues/372
2023-10-29 09:32:21 +01:00
Michael Peter Christen
9c8fb97985 introduced url list and title list caching and enhanced input stream
performance in ZIM reader
2023-10-29 00:43:12 +02:00
Michael Peter Christen
b0ae660790 added Zstandard compressed data decompression for ZIM files type 5
also: more generalization and performance enhancements
2023-10-28 12:24:29 +02:00
Michael Peter Christen
ad8ee3a0b6 fixed typo in class name 2023-10-28 08:57:42 +02:00
Michael Peter Christen
c4082c4ff2 refactoring of ZIM reader, simplification, removed unnecessary code 2023-10-28 08:56:58 +02:00
Michael Peter Christen
c2b6b6e7b9 Fixed a large number of problems in the ZIM reader.
This library was not prepared for large data because it was missing long
data types for pointers. I had to modify the code-base in a fundamental
way:
- Proof-Reading,
- unclustering,
- refactoring,
- naming adoption to https://wiki.openzim.org/wiki/ZIM_file_format,
- change of Exception handling,
- extension to more attributes as defined in spec (bugfix for mime type
loading)
- bugfix to long parsing (prevented reading of large files)
The code is furthermore very inefficient and requires more attention.
However the format is very useful for YaCy as there are numerous data
sources for ZIM-Files.
2023-10-27 15:49:23 +02:00
Michael Peter Christen
5ba5fb5d23 upgraded pdfbox to 3.0.0 2023-10-27 12:05:24 +02:00
Michael Peter Christen
1fefae9baf integrated the source code of a openzim file format reader. These are
the raw format reader files with no integration in YaCy yet, which will
maybe follow as a next step. The zim file format is documented in
https://openzim.org and the reader code was taken from the archived,
non-maintained repository at https://github.com/openzim/zimreader-java
2023-10-27 10:59:06 +02:00
Michael Peter Christen
4308aa5415 removed concept of empty passwords as "no passwords used",
because we now start YaCy with a default password (yacy).
This has impact of all function that check the current state of
password-protection that included the empty password situation,
including the warnings to set a password in case that none is set (which
cannot be the case any more).
2023-10-25 22:56:06 +02:00
Michael Peter Christen
2c60ff14bb fixed default pw comparison 2023-10-25 13:59:02 +02:00
Michael Peter Christen
4da320bebf added a warning message in ConfigBasic in case that the default password
was not changed.
2023-10-24 23:36:26 +02:00
Michael Peter Christen
7830268be1 fix 756c817b5a
must be applied to all code where a transaction token is generated.
2023-10-21 13:00:49 +02:00
Michael Peter Christen
756c817b5a fix for https://github.com/yacy/yacy_search_server/issues/544 2023-10-21 11:45:26 +02:00
Michael Peter Christen
03bf259601 fix for https://github.com/yacy/yacy_search_server/issues/363
We still need to set the load in the process because a demand for higher
crawl speed may require to increase the maximum load limit. However,
following the criticism in the bug, we do never reduce the load limit
again.
2023-10-16 18:26:47 +02:00
mchristen
8fc51f66c6 fixed a test class which prevented compilation on latest jvm 2023-09-26 15:39:34 +02:00
Joel Strasser
53bafa1544
consistent formatting in string concatenation 2023-09-25 23:31:55 +02:00
Joel Strasser
22c4188001
additionally match release stub for YaCy version 2023-09-25 22:41:04 +02:00
Michael Peter Christen
ff8fe7b6a4 fix for ',' or '.' appearing within a word or number. This will not
tokenize the query into parts around that character to make it possible
to search for numbers or version numbers.
2023-09-03 11:37:25 +02:00
Michael Peter Christen
0689f4f0ae Check if the character is a minus sign and is followed by a letter or a
digit. Treat it as part of the word/number.
2023-09-03 10:22:03 +02:00
Michael Peter Christen
5db97a8928 parser can now separate numbers from words also when they are not
separated by space, i.e. 4.7Ohm
2023-09-02 19:15:22 +02:00
Michael Peter Christen
e3797de7de enhanced the word tokenizer to recognize numbers in a proper way 2023-09-01 20:10:08 +02:00
Michael Peter Christen
88cd17ea57 migrated solr from 8.9.0 to 8.11.2; activated also migration script. A YaCy index with solr 8.9.0 will automatically be migrated to 8.11.2. This is a preparation step to migrate to 9.0.0 soon. 2023-09-01 18:24:52 +02:00
Michael Peter Christen
0089f234f4 added npe protection 2023-09-01 12:18:47 +02:00
Michael Peter Christen
8285fe715a tab to spaces for classes supporting the condenser.
This is a preparation step to make changes in condenser and parser more
visible; no functional changes so far.
2023-09-01 11:00:42 +02:00
Michael Peter Christen
195bd2e444 extended the maximum header size to 16k to prevent http error 431 2023-08-19 15:21:24 +02:00
Michael Peter Christen
92dad3ed49 removed 7Zip parser because the old library could not be replaced by a maven repository 2023-07-27 23:11:27 +02:00
Michael Peter Christen
5afcba162b updated libraries 2023-07-27 22:55:46 +02:00
Michael Christen
a348146d8f setting connect host to 0.0.0.0 2023-06-29 10:46:05 +02:00