Commit Graph

28 Commits

Author SHA1 Message Date
Michael Peter Christen
34a9fc1a07 bugfixes to zim reader: 2023-11-05 12:46:37 +01:00
Michael Peter Christen
7db0534d8a Added a zim parser to the surrogate import option.
You can now import zim files into YaCy by simply moving them
to the DATA/SURROGATE/IN folder. They will be fetched and after
parsing moved to DATA/SURROGATE/OUT.
There are exceptions where the parser is not able to identify the
original URL of the documents in the zim file. In that case the file
is simply ignored.
This commit also carries an important fix to the pdf parser and an
increase of the maximum parsing speed to 60000 PPM which should make it
possible to index up to 1000 files in one second.
2023-11-05 02:16:40 +01:00
Michael Peter Christen
496f768c44 modified cache strategy for zim clusters 2023-11-03 18:20:10 +01:00
Michael Peter Christen
fdc6311dc7 added parsing rules for wikibooks and wikinews in zim reader 2023-11-02 00:27:24 +01:00
Michael Peter Christen
2ea54b3503 fixed blob iterator in zim cluster definition 2023-11-01 23:43:27 +01:00
Michael Peter Christen
54fa5d3c2e added a cluster cache but it requires more testing 2023-11-01 19:52:44 +01:00
Michael Peter Christen
53b01dbf2e Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2023-11-01 18:57:04 +01:00
Michael Peter Christen
41856e9f34 added an optimized zim file entry iterator 2023-11-01 18:50:28 +01:00
Michael Peter Christen
1c0df28bfb added a zim importer that can be used for surrogate imports.
Can not be used yet because it requires some security additions
to verify that the given urls actually work.
2023-11-01 18:48:40 +01:00
Michael Peter Christen
e2c86a8eba added a ZIM cluster pointer cache 2023-10-29 12:49:08 +01:00
Michael Peter Christen
9c8fb97985 introduced url list and title list caching and enhanced input stream
performance in ZIM reader
2023-10-29 00:43:12 +02:00
Michael Peter Christen
b0ae660790 added Zstandard compressed data decompression for ZIM files type 5
also: more generalization and performance enhancements
2023-10-28 12:24:29 +02:00
Michael Peter Christen
ad8ee3a0b6 fixed typo in class name 2023-10-28 08:57:42 +02:00
Michael Peter Christen
c4082c4ff2 refactoring of ZIM reader, simplification, removed unnecessary code 2023-10-28 08:56:58 +02:00
Michael Peter Christen
c2b6b6e7b9 Fixed a large number of problems in the ZIM reader.
This library was not prepared for large data because it was missing long
data types for pointers. I had to modify the code-base in a fundamental
way:
- Proof-Reading,
- unclustering,
- refactoring,
- naming adoption to https://wiki.openzim.org/wiki/ZIM_file_format,
- change of Exception handling,
- extension to more attributes as defined in spec (bugfix for mime type
loading)
- bugfix to long parsing (prevented reading of large files)
The code is furthermore very inefficient and requires more attention.
However the format is very useful for YaCy as there are numerous data
sources for ZIM-Files.
2023-10-27 15:49:23 +02:00
Michael Peter Christen
1fefae9baf integrated the source code of a openzim file format reader. These are
the raw format reader files with no integration in YaCy yet, which will
maybe follow as a next step. The zim file format is documented in
https://openzim.org and the reader code was taken from the archived,
non-maintained repository at https://github.com/openzim/zimreader-java
2023-10-27 10:59:06 +02:00
Michael Peter Christen
ea8df27e95 modified org.json.* library to fit into the YaCy environment
as drop-in replacement.
Also made some fixes and enhancements to the library.
2020-04-24 11:42:06 +02:00
Michael Peter Christen
60dc1241a3 added org.json.* library
from https://android.googlesource.com/platform/libcore/+/refs/heads/master/json/src/main/java/org/json
as a preparation step for
https://github.com/yacy/yacy_search_server/issues/347
2020-04-24 10:28:43 +02:00
reger
681889ae64 use current tar library for untar files
- remove old source copy
2015-11-04 02:57:00 +01:00
Michael Peter Christen
d3964253ae - added @SuppressWarnings to unused servlet method parameters
- removed unnecessary casts
- removed unnecessary throw statements
2012-07-05 09:14:04 +02:00
Roland 'Quix0r' Haeder
a093ccf5eb Now used synchronization in all close() methods to make sure all objects
are 'closed' in an ordered way

Conflicts:
	source/de/anomic/http/server/ChunkedInputStream.java
	source/de/anomic/http/server/ChunkedOutputStream.java
	source/de/anomic/http/server/ContentLengthInputStream.java
	source/net/yacy/cora/protocol/Domains.java
	source/net/yacy/cora/services/federated/solr/SolrShardingConnector.java
	source/net/yacy/cora/services/federated/solr/SolrSingleConnector.java
	source/net/yacy/document/content/dao/PhpBB3Dao.java
	source/net/yacy/document/parser/html/AbstractTransformer.java
	source/net/yacy/kelondro/blob/BEncodedHeap.java
	source/net/yacy/kelondro/blob/HeapReader.java
	source/net/yacy/kelondro/index/RAMIndexCluster.java
	source/net/yacy/kelondro/io/ByteCountInputStream.java
	source/net/yacy/kelondro/logging/ConsoleOutErrHandler.java
	source/net/yacy/kelondro/table/SQLTable.java
2012-05-14 07:41:55 +02:00
orbiter
5a55397f99 some last-minute performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8101 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-25 11:23:52 +00:00
low012
3b40b98256 *) set SVN properties
*) minor changes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7567 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-08 01:51:51 +00:00
orbiter
dd459281c8 applied code changes that are recommended by PMD
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6563 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-10 23:09:48 +00:00
orbiter
4df88a4e7a - fixes for missing or bad hashCode computation
- fixes for bad equals() methods that had not been used by hash maps and therefore some classes did not work as objects in hash maps.
- this may also affect some cases where double-checks should have been, but did not work.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6495 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-11-20 12:11:56 +00:00
orbiter
8103ccec4c removed compiler warnings in imported classes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6220 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-15 20:44:23 +00:00
orbiter
aee35bff6f replaced StringBuffer with StringBuilder in tar lib
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6213 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-14 13:31:57 +00:00
orbiter
f987fc6b4a added tar classes from apache ant tools
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6211 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-14 11:25:40 +00:00