Commit Graph

37 Commits

Author SHA1 Message Date
orbiter
f453c14b5d removed unreacheable catch blocks and unused imports
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2619 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-18 11:23:58 +00:00
theli
b298474e22 *) Bugfix needed because of changed plasmaCrawlLURL.load behavior
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2518 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-08 11:33:27 +00:00
orbiter
4866868c0e added write cache for LURLs
This was necessary to speed up the index receive process during global search


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2498 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-07 01:13:03 +00:00
orbiter
b7f4a1521b added options to switch on or off the kelondroFlexTable for NURL, EURL and PreNURL
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2456 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-08-24 22:21:22 +00:00
orbiter
314021453f * more logging
* option in yacy.init to set useCollectionIndex usage

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2374 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-08-10 21:21:50 +00:00
orbiter
279b1d969d Integrated new indexing data structure 'collections' into the main class
for indexing, the plasmaWordIndex.

The new data structure is ready-to-use, but currently disabled.
It can be activated by setting the static
plasmaWordIndex.useCollectionIndex
to true. This shall be done for testing purpose.

The new index is stored to
DATA/INDEX/PUBLIC/TEXT
The directory PLASMA shall be used only for crawler in the future.

Attention: during testing the data structure in INDEX may change,
and created indexes with the new data structure may get useless.


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2348 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-08-05 22:22:14 +00:00
orbiter
c4e922885a replaced indexURLEntry by new class that uses a kelondroRow.Entry object
to store the index entry. This is another step to move to the new database structure.
A side effect of this change is, that index storage uses much less RAM space,
which affects the index RAM cache.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2341 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-08-02 19:59:28 +00:00
orbiter
417ed5102e redesign of database iterators:
an iteration of key elements in kelondroTree databases is no longer supported.
this is now replaced by an iteration of kelondroRow.Entry objects from the database
Iteration of keys from the database was mostly followed by retrieval of the row
from the database, whcih caused unnecessary database load.
The index selection was also redesigned to use the new row iteration methods.
This affects many funktions, most important is the DHT selection routine which is now much faster.



git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2327 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-07-26 11:21:51 +00:00
orbiter
ad692fc6c7 implemented option to extract nurls from the database
(plus some iteration enhancements for nurls)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2325 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-07-24 16:40:59 +00:00
orbiter
7fd90ca7c8 * strict handling of NURL entry element generation, storage and stacking
* more space for EURL reason strings (you must delete the EURL db to use this)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2324 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-07-24 16:04:14 +00:00
orbiter
58df8b7bbf a large collection of different changes
* mainly for the transition to the new indexing database structure
* a bugfix for an endless loop inside kelondroTree iteration
* a bugfix for bulk read inside a kelondroTree iteration; the bug caused that some elements had been iterated twice
* very strong speed enhancement for url/domain extraction

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2320 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-07-23 22:39:41 +00:00
orbiter
40aa735520 fixe timing problem causing too long delay during initialization of kelondroTree objects
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2288 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-07-11 23:44:44 +00:00
orbiter
92f4cb4d73 added option to configure the start-up delay time for kelondro database files.
the start-up delay is used to pre-load the database node cache

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2276 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-07-03 23:57:33 +00:00
theli
fb090652df *) use a more compact for plasmaWordIndexAssortmentImporter.java because the long name
caused problems during untar operation

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2206 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-06-14 14:28:46 +00:00
orbiter
4a907a570f 1st step to migrate kelondroTree to usage of kelondroRow instead of byte[][]
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2162 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-05-31 23:31:46 +00:00
orbiter
5041d330ce refactoring
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2150 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-05-28 11:44:50 +00:00
orbiter
7b3b12888c refactoring: integrated indexContainer abstraction layer
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2149 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-05-28 01:09:31 +00:00
orbiter
196b8abb30 refactoring
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2144 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-05-26 09:32:50 +00:00
orbiter
a930be4ba3 refactoring of index management:
generalized the index entry

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2121 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-05-19 23:19:20 +00:00
theli
9104001e7c *) Better error handling for assortment import
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2067 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-05-08 07:58:22 +00:00
orbiter
6c70f4a0cf renamed wordHashes for a word hash set generation to wordHashSet
This was done because the wordHashes iterator will get another integer
parameter and then conflicts with the wordHashes set generation

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1921 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-03-18 01:04:16 +00:00
hermens
4e9a8f41fd rwiDBCleaner + dbImporter: Iterate over small excerpts of
word hashes instead of the whole DB especially while changing
the DB in the process.
see http://www.yacy-forum.de/viewtopic.php?p=19136#19136



git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1917 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-03-17 23:39:10 +00:00
hermens
474379ae63 remove TABs from plasmaDbImporter.java
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1916 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-03-17 21:52:36 +00:00
orbiter
3286b1f498 re-organisation of lurl-creation and -stacking
this was necessary to prevent useless write to the database
in case of blacklist appearance of the url

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1905 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-03-17 10:16:07 +00:00
theli
fb4100d47b *) undoing last commit.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1856 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-03-08 07:59:32 +00:00
theli
a84cc71218 *) removing getTotalRuntime
- not needed anymore

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1855 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-03-08 07:37:21 +00:00
auron_x
dce08771d1 *) Fix for wrong estimated and elapsed times when import was paused
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1850 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-03-07 22:51:18 +00:00
hermens
b34713324a DBImport: remove words from source index even if nothing has been added to home index
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1849 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-03-07 22:19:39 +00:00
hermens
351bd0a678 *) dbImport: convert cacheSize to kb when creating plasma* objects
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1773 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-02-27 01:58:37 +00:00
theli
40dd6ec4fd *) experimental restructuring of db import function
- trying to reduce IO load by avoiding  unnecessary db access
   - trying to presort url list

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1671 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-02-16 13:07:01 +00:00
orbiter
7eb10675b3 re-organization of index management
this was done to be prepared for new storage algorithms


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1635 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-02-14 00:12:07 +00:00
orbiter
1e4578aab6 VERY EXPERIMENTAL removal of index ram cache flushing thread.
The cache will fill up and flushed explicitely when it is full.
This shall remove double-access of assortments (indexing and flush)
during indexing process. Hopefully this should reduce IO.
The main idea is: the cache shall mainly be flushed by DHT transfer, and
only indexes that shall be hosted by the own peer are flushed to the
assortments. This needs further work.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1617 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-02-11 23:19:01 +00:00
orbiter
be77fe1a88 code clean-up
@Martin: bitte schaue mal warum die Variablenzuweisung
         in plasmaCrawlNURLImporter war. So wie sie waren, waren sie überflüssig.
         Das hattest du dir bestimmt nicht so gedacht.
         Sollten es ggf. globale Variablen sein?

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1529 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-02-04 15:25:48 +00:00
theli
50d85657b8 *) new import function for IndexImport_p.html
- can be used to import the crawling queue (noticeUrlDB + stacks)
   

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1518 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-02-02 16:46:58 +00:00
theli
442807cb29 *) Bugfix for last commit
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1506 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-01-31 15:26:11 +00:00
theli
22fd1ca9aa *) minor changes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1505 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-01-31 12:39:10 +00:00
theli
6a99304b2b *) Redesign of db import functionality
- restructuring to allow different import tasks to be controlled via one gui 
   - adding possibility to import a single assortment file
   - adding possibility to set the cache size that should be used

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1504 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-01-31 12:30:24 +00:00