Commit Graph

56 Commits

Author SHA1 Message Date
orbiter
daf0f74361 joined anomic.net.URL, plasmaURL and url hash computation:
search profiling showed, that a major amount of time is wasted by computing url hashes. The computation does an intranet-check, which needs a DNS lookup. This caused that each urlhash computation needed 100-200 milliseconds, which caused remote searches to delay at least 1 second more that necessary. The solution to this problem is to attach a URL hash to the URL data structure, because that means that the url hash value can be filled after retrieval of the URL from the database. The redesign of the url/urlhash management caused a major redesign of many parts of the software. Since some parts had been decided to be given up they had been removed during this change to avoid unnecessary maintenance of unused code.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4074 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-09-05 09:01:35 +00:00
orbiter
bb426565f0 added new yacy protocol for mass url-pull for better remote crawling distribution
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4056 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-08-22 00:59:05 +00:00
orbiter
139c59ebbd - fixed dht selction problem: the seed tables used a wrong ordering
- cleaned some code

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3693 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-05-09 17:59:36 +00:00
theli
6f46245a51 *) Bookmarks: Ajax icon is displayed while loading title
*) First version of a sitemap parser added
   - currently only autodetection of sitemap files is supported
*) DB-Import restructured
   - pause/resume should work again now


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3666 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-05-06 09:52:04 +00:00
orbiter
5c3afb3202 added option to configure a path to a secondary index location.
this shall be used to store a fragment of the index on another physical device,
to split IO load and enhance access speed. The index is splitted in such a way
that the LURLs are stored to the secondary location, and the RWIs to the primary
location. This is especially useful for environments where symbolic links are
not possible and may cause IO access even if there is no write access to the
device which hosts the symbolic link.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3519 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-03-24 15:28:17 +00:00
orbiter
2cb16824e3 removed support for old database structures.
The new collection index will be more generalized to support other indexes
i.e. YBR block-rank computation. A clean-up of the many conditions to support
the old database was necessary.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3506 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-03-21 15:35:35 +00:00
orbiter
861f41e67e redesigned NURL-handling:
- the general NURL-index for all crawl stack types was splitted into separate indexes for these stacks
- the new NURL-index is managed by the crawl balancer
- the crawl balancer does not need an internal index any more, it is replaced by the NURL-index
- the NURL.Entry was generalized and is now a new class plasmaCrawlEntry
- the new class plasmaCrawlEntry replaces also the preNURL.Entry class, and will also replace the switchboardEntry class in the future
- the new class plasmaCrawlEntry is more accurate for date entries (holds milliseconds) and can contain larger 'name' entries (anchor tag names)
- the EURL object was replaced by a new ZURL object, which is a container for the plasmaCrawlEntry and some tracking information
- the EURL index is now filled with ZURL objects
- a new index delegatedURL holds ZURL objects about plasmaCrawlEntry obects to track which url is handed over to other peers
- redesigned handling of plasmaCrawlEntry - handover, because there is no need any more to convert one entry object into another
- found and fixed numerous bugs in the context of crawl state handling
- fixed a serious bug in kelondroCache which caused that entries could not be removed
- fixed some bugs in online interface and adopted monitor output to new entry objects
- adopted yacy protocol to handle new delegatedURL entries
all old crawl queues will disappear after this update!

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3483 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-03-16 13:25:56 +00:00
orbiter
d755a8026d - better OOM protection
- better memory allocation for FlexTable indexes
- splitting between static index and dynamic index (only the dynamic part must grow)
- to enable a merge-iteration of new splittet index, a huge number of classes needed to be adopted for new iterator classes
- added new iterator classes that support cloneable iterators
- adopted all iterator classes to implement cloneable itarators

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3453 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-03-08 16:15:40 +00:00
orbiter
1cba31de43 redesigned ram organization for database caches
- each cache can now allocate as much memory as is available
- no more fixed limits
- replaced old performance memory monitor by new one
- added supervision methods as static functions into the classes that provide cache functionality
- steering of ram allocation is done with two simple limits that are ram availability-relative


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3434 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-03-06 22:43:32 +00:00
orbiter
f7803a6ce4 enhanced crawl balancer
- new domains now get a chance to get crawled early
- less IO operations
- new balancing method
- better dump order at shutdown time
- bugfixes regarding not found url hashes (no more superfluous cache kill)
- domain access time is now shared over all balancer stacks
- viewing the stack does no more disturbish the balancing algorithm that much
- intelligent selection of best next domain using domain access times
- extra double-check (to double-check the double-check)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3384 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-02-21 16:23:31 +00:00
orbiter
93a5ace330 fix for http://www.yacy-forum.de/viewtopic.php?p=28544#28544
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3056 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-12-08 12:57:17 +00:00
orbiter
109ed0a0bb - cleaned up code; removed methods to write the old data structures
- added an assortment importer. the old database structures can
  be imported with
  java -classpath classes yacy -migrateassortments
- modified wordmigration. The indexes from WORDS are now imported
  to the collection database. The call is
  java -classpath classes yacy -migratewords
  (as it was)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3044 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-12-05 02:47:51 +00:00
orbiter
052f28312a removed assortments from indexing data structures
removed options to switch on assortments

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3041 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-12-02 19:34:59 +00:00
orbiter
bb7d4b5d5e refactoring to prepare new RWI entry object
- moved all url and index(RWI) entries to index package
- better naming to distinguish RWI entries and URL entries


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2937 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-11-08 16:17:47 +00:00
orbiter
8fdefd5c68 generalization of payload definition of index storage
this is one step forward to the migration to a new collection data format

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2912 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-11-05 02:10:40 +00:00
orbiter
2a9d868f6d - removed object cache from kelondroTree
- generalized object caching and added new object caching class
- added object caching wherever kelondroTree was used
- added object caching also to usage of kelondroFlex
- added object buffering (a write cache) to NURLs
- added many assert statements; fixed bugs here and there
- added missing close methods to latest added classes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2858 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-24 13:48:16 +00:00
orbiter
06854988da - full integration of new LURL database in INDEX
- added migration method for urlHash.db into INDEX

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2819 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-19 21:14:37 +00:00
orbiter
77a59a115d refactoring of indexing methods
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2787 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-16 15:04:16 +00:00
orbiter
a5dd0d41af - refactoring of plasmaCrawlLURL.Entry to prepare new Entry format
- added test migration method to migrate the old LURL to a new LURL
the new LURL will be splitted into different tables for each month
this solves several problems:
- the biggest table in YaCy is splitted in different parts and can
  also be managed in filesystems that are limited to 2GB
- the oldest entries can easily be identified, used for re-crawl und
  deleted
- The complete database can be limited to a specific size (as wanted many times)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2755 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-12 23:14:41 +00:00
orbiter
f453c14b5d removed unreacheable catch blocks and unused imports
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2619 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-18 11:23:58 +00:00
theli
b298474e22 *) Bugfix needed because of changed plasmaCrawlLURL.load behavior
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2518 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-08 11:33:27 +00:00
orbiter
4866868c0e added write cache for LURLs
This was necessary to speed up the index receive process during global search


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2498 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-07 01:13:03 +00:00
orbiter
b7f4a1521b added options to switch on or off the kelondroFlexTable for NURL, EURL and PreNURL
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2456 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-08-24 22:21:22 +00:00
orbiter
314021453f * more logging
* option in yacy.init to set useCollectionIndex usage

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2374 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-08-10 21:21:50 +00:00
orbiter
279b1d969d Integrated new indexing data structure 'collections' into the main class
for indexing, the plasmaWordIndex.

The new data structure is ready-to-use, but currently disabled.
It can be activated by setting the static
plasmaWordIndex.useCollectionIndex
to true. This shall be done for testing purpose.

The new index is stored to
DATA/INDEX/PUBLIC/TEXT
The directory PLASMA shall be used only for crawler in the future.

Attention: during testing the data structure in INDEX may change,
and created indexes with the new data structure may get useless.


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2348 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-08-05 22:22:14 +00:00
orbiter
c4e922885a replaced indexURLEntry by new class that uses a kelondroRow.Entry object
to store the index entry. This is another step to move to the new database structure.
A side effect of this change is, that index storage uses much less RAM space,
which affects the index RAM cache.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2341 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-08-02 19:59:28 +00:00
orbiter
417ed5102e redesign of database iterators:
an iteration of key elements in kelondroTree databases is no longer supported.
this is now replaced by an iteration of kelondroRow.Entry objects from the database
Iteration of keys from the database was mostly followed by retrieval of the row
from the database, whcih caused unnecessary database load.
The index selection was also redesigned to use the new row iteration methods.
This affects many funktions, most important is the DHT selection routine which is now much faster.



git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2327 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-07-26 11:21:51 +00:00
orbiter
ad692fc6c7 implemented option to extract nurls from the database
(plus some iteration enhancements for nurls)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2325 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-07-24 16:40:59 +00:00
orbiter
7fd90ca7c8 * strict handling of NURL entry element generation, storage and stacking
* more space for EURL reason strings (you must delete the EURL db to use this)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2324 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-07-24 16:04:14 +00:00
orbiter
58df8b7bbf a large collection of different changes
* mainly for the transition to the new indexing database structure
* a bugfix for an endless loop inside kelondroTree iteration
* a bugfix for bulk read inside a kelondroTree iteration; the bug caused that some elements had been iterated twice
* very strong speed enhancement for url/domain extraction

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2320 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-07-23 22:39:41 +00:00
orbiter
40aa735520 fixe timing problem causing too long delay during initialization of kelondroTree objects
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2288 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-07-11 23:44:44 +00:00
orbiter
92f4cb4d73 added option to configure the start-up delay time for kelondro database files.
the start-up delay is used to pre-load the database node cache

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2276 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-07-03 23:57:33 +00:00
theli
fb090652df *) use a more compact for plasmaWordIndexAssortmentImporter.java because the long name
caused problems during untar operation

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2206 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-06-14 14:28:46 +00:00
orbiter
4a907a570f 1st step to migrate kelondroTree to usage of kelondroRow instead of byte[][]
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2162 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-05-31 23:31:46 +00:00
orbiter
5041d330ce refactoring
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2150 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-05-28 11:44:50 +00:00
orbiter
7b3b12888c refactoring: integrated indexContainer abstraction layer
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2149 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-05-28 01:09:31 +00:00
orbiter
196b8abb30 refactoring
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2144 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-05-26 09:32:50 +00:00
orbiter
a930be4ba3 refactoring of index management:
generalized the index entry

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2121 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-05-19 23:19:20 +00:00
theli
9104001e7c *) Better error handling for assortment import
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2067 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-05-08 07:58:22 +00:00
orbiter
6c70f4a0cf renamed wordHashes for a word hash set generation to wordHashSet
This was done because the wordHashes iterator will get another integer
parameter and then conflicts with the wordHashes set generation

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1921 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-03-18 01:04:16 +00:00
hermens
4e9a8f41fd rwiDBCleaner + dbImporter: Iterate over small excerpts of
word hashes instead of the whole DB especially while changing
the DB in the process.
see http://www.yacy-forum.de/viewtopic.php?p=19136#19136



git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1917 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-03-17 23:39:10 +00:00
hermens
474379ae63 remove TABs from plasmaDbImporter.java
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1916 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-03-17 21:52:36 +00:00
orbiter
3286b1f498 re-organisation of lurl-creation and -stacking
this was necessary to prevent useless write to the database
in case of blacklist appearance of the url

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1905 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-03-17 10:16:07 +00:00
theli
fb4100d47b *) undoing last commit.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1856 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-03-08 07:59:32 +00:00
theli
a84cc71218 *) removing getTotalRuntime
- not needed anymore

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1855 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-03-08 07:37:21 +00:00
auron_x
dce08771d1 *) Fix for wrong estimated and elapsed times when import was paused
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1850 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-03-07 22:51:18 +00:00
hermens
b34713324a DBImport: remove words from source index even if nothing has been added to home index
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1849 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-03-07 22:19:39 +00:00
hermens
351bd0a678 *) dbImport: convert cacheSize to kb when creating plasma* objects
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1773 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-02-27 01:58:37 +00:00
theli
40dd6ec4fd *) experimental restructuring of db import function
- trying to reduce IO load by avoiding  unnecessary db access
   - trying to presort url list

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1671 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-02-16 13:07:01 +00:00
orbiter
7eb10675b3 re-organization of index management
this was done to be prepared for new storage algorithms


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1635 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-02-14 00:12:07 +00:00