Commit Graph

92 Commits

Author SHA1 Message Date
orbiter
bb426565f0 added new yacy protocol for mass url-pull for better remote crawling distribution
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4056 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-08-22 00:59:05 +00:00
orbiter
344911bfaa shorter minimum delay values for intranet crawl targets
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4047 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-08-15 23:18:12 +00:00
orbiter
69d640b041 added missing synchronization in crawl balancer
to avoid that the synchronization is triggered during many-time-used size() operation
a notEmpty method was added that can avoid the synchronization many times

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4025 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-08-03 12:21:46 +00:00
orbiter
9628db6cdc enhanced memory allocation during database access:
- refactoring of kelondroRecords; this class is now divided into
  kelondroAbstractRecords, kelondroRecords, kelondroCachedRecords, kelondroHandle and kelondroNode
- better abstraction of kelondroNodes, such nodes may now be crated by different classes
- a new Node defining class kelondroEcoRecords defines Nodes that do not need so much allocation and System.arraycopy
- there is less memory transfer on the bus, especially for collection index
- now half of memory needed for web index access


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4024 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-08-03 11:44:58 +00:00
orbiter
71fd972ac0 - reduced default search time
- catched case when web structure cannot be painted because of too less data
- better logging when balance fails


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3892 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-06-14 15:21:01 +00:00
orbiter
861f41e67e redesigned NURL-handling:
- the general NURL-index for all crawl stack types was splitted into separate indexes for these stacks
- the new NURL-index is managed by the crawl balancer
- the crawl balancer does not need an internal index any more, it is replaced by the NURL-index
- the NURL.Entry was generalized and is now a new class plasmaCrawlEntry
- the new class plasmaCrawlEntry replaces also the preNURL.Entry class, and will also replace the switchboardEntry class in the future
- the new class plasmaCrawlEntry is more accurate for date entries (holds milliseconds) and can contain larger 'name' entries (anchor tag names)
- the EURL object was replaced by a new ZURL object, which is a container for the plasmaCrawlEntry and some tracking information
- the EURL index is now filled with ZURL objects
- a new index delegatedURL holds ZURL objects about plasmaCrawlEntry obects to track which url is handed over to other peers
- redesigned handling of plasmaCrawlEntry - handover, because there is no need any more to convert one entry object into another
- found and fixed numerous bugs in the context of crawl state handling
- fixed a serious bug in kelondroCache which caused that entries could not be removed
- fixed some bugs in online interface and adopted monitor output to new entry objects
- adopted yacy protocol to handle new delegatedURL entries
all old crawl queues will disappear after this update!

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3483 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-03-16 13:25:56 +00:00
orbiter
581db87237 more debug code for
http://www.yacy-forum.de/viewtopic.php?p=33009#33009

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3479 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-03-14 15:04:06 +00:00
orbiter
81c4cc6bf7 better debugging of balancer failure
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3478 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-03-14 12:02:56 +00:00
orbiter
6faa262259 fix for NURL-fix
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3465 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-03-09 14:30:53 +00:00
orbiter
243a2f831b fixed problem with not found NURL-hashes
The cause for this problem could still not be found, but the effect
is handled much better. The NURL-pop will continue automatically until
it found a hash that can be found.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3458 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-03-09 11:07:20 +00:00
orbiter
6ad39bae1e fixed shutdown problem
this fixes the 'inconsistency' messages during start-up

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3457 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-03-09 08:48:47 +00:00
orbiter
d755a8026d - better OOM protection
- better memory allocation for FlexTable indexes
- splitting between static index and dynamic index (only the dynamic part must grow)
- to enable a merge-iteration of new splittet index, a huge number of classes needed to be adopted for new iterator classes
- added new iterator classes that support cloneable iterators
- adopted all iterator classes to implement cloneable itarators

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3453 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-03-08 16:15:40 +00:00
orbiter
4e8eb1dbe3 some minor changes here and there
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3441 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-03-07 14:22:10 +00:00
orbiter
1cba31de43 redesigned ram organization for database caches
- each cache can now allocate as much memory as is available
- no more fixed limits
- replaced old performance memory monitor by new one
- added supervision methods as static functions into the classes that provide cache functionality
- steering of ram allocation is done with two simple limits that are ram availability-relative


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3434 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-03-06 22:43:32 +00:00
orbiter
f7803a6ce4 enhanced crawl balancer
- new domains now get a chance to get crawled early
- less IO operations
- new balancing method
- better dump order at shutdown time
- bugfixes regarding not found url hashes (no more superfluous cache kill)
- domain access time is now shared over all balancer stacks
- viewing the stack does no more disturbish the balancing algorithm that much
- intelligent selection of best next domain using domain access times
- extra double-check (to double-check the double-check)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3384 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-02-21 16:23:31 +00:00
orbiter
dc0c06e43d PLEASE MAKE A BACK-UP OF YOUR COMPLETE DATA DIRECTORY BEFORE USING THIS
redesign for better IO performance
enhanced database seek-time by avoiding write operations at distant
positions of a database file. until now, a USEDC counter was written
at the head-section of a kelondroRecords database file (which is the
basic data structure of all kelondro database files) to store the
actual number of records that are contained in the database. Now, this
value is computed from the database file size. This is either done
only once at start-time, or continuously when run in asserts enabled.
The counter is then updated only in RAM, and written at close of the
file. If the close fails, the correct number can be computed from the
file size, and if this is not equal to the stored number it is a strong
evidence that YaCY was not shut down properly.
To preserve consistency, the complete storage-routine had to be re-written.
Another change enhances read of nodes in some cases, where the data-tail
can be read together with the data-head. This saves another IO lookup during
each DB node fetch.
Includes also many small bugfixes.
IF ANYTHING GOES WRONG, ALL YOUR DATA IS LOST: PLEASE MAKE A BACK-UP

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3375 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-02-20 08:35:51 +00:00
orbiter
8c1d2e0227 protection against crawl balancer failure:
a minimum of 500 milliseconds distance between two acesses
to the same domain is now ensured

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3354 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-02-09 09:48:23 +00:00
orbiter
773ba1e91a - generalized object order handling
- controlled object order for all database tables
- migrated DHT position computation to correct base64-decoded values
  this also closed the 'gaps' in the dht positions

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3049 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-12-06 03:02:57 +00:00
orbiter
052f28312a removed assortments from indexing data structures
removed options to switch on assortments

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3041 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-12-02 19:34:59 +00:00
orbiter
0b9370a9dc fix for http://www.yacy-forum.de/viewtopic.php?p=28108#28108
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3013 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-11-25 23:33:28 +00:00
orbiter
30888e7a2f implementation of search constraints
Such constraints may formulate specific restrictions to web searches
This is implemented by scraping information for constraints from a web
page during parsing, and storing flags to the pages within the web index.

In this first step, only information for index pages ("index of", directory listings)
are scraped and stored in flags
- added new flag class kelondroBitfield
- added scraper method in condenser
- added bitfield structure for all scrape types (see also condenser)
- added bitfield structure for appearance locations (see RWIEntry)
- added handover protocol for remote search and index distribution
- extended kelondroColumn class to hold bitfield types
- added another search attribute on search page (index.html)
- extended search-filter to enable filtering of non-matching constraints
- set all new database types to be default
- refactoring: moved word hash generation to condenser class

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2999 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-11-23 02:16:30 +00:00
orbiter
497428c8ec refactoring
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2949 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-11-10 01:13:33 +00:00
orbiter
76fceb9997 refactoring
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2945 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-11-09 16:32:34 +00:00
orbiter
bb7d4b5d5e refactoring to prepare new RWI entry object
- moved all url and index(RWI) entries to index package
- better naming to distinguish RWI entries and URL entries


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2937 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-11-08 16:17:47 +00:00
orbiter
1751a799ac - deactivated all write buffers
- fixed a storage bug


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2933 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-11-07 10:56:36 +00:00
orbiter
147d88cf23 re-design of database caching
this should reduce IO a lot, because write caches are now actived for all databases
- added new caching class that combines a read- and write-cache.
- removed old read and write cache classes
- removed superfluous RAM index (can be replaced by kelonodroRowSet)
- addoped all current classes that used the old caching methods
- more asserts, more bugfixes


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2865 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-26 13:50:50 +00:00
orbiter
2a9d868f6d - removed object cache from kelondroTree
- generalized object caching and added new object caching class
- added object caching wherever kelondroTree was used
- added object caching also to usage of kelondroFlex
- added object buffering (a write cache) to NURLs
- added many assert statements; fixed bugs here and there
- added missing close methods to latest added classes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2858 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-24 13:48:16 +00:00
orbiter
06854988da - full integration of new LURL database in INDEX
- added migration method for urlHash.db into INDEX

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2819 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-19 21:14:37 +00:00
karlchenofhell
ebf0da2a45 - now the fix http://www.yacy-forum.de/viewtopic.php?t=2974 works
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2796 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-18 12:07:17 +00:00
orbiter
df1629b05a - code cleanup
- version 0.471
- moved surftipps to own web page


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2676 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-29 22:27:20 +00:00
theli
26dfbb7499 *) Bugfix for UTF-8: url names are now stored properly in stackcrawl, crawler, indexing queue and should be displayed correct on the gui
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2630 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-19 05:19:41 +00:00
orbiter
4866868c0e added write cache for LURLs
This was necessary to speed up the index receive process during global search


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2498 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-07 01:13:03 +00:00
orbiter
b7f4a1521b added options to switch on or off the kelondroFlexTable for NURL, EURL and PreNURL
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2456 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-08-24 22:21:22 +00:00
orbiter
c26da4893b turned back NURL usage of kelondroTree, kelondroFlexTable has still problems with deleted entries
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2454 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-08-24 10:03:38 +00:00
orbiter
db1eae0227 * simplified initialization of database objects
* replaced kelondroTree for NURLs by kelondroFlex
* replaced kelondroTree for EURLs by kelondroFlex
take care, may be very buggy
please finish crawls before updating. crawls will be lost.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2452 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-08-24 02:19:25 +00:00
orbiter
6ad471ef96 * applied many compiler warning recommendations
* cleaned up code
* added unit test code
* migrated ranking RCI computation to kelondroFlex and kelondroCollectionIndex


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2414 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-08-16 19:49:31 +00:00
orbiter
cd5f7e137c fixed problem with NURL-generation upon first startup
(a new kelondroFlexTable was generated, which should not)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2402 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-08-12 23:24:50 +00:00
orbiter
9ae9062bd3 * disabled new kelondroFlex table for NURLs
* added new RAM index Class
* fixed possible synchronization problem in kelondroRecords


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2388 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-08-12 00:58:43 +00:00
orbiter
689bbcf9cd replaced kelondroTree db for NURLs by new kelondroFlexTable
The new database is only created if the old is deleted or does not exist

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2387 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-08-11 23:36:58 +00:00
orbiter
130e6d4719 generalized index object for eurl, nurl and lurl to prepare move
of these tables to new kelondroFlexTable Object

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2382 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-08-11 17:37:54 +00:00
orbiter
740d49751d * strict type and size check in kelondroRow handling
* adopted all code to use the declaration form of kelondroRow
* fixed a bug in kelondroRow which caused wrong parsing of encoding type
* the bug caused bad database behaviour in new indexCollection data structure.
  because of this bug, all test databases are now already void. A new database is created
* the kelondroFlexTable and indexCollection data structures now store a declaration of the row definition
  into a properties file along the database files.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2375 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-08-11 03:20:44 +00:00
orbiter
ad692fc6c7 implemented option to extract nurls from the database
(plus some iteration enhancements for nurls)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2325 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-07-24 16:40:59 +00:00
orbiter
7fd90ca7c8 * strict handling of NURL entry element generation, storage and stacking
* more space for EURL reason strings (you must delete the EURL db to use this)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2324 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-07-24 16:04:14 +00:00
orbiter
5f72be2a95 some redesign of EURL storage
* store() is now called explicitely
* more urls are written to the EURL table
* the EURL stack does not store the complete entry any more, now only the URL hash


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2323 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-07-24 15:25:47 +00:00
orbiter
5214f571cd simplified method call in balancer
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2303 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-07-19 00:42:20 +00:00
orbiter
3879a0ecd0 replaced java.net.URL usage by use of new class de.anomic.net.URL
This shall be seen as an experiment to exclude all cases where
there could be a DNS lookup during URL comparisment.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2290 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-07-13 01:21:53 +00:00
orbiter
92f4cb4d73 added option to configure the start-up delay time for kelondro database files.
the start-up delay is used to pre-load the database node cache

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2276 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-07-03 23:57:33 +00:00
orbiter
c36e9fc8d3 full integration of kelondroRow
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2167 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-06-02 12:45:57 +00:00
orbiter
4a907a570f 1st step to migrate kelondroTree to usage of kelondroRow instead of byte[][]
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2162 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-05-31 23:31:46 +00:00
orbiter
3c3c047d0a integrated kelondroRow into kelondroStack
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2156 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-05-30 15:28:05 +00:00