Commit Graph

355 Commits

Author SHA1 Message Date
orbiter
c2359f20dd refactoring: better abstraction of reference and metadata prototypes.
This is a preparation to introduce other index tables as used now only for reverse text indexes. Next application of the reverse index is a citation index.
Moved to version 0.74

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5777 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-03 13:23:45 +00:00
orbiter
a9cea419ef Integration of the new index data structure IndexCell
This is the start of a testing phase for IndexCell data structure which will replace
the collections and caching strategy. IndexCall creation and maintenance is fast, has
no caching overhead, very low IO load and is the basis for the next data structure,
index segments.

IndexCell files are stored at DATA/<network>/TEXT/RICELL
With this commit still the old data structures are used, until a flag in yacy.conf is set.
To switch to the new data structure, set
useCell = true
in yacy.conf. Then you will have no access any more to TEXT/RICACHE and TEXT/RICOLLECTION

This code is still bleeding-edge development. Please do not use the new data structure for
production now. Future versions may have changed data types, or other storage locations.
The next main release will have a migration feature for old data structures.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5724 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-17 13:03:27 +00:00
orbiter
83792d9233 more refactoring
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5722 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-16 16:24:53 +00:00
orbiter
7f67238f8b refactoring of plasmaWordIndex: less methods in the class, separated the index to CachedIndexCollection
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5710 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-13 14:56:25 +00:00
orbiter
60078cf322 added next tool for url analysis: check for references, that occur in the URL-DB but not in the RICOLLECTIONS
to use this, you must user the -incollection command before (see SVN 5687) and you need a 
used.dump file that has been produced with that process.

Now you can use that file, to do a URL-hash compare with the urls in the URL-DB. To do that, execute
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -diffurlcol DATA/INDEX/freeworld/TEXT used.dump diffurlcol.dump
or use different names for the dump files or more memory.

As a result, you get the file diffurlcol.dump which contains all the url hashes that occur in the URL database, but not in the collections.
The file has the format
{hash-12}*
that means: 12 byte long hashes are listed without any separation.

The next step could be to process this file and delete all these URLs with the computed hashes, or to export them before deletion.



git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5692 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-10 13:38:40 +00:00
orbiter
efcd95dc37 simplification of (internal) query process / refactoring
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5671 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-06 15:53:20 +00:00
orbiter
aa44d9bad9 more refactoring of kelondro.text / deleted de.anomic.index
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5664 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-02 11:04:13 +00:00
orbiter
6ffc6e3389 more refactoring of indexer and kelondro classes;
- integrating the indexer into kelondro as package 'text'
- renaming of classes in kelondro.index

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5663 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-02 10:00:32 +00:00
orbiter
76ef5f0f14 refactoring of index package: better names for the classes (to be continued)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5661 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-01 23:58:14 +00:00
orbiter
ef62ec635e removed overwriting of logging config
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5629 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-21 00:43:38 +00:00
hermens
02dfd6183b Fix logging in serverCore
Prevent NPEs from keeping stopped Sessions in the pool and blocking slots



git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5625 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-20 18:54:01 +00:00
orbiter
c12bb8a6d0 - refactoring of the http client
- added a protection against memory leaks for the access tracker

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5621 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-19 16:24:46 +00:00
orbiter
c25c334b75 replaced old DHT transmission method with new method. Many things have changed! some of them:
- after a index selection is made, the index is splitted into its vertical components
- from differrent index selctions the splitted components can be accumulated before they are placed into the transmission queue
- each splitted chunk gets its own transmission thread
- multiple transmission threads are started concurrently
- the process can be monitored with the blocking queue servlet
To implement that, a new package de.anomic.yacy.dht was created. Some old files have been removed.
The new index distribution model using a vertical DHT was implemented. An abstraction of this model
is implemented in the new dht package as interface. The freeworld network has now a configuration
of two vertial partitions; sixteen partitions are planned and will be configured if the process is bug-free.
This modification has three main targets:
- enhance the DHT transmission speed
- with a vertical DHT, a search will speed up. With two partitions, two times. With sixteen, sixteen times.
- the vertical DHT will apply a semi-dht for URLs, and peers will receive a fraction of the overall URLs they received before.
  with two partitions, the fractions will be halve. With sixteen partitions, a 1/16 of the previous number of URLs.
BE CAREFULL, THIS IS A MAJOR CODE CHANGE, POSSIBLY FULL OF BUGS AND HARMFUL THINGS.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5586 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-10 00:06:59 +00:00
orbiter
9d282d2c16 - renamed interactivesearch to yacyinteractive
- added a configuration option to set the pop up page in Config Appearance
- added a minimized header option to yacyinteractive
- fixed a bug in yacysearch: default values when no query is done


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5569 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-03 13:04:02 +00:00
orbiter
180fe81ef7 quick hack to copy new log configuration over old one
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5565 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-02 14:49:46 +00:00
orbiter
94110df85a moved logging partially to kelondro
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5545 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-31 01:06:56 +00:00
orbiter
024da2916b refactoring of logging
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5544 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-30 23:33:47 +00:00
orbiter
83ce65707a (almost) completed partition of classes in kelondro
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5543 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-30 22:44:20 +00:00
orbiter
7ee494fde5 more refactoring of kelondro:
- seperated BLOB from table classes
- renamed 'coding' package to 'order'

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5542 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-30 22:08:08 +00:00
orbiter
bf93767ec6 refactoring of kelondro database classes
(to be continued)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5540 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-30 15:33:00 +00:00
orbiter
fc27bf8c4c refactoring of kelondro classes:
kelondro shall become independent from other packages.
moved bytebuffer, date and memory to kelondro

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5539 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-30 14:48:11 +00:00
orbiter
419469ac27 added more methods to control the vertical DHT (not yet active .. )
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5514 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-23 15:32:27 +00:00
orbiter
243e73f53b removed unnecessary usage of kelondroBLOBTree
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5397 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-12-18 00:18:37 +00:00
orbiter
b0f2003792 fast database initialization and fast start.up of yacy:
- applied knowledge about concurrent files stream reading and index processing from the wikimedia reader
   to the EcoTable initialization process: the file reader is now concurrent to the index generation
- changed also some initialization processes to avoid some pauses during initialization

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5354 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-21 23:21:33 +00:00
orbiter
867d0f2f56 removed some unnecessary pause delays
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5346 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-14 23:36:33 +00:00
orbiter
3f746be5d4 - consolidation and refactoring of many DHT target - computing methods
- implemented vertical DHT acceptance ("my own DHT") to accept new targets
- added new target computation for global search: addresses vertical targets also
- enhanced remote crawling: collection of remote crawl urls if queue has less than 100 entries (was: 0 entries)
- better performance value computations for PPM selection in network configuration

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5319 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-06 10:07:53 +00:00
orbiter
22989d0d8a added property index.storeCommons to switch commons storage on or off
with index.storeCommons=false all currently stored commons are deleted!
Default is now 'true', but in future full releases it will be switched to 'false'

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5315 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-02 23:30:09 +00:00
orbiter
d09ddabd09 corrected a design mistake (5-byte hashes not necessary)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5119 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-09-04 21:28:00 +00:00
orbiter
536e77e8b7 modifications towards a single database operation to read/write http header and cached file at once:
- removed distinction between header file types for http and ftp; ftp is simulated by using http properties
- removed all old resourceInfo classes that handled this distinction
- introduced a new distinction between http request and http response objects
- unified new response objects with two other object types that had been introduced elsewhere
- changed all servlet call methods to use the new http request header object type
- divided static object keys for http header properties into request and response types
- refactoring here and there (a large number of type changes and many methods merged/moved)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5079 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-08-25 18:11:47 +00:00
danielr
753a1ae430 - changed default browser from netscape to firefox
- fixed "Inefficient use of keySet iterator instead of entrySet iterator" [WMI_WRONG_MAP_ITERATOR, FindBugs]
- fixed some possible null pointer accesses


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5063 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-08-20 07:54:56 +00:00
danielr
be28af50f5 - fixed "yacy2yacy no proxy"-problem
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5058 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-08-17 10:16:32 +00:00
danielr
621b473b18 * removed some warnings of findbugs (http://findbugs.sf.net)
- removed unnecessary code (unused variables, String.toString)
- corrected some calculations (cast int to double or long ;)
- improved little performance (using Integer.valueOf() instead of new Integer)
- log if some File-actions fail (mkdir(), delete(), ...) and some ignored exceptions
- finalized some (more) fields
- finally close some streams
- made inner classes static if not using environment
- generalized some equals (from specificClass to Object)
- fixed some potential nullpointer accesses


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5039 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-08-06 19:43:12 +00:00
danielr
17b7845eb5 * refactoring
- moved constants from plasmaSwitchboard to own class (all 232 ;)
- moved remoteProxy-Methods to httpRemoteProxyConfig, better names
- removed some unnecessary code (else-statements)
* formatting (correct indentation)
* minor bugfixes (due to findbugs.sf.net)
* hopefully fixed "missing quote" (announcing StringParts as UTF-8)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5031 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-08-02 13:57:00 +00:00
danielr
3bb870bfcd added final where possible
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5030 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-08-02 12:12:04 +00:00
orbiter
c3d461d191 - removed superfluous copyright statement
- updated my email address

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5011 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-07-20 17:14:51 +00:00
lotus
62afea0c9f some improvements for yacyTray
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5008 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-07-18 14:17:52 +00:00
orbiter
4acf0a61cd refactoring of kelondroObjects (mainly renaming to kelondroMap)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4982 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-07-10 22:08:16 +00:00
orbiter
e81be7d4f2 added many missing user-agent declarations for yacy http client connections.
the most important fix was the addition of the yacybot user-agent for robots.txt loading,
because web masters look for that access to see if the crawler behaves correctly.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4968 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-07-04 11:03:03 +00:00
orbiter
31efb8fbee - fix for LOG path generation when the DATA/LOG does not exists (fix for bug introduced in SVN 4923)
- some more/better asserts
- slight performance enhancements in remove method in index management. Works for all who do not run using asserts (the majority)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4928 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-06-14 22:51:47 +00:00
danielr
726218dd4a fixed logging
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4926 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-06-14 14:10:53 +00:00
danielr
68c38c2d34 - WatchCrawler shows status without JavaScript
- Performance can be scaled + DHT-profile
- names for pool-threads
- some small refactorings


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4923 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-06-14 10:24:58 +00:00
orbiter
3330181aa0 refactoring:
find a better way to store BLOBs; generalize current BLOG data structure (kelondroDyn)
and prepare it to replace it with something better. The best candidate is the kelondroHeap,
which will become the kelondroBLOBHeap;
removed also some never-used classes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4902 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-06-07 23:12:24 +00:00
danielr
7feae906aa - organize imports
- removed potential null pointer accesses
- removed unnecessary casts


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4893 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-06-06 16:01:27 +00:00
orbiter
c10eaf9bdb - fix for pop-up page upon first start
- added comments in opensearchdescription to explain fast mode
- release 0.59

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4890 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-06-05 20:46:31 +00:00
orbiter
9bef20b537 - added cleanup for unused server loggings: they are removed after the client had not been seen since one hour
- removed configBasic popup trigger when no password is set

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4875 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-06-02 21:49:59 +00:00
orbiter
6f1a3fce05 BF Bugfix
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4869 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-05-29 17:27:38 +00:00
orbiter
4229cd275c fixed several details about network switching, default password, random password and localhost authentification
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4830 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-05-20 09:29:01 +00:00
orbiter
fbb712c669 refactoring:
moved importer classes to crawler and plasma package

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4770 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-05-06 13:44:38 +00:00
orbiter
d2ba1fd2ab major step forward to network switching (target is easy switch to intranet or other networks .. and back)
This change is inspired by the need to see a network connected to the index it creates in a indexing team.
It is not possible to divide the network and the index. Therefore all control files for the network was moved to the network within the INDEX/<network-name> subfolder.
The remaining YACYDB is superfluous and can be deleted.
The yacyDB and yacyNews data structures are now part of plasmaWordIndex. Therefore all methods, using static access to yacySeedDB had to be rewritten. A special problem had been all the port forwarding methods which had been tightly mixed with seed construction. It was not possible to move the port forwarding functions to the place, meaning and usage of plasmaWordIndex. Therefore the port forwarding had been deleted (I guess nobody used it and it can be simulated by methods outside of YaCy).
The mySeed.txt is automatically moved to the current network position. A new effect causes that every network will create a different local seed file, which is ok, since the seed identifies the peer only against the network (it is the purpose of the seed hash to give a peer a location within the DHT).
No other functional change has been made. The next steps to enable network switcing are:
- shift of crawler tables from PLASMADB into the network (crawls are also network-specific)
- possibly shift of plasmaWordIndex code into yacy package (index management is network-specific)
- servlet to switch networks 

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4765 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-05-05 23:13:47 +00:00
danielr
d32fe84472 added default User-Agent
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4763 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-05-04 17:26:19 +00:00