Commit Graph

711 Commits

Author SHA1 Message Date
orbiter
6958eff196 removed unnecessary exceptions, extended testing in IntegerHandleIndex
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5701 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-12 07:35:17 +00:00
orbiter
13c666adef performance hack to ObjectIndex put() method:
Java standard classes provide a Map Interface, that has a put() method that returns the object that was replaced by the object that was the argument of the put call. The kelondro ObjectIndex defined a put method in the same way, that means it also returned the previous value of the Entry object before the put call. However, this value was not used by the calling code in the most cases. Omitting a return of the previous value would cause some performance benefit. This change implements a put method that does not return the previous value to reflect the common use. Omitting the return of previous values will cause some benefit in performance. The functionality to get the previous value is still maintained, and provided with a new 'replace' method. 

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5700 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-11 20:23:19 +00:00
orbiter
1f1be1518c added stub for another performance hack: concurrent indexes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5699 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-11 15:52:03 +00:00
orbiter
3e4c28e188 enhanced count feature for kelondroRowSet. This is about twice as fast as before. Should speed up the collection analysis (half time!)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5698 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-11 15:10:38 +00:00
orbiter
84e37387a2 fix for last commit and more testing stub
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5697 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-11 09:16:46 +00:00
orbiter
ca006c506d stub for performance enhancements for RowSet (no functional change yet)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5696 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-11 08:55:43 +00:00
orbiter
100247bdda added also an export and delete-feature to the URLAnalysis. This completes the clean-up feature for URLs. To do a complete clean-up of the url database, start the following:
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -incollection DATA/INDEX/freeworld/TEXT/RICOLLECTION used.dump
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -diffurlcol DATA/INDEX/freeworld/TEXT used.dump diffurlcol.dump
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -export DATA/INDEX/freeworld/TEXT xml urls.xml diffurlcol.dump
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -delete DATA/INDEX/freeworld/TEXT diffurlcol.dump

The export-feature is optional, the purpose of that function is to provide a back-up function for URLs to be deleted. The export function can also be used to create html files with embedded links and simple text-files. Simply replace the 'xml' word with 'html' or 'text'. The last argument in the cann, the diffurlcol.dump value, can also be omitted. This will cause that the complete URL database is exported. This is an alternative to the Web-Interface based export function.

The delete-feature is the only destructive method of the four presented here. Please use it with care. It is better to make a back-up of the url database files before starting the deletion.


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5694 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-10 20:52:10 +00:00
orbiter
60078cf322 added next tool for url analysis: check for references, that occur in the URL-DB but not in the RICOLLECTIONS
to use this, you must user the -incollection command before (see SVN 5687) and you need a 
used.dump file that has been produced with that process.

Now you can use that file, to do a URL-hash compare with the urls in the URL-DB. To do that, execute
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -diffurlcol DATA/INDEX/freeworld/TEXT used.dump diffurlcol.dump
or use different names for the dump files or more memory.

As a result, you get the file diffurlcol.dump which contains all the url hashes that occur in the URL database, but not in the collections.
The file has the format
{hash-12}*
that means: 12 byte long hashes are listed without any separation.

The next step could be to process this file and delete all these URLs with the computed hashes, or to export them before deletion.



git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5692 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-10 13:38:40 +00:00
orbiter
dbdd10da84 better logging and startup behaviour for referenceHash computation
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5690 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-09 22:32:04 +00:00
orbiter
d64836c34f added statistical analysis of URL reference
use that with the following command on a linux shell:
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -incollection DATA/INDEX/freeworld/TEXT/RICOLLECTION used.dump
for freeworld indexes.
For more details please see discussion below:
http://forum.yacy-websuche.de/viewtopic.php?p=13204#p13204


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5687 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-09 10:43:28 +00:00
orbiter
3b28daab40 code-beautification (to be consistent with external documentation paper)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5686 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-09 10:24:15 +00:00
orbiter
485c9406e5 fix for http://forum.yacy-websuche.de/viewtopic.php?f=6&t=1915&hilit=&p=13249#p13249
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5684 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-09 10:14:49 +00:00
orbiter
b80db04667 - refactoring of IntegerHandleIndex and LongHandleIndex (better method names)
- fix for problem in httpdFileHandler: mising close of open Files if tempate cache was disabled
- more memory for DHT selection required
- stub for URL reference hash statistics in index collections

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5682 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-08 21:37:17 +00:00
orbiter
16f5c6a85e fixed merge method initialization in ReferenceContainer
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5676 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-07 10:45:14 +00:00
orbiter
d7a493b4f5 added experimental timeline api
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5672 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-06 16:01:29 +00:00
orbiter
efcd95dc37 simplification of (internal) query process / refactoring
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5671 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-06 15:53:20 +00:00
orbiter
d4b56d5819 added more asserts to BLOBHeap.flushBuffer() to fix the problem described in
http://forum.yacy-websuche.de/viewtopic.php?f=6&t=1679&hilit=&p=13109#p13109

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5666 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-03 23:24:19 +00:00
orbiter
aa44d9bad9 more refactoring of kelondro.text / deleted de.anomic.index
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5664 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-02 11:04:13 +00:00
orbiter
6ffc6e3389 more refactoring of indexer and kelondro classes;
- integrating the indexer into kelondro as package 'text'
- renaming of classes in kelondro.index

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5663 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-02 10:00:32 +00:00
orbiter
2df57b1fd1 refactoring of index collection class
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5660 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-01 23:07:45 +00:00
orbiter
8444357291 added new row interator in kelondro tables files that enumerates rows
without an order by the primary key. The result is a very fast enumeration of the Eco table data structure. Other table data types are not affected.
The new enumerator is used for the url export function that can be accessed from the online interface (Index Administration -> URL References -> Export). This export should now be much faster, if all url database files are from type Eco
The new enumeration is also used at other functions in YaCy, i.e. the initialization of the crawl balancer and the initialization of YaCy News.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5647 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-24 10:40:20 +00:00
orbiter
62505bb3cb more bugfixes as recommendet by findbugs
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5619 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-17 09:12:47 +00:00
orbiter
6b450d09ca some fixes recommended by findbugs
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5618 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-16 23:31:54 +00:00
orbiter
e04a0e05c3 fix for last commit
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5614 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-16 16:21:12 +00:00
orbiter
a9ad863686 second part of 'doubles' fix - better handling of doubles in RAMIndex. More logging.
still missing: deletion of double entries in collections

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5613 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-16 16:13:48 +00:00
orbiter
59427064fb first part of 'doubles' fix (not fully ready yet)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5612 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-16 00:47:48 +00:00
orbiter
26978b2a25 - better memory protection in kelondro caches: computation of needed memory for cache grow
- removed excessive gc calls
- step to 16 vertical DHT partitions

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5611 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-15 23:35:59 +00:00
hermens
2173865f92 Prevent race condition when switching timezones.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5605 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-13 11:59:50 +00:00
orbiter
30a1de41b3 disabled the BufferedIOChunks, because I consider it as broken.
I will try to fix that, but it is better to not use a buffer than using a broken buffer.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5600 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-11 15:21:48 +00:00
orbiter
411f2212f2 more memory leak fixing hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5599 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-11 13:31:10 +00:00
orbiter
333489420b - fix for NPE when loading the cytag image
- some hacks for less memory usage:
-- less usage of buffer and cache memory in EcoFS
-- buffer allocation on-demand in BufferedIOChunks
-- removed largest ybr idx

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5595 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-11 10:52:56 +00:00
orbiter
c25c334b75 replaced old DHT transmission method with new method. Many things have changed! some of them:
- after a index selection is made, the index is splitted into its vertical components
- from differrent index selctions the splitted components can be accumulated before they are placed into the transmission queue
- each splitted chunk gets its own transmission thread
- multiple transmission threads are started concurrently
- the process can be monitored with the blocking queue servlet
To implement that, a new package de.anomic.yacy.dht was created. Some old files have been removed.
The new index distribution model using a vertical DHT was implemented. An abstraction of this model
is implemented in the new dht package as interface. The freeworld network has now a configuration
of two vertial partitions; sixteen partitions are planned and will be configured if the process is bug-free.
This modification has three main targets:
- enhance the DHT transmission speed
- with a vertical DHT, a search will speed up. With two partitions, two times. With sixteen, sixteen times.
- the vertical DHT will apply a semi-dht for URLs, and peers will receive a fraction of the overall URLs they received before.
  with two partitions, the fractions will be halve. With sixteen partitions, a 1/16 of the previous number of URLs.
BE CAREFULL, THIS IS A MAJOR CODE CHANGE, POSSIBLY FULL OF BUGS AND HARMFUL THINGS.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5586 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-10 00:06:59 +00:00
orbiter
01b97ef3f8 added new cybertag-tracking feature that was inspired by itgrl
from the forum discussion in
http://forum.yacy-websuche.de/viewtopic.php?p=12612#p12612

The feature will provide two basic entities:
- you can integrate image links which point to your yacy installation anywhere in the web.
  the image can be loaded with
  <img src="http://<yourpeer>:<yourport>/cytag.png?icon=invisible&nick=<yournickname_or_community_id>&tag=<anything>">
  This will place a invisible 1-pixel image. If you change the icon=invisible to icon=redpill, you will see a red pill
  Use this, to track your activity in the web.
- you can view your tracks at
  http://localhost:8080/Tracks.html
- There is a public api to your tracks at
  http://localhost:8080/api/tracks_p.json
  which needs authentication


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5581 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-06 15:06:19 +00:00
borg-0300
b19bc611b0 gc: better logging
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5578 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-05 19:42:32 +00:00
orbiter
b1f9c00118 fix for bug in merge operator initialization
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5577 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-05 15:26:16 +00:00
orbiter
b57c9da1f8 - fixes to doc, ppt, xls parser: better title
- fixes to httpd server response header generation
- fixes to a server date computation bug
- new Button in indexControl to view content of url in ViewFile


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5576 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-05 15:15:13 +00:00
f1ori
7936e58fe7 * sorry,previous version didn't compile
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5575 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-05 12:15:21 +00:00
f1ori
76cdc59789 * added some convertions to and from UTF-8
* this might fix problems on windows systems
  (like http://forum.yacy-websuche.de/viewtopic.php?f=6&t=1824)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5574 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-05 12:12:07 +00:00
orbiter
94110df85a moved logging partially to kelondro
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5545 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-31 01:06:56 +00:00
orbiter
024da2916b refactoring of logging
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5544 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-30 23:33:47 +00:00
orbiter
83ce65707a (almost) completed partition of classes in kelondro
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5543 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-30 22:44:20 +00:00
orbiter
7ee494fde5 more refactoring of kelondro:
- seperated BLOB from table classes
- renamed 'coding' package to 'order'

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5542 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-30 22:08:08 +00:00
orbiter
bf93767ec6 refactoring of kelondro database classes
(to be continued)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5540 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-30 15:33:00 +00:00
orbiter
fc27bf8c4c refactoring of kelondro classes:
kelondro shall become independent from other packages.
moved bytebuffer, date and memory to kelondro

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5539 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-30 14:48:11 +00:00
orbiter
6cbca1e508 extended last fix, preventing more sorts
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5533 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-29 16:42:01 +00:00
orbiter
f9672d3f97 applied fix for inefficient put method as recommended by celle, see
http://forum.yacy-websuche.de/viewtopic.php?p=12424#p12424

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5532 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-29 16:08:24 +00:00
orbiter
3154926311 some better memory protection and OOM prevention in EcoFS
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5523 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-26 20:29:20 +00:00
orbiter
dedfc7df7f removed distinction between DHT-in and DHT-out. This is necessary to make room for the new cell data structure, which cannot use this this distinction in the first place, but will enable the same meaning with different mechanisms (segments, later)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5511 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-22 00:03:54 +00:00
orbiter
b74159feb8 preparations to integrate the new 'cell' index data structure
(this commit is just to move development files to my other computer, no functionality change so far)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5509 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-21 18:23:37 +00:00
orbiter
cb76d9e0e4 more synchronized in BLOBHeap (will not fix problem with Runtime-Error as reported in forum)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5487 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-13 13:22:29 +00:00