yacy_search_server/source/de/anomic/kelondro/text
orbiter 60078cf322 added next tool for url analysis: check for references, that occur in the URL-DB but not in the RICOLLECTIONS
to use this, you must user the -incollection command before (see SVN 5687) and you need a 
used.dump file that has been produced with that process.

Now you can use that file, to do a URL-hash compare with the urls in the URL-DB. To do that, execute
java -Xmx1000m -cp classes de.anomic.data.URLAnalysis -diffurlcol DATA/INDEX/freeworld/TEXT used.dump diffurlcol.dump
or use different names for the dump files or more memory.

As a result, you get the file diffurlcol.dump which contains all the url hashes that occur in the URL database, but not in the collections.
The file has the format
{hash-12}*
that means: 12 byte long hashes are listed without any separation.

The next step could be to process this file and delete all these URLs with the computed hashes, or to export them before deletion.



git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5692 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-10 13:38:40 +00:00
..
AbstractBlacklist.java more refactoring of kelondro.text / deleted de.anomic.index 2009-03-02 11:04:13 +00:00
Blacklist.java more refactoring of kelondro.text / deleted de.anomic.index 2009-03-02 11:04:13 +00:00
DefaultBlacklist.java more refactoring of kelondro.text / deleted de.anomic.index 2009-03-02 11:04:13 +00:00
Document.java more refactoring of kelondro.text / deleted de.anomic.index 2009-03-02 11:04:13 +00:00
Index.java more refactoring of indexer and kelondro classes; 2009-03-02 10:00:32 +00:00
IndexCache.java fix for http://forum.yacy-websuche.de/viewtopic.php?f=6&t=1915&hilit=&p=13249#p13249 2009-03-09 10:14:49 +00:00
IndexCell.java more refactoring of kelondro.text / deleted de.anomic.index 2009-03-02 11:04:13 +00:00
IndexCollection.java better logging and startup behaviour for referenceHash computation 2009-03-09 22:32:04 +00:00
IndexReader.java more refactoring of indexer and kelondro classes; 2009-03-02 10:00:32 +00:00
MetadataRepository.java added next tool for url analysis: check for references, that occur in the URL-DB but not in the RICOLLECTIONS 2009-03-10 13:38:40 +00:00
MetadataRowContainer.java simplification of (internal) query process / refactoring 2009-03-06 15:53:20 +00:00
Phrase.java more refactoring of indexer and kelondro classes; 2009-03-02 10:00:32 +00:00
Reference.java more refactoring of indexer and kelondro classes; 2009-03-02 10:00:32 +00:00
ReferenceContainer.java fixed merge method initialization in ReferenceContainer 2009-03-07 10:45:14 +00:00
ReferenceContainerArray.java more refactoring of kelondro.text / deleted de.anomic.index 2009-03-02 11:04:13 +00:00
ReferenceContainerCache.java more refactoring of kelondro.text / deleted de.anomic.index 2009-03-02 11:04:13 +00:00
ReferenceContainerOrder.java more refactoring of kelondro.text / deleted de.anomic.index 2009-03-02 11:04:13 +00:00
ReferenceOrder.java more refactoring of indexer and kelondro classes; 2009-03-02 10:00:32 +00:00
ReferenceRow.java more refactoring of indexer and kelondro classes; 2009-03-02 10:00:32 +00:00
ReferenceVars.java more refactoring of indexer and kelondro classes; 2009-03-02 10:00:32 +00:00
URLMetadata.java more refactoring of kelondro.text / deleted de.anomic.index 2009-03-02 11:04:13 +00:00
Word.java more refactoring of indexer and kelondro classes; 2009-03-02 10:00:32 +00:00