Commit Graph

152 Commits

Author SHA1 Message Date
orbiter
138422990a - removed useCell option: the indexCell data structure is now the default index structure; old collection data is still migrated
- added some debugging output to balancer to find a bug
- removed unused classes for index collection handling
- changed some default values for the process handling: more memory needed to prevent OOM

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5856 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-22 22:39:12 +00:00
lotus
635b0a9da7 code-split
allow cgi indexing

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5839 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-20 13:28:28 +00:00
orbiter
fa3adbbfc6 added domain checks to surrogate reader and RWI transfer receiver to prevent spaming using surrogates
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5837 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-20 06:38:28 +00:00
lotus
ab0030d7a7 allow dht-out for remote-crawl processing peers on default settings
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5834 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-18 20:04:01 +00:00
orbiter
4e97a31009 corrections in dublin core syntax
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5823 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-17 12:23:00 +00:00
orbiter
7dfe7e7cc6 fixed some problems with surrogate reader. This is now ready for testing.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5817 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-16 21:29:41 +00:00
orbiter
9050a3c4c5 alpha version of surrogate reading and indexing.
see the example file for an explanation.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5815 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-16 20:47:55 +00:00
orbiter
ad78e3a59f - less lines in rssTerminal
- crawl more documents: if remote crawling is enabled, a remote crawl list is also loaded if a local crawl is running in case that the indexer is idle

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5809 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-15 23:07:51 +00:00
orbiter
bc80dc913a added new surrogate reader (surrogates are parsed documents on batches)
this will open a new way to insert indexes to YaCy (instead crawling)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5808 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-15 15:30:25 +00:00
orbiter
e58320a507 added more info in log fore debugging
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5805 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-15 07:37:36 +00:00
orbiter
c0e8ed5461 fixed problem with not http client
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5801 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-13 21:21:47 +00:00
orbiter
c2359f20dd refactoring: better abstraction of reference and metadata prototypes.
This is a preparation to introduce other index tables as used now only for reverse text indexes. Next application of the reverse index is a citation index.
Moved to version 0.74

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5777 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-03 13:23:45 +00:00
shostakovich
1f37cc6107 Robots.txt is now reused after one day. See forum-topic:
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=1669&p=13565#p13565

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5772 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-02 15:29:36 +00:00
orbiter
9bfb2641db - removed deprecated threads
- added automatic http client reset. this was necessary because excessive intranet crawling caused deadlocks. this hack solved the problem.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5768 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-01 20:13:57 +00:00
orbiter
b6c2167143 - patch for bad web structure dumps
- added automatic slow down of accessed to specific domains when access to a web page fails

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5765 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-01 13:21:47 +00:00
orbiter
0139988c04 - added writing of temporary file names and renaming to final file name when index dump/merge are done. Interrupted merges can be cleaned up.
- added clean-up of unfinished merges and unused idx/gap files
- enhanced merge file selection method

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5764 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-04-01 12:39:11 +00:00
orbiter
3621aa96ab - added a memory protection for the IndexCell migration
- fix for bad cell file selection

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5763 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-31 19:17:45 +00:00
orbiter
d39a5b42ca more care about open file handles. Now files also close on windows and can be deleted afterwards.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5760 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-31 12:42:12 +00:00
orbiter
029495e64d fixed bug introduced in SVN 5756 in EcoTable.put()
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5759 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-31 07:51:32 +00:00
orbiter
587838bd09 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5758 6c8d7289-2bf4-0310-a012-ef5d649a1542 2009-03-30 21:13:53 +00:00
orbiter
96eaecda3e - added migration class to go from index collections to the index cell data structure.
- added better control over file deletion, because this sometimes fails, especially on windows

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5756 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-30 15:31:25 +00:00
orbiter
37f892b988 added new concurrent merger class for IndexCell RWI data
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5735 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-20 14:54:37 +00:00
borg-0300
8c494afcfe svn attributes added
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5734 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-20 11:21:32 +00:00
orbiter
67aaffc0a2 - added Latency control to the crawler:
because of the strongly enhanced indexing speed when using the new IndexCell RWI data structures (> 2000PPM on my notebook), it is now necessary to control the crawling speed depending on the response time of the target server (which is also YaCy in case of some intranet indexing use cases).
The latency factor in crawl delay times is derived from the time that a target hosts takes to answer on http requests. For internet domains, the crawl delay is a minimum of twice the response time, in intranet cases the delay time is now a halve of the response time.

- added API to monitor the latency times of the crawler:
a new api at /api/latency_p.xml returns the current response times of domains, the time when the domain was accessed by the crawler the last time and many more attributes.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5733 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-20 10:21:23 +00:00
orbiter
61f9dbf0cc - fixed a display problem in watch crawler
- another small enhancement in balancer

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5729 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-18 21:25:52 +00:00
orbiter
b3f75e48fa - enhanced balancer: auto-solving of waiting-deadlocks
- removed deprecated cache-init size value
- more debug lines for IndexCell cache dump merge

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5728 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-18 20:21:19 +00:00
orbiter
d99ff745aa fix for http://forum.yacy-websuche.de/viewtopic.php?p=13378#p13378
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5726 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-18 10:29:13 +00:00
borg-0300
fd0976c0a7 refactoring
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5723 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-16 18:08:43 +00:00
borg-0300
ce79239322 "typo"
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5721 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-16 16:22:33 +00:00
orbiter
7dff1cba62 removed option to use different primary keys in kelondro tables
this option was never used and there is also no use to set other columns but the first as the primary key. as a result, access methods to the key do not need to compute key positions, and they work faster.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5711 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-13 16:52:31 +00:00
orbiter
7f67238f8b refactoring of plasmaWordIndex: less methods in the class, separated the index to CachedIndexCollection
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5710 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-13 14:56:25 +00:00
orbiter
14a1c33823 refactoring of wordIndex class
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5709 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-13 10:34:51 +00:00
orbiter
f6d989aa04 added new class RowSetArray which arranges RowSet objects like Elements in a hashtable, but still provides the functionality of sorted enumeration. The new class is now integrated into the ObjectIndexCache, which is the core class to provide index functions to all database files. The new index access is about twice as fast as before. This has strong speed enhancement effects on all parts of YaCy.
The speed of the kelondro indexing class ObjectIndexCache can be compared with Javas standard TreeMap with the main method in IntegerHandleIndex. The result is, that the kelondro indexing needs only 1/5 of the memory that TreeMap uses! In exchange, the kelondro classes are slower than TreeMap, about four (!) times slower. However, this is not so bad because the better use of the memory is a strong advantage and makes it possible that YaCy can maintain such a large number of document (> 50 million) in one peer.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5705 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-12 23:05:18 +00:00
orbiter
efcd95dc37 simplification of (internal) query process / refactoring
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5671 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-06 15:53:20 +00:00
orbiter
aa44d9bad9 more refactoring of kelondro.text / deleted de.anomic.index
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5664 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-02 11:04:13 +00:00
orbiter
76ef5f0f14 refactoring of index package: better names for the classes (to be continued)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5661 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-03-01 23:58:14 +00:00
orbiter
8444357291 added new row interator in kelondro tables files that enumerates rows
without an order by the primary key. The result is a very fast enumeration of the Eco table data structure. Other table data types are not affected.
The new enumerator is used for the url export function that can be accessed from the online interface (Index Administration -> URL References -> Export). This export should now be much faster, if all url database files are from type Eco
The new enumeration is also used at other functions in YaCy, i.e. the initialization of the crawl balancer and the initialization of YaCy News.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5647 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-24 10:40:20 +00:00
orbiter
9559bc23fd automatic clean-up of dead connections
(hope that works well..)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5626 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-20 22:27:02 +00:00
orbiter
4f9dae2571 remove reference in crawl entries
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5623 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-19 22:58:00 +00:00
orbiter
c12bb8a6d0 - refactoring of the http client
- added a protection against memory leaks for the access tracker

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5621 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-19 16:24:46 +00:00
orbiter
62505bb3cb more bugfixes as recommendet by findbugs
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5619 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-17 09:12:47 +00:00
orbiter
411f2212f2 more memory leak fixing hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5599 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-11 13:31:10 +00:00
orbiter
6c627dbdff update to the server core
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5591 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-10 13:26:26 +00:00
orbiter
6a876ecb88 first fixes to the DHT transmission process
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5588 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-10 00:48:54 +00:00
orbiter
c25c334b75 replaced old DHT transmission method with new method. Many things have changed! some of them:
- after a index selection is made, the index is splitted into its vertical components
- from differrent index selctions the splitted components can be accumulated before they are placed into the transmission queue
- each splitted chunk gets its own transmission thread
- multiple transmission threads are started concurrently
- the process can be monitored with the blocking queue servlet
To implement that, a new package de.anomic.yacy.dht was created. Some old files have been removed.
The new index distribution model using a vertical DHT was implemented. An abstraction of this model
is implemented in the new dht package as interface. The freeworld network has now a configuration
of two vertial partitions; sixteen partitions are planned and will be configured if the process is bug-free.
This modification has three main targets:
- enhance the DHT transmission speed
- with a vertical DHT, a search will speed up. With two partitions, two times. With sixteen, sixteen times.
- the vertical DHT will apply a semi-dht for URLs, and peers will receive a fraction of the overall URLs they received before.
  with two partitions, the fractions will be halve. With sixteen partitions, a 1/16 of the previous number of URLs.
BE CAREFULL, THIS IS A MAJOR CODE CHANGE, POSSIBLY FULL OF BUGS AND HARMFUL THINGS.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5586 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-10 00:06:59 +00:00
orbiter
65a1de6c05 longer timeout for remote crawl queries
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5573 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-04 16:55:09 +00:00
orbiter
94110df85a moved logging partially to kelondro
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5545 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-31 01:06:56 +00:00
orbiter
024da2916b refactoring of logging
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5544 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-30 23:33:47 +00:00
orbiter
83ce65707a (almost) completed partition of classes in kelondro
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5543 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-30 22:44:20 +00:00
orbiter
7ee494fde5 more refactoring of kelondro:
- seperated BLOB from table classes
- renamed 'coding' package to 'order'

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5542 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-30 22:08:08 +00:00