Commit Graph

137 Commits

Author SHA1 Message Date
orbiter
a7d038bb7a The oai ListFriends source list becomes configurable: just write them into defaults/oaiListFriendsSource.xml
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6857 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-06 10:01:37 +00:00
orbiter
5fbf866cae - fixed resumption token generation for oai-pmh import
- relaxed dublin core parsing: the dc:reference tag may replace dc:identifier if this does not contain a valid url
- parsing of completeRecords number and presentation in the download list of oai import

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6850 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-02 22:20:24 +00:00
orbiter
fc5efcc05a enhanced and fixed OAI-PMH import
- now importing OAI-PMH server list fron two sources
- simultanous import from several servers (even > 2000)
- check buttons on OAI-PMH server list to select multiple servers for import start
- it is possible to select all servers at once for import
- imported XML data is gzipped after import from surrogate reader

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6847 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-30 14:03:51 +00:00
sixcooler
c2098f9399 close unused connections if there to many for DHT
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6846 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-29 23:38:50 +00:00
orbiter
3aad50d38e :-(
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6841 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-26 15:26:08 +00:00
orbiter
9edd38fbc5 connectionCount limit too low?
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6840 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-26 15:24:47 +00:00
orbiter
7a05db0fcb fixed to prevent that too many open connections exist
- create less connections at maximum (smaller httpc connection pool size)
- create less connections per host (2, standard required by RFC)
- do not start DHT distributions if there are too many open connections
- clear open/idle connections earlier; run cleaner more often

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6839 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-25 23:08:36 +00:00
orbiter
2bc3cba6f1 - fix for 'do not write to cache' rule.
- do not read from cache if byte[] array is still filled from response object (will do less IO)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6836 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-24 08:22:45 +00:00
orbiter
8b8107b2a3 reduced IO-load and synchronization/blocking
- enhanced the Balancer performance when building new domain stacks using a new Table buffer
- added the new Table buffer BufferedObjectIndex class
- changed order of access to LURL-read (prefereing segment over Crawl Queues) will reduced blocking time on balancer
- fixed PPM setting in Crawler_p servlet (had doubled values)
- reduced synchronization in IndexCell because it is not necessary: reduced blocking during indexing/merging/dumping
- removed did-you-mean cache in IndexCell because that caused too much overhead and more memory usage but was not very useful. This reduced also deadlocks that could be causes when searched are performed during indexing.




git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6819 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-18 21:55:20 +00:00
orbiter
1a8a134e0c continuing String-hash - to - byte[]-hash redesign that was started in SVN 6775 and continued in SVN 6790
The result should be a less usage of new String() and less memory usage (since a String-encapsulated byte[] has 40 bytes overhead)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6815 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-15 13:22:59 +00:00
orbiter
55d8e686ea performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6807 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-13 23:29:55 +00:00
hermens
2f90f0ad56 Remove asserts blocking proxy use cases
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6793 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-10 15:12:39 +00:00
orbiter
25aef069a6 continuing String-hash - to - byte[]-hash redesign that was started in SVN 6775
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6790 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-08 00:11:32 +00:00
orbiter
a85c5bb8a7 added support for multiple (fail-over) network definition locations when http-locations are given. multiple locations can be given with a comma-separated list of urls pointing to the network definition file
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6780 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-27 23:15:15 +00:00
orbiter
1e8e79b9ef redesign of reference hash (URL-hash) parameter hand-over:
pass value as byte[], not as String. This should cause that less
byte[] <-> String conversions are made during time-critical tasks.
This redesign is not yet complete, more to come ..

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6775 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-26 18:33:20 +00:00
orbiter
748abfcffa added patches to prevent yacy-protocol DoS settings
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6751 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-19 15:31:15 +00:00
orbiter
e820ed061a avoiding excessive DNS lookups to determine localhost
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6750 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-19 14:28:25 +00:00
orbiter
3300930fc5 - (almost) fixed FTP crawler
- integrated/fixed SMB crawler

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6742 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-11 15:43:06 +00:00
orbiter
617dfbbd06 allo 'authorization by encoded password' also if requesting client is not from localhost but from the same host as yacy is running on.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6714 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-05 16:03:55 +00:00
orbiter
727dd9b193 - fixed a bug in robots.txt parser
- moved storage of robots.txt entries to WorkTables, so it is now possible to browse the robots entries with the table browser

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6710 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-04 11:58:07 +00:00
orbiter
54af9e6b49 - added parsing of robots meta-tag in html headers to detect a noindexing request
- added evaluation and indexing prevention in case that a noindexing is given in a html file

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6709 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-03 23:32:56 +00:00
orbiter
f175f9a2d3 changed way how number of search requests are counted:
so far only search requests at the remote search interface had been counted.
This was done to protect the privacy of searchers, because counting was not done and published at the own search interface.
This caused that no search requests of robinson peers had been counted, becuase they cannot be counted at remote peer.
This change introduces a distinction of locally done search requests at the local search interface from search requests that are on the local interface but had been submitted from a remote IP without authentication.
Now 3 counters are maintained:
- partial count of remote searches
- total count of local searches on robinson peers from non-authenticated clients
- total count of local searches on robinson peers from localhost or authenticated clients
In the global statistic of search requests now the first two counters of the three cases are added
Because we habe a large number of robinson peers with a large number of remote non-authenticated requests the statistic should show at least three times of the number of search requests.


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6696 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-24 13:53:55 +00:00
orbiter
8030ed3319 self-healing for lost crawl profile handles
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6680 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-18 21:55:45 +00:00
orbiter
ef62d017e5 integrated session id filtering for crawler
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6672 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-15 23:15:17 +00:00
orbiter
d8d9984913 added framework for session id filtering (not ready yet)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6671 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-15 22:30:41 +00:00
orbiter
74e736c903 missing file for last commit
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6645 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-04 14:52:58 +00:00
orbiter
d77782a8d5 removed bookmark tags file, tags are now stored only in RAM
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6638 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-01 22:44:59 +00:00
orbiter
24060885b6 - added Tables abstraction in data.Tables.java
fix for
http://forum.yacy-websuche.de/viewtopic.php?p=18910#p18910
http://forum.yacy-websuche.de/viewtopic.php?p=18894#p18894
http://forum.yacy-websuche.de/viewtopic.php?p=18814#p18814


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6631 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-29 18:02:09 +00:00
orbiter
7fdf59a77f misc NPE check
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6630 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-29 15:59:24 +00:00
orbiter
69c29acb6e no exception thread dump if parser cannot parse becuase that mime-type/extension is in the deny-set
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6611 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-22 13:21:37 +00:00
orbiter
8ce936bcdd added an api recording function: it shall be possible to record
all operations on YaCy in a database that should make it possible
1) to re-create a setting on fresh peers
2) to transmit a setting from one peer to another
3) to re-create crawl starts after a complete deletion of the index
This functionality will also support
4) scheduled re-crawls (new implementation)
To implement this, a new database structure has been crated that stores maps into blob heaps. to encode maps the b-encoding technique was used (this is the same encoding that torrent files use)
- added a b-encoder
- enhanced the b-decoder
- added a b-encoded map heap data structure
- added a table organisation based on b-encoded heaps
- added a servlet to maintain such tables (see Tables_p.html)
- integrated the servlet into the Advanced Settings menu
- added an api recording based on the new tables

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6606 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-21 22:06:03 +00:00
orbiter
234f733a3d - relocation of seed db is better for network switch than re-initialization because of the embedding of the peers object in other objects
- small refactoring of blacklist interface code to remove PMD warnings


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6593 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-18 00:07:20 +00:00
orbiter
473b11033d fixed network switch process - crawling did not work after a switch before this fix
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6592 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-17 23:33:15 +00:00
orbiter
fd7b348973 some fixes for the network switch
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6591 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-17 22:07:08 +00:00
orbiter
f6731c6240 more logging etc.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6589 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-17 00:41:50 +00:00
orbiter
a06f7ddb33 more PMD recommendations
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6572 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-12 20:53:19 +00:00
orbiter
dd459281c8 applied code changes that are recommended by PMD
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6563 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-10 23:09:48 +00:00
orbiter
d77a8f3b3e added some modifications recommended by PMD for better performance
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6560 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-10 01:40:26 +00:00
orbiter
dff4f95c78 some patches to get the torrent parser working
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6551 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-07 00:42:12 +00:00
low012
82198acc06 *) minor changes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6537 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-12-28 11:06:49 +00:00
orbiter
57d729e377 fix for negative numbers in network statistic
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6532 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-12-21 11:36:48 +00:00
orbiter
362b7a929b added extensive memory protection logic to avoid out of memory errors that may be caused by the RowCollection memory allocation function
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6521 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-12-09 23:27:26 +00:00
orbiter
8281e29963 - more configuration for profiling graph (number of events)
- more logging for a shutdown: print reason and accessing IP into log


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6520 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-12-08 14:25:51 +00:00
orbiter
e34e63a039 preset of proper HashMap dimensions: should prevent re-hashing and increase performance
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6511 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-12-02 14:01:19 +00:00
orbiter
4a5100789f replaced _all_ size() == 0 with isEmpty() and all size() > 0 with !isEmpty(). The isEmpty() method is much faster in some cases, especially when used to access badly balanced hashtables where an size() operation becomes a large iteration.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6510 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-12-02 00:37:59 +00:00
orbiter
491ba6a1ba - some refactoring in workflow
- some refactoring in search process
- fixed image search for json and rss output
- search navigation on bottom of search result page in cases where there are more than 6 results on page
- fixes for number of displayed documents
- disabled pseudostemming

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6504 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-11-24 11:13:11 +00:00
orbiter
1dff620181 Better implementation of SortStack and SortStore and adoptions in all using classes to implement the necessary Comparable interface and hash code computation.
The better SortStack performance affects crawling and image search speed and quality.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6492 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-11-19 13:49:28 +00:00
orbiter
4c6312d103 enhanced image search
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6489 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-11-18 23:56:05 +00:00
orbiter
4c99d4683d possible fix for lost crawl profile handles: clean-up job did wrong measurement to see if crawl is still running.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6465 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-11-06 23:15:20 +00:00
orbiter
4431b9767e added about 450 replacements for printStackTrace() methods to pipe such traces into the log at DATA/LOG/
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6458 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-11-05 20:28:37 +00:00
orbiter
b0b7a4f9a5 - added function to OAI-PMH reader that can pull all records from a server using an evaluation of the resumption token to get URL to retrieve remaining records
- added monitoring for retrieved records

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6444 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-11-02 11:53:14 +00:00
orbiter
a0e891c63d - some redesign in UI menu structure to make room for new 'Content Integration' main menu containing import servlets for Wikimedia Dumps, phpbb3 forum imports and OAI-PMH imports
- extended the OAI-PMH test applet and integrated it into the menu. Does still not import OAI-PMH records, but shows that it is able to read and parse this data
- some redesign in ZURL storage: refactoring of access methods, better concurrency, less synchronization
- added a limitation to the LURL metadata database table cache to 20 million entries: this cache was until now not limited and only limited by the available RAM which may have caused a memory-leak-like behavior.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6440 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-31 11:58:06 +00:00
orbiter
52470d0de4 - fix for xls parser
- fix for image parser
- temporary integration of images as document types in the crawler and indexer for testing of the image parser

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6435 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-22 22:38:04 +00:00
orbiter
5e8038ac4d - refactoring of blacklists
- refactoring of event origin encoding


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6434 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-21 20:14:30 +00:00
orbiter
26fafd85a5 - more refactoring
- fixed problem with parsers

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6433 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-21 15:12:34 +00:00
orbiter
3528b970d6 - refactoring
- added new experimental (not-yet-working) image parser
- added new test image

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6431 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-19 22:34:44 +00:00
orbiter
a8ce192f63 - shifted main classes to new package net.yacy
- fixed some bugs in last commit

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6427 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-18 01:38:07 +00:00
orbiter
b79f4f062f refactoring of yacy documents and parsers: they depend now only on the kelondro classes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6426 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-18 00:53:43 +00:00
orbiter
e7f18ba24b refactoring
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6399 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-11 00:24:42 +00:00
orbiter
ce8dc575ca refactoring
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6398 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-11 00:12:19 +00:00
orbiter
bea3b99aff moved table and util classes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6397 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-10 01:14:19 +00:00
orbiter
4446acc8cd moved kelondro order
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6392 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-09 23:22:22 +00:00
orbiter
f677d534b1 start of a really extensive refactoring which will produce a hierarchical package structure with the domain yacy.net as package root
- moved here the logging classes as part of the new net.yacy.kelondro package

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6391 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-09 23:13:30 +00:00
orbiter
ea473e32b8 refactoring
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6390 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-09 22:27:50 +00:00
orbiter
735e2737e3 * added index segments
This is a major change in the organization of indexes.
Please consider a back-up of your data before you run this update.
All existing index files will be moved and renamed to a new position.
With this change, it will be possible to maintain different indexes for different purposes and it will be possible to have a distinction between DHT-in and DHT-out specific indexes. Tenants may also have their own index, and it may be possible to have histories and back-ups of indexes. This is just the beginning, many servlets must be adopted after this change, but all functions that had been there should still work.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6389 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-09 14:44:20 +00:00
orbiter
09de5da74a once again a performance hack
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6388 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-08 18:26:54 +00:00
orbiter
6e0dc39a7d - some fixes to prevent blocking situations
- better logging for the crawler
- better default values for the crawler

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6377 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-06 21:52:55 +00:00
orbiter
04a548a1e3 - temporary integrated the transferURL servlet as static class instead as a class that is called using reflection to investigate the OOM problems in that class
- fixes for numerous other problems
- removed dead code
- resdesign of the strings-method, which produces now less memory overhead and may help to prevent OOMs
- another fix for the deadlock problem in SplitTable

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6373 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-05 20:11:41 +00:00
orbiter
6aa474f529 - better logging for web cache access and fail reasons
- better Exception handling for web cache access
- distinction between access of web cache for proxy and crawler


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6367 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-01 13:08:19 +00:00
orbiter
3671c37989 added experimental oai-pmh reader and integrated it with the existing dublin core parser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6366 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-30 22:11:00 +00:00
orbiter
e627f75415 one more fix to badwords and stopwords
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6316 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-15 11:47:50 +00:00
orbiter
721b88efbd - fixed a problem loading blacklists with new yacycore.jar
- fixed badwords and stopwords initialization

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6315 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-15 11:46:02 +00:00
orbiter
68465c37af added a convenience class to add files into a YaCy index
to make this possible, the yacyURL must be able to process file:// urls, which has also been implemented
testing of the new class resulted in some bugfixes in other classes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6313 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-14 21:17:42 +00:00
orbiter
573d03c7d7 added configuration to enable ram table copy
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6304 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-07 20:30:57 +00:00
orbiter
700218846c disabled or removed sleep calls
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6301 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-07 18:50:44 +00:00
orbiter
67eddaec4b changed way to integrate dictionary files:
the must be downloaded manually by the user and placed in DATA/DICTIONARIES/source
for each externally imported dictionary file there will be a translator that converts the input file once
into a YaCy-internat data format.
Files that will be provided together with yacy releases may still be placed in <root>/dictionaries

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6286 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-02 18:42:13 +00:00
orbiter
d656a94f55 fix for bad paths in dictionary processing
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6285 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-02 18:24:41 +00:00
orbiter
3b9aaf9e9f - inserted new library tests inside DidYouMean
- some redesign of DidYouMean that was necessary to follow
  a special rule how a library should be used:
  - the library provides words that start or end with a test
    word which may be possibly also an empty set of words
  - all words that the DidYouMean produced with the four
    production rules are used to generate a set of
    library-completed words
  - if this process results in any words from the library,
    only library-genrated words are taken
  - if the is no library-generated word at all, take the
    artifial generated word
  - all words that result from these rules are tested against
    the index
  - the result is ordered using a lightweight comparator that
    prefers short words
  - a not-so-much-io test against the index is beeing prepared
    next
- insered the library initialization into the switchboard

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6284 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-02 13:41:56 +00:00
orbiter
1a9cfd8718 some performance hacks (CPU only, not IO)
this will cause better computation speed for single- and multi-core;
there are enhancements that will speed up old and slow machines as well
as multi-core CPUs. Indexing of surrogates has been speed up
from 4000 PPM to over 20000 PPM on a simple dual core office computer.
Since the enhancements are mostly in core routines, the hack should also
speed up search performance.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6276 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-28 13:28:11 +00:00
orbiter
72e5407115 refactoring of snippet cache
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6268 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-27 14:34:41 +00:00
orbiter
72ac5bd80f refactoring of search process.
this is the beginning of some architecture changes that will hopefully bring some more stability, speed and transparency to the search process.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6260 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-24 15:24:02 +00:00
orbiter
92edd24e70 fixed problem with switching of networks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6247 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-30 15:49:23 +00:00
orbiter
c4ae2cd03f fixed bug that caused deletion of crawl profiles at every application startup
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6240 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-23 22:09:02 +00:00
orbiter
161d2fd2ef redesign of access to the HTCache (now http.client.Cache):
- better control to the cache by using combined request-header and content access methods
- refactoring of many classes to comply to this new access method
- make shure that the cache is always written if something was loaded
- some redesign of the process how http response results are feeded into the new indexing queue
- introduction of a cache read policy:
 * never use the cache
 * use the cache if entry exist
 * use the cache if the proxy freshness rule confirmes
 * use only the cache and go never online
- added configuration options for the crawl profiles to use the new cache policies. There is not yet a input during crawl start to set the policy but this will be added in another step.
- set the default policies for the existing crawl profiles. If you want them to appear in your default profiles you must delete the crawl profiles database; othervise the policy is 'proxy freshness rule'
- enhanced some cache access methods in such a way that unnecessary retrievals are omitted (i.e. for size computation). That should reduce some IO but also a lot of CPU computation because sizes were computed after decompression of content after retrieval of the content from the disc.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6239 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-23 21:31:51 +00:00
f1ori
ba2e6de538 fix empty version string again
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6236 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-21 19:56:40 +00:00
orbiter
4da9042e8a code simplification
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6233 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-19 21:59:29 +00:00
orbiter
1d8d51075c refactoring:
- removed the plasma package. The name of that package came from a very early pre-version of YaCy, even before YaCy was named AnomicHTTPProxy. The Proxy project introduced search for cache contents using class files that had been developed during the plasma project. Information from 2002 about plasma can be found here:
http://web.archive.org/web/20020802110827/http://anomic.de/AnomicPlasma/index.html
We stil have one class that comes mostly unchanged from the plasma project, the Condenser class. But this is now part of the document package and all other classes in the plasma package can be assigned to other packages.
- cleaned up the http package: better structure of that class and clean isolation of server and client classes. The old HTCache becomes part of the client sub-package of http.
- because the plasmaSwitchboard is now part of the search package all servlets had to be touched to declare a different package source.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6232 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-19 20:37:44 +00:00