Commit Graph

403 Commits

Author SHA1 Message Date
f1ori
7d8de34778 * add a bit documentation to DigestURI, use DigestURI(string) instead of DigestURI(string, null)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7276 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-26 16:10:20 +00:00
orbiter
ed4371dcf3 enhanced navigation implementation and enhanced tag cloud computation
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7252 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-15 23:45:12 +00:00
orbiter
ca738ac924 - added a tag cloud to search results (using the topics)
- some refactoring of score classes
- added default package for new classes add_ymark and delete_ymark

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7251 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-15 22:01:39 +00:00
orbiter
e4d561971e added more score cluster options and made score cluster usage more transparent
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7248 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-14 11:40:02 +00:00
orbiter
7cd9d9d22a - enhanced DidYouMean computation using a faster count on index entries; this causes that results can be ranked better
- added limitations on DidYouMean result sets according to input and output string length

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7246 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-12 22:02:10 +00:00
orbiter
24f1cba7b2 performance hacks:
- faster generation of index abstract compression during remote search
- less synchronization in IO record reading
- request index abstract generation only if necessary and faster time-out in remote search 

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7239 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-11 12:44:07 +00:00
orbiter
d607b30b6a performance enhancements for search and code review for database functions
- removed read cache from Records data structure because the read cache had no cache hit during search operation
- copied old read-cache class to CachedRecords and the old, now new Records class does not have the cache any more and a code review checked that data structures and synchronization is clean
- removed unnecessary synchronization from Table class during get()

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7237 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-11 11:01:50 +00:00
orbiter
0d363a94d7 more performance hacks
this makes YaCy search results VERY fast for all verify=false search cases
and it enhances the search speed also for all other snippet-fetch cases.
With this change my peer performed 100 Queries Per Second (!!!) while doing 10 queries simultanously (!!!)
in an intranet index of 20000 URLs on my 16-core Mac

Check this yourself by doing:
cd bin
./searchtestmulti.sh
after finishing the run, divide 1000 by the given time per query (which is the qps for one thread)
and then multiply again by 10 (because 10 search threads has been started)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7231 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-09 08:55:57 +00:00
orbiter
091dd3f6ec - enhanced intranet search speed
- enhanced intranet portscan speed (better time-out)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7227 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-08 10:54:13 +00:00
orbiter
aacf572a26 - enhancements for search speed
- bug fixes in many classes including basic data structure classes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7217 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-04 11:54:48 +00:00
orbiter
2c549ae341 fixed a number of small bugs:
- better crawl star for files paths and smb paths
- added time-out wrapper for dns resolving and reverse resolving to prevent blockings
- fixed intranet scanner result list check boxes
- prevented htcache usage in case of file and smb crawling (not necessary, documents are locally available)
- fixed rss feed loader
- fixes sitemap loader which had not been restricted to single files (crawl-depth must be zero)
- clearing of crawl result lists when a network switch was done
- higher maximum file size for crawler

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7214 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-30 23:57:58 +00:00
orbiter
e54cb7fb0c more bugfixes (also for latest commit)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7202 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-28 10:20:46 +00:00
orbiter
be6b48311c misc bugfixes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7201 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-28 10:00:33 +00:00
orbiter
48c0d508ac fixes for crawling of smb links (file length not always available)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7190 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-25 22:32:26 +00:00
orbiter
09c208a3ab patch for corrupted database files (just work on and forget key)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7177 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-20 14:38:56 +00:00
orbiter
97ee278931 enhanced search speed:
- better control of number of running search threads
- no time-out waiting time when no ranking feeding takes place
- local search queries by a remote peer may be faster up to 300 milliseconds
- a local search may even be faster

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7176 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-20 13:17:25 +00:00
orbiter
8da4eb5de6 addition to patch in SVN 7111
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7170 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-19 23:12:50 +00:00
orbiter
37baa8bae3 - fixes for concurrency exceptions and failed database integrity verification
- added link to yacystats peer when peer is more than one day old

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7164 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-17 10:20:04 +00:00
orbiter
461a2a6ec7 enhanced remote crawling:
- 300 ppm is default now (but this is switched off by default; if you switch it on you may want more traffic?)
- better timing for busy queue
- better amount of remote url retrieval
- better time-out values
- better tracking of availability of remote crawl urls
- more logging for result of receipt sending

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7159 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-16 09:34:17 +00:00
orbiter
0cf006865e refactoring and enhanced concurrency
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7155 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-15 11:38:03 +00:00
orbiter
83ac07874f - corrected return value of put() methods (not used anywhere, so it did not harm before)
- added use of LookAheadIterator which should prevent mistakes when coding iterators with embedded iterators
- added a fail-safe reaction in case of database corruption using iterators over database elements (no interruption then)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7154 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-15 10:43:14 +00:00
orbiter
14c843d364 more performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7148 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-14 15:00:34 +00:00
orbiter
39f409a7bb performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7147 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-14 14:32:24 +00:00
orbiter
906c572621 - enhanced index create menu structure
- clear search log caches each time a search is done

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7142 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-14 09:06:27 +00:00
orbiter
64860dc1bb enhanced search event logging (to be used for further improvements)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7140 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-13 09:33:04 +00:00
orbiter
7dbc357593 patch to identify corrupted database files
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7139 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-13 07:20:53 +00:00
sixcooler
17eebd4ef8 counting crawler traffic again:
fix for http://forum.yacy-websuche.de/viewtopic.php?f=6&t=2808

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7138 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-11 15:58:15 +00:00
orbiter
32f73d1aaa added copy for Info.plist for Mac application release updates (this file contains class paths and start parameters)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7133 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-10 09:48:09 +00:00
orbiter
570ca577c6 performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7129 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-09 22:42:54 +00:00
orbiter
348dece62f redesign of the SortStack and SortStore classes:
created a WeakPriorityBlockingQueue as special implementation
of a PriorityBlockingQueue with a weak object binding.
- better abstraction of ordering technique
- fixed some bugs according to result numbering (distinguish different counters in Queue)
- fixed a ordering bug in post-ranking (ordering was decreased instead of increased)
- reversed ordering numbering using a reversed ordering. The higher the ranking number the better (now).

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7128 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-09 15:30:25 +00:00
orbiter
5fe828fa06 - replaced pdfbox and fontbox version 1.1.0 with 1.2.1
- added some clear statements that shall clear static cache size within the pdfbox library
- the pdfbox library contains a memory leak; it is unsafe to run a peer with pdf parser permanently on.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7120 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-07 17:13:47 +00:00
orbiter
24502fe3de performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7116 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-06 12:59:33 +00:00
orbiter
d865ef77a8 removed re-read of index in case of a bad index. This may not solve the problem but it applies a 100% CPU problem on the peer. I'm afraid bad index files must be abandoned, and cannot be fixed this way.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7111 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-06 09:55:04 +00:00
orbiter
b2c9db48ea Performance enhancement
- introduced byte[] - based ARC method for MapHeap which avoids a String generation each time the cache is accessed
- bugfixing in required class ComparableARC

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7110 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-06 09:53:33 +00:00
orbiter
e8228fba09 less locking in time format computation, caching and during secondary (remote) search evaluation
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7106 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-05 11:13:12 +00:00
orbiter
9c0c94683c because of a bug in search result caching count search results had not been generated as fast as possible.
with this fix search results are (even) faster.
Also enhanced: image search. This is now speeded up using a image search result look-ahead

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7105 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-04 22:57:12 +00:00
orbiter
9d080f387e change in handling of the all-visible home path for storage in YaCy:
the home path can now be distinguished between
- data home; the path where the DATA directory is created
- application home; everything else
This will make it possible to store application data on Mac releases within the
~/Library/YaCy
directory; a place where Mac applications write their data.
Similar techniques will be possible for debian and windows.
To use the new data path, YaCy can be started with
-start <data path>
or
-gui <data path>


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7092 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-02 19:24:22 +00:00
orbiter
65eaf30f77 redesign of crawl profiles data structure. target will be:
- permanent storage of auto-dom statistics in profile
- storage of profiles in WorkTable data structure
not finished yet. No functional change yet.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7088 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-31 15:47:47 +00:00
orbiter
4f22e2df41 bugfixes for
- next-execution-time in scheduler
- deletion of scheduled rss feed loading (now deletes also the scheduling entry)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7075 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-26 16:42:00 +00:00
orbiter
42414a6ae3 added two more tables in rss reader interface:
- fresh recorded rss feeds (not yet loaded or in scheduler)
- rss feeds in scheduler
The first list has a button that can be used to place rss feeds into the scheduler
The second list has a button to delete rss feeds from the scheduler

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7074 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-26 16:01:45 +00:00
orbiter
0010cd9db1 Support for indexing of RSS feeds!
- added a scanning in html parser for rss feeds
- storage of rss feed addresses, can be viewed with http://localhost:8080/Tables_p.html?table=rss
- rss items retrieved by http://localhost:8080/Load_RSS_p.html (in Index Creation menu) can be selected and indexed
- a rss feed retrieved in http://localhost:8080/Load_RSS_p.html can now be fully indexed
- indexing of rss feeds can be placed in scheduler

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7073 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-25 18:24:54 +00:00
orbiter
0f276dd63f - MapHeap now implements Map<byte[], Map<String, String>>
- refactoring of method names to comply with Map method names

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7072 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-24 12:36:56 +00:00
orbiter
cf07b34c2d implemented the Map interface in the ARC classes so it will be possible to instantiate ARCs as
Map<byte[], Map<String, byte[]>>
Because such Maps with byte[] keys cannot be stored in hash maps (bad hashing on byte[])
another ARC with comparable Maps has been added

This will make it possible to move the HTCache class 'Cache' into the cora package because that
class may be used either with RAM caches (ARCs) or with file-based caches (BEncodedHeaps)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7071 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-23 23:38:03 +00:00
orbiter
c60d0282fd more abstraction for tables stored in heaps:
the BEncodedHeap now implements Map<byte[], Map<String, byte[]>>
This will make it possible that also different database storage types may be added that implement also the same Map<byte[], Map<String, byte[]>> interface.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7070 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-23 21:27:58 +00:00
orbiter
d1be64d491 removed wrong assert
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7069 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-23 21:02:28 +00:00
orbiter
3197ca42ed preparations to move the HTCache into cora:
- move the header framework classes to cora
- move the ARC caching classes to cora
- refactoring of code to call these classes from cora

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7068 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-23 12:32:02 +00:00
orbiter
844f158686 - removed dependencies in header framework:
moved http date methods from DateFormatter to HeaderFramework
  changed logging to log4j
- added ftp load access to MultiProtocolURI
- ensured termination of RSS feed iteration

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7067 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-23 11:41:12 +00:00
orbiter
5e7081cd19 refactoring towards a unified loading mechanism for MultiProtocolURIs
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7065 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-23 01:08:56 +00:00
orbiter
7aa860c505 - more logging
- more stability for database heap in case of buffer failure

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7058 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-21 10:16:05 +00:00
orbiter
66ac3a7d9d corrected database row iteration
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7055 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-20 23:33:56 +00:00
orbiter
70dd26ec95 added the new crawl scheduling function to the crawl start menu:
- the scheduler extends the option for re-crawl timing. Many people misunderstood the re-crawl timing feature because that was just a criteria for the url double-check and not a scheduler. Now the scheduler setting is combined with the re-crawl setting and people will have the choice between no re-crawl, re-crawl as was possible so far and a scheduled re-crawl. The 'classic' re-crawl time is set automatically when the scheduling function is selected
- removed the bookmark-based scheduler. This scheduler was not able to transport all attributes of a crawl start and did therefore not support special crawling starts i.e. for forums and wikis
- since the old scheduler was not aber to crawl special forums and wikis, the must-not-match filter was statically fixed to all bad pages for these special use cases. Since the new scheduler can handle these filters, it is possible to remove the default settings for the filters
- removed the busy thread that was used to trigger the bookmark-based scheduler
- removed the crontab for the bookmark-based scheduler

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7051 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-19 23:52:38 +00:00
orbiter
5a994c9796 added a scheduler based on API actions
- every process that is monitored with the API Steering interface can now be scheduled!
- added input methods in Steering interface to set a scheduling time
- added a view on the steering api that shows only crawl jobs inside the Crawl Profile servlet
- added a scheduling call process in the cleanup process handler that triggers the scheduled processes
This causes that the cleanup now also looks for scheduled processes. Such processes are therefore not executed at
the same time as given in the target execution time but they will be executed within the cleanup process time window.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7050 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-19 12:13:54 +00:00
orbiter
189a986ebd - modified api-call interface to record api calls with references to api-call database (carries pk)
- added recording date, last execution date and next execution date for a scheduler (scheduler to be implemented next)
- extended database access methods for more data formats, especially for date insert/retrieval
- extended 'Steering' interface to show new database fields
- migrated Steering to new http client
- extended cora http client to transmit authentication and also added some convenience methods (http response code)
- simplified database back-end (not so much specialized methods for multiple properties)
- extended date formatter to produce a special format to show dates in html (&nbsp; in spaces of date format)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7049 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-18 15:56:38 +00:00
orbiter
054c22e2c6 added TLDs from http://www.opennicproject.org
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7047 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-18 10:39:49 +00:00
orbiter
7fdb17bb96 redirect uncaught exceptions to logging + small other changes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7042 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-16 12:33:06 +00:00
orbiter
a82a93f2fc - better url double check in crawler
- more logging for error urls

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7032 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-11 09:54:18 +00:00
orbiter
a835a22b32 fixed isLocal() property (better recognition of intranet hosts)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7028 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-09 11:22:56 +00:00
orbiter
301a59e07f moved browser access method from kelondro/util/OS to gui/framework/Browser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7022 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-05 10:49:58 +00:00
orbiter
6388a58fc7 better memory management and slightly less (in total and temporary) RAM allocation:
- confirm that database objects that are not supposed to grow do not have a index memory management that is designed for growth
- changed index sorting method in such a way that it allocates less objects during quicksort
- database classes classes renaming (shorter, naming addresses that objects hold in RAM)
- added a large number of asserts to check if objects actually take the RAM that they should have


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7019 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-04 13:33:12 +00:00
orbiter
5924a0d851 - enhanced concurrency in database index access for multicore
- added statistics about database index caches in PerformanceMemory_p.html
- adoped many classes to use the new statistics
- added missing close statements

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7018 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-03 04:58:48 +00:00
orbiter
9ab06bc333 enhancement in sorting efficiency (database root operation): less object allocation
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7015 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-03 02:42:28 +00:00
mikeworks
b12db14b9f Added Generics to new net.yacy.upnp.* classes to eliminate compiler warnings
Added @Deprecated for deprecated functions getIPDevices and getPPPDevices in class InternetGatewayDevice
Changed debug statement in Domains.java and corrected filename in comments header

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6993 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-07-24 13:48:45 +00:00
orbiter
60caade056 removed debug output
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6984 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-07-22 07:59:22 +00:00
orbiter
dec1419bc3 ;-)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6978 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-07-18 20:18:32 +00:00
orbiter
22dbbcfa56 better (and corrected) recognition of intranet and internet-addresses. This corrects the isLocal property that is used by network definitions to restrict index ranges to local and global addresses. Address locations (intranet or internet) had been partly identified by the top level domain of the host address. Since intranet addresses can also be addressed using a host name that is in a country domain it is necessary to do a dns resolving for each check. The check is supported by a local dns cache so the intranet/internet check should not affect network traffic too much. To ensure that the cache works properly the cache class was upgraded to better concurrency data structures.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6977 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-07-18 20:14:20 +00:00
low012
dc5f0e357c *) fixed SVN properties
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6972 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-07-18 10:02:03 +00:00
low012
01d6b952f0 *) minor changes for easier to read code, no functional changes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6971 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-07-18 10:00:43 +00:00
orbiter
25024d6ab2 fix for problen when accessing the metadata index. The index was not available for all peers with no RAM table copy.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6957 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-30 07:22:50 +00:00
orbiter
b6fb239e74 redesign of parser interface:
some file types are containers for several files. These containers had been parsed in such a way that the set of resulting parsed content was merged into one single document before parsing. Using this parser infrastructure it is not possible to parse document containers that contain individual files. An example is a rss file where the rss messages can be treated as individual documents with their own url reference. Another example is a surrogate file which was treated with a special operation outside of the parser infrastructure.
This commit introduces a redesigned parser interface and a new abstract parser implementation. The new parser interface has now only one entry point and returns always a set of parsed documents. In case of single documents the parser method returns a set of one documents.
To be compliant with the new interface, the zip and tar parser had been also completely redesigned. All parsers are now much more simple and cleaner in its structure. The switchboard operations had been extended to operate with sets of parsed files, not single parsed files.
additionally, parsing of jar manifest files had been added.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6955 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-29 19:20:45 +00:00
orbiter
150cf42a1b migrated all my LGPL 3 -licensed files to the LGPL 2.1 because LGPL 3 is not compatible to the GPL 2
see http://www.gnu.org/licenses/license-list.html for explanation
Since (as far as I know) nobody else has ever contributed to these files I may be allowed to just apply an older license.
You may consider this as a dual-licensing and may use and optionally replicate the older files under GPL 3.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6952 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-28 16:25:14 +00:00
orbiter
37b8827a7a - removed the UPnP library sources from sbbi and added the jar library again. The library was included to get support for fedora releases, but after this time the fact that the sbbi cannot be part of fedora should be re-discussed. If this will still not be possible, then we may integrate the sbbi UPnP package using reflection.
- cleaned uo the code. The new eclipse helios provided new warnings for dead code. This change cleans up most of these warnings

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6945 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-26 10:32:47 +00:00
orbiter
777195e8d1 more abstraction for access of LoaderDispatcher and cache
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6937 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-22 12:28:53 +00:00
orbiter
7e2d6fac12 patch for bad values during local search join
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6934 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-20 00:31:00 +00:00
orbiter
986d4f34d9 added a consistency check for new queues
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6931 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-18 18:59:42 +00:00
orbiter
fbf021bb50 redesign of index abstract processing - currently disabled until enough peers have fix in SVN 6928
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6929 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-18 09:44:21 +00:00
orbiter
87087f12fe - scanned remote search process and enhanced some data structure and synchronizations here and there
- removed concurrency overhead for small number of index normalizations as it happens during remote search
- removed 'load only parseable' constraint for snippet fetch because some resources may not have any url file extension and these had therefore not been parseable and searcheable since they may become parseable after loading when their mime type is known
- this partly fixes some problems with http://forum.yacy-websuche.de/viewtopic.php?p=20300#p20300 but more changes are necessary to get all expected search results

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6926 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-17 11:59:40 +00:00
orbiter
de4f30bb2e UTF-8 fix
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6923 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-16 15:22:31 +00:00
orbiter
3a1cebb598 bugfixes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6922 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-16 15:11:21 +00:00
orbiter
51332b787d reverted SVN 6869 as discussed with dulcedo in car after LinuxTag:
missing time-out may be cause of locks during DHT-out

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6920 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-15 20:30:53 +00:00
orbiter
b03caaa57a better handling of OOM situations
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6918 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-15 19:44:05 +00:00
orbiter
60e71876ad - more abstraction (HashMap -> Map)
- more concurrency-awareness (HashMap -> ConcurrentHashMap)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6910 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-01 13:02:11 +00:00
orbiter
a83772c71b fixes and enhancements for balancer:
- crawl lists for each domain now uses a HandleSet which should use less memory than LinkedLists
- but: fill more entries into the domain lists (all available entries)
- fixes to selection criteria (best domain selection)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6909 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-01 09:30:23 +00:00
orbiter
9cde05418f fixed url crawl list display
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6908 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-31 00:27:00 +00:00
orbiter
11639aef35 - added new protocol loader for 'file'-type URLs
- it is now possible to crawl the local file system with an intranet peer
- redesign of URL handling
- refactoring: created LGPLed package cora: 'content retrieval api' which may be used externally by other applications without yacy core elements because it has no dependencies to other parts of yacy

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6902 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-25 12:54:57 +00:00
orbiter
6950d8a33d fixes to SMB crawler
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6900 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-23 01:17:44 +00:00
orbiter
1defd580bc - added option to localization search to distinguish between a search for a location according to the search word only or for the relation between a web search results and locations found in the metadata fields
- used that to display two layers on map: cities and search result locations
- added many marker grafics for the display of the markers on the map
- some refactoring of the yacy news code plus bugfixes for latest move from Tree to Table data structure

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6889 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-19 12:53:09 +00:00
orbiter
118d589eff replaced the very very old data structure 'Records' with a simple table to fix the problem from
http://forum.yacy-websuche.de/viewtopic.php?p=20066#p20066

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6876 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-15 00:59:02 +00:00
orbiter
2a8f70f0ca - fix for caching of OSM tiles. if you want that this fix applies to your peer, please delete the crawl profiles
- fix for initial generation of crawl profiles (one more reason to remove your crawl profiles)
- more String -> byte[] migration
- more logging for cache store/hit

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6874 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-14 23:50:07 +00:00
orbiter
439b44be9e removed exit from computation in ReferenceContainerArray.get merge method
an warning is still given, but method computes at normal operation
see also: http://forum.yacy-websuche.de/viewtopic.php?p=20038#p20038

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6869 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-11 23:36:40 +00:00
orbiter
789c6b26ce added a location search service: using the following servlet/example:
http://localhost:8080/yacysearch_location.kml?query=berlin&maximumTime=2000&maximumRecords=100

This will open any application that can consume kml data (which will probably be google earth) on your computer and displays the search result as positions on a map


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6865 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-11 12:58:05 +00:00
orbiter
f23cbd2dab more bugfixes to date parser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6864 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-11 11:32:46 +00:00
orbiter
cf43bdc87e This is a large bugfix and enhancement commit to support a better location detection for data
- fixes to http file server session handling
- fixes and enhancements to metadata date/time handling
- added dc:publisher metadata field and updated all document parser
- fixed bug in metdata read procedure
- enhanced dublin core and rss parser to understand more fields more properly
- enhanced url selection in case that multiple urls are given in surrogates
- fix for condenser; failure when last word does not end with termination symbol

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6863 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-11 11:14:05 +00:00
orbiter
c45117f81f fixed dates in metadata
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6860 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-08 22:09:36 +00:00
orbiter
0a5fd15703 :-(
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6859 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-06 22:06:31 +00:00
orbiter
ac16f582aa fix for http://forum.yacy-websuche.de/viewtopic.php?p=20017#p20017
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6858 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-06 22:04:30 +00:00
orbiter
fc5efcc05a enhanced and fixed OAI-PMH import
- now importing OAI-PMH server list fron two sources
- simultanous import from several servers (even > 2000)
- check buttons on OAI-PMH server list to select multiple servers for import start
- it is possible to select all servers at once for import
- imported XML data is gzipped after import from surrogate reader

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6847 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-30 14:03:51 +00:00
orbiter
455a763d7c performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6845 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-28 08:38:57 +00:00
orbiter
b6cce08019 fixed a bug in rwi storage data size allocation
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6843 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-27 22:22:16 +00:00
orbiter
90c3e5d6f6 - cleanup, removed unused imports
- added crawling queue sizes to /api/status_p.xml, syntax same as in queues_p.html
- fixed a bug in queue enumeration that caused a out of bounds exception

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6842 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-27 21:47:41 +00:00
orbiter
b18a7606a0 some performance hacks and fixed after reading dump in
http://forum.yacy-websuche.de/viewtopic.php?p=19920#p19920

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6837 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-25 21:37:36 +00:00
orbiter
4cd5418963 removed finalize methods because of a hint in
http://java.sun.com/javase/6/webnotes/trouble/TSG-VM/html/memleaks.html#gbyvh

The finalize method prevents that the memory, used by the objects containing the finalize method, is collected and available for the garbage collector. Instead, the memory allocated by such classes are enqueued to a java-internal finalize queue runner. This slows down all operations that uses a lot of object containing finalize methods.

this fix does not remove all finalize method, but such that may be used for throw-away objects that are allocated many times. This should cause a better run-time performance and less OutOfMemoryErrors 

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6835 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-23 09:32:29 +00:00
orbiter
cff8ed134f added index check to prevent blocking in synchronization
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6832 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-22 22:16:38 +00:00
orbiter
b95ae2518b fix for assert
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6829 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-21 17:59:22 +00:00
orbiter
027b971bde fix for concurrent quicksort: catch jobs from ThreadPoolExecutor that had been rejected because of full processing queues.
Non-catched jobs may have been the cause for blockings and freezes in case of overloading during strong processing

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6827 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-21 13:44:59 +00:00
orbiter
8c40f1cb8e self-healing for broken table files (may cause other problems, but better than nothing)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6826 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-21 11:29:27 +00:00
orbiter
7b69d79727 enhanced remove() operation: in many cases it is not necessary to return the removed object to the called.
for such cases the delete() operation was introduced which is sometimes much cheaper in operation since it does not need to create objects to hold the removed content and it does not need to read those objects.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6824 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-20 14:47:41 +00:00
orbiter
93ea0a4789 enhanced remove operation in search consequences (which are triggered when the snippet fetch proves that the word has disappeared from the page that was stored in the index)
- no direct deletion of referenced during search (shifted to time after search)
- bundling of all deletions for the references of a single word into one remove operation
- enhanced remove operation by caring that the collection is stored sorted (experimental)
- more String -> byte[] transition for search word lists
- clean up of unused code
- enhanced memory allocation of RowSet Objects (will use a little bit less memory which was wasted before)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6823 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-20 13:45:22 +00:00
orbiter
7a59012632 fix for NPE
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6822 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-20 07:43:48 +00:00
orbiter
1a6c2f77b4 fix for NPE in statistic servlet
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6821 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-20 00:08:43 +00:00
orbiter
64f29f990e a collection of performance hacks and code cleanup:
- removed usage of URL-Caches which could have been a memory leak
- removed unused classes and methods
- removed not necessary synchronizations
- added synchronization hacks where possible
- fine-tuned crawling speed to prevent IO of balancer
- fixed a bug in IODispatcher that may have caused that no merges were done
- reduced number of parameters in very often called methods (compare methods)
- reduced complexity of data structures of now massively used HandleSet class
- reduction of new String() and getBytes() usage / new methods to support this transition

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6820 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-19 16:42:37 +00:00
orbiter
8b8107b2a3 reduced IO-load and synchronization/blocking
- enhanced the Balancer performance when building new domain stacks using a new Table buffer
- added the new Table buffer BufferedObjectIndex class
- changed order of access to LURL-read (prefereing segment over Crawl Queues) will reduced blocking time on balancer
- fixed PPM setting in Crawler_p servlet (had doubled values)
- reduced synchronization in IndexCell because it is not necessary: reduced blocking during indexing/merging/dumping
- removed did-you-mean cache in IndexCell because that caused too much overhead and more memory usage but was not very useful. This reduced also deadlocks that could be causes when searched are performed during indexing.




git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6819 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-18 21:55:20 +00:00
orbiter
ed07046870 flush only when > 3000 RWIs present + code cleanup
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6817 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-16 16:07:19 +00:00
orbiter
3a50b5aa04 enhanced object hash computation
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6816 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-15 14:19:29 +00:00
orbiter
1a8a134e0c continuing String-hash - to - byte[]-hash redesign that was started in SVN 6775 and continued in SVN 6790
The result should be a less usage of new String() and less memory usage (since a String-encapsulated byte[] has 40 bytes overhead)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6815 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-15 13:22:59 +00:00
orbiter
dde394a977 - shifted some computation out of synchronization to allow more concurrency
- removed synchronization where not necessary

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6814 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-14 23:22:06 +00:00
orbiter
f204076d25 removed usage of temporary files: causes too much IO
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6813 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-14 22:17:18 +00:00
orbiter
650be3599f added a time-out to the RWI cache to flush the cache if it has not been written for ten minutes. This additional dump criteria is necessary because some data sources repeat their vocabulary and may cause that the number of words in a RWI does not increase while the number of references in the RWI set increases. Now the RWI Buffer is flushed every 10 minutes or later if at that time already a dump is ongoing.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6811 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-14 20:30:34 +00:00
orbiter
ff6cf24b80 replaced RowSetArray in ObjectIndexCache with RowSet to reduce complexity in MergeIterator. This complexity caused too much computing overhead when the RowSetArray had become very large.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6810 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-14 19:26:51 +00:00
orbiter
55d8e686ea performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6807 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-13 23:29:55 +00:00
orbiter
2f181d0027 introduced concurrency in HTCACHE storage compression
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6806 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-13 16:22:09 +00:00
orbiter
2e26744f4e more concurrency when normalizing RWI entries + cleanup
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6805 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-13 14:47:57 +00:00
orbiter
aa083fc45c try to get a fix for OOM problem in case that there is no real problem with missing memory.
See also http://forum.yacy-websuche.de/viewtopic.php?p=19835#p19835

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6802 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-13 11:39:54 +00:00
orbiter
70e6222978 more concurrency during search requests
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6801 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-13 11:12:36 +00:00
low012
dc93cec3a8 *) Java 1.5 compatibility (see http://forum.yacy-websuche.de/viewtopic.php?f=8&t=2764)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6796 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-13 00:25:46 +00:00
orbiter
67ec58d8e7 search performance enhancement
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6795 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-12 07:31:43 +00:00
hermens
ef467a0303 Another workaround for the second part of http://forum.yacy-websuche.de/viewtopic.php?f=5&t=2770
This should prevent URLs with bad referrer entries from being dropped by transferURL or even crashing the whole Transmission$Chunk


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6792 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-10 13:57:46 +00:00
orbiter
25aef069a6 continuing String-hash - to - byte[]-hash redesign that was started in SVN 6775
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6790 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-08 00:11:32 +00:00
orbiter
1e8e79b9ef redesign of reference hash (URL-hash) parameter hand-over:
pass value as byte[], not as String. This should cause that less
byte[] <-> String conversions are made during time-critical tasks.
This redesign is not yet complete, more to come ..

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6775 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-26 18:33:20 +00:00
orbiter
72d8e9897b removed unnecessary cache flush call in backend of BufferedRecords
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6774 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-26 12:44:13 +00:00
orbiter
749ffbd642 - added another catch case for the index dump and index merge process that should cause non-blocking behavior in case that index dump and/or index merge caused any unexpected exception.
- reverted SVN 6766, this is too dangerous (may cause unexpected memory usage) and should not be necessary

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6773 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-26 10:46:40 +00:00
orbiter
312ca5d917 removed flush at end of every rwi entry since this reduces the write performance.
This should speed up RWI cache dump and RWI merge operations and should cause less blocking time during these processes for the indexer.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6771 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-26 10:41:20 +00:00
orbiter
0018163c07 moved table row/column matching method from front-end to back-end
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6770 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-26 10:01:27 +00:00
orbiter
31e29a8831 - removed synchronization during index dump and index cleaning
- added semaphores to synchronize index dump and index cleaning for each process separately

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6767 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-25 07:09:53 +00:00
orbiter
bb63c5d075 using a Pattern object with precompiled regular expressions to apply must-match constraints to search results: should speed up pre-sorting of search results and should cause richer search result sets
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6762 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-23 10:17:28 +00:00
orbiter
90dd197ae7 - no latency for local crawls
- catch interrupted exception during 'fast' crawls in workflow processor

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6759 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-22 09:12:18 +00:00
orbiter
36bd843ece for for RFC5322 comformance as suggested by Quix0r in http://forum.yacy-websuche.de/viewtopic.php?p=19585#p19585
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6754 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-20 10:23:47 +00:00
orbiter
748abfcffa added patches to prevent yacy-protocol DoS settings
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6751 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-19 15:31:15 +00:00
orbiter
e820ed061a avoiding excessive DNS lookups to determine localhost
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6750 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-19 14:28:25 +00:00
orbiter
0f8004f9da enhanced html parser to recognize a href tags inside header tags
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6743 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-11 17:52:07 +00:00
orbiter
1198b9989d bugfixes, more sorttable
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6739 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-10 15:39:36 +00:00
orbiter
ae2f3f000f better handling of table copy abandon .. prevent memory leak
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6734 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-09 13:32:15 +00:00
orbiter
0769517129 added a robots.txt monitor in the crawler monitor submenu
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6733 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-09 11:31:15 +00:00
orbiter
de01fe0e6d fix for bug in url parser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6722 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-07 01:33:18 +00:00
orbiter
1bbe14d23f SVN 6716 unfortunately contained parts of the unfinished SMB integration. To fix compile errors the remaining parts of the SMB implementation stub is added with this commit.
This adds the jcifs smb library.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6717 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-05 21:46:22 +00:00
orbiter
884b262130 - added a new Wiki Namespace Navigator
- some redesign of Navigator data structures

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6716 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-05 21:25:49 +00:00
orbiter
270fb38674 - fixed some bugs in Table viewer
- added 'select all' feature in Tables_p
- enhanced ViewFile.html: has now an input field to load arbitrary resources from the web and analyze them (!!!)
- included the ViewFile servlet into the Index Administration menu
- show in ViewFile if ressource is in url-db and/or in Web cache
- bugfixes to BEncodedHeap and Tables management

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6713 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-05 15:41:15 +00:00
orbiter
727dd9b193 - fixed a bug in robots.txt parser
- moved storage of robots.txt entries to WorkTables, so it is now possible to browse the robots entries with the table browser

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6710 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-04 11:58:07 +00:00
sixcooler
cd6de83905 next try for for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=2703
(reverted 6692)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6694 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-23 15:59:58 +00:00
sixcooler
bfe4693e9a fix for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=2703
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6693 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-23 13:46:56 +00:00
orbiter
564927ce72 redesign of CrawlResult data structures because of OOM occurrences during URL deletion processes.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6675 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-16 23:06:04 +00:00
orbiter
30c8185139 fix for sid check
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6673 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-15 23:31:32 +00:00
orbiter
ef62d017e5 integrated session id filtering for crawler
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6672 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-15 23:15:17 +00:00
orbiter
d8d9984913 added framework for session id filtering (not ready yet)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6671 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-15 22:30:41 +00:00
orbiter
2bc36de336 - fix for bug in svn 6669
- cleanup

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6670 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-15 22:06:13 +00:00
orbiter
d378ca4604 better handling of concurrency in seed
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6669 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-15 15:57:35 +00:00
orbiter
6538043d89 fix for http://forum.yacy-websuche.de/viewtopic.php?p=19189#p19189
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6668 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-15 15:45:31 +00:00
sixcooler
e071d71f19 fix for yacy-banner-network-values
http://forum.yacy-websuche.de/viewtopic.php?f=6&t=2521

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6659 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-09 18:22:36 +00:00
sixcooler
787b588c33 reverted a part of svn6636:
- didn't work on blobs >2GB
- should be obsolete since svn6651
http://forum.yacy-websuche.de/viewtopic.php?f=6&t=2652&sid=7fa98fd3edfc2a03f26394d545e3e3c1&p=19172#p19172

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6655 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-07 19:32:46 +00:00
lotus
11188cd7eb resource observer now uses the Java 6 method to check for free space. thus, disk observing now needs Java 6 installed.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6652 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-06 18:48:06 +00:00
sixcooler
089877f32c my first commit - hopefully fix for merge problem
- http://forum.yacy-websuche.de/viewtopic.php?f=6&t=2652

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6651 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-05 19:38:00 +00:00
orbiter
d6391f2537 better handling of rewrite cases where the resulting rewrite blob entry is equal in size
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6648 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-04 23:17:47 +00:00
orbiter
ef9473d92c added another sixcooler suggestion: recycle corrupted records
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6647 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-04 16:25:05 +00:00
orbiter
fe78edac32 - view API calls in correct date-order
- execute recorded API calls in date-order

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6646 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-04 15:51:54 +00:00
orbiter
308a973503 refactoring of tables data organisation
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6644 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-04 11:26:23 +00:00
orbiter
ada0ce9de3 refactoring of bookmarks: there is a big performance problem in the bookmarks code and furthermore the bookmarks
will loose its leading role for the re-crawl funtion when the new api tables will work. To be prepared for a replacement
of such functions the bookmark class is re-organised.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6637 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-01 22:18:56 +00:00
orbiter
3751ab4ae2 added sixcoolers patch and more checks/removed unnecessary code
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6636 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-01 16:11:00 +00:00
orbiter
d8d8562c59 fill key with zeros during normalization
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6635 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-01 15:40:16 +00:00
orbiter
24060885b6 - added Tables abstraction in data.Tables.java
fix for
http://forum.yacy-websuche.de/viewtopic.php?p=18910#p18910
http://forum.yacy-websuche.de/viewtopic.php?p=18894#p18894
http://forum.yacy-websuche.de/viewtopic.php?p=18814#p18814


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6631 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-29 18:02:09 +00:00
orbiter
7fdf59a77f misc NPE check
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6630 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-29 15:59:24 +00:00
orbiter
4403304957 bugfix for list()
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6616 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-22 16:00:56 +00:00
orbiter
0098e6e859 bugfix for heap iterator
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6610 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-22 10:26:50 +00:00
orbiter
db19a941cf added new image index storage classes (not integrated yet)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6608 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-21 22:12:05 +00:00
orbiter
8ce936bcdd added an api recording function: it shall be possible to record
all operations on YaCy in a database that should make it possible
1) to re-create a setting on fresh peers
2) to transmit a setting from one peer to another
3) to re-create crawl starts after a complete deletion of the index
This functionality will also support
4) scheduled re-crawls (new implementation)
To implement this, a new database structure has been crated that stores maps into blob heaps. to encode maps the b-encoding technique was used (this is the same encoding that torrent files use)
- added a b-encoder
- enhanced the b-decoder
- added a b-encoded map heap data structure
- added a table organisation based on b-encoded heaps
- added a servlet to maintain such tables (see Tables_p.html)
- integrated the servlet into the Advanced Settings menu
- added an api recording based on the new tables

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6606 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-21 22:06:03 +00:00
orbiter
e80e060ca6 - increased thread priority for server threads
- decreased thread priority for crawler threads

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6596 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-19 11:29:22 +00:00
orbiter
f6731c6240 more logging etc.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6589 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-17 00:41:50 +00:00
orbiter
4f1f4863c4 fix for deadlock when initializing a SplitTable with a file of size 0, see also:
http://forum.yacy-websuche.de/viewtopic.php?p=18594#p18594

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6587 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-14 23:03:48 +00:00
orbiter
cc5dcf69ff missing change for last commit
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6585 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-14 14:20:18 +00:00
orbiter
ca1ef9a079 fix for http://forum.yacy-websuche.de/viewtopic.php?p=18584#p18584
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6584 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-14 13:38:14 +00:00
orbiter
938e806182 tried to fix date problem that may have prevented that foreign peers stay in the network
- removed unused code
- removed possibly wrong utc difference correction

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6581 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-13 20:01:46 +00:00
orbiter
5df628a2a4 - added BEncoder class
- added BEncodedHeap class that encodes B data structures and stores that to a heap
- refactoring of MapView, this is now named MapHeap to fit into the naming scheme of the BEncodedHeap

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6579 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-13 16:21:37 +00:00
orbiter
82f57f79e5 more PMD enhancements
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6576 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-13 00:23:07 +00:00
orbiter
a06f7ddb33 more PMD recommendations
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6572 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-12 20:53:19 +00:00
orbiter
eb79ceb3ff update to kelondro data structures
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6571 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-12 15:37:34 +00:00
orbiter
18172451a0 better search computation:
- increased sort limit, now 3000 entries, before: 1000
  this should cause that more results can be shown in case
  of strong limitating constraints, like domain navigation
- enhanced the sort process
- check against domain navigator bugs
- fix in sort stack
- showing now all naviagtion pages at first search (not only next page)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6569 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-12 15:01:44 +00:00
orbiter
2113fcd7e5 - fixed usage of isEmpty() which is not available in java 1.5
- increased visibility of some methods

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6564 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-11 12:33:40 +00:00
orbiter
dd459281c8 applied code changes that are recommended by PMD
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6563 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-10 23:09:48 +00:00
lotus
eac2daf2e8 * reenable DHT if yet enough memory is available
* reset treshold on reconfiguratoin
(thanks to sixcooler)

* display status message in web interface

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6562 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-10 19:04:43 +00:00
orbiter
d77a8f3b3e added some modifications recommended by PMD for better performance
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6560 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-10 01:40:26 +00:00
orbiter
7f20963b41 add-on to last commit
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6556 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-09 00:17:39 +00:00
orbiter
eeca2ded92 fix for http://forum.yacy-websuche.de/viewtopic.php?p=18500#p18500
- catch uncatched OOM
- less wasting of memory

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6555 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-09 00:08:16 +00:00
lotus
32972139af added nice configuration for the resource observer
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6554 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-07 17:19:50 +00:00
hermens
574f49903e Prevent blob merge from possibly losing the last container
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6549 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-05 01:34:46 +00:00
orbiter
83d05e9176 added sixcoolers hack with some modifications:
http://forum.yacy-websuche.de/viewtopic.php?p=15004#p15004
old index blobs where deletions have been made because of DHT transmission should be melted down to new blobs. This uses sixcoolers methods from the forum thread but modifies the process in such a way that the blobs are not merged with themselves but simply rewritten to smaller files.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6548 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-04 18:12:03 +00:00
orbiter
d0b7bf9ca2 added a decoder class for Bencoding
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6544 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-03 22:44:09 +00:00
low012
028657f019 *) adding more SVN properties
*) minor changes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6542 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-12-28 13:11:07 +00:00
low012
82d740050f *) adding more SVN properties
*) minor changes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6541 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-12-28 12:26:50 +00:00
low012
e04cb8cef0 *) adding more SVN properties
*) minor changes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6540 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-12-28 12:16:40 +00:00
low012
dcb1096fb0 *) adding more SVN properties
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6539 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-12-28 11:22:32 +00:00
low012
7d610e0063 *) minor changes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6538 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-12-28 11:20:34 +00:00
lotus
9bee0ac780 more logging for DHTrule
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6533 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-12-21 14:02:00 +00:00