Commit Graph

69 Commits

Author SHA1 Message Date
orbiter
0c762daf4b better startup failure handling
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1205 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-12-12 23:59:58 +00:00
orbiter
9d9a87f445 limited htcache storage length
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1096 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-11-16 18:40:44 +00:00
orbiter
79818a320f introduced citation-rank transmission protocol and activate transport for anonymisation
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1055 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-11-10 23:48:20 +00:00
theli
fb766413d1 *) Changes on httpc dns caching
- Bugfix: old dns cache did not handle case insensitive hostnames correctly. 
   - adding a possibility to set domain name patterns defining hostnames that should not be cached by the httpc dns cache
     e.g. borg-300.dyndns.org
     This can be done by setting the new httpc.nameCacheNoCachingPatterns property
   - using httpc.dnsResolve wherever possible within the sourcecode
     [httpd.java,plasmaCrawlStacker.java]

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1044 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-11-07 10:57:54 +00:00
orbiter
bc420c62f6 fixed htcache path generation (never change a running system)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1041 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-11-07 01:31:11 +00:00
borg-0300
72cde1d894 getCachePath: no logging
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1033 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-11-04 22:47:13 +00:00
borg-0300
1fbd72f9e0 rename "index.html" to "ndx"
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1032 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-11-04 22:39:33 +00:00
borg-0300
cd1107d85e added support for URLs with '?&'
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1030 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-11-04 17:25:15 +00:00
borg-0300
5fb2b017cb small change
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1029 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-11-04 16:37:56 +00:00
theli
b8ceb1ffde *) Adding better https support for crawler
- solving problems with unkown certificates by implementing a dummy trust Manager
   - adding https support to robots-parser 
   - Seed File can now be downloaded from https resources
   - adapting plasmaHTCache.java to support https URLs properly

*) URL Normalization
   - sub URLs are now normalized properly during indexing
   - pointing urlNormalForm function of plasmaParser to htmlFilterContentScraper function
   - normalizing URLs which were received by a crawlOrder request

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1024 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-11-03 15:28:37 +00:00
borg-0300
a803a509ae bugfix: port handling in HTCache
grogram flow, cleared up


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1021 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-11-03 12:39:24 +00:00
theli
9a5ab62928 *) Adding yacy specific X-YACY-Index-Control header which can be used by clients
to disallow yacy to index the response that belongs to the request where 
   X-YACY-Index-Contro is set to "no-index"

*) Bugfix for Seed-List download via Remote Proxy.
   Now the pragma and cache-control http headers of the request are properly set to "no-cache" 
   See: http://www.yacy-forum.de/viewtopic.php?p=11639#11639

*) Bugfix for http-Proxy
   yacy has ignored "no-cache"- pragma and cache-control http headers that were send in requests.
   Now, these request headers are evaluated properly

TODO: Missing evaluation of "no-store" request headers

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@971 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-10-23 10:35:05 +00:00
borg-0300
58b670201d now, changed HTCacheSize needs no restart
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@961 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-10-19 17:59:54 +00:00
theli
40777556c5 *) Connection Tracking
- adding automatic refresh
   - accepts new parameter nameLookup which can be used to deactivate 
     yacy-peer name lookup (because we have problems with this on large seed-dbs)

*) ViewFile
   New page that can be used to view 
   - original content 
   - plain text content 
   - parsed content
   - parsed sentences 
   of a webpage specified by there url hash
   Mainly for debugging purpose at the moment

*) Robots.txt 
   Bugfix for if-modified-since usage
   TODO: synchronization of downloads to avoid loading the same robots-file 
   multiple times in parallel by different threads

*) Shutdown
   Better abortion of transferRWI and transferURL sessions on server shutdown

*) Status Page
   Adding icon to start/stop crawling via status page

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@950 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-10-18 07:45:27 +00:00
theli
a2fa75e688 *) Asynchronous queuing of crawl job URLs (stackCrawl)
various checks like the blacklist check or the robots.txt disallow check are now
   done by a separate thread to unburden the indexer thread(s)
   TODO: maybe we have to introduce a threadpool here if it turn out that this single
         thread is a bottleneck because of the time consuming robots.txt downloads

*) improved index transfer
   The index selection and transmission is done in parallel now to improve index 
   transfer performance.
   TODO: maybe we could speed up performance by unsing multiple transmission threads in 
         parallel instead of only a single one.

*) gzip encoded post requests
   it is now configureable if a gzip encoded post request should be send on
   intex transfer/distribution

*) storage Peer (very experimentell and not optimized yet)
   Now it's possible to send the result of the yacy indexer thread to a remote peer 
   istead of storing the indexed words locally. 
   This could be done by setting the property "storagePeerHash" in the yacy config file
   - Please note that if the index transfer fails, the index ist stored locally.
   - TODO: currently this index transfer is done by the indexer thread. 
     To seedup the indexer
     a) this transmission should be done in parallel and
     b) multiple chunks should be bundled and transfered together


*) general performance improvements  
   - better memory cleanup after http request processing has finished
   - replacing some string concatenations with stringBuffers
   - replacing BufferedInputStreams with serverByteBuffer
   - replacing vectors with arraylists wherever possible
   - replacing hashtables with hashmaps wherever possible
   This was done because function calls to verctor or hashtable functions
   take 3 time longer than calls to functions of arraylists or hashmaps.
   TODO: we should take a look on the class serverObject which is inherited from hashmap
         Do we realy need a synchronization for this class?
   TODO: replace arraylists with linkedLists if random access to the list elements is not needed

*) Robots Parser supports if-modified-since downloads now
   If the downloaded robots.txt file is older than 7 days the robots parser tries to
   download the robots.txt with the if-modified-since header to avoid unnecessary downloads
   if the file was not changed. Additionally the ETag header is used to detect changes.

*) Crawler: better handling of unsupported mimeTypes + FileExtension

*) Bugfix: plasmaWordIndexEntity was not closed correctly in 
   - query.java
   - plasmaswitchboard.java

*) function minimizeUrlDB added to yacy.java 
   this function tests the current urlHashDB for unused urls
   ATTENTION: please don't use this function at the moment because
              it causes the wordIndexDB to flush all words into the
              word directory!

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@853 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-10-05 10:45:33 +00:00
orbiter
7fc822a59b changed handling of time-zones
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@801 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-27 16:28:55 +00:00
orbiter
495bc8bec6 removed cache-control from low and medium priority caches which reduces memory use and computation overhead
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@774 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-22 20:01:26 +00:00
orbiter
71a31f0902 integrated and extended new memory performance menu; found and fixed bug in DHT caching
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@752 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-20 10:54:20 +00:00
orbiter
fb52a82008 added new performance page for memory settings
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@751 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-20 10:10:34 +00:00
borg-0300
8260128ee9 changed getFreeSize();
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@675 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-07 11:22:41 +00:00
borg-0300
0a57fbcde5 Added new HashSet filesInUse;
Added new Function getFreeSize();

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@672 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-07 09:37:00 +00:00
borg-0300
da9c6857fb *) changed a misunderstand, no BUG ;)
*) finals and other

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@668 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-06 14:17:53 +00:00
borg-0300
81cb8feb15 back to 649 :/
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@651 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-04 22:03:44 +00:00
borg-0300
5194511e8e *) attempt to find bug
See: http://www.yacy-forum.de/viewtopic.php?t=1121

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@650 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-04 19:08:51 +00:00
borg-0300
7626823519 BUGFIX for last 'commit'
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@635 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-01 23:43:27 +00:00
borg-0300
971756e8dd the delete size is smaller
See: http://www.yacy-forum.de/viewtopic.php?t=1084

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@634 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-01 23:35:00 +00:00
borg-0300
cc493ef8c1 Added change from Hermes
See: http://www.yacy-forum.de/viewtopic.php?t=1050

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@629 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-01 11:18:41 +00:00
borg-0300
c1d7527929 better cache cleanup
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@621 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-08-31 13:07:08 +00:00
theli
4fd5b95b1f *) Renaming Logger function names to reflect the proper Java Logging API Loglevels
- please use logFine instead of logDebug
   - please use logSevere instead of logFailure and logError
   See: http://www.yacy-forum.de/viewtopic.php?p=8726#8726

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@615 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-08-30 21:32:59 +00:00
theli
6adf8a4bde *) Renaming Logger function names to reflect the proper Java Logging API Loglevels
- please use logFine instead of logDebug
   - please use logFailure instead of logError
   See: http://www.yacy-forum.de/viewtopic.php?p=8726#8726

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@614 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-08-30 21:10:39 +00:00
theli
cc1df08069 *) Adding missing synchronized blocks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@608 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-08-30 14:57:32 +00:00
borg-0300
bf14e6def5 *) proxyCache, proxyCacheSize can be changed under 'Proxy Indexing'
- path now are absolute
*) move path check from plasmaHTCache to plasmaSwitchboard
   - only one path check when starting
*) small other

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@606 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-08-30 12:50:30 +00:00
theli
0c8a48e2cb *) converting php Session ID to lower case in funktion isCGI
See: http://www.yacy-forum.de/viewtopic.php?p=7671#7671

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@552 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-08-17 05:50:18 +00:00
theli
4654eae4e2 *) adding php Session ID to argument in funktion isCGI
See: http://www.yacy-forum.de/viewtopic.php?p=7671#7671

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@546 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-08-16 11:33:31 +00:00
orbiter
cd10370992 several bugfixes and dht selection / logging improvement
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@531 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-08-14 00:57:30 +00:00
orbiter
3610fe6b3a see http://www.yacy-forum.de/viewtopic.php?p=7410#7410
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@530 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-08-13 22:04:18 +00:00
orbiter
f5259f29e8 word cache behaviour fix and other fixes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@519 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-08-11 23:33:19 +00:00
orbiter
91163db52e fix for more time-related problems in proxy
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@486 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-08-03 00:52:32 +00:00
orbiter
fb6f238d70 fix for expires-problem
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@485 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-08-03 00:28:12 +00:00
orbiter
e84a177c49 many bigfixes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@475 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-08-02 02:18:01 +00:00
orbiter
36707586c7 filtering of jsessionid
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@447 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-07-27 23:20:17 +00:00
orbiter
ad90f0ad13 activated RWI distribution to DHT for senior peers (default redundancy 3), necessary now for network growth
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@438 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-07-27 12:51:00 +00:00
orbiter
3470a72d48 fixed div by zero, set default delays, fixed release number format and display
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@435 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-07-26 11:47:50 +00:00
orbiter
be1f324fca performance setting for remote indexing configuration and latest changes for 0.39
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@424 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-07-22 13:56:19 +00:00
orbiter
c64970fa47 re-implemented proxy-busy-check and fixed some other things
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@421 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-07-21 11:17:04 +00:00
orbiter
51962d55bf added 'PPM', page-per-minute statistics
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@405 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-07-18 00:44:51 +00:00
orbiter
2f0d7ea8d3 removed htcache stati (superfluous now)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@396 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-07-09 00:33:34 +00:00
orbiter
419f8fb398 fixed bugs/missing code regarding new crawl stack
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@384 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-07-07 01:38:49 +00:00
orbiter
858cd94299 replaced indexing ram-queue by file-based stack-queue
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@381 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-07-06 14:48:41 +00:00
orbiter
1e7f062350 many bugfixes, memory leak fixes, performance enhancements; new kelondroHashtable; activated snippets
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@313 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-06-23 02:07:45 +00:00