Commit Graph

295 Commits

Author SHA1 Message Date
orbiter
2851658c2a re-integrated Martins last change to crawl stacker from svn 882 that I had deleted accidently
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@888 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-10-09 16:11:41 +00:00
orbiter
c83594528c integrated crawl stacker into thread control
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@887 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-10-09 15:59:09 +00:00
theli
959eefbc4f *) Robots.txt parser/ppt
cutting of comments at the line end
*) Adding Threadpool for stackCrawl Thread to speedup robots.txt download
   and double url checks

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@882 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-10-09 04:43:07 +00:00
allo
f65c939a60 userDB Auth
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@874 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-10-07 13:49:07 +00:00
orbiter
1a5d98cd6d better imagePainter example and fix for typo http://www.yacy-forum.de/viewtopic.php?p=10920#10920
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@868 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-10-06 11:51:35 +00:00
orbiter
f6cf3967de fix for compile-bug in svn 583 (Martin guck mal ob das richtig ist: fifo oder filo-stack?)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@854 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-10-05 12:21:30 +00:00
theli
a2fa75e688 *) Asynchronous queuing of crawl job URLs (stackCrawl)
various checks like the blacklist check or the robots.txt disallow check are now
   done by a separate thread to unburden the indexer thread(s)
   TODO: maybe we have to introduce a threadpool here if it turn out that this single
         thread is a bottleneck because of the time consuming robots.txt downloads

*) improved index transfer
   The index selection and transmission is done in parallel now to improve index 
   transfer performance.
   TODO: maybe we could speed up performance by unsing multiple transmission threads in 
         parallel instead of only a single one.

*) gzip encoded post requests
   it is now configureable if a gzip encoded post request should be send on
   intex transfer/distribution

*) storage Peer (very experimentell and not optimized yet)
   Now it's possible to send the result of the yacy indexer thread to a remote peer 
   istead of storing the indexed words locally. 
   This could be done by setting the property "storagePeerHash" in the yacy config file
   - Please note that if the index transfer fails, the index ist stored locally.
   - TODO: currently this index transfer is done by the indexer thread. 
     To seedup the indexer
     a) this transmission should be done in parallel and
     b) multiple chunks should be bundled and transfered together


*) general performance improvements  
   - better memory cleanup after http request processing has finished
   - replacing some string concatenations with stringBuffers
   - replacing BufferedInputStreams with serverByteBuffer
   - replacing vectors with arraylists wherever possible
   - replacing hashtables with hashmaps wherever possible
   This was done because function calls to verctor or hashtable functions
   take 3 time longer than calls to functions of arraylists or hashmaps.
   TODO: we should take a look on the class serverObject which is inherited from hashmap
         Do we realy need a synchronization for this class?
   TODO: replace arraylists with linkedLists if random access to the list elements is not needed

*) Robots Parser supports if-modified-since downloads now
   If the downloaded robots.txt file is older than 7 days the robots parser tries to
   download the robots.txt with the if-modified-since header to avoid unnecessary downloads
   if the file was not changed. Additionally the ETag header is used to detect changes.

*) Crawler: better handling of unsupported mimeTypes + FileExtension

*) Bugfix: plasmaWordIndexEntity was not closed correctly in 
   - query.java
   - plasmaswitchboard.java

*) function minimizeUrlDB added to yacy.java 
   this function tests the current urlHashDB for unused urls
   ATTENTION: please don't use this function at the moment because
              it causes the wordIndexDB to flush all words into the
              word directory!

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@853 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-10-05 10:45:33 +00:00
orbiter
6d5d0ac801 bugfix for startup problems
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@850 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-10-05 00:52:55 +00:00
orbiter
0c3a20d44f more + changed log for better understanding of outOfMemory bug and others
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@846 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-10-04 00:28:59 +00:00
theli
0fd9aa6c6e *) Bugfix: supportedFileExt Function didn't detect the file extension correctly because of missing conversion to lower case
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@837 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-10-03 10:48:41 +00:00
theli
8a33c9b309 *) Bugfix: supportedFileExt Function didn't detect the file extension correctly if there was a dot
in one of the parent directories of the file.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@836 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-10-03 10:21:13 +00:00
theli
28c5687ff9 *) Bugfix for "download of non supported file content" via crawler
See: http://www.yacy-forum.de/viewtopic.php?p=10724#10724

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@835 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-10-03 08:45:39 +00:00
theli
2b3f964037 *) Bugfix: supportedFileExt Function didn't chop http parameters before trying to detect the file extension
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@834 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-10-03 08:42:55 +00:00
allo
ff1d3d0680 Init of userDB
Pagelayout of User_p.html


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@822 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-30 13:48:26 +00:00
orbiter
9c4306e41e fixed problem with htcache path
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@811 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-29 00:24:09 +00:00
orbiter
1669eaaa1a fixed svn 805
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@807 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-28 14:47:57 +00:00
borg-0300
ca82d690a9 changed in SVN 805 one line too much
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@806 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-28 13:58:42 +00:00
borg-0300
4bb1f849a0 Bugfix for http://www.yacy-forum.de/viewtopic.php?t=1233
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@805 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-28 13:49:57 +00:00
orbiter
2c7b490e30 memory-logging
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@804 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-28 00:52:54 +00:00
orbiter
7fc822a59b changed handling of time-zones
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@801 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-27 16:28:55 +00:00
theli
9b7f37fc37 *) Minor changes
- more debugging output: storageTime for indexed document is logged now
   - saving memory in plasmaParserDocument.java, plasmaWordIndexEntryContainer.java (not a big deal)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@798 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-27 07:10:24 +00:00
theli
b5a8992d29 *) Setting some object fields to final
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@796 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-26 09:39:54 +00:00
theli
023be89586 *) Bugfix for "Robots.txt wird immer wieder geladen"
See: http://www.yacy-forum.de/viewtopic.php?p=10241#10233

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@794 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-26 08:05:59 +00:00
theli
35c6c5ead7 *) Bugfix for "Blacklist und Crawlen" Bug.
: Crawling continues even if URL is listed in Blacklist
   See: http://www.yacy-forum.de/viewtopic.php?p=10279#10279
   - missing return statement added. Thanks to allo for the
     code review.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@793 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-26 06:51:11 +00:00
orbiter
9e2fc7e5fe load balancing of crawl target domains
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@791 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-25 01:09:21 +00:00
orbiter
3fcc95a82c integrated crawl-profiles db in memory-performance monitor
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@788 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-24 00:33:27 +00:00
theli
fe6a6abc0b *) Adding robots.txt db to Performance Settings for Memory menue
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@785 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-23 01:31:29 +00:00
orbiter
3274ae725e increased cache size of robots database; however, this should be integrated into new memory control
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@784 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-23 00:37:31 +00:00
orbiter
c6d2f50375 changed order of robots and double-check
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@783 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-23 00:18:08 +00:00
orbiter
68d5ff2ef1 added stringbuffer in condenser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@782 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-22 23:43:45 +00:00
orbiter
495bc8bec6 removed cache-control from low and medium priority caches which reduces memory use and computation overhead
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@774 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-22 20:01:26 +00:00
orbiter
18d9e1a256 fix for http://www.yacy-forum.de/viewtopic.php?p=10026#10026
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@768 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-21 21:56:39 +00:00
orbiter
07f30931ec various configuration options in memory performance
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@763 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-21 14:21:45 +00:00
theli
b990dc1ad1 *) Replacing jsch 0.1.19 lib with newer version 0.1.21
*) Replacing PDFBox 0.7.1 lib with newer version 0.7.2
*) Refactoring of classes httpd/httpc/httpHeaders to
   make many methods for httpHeader/Requestline parsing
   reusable for new icap implementation
*) adding chunked input stream support
   - needed by new icap implementation
   - needed by future httpc HTTP/1.1 support 
*) httpd.java
   - moving all connection property contants to class httpHeader
   - moving readHeader function to class httpHeader
   - moving parseQuery function to class httpHeader
   - moving handleTransparentProxy function to class httpHeader
*) httpHeader.java
   - adding new fuction to parse the http response line
   - adding new function to converte http headers to a string that
     can be send to the client
   - adding a function that generates a proper url using all parsed
     connection properties
*) ICAP Support
   - yacy now supports handling of icap response modification requests
   - this feature can be used by other icap enabled proxies to contact 
     yacy as icap server, and to handover the downloaded content to yacy.logging
     for indexing
   - functionality was successfully tested with squid 2.5Stable 10 + icap patch
   - further icap services e.g. URL filtering based on yacy's blacklists are possible
*) plasmaSwitchboard.java
   - htcache entries that are still needed for indexing are now properly registered 
     as in use after system restart
   - extended logging: log message now shows parsing and indexing time for each sb. entry
    

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@757 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-20 21:49:47 +00:00
borg-0300
6d1de8abfd finals; cleaned;
Properties;

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@756 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-20 15:43:31 +00:00
orbiter
14bc880fa4 fixed bug with crashed profile database
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@753 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-20 11:20:29 +00:00
orbiter
71a31f0902 integrated and extended new memory performance menu; found and fixed bug in DHT caching
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@752 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-20 10:54:20 +00:00
orbiter
fb52a82008 added new performance page for memory settings
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@751 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-20 10:10:34 +00:00
orbiter
cddd9aaa33 fixed SERIOUS bug with kelondroStack; affected all stack processing since 729
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@732 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-15 22:17:51 +00:00
orbiter
416c126815 fix for a profile = null problem and new monitor in crawl queue
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@730 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-15 21:39:37 +00:00
orbiter
2148c0cf49 replaced kelondro storage core; much less objects in kelondro cache now; less IO from DB
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@724 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-14 10:10:49 +00:00
theli
beefddf0e8 *) Adding option which allows to do a Index-Transfer without deletion of index
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@722 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-14 07:14:24 +00:00
rramthun
4036ee812a Updated german language file
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@721 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-13 16:29:59 +00:00
theli
40925f4fb7 *) Improving complete index transfer performance by automatically increasing size of transfered word chunk
for fast connections (much similar to normal dht behavior) 
   

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@719 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-13 10:29:04 +00:00
theli
91ab4d044b *) Adding automatic retry functionality to complete index transfer function
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@718 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-13 08:32:24 +00:00
theli
a62677f761 *) Adding additional logging output for complete index transfer
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@717 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-13 06:44:38 +00:00
theli
b991d2e7dd *) Additional logging message for complete index transfer
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@712 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-12 12:02:45 +00:00
theli
3c00c5f6c7 *) Complete Index Transfer
See: http://www.yacy-forum.de/viewtopic.php?p=9622

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@711 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-12 11:39:32 +00:00
theli
2cb084d426 *) Complete Index Transfer
See: http://www.yacy-forum.de/viewtopic.php?p=9622

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@707 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-12 10:37:16 +00:00
theli
d1de71e9f6 *) Suppress stacktrace on proxy error for "No route to host Exception"
See: http://www.yacy-forum.de/viewtopic.php?t=1153

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@704 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-11 20:21:38 +00:00