Commit Graph

22 Commits

Author SHA1 Message Date
orbiter
daf0f74361 joined anomic.net.URL, plasmaURL and url hash computation:
search profiling showed, that a major amount of time is wasted by computing url hashes. The computation does an intranet-check, which needs a DNS lookup. This caused that each urlhash computation needed 100-200 milliseconds, which caused remote searches to delay at least 1 second more that necessary. The solution to this problem is to attach a URL hash to the URL data structure, because that means that the url hash value can be filled after retrieval of the URL from the database. The redesign of the url/urlhash management caused a major redesign of many parts of the software. Since some parts had been decided to be given up they had been removed during this change to avoid unnecessary maintenance of unused code.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4074 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-09-05 09:01:35 +00:00
orbiter
f890cc86aa inserted forwarding patch from fuchs
see http://forum.yacy-websuche.de/viewtopic.php?f=6&t=233

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4046 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-08-15 22:25:48 +00:00
orbiter
b5346141b3 made the plasmaHTCache static (there is only one internet, so we need only one cache)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4045 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-08-15 21:31:31 +00:00
orbiter
57a5b6fa71 some generalization of remote proxy configuration and setting handling in httpc
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4023 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-08-02 00:42:37 +00:00
orbiter
40b0547611 - documentaton changes (removed old forum links)
- different handling of link quotation
- different handling of link normalization
- enhanced html/unicode en/de-coding

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3993 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-07-19 15:32:10 +00:00
rramthun
18a5380ee3 *) situation-dependent lock-buttons for search-page
*) removed one unused import and a double definition of "ogg" as media-type

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3817 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-06-07 15:26:41 +00:00
theli
d157201e08 *) IfesL for "Unexpected end of ZLIB" error message
See: http://www.yacy-forum.de/viewtopic.php?t=3327

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3169 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-01-05 13:45:31 +00:00
orbiter
109ed0a0bb - cleaned up code; removed methods to write the old data structures
- added an assortment importer. the old database structures can
  be imported with
  java -classpath classes yacy -migrateassortments
- modified wordmigration. The indexes from WORDS are now imported
  to the collection database. The call is
  java -classpath classes yacy -migratewords
  (as it was)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3044 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-12-05 02:47:51 +00:00
orbiter
497428c8ec refactoring
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2949 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-11-10 01:13:33 +00:00
theli
a5b9b514c1 *) retry crawling without content-encoding if the content-encoding header was not correct
See: http://www.yacy-forum.de/viewtopic.php?p=26917#26917

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2811 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-19 08:45:52 +00:00
theli
1d4fb680ce *) CrawlWorker.java: only keep content in memory if size is equal or less than 5MB
TODO: make this limit configurable 

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2703 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-03 12:16:25 +00:00
theli
f17ce28b6d *) plasmaHTCache:
- method loadResourceContent defined as deprecated. 
     Please do not use this function to avoid OutOfMemory Exceptions 
     when loading large files
   - new function getResourceContentStream to get an inputstream of a cache file
   - new function getResourceContentLength to get the size of a cached file
*) httpc.java:
   - Bugfix: resource content was loaded into memory even if this was not requested
*) Crawler:
   - new option to hold loaded resource content in memory
   - adding option to use the worker class without the worker pool 
     (needed by the snippet fetcher)
*) plasmaSnippetCache
   - snippet loader does not use a crawl-worker from pool but uses
     a newly created instance to avoid blocking by normal crawling
     activity.
   - now operates on streams instead of byte arrays to avoid OutOfMemory 
     Exceptions when operating on large files 
   - snippet loader now forces the crawl-worker to keep the loaded
     resource in memory to avoid IO 
*) plasmaCondenser: adding new function getWords that can directly operate on input streams
*) Parsers
   - keep resource in memory whenever possible (to avoid IO)
   - when parsing from stream the content length must be passed to the parser function now.
     this length value is needed by the parsers to decide if the parsed resource content is to large
     to hold it in memory and must be stored to file 
   - AbstractParser.java: new function to pass the contentLength of a resource to the parsers
   


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2701 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-03 11:05:48 +00:00
theli
a0ddf2ec11 *) AbstractCrawlWorker.java: delete already downloaded data on crawling error
*) plasmaSwitchboard.java: log unexpected errors while parsing/indexing

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2552 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-12 04:50:12 +00:00
theli
fded1f4a5d *) better handling of maximum file size limit in crawler
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2543 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-11 08:26:39 +00:00
theli
63893003be *) Adding settings page for the crawler which allows to specify a file size limit and the timeout to use.
*) adding first version of maximum filesize check for the crawler

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2534 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-09 15:06:49 +00:00
theli
043edfa4d8 *) ftp/ResourceInfo.java ResourceInfo object for ftp resources added
*) ftp/CrawlWorker.java better errorhandling for ftp crawler
*) plasmaCrawlEURL.java: some errorcodes added

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2499 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-07 04:12:52 +00:00
theli
dae763d8e3 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2495 6c8d7289-2bf4-0310-a012-ef5d649a1542 2006-09-06 14:31:17 +00:00
theli
393a7d10be *) setting htCache.Entry fields to private
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2484 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-04 15:03:54 +00:00
theli
ab5a9bee66 *) adding some copyright headers
*) next step of restructuring for new crawlers
   - adding first testversion of ftp crawler class
   -- does not create a htCache entry yet

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2483 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-04 14:38:29 +00:00
theli
fce9e7741b *) next step of restructuring for new crawlers
- renaming of http specific crawler settings

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2480 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-04 11:56:47 +00:00
theli
09b106eb04 *) next step of restructuring for new crawlers
- adding interface class (plasma/crawler/plasmaCrawlWorker.java) for protocol specific crawl-worker threads 
   - moving reusable code into abstract crawl-worker class AbstractCrawlWorker.java
   - the load method of the worker threads should not be called directly anymore (e.g. by the snippet fetcher)
     to crawl a page and wait for the result use function plasmaCrawlLoader.loadSync([...])

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2474 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-04 09:00:18 +00:00
theli
eb9b138986 *) next step of restructuring for new crawlers
- conversion of the crawler pool into a keyed object pool
   - crawlers are now loaded based on the url protocol (of course works only for http now)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2473 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-04 06:52:55 +00:00