Commit Graph

1708 Commits

Author SHA1 Message Date
orbiter
1dfab1abe3 more control for seed receive
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2709 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-04 08:55:01 +00:00
theli
1c0e65f55f *) Bugfix for problems with charset detection
See: http://www.yacy-forum.de/viewtopic.php?p=26196

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2708 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-04 04:54:21 +00:00
orbiter
db294687ea enhanced logging
- more logging output
- fix in log line preparation
- added filter to log page
- some small bugfixes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2707 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-03 22:55:59 +00:00
theli
a9a0f51303 *) suppressing InterruptedException errormessage
See: http://www.yacy-forum.de/viewtopic.php?t=2915

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2705 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-03 15:40:18 +00:00
theli
ce7ee74316 *) better errorhandling in filehandler (try catch block now starts before argument parsing)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2704 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-03 14:21:46 +00:00
theli
1d4fb680ce *) CrawlWorker.java: only keep content in memory if size is equal or less than 5MB
TODO: make this limit configurable 

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2703 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-03 12:16:25 +00:00
theli
1586d57187 *) odtParser: better handling of large files
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2702 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-03 12:00:26 +00:00
theli
f17ce28b6d *) plasmaHTCache:
- method loadResourceContent defined as deprecated. 
     Please do not use this function to avoid OutOfMemory Exceptions 
     when loading large files
   - new function getResourceContentStream to get an inputstream of a cache file
   - new function getResourceContentLength to get the size of a cached file
*) httpc.java:
   - Bugfix: resource content was loaded into memory even if this was not requested
*) Crawler:
   - new option to hold loaded resource content in memory
   - adding option to use the worker class without the worker pool 
     (needed by the snippet fetcher)
*) plasmaSnippetCache
   - snippet loader does not use a crawl-worker from pool but uses
     a newly created instance to avoid blocking by normal crawling
     activity.
   - now operates on streams instead of byte arrays to avoid OutOfMemory 
     Exceptions when operating on large files 
   - snippet loader now forces the crawl-worker to keep the loaded
     resource in memory to avoid IO 
*) plasmaCondenser: adding new function getWords that can directly operate on input streams
*) Parsers
   - keep resource in memory whenever possible (to avoid IO)
   - when parsing from stream the content length must be passed to the parser function now.
     this length value is needed by the parsers to decide if the parsed resource content is to large
     to hold it in memory and must be stored to file 
   - AbstractParser.java: new function to pass the contentLength of a resource to the parsers
   


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2701 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-03 11:05:48 +00:00
orbiter
630a955674 read snippets from cache in case they are not provided in RAM
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2700 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-02 17:18:24 +00:00
orbiter
bcf2b800b4 applied UTF-8 encoding parameter to yacy-internal protocol communication
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2694 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-02 13:35:38 +00:00
orbiter
c40fca08a2 fixed bad handling of string separation
you can now use a new encoding attribute to create strings from byte arrays

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2693 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-02 10:21:14 +00:00
orbiter
5a40ea7866 refactoring of wget string list generation
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2692 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-02 09:59:20 +00:00
orbiter
dbc2e039bb added time-out option parameter to call hierarchy
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2691 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-02 09:40:18 +00:00
orbiter
d4c239e4be - fixed problem in collection index with deletion of single url references
- added automatic deletion of not-found snippets after search

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2689 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-02 01:40:52 +00:00
orbiter
00746ca232 identified and fixed search performance problem caused by
snippet loading. Some access to header-db had been twice and even
more times in some cases. Snippet resource loading fixed.
Furthermore the snippet loading during remote search within the
remote peer has been disabled, but can be switched on remotely by
new flag 'includesnippet=true'

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2688 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-02 01:15:02 +00:00
orbiter
b033a80750 better control of failure in node seek of kelondroTree
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2686 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-02 00:13:19 +00:00
orbiter
310f1c41cd added option to see ranking scores in surftipps
and some cleanups

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2684 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-30 23:28:03 +00:00
theli
a2e3095044 *) Bugfix. Add missing plasmaParserDocument.close() calls
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2680 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-30 10:09:01 +00:00
theli
cd5f349666 *) Better handling of large files during parsing
Extracted text of files that are larger than 5MB is stored in a temp file instead of keeping it in memory
*) plasmaParserDocument.java; getText now returnes an inputStream instead of a byte array
*) plasmaParserDocument.java: new function getTextBytes returns the parsed content as byte array
   Attention: the caller of this function has to ensure that enough memory is available to do this 
   to avoid OutOfMemory Exceptions
*) httpd.java: better error handling if the soaphander is not installed
*) pdfParser.java: 
   - better handling of documents with exotic charsets
   - better handling of large documents
   - better error logging of encrypted documents
*) rtfParser.java: Bugfix for UTF-8 support
*) tarParser.java: better handling of large documents
*) zipParser.java: better handling of large documents
*) plasmaCrawlEURL.java: new errorcode for encrypted documents
*) plasmaParserDocument.java: the extracted text can now be passed
   to this object as byte array or temp file   

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2679 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-30 09:31:53 +00:00
theli
8b2ceddb91 *) Displaying servere and warning logging messages in different colors on ViewLog_p.html
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2678 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-30 08:12:22 +00:00
low012
f8ac694e51 *) fixed a bug where searchword in snippets were not displayed bold in front of a punctuation mark (see http://www.yacy-forum.de/viewtopic.php?p=25998)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2677 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-30 00:27:42 +00:00
orbiter
df1629b05a - code cleanup
- version 0.471
- moved surftipps to own web page


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2676 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-29 22:27:20 +00:00
theli
c665f6cddb *) handling of quotes in charset string
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2674 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-28 06:29:15 +00:00
theli
b73efd5565 *) missing changes needed because of last commit
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2673 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-28 05:48:28 +00:00
theli
140ddba93f *) adding soap functions to pause and resume the crawler
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2668 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-27 05:22:43 +00:00
orbiter
2463e5624a 'quick' release 0.47
- documentation update
- necessary bugfixes (missing css for new peers)
- reduced effect of search result redundancy filter
- removed some debug output, but not all

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2665 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-26 23:41:54 +00:00
theli
49fbb688df *) SOAP: old urlInfo renamed to urlInfoByHash, new urlInfo Function added.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2662 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-26 15:14:33 +00:00
theli
8f143d516b *) make snippet fetcher accessible via soap api
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2661 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-26 15:07:16 +00:00
theli
97615af406 *) Restructuring of YaCy SOAP services
- general functions moved to abstract service class
   - service class splitted into SearchService, CrawlService, StatusService
*) Bugfix for SOAP search services
   - Attention: some xml tages where renamed
   See: http://www.yacy-forum.de/viewtopic.php?p=25877
*) New SOAP service function urlInfo to view the parsed content of an URL
   See: http://www.yacy-forum.de/viewtopic.php?p=25869

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2660 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-26 14:47:44 +00:00
theli
241b881560 *) Redesign of YaCy SOAP handler
- should be more fail-safe now
   - better handling of compressed request bodies
   - better handling of persistent connections
   - better handling of AxisFaults

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2659 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-26 12:24:40 +00:00
theli
009a33170b *) Content-Location header added
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2658 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-26 04:32:01 +00:00
theli
1aa07a52cd *) Bugfix for UnsupportedEncodingException if the media type contains multiple parameters
See: http://www.yacy-forum.de/viewtopic.php?p=25832#25826

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2654 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-24 15:50:51 +00:00
theli
625c2ce6b1 *) bugfix for snippet fetching problem if content but not http header is available in cache
See: http://www.yacy-forum.de/viewtopic.php?p=25748

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2651 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-22 11:55:28 +00:00
theli
813a8a8179 *) migration of mimeTypeParser to jmimemagic 0.1
- better mimetype detection for rss feeds
   - better mimetype detection for odt documents (less memory consuming)
   - two new detector classes implementing MagicDetector interface of jmimemagic

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2650 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-22 11:40:46 +00:00
hermens
3f5a4153a0 Make Peers more receptible to transferred indexes
- Set MaxWordCount for dhtInCache to indexDistribution.dhtReceiptLimit
  so that the inCache gets flushed when the limit is passed
- Modify flushCacheSome to flush enough words to get below MaxWordCount immediately



git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2649 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-22 10:58:58 +00:00
theli
57415b6889 *) Bugfix for surftipps UTF-8 problem
See: http://www.yacy-forum.de/viewtopic.php?t=2864

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2647 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-22 05:40:29 +00:00
allo
b0a4fcce8c fix from theli
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2642 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-20 18:03:24 +00:00
theli
b6c7b91582 *) Parser now throws an ParserException instead of returning null on parsing errors (e.g. needed by snippet fetcher)
*) better logging of parser failures
*) simplified usage of plasmaparser through switchboard
*) restructuring of crawler
   - crawler now returns an error message if it is used in sync mode (e.g. by snippet fetcher)
*) snippet-fetcher: more verbose error messages
*) serverByteBuffer.java: adding new function append(String,encoding)
*) serverFileUtils.java: adding functions to copy only a given number of bytes between streams


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2641 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-20 12:25:07 +00:00
orbiter
e03427871e enhanced surftipps:
- added switchh to show or hide surftipps
- more news contribute to surftipps
- added voting system for surftipps

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2638 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-20 07:17:41 +00:00
theli
1dc12d6659 *) Bugfix for shutdown problem caused by cacheScan thread
See: http://www.yacy-forum.de/viewtopic.php?p=25729

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2636 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-20 04:36:25 +00:00
borg-0300
42173462f5 rename cutUrlText to shortenURLString;
other little things;

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2635 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-19 20:47:45 +00:00
borg-0300
af1d89e381 check url == null added;
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2634 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-19 20:12:26 +00:00
theli
cc667b0aa5 *) htmlFilterContentScraper.java: adding support for link tag
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2633 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-19 16:13:13 +00:00
theli
26dfbb7499 *) Bugfix for UTF-8: url names are now stored properly in stackcrawl, crawler, indexing queue and should be displayed correct on the gui
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2630 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-19 05:19:41 +00:00
theli
cf6acff2c2 *) Bugfix. htmlFilterInputStream document analysis did not work properly for documents smaller than the
default InputStream Buffer size.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2629 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-19 04:58:34 +00:00
borg-0300
f18304ddd3 unused/not needed imports removes;
properties added;

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2628 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-18 22:21:18 +00:00
orbiter
ec031eb993 first version of surftipps
see http://localhost:8080/index.html

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2627 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-18 20:14:21 +00:00
borg-0300
b174fbd0ca "import ...*" removed;
properties added;

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2626 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-18 18:31:27 +00:00
orbiter
807756150e patch for strange bug reported by email
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2625 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-18 16:50:31 +00:00
theli
5c6251bced *) some improvements for extended html document charset support
- new class htmlFilterInputStream.java which allows to pre-analyze the html header to extract
     the charset meta data. This is only enabled for the crawler at the moment. Integration into 
     proxy needs more testing.     
   - adding eventlisterner interfaces to the htmlscraper to allow other classes to get informed
     about detected tags (used by the htmlFilterInputStream.java)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2624 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-18 15:36:04 +00:00