Commit Graph

28 Commits

Author SHA1 Message Date
orbiter
87a8747ce3 - enhanced recognition, parsing, management and double-occurrence-handling of image tags
- enhanced text parser (condenser): found and eliminated bad code parts; increase of speed
- added handling of image preview using the image cache from HTCACHE
- some other minor changes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4507 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-02-25 14:08:15 +00:00
orbiter
efd0b8371a - added parsing of Dublin Core - compliant metadata (see RFC 5013 and ISO 15836) to html parser
- refactoring of plasmaParserDocument to use Dublin Core - compatible property names
- redesign of url handling in parser and condenser (less String-to-yacyURL conversion)
- more generics

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4352 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-01-22 11:51:43 +00:00
low012
b08f877e97 *) tried to get rid of warnings when compiling parsers (http://forum.yacy-websuche.de/viewtopic.php?t=660)
lots of warnings are gone, new one in htmlFilterContentScraper


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4293 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-12-27 22:37:02 +00:00
orbiter
e22014dc83 some memory enhancements when generating and displaying ymage objects
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4253 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-12-07 02:15:12 +00:00
orbiter
daf0f74361 joined anomic.net.URL, plasmaURL and url hash computation:
search profiling showed, that a major amount of time is wasted by computing url hashes. The computation does an intranet-check, which needs a DNS lookup. This caused that each urlhash computation needed 100-200 milliseconds, which caused remote searches to delay at least 1 second more that necessary. The solution to this problem is to attach a URL hash to the URL data structure, because that means that the url hash value can be filled after retrieval of the URL from the database. The redesign of the url/urlhash management caused a major redesign of many parts of the software. Since some parts had been decided to be given up they had been removed during this change to avoid unnecessary maintenance of unused code.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4074 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-09-05 09:01:35 +00:00
orbiter
40b0547611 - documentaton changes (removed old forum links)
- different handling of link quotation
- different handling of link normalization
- enhanced html/unicode en/de-coding

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3993 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-07-19 15:32:10 +00:00
rramthun
e12e934ade *) Fixed broken compile process.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3650 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-05-04 21:33:37 +00:00
orbiter
6b9eea3932 - removed differentiation between longTitle and shortTitle; this cannot be used for search results,
and it is difficult to get both types from all document types
- added some author parsing

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3489 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-03-18 12:33:19 +00:00
orbiter
a738b57b31 added author tag to indexing content
enhanced composition of title tag
TODO: insert author information for external parsers

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3488 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-03-17 01:18:34 +00:00
theli
f17ce28b6d *) plasmaHTCache:
- method loadResourceContent defined as deprecated. 
     Please do not use this function to avoid OutOfMemory Exceptions 
     when loading large files
   - new function getResourceContentStream to get an inputstream of a cache file
   - new function getResourceContentLength to get the size of a cached file
*) httpc.java:
   - Bugfix: resource content was loaded into memory even if this was not requested
*) Crawler:
   - new option to hold loaded resource content in memory
   - adding option to use the worker class without the worker pool 
     (needed by the snippet fetcher)
*) plasmaSnippetCache
   - snippet loader does not use a crawl-worker from pool but uses
     a newly created instance to avoid blocking by normal crawling
     activity.
   - now operates on streams instead of byte arrays to avoid OutOfMemory 
     Exceptions when operating on large files 
   - snippet loader now forces the crawl-worker to keep the loaded
     resource in memory to avoid IO 
*) plasmaCondenser: adding new function getWords that can directly operate on input streams
*) Parsers
   - keep resource in memory whenever possible (to avoid IO)
   - when parsing from stream the content length must be passed to the parser function now.
     this length value is needed by the parsers to decide if the parsed resource content is to large
     to hold it in memory and must be stored to file 
   - AbstractParser.java: new function to pass the contentLength of a resource to the parsers
   


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2701 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-03 11:05:48 +00:00
theli
cd5f349666 *) Better handling of large files during parsing
Extracted text of files that are larger than 5MB is stored in a temp file instead of keeping it in memory
*) plasmaParserDocument.java; getText now returnes an inputStream instead of a byte array
*) plasmaParserDocument.java: new function getTextBytes returns the parsed content as byte array
   Attention: the caller of this function has to ensure that enough memory is available to do this 
   to avoid OutOfMemory Exceptions
*) httpd.java: better error handling if the soaphander is not installed
*) pdfParser.java: 
   - better handling of documents with exotic charsets
   - better handling of large documents
   - better error logging of encrypted documents
*) rtfParser.java: Bugfix for UTF-8 support
*) tarParser.java: better handling of large documents
*) zipParser.java: better handling of large documents
*) plasmaCrawlEURL.java: new errorcode for encrypted documents
*) plasmaParserDocument.java: the extracted text can now be passed
   to this object as byte array or temp file   

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2679 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-30 09:31:53 +00:00
theli
813a8a8179 *) migration of mimeTypeParser to jmimemagic 0.1
- better mimetype detection for rss feeds
   - better mimetype detection for odt documents (less memory consuming)
   - two new detector classes implementing MagicDetector interface of jmimemagic

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2650 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-22 11:40:46 +00:00
theli
b6c7b91582 *) Parser now throws an ParserException instead of returning null on parsing errors (e.g. needed by snippet fetcher)
*) better logging of parser failures
*) simplified usage of plasmaparser through switchboard
*) restructuring of crawler
   - crawler now returns an error message if it is used in sync mode (e.g. by snippet fetcher)
*) snippet-fetcher: more verbose error messages
*) serverByteBuffer.java: adding new function append(String,encoding)
*) serverFileUtils.java: adding functions to copy only a given number of bytes between streams


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2641 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-20 12:25:07 +00:00
orbiter
3aac5b26da - added automatic tag generation when a web page from the search results is added
- added new image 'B' in front of search results for bookmark generation
- added news generation when a public bookmark is added
- the '+' in front of search results has new meaning: positive rating for that result
- added news generation when a '+' is hit

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2613 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-18 00:37:02 +00:00
theli
74c3e7cf29 *) storing document charset into plasmaParserDocument object (is needed later by the condenser)
*) htmlFilterContentScraper.java: using proper charset for document title
*) serverByteBuffer.java: adding new toString which allows to specify the charset for byte encoding


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2593 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-15 13:18:12 +00:00
theli
d0a5a53789 *) changes needed for multi-language support
- parsers may need to know the charset of the byte stream 

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2591 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-15 12:52:46 +00:00
theli
f3ac4dbbb9 *) better handling of server shutdown
See: e.g. http://www.yacy-forum.de/viewtopic.php?t=2584

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2468 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-03 14:59:00 +00:00
orbiter
3879a0ecd0 replaced java.net.URL usage by use of new class de.anomic.net.URL
This shall be seen as an experiment to exclude all cases where
there could be a DNS lookup during URL comparisment.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2290 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-07-13 01:21:53 +00:00
theli
45b39ee1be *) solving unpacking problems with to long filename by
a) renaming the parent folder in the tgz file to yacy
      (can be configured via build properties file)
   b) reconfiguring build file to throw an error if a file
      name is too long 
Please note that currently there is _no_ proplem with too long
class names because of step a.

      

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2207 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-06-14 15:18:41 +00:00
orbiter
83e0e765ec redesigned some parts of the html scanner & parser
to better support image tags

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1995 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-04-04 14:36:01 +00:00
orbiter
3d8a5ae652 code cleanup
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1166 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-12-05 14:24:13 +00:00
orbiter
a04930f025 code cleanup
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1158 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-12-04 23:51:28 +00:00
theli
bdf30117c1 *) Redesign of parser configuration
- restructuring of mimeTypes based on the parsers
   - displaying parser usage count
   - displaying human readably parser names
   - displaying parser version information

*) httpdFileHandler.java
   - adding possibility to support "streaming" servlets
     which are special servlets that can communicate with
     the client via the connection streams autonomous
   - the name of these new servlet types must end with the 
     file extension .stream
   - this feature will be needed by the yacy ScreenSaver
     class to fetch statistic data from the peer without the
     need to reconnect to the server all the time

*) Adding human readable names and version information for
   all supported parsers

*) plasmaParser.java
   - adding new structure to store parser statistic data

*) Adding openDocument parser
   - can be used to parse odt files

*) jmimemagic
   - adding rules to detect openDocument formats properly

*) serverLog.java
   - adding functions that can be used to query if a given
     logging level is enabled or not.


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1140 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-11-29 07:27:58 +00:00
hydrox
cb69047b91 *)cleanup access static methods and fields
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1016 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-11-02 17:56:26 +00:00
theli
6dd3ec0dc4 *) Adding debug="true" debuglevel="lines,vars,source" to ant build files
See: http://www.yacy-forum.de/viewtopic.php?p=4099


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@270 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-06-12 05:22:06 +00:00
theli
84f9d8f7f0 *) migrating ant build files to generate a single extension tar per default
This is buildconfigureable by changing the extensionTarget Property in the build.properties file
*) Trying to solve "yacy.java template replacement / ant build failed" bug

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@169 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-05-24 09:02:04 +00:00
theli
8bd49ba535 *) setting root dir for all tar files properly
*) automatically adding SVN Revision number to tar file name
*) introducing build.properties file that can be used to set the build version number and date

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@164 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-05-23 15:47:08 +00:00
theli
361f05978d Multiple updates regarding the yacy seedUpload facility,
optional content parsers, thread pool configuration ...

Please help me testing if everything works correct.

*) Migration of yacy seedUpload functionality
See: http://www.yacy-forum.de/viewtopic.php?t=256
- new uploaders can now be easily introduced because of a new modulare uploader system
- default uploaders are: none, file, ftp
- adding optional uploader for scp
- each uploader provides its own configuration file that will be 
  included into the settings page using the new template include feature
- Each uploader can define its libx dependencies. If not all needed libs are
  available, the uploader is deactivated automatically.

*) Migration of optional parsers
See: http://www.yacy-forum.de/viewtopic.php?t=198
- Parsers can now also define there libx dependencies
- adding parser for bzip compressed content
- adding parser for gzip compressed content
- adding parser for zip files
- adding parser for tar files
- adding parser to detect the mime-type of a file
  this is needed by the bzip/gzip Parser.java
- adding parser for rtf files
- removing extra configuration file yacy.parser
  the list of enabled parsers is now stored in the main config file

*) Adding configuration option in the performance dialog to configure
See: http://www.yacy-forum.de/viewtopic.php?t=267
- maxActive / maxIdle / minIdle values for httpd-session-threadpool
- maxActive / maxIdle / minIdle values for crawler-threadpool

*) Changing Crawling Filter behaviour
See: http://www.yacy-forum.de/viewtopic.php?p=2631

*) Replacing some hardcoded strings with the proper constants of the httpHeader class

*) Adding new libs to libx directory. This libs are
- needed by new content parsers
- needed by new optional seed uploader
- needed by SOAP API (which will be committed later)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@126 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-05-17 08:25:04 +00:00