Commit Graph

95 Commits

Author SHA1 Message Date
orbiter
4fefa53135 removed parser object pool, see also svn 4106
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4184 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-29 12:14:18 +00:00
orbiter
a31b9097a4 preparations for mass remote crawls:
two main changes must be implemented to enable mass remote crawls:
- shift control of robots.txt to crawl queue (away from stacker). This is necessary since remote
  crawls can contain unchecked urls. Each peer must check the robots to prevent that it is misused
  as crawl agent for unwanted file retrieval
- implement new index files that control double-check of remotely crawled urls

After removal of robots.txt checking from stacker threads, the multi-threading of this process is void.
Multithreading has been removed. Also the thread pools for the crawl threads had been removed, since
creation of these threads is not resource-consuming, for a detailed explanation see svn 4106

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4181 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-29 01:43:20 +00:00
fuchsi
f717beecb1 - Changed yFormatter handling to be more flexible and produce more readable code for server pages. There are serverObject.putNum() methods to allow adding of number type values in a formatted form, and put() methods for number types that add them without formatting. This reduces the need to transform them into Strings in server pages and removes the HTML encoding step which is unecessary for numbers.
- some minor code cleanups (mostly unnecessary casts, null checks)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4166 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-19 04:13:46 +00:00
orbiter
01e0669264 re-designed some parts of DHT position calculation (effect is the same as before)
and replaced old fist hash computation by new method that tries to find a gap in the current dht
to do this, it is necessary that the network bootstraping is done before the own hash is computed
this made further redesigns in peer initialization order necessary

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4117 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-01 12:30:23 +00:00
orbiter
341f7cb327 steps to enhance remote search performance:
- added a file size limitation, that disallows parsing of large documents during (offline-) remote search
- added profiling information to search result computation, visible at search access tracker. this info shows used time for URL fetch and snippet computation

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4112 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-09-26 10:11:50 +00:00
orbiter
6c819a6fd9 added cache to favicon display
added better synchronization for simultanous search requests

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4076 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-09-06 01:28:35 +00:00
orbiter
daf0f74361 joined anomic.net.URL, plasmaURL and url hash computation:
search profiling showed, that a major amount of time is wasted by computing url hashes. The computation does an intranet-check, which needs a DNS lookup. This caused that each urlhash computation needed 100-200 milliseconds, which caused remote searches to delay at least 1 second more that necessary. The solution to this problem is to attach a URL hash to the URL data structure, because that means that the url hash value can be filled after retrieval of the URL from the database. The redesign of the url/urlhash management caused a major redesign of many parts of the software. Since some parts had been decided to be given up they had been removed during this change to avoid unnecessary maintenance of unused code.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4074 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-09-05 09:01:35 +00:00
orbiter
4779f314fe first version of next-generation search interface:
- snippets are not fetched by browser using ajax, they are now fetched internally
- YaCy-internat threads control existence of snippets and sort out bad results
- search results are prepared using SSI includes
- the search result page is visible right after the search request, the results drop in when they are detected
- no more time-out strategy during search processes, results are shifted within queues when they arrive from remote peers
- added result page switching! after the first 10 results, the next page can be retrieved
- number of remote results is updated online on the result page as they drop in
- removed old snippet servelet (which had been also a security leak btw)
- media search is broken now, will be redesigned and fixed in another step


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4071 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-09-03 23:43:55 +00:00
orbiter
e332b844b2 - enhanced remote search: during waiting time for remote crawls
some urls are fetched so the url cache can be filled with these urls
- the url-prefetch is used to sort out some unresolved urls
- the snippet-fetcher is triggered with the search event id. This is used
  to remove missing snippets from the search cache so they will not be displayed again


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4060 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-08-26 18:18:35 +00:00
orbiter
b5346141b3 made the plasmaHTCache static (there is only one internet, so we need only one cache)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4045 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-08-15 21:31:31 +00:00
orbiter
947fc46904 refactoring of search process:
- re-designed remote request result processing
- re-designed local result accumulation, will be further enhanced with snippet fetcher
- removed search process handling in switchboad
- made snippet class static (there is no need for multiple snippet objects)
- removed some redundant tasks in server-side search process, should be a little bit faster now


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4043 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-08-15 11:36:59 +00:00
orbiter
40b0547611 - documentaton changes (removed old forum links)
- different handling of link quotation
- different handling of link normalization
- enhanced html/unicode en/de-coding

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3993 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-07-19 15:32:10 +00:00
orbiter
89e1848db6 fixed problem with favicons:
target servers had been able to see search words from the referrer of the favicon fetch.
This has been removed by using the getImage - servlet for favicon fetch.
Since java does not support loading of bmp and ico-Images, such parsers had been added.
The image parser had been coded from their original microsoft documentation.
This influences also the image-search functionality: there can now be a preview
of found bmp-images. Another benefit: favicons for search results are now cached with the HTCACHE.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3965 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-07-15 01:34:01 +00:00
low012
c59a7ce5c2 *) hopefully fixed a stupid bug (my fault of course) that sometimes messed up the marking of search words in the snippets (see http://www.yacy-forum.de/viewtopic.php?p=37329#37329)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3908 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-06-17 16:29:04 +00:00
theli
339153d40e *) favicons that are specified in the document content via html link-tags
are now detected and displayed on the search page (requested by allo).

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3845 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-06-09 15:22:37 +00:00
orbiter
6488ec8a80 no deletions in index in case that snippet-loading fails and there is no network connection
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3525 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-03-27 08:21:45 +00:00
orbiter
861f41e67e redesigned NURL-handling:
- the general NURL-index for all crawl stack types was splitted into separate indexes for these stacks
- the new NURL-index is managed by the crawl balancer
- the crawl balancer does not need an internal index any more, it is replaced by the NURL-index
- the NURL.Entry was generalized and is now a new class plasmaCrawlEntry
- the new class plasmaCrawlEntry replaces also the preNURL.Entry class, and will also replace the switchboardEntry class in the future
- the new class plasmaCrawlEntry is more accurate for date entries (holds milliseconds) and can contain larger 'name' entries (anchor tag names)
- the EURL object was replaced by a new ZURL object, which is a container for the plasmaCrawlEntry and some tracking information
- the EURL index is now filled with ZURL objects
- a new index delegatedURL holds ZURL objects about plasmaCrawlEntry obects to track which url is handed over to other peers
- redesigned handling of plasmaCrawlEntry - handover, because there is no need any more to convert one entry object into another
- found and fixed numerous bugs in the context of crawl state handling
- fixed a serious bug in kelondroCache which caused that entries could not be removed
- fixed some bugs in online interface and adopted monitor output to new entry objects
- adopted yacy protocol to handle new delegatedURL entries
all old crawl queues will disappear after this update!

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3483 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-03-16 13:25:56 +00:00
orbiter
9f929b5438 better snippet handling in case of snippet load fail
see also http://www.yacy-forum.de/viewtopic.php?p=31096#31096

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3475 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-03-13 22:18:36 +00:00
orbiter
f25c0e98d1 - replaced String by StringBuffer in condenser
- added CamelCase parser in condenser
- added option to switch on or off indexing for proxy

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3292 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-01-29 01:11:22 +00:00
orbiter
0a050bc043 enhanced ranking
- redesign of data storage in plasmaSearchRankingProfile
- profiles are extended by new ranking parameters
- new RWI ranking parameters are considered during ranking
- appearance attributes (i.e. emphasised text) is now considered
- faster ranking
- some attributes that had been checked during post-ranking can now be
  checked during pre-ranking phase
- removed old ranking parameter on index.html page (will be replaced by profiles in the future)
- ranking can now consider appearances of media content
- snippet-loading for media types now work correctly (fetches only from the wanted media)
- ranking-profiles can be handed over the remote peers and apply there also
- re-search of same query with different domain now also re-triggers remote search

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3105 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-12-20 15:44:29 +00:00
orbiter
61798f0ae6 added option to distinguish between text crawl and media crawl
- for each crawl start, there is now a flag for text and media
- the localCrawl flag is superfluous
- added new crawl profiles
- if an image search is done, only media links are crawled for the snippets


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3100 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-12-19 03:10:46 +00:00
orbiter
e4570bffaf -implemented a specialized snippet-fetch for media content
-changed search result preparation for media search presentation

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3073 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-12-12 02:09:25 +00:00
low012
694a6e4f44 *) better text snipptes: any possible searchword (welt, linux, tag) in welt-linux-tag will be marked correctly now
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3072 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-12-11 15:19:35 +00:00
orbiter
bddc197453 reverted by-mistake removed change from low012/SVN 3068
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3070 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-12-11 11:07:36 +00:00
orbiter
1377c53aa3 extraction of media links from search results
these links are mixed to the snippets for testing purpose
(a final version will handle this differently)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3069 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-12-11 01:31:23 +00:00
low012
586add4c6c *) Better snippets: words like GNU/Linux will not prevent Linux or GNU from being marked if they are searchword (see http://www.yacy-forum.de/viewtopic.php?t=2891)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3068 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-12-10 23:13:53 +00:00
orbiter
937ccd4e76 fix for snippet-generation
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3060 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-12-09 02:13:43 +00:00
orbiter
bf0d820659 - added correct flagging of word properties
- added self-healing to database in case that wrong free-pointers exist
- added presentation of media links in snippets (does not yet work correctly)
- code cleanup

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3055 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-12-08 02:14:56 +00:00
orbiter
ceb9e3aa17 - enhanced parser: collection of audio, video, image and application links
- enhanced condenser: better handling of utf-8 and pre-formatted texts


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3017 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-11-28 15:00:15 +00:00
orbiter
b5a29e9651 - fix for snippets that are too short
- added keyword to snippet fetch to suppres removal of not-found snippet words (for debugging)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3009 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-11-25 00:38:09 +00:00
orbiter
30888e7a2f implementation of search constraints
Such constraints may formulate specific restrictions to web searches
This is implemented by scraping information for constraints from a web
page during parsing, and storing flags to the pages within the web index.

In this first step, only information for index pages ("index of", directory listings)
are scraped and stored in flags
- added new flag class kelondroBitfield
- added scraper method in condenser
- added bitfield structure for all scrape types (see also condenser)
- added bitfield structure for appearance locations (see RWIEntry)
- added handover protocol for remote search and index distribution
- extended kelondroColumn class to hold bitfield types
- added another search attribute on search page (index.html)
- extended search-filter to enable filtering of non-matching constraints
- set all new database types to be default
- refactoring: moved word hash generation to condenser class

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2999 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-11-23 02:16:30 +00:00
orbiter
497428c8ec refactoring
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2949 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-11-10 01:13:33 +00:00
orbiter
bb7d4b5d5e refactoring to prepare new RWI entry object
- moved all url and index(RWI) entries to index package
- better naming to distinguish RWI entries and URL entries


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2937 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-11-08 16:17:47 +00:00
orbiter
b79e06615d - added new LURL.Entry class for next database migration
- refactoring of affected classes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2802 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-18 22:25:07 +00:00
orbiter
a5dd0d41af - refactoring of plasmaCrawlLURL.Entry to prepare new Entry format
- added test migration method to migrate the old LURL to a new LURL
the new LURL will be splitted into different tables for each month
this solves several problems:
- the biggest table in YaCy is splitted in different parts and can
  also be managed in filesystems that are limited to 2GB
- the oldest entries can easily be identified, used for re-crawl und
  deleted
- The complete database can be limited to a specific size (as wanted many times)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2755 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-12 23:14:41 +00:00
orbiter
c8f3a7d363 added snippet-url re-indexing
- snippets will generate an entry in responseHeader.db
- there is now another default profile for snippet loading
- pages from snippet-loading will be indexed, indexing depth = 0
- better organization of default profiles

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2733 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-09 23:07:10 +00:00
low012
2cfd4633ac *) even better handling of searchwords in snippets, words can consist of letters and numbers now
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2732 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-09 21:08:13 +00:00
low012
2d3b7251a4 *) better handling of searchwords in snippets (see http://www.yacy-forum.de/viewtopic.php?t=2891 for details)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2728 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-09 13:37:38 +00:00
orbiter
1969522dc1 removed lowercase of snippets (and other things):
- added new sentence parser to condenser
- sentence parsing can now handle charsets

to do: charsets must be handed over to new sentence parser

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2712 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-07 00:06:09 +00:00
theli
f17ce28b6d *) plasmaHTCache:
- method loadResourceContent defined as deprecated. 
     Please do not use this function to avoid OutOfMemory Exceptions 
     when loading large files
   - new function getResourceContentStream to get an inputstream of a cache file
   - new function getResourceContentLength to get the size of a cached file
*) httpc.java:
   - Bugfix: resource content was loaded into memory even if this was not requested
*) Crawler:
   - new option to hold loaded resource content in memory
   - adding option to use the worker class without the worker pool 
     (needed by the snippet fetcher)
*) plasmaSnippetCache
   - snippet loader does not use a crawl-worker from pool but uses
     a newly created instance to avoid blocking by normal crawling
     activity.
   - now operates on streams instead of byte arrays to avoid OutOfMemory 
     Exceptions when operating on large files 
   - snippet loader now forces the crawl-worker to keep the loaded
     resource in memory to avoid IO 
*) plasmaCondenser: adding new function getWords that can directly operate on input streams
*) Parsers
   - keep resource in memory whenever possible (to avoid IO)
   - when parsing from stream the content length must be passed to the parser function now.
     this length value is needed by the parsers to decide if the parsed resource content is to large
     to hold it in memory and must be stored to file 
   - AbstractParser.java: new function to pass the contentLength of a resource to the parsers
   


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2701 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-03 11:05:48 +00:00
orbiter
630a955674 read snippets from cache in case they are not provided in RAM
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2700 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-02 17:18:24 +00:00
orbiter
dbc2e039bb added time-out option parameter to call hierarchy
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2691 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-02 09:40:18 +00:00
orbiter
00746ca232 identified and fixed search performance problem caused by
snippet loading. Some access to header-db had been twice and even
more times in some cases. Snippet resource loading fixed.
Furthermore the snippet loading during remote search within the
remote peer has been disabled, but can be switched on remotely by
new flag 'includesnippet=true'

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2688 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-02 01:15:02 +00:00
theli
a2e3095044 *) Bugfix. Add missing plasmaParserDocument.close() calls
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2680 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-30 10:09:01 +00:00
low012
f8ac694e51 *) fixed a bug where searchword in snippets were not displayed bold in front of a punctuation mark (see http://www.yacy-forum.de/viewtopic.php?p=25998)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2677 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-30 00:27:42 +00:00
orbiter
df1629b05a - code cleanup
- version 0.471
- moved surftipps to own web page


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2676 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-29 22:27:20 +00:00
theli
625c2ce6b1 *) bugfix for snippet fetching problem if content but not http header is available in cache
See: http://www.yacy-forum.de/viewtopic.php?p=25748

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2651 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-22 11:55:28 +00:00
theli
813a8a8179 *) migration of mimeTypeParser to jmimemagic 0.1
- better mimetype detection for rss feeds
   - better mimetype detection for odt documents (less memory consuming)
   - two new detector classes implementing MagicDetector interface of jmimemagic

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2650 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-22 11:40:46 +00:00
theli
b6c7b91582 *) Parser now throws an ParserException instead of returning null on parsing errors (e.g. needed by snippet fetcher)
*) better logging of parser failures
*) simplified usage of plasmaparser through switchboard
*) restructuring of crawler
   - crawler now returns an error message if it is used in sync mode (e.g. by snippet fetcher)
*) snippet-fetcher: more verbose error messages
*) serverByteBuffer.java: adding new function append(String,encoding)
*) serverFileUtils.java: adding functions to copy only a given number of bytes between streams


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2641 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-20 12:25:07 +00:00
theli
97d2a08ef1 *) restructuring needed to support parsing of documents using various charsets
- serverFileUtils.java: 
   -- adding methods to copy from stream to writer and readers to writers
   -- moving httpc writeX methods into serverFileUtils class
   - serverCharBuffer.java: removing inheritance from Writer class
   - replacing htmlFilterOutputStream by htmlFilterWriter class which handles
     content as char stream
   - htmlFilterContentTransformer.java: deactivating getText mode 
    (still needs to be migrated to use char streams instead of byte streams)
   - changes in several classes to use htmlFilterWriter instead of htmlFilterOutputStream
   - changes in Scraper and Transformer classes to operate on chars instead of bytes
   - httpdProxyHandler.java: bugfix. clientTimeout setting was missing in config file

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2617 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-09-18 10:12:11 +00:00