Commit Graph

205 Commits

Author SHA1 Message Date
orbiter
44579fa06d - fixed a problem loading images through yacy's document loader,
this denied non-parseable documents which excluded all images
- fixed url of osm tile server

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6287 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-03 11:46:08 +00:00
orbiter
72e5407115 refactoring of snippet cache
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6268 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-27 14:34:41 +00:00
orbiter
8e56c2ace6 fix for fixes from this afternoon
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6253 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-07 22:53:49 +00:00
orbiter
cf739edc2e fix for possible deadlock, see
http://forum.yacy-websuche.de/viewtopic.php?p=17017#p17017

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6252 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-07 12:11:22 +00:00
orbiter
92edd24e70 fixed problem with switching of networks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6247 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-30 15:49:23 +00:00
orbiter
0575f12838 fix for deadlock
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6246 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-30 09:08:44 +00:00
orbiter
c0e17de2fb - fixes for some problems with the new crawling/caching strategies
- speed enhancements for the cache-only cache policy by using special no-delay rules in the balancer
- fixed some deadlock- and 100% CPU problems in the balancer

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6243 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-25 21:38:57 +00:00
orbiter
634a01a9a4 replaced wget-requests with caching requests
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6242 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-24 14:52:27 +00:00
orbiter
c6c97f23ad - added cache usage properties to crawl start
- added special rule to balancer to omit forced delays if cache is used exclusively
- extended the htCache size by default to 32GB

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6241 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-24 11:54:04 +00:00
orbiter
c4ae2cd03f fixed bug that caused deletion of crawl profiles at every application startup
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6240 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-23 22:09:02 +00:00
orbiter
161d2fd2ef redesign of access to the HTCache (now http.client.Cache):
- better control to the cache by using combined request-header and content access methods
- refactoring of many classes to comply to this new access method
- make shure that the cache is always written if something was loaded
- some redesign of the process how http response results are feeded into the new indexing queue
- introduction of a cache read policy:
 * never use the cache
 * use the cache if entry exist
 * use the cache if the proxy freshness rule confirmes
 * use only the cache and go never online
- added configuration options for the crawl profiles to use the new cache policies. There is not yet a input during crawl start to set the policy but this will be added in another step.
- set the default policies for the existing crawl profiles. If you want them to appear in your default profiles you must delete the crawl profiles database; othervise the policy is 'proxy freshness rule'
- enhanced some cache access methods in such a way that unnecessary retrievals are omitted (i.e. for size computation). That should reduce some IO but also a lot of CPU computation because sizes were computed after decompression of content after retrieval of the content from the disc.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6239 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-23 21:31:51 +00:00
orbiter
4da9042e8a code simplification
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6233 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-19 21:59:29 +00:00
orbiter
1d8d51075c refactoring:
- removed the plasma package. The name of that package came from a very early pre-version of YaCy, even before YaCy was named AnomicHTTPProxy. The Proxy project introduced search for cache contents using class files that had been developed during the plasma project. Information from 2002 about plasma can be found here:
http://web.archive.org/web/20020802110827/http://anomic.de/AnomicPlasma/index.html
We stil have one class that comes mostly unchanged from the plasma project, the Condenser class. But this is now part of the document package and all other classes in the plasma package can be assigned to other packages.
- cleaned up the http package: better structure of that class and clean isolation of server and client classes. The old HTCache becomes part of the client sub-package of http.
- because the plasmaSwitchboard is now part of the search package all servlets had to be touched to declare a different package source.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6232 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-19 20:37:44 +00:00
orbiter
5bb8074150 removed the indexing queue. This queue was superfluous since the introduction of the blocking queues last year, where documents are parsed, analysed and stored in the index with concurrency.
- The indexing queue was a historic data structure that was introduced at the very beginning at the project as a part of the switchboard organisation object structure. Without the indexing queue the switchboard queue becomes also superfluous. It has been removed as well.
- Removing the switchboard queue requires that all servlets are called without a opaque generic ('<?>'). That caused that all serlets had to be modified.
- Many servlets displayed the indexing queue or the size of that queue. In the past months the indexer was so fast that mostly the indexing queue appeared empty, so there was no use of it any more. Because the queue has been removed, the display in the servlets had also to be removed.
- The surrogate work task had been a part of the indexing queue control structure. Without the indexing queue the surrogates needed its own task management. That has been integrated here.
- Because the indexing queue had a special queue entry object and properties attached to this object, the propterties had to be moved to the queue entry object which is part of the new indexing queue withing the blocking queue, the Response Object. That object has now also the new properties of the removed indexing queue entry object.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6225 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-17 13:59:21 +00:00
orbiter
b332dfad67 - inserted request object into response object which carries this now instead generating new objects
- fixed a problem with the crawler introduced in SVN 6216

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6222 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-15 23:08:35 +00:00
orbiter
ca72ed7526 -removed superfluous crawl cache
-refactoring of crawler classes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6221 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-15 21:07:46 +00:00
orbiter
13c63f4082 a set of small fixes to crawling behaviour
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6216 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-15 14:15:51 +00:00
orbiter
43c8defd79 enhanced parser with more extension + mime attributes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6214 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-14 13:32:53 +00:00
orbiter
b2263bc720 enhanced document type recognition
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6209 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-14 11:01:05 +00:00
lotus
aa38eb5a20 * maxfilesize -1 for infinite filesize
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6208 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-14 08:39:39 +00:00
lotus
9cfe89c8fc * process content-length as soon as it is received
* corrected indentation

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6206 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-13 19:55:13 +00:00
lotus
9f083bb6b2 check filetype before loading (no more mp4 loading)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6200 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-12 16:50:11 +00:00
f1ori
f814e0fa81 enable warnings and fix most of it
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6196 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-11 21:01:27 +00:00
orbiter
57a88d435b redesign of parser mime type detection and parser steering
There is now a mime-blacklist instead of a mime-whitelist

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6190 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-10 14:22:17 +00:00
orbiter
21b8704fb4 refactoring of the ParserDispatcher and ParserConfig: resulted into Idiom, Parser and Classification classes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6188 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-09 22:25:31 +00:00
orbiter
dafffd0153 refactoring of parsers and document processing
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6182 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-08 21:48:08 +00:00
orbiter
024744245c small refactoring to prepare for new queues
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6173 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-04 12:17:10 +00:00
orbiter
24cb6d68bc - renamed Stack to RecordStack to avoid name confusion with new classes
- added new Stack class that implements a stack on BLOB files
- added new Stacks class that can be used for a set of Stacks (a 'Stack Database')
- added methods to other classes to support the new stacks

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6169 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-03 16:35:34 +00:00
orbiter
995da28c73 all stack/heap files that had been stored in DATA/PLASMA are now stored in the network-specific QUEUES path
There is no migration. All crawls must be restarted.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6167 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-02 17:01:23 +00:00
orbiter
409538e17a code cleanup and code simplifcation
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6161 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-30 22:20:55 +00:00
orbiter
1f1399e5c5 extending visibility of objects and methods to avoid synthetic accessor methods and increase performance
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6156 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-30 13:25:46 +00:00
orbiter
154bbc3364 code cleanup: call of static methods directly to the class
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6155 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-30 13:01:35 +00:00
orbiter
222850414e simplification of the code: removed unused classes, methods and variables
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6154 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-30 09:27:46 +00:00
orbiter
93dfb51fd4 problems with code style
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6153 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-29 22:22:35 +00:00
orbiter
9a674d8047 - After the removal of the Tree class some code simplifications are possible. This affects mostly the Records class, which can be refactored and the result of the refactoring results in a reduced number of classes.
- The EcoTable was renamed to Table.


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6151 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-28 21:37:37 +00:00
orbiter
c5122d6836 completed migration of BLOBTree to BLOBHeaps:
- removed migration code
- removed BLOBTree
after the removal of the BLOBTree, a lot of dead code appeared:
- removed dead code that was needed for BLOBTree
Some more classes may have not much use any more after the removal of BLOBTree, but still have some component that are needed elsewhere. Additional Refactoring steps are needed to clean up dependencies and then more code may appear that is unused and can be removed as well.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6150 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-28 21:02:56 +00:00
orbiter
ae015e8e98 refactoring of blob package classes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6088 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-17 09:58:15 +00:00
orbiter
ce1adf9955 serialized all logging using concurrency:
high-performance search query situations as seen in yacy-metager integration showed deadlock situation caused by synchronization effects inside of sun.java code. It appears that the logger is not completely safe against deadlock situations in concurrent calls of the logger. One possible solution would be a outside-synchronization with 'synchronized' statements, but that would further apply blocking on all high-efficient methods that call the logger. It is much better to do a non-blocking hand-over of logging lines and work off log entries with a concurrent log writer. This also disconnects IO operations from logging, which can also cause IO operation when a log is written to a file. This commit not only moves the logger from kelondro to yacy.logging, it also inserts the concurrency methods to realize non-blocking logging.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6078 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-15 21:19:54 +00:00
orbiter
b8e738a7be a collection of
- small bug fixes
- better/more comments
- more asserts
- fixed synchronization
- test case enhancements
- code cleanup
- performance hacks

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6073 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-14 22:09:08 +00:00
orbiter
d58b395993 fix for http://forum.yacy-websuche.de/viewtopic.php?p=15693#p15693
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6049 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-11 09:38:25 +00:00
orbiter
b6e274f211 omit most of forced crawl delays by using a separat delay table which flushes delayed URLs at the correct time
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6029 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-06 16:20:27 +00:00
orbiter
d50be59088 - added a automatic re-construction of the domain stack after 10 minutes. this includes then urls to the domain stack that were left over in case of stack size limitations when the domain stack was created the last time
- changed the busy sleep time for the crawl thread to 30 millisecons. This is sufficient to crawl with 2000 PPM.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6028 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-06 09:34:44 +00:00
orbiter
5fdba0fa51 - fixed a not working selection rule in balancer
- more security about crawl-delay, be more fail-save
- better logging in case of long forced crawl-delays

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6027 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-06 08:46:59 +00:00
orbiter
f5602404d5 another speed boost for the balancer
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6026 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-06 02:37:04 +00:00
orbiter
95e8cbd1c3 new fully redesigned balancer and bugfixes regarding lost profile handles and killed crawls
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6025 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-06 01:56:31 +00:00
orbiter
42ae40b9f6 some bugfixes to database close() methods
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6023 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-04 22:43:46 +00:00
orbiter
88426912ad more refactoring to make the segment object easier to use and to be prepared to integrate author navigation
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5992 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-05-29 10:03:35 +00:00
orbiter
99bf0b8e41 refactoring of plasmaWordIndex:
divided that class into three parts:
- the peers object is now hosted by the plasmaSwitchboard
- the crawler elements are now in a new class, crawler.CrawlerSwitchboard
- the index elements are core of the new segment data structure, which is a bundle of different indexes for the full text and (in the future) navigation indexes and the metadata store. The new class is now in kelondro.text.Segment

The refactoring is inspired by the roadmap to create index segments, the option to host different indexes on one peer.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5990 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-05-28 14:26:05 +00:00
orbiter
3d4b826ca5 migration of all databases that use the deprecated BLOBTree format into the BLOBHeap format. Old databases are migrated automatically.
This removes the last very IO-intensive data structures which were still used for Wiki, Blog and Bookmarks. Old database files will still remain in the DATA subdirectory but can be deleted manually if no major bugs appear during migration. There is no need for any user action, all migration is done automatically.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5986 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-05-27 15:04:04 +00:00
orbiter
63a0255166 - refactoring: added new content package, which will contain connector classes for different types of data sources to import texts into the YaCy index
- refactoring: migrated data objects for the new connector classes
- added a DAO interface class to specify an abstract interface for database retrieval connector methods

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5977 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-05-26 07:44:22 +00:00