Commit Graph

48 Commits

Author SHA1 Message Date
orbiter
e627f75415 one more fix to badwords and stopwords
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6316 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-15 11:47:50 +00:00
orbiter
721b88efbd - fixed a problem loading blacklists with new yacycore.jar
- fixed badwords and stopwords initialization

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6315 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-15 11:46:02 +00:00
orbiter
68465c37af added a convenience class to add files into a YaCy index
to make this possible, the yacyURL must be able to process file:// urls, which has also been implemented
testing of the new class resulted in some bugfixes in other classes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6313 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-14 21:17:42 +00:00
orbiter
573d03c7d7 added configuration to enable ram table copy
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6304 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-07 20:30:57 +00:00
orbiter
3be54e1891 fix to rule when to use a ram table copy
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6302 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-07 19:22:12 +00:00
orbiter
700218846c disabled or removed sleep calls
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6301 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-07 18:50:44 +00:00
low012
53bbdfd19a *) setting SVN keywords
*) minor changes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6297 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-05 20:41:21 +00:00
low012
25f6145934 *) preventing null pointer exception in case empty search word or only one character is enterd or all search words are removed by filters
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6296 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-05 20:31:39 +00:00
orbiter
af3a696fc4 added a fast-fail concept in search processes. The search now has better control if all the remote searches may bring any result. If all processes are finished, then all search tasks fail fast.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6290 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-03 23:09:53 +00:00
orbiter
67eddaec4b changed way to integrate dictionary files:
the must be downloaded manually by the user and placed in DATA/DICTIONARIES/source
for each externally imported dictionary file there will be a translator that converts the input file once
into a YaCy-internat data format.
Files that will be provided together with yacy releases may still be placed in <root>/dictionaries

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6286 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-02 18:42:13 +00:00
orbiter
d656a94f55 fix for bad paths in dictionary processing
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6285 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-02 18:24:41 +00:00
orbiter
3b9aaf9e9f - inserted new library tests inside DidYouMean
- some redesign of DidYouMean that was necessary to follow
  a special rule how a library should be used:
  - the library provides words that start or end with a test
    word which may be possibly also an empty set of words
  - all words that the DidYouMean produced with the four
    production rules are used to generate a set of
    library-completed words
  - if this process results in any words from the library,
    only library-genrated words are taken
  - if the is no library-generated word at all, take the
    artifial generated word
  - all words that result from these rules are tested against
    the index
  - the result is ordered using a lightweight comparator that
    prefers short words
  - a not-so-much-io test against the index is beeing prepared
    next
- insered the library initialization into the switchboard

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6284 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-02 13:41:56 +00:00
orbiter
bf8ed00e9e removed debugging code
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6280 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-30 11:03:34 +00:00
orbiter
ead48c4b25 fix for preparation of search result pages with offset > 10:
- less pages are fetched in advance
- just-in-time fetch of next required pages
- fix for missing hand-over of offset to fetch threads

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6279 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-30 10:28:23 +00:00
orbiter
10d3e856b5 better concurrency, less blocking & performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6277 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-29 23:34:14 +00:00
orbiter
1a9cfd8718 some performance hacks (CPU only, not IO)
this will cause better computation speed for single- and multi-core;
there are enhancements that will speed up old and slow machines as well
as multi-core CPUs. Indexing of surrogates has been speed up
from 4000 PPM to over 20000 PPM on a simple dual core office computer.
Since the enhancements are mostly in core routines, the hack should also
speed up search performance.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6276 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-28 13:28:11 +00:00
orbiter
92407009b2 cleanup
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6275 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-27 23:20:59 +00:00
orbiter
0ba1beaf56 separated rwi constraint evaluation from rwi ranking and added concurrency
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6274 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-27 22:54:32 +00:00
orbiter
b0637600d5 enhanced url constraint computation: better position of constraint check during retrieval process
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6272 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-27 20:20:07 +00:00
orbiter
61748285c3 more refactoring of search
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6270 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-27 15:19:48 +00:00
orbiter
323a8e733d removed unused classes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6269 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-27 14:42:05 +00:00
orbiter
72e5407115 refactoring of snippet cache
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6268 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-27 14:34:41 +00:00
orbiter
e7736d9c8d more refactoring: made all variables in SearchEvent private
to prepare splitting of the class into two parts: local and remote search

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6265 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-26 15:59:55 +00:00
orbiter
d8ca6e6bf1 more refactoring for search
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6263 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-25 21:27:01 +00:00
orbiter
fe4a4e3f6b added missing class
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6261 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-24 21:03:40 +00:00
orbiter
72ac5bd80f refactoring of search process.
this is the beginning of some architecture changes that will hopefully bring some more stability, speed and transparency to the search process.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6260 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-24 15:24:02 +00:00
orbiter
d9744b1b5d replaced old caching strategy control class with lightweight simplearc
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6254 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-07 23:01:33 +00:00
orbiter
92edd24e70 fixed problem with switching of networks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6247 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-30 15:49:23 +00:00
orbiter
c4ae2cd03f fixed bug that caused deletion of crawl profiles at every application startup
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6240 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-23 22:09:02 +00:00
orbiter
161d2fd2ef redesign of access to the HTCache (now http.client.Cache):
- better control to the cache by using combined request-header and content access methods
- refactoring of many classes to comply to this new access method
- make shure that the cache is always written if something was loaded
- some redesign of the process how http response results are feeded into the new indexing queue
- introduction of a cache read policy:
 * never use the cache
 * use the cache if entry exist
 * use the cache if the proxy freshness rule confirmes
 * use only the cache and go never online
- added configuration options for the crawl profiles to use the new cache policies. There is not yet a input during crawl start to set the policy but this will be added in another step.
- set the default policies for the existing crawl profiles. If you want them to appear in your default profiles you must delete the crawl profiles database; othervise the policy is 'proxy freshness rule'
- enhanced some cache access methods in such a way that unnecessary retrievals are omitted (i.e. for size computation). That should reduce some IO but also a lot of CPU computation because sizes were computed after decompression of content after retrieval of the content from the disc.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6239 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-23 21:31:51 +00:00
f1ori
ba2e6de538 fix empty version string again
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6236 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-21 19:56:40 +00:00
orbiter
4da9042e8a code simplification
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6233 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-19 21:59:29 +00:00
orbiter
1d8d51075c refactoring:
- removed the plasma package. The name of that package came from a very early pre-version of YaCy, even before YaCy was named AnomicHTTPProxy. The Proxy project introduced search for cache contents using class files that had been developed during the plasma project. Information from 2002 about plasma can be found here:
http://web.archive.org/web/20020802110827/http://anomic.de/AnomicPlasma/index.html
We stil have one class that comes mostly unchanged from the plasma project, the Condenser class. But this is now part of the document package and all other classes in the plasma package can be assigned to other packages.
- cleaned up the http package: better structure of that class and clean isolation of server and client classes. The old HTCache becomes part of the client sub-package of http.
- because the plasmaSwitchboard is now part of the search package all servlets had to be touched to declare a different package source.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6232 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-19 20:37:44 +00:00
orbiter
b332dfad67 - inserted request object into response object which carries this now instead generating new objects
- fixed a problem with the crawler introduced in SVN 6216

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6222 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-15 23:08:35 +00:00
orbiter
ca72ed7526 -removed superfluous crawl cache
-refactoring of crawler classes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6221 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-15 21:07:46 +00:00
orbiter
b2263bc720 enhanced document type recognition
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6209 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-14 11:01:05 +00:00
orbiter
57a88d435b redesign of parser mime type detection and parser steering
There is now a mime-blacklist instead of a mime-whitelist

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6190 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-10 14:22:17 +00:00
orbiter
21b8704fb4 refactoring of the ParserDispatcher and ParserConfig: resulted into Idiom, Parser and Classification classes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6188 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-09 22:25:31 +00:00
orbiter
8ca1f5d400 - some work to integrate the html parser the same way as the other parsers are integrated (not finished)
- added migration of code of settings pages (hmm.. does not work correctly yet, sorry)
- more refactoring
- removed more unused code

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6187 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-09 20:56:30 +00:00
orbiter
0e8647d62f refactoring of search classes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6184 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-08 22:14:57 +00:00
orbiter
dafffd0153 refactoring of parsers and document processing
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6182 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-08 21:48:08 +00:00
orbiter
222850414e simplification of the code: removed unused classes, methods and variables
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6154 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-30 09:27:46 +00:00
orbiter
99fa265e1d fix for search bug caused by tenant patch
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6125 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-22 22:31:29 +00:00
orbiter
57af311627 fix for wrong urls in navigator when a tenant is used
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6119 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-22 12:25:18 +00:00
orbiter
8b8877c233 moved image collector
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6087 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-16 21:48:09 +00:00
orbiter
be1c7ddc64 refactoring of search classes -- moved Ranking Profile to search package
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6086 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-16 21:45:40 +00:00
orbiter
bc6dd8194b refactoring: moved search query class to new search package
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6075 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-15 11:49:00 +00:00
orbiter
a4805defdd added stub for new search process
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6074 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-06-15 11:46:23 +00:00