Commit Graph

505 Commits

Author SHA1 Message Date
orbiter
acf771d5e1 - fixed bug with too much RAM in crawler queue
- fixed dir bug
- better calculation of TF for join
- better waiting-on-result logic

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4424 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-01-31 23:40:47 +00:00
orbiter
a8a5df4a51 - more dublin core naming of page metadata
- better presentation of result counters in search results

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4420 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-01-30 21:58:30 +00:00
orbiter
15397298dc - refactoring of indexControlRWIs: moved statics to own class; better Dublin Core naming
- fix for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=759&hilit=&p=4866#p4866
- some bugfixes in EcoTable according remove method
- switched more tables to Eco: crawl Profiles, htcache, seeddb, newsdb

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4397 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-01-24 22:49:00 +00:00
orbiter
efd0b8371a - added parsing of Dublin Core - compliant metadata (see RFC 5013 and ISO 15836) to html parser
- refactoring of plasmaParserDocument to use Dublin Core - compatible property names
- redesign of url handling in parser and condenser (less String-to-yacyURL conversion)
- more generics

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4352 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-01-22 11:51:43 +00:00
orbiter
f4e9ff6ce9 more generics
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4343 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-01-19 00:40:19 +00:00
orbiter
45339c3db5 more generics
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4341 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-01-18 17:14:02 +00:00
orbiter
dc26d6262b - removed write buffer from kelondroCache (was never used because buggy; will now be replaced by new EcoBuffer)
- added new data structure 'eco' for an index file that should use only 50% of write-IO compared to kelondroFlex
The new eco index is not used yet, but already successfully tested with the collectionIndex
The main purpose is to replace the kelondroFlex at every point when enough RAM is available.
Othervise, the kelondroFlex stays as option in case of low memory (which then can even use a file-index)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4337 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-01-17 12:12:52 +00:00
orbiter
a5054c038d - added large number of generics
- redesign of ordering structures in kelondro (old did not work with strict generics)
- 50% IO reduction during read access on kelondroFlex (ommiting of read on index table)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4320 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-01-11 00:12:01 +00:00
orbiter
03e7782269 more generics
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4305 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-01-06 19:23:38 +00:00
orbiter
4dc438f7e7 moved to Java 1.5:
- changed build script to use java 1.5 compiler
- first stept to resolve missing generics definition (about 400 from over 4100 'missing'-warnings)
- added key-iterator to kelondro databases (for rapid from-memory enumerations, will be used for domain name collection, not used yet)

please set your development environment to use java 1.5!


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4292 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-12-27 17:56:59 +00:00
fuchsi
21b8d1b918 small cosmetic change for static fields in serverCore (special protocol ASCII entities) to improve readability
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4275 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-12-14 19:17:54 +00:00
orbiter
f243e338cf implemented online caution also for local and remote search
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4252 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-12-06 21:53:17 +00:00
orbiter
b46bcaa5d8 changed method of profiling
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4248 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-12-04 20:19:13 +00:00
orbiter
aefb3f7765 added memory graph picture to PerformanceMemory_p.html
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4241 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-30 03:22:42 +00:00
orbiter
9b0ae4b989 added referrer to remote crawl url list
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4236 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-29 13:58:00 +00:00
orbiter
89b9b2b02a redesigned remote crawl process:
- instead of pushing urls to other peers, the urls are actively pulled
  by the peer that wants to do a remote crawl
- the remote crawl push process had been removed
- a process that adds urls from remote peers had been added
- the server-side interface for providing 'limit'-urls exists since 0.55 and works with this version
- the list-interface had been removed
- servlets using the list-interface had been removed (this implementation did not properly manage double-check)
- changes in configuration file to support new pull-process
- fixed a bug in crawl balancer (status was not saved/closed properly)
- the yacy/urls-protocol was extended to support different networks/clusters
- many interface-adoptions to new stack counters

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4232 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-29 02:07:37 +00:00
orbiter
af10f729df fixed image search and favicon loading
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4225 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-22 01:34:29 +00:00
orbiter
c527969185 - enhanced monitoring of ranking parameters
for details, please try http://localhost:8080/IndexControlRWIs_p.html
- fixed computation of ranking ordering in some cases

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4220 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-16 14:48:09 +00:00
orbiter
6eaa5a0e64 enhanced local search speed. The ranking process is now 6 times faster that before.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4197 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-07 22:38:09 +00:00
fuchsi
425e4ead66 Allow absolute paths in configuration settings.
- before absolute paths would be expanded incorrectly, e.g.: fooPath=/a/b/c would become /path/to/yacy/root/a/b/c. Now you can put nearly every dynamically generated data with a configurable path to a location outside of yacys root dir without having to use symlinks (probably good for third party distribution packaging).
- abstractServerSwitch.getConfigPath(setting, default) returns a File instance, either with an absolute path or relative to the applications root path.

- exceptions (hardcoded): 
  DATA/LOG/yacy.logging
  DATA/SETTINGS/httpProxy.conf
  DATA/SETTINGS/user.db
TODO: all of these are the global configuration files and they should probably be put into _one_ command line configurable settings path, so it would be possible to package them in /etc/ for example.

- add missing workPath to yacy.init (it was used in code, but there was no default in the file)
- fix broken skinPath (was skinsPath in yacy.init but skinsPath in the code) + a few other broken config reading caused by typos.
- replaced path setting names and their default values with the related static fields in plasmaSwitchboard where not already done/existing

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4196 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-04 10:36:25 +00:00
borg-0300
a5d28785b1 less OOM (works for me)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4194 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-02 14:55:46 +00:00
orbiter
ccbfb15b6b enhancement to crawl stacker enqueue order
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4192 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-01 00:57:32 +00:00
orbiter
55c87b3b12 changed behavior of crawl stacker
- final flush only when tabletype = RAM
- prestacker (dns prefetch) only if tabletype = RAM and busytime <= 100
- number of maximun entries in stacker is configurable in yacy.init (stacker.slots)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4186 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-31 11:32:40 +00:00
orbiter
a31b9097a4 preparations for mass remote crawls:
two main changes must be implemented to enable mass remote crawls:
- shift control of robots.txt to crawl queue (away from stacker). This is necessary since remote
  crawls can contain unchecked urls. Each peer must check the robots to prevent that it is misused
  as crawl agent for unwanted file retrieval
- implement new index files that control double-check of remotely crawled urls

After removal of robots.txt checking from stacker threads, the multi-threading of this process is void.
Multithreading has been removed. Also the thread pools for the crawl threads had been removed, since
creation of these threads is not resource-consuming, for a detailed explanation see svn 4106

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4181 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-29 01:43:20 +00:00
fuchsi
508de558f7 sbStackCrawlThread is null during first cleanProfiles() run at startup.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4152 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-08 15:56:40 +00:00
fuchsi
70614385ef Attempt to fix the "lost profile handle" bug.
It seems improbable, but it might happen, that during a crawl all queues (indexing, crawling, ...) except the crawl URL stacker ran empty. This commit adds an additional check for an empty crawl stacker queue before executing the profile cleaner.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4151 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-08 15:11:26 +00:00
orbiter
33fb2f756d added emergency fail case in remote crawls
in extreme situations this will cause that no remote crawls are send out any more
this is bad, but it protects the case where failing remote crawls fill up the local queue too much,
which is even worse

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4141 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-04 10:40:30 +00:00
fuchsi
03c5b4ad68 more fixes to the yacysearch.rss, it's now 100% valid according to http://feedvalidator.org
- RFC-822 date time had to include the time instead of date only
- <opensearch:link> doesn't exist -> <atom:link>, see http://www.opensearch.org/Specifications/OpenSearch/1.1
- <link> elements are mandatory for <channel> and <item>

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4131 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-03 04:00:52 +00:00
fuchsi
7404f2c35c Fix some of the issues with the RSS search interface, see http://forum.yacy-websuche.de/viewtopic.php?f=6&t=392
Note: the new DateFormatter822 in the plasmaSwitchboard is just a copy of the DateFormatter that always uses the US locale to allow formatting of a loocale independent date String.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4124 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-02 21:28:29 +00:00
orbiter
98abe0804d another enhancement to crawl starts with link files
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4123 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-02 20:30:42 +00:00
orbiter
01e0669264 re-designed some parts of DHT position calculation (effect is the same as before)
and replaced old fist hash computation by new method that tries to find a gap in the current dht
to do this, it is necessary that the network bootstraping is done before the own hash is computed
this made further redesigns in peer initialization order necessary

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4117 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-01 12:30:23 +00:00
orbiter
842308ea97 - redesigned crawl start menu, integrated monitoring pages
- removed web structure picture from indexing menu and grouped it together with htcache monitor
- added a database for terminated crawls, when a crawl is finished it is automatically moved to the new database
- extended crawl profile edit servlet, shows now also terminated crawls
- option that was used to delete profiles is now redesigned to a function that moves the current crawl to the terminated crawls and removes all urls from the current queues!
- fixed here and there problems with indexing queues
- enhances indexing speed by changing cache flush sizes.
- changed behaviour of crawl result servlet: the list of crawled urls is shown if there is one, othevise the overview window is shown

attention: the new profile databases are not compatible with the old one. current crawls will be lost! the web index is not touched.
next steps: the database of terminated crawls can be used to start with them a new crawl. This is useful if one wants to re-crawl specific pages and wants to use a old crawl profile.


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4113 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-09-28 01:21:31 +00:00
orbiter
3c74014004 automatic deletion of dead client connections
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4110 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-09-25 22:46:11 +00:00
orbiter
4275727d69 fix for peer ping problem (implemented a 3-time re-ping); cause for 'Connection reset' still unknown
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4095 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-09-12 00:42:53 +00:00
fuchsi
5b0c1449e1 various fixes and cleanups for blacklist handling:
1. avoid adding duplicate file name entries in config properties for lists, 
2. correctly merge all path masks from all list files for the same host masks,
3. rewrite helper methods standard java methods for Collection transformations,
4. merged various methods with identical functionality for different Collection implementations into one,
5. minor refactoring to improve code readability.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4087 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-09-10 06:20:27 +00:00
orbiter
6c819a6fd9 added cache to favicon display
added better synchronization for simultanous search requests

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4076 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-09-06 01:28:35 +00:00
orbiter
daf0f74361 joined anomic.net.URL, plasmaURL and url hash computation:
search profiling showed, that a major amount of time is wasted by computing url hashes. The computation does an intranet-check, which needs a DNS lookup. This caused that each urlhash computation needed 100-200 milliseconds, which caused remote searches to delay at least 1 second more that necessary. The solution to this problem is to attach a URL hash to the URL data structure, because that means that the url hash value can be filled after retrieval of the URL from the database. The redesign of the url/urlhash management caused a major redesign of many parts of the software. Since some parts had been decided to be given up they had been removed during this change to avoid unnecessary maintenance of unused code.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4074 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-09-05 09:01:35 +00:00
orbiter
f81ef40cc4 no dht activity for small networks; this is not needed if the network is small
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4062 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-08-26 22:35:26 +00:00
orbiter
a34d9b8609 * added a search history cache that maintains search results for 10 minutes
it is necessary for the new search process that will do automatic re-searches
a positive effect is, that when a re-search is done it can be monitored how many
results had been contributed from other peers. The message for this contribution
was moved from the end of the result page to the top.
* enhanced re-search time when a global search was done an the local index has
already a great number of results for this word
* re-organised presearch computation; must be further enhanced

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4059 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-08-24 23:12:59 +00:00
orbiter
bb426565f0 added new yacy protocol for mass url-pull for better remote crawling distribution
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4056 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-08-22 00:59:05 +00:00
orbiter
b5346141b3 made the plasmaHTCache static (there is only one internet, so we need only one cache)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4045 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-08-15 21:31:31 +00:00
orbiter
947fc46904 refactoring of search process:
- re-designed remote request result processing
- re-designed local result accumulation, will be further enhanced with snippet fetcher
- removed search process handling in switchboad
- made snippet class static (there is no need for multiple snippet objects)
- removed some redundant tasks in server-side search process, should be a little bit faster now


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4043 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-08-15 11:36:59 +00:00
orbiter
5605887571 refactoring of search processes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4030 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-08-05 23:57:25 +00:00
orbiter
62347b50f4 added security layer for ViewImage:
- images may be requested by localhost and authorized users only, if the request is done using a clear-text URL
- the image may be requested also using a code that can be a license to retrieve a URL for everyone
- some servelets produce URL licenses for ViewImage, like image search results


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4027 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-08-03 23:06:53 +00:00
orbiter
69d640b041 added missing synchronization in crawl balancer
to avoid that the synchronization is triggered during many-time-used size() operation
a notEmpty method was added that can avoid the synchronization many times

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4025 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-08-03 12:21:46 +00:00
orbiter
9628db6cdc enhanced memory allocation during database access:
- refactoring of kelondroRecords; this class is now divided into
  kelondroAbstractRecords, kelondroRecords, kelondroCachedRecords, kelondroHandle and kelondroNode
- better abstraction of kelondroNodes, such nodes may now be crated by different classes
- a new Node defining class kelondroEcoRecords defines Nodes that do not need so much allocation and System.arraycopy
- there is less memory transfer on the bus, especially for collection index
- now half of memory needed for web index access


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4024 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-08-03 11:44:58 +00:00
orbiter
e76fe1c078 - replaced unicode characters in copyright holder name ('Brausse')
- more logging for bootstrap seedlist loading
- larger DHT chunks

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4015 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-07-31 10:00:17 +00:00
orbiter
7ff4357184 fix for http://forum.yacy-websuche.de/viewtopic.php?f=6&t=206&hilit=&p=1130#p1130
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4004 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-07-24 18:11:30 +00:00
orbiter
9ca46a8c69 indexing of local (intranet) urls enabled
To do this, one must create a separate YaCy network that has a local URL domain
A description how to do this is here: http://www.yacy-websuche.de/wiki/index.php/De:Netzdefinition

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4001 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-07-24 00:46:17 +00:00
orbiter
40b0547611 - documentaton changes (removed old forum links)
- different handling of link quotation
- different handling of link normalization
- enhanced html/unicode en/de-coding

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3993 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-07-19 15:32:10 +00:00