Commit Graph

2647 Commits

Author SHA1 Message Date
low012
383dc815d2 *) fix for commit 4212
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4217 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-14 19:14:53 +00:00
orbiter
3491531cea - fixed 'appears in url' flag in index generation
- extended index administration page, shows some properties to the web links now

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4216 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-14 01:15:28 +00:00
orbiter
ec7ba0d3d0 - fixed problem with too small sort fields (sortbound was not set)
- slightly changed handling of date in indexURLEntry

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4214 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-13 02:24:10 +00:00
orbiter
bc2368e907 fix for problem with remote crawl referrers
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4210 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-12 16:32:50 +00:00
orbiter
875096552f fix for NPE in case that remote search results are empty
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4209 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-12 15:54:50 +00:00
orbiter
64b3b79e44 - fix for termination problem with uniq()
- addition to seed dna interpretation

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4208 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-12 14:39:30 +00:00
orbiter
0abf33ed03 - tried to remove deadlock
- enhanced searchtime in kelondroRowSets
- enhanced uniq() - reverse enumeration causes less time in case of mass removal of doubles

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4207 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-12 01:14:51 +00:00
low012
a4010f7dc8 *) fixed bug where dots were added after numbers < 1000: "123" was transformed to "123." which is undesirable
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4206 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-11 21:42:50 +00:00
orbiter
2421127612 fix for http://forum.yacy-websuche.de/viewtopic.php?f=6&t=513&hilit=
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4204 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-11 01:32:54 +00:00
orbiter
d0d2771883 disabled multiprocessoring of rowCollection.sort for testing purpose
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4202 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-11 00:28:22 +00:00
orbiter
edc4da5317 fix for division by zero in test reoutine
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4201 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-10 08:57:00 +00:00
orbiter
df38aaf7bd update to RowCollection sort speed-enhancements:
- better handling of small collections (less overhead)
- usage of pre-sorted limits
- different re-sort limit
- more testing procedures

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4200 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-09 15:34:11 +00:00
orbiter
0eb60cfe6f better handling of seed properties
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4199 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-09 09:40:42 +00:00
orbiter
ecba35de72 enhanced computing speed of kelondro core function: sorting
the enhancement was made by using better organized data structures and
multi-threading during the sort. A sort can be divided into two separate
processes when the first partition of the quicksort algorithm was done.
Generating a separate thread and starting the thread takes only 10 milliseconds,
so using a separate thread makes only sense if the data amount is large.
statistics about the speed-up:
without ehancement: 250 milliseconds for 100000 entries
with data structure enhancement: 170 milliseconds for 100000 entries
with additional second thread (if second processor is present): 130 milliseconds.

For dual-processor systems, this means about 100% speed-up
a test can be made with the following command:
java -classpath classes de.anomic.kelondro.kelondroRowCollection


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4198 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-09 00:51:38 +00:00
orbiter
6eaa5a0e64 enhanced local search speed. The ranking process is now 6 times faster that before.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4197 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-07 22:38:09 +00:00
fuchsi
425e4ead66 Allow absolute paths in configuration settings.
- before absolute paths would be expanded incorrectly, e.g.: fooPath=/a/b/c would become /path/to/yacy/root/a/b/c. Now you can put nearly every dynamically generated data with a configurable path to a location outside of yacys root dir without having to use symlinks (probably good for third party distribution packaging).
- abstractServerSwitch.getConfigPath(setting, default) returns a File instance, either with an absolute path or relative to the applications root path.

- exceptions (hardcoded): 
  DATA/LOG/yacy.logging
  DATA/SETTINGS/httpProxy.conf
  DATA/SETTINGS/user.db
TODO: all of these are the global configuration files and they should probably be put into _one_ command line configurable settings path, so it would be possible to package them in /etc/ for example.

- add missing workPath to yacy.init (it was used in code, but there was no default in the file)
- fix broken skinPath (was skinsPath in yacy.init but skinsPath in the code) + a few other broken config reading caused by typos.
- replaced path setting names and their default values with the related static fields in plasmaSwitchboard where not already done/existing

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4196 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-04 10:36:25 +00:00
borg-0300
e8d32d9f62 other loglevel
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4195 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-02 16:06:54 +00:00
borg-0300
a5d28785b1 less OOM (works for me)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4194 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-02 14:55:46 +00:00
orbiter
ccbfb15b6b enhancement to crawl stacker enqueue order
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4192 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-11-01 00:57:32 +00:00
hermens
5c5344ae97 Beautify log
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4190 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-31 16:29:07 +00:00
hermens
35cf196204 transferRanking(): Do not flush more ranking files than requested by caller.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4189 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-31 15:55:52 +00:00
hermens
d0aa8cf25d Only update handshaked peer's last seed date if it has not been updated recently.
Unil now the newer data was overwritten by old data from before the handshake.


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4188 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-31 15:47:48 +00:00
hermens
8f9d65da67 Small corrections to dhtFlushControl()
- Test wCacheMaxChunk against maxURLinCache(), not getMaxWordCount(). This triggered a flush everytime dhtFlushControl() was called.
- If triggered, flush at least 1 entry.


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4187 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-31 14:21:58 +00:00
orbiter
55c87b3b12 changed behavior of crawl stacker
- final flush only when tabletype = RAM
- prestacker (dns prefetch) only if tabletype = RAM and busytime <= 100
- number of maximun entries in stacker is configurable in yacy.init (stacker.slots)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4186 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-31 11:32:40 +00:00
hermens
18144043e6 Correct UTC Offset at beginning/end of daylight savings time
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4185 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-30 19:20:02 +00:00
orbiter
4fefa53135 removed parser object pool, see also svn 4106
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4184 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-29 12:14:18 +00:00
orbiter
a31b9097a4 preparations for mass remote crawls:
two main changes must be implemented to enable mass remote crawls:
- shift control of robots.txt to crawl queue (away from stacker). This is necessary since remote
  crawls can contain unchecked urls. Each peer must check the robots to prevent that it is misused
  as crawl agent for unwanted file retrieval
- implement new index files that control double-check of remotely crawled urls

After removal of robots.txt checking from stacker threads, the multi-threading of this process is void.
Multithreading has been removed. Also the thread pools for the crawl threads had been removed, since
creation of these threads is not resource-consuming, for a detailed explanation see svn 4106

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4181 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-29 01:43:20 +00:00
fuchsi
a718858e8b seed.CCOUNT is interpreted as a double value not int
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4180 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-24 23:25:48 +00:00
fuchsi
0e1738899f * Complete number localization and provide a more reasonable interface to serverObjects:
- put(key, value) methods are now used if a value added to the map should be kept as it is. Numbers are transformed (but not formatted) to an equivalent String representation.
- putASIS(...) have been removed, now done with simple put(...) (see above).
- puNum(...) can be used for number values which should be stored in a formatted way, either depending on the current locale setting for yacy (default) or in a "none" locale (see javadocs and setLocalize()).
- putHTML(...) escapes special characters into corresponding HTML enities ('<' => '&lt;') which was done with put(...) before and so was called too often, becauses it is necessary only for very few cases. Additionally there is a "forXML" mode which only replaces < > & ".
In short: Use put(...) for almost everything, use putXY(...) if you need some special transformation of the value.
A few bugs have been fixed as well, and there should be a small performance improvement for complex pages with a lot of values.

* added additional Sum/Avg rows to access tracker pages, see http://forum.yacy-websuche.de/viewtopic.php?f=5&t=456
* removed duplicate code (mostly related to the big changes above).

TODO:
- make sure, number formats work as expected _everywhere_, report overseen stuff http://forum.yacy-websuche.de/viewtopic.php?f=5&t=437
- probably a good idea to add special putDate() methods as they are used in many pages and create duplicated formatting code + maybe some centralized handling for memory value formatting.
- further improve the speed of page creation for the WatchCrawler.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4178 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-24 21:38:19 +00:00
orbiter
f8318436a1 fix for last commit
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4177 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-22 16:32:39 +00:00
orbiter
7d57b80598 distinct keepOrder strategy, more discrete implementation of enhancement introduced in SVN 4158
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4176 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-22 15:26:47 +00:00
orbiter
9a7b093eed tried to avoid endless loop, see also:
http://forum.yacy-websuche.de/viewtopic.php?f=6&t=467&hilit=

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4175 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-22 14:35:45 +00:00
orbiter
b856e377a9 some additions and a small bugfix to SVN 4158
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4173 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-21 23:26:22 +00:00
hermens
501a7aae90 Small correction
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4172 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-20 12:02:31 +00:00
hermens
caff520988 Removed unnecessary and unused code.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4171 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-20 11:56:15 +00:00
hermens
d732840f8a Avoid ConcurrentModificationException when accessing the PerformanceQueues page while yacy is indexing.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4170 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-19 23:36:40 +00:00
fuchsi
35303f9504 add real size values (KBytes) of the DHT-In/Out-RAM-Caches to the PerformanceQueues page. A lot of users seem to tweak this value and it might help in finding the best size in relation to the peer's memory ressources.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4169 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-19 21:47:07 +00:00
fuchsi
38bbd4a4b3 no code changes. just touched yacyClient.java to trigger a rebuild of the file in an uncleaned tree.
NOTE: run "ant clean" before building SVN 4166/4167 in a tree that includes class files from a previous build to make sure, that every class file is rebuilt!

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4167 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-19 15:31:38 +00:00
fuchsi
f717beecb1 - Changed yFormatter handling to be more flexible and produce more readable code for server pages. There are serverObject.putNum() methods to allow adding of number type values in a formatted form, and put() methods for number types that add them without formatting. This reduces the need to transform them into Strings in server pages and removes the HTML encoding step which is unecessary for numbers.
- some minor code cleanups (mostly unnecessary casts, null checks)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4166 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-19 04:13:46 +00:00
fuchsi
ca83f5a8d9 Add external lib FontBox which is part of the PDFBox (they extracted the font handling code into this package in 0.7.3).
Add the packages to the eclipse .classpath.
Closes: http://forum.yacy-websuche.de/viewtopic.php?f=5&t=453

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4165 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-18 19:53:52 +00:00
fuchsi
3352474dd8 Remove grouping separator in Network.xml (yacystats will woork without it) and format a few more numbers.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4163 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-16 13:29:11 +00:00
fuchsi
06e6a1ff62 Add a generalized Formatter class yFormatter inspired by http://forum.yacy-websuche.de/viewtopic.php?f=5&t=437
At the current state it allows formatting of numbers (integer + decimal types) for output according to the Locale derived from the language setting in yacy. Network.(html|xml) and Status.html have been changed to use it for now (TODO: should be integrated into other servlets as well to reduce duplicate formatting code).
NOTE: For now the output format for Network.xml simulates the old behaviour which is wrong (it uses '.' as decimal and grouping separator), to make sure external scripts like the yacystats.de one won't break with this update.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4162 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-16 02:12:31 +00:00
fuchsi
e77aec8c9d fix handling of encrypted PDF-Documents (with default user password "")
- update PDFBox package to current version 0.7.3
- use new security model in PDFBox to "guess" wether we can decrypt a document or not
NOTE: When upgrading to this version make sure the old PDFBox-0.7.2.jar is removed from libx/

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4161 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-15 13:18:38 +00:00
fuchsi
b5f7df8d0a Speed up remove operations in rowCollections.
- Array element shifting during remove is only done when it is necessary to keep the order of a row collection.
- This will speed up the most expensive operation "common word shrinking" by a factor of 500-1000 (in the worst cases we shifted > 60 GB of data during this operation)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4158 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-11 17:17:08 +00:00
low012
fdb0b861f8 *) fixed wrong calculation of network words, network links, network PPM if peer is senior or principal peer
*) added network QPH
*) banner is cached for 1 second to avoid DOS
*) still no logo


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4154 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-09 21:47:37 +00:00
fuchsi
508de558f7 sbStackCrawlThread is null during first cleanProfiles() run at startup.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4152 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-08 15:56:40 +00:00
fuchsi
70614385ef Attempt to fix the "lost profile handle" bug.
It seems improbable, but it might happen, that during a crawl all queues (indexing, crawling, ...) except the crawl URL stacker ran empty. This commit adds an additional check for an empty crawl stacker queue before executing the profile cleaner.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4151 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-08 15:11:26 +00:00
low012
507ecd8afa *) added banner that can be displayed like this: http://localhost:8080/Banner.png
possible arguments: textcolor, bgcolor, bordercolor
   example: http://localhost:8000/Banner.png?textcolor=ffffff&bgcolor=121212&bordercolor=ffffff
   take care: YaCy uses CMY color model!
*) there are still some known bugs, but I can't continue coding right now


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4149 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-07 21:59:36 +00:00
fuchsi
9b0948cb4c gnarf. mixed up the positions. finally fixed...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4143 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-04 10:58:01 +00:00
fuchsi
c0f5fc51ef bugfix for last commit
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4142 6c8d7289-2bf4-0310-a012-ef5d649a1542
2007-10-04 10:47:48 +00:00