Commit Graph

3310 Commits

Author SHA1 Message Date
f1ori
90e78b2cf6 * improve encoding detection of http service
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5337 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-12 21:06:32 +00:00
orbiter
ef66438662 - more space in error db to store larger error messages
- added hash to HTCACHE storage files which will make it possible to join separate caches by just copying files

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5329 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-11 21:42:12 +00:00
orbiter
674ad2d55b different handling of error cases that occur during loading files with http or ftp:
methods throw exception instead of returning an error string

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5328 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-11 21:33:40 +00:00
danielr
538359a0ff simple fix to get DHT working again (maybe something more has to be done ;)
fix for http://forum.yacy-websuche.de/viewtopic.php?f=6&t=1578



git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5327 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-11 18:55:16 +00:00
f1ori
7e1fe05e3c * added utf8-encoding to many getBytes-calls
* utf8 should work now


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5323 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-08 20:24:31 +00:00
lotus
fad044fb54 update to snippet marker:
- do not display indexed html (solves xss issues)
the single words are analyzed for already marked parts. this is needed to avoid false encoding of the marker (<b>) tags.
- improved speed for existing routine
heavy used regex pattern are precompiled now

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5322 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-08 10:08:53 +00:00
lotus
16723d0fa6 ask another peer if crawljob loading fails
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5321 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-06 14:14:34 +00:00
orbiter
1b18d4bcf3 enhancement to crawling and remote crawling:
- for redirector and  remote crawling place crawling url on notice queue instead of direct enqueueing in crawler queue
- when a request to a remote crawl provider fails, remove the peer from the network to prevent that the url fetcher gets stuck another time again

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5320 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-06 12:30:55 +00:00
orbiter
3f746be5d4 - consolidation and refactoring of many DHT target - computing methods
- implemented vertical DHT acceptance ("my own DHT") to accept new targets
- added new target computation for global search: addresses vertical targets also
- enhanced remote crawling: collection of remote crawl urls if queue has less than 100 entries (was: 0 entries)
- better performance value computations for PPM selection in network configuration

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5319 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-06 10:07:53 +00:00
orbiter
d014b2728a Design-check, Extension and Refactoring of DHT target position computation:
- two different computations (but mathematical equivalent) of the DHT distance had been consolidated
- moved from 0.0 .. 1.0 double-range position computation to 0 .. Long.Max range for DHT targets
- added fast Long - to - hash computation
- high-precision target computation of gaps for new peers
- added new target computation for horizontal and vertical DHT targets (not yet in use)
- old horizontal-only DHT targets will be upwards compatible to new horizontal and vertical DHT positions

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5318 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-03 00:27:23 +00:00
orbiter
dd27ce7216 added control logic to ECO tables that deletes ram copies of the tables if they get too large
table copies in ram are now abandoned if less than 20 MB ram is left

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5317 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-02 23:53:09 +00:00
orbiter
38e6ba5d00 forgot to re-rename commonsPath
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5316 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-02 23:39:02 +00:00
orbiter
22989d0d8a added property index.storeCommons to switch commons storage on or off
with index.storeCommons=false all currently stored commons are deleted!
Default is now 'true', but in future full releases it will be switched to 'false'

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5315 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-02 23:30:09 +00:00
f1ori
4b4ce75396 * http-server: submit charset from html metatags
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5314 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-01 23:17:51 +00:00
f1ori
69e695bd4b * detect charset for directory index
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5313 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-01 22:14:31 +00:00
f1ori
340ecd919d * include non ascii characters in visible characters
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5312 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-01 21:13:57 +00:00
lotus
5cf0cbb47e javadoc
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5311 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-01 08:56:58 +00:00
lotus
8d07607d1d update to resource observer:
- returns high/medium/low disk space
- pauses crawling on medium disk space
- disables index receive on low disk space

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5310 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-31 11:33:17 +00:00
f1ori
d0543a7c39 * fix the debug ant-target
* fix yacy-subdomain handling (http://forum.yacy-websuche.de/viewtopic.php?f=6&t=1556)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5307 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-27 22:16:56 +00:00
low012
baae3d91b1 *) fixed warning when compiling listManager
*) fixed display of values of information for which part of YaCy (crawler, proxy, ...) blacklist is activated for
*) replaced regular put() with putXML() in several cases

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5305 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-27 16:56:19 +00:00
danielr
a4fb76e93c undo r5300 (not fixed as seen after longer run)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5303 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-25 23:20:09 +00:00
low012
a99a629ed4 *) quick fix to prevent comments for blog entries which don't exist (http://forum.yacy-websuche.de/viewtopic.php?f=5&t=1554)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5302 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-25 12:04:10 +00:00
low012
00e27e5050 *) fixed bug which made it possible to write files outside of the DATA/LIST directory when creating a new blacklist
*) a blacklist will only be created if no blacklist with same name exists (some refactoring has been necessary for this)
*) further minor fixes
*) to be continued...

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5301 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-25 00:11:03 +00:00
danielr
0f9c0bd0d5 fix for ConcurrentModificationException at de.anomic.index.indexContainerHeap$heapCacheIterator.next(indexContainerHeap.java:324)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5300 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-24 14:00:41 +00:00
danielr
103ad2a437 some javadoc
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5299 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-24 13:58:26 +00:00
orbiter
b098522977 some very small advances to index utf-8 (not working yet), inserted also debugging code
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5298 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-22 22:04:13 +00:00
orbiter
2f49666908 integrated the character decoding into the parser, removed old code
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5297 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-22 20:56:13 +00:00
orbiter
49293c1358 fix for deadlock in new encoder :-(
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5296 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-22 19:36:34 +00:00
orbiter
0edec2b760 FULL redesign of algorithms in htmlTools to encode/decode strings from/to unicode and html.
The old process used a not really efficient way to detect html encoding strings in texts.
All calling methods had been adoped to call the new class in an enhanced way with less parameters.

Many classes in interfaces used a XML encoding only (instead of full html conversion from unicode to html); this behavior was not changed with this commit but should be controlled again since it points out possible XSS leaks

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5295 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-22 18:59:04 +00:00
orbiter
958ec20cd0 removed specialized umlaute-handling in html parser. This has to be replaced by something that is able to transfer all possible html encodings into utf-8. Please see SVN 5293 for test cases.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5294 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-22 11:11:55 +00:00
f1ori
2e53cbc66a should compile now
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5292 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-22 08:50:30 +00:00
f1ori
f3bf2e379e should compile again
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5291 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-22 07:35:49 +00:00
f1ori
dd8441f102 fix bug: data from plasmaParser is allready converted to UTF-8
After removing the restrictions in the code, YaCy should be able to index Unicode-charaters!


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5290 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-21 20:19:10 +00:00
orbiter
6941bf42b1 performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5288 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-20 14:07:09 +00:00
orbiter
9b0c4b1063 redesign of parts of the new BLOB buffer
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5287 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-19 22:30:44 +00:00
orbiter
1778fb420d - added some performance tweaks to the new BLOB buffer
- removed the now superfluous HT storage thread
- reduced number of file decompression by shifting the compression moment to the future


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5286 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-19 18:10:42 +00:00
orbiter
9663e61449 added another class to handle BLOB writings to the new HTCACHE data storage:
- entries are buffered and written as stream with many entries at once (saves many IO accesses)
- entries are compressed with gzip: increases capacity of cache
- concurrency for stream-writing and compression: all writings to the cache are non-blocking

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5284 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-18 08:57:48 +00:00
orbiter
382226da94 fix for bug introduced in SVN 5281: parameters were switched
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5282 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-17 12:14:57 +00:00
danielr
f2fd043797 refactoring (moved duplicate code into methods)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5281 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-17 10:39:32 +00:00
danielr
c612046e5e r5278 java 1.5 compatible
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5280 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-17 09:59:59 +00:00
f1ori
af71ec93bf ops, forgot to import something
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5279 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-16 22:44:25 +00:00
f1ori
9e65e9141c * always use UTF-8 for encoding hashes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5278 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-16 22:35:27 +00:00
orbiter
826ca79735 refactoring and new architecture to store the files of the web cache:
- files are not stored any more as individual files
- a new database structure using BLOBHeap files stores many cache entries in common files
- all file-writing procedures had been migrated to generate byte[] objects which are written with the new database methods

this is only an intermediate step to the final architecture, where cached files are written together with their metadata in one single database structure.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5276 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-16 21:24:09 +00:00
danielr
f095137238 - respecting httpdMaxBusySessions (refusing new connections if limit is hit)
- comments in serverBusyThread converted to JavaDoc
- better debug output for npe-case in diskUsage


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5274 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-16 10:53:32 +00:00
orbiter
8ba33f104e fix for npe
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5269 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-13 21:59:53 +00:00
orbiter
998861acfd - some refactoring in BLOBHeap to enable more gap processing functions
- better gap merging in BLOBHeap
- shrinking of heap file if gap is at end of file when file is closed

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5268 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-13 21:15:54 +00:00
lotus
9d50bfd0b3 fix for npe: http://forum.yacy-websuche.de/viewtopic.php?p=10562
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5267 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-13 09:09:53 +00:00
orbiter
766cad6e93 enhancement in memory management of BLOB Heap files / merging of deleted entries
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5266 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-12 22:15:01 +00:00
orbiter
7860d5d632 fix for bug in seed list management (cause was bad class overloading, only visual effects!)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5265 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-12 19:51:53 +00:00
orbiter
ffed5fc415 fixed problem with lost peers in database
migrated seedDB from BLOBTree to BLOBHeap

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5263 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-10 14:40:02 +00:00