orbiter
e6044e5198
bugfix for
...
http://www.yacy-forum.de/viewtopic.php?p=27207#27207
and
http://www.yacy-forum.de/viewtopic.php?p=27219#27219
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2875 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-28 21:43:12 +00:00
orbiter
78b7f6f7fd
bugfix for index remove bug,
...
appeared after search where snippet-loading triggered word removal
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2869 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-28 00:22:10 +00:00
orbiter
147d88cf23
re-design of database caching
...
this should reduce IO a lot, because write caches are now actived for all databases
- added new caching class that combines a read- and write-cache.
- removed old read and write cache classes
- removed superfluous RAM index (can be replaced by kelonodroRowSet)
- addoped all current classes that used the old caching methods
- more asserts, more bugfixes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2865 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-26 13:50:50 +00:00
orbiter
4e363108e1
- removed bad debug code that caused a large and unnecessary delay during global search
...
- fixed problem that global search results disappear after a search
- removed some stopwords
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2861 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-25 02:24:41 +00:00
orbiter
2a9d868f6d
- removed object cache from kelondroTree
...
- generalized object caching and added new object caching class
- added object caching wherever kelondroTree was used
- added object caching also to usage of kelondroFlex
- added object buffering (a write cache) to NURLs
- added many assert statements; fixed bugs here and there
- added missing close methods to latest added classes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2858 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-24 13:48:16 +00:00
orbiter
3ffc5b8793
fixed problem with serverCharBuffer.append(char)
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2821 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-19 21:44:02 +00:00
orbiter
06854988da
- full integration of new LURL database in INDEX
...
- added migration method for urlHash.db into INDEX
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2819 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-19 21:14:37 +00:00
octoate
e4a3574b77
StringBuffer now resets every time the parser is called
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2817 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-19 16:58:45 +00:00
karlchenofhell
ce237aefad
- assortment-sizes table from PerformanceQueues_p.html is not shown if not used
...
- escape query- and fragment-part of an url as well
- new resolveBackpath for urls: http://www.yacy-forum.de/viewtopic.php?t=2679#24867
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2815 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-19 15:27:24 +00:00
theli
a5b9b514c1
*) retry crawling without content-encoding if the content-encoding header was not correct
...
See: http://www.yacy-forum.de/viewtopic.php?p=26917#26917
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2811 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-19 08:45:52 +00:00
theli
92f774edd1
*) Better charset encoding detection
...
*) New testclass for charset encoding detection tests
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2808 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-19 07:02:18 +00:00
orbiter
b79e06615d
- added new LURL.Entry class for next database migration
...
- refactoring of affected classes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2802 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-18 22:25:07 +00:00
octoate
cc24dde5e0
First version of a MS Excel parser based on Apache POI
...
(event based parsing)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2801 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-18 19:13:37 +00:00
karlchenofhell
4c63129136
- stupid mistake...
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2798 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-18 15:14:38 +00:00
karlchenofhell
ebf0da2a45
- now the fix http://www.yacy-forum.de/viewtopic.php?t=2974 works
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2796 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-18 12:07:17 +00:00
theli
3d152bfe43
*) Logging message added
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2794 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-18 04:23:00 +00:00
karlchenofhell
b5e40e2fa2
- fix for http://www.yacy-forum.de/viewtopic.php?t=2974 (no cache-sizes for new db)
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2792 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-17 21:01:35 +00:00
orbiter
77a59a115d
refactoring of indexing methods
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2787 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-16 15:04:16 +00:00
theli
cbb1e710b9
*) removing old class
...
- was replaced by plasma/urlPattern/defaultURLPattern
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2765 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-13 13:03:32 +00:00
orbiter
c6d46f7ebd
null pointer bugfix
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2761 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-13 08:03:11 +00:00
theli
decb09df6d
*) Trying to be more tolerant against wrong charset names
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2760 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-13 05:30:20 +00:00
theli
e9afe39cbb
*) Trying to be more tolerant against wrong charset names
...
See: http://www.yacy-forum.de/viewtopic.php?p=26662
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2759 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-13 05:08:56 +00:00
theli
7526c831a8
*) Suppressing stracktrace
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2758 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-13 04:34:49 +00:00
orbiter
50f2578c55
- some bugfixing and code cleanup
...
- now assortments can completely left out if they do not exist
before startup and collection index is selected.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2757 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-13 01:19:26 +00:00
orbiter
bdf4c7c51e
added missing files for last commit
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2756 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-12 23:17:16 +00:00
orbiter
a5dd0d41af
- refactoring of plasmaCrawlLURL.Entry to prepare new Entry format
...
- added test migration method to migrate the old LURL to a new LURL
the new LURL will be splitted into different tables for each month
this solves several problems:
- the biggest table in YaCy is splitted in different parts and can
also be managed in filesystems that are limited to 2GB
- the oldest entries can easily be identified, used for re-crawl und
deleted
- The complete database can be limited to a specific size (as wanted many times)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2755 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-12 23:14:41 +00:00
octoate
1c4076da8a
First version of the MS Powerpoint parser based on Apache POI
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2753 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-12 17:28:53 +00:00
theli
5b75d64d7d
*) bugfix for last commit
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2750 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-12 09:39:25 +00:00
theli
71ed104bc7
*) adding additional rpm mimetype (used by packman)
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2749 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-12 09:32:24 +00:00
orbiter
6396f5971e
bugfixes and migration attempt toward new kelondroFlex db
...
- more synchronization
- bugfix for remove in collections
- bugfix in kelondroFlex (wrong exception condition!)
- options to use RAM, FLEX and TREE tables for Crawl URL stacker
- default for Crawl URL stacker is now FLEX (!)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2746 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-11 00:46:45 +00:00
hermens
48f81acc0e
reverse SVN 2744, it is not needed
...
(this resulted from a small misunderstanding of the newest cache layout)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2745 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-10 22:02:23 +00:00
hermens
1da9aece12
Repair DNS prefetch during cacheScan
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2744 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-10 21:34:27 +00:00
theli
22649408ad
*) Better errorhandling for charset encoding problem during content parsing
...
See: http://www.yacy-forum.de/viewtopic.php?t=2952
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2737 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-10 10:14:03 +00:00
theli
a9c7e3f061
*) Bugfix for NoSuchElementException
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2735 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-10 08:39:27 +00:00
orbiter
c8f3a7d363
added snippet-url re-indexing
...
- snippets will generate an entry in responseHeader.db
- there is now another default profile for snippet loading
- pages from snippet-loading will be indexed, indexing depth = 0
- better organization of default profiles
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2733 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-09 23:07:10 +00:00
low012
2cfd4633ac
*) even better handling of searchwords in snippets, words can consist of letters and numbers now
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2732 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-09 21:08:13 +00:00
orbiter
e17fea7015
files in htcache are now stored in different hash/tree subdirectories
...
according to storage method
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2730 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-09 18:18:49 +00:00
low012
2d3b7251a4
*) better handling of searchwords in snippets (see http://www.yacy-forum.de/viewtopic.php?t=2891 for details)
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2728 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-09 13:37:38 +00:00
orbiter
25ae3d3161
generalized definition of hexhash
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2725 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-09 10:07:07 +00:00
orbiter
f0d747c723
removed deprecated method
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2723 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-09 02:47:37 +00:00
orbiter
5ff77612ac
bugfix for old WORDS storage method
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2722 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-09 02:20:27 +00:00
orbiter
0f10bdde22
more generic cache methods
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2721 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-09 02:13:13 +00:00
hermens
6557112d8f
small fix for plasmaURLPool.getURL() needed for new alternative htcache layout
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2719 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-08 17:32:01 +00:00
hermens
440c6ee657
Implement alternative htcache layout
...
mostly according to: http://www.yacy-forum.de/viewtopic.php?p=26205#26205
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2718 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-08 17:25:19 +00:00
orbiter
fd61209797
lines inside tags without punctuation are extended by a single dot.
...
This enables the condenser to distinguish the lines in a better way.
The result is a better preparation of snippets.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2715 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-08 01:24:00 +00:00
orbiter
1969522dc1
removed lowercase of snippets (and other things):
...
- added new sentence parser to condenser
- sentence parsing can now handle charsets
to do: charsets must be handed over to new sentence parser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2712 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-07 00:06:09 +00:00
orbiter
43614f1b36
bugfix in collection index. the index for collections was not created correctly
...
The bugfix includes a migration function which starts automatically
after startup of yacy.
This applies only to you, if you are using the new collection index.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2711 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-05 23:47:08 +00:00
orbiter
db294687ea
enhanced logging
...
- more logging output
- fix in log line preparation
- added filter to log page
- some small bugfixes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2707 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-03 22:55:59 +00:00
theli
a9a0f51303
*) suppressing InterruptedException errormessage
...
See: http://www.yacy-forum.de/viewtopic.php?t=2915
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2705 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-03 15:40:18 +00:00
theli
1d4fb680ce
*) CrawlWorker.java: only keep content in memory if size is equal or less than 5MB
...
TODO: make this limit configurable
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@2703 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-10-03 12:16:25 +00:00