Commit Graph

474 Commits

Author SHA1 Message Date
orbiter
d1d9fbae5c enabling the URLAnalysis to operate on multime input files, just use a wild card when calling the class from the command line
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5658 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-26 23:47:41 +00:00
orbiter
7ea53fe47b added another url list transformation option:
- check the list and kick out entries with lines that contain not valid urls
- normalize the urls
- remove doubles
- sort the list
- split the list in smaller chunks
This is all done in one process which can be called with a new -sort option

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5655 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-26 21:51:23 +00:00
orbiter
54625360f7 performance update
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5653 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-25 23:27:21 +00:00
orbiter
d884c4718a added gzip support for URLAnalysis:
url lists can also be compressed with gzip
If such a file is handed over to URLAnalysis, the output will also be written as .gz-file

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5652 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-25 13:40:51 +00:00
orbiter
cf9b74e6e3 added another method to process url lists: extract hosts only
This can be used like
java -Xmx2000m -cp classes de.anomic.data.URLAnalysis -host DATA/EXPORT/20090224213823.txt

changed als the call method to generate statistics, please use now
java -Xmx2000m -cp classes de.anomic.data.URLAnalysis -stat DATA/EXPORT/20090224213823.txt


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5650 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-24 22:51:07 +00:00
orbiter
89d8e824ed memory protection for URLAnalysis
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5649 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-24 22:05:09 +00:00
orbiter
0f6fa804ff performance update to URLAnalysis
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5648 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-24 21:35:33 +00:00
orbiter
e8f5f2f612 added tool to analyse url strings
and to generate statistics about words occurring in urls

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5646 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-24 10:00:35 +00:00
orbiter
c12bb8a6d0 - refactoring of the http client
- added a protection against memory leaks for the access tracker

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5621 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-19 16:24:46 +00:00
orbiter
411f2212f2 more memory leak fixing hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5599 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-11 13:31:10 +00:00
orbiter
333489420b - fix for NPE when loading the cytag image
- some hacks for less memory usage:
-- less usage of buffer and cache memory in EcoFS
-- buffer allocation on-demand in BufferedIOChunks
-- removed largest ybr idx

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5595 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-11 10:52:56 +00:00
orbiter
c25c334b75 replaced old DHT transmission method with new method. Many things have changed! some of them:
- after a index selection is made, the index is splitted into its vertical components
- from differrent index selctions the splitted components can be accumulated before they are placed into the transmission queue
- each splitted chunk gets its own transmission thread
- multiple transmission threads are started concurrently
- the process can be monitored with the blocking queue servlet
To implement that, a new package de.anomic.yacy.dht was created. Some old files have been removed.
The new index distribution model using a vertical DHT was implemented. An abstraction of this model
is implemented in the new dht package as interface. The freeworld network has now a configuration
of two vertial partitions; sixteen partitions are planned and will be configured if the process is bug-free.
This modification has three main targets:
- enhance the DHT transmission speed
- with a vertical DHT, a search will speed up. With two partitions, two times. With sixteen, sixteen times.
- the vertical DHT will apply a semi-dht for URLs, and peers will receive a fraction of the overall URLs they received before.
  with two partitions, the fractions will be halve. With sixteen partitions, a 1/16 of the previous number of URLs.
BE CAREFULL, THIS IS A MAJOR CODE CHANGE, POSSIBLY FULL OF BUGS AND HARMFUL THINGS.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5586 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-02-10 00:06:59 +00:00
orbiter
94110df85a moved logging partially to kelondro
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5545 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-31 01:06:56 +00:00
orbiter
024da2916b refactoring of logging
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5544 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-30 23:33:47 +00:00
orbiter
83ce65707a (almost) completed partition of classes in kelondro
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5543 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-30 22:44:20 +00:00
orbiter
7ee494fde5 more refactoring of kelondro:
- seperated BLOB from table classes
- renamed 'coding' package to 'order'

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5542 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-30 22:08:08 +00:00
orbiter
bf93767ec6 refactoring of kelondro database classes
(to be continued)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5540 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-30 15:33:00 +00:00
orbiter
fc27bf8c4c refactoring of kelondro classes:
kelondro shall become independent from other packages.
moved bytebuffer, date and memory to kelondro

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5539 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-30 14:48:11 +00:00
apfelmaennchen
3484e55be4 - small fix for bookmarksDB
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5527 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-28 06:57:11 +00:00
apfelmaennchen
6dd52422ea - added two dialogs to manage bookmark tags in YaCy-UI
- fixed renameTag() in bookmarksDB
- added /api/bookmarks/tags/addTag.xml
- added /api/bookmarks/tags/editTag.xml

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5525 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-28 00:15:43 +00:00
apfelmaennchen
3dc208fad0 bugfix: bookmarks can now handle folder names like /news and /newspaper without getting confused...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5470 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-11 19:39:51 +00:00
low012
f26b8fcb1b *) comment mode is 'moderated' instead of 'activated' by default now (to avoid spam being visible)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5465 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-01-10 12:58:35 +00:00
orbiter
e004da48d3 - added fast fingerprint computation for files (any). Will be used in new index dump method
- refactoring

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5415 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-12-29 12:22:13 +00:00
orbiter
7535fd7447 - refactoring of CrawlEntry and CrawlStacker
- introduced blocking queues in CrawlStacker to make it ready for concurrency
- added a second busy thread for the CrawlStacker
The CrawlStacker is multithreaded. It shall be transformed into a BlockingThread in another step.
The concurrency of the stacker will hopefully solve some problems with cases where DNS blocks.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5395 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-12-17 22:53:06 +00:00
lotus
18513e2ee2 npe fix: http://forum.yacy-websuche.de/viewtopic.php?t=1646
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5393 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-12-16 13:36:13 +00:00
orbiter
e1acdb952c fix for problem with userDB and bookmarksDB which was caused by changes in kelondroRA in SVN 5376
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5385 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-12-08 00:17:45 +00:00
orbiter
47292e696a more performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5379 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-12-04 12:54:16 +00:00
orbiter
d39d420b39 performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5376 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-12-03 15:38:29 +00:00
orbiter
0b4808ba3d added new interactive search feature:
- during the user types search queries, the local database is searched
- results are presented interactively

This was implemented using a new JSON result format for search results in YaCy
- added JSON as file format for servlets
- refactoring of current search servlets (xml and html)
- added JSON output format for search results
- added AJAX-based search page, that uses the yacysearch.json selrvlet to print results as a query is typed

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5373 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-12-02 15:24:25 +00:00
low012
e423fa9846 *) added method to only get file names in directory listing which match a filter
*) only files which end with .black will be listed as blacklists
*) added a little bit of Javadoc

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5366 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-25 20:26:06 +00:00
orbiter
dba7ef5144 extended crawling constraints:
- removed never-used secondary crawl depth
- added a must-not-match filter that can be used to exclude urls from a crawl
- added stub for crawl tags which will be used to identify search results that had been produced from specific crawls
please update the yacybar: replace property name 'crawlFilter' with 'mustmatch'.
Additionally, a new parameter named 'mustnotmatch' can be used, which should be by default the empty sring (match-never)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5342 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-14 09:58:56 +00:00
f1ori
7e1fe05e3c * added utf8-encoding to many getBytes-calls
* utf8 should work now


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5323 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-11-08 20:24:31 +00:00
low012
baae3d91b1 *) fixed warning when compiling listManager
*) fixed display of values of information for which part of YaCy (crawler, proxy, ...) blacklist is activated for
*) replaced regular put() with putXML() in several cases

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5305 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-27 16:56:19 +00:00
low012
a99a629ed4 *) quick fix to prevent comments for blog entries which don't exist (http://forum.yacy-websuche.de/viewtopic.php?f=5&t=1554)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5302 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-25 12:04:10 +00:00
low012
00e27e5050 *) fixed bug which made it possible to write files outside of the DATA/LIST directory when creating a new blacklist
*) a blacklist will only be created if no blacklist with same name exists (some refactoring has been necessary for this)
*) further minor fixes
*) to be continued...

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5301 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-25 00:11:03 +00:00
orbiter
0edec2b760 FULL redesign of algorithms in htmlTools to encode/decode strings from/to unicode and html.
The old process used a not really efficient way to detect html encoding strings in texts.
All calling methods had been adoped to call the new class in an enhanced way with less parameters.

Many classes in interfaces used a XML encoding only (instead of full html conversion from unicode to html); this behavior was not changed with this commit but should be controlled again since it points out possible XSS leaks

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5295 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-22 18:59:04 +00:00
orbiter
6fb865fbdc - fix of bug in iterator in kelondroBLOBHeap which caused bug in crawl profile listing
- some refactoring of classes that use kelondroMap (Map instead of HashMap)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5262 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-10-10 08:39:11 +00:00
apfelmaennchen
b97ff24b43 bookmarksDB / xbel.xml:
- added support for folder=/foldername
- it crashes if foldername ends with /

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5207 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-09-24 21:16:13 +00:00
lotus
0a0cc3bf67 added missing classes to build target "run"
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5201 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-09-23 15:54:12 +00:00
lotus
a81cb78211 finally some putHTML on htroot/xml/
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5188 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-09-20 07:55:30 +00:00
apfelmaennchen
7b63c66a08 - bugfix in bookmarksDB.Tag.hasPublicItems()
- this anoying little bug prevented display of public items without admin login for /xml/bookmarks/...

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5151 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-09-14 18:45:08 +00:00
orbiter
05dbba4bab added logging conditions to all fine and finest log line calls
this will prevent an overhead for the generation of the log lines in case that they then are not printed

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5102 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-09-03 00:30:21 +00:00
apfelmaennchen
aa6ae77e5e - autoReCrawl: fix for filter settings
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5088 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-08-26 21:51:05 +00:00
apfelmaennchen
8ae29bad57 - fix to previous change of Crawl Profile Names
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5087 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-08-26 20:42:29 +00:00
apfelmaennchen
434104e4a0 - change Crawl profile name for autoreCrawl
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5085 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-08-26 18:08:48 +00:00
lotus
0df2e47012 changed auto recrawl to comply with new date format
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5083 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-08-26 13:36:10 +00:00
orbiter
536e77e8b7 modifications towards a single database operation to read/write http header and cached file at once:
- removed distinction between header file types for http and ftp; ftp is simulated by using http properties
- removed all old resourceInfo classes that handled this distinction
- introduced a new distinction between http request and http response objects
- unified new response objects with two other object types that had been introduced elsewhere
- changed all servlet call methods to use the new http request header object type
- divided static object keys for http header properties into request and response types
- refactoring here and there (a large number of type changes and many methods merged/moved)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5079 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-08-25 18:11:47 +00:00
apfelmaennchen
bd931a82f7 - added dynamic filters to autoReCrawl.conf
- Restrict to sub-path: sub
- Restrict to start-domain: dom

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5070 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-08-22 18:05:05 +00:00
apfelmaennchen
b3fc5e96a3 - removed unused import from bookmarksDB
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5067 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-08-20 21:26:06 +00:00
apfelmaennchen
bc048db7b6 - bugfix for bookmarksDB's rebuildDates()
- dates are now saved as String.valueOf(TimeStamp)
- it might be a good idea to delete (backup) bookmarkDates.db and restart YaCy to rebuild it 

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5066 6c8d7289-2bf4-0310-a012-ef5d649a1542
2008-08-20 21:25:05 +00:00