Commit Graph

137 Commits

Author SHA1 Message Date
orbiter
e3964f2c31 better catch of network definition load error; continue with secondary network load definition location
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7270 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-25 09:20:45 +00:00
orbiter
e3e3b49d52 - enhanced main release recognition
- yacybot user agent now includes the yacy network name (not the peer name!)
- refactoring and clean-up (mostly turned tab into spaces)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7266 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-24 21:43:01 +00:00
orbiter
58e74282af added a word counter statistic in condenser which is used by the did-you-mean to calculate best matches for given search words.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7258 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-18 11:35:09 +00:00
orbiter
2c549ae341 fixed a number of small bugs:
- better crawl star for files paths and smb paths
- added time-out wrapper for dns resolving and reverse resolving to prevent blockings
- fixed intranet scanner result list check boxes
- prevented htcache usage in case of file and smb crawling (not necessary, documents are locally available)
- fixed rss feed loader
- fixes sitemap loader which had not been restricted to single files (crawl-depth must be zero)
- clearing of crawl result lists when a network switch was done
- higher maximum file size for crawler

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7214 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-30 23:57:58 +00:00
orbiter
e63896f2a8 added an intranet scanner and a servlet which shows all intranet addresses and an option to start a site-crawl for all these addresses at once.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7203 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-28 12:18:54 +00:00
orbiter
d2fd93135c - moved yacybot user agent string definition to MultiProtocolURI since there are basic access mechanisms where the bot string is needed
- migrated the 'yacy' user agent to 'yacybot' in many client methods since the 'yacy' user agent is only used for the proxy

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7199 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-27 14:54:32 +00:00
orbiter
a83186ac7d fix for bug in cytrails
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7192 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-26 10:32:40 +00:00
orbiter
10a9cb1971 simplified snippet computation process and separated the algorithm into two classes
also enhances selection criteria for best snippet line computation

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7182 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-22 20:50:02 +00:00
orbiter
461a2a6ec7 enhanced remote crawling:
- 300 ppm is default now (but this is switched off by default; if you switch it on you may want more traffic?)
- better timing for busy queue
- better amount of remote url retrieval
- better time-out values
- better tracking of availability of remote crawl urls
- more logging for result of receipt sending

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7159 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-16 09:34:17 +00:00
orbiter
7ebef56add - redesign of a part of the remote search client to make it possible to have a test environment for remote search performance tests
- added a remote search test main methods in yacyClient

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7146 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-14 13:35:47 +00:00
orbiter
3c0e07ba72 removed all delays in shutdown process
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7143 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-14 09:13:28 +00:00
orbiter
64860dc1bb enhanced search event logging (to be used for further improvements)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7140 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-13 09:33:04 +00:00
orbiter
4c21d8dc9d - changed default values for online caution (the pausing may not be necessary any more)
- fixed bug in WeakPriorityBlockingQueue
- show favicon faster using pre-loading (same technique as used for fast image search)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7130 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-09 23:25:19 +00:00
orbiter
114bdd8ba7 fixed old sitemap importer which was not able to parse urls containing post elements
- removed old parser
- removed old importer framework (was only used by removed old parser)
- added a new sitemap parser in parser framework
- linked new parser with parser access in old sitemap processing routines

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7126 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-08 14:13:15 +00:00
orbiter
9d080f387e change in handling of the all-visible home path for storage in YaCy:
the home path can now be distinguished between
- data home; the path where the DATA directory is created
- application home; everything else
This will make it possible to store application data on Mac releases within the
~/Library/YaCy
directory; a place where Mac applications write their data.
Similar techniques will be possible for debian and windows.
To use the new data path, YaCy can be started with
-start <data path>
or
-gui <data path>


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7092 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-09-02 19:24:22 +00:00
orbiter
65eaf30f77 redesign of crawl profiles data structure. target will be:
- permanent storage of auto-dom statistics in profile
- storage of profiles in WorkTable data structure
not finished yet. No functional change yet.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7088 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-31 15:47:47 +00:00
orbiter
4f22e2df41 bugfixes for
- next-execution-time in scheduler
- deletion of scheduled rss feed loading (now deletes also the scheduling entry)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7075 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-26 16:42:00 +00:00
orbiter
42414a6ae3 added two more tables in rss reader interface:
- fresh recorded rss feeds (not yet loaded or in scheduler)
- rss feeds in scheduler
The first list has a button that can be used to place rss feeds into the scheduler
The second list has a button to delete rss feeds from the scheduler

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7074 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-26 16:01:45 +00:00
orbiter
0010cd9db1 Support for indexing of RSS feeds!
- added a scanning in html parser for rss feeds
- storage of rss feed addresses, can be viewed with http://localhost:8080/Tables_p.html?table=rss
- rss items retrieved by http://localhost:8080/Load_RSS_p.html (in Index Creation menu) can be selected and indexed
- a rss feed retrieved in http://localhost:8080/Load_RSS_p.html can now be fully indexed
- indexing of rss feeds can be placed in scheduler

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7073 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-25 18:24:54 +00:00
orbiter
3197ca42ed preparations to move the HTCache into cora:
- move the header framework classes to cora
- move the ARC caching classes to cora
- refactoring of code to call these classes from cora

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7068 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-23 12:32:02 +00:00
orbiter
90531f78ff refactoring of the cora package to get subpackages for http and ftp (smb to come)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7063 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-22 22:32:39 +00:00
sixcooler
661867923a ... migrating to HttpComponents-Client-4.x ...
The Client is dead, long live the Client!
(no references to the old client)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7060 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-22 17:38:27 +00:00
orbiter
7aa860c505 - more logging
- more stability for database heap in case of buffer failure

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7058 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-21 10:16:05 +00:00
orbiter
70dd26ec95 added the new crawl scheduling function to the crawl start menu:
- the scheduler extends the option for re-crawl timing. Many people misunderstood the re-crawl timing feature because that was just a criteria for the url double-check and not a scheduler. Now the scheduler setting is combined with the re-crawl setting and people will have the choice between no re-crawl, re-crawl as was possible so far and a scheduled re-crawl. The 'classic' re-crawl time is set automatically when the scheduling function is selected
- removed the bookmark-based scheduler. This scheduler was not able to transport all attributes of a crawl start and did therefore not support special crawling starts i.e. for forums and wikis
- since the old scheduler was not aber to crawl special forums and wikis, the must-not-match filter was statically fixed to all bad pages for these special use cases. Since the new scheduler can handle these filters, it is possible to remove the default settings for the filters
- removed the busy thread that was used to trigger the bookmark-based scheduler
- removed the crontab for the bookmark-based scheduler

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7051 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-19 23:52:38 +00:00
orbiter
5a994c9796 added a scheduler based on API actions
- every process that is monitored with the API Steering interface can now be scheduled!
- added input methods in Steering interface to set a scheduling time
- added a view on the steering api that shows only crawl jobs inside the Crawl Profile servlet
- added a scheduling call process in the cleanup process handler that triggers the scheduled processes
This causes that the cleanup now also looks for scheduled processes. Such processes are therefore not executed at
the same time as given in the target execution time but they will be executed within the cleanup process time window.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7050 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-19 12:13:54 +00:00
orbiter
a82a93f2fc - better url double check in crawler
- more logging for error urls

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7032 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-11 09:54:18 +00:00
sixcooler
a6ed6e8cb9 ... migrating to HttpComponents-Client-4.x ...
make the occurrence of multiple header-keys possible

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7031 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-10 21:22:30 +00:00
orbiter
ec72387165 added a very early test version of a YaCy gui component.
The gui currently does nothing else than providing a search window that sends the search string to the browser
The gui is started when YaCy is started with the option -g or --gui, like
./startYACY.sh -g
The gui will primary be used to provide a 'real' macintosh version that can be started and operated like any other macintosh application. A special mac application wrapper will follow.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7021 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-05 10:43:03 +00:00
sixcooler
15e8c13526 ... migrating to HttpComponents-Client-4.x ...
(gzip decompression, httploader, robots, ...)

+ enable proxy-crawling while log is fine

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7001 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-07-27 01:16:26 +00:00
mikeworks
aa663cda4d ConfigUpdate_p.html and ConfigUpdate_p.java: Added check for downloaded releases and disabled buttons in case no new releases available
de.lng: Updated German translation for additional String in ConfigUpdate_p.html
XHTML 1.0 Strict fixes for all the other .html files
yacy/ui/css/yacyui-portalsearch.css: added .hidden class that was removed from ConfigProperties_p.html
Switchboard.java: Added URL for thread Remote Crawl Job and set URL for Remote Crawl URL Loader to null to fix empty href=""

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6996 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-07-25 13:19:16 +00:00
sixcooler
b7102eff92 ... migrating to HttpComponents-Client-4.x ...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6989 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-07-22 23:08:37 +00:00
sixcooler
52718e6dcb ... migrating to HttpComponents-Client-4.x ...
monitoring: replaced unused 'idletime' by uploading bytes
added some kind of 'upload-throttling' at dht-out :-)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6983 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-07-22 00:51:41 +00:00
sixcooler
5fa8038f10 ... migrating to HttpComponents-Client-4.x ...
monitoring and first try to use remoteProxy

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6979 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-07-20 01:14:28 +00:00
sixcooler
0e56d29335 ... migrating to HttpComponents-Client-4.x ...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6968 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-07-15 00:59:53 +00:00
sixcooler
c5c67f0504 start migrating to HttpComponents-Client-4.x
see http://forum.yacy-websuche.de/viewtopic.php?f=5&t=2872

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6965 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-07-12 23:07:05 +00:00
orbiter
b6fb239e74 redesign of parser interface:
some file types are containers for several files. These containers had been parsed in such a way that the set of resulting parsed content was merged into one single document before parsing. Using this parser infrastructure it is not possible to parse document containers that contain individual files. An example is a rss file where the rss messages can be treated as individual documents with their own url reference. Another example is a surrogate file which was treated with a special operation outside of the parser infrastructure.
This commit introduces a redesigned parser interface and a new abstract parser implementation. The new parser interface has now only one entry point and returns always a set of parsed documents. In case of single documents the parser method returns a set of one documents.
To be compliant with the new interface, the zip and tar parser had been also completely redesigned. All parsers are now much more simple and cleaner in its structure. The switchboard operations had been extended to operate with sets of parsed files, not single parsed files.
additionally, parsing of jar manifest files had been added.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6955 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-29 19:20:45 +00:00
orbiter
11b7853940 added a configuration page for search heuristics. currently you can switch on there:
- a site-operation heuristic that loads all direct links from a portal page if the site-operator is used
- a direct crawl for search results from scroogle for the given search terms
The configuration page can be found directly beside the network configuration page


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6951 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-27 21:38:16 +00:00
orbiter
1557e0f2d0 - some refactoring for internal RSSFeed (protocol of all actions as seen on status page)
- added dht-out to internal RSSFeed (you can see now messages about distributed indexes on status page)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6948 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-26 22:39:27 +00:00
orbiter
dcd01698b4 added a 'transition feature' that shall lower the barrier to move from g**gle to yacy (yes!):
Here a new concept called 'search heuristics' is introduced. A heuristic is a kind of 'shortcut' to good results in IT, here for good search results. In this case it will be used to get a very transparent way to compare what YaCy is able to produce as search result and what g**gle produces as search result. Here is what your can do now:
- add the phrase 'heuristic:scroogle' to your search query, like 'oil spill heuristic:scroogle' and then a call to scroogle is made to get anonymous search results from g**gle.
- these results are _not_ taken as meta-search results, but are used to instantly feed a crawling and indexing process. This happens very fast, here 20 results from scroogle are taken and loaded all simultanously, parsed and indexed immediately and from the results of the parsed content the search result is feeded, along to the normal p2p search
- when new results from that heuristic (more to come) get part of the search results, then it is verified if such results are redundant to existing (they had been part of the normal YaCy search result anyway) or if they had been completely new to YaCy.
- in the search results the new search results from heuristics are marked with a 'H ++' and search results from heuristics that had been already found by YaCy are marked with a 'H ='. That means:
- you can now see YaCy and Scroogle search results in one result page but you also see that you would not have 'missed' the g**gle results when you would only have used YaCy.

- to make it short: YaCy now subsumes g**gle results. If you use only YaCy, you miss nothing.

to come: a configuration page that let you configure the usage of heuristics and get this feature by default.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6944 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-25 16:44:57 +00:00
orbiter
3a9dc52ac2 added a fascinating new way to search _and_ start a web crawl at the same time:
implemented a hint from dulcedo "use site: - operator as crawl start point".
YaCy already was able to search using a site-constraint. This function is now extended with a instant crawling feature.
When you now use the site-operator, then the landing page of the site iand every page that is linked from this page are loaded, indexed and selected for the search result within that search request. When the remote server responds quickly enough, then this process can result in search results during the normal search result preparation .. just in some seconds.


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6941 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-23 11:19:32 +00:00
orbiter
87087f12fe - scanned remote search process and enhanced some data structure and synchronizations here and there
- removed concurrency overhead for small number of index normalizations as it happens during remote search
- removed 'load only parseable' constraint for snippet fetch because some resources may not have any url file extension and these had therefore not been parseable and searcheable since they may become parseable after loading when their mime type is known
- this partly fixes some problems with http://forum.yacy-websuche.de/viewtopic.php?p=20300#p20300 but more changes are necessary to get all expected search results

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6926 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-17 11:59:40 +00:00
orbiter
56ff9d5fd4 - extended news size from 512 to 1024 characters
- a new news db will be created (news1024.db), the old one (news.db) can be deleted
- peers with too large news payload are not ignored any more (they may have been invisible because they had a too large news payload!)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6917 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-15 10:43:47 +00:00
orbiter
60e71876ad - more abstraction (HashMap -> Map)
- more concurrency-awareness (HashMap -> ConcurrentHashMap)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6910 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-01 13:02:11 +00:00
orbiter
3f93a0cc8f redesign of remote proxy settings
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6903 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-26 00:01:16 +00:00
orbiter
11639aef35 - added new protocol loader for 'file'-type URLs
- it is now possible to crawl the local file system with an intranet peer
- redesign of URL handling
- refactoring: created LGPLed package cora: 'content retrieval api' which may be used externally by other applications without yacy core elements because it has no dependencies to other parts of yacy

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6902 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-25 12:54:57 +00:00
orbiter
6950d8a33d fixes to SMB crawler
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6900 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-23 01:17:44 +00:00
orbiter
1defd580bc - added option to localization search to distinguish between a search for a location according to the search word only or for the relation between a web search results and locations found in the metadata fields
- used that to display two layers on map: cities and search result locations
- added many marker grafics for the display of the markers on the map
- some refactoring of the yacy news code plus bugfixes for latest move from Tree to Table data structure

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6889 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-19 12:53:09 +00:00
orbiter
2a8f70f0ca - fix for caching of OSM tiles. if you want that this fix applies to your peer, please delete the crawl profiles
- fix for initial generation of crawl profiles (one more reason to remove your crawl profiles)
- more String -> byte[] migration
- more logging for cache store/hit

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6874 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-14 23:50:07 +00:00
orbiter
789c6b26ce added a location search service: using the following servlet/example:
http://localhost:8080/yacysearch_location.kml?query=berlin&maximumTime=2000&maximumRecords=100

This will open any application that can consume kml data (which will probably be google earth) on your computer and displays the search result as positions on a map


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6865 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-11 12:58:05 +00:00
orbiter
c45117f81f fixed dates in metadata
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6860 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-08 22:09:36 +00:00