Commit Graph

3371 Commits

Author SHA1 Message Date
orbiter
87082f407e less String object creation during search
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7756 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-30 04:19:20 +00:00
orbiter
4bea3f9714 hack to reduce resource contention caused by massive UTF8 decodings which use java.nio resources:
used a ASCII String <-> byte[] conversion wherever possible. Many Strings in YaCy are hashes which are pure ASCII (base64 hashes).
The new ASCII String <-> byte[] conversion method have less computation overhead than the UTF8 conversion.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7746 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-27 08:24:54 +00:00
orbiter
e28bd0d038 fix for some possible causes of memory leaks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7741 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-26 14:35:32 +00:00
orbiter
8d9b5dda3b disabled did-you-mean computation for json and rss search results where this info is not used
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7739 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-26 12:35:24 +00:00
orbiter
10e2f588f8 - enhanced ybr ranking computation
- many speed/performance hacks
- added solr charding and new charding web interface
- added option to switch off the yacy index when using solr
- added new fail-url categories which are used to make a distinction which fail-urls to be sent to solr
- refactoring/renaming of some method names to distinguish host/url hashes better
- a large number of bug/npe fixes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7738 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-26 10:57:02 +00:00
orbiter
3ed4a09368 small features, some bug fixes and performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7733 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-23 21:08:04 +00:00
orbiter
3ec94d87c4 show dom counter only for active crawls where the dom counter is enabled within the crawl profile
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7731 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-22 19:34:20 +00:00
orbiter
b45701d20f this is a re-implementation of the YaCy Block Rank feature
This time it works like this:
- each peer provides its ranking information using the yacy/idx.json servlet
- peers with more than 1 GB ram will load this information from all other peers, combine that into one ranking table and store it locally. This happens during the start-up of the peer concurrently. The new generated file with the ranking information is at DATA/INDEX/<network>/QUEUES/hostIndex.blob
- this index is then computed to generate a new fresh ranking table. Peers which can calculate their own ranking table will do that every start-up to get latest feature updates until the feature is stable
- I computed new ranking tables as part of the distribition and commit it here also
- the YBR feature must be enabled manually by setting the YBR value in the ranking servlet to level 15. A default configuration for that is also in the commit but it does not affect your current installation only fresh peers
- a recursive block rank refinement is implemented but disabled at this point. it needs more testing

Please play around with the ranking settings and see if this helped to make search results better.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7729 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-18 14:26:28 +00:00
orbiter
123375bfba added a new yacy protocol servlet 'idx'. This returns an index to one of the data entities that is stored in YaCy.
This servlet currently only serves for indexes to the web structure hosts. It can be tested by calling
http://localhost:8090/yacy/idx.json?object=host
This yacy protocol servlet is the first one that returns JSON code and that also shows index entries in a readable format. This will make the development of API applications much easier. This is also an example implementation for possible json versions of the other existing YaCy protocol interfaces.

The main purpose of this new feature is to provide a distributed block rank collection feature. Creating a block rank is very difficult if the forward-link data is first collected and then one peer must create a backward-link index. This interface provides already a partial backward index and therefore a collection of all these indexes needs only to be joined which is very easy. The result should be the computation of new block rank tables that all peers can perform.

To reduce load from peers this servlet buffers all data and refreshes it only once in 12 hours. This very slow update cycle is needed because the interface will be called round-robin from all peers once after start-up.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7724 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-15 22:57:31 +00:00
orbiter
d326f1486a added timeout setting to scanner interface
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7723 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-14 11:30:41 +00:00
low012
f0d5ddfa92 *) preventing potential NPE which occured if user deleted DATA/RELEASE manually and opened ConfigureUpdate_p.java then
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7722 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-14 09:23:19 +00:00
orbiter
5c981762c6 added bigrange option for network scan
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7721 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-14 09:13:16 +00:00
orbiter
bade61696f speed-up of network port scanner
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7719 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-14 09:03:16 +00:00
orbiter
b04382bc59 added topmenu as defined for search to wiki
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7718 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-14 08:29:16 +00:00
lotus
229df8b626 restart link after memory changed
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7717 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-13 17:24:03 +00:00
orbiter
1d8b0f74f4 one more fix for SVN 7713
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7716 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-13 15:31:24 +00:00
orbiter
0960261769 fix for svn 7713
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7715 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-13 15:20:57 +00:00
apfelmaennchen
7e368000c8 transparent progress bar
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7714 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-13 13:40:23 +00:00
orbiter
5b579e21a3 code cleanup
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7713 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-13 06:21:40 +00:00
orbiter
fcd4b03892 show progress of search after display of results is finished
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7712 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-13 06:20:00 +00:00
orbiter
039126cfaf better handling of on/off switched solr indexing
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7709 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-08 22:47:20 +00:00
lotus
f123dbec79 fix in heuristics config
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7707 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-07 18:52:20 +00:00
orbiter
897b4e8b9c another hack to prevent black images
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7706 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-07 07:45:02 +00:00
orbiter
9248a4eef4 reduce teh effect of 'Bildersuche findet generierte HTML-Seiten als Bilder'
see http://bugs.yacy.net/view.php?id=9

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7705 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-07 07:37:46 +00:00
orbiter
0621a15f89 fix for wrong search result counter: added a counter for all filtered out entities
see also http://bugs.yacy.net/view.php?id=5

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7704 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-06 23:04:27 +00:00
apfelmaennchen
61c9a791c4 YMarks: sidebar with tabs for tags and folders
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7703 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-06 21:36:35 +00:00
orbiter
deda54d684 - relaxed matching of string-search (this is now case-insensitive)
- added transport of string-search pattern to remote search protocol
- fixed a problem parsing snippets with a '-' inside

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7700 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-05 22:37:06 +00:00
orbiter
6e42d4de88 - added full-String search function: find things that match exactly what is quoted in the query
- re-structuring authentification methods to fix a problem with API steering

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7697 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-05 00:25:14 +00:00
apfelmaennchen
8b8db2aaba YMarks: some small changes/fixes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7695 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-03 21:21:06 +00:00
apfelmaennchen
441035f1f4 YMarks: some improvements to flexigrid quick search on YMarks.html
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7694 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-02 20:11:58 +00:00
orbiter
6fa439c82b - refactoring of robots
- added option to crawler to send error-URLs to solr
- changed solr scheme slightly (no multi-value fields where no multi values are)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7693 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-02 14:05:51 +00:00
sixcooler
1ea0bc775c @apfelmaenchen:
is this the expected, but forgotten change?
Please correct if I'm wrong
(this let me build Yacy again)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7692 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-02 10:46:05 +00:00
apfelmaennchen
e7c2ea193b YMark:
- general improvements on importers, especially on auto tagging
- added get_tags (needed for tag clouds etc.)
- improved flexigrid support
- added YMarks.html (not fully working) that will eventually replace Bookmarks.html

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7691 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-05-01 21:42:48 +00:00
orbiter
3b578a28ef some patches to prevent that empty or bad IP information is broadcasted
- on client-side: fix bad IP reports from remote Peers by replacing their reported IP with their server IP if the reported IP is bad, broken or disallowed
- on server-side: the same during a peer ping (here the ping'ed server acts also as client during the back-ping) and also when receiving a message or a search where the client sends also its seed. Here the IP is replaced by the client IP if the reported IP is broken or bad

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7687 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-29 10:58:12 +00:00
orbiter
8b95a26866 better magic
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7684 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-29 02:00:37 +00:00
orbiter
2700a58e5a added a magic to the peer ping that will be used in case that the contacting peer requests that it's reported IP shall be used for a back-ping. The back-ping now also returns the same magic which will make it possible that the requested peer can verify that the back-pinged peer is actually the same peer.
This is also a protection against the foced-fake of a external IP: if such an IP was faked, then the next ping from the affected peer to another peer looks like a staticIP report. Such a bad staticIP-by-faked-response can now be discovered and fixed by the peer that gets the second ping after the first ping contained a faked response.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7683 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-29 01:52:20 +00:00
orbiter
f6077b3cc0 added more attributes for html parser and enhanced data structures
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7679 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-28 13:09:01 +00:00
f1ori
d671de8c17 add ranking weight to json-search-results
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7677 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-28 11:18:14 +00:00
orbiter
d8e934c085 better abstraction of http client identification
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7675 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-26 13:35:29 +00:00
orbiter
b77b8cac0c - enhanced html parser: recognized much more details in the content
- added more properties to solr index
- refactoring
- more constants in switchboard
- fix for some NPEs
- recognition of more images
- removed synchronization in HandleMap (obviously not necessary?)
- added a nolocal configuration to remove excessive dns lookup (works only on allip - default off). Indexes produced with this setting are all flagged with 'local' and are (on purpose) not usable for freeworld because they will be rejected as beeing local.



git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7672 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-21 13:58:49 +00:00
low012
bc84d2bc9d *) fixed typo in stop script
*) added <u> </u> tags for underlined text in Wiki Code
*) minor code changes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7671 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-20 22:54:29 +00:00
apfelmaennchen
b2281f0b7d YMark: intermediate work towards flexigrid support
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7670 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-20 22:33:01 +00:00
apfelmaennchen
60412d2bb3 YMark:
- more refactoring >> YMarkEntry
- integration of SurrogateReader as bookmark importer
- various small bug fixes e.g. get_xbel.xml

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7668 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-18 21:42:14 +00:00
orbiter
3d5104d357 - fixed a bug in crawl start with file name (npe in new url)
- added deletion of solr index in IndexControlRWIs
- added asynchronous adding of large url lists (happens when crawls are startet with file)
- fixed npe in Image display
- replaced language warning with fine logging
- added a domain name cache in Domains that helps to speed up the isLocal property (less DNS lookups)
- added a new storage class for this new cache: KeyList. The domain key list is stored in DATA/WORK/globalhosts.list
- added concurrent solr updates and chunked transfers (50 documents until a commit is done) for high-speed feeding (> 40000 ppm)
- fixed a bug in content scraper that chopped off large parts of crawl lists (using crawl start from file)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7666 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-18 16:11:16 +00:00
orbiter
08108f0ece fix for http://bugs.yacy.net/view.php?id=12
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7665 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-17 22:53:15 +00:00
apfelmaennchen
62855f9567 YMark: code clean up and some small fixes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7661 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-16 21:19:42 +00:00
apfelmaennchen
667e912b19 YMark:
- some improvements to firefox json bookmark importer
- test import with: /api/ymarks/test_import.html
- view ymarks with: /api/ymarks/test_treeview.html


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7660 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-16 09:09:33 +00:00
apfelmaennchen
a0e4960a4d YMark:
- first attempt for a firefox json bookmark importer
- added JSON library json-simple-1.1.jar

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7658 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-15 20:58:58 +00:00
orbiter
958ff4778e enhanced location search:
search is now done using verify=false (instead of verify=cacheonly) which will cause that much more targets can be found.
This showed a bug where no location information was used from the metadata (and other metadata information) if cache=false is requested. The bug was fixed.

Added also location parsing from wikimedia dumps. A wikipedia dump can now also be a source for a location search.
Fixed many smaller bugs in connection with location search.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7657 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-15 15:54:19 +00:00
orbiter
19fd13d3bc Added federated index storage to solr.
YaCy supports now the storage to remote solr indexes.
More federated storage (and search) methods may follow.

The remote index scheme is the same as produced by the SolrCell; see
http://wiki.apache.org/solr/ExtractingRequestHandler
Because this default scheme is used, the default example scheme can be used as solr configuration
This is also the same scheme that solr uses if documents are imported with apache tika.

federated solr storage is switched off by default.

To use this, do the following:
- set federated.service.solr.indexing.enabled = true
- download solr from http://www.apache.org/dyn/closer.cgi/lucene/solr/
- extract the solr (3.1) package, 'cd example' and start solr with 'java -jar start.jar'
- start yacy and then start a crawler. The crawler will fill both, YaCy and solr indexes.
- to check whats in solr after indexing, open http://localhost:8983/solr/admin/

Until now it is not possible to use the solr index to search with YaCy in that solr index.
This functionality is now available for two reasons:
1) to compare the functionality of Solr and YaCy and to compare the search speed
2) to use YaCy as a search appliance for people who need a crawler or other source harvesting methods
   that YaCy provides (like dublin core reading, wikimedia dump reading, rss feed reader etc) if people still
   want to use solr instead of YaCy.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7654 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-14 20:05:04 +00:00