Commit Graph

186 Commits

Author SHA1 Message Date
orbiter
19fd13d3bc Added federated index storage to solr.
YaCy supports now the storage to remote solr indexes.
More federated storage (and search) methods may follow.

The remote index scheme is the same as produced by the SolrCell; see
http://wiki.apache.org/solr/ExtractingRequestHandler
Because this default scheme is used, the default example scheme can be used as solr configuration
This is also the same scheme that solr uses if documents are imported with apache tika.

federated solr storage is switched off by default.

To use this, do the following:
- set federated.service.solr.indexing.enabled = true
- download solr from http://www.apache.org/dyn/closer.cgi/lucene/solr/
- extract the solr (3.1) package, 'cd example' and start solr with 'java -jar start.jar'
- start yacy and then start a crawler. The crawler will fill both, YaCy and solr indexes.
- to check whats in solr after indexing, open http://localhost:8983/solr/admin/

Until now it is not possible to use the solr index to search with YaCy in that solr index.
This functionality is now available for two reasons:
1) to compare the functionality of Solr and YaCy and to compare the search speed
2) to use YaCy as a search appliance for people who need a crawler or other source harvesting methods
   that YaCy provides (like dublin core reading, wikimedia dump reading, rss feed reader etc) if people still
   want to use solr instead of YaCy.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7654 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-14 20:05:04 +00:00
orbiter
4c013d9088 more UTF8 getBytes() performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7649 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-12 05:02:36 +00:00
orbiter
cb6f709a16 - enhancements in surrogate reading
- better display of map in location search

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7636 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-02 00:11:37 +00:00
low012
1ff9947f91 *) added new user right: extended search right (allows to define users who can query more results than anonymous users)
*) cleaned up code a little bit

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7635 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-01 23:32:40 +00:00
orbiter
b1a8d0c020 enhancements to web cache and less strict caching rules
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7620 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-22 10:35:26 +00:00
orbiter
ba03ca8620 added more configuration options for search:
- removed configuration button for 'search only for admin' from index.html and added this to ConfigPortal
- added configuration of link verification options (iffresh, cacheonly, nocache, ifexist) to ConfigPortal
- added configuration of navigation options to ConfigPortal
- added an option to switch off automatic index cleaning in case that a link verification method fails


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7613 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-21 07:50:34 +00:00
orbiter
a50f28e6e7 - fixed missing save operation for peer name change
- fixed import of mediawiki dump files
- added script to add mediawiki dump files

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7609 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-19 23:52:09 +00:00
orbiter
cb1f49d0f2 replaced all 'new String' with default encoding (missing) or UTF-8 encoding with a String generation method that uses a pre-defined Charset constant for UTF-8. This avoids a cache-lookup for the Charset object using String hashing of the String 'UTF-8'.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7558 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-07 20:36:40 +00:00
orbiter
8d14916c74 more patches for a better out-of-memory management
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7555 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-07 01:45:11 +00:00
orbiter
c2c5b12882 - even less memory for circle tool
- background thread for bookmark initialization: this uses a DNS lookup which may cause long waiting times during startup

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7554 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-06 12:30:22 +00:00
orbiter
6dfaf6fef7 fix for bug in deletion of old seeds
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7547 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-04 10:00:37 +00:00
orbiter
42d90664f3 - fixed a memory leak in the httpc.post method (no finish)
- patched some more memory-saving relevant code
- some more minor bug fixes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7541 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-01 09:03:33 +00:00
orbiter
e3ef4e3021 - increased default peer ping time from 2 minutes to 1 minute
- filtering out too old peers when reading seed lists (limit is now 240 minutes)
- added concurrent host names resolving in front of the http client because the http client uses the java built-in DNS resolve which is not multithreading-safe (i have seen deadlocks in thread dumps showing that this bug in jdk is still there)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7515 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-23 09:42:01 +00:00
orbiter
cd19d0517e added dns resolve to HTTPClient POST using a dns cache to prevent that that not-thread-safe built-in dns cache inside apache http client is used
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7513 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-22 22:58:19 +00:00
orbiter
4bd65532da initialization of libraries concurrently (faster start-up)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7510 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-22 10:57:38 +00:00
orbiter
5892fff51f introduction of dht-burst modes: this can expand the number of target peers in some cases where a better heuristic is needed. The problematic cases are either when a muti-word search is made (still a hard case for our term-oriented DHT) or when a network operator wants that all robinson peers are asked. We therefore introduced two new network steering values that switch on more peers during the peer selection. Because the number of peers can now be very large, the number of maximum httpc connections was also increased.
Please see new coments in yacy.network.freeworld.unit for details of the new DHT selection methods.
The number of maximum peers is now not fixed to a specific number but may increase with
- the partition exponent
- the number of redundant peers
- the robinson burst percentage
- the multiword burst percentage
The maximum can then be the number of senior peers (all visible peers).

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7479 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-13 17:37:28 +00:00
orbiter
4588b5a291 - fixed document number limitation for crawls that restrict the number of documents per domain
- some restructuring of the document counting and logging structures was necessary
- better abstraction of CrawlProfiles
- added deletion of logs to the index deletion option (if the index is deleted using the servlets) which is necessary to reset the domain counters for the page limitation
- more refactoring to get the LibraryProvider more clean
- some refactoring of the Condenser class

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7478 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-12 00:01:40 +00:00
orbiter
0cdfb82963 replaced more appearance of double values by float values
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7461 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-02 00:06:29 +00:00
f1ori
982aa689ef * fix StringIndexOutOfBoundException in WebStructureGraph
* add better escaping to saveMap and loadMap

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7458 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-31 14:25:09 +00:00
orbiter
88773e4daa changed the default port from 8080 to 8090
see also: http://forum.yacy-websuche.de/viewtopic.php?p=21683#p21683

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7454 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-28 10:54:13 +00:00
orbiter
6c35b68f17 - removed 'peerName' property from the yacy settings file because this information is stored in the yacy seed file
- the own seed file gets the lead for storage of the peer name
- exchanged default peer name generation method with one that does not use the local ip
- default peer names are now strings starting with '_anon'
- added another switch to suppress forwarding to ConfigBasic if the name was already changed
- replaced all usages of the yacy.conf peerName with access to the local seed
- changes to the peer name are now applied directly and not after the next peer ping


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7453 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-28 10:12:17 +00:00
orbiter
3ae8f40fc8 removed yacy.network.group - this feature was never used
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7442 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-22 09:50:36 +00:00
orbiter
efb4ca8fa8 modified auto-delete of search failure-words:
- words are now not deleted from the search index automatically if index receive is switched off
- a flag in the network definition defines if this feature is switched on at all
- the search filter for not-found word references is switched off for server-side remote searches

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7441 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-22 09:46:00 +00:00
f1ori
4e29e9712a * create cleanupjob for cached failed urls
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7437 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-17 15:04:00 +00:00
f1ori
a321c7673d * adminAccountForLocalhost only for localhost
* yacy crawls local domains also, if no password is set (the interface is already protected)
* it's not required anymore, to set a password in intranet mode

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7436 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-17 11:37:30 +00:00
orbiter
c93f4dda72 - cleaned up yacy news
- removed unused methods
- avoid news generation in case that the peer runs in robinson mode

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7431 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-12 00:00:14 +00:00
orbiter
10ae8d961b - cora package has now no dependencies to other yacy packages and becomes a 'base' package (refactoring)
- cleaned up (removed special code and documentation for 27c3)
- added remote search functions to be used within cora

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7420 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-03 20:52:54 +00:00
lotus
0e54233408 UPnP: map port again if we are not reachable (e.g. when router rebooted)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7419 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-01-02 21:17:21 +00:00
orbiter
0769f4caa6 added search suggestions for interactive search: is only shown if there are no search results
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7411 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-29 14:30:25 +00:00
orbiter
a4c9d27287 - moved some variables from Stwitchboard to new class AccessTracker
- added a limitation in access tracking to delete queries which are older than 10 minutes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7410 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-29 01:54:27 +00:00
f1ori
2521677a45 * deny adminForLocalhost and intranet network setup also on bootup and not only on network switch
* require authentication for yacybot what ever adminForLocalhost is set to
  (after this patch, is the rule from above really nesseccary,
  the crawler also checks the robots.txt)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7376 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-15 21:39:02 +00:00
f1ori
9d2159582f * fix system update if urls are in blacklist (for example for very general blacklists like *.de)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7375 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-15 19:20:00 +00:00
orbiter
56264dcc17 - added CamelCase parser to MultiProtocolURI: generate better to-be-indexed words from urls
- integrated new parser into loader processes: enrich document parser
- fixed a concurrent modification exception in kelondro iterator
- hand-over of document size from crawler to indexer

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7374 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-15 00:03:19 +00:00
orbiter
acab6801d9 added new network scanner
- you can scan any ip or host in the internet for services
- this replaces the intranet scanner

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7371 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-13 18:19:37 +00:00
orbiter
a563b05b60 enhanced crawler:
- added a new queue 'noload' which can be filled with urls where it is already known that the content cannot be loaded. This may be because there is no parser available or the file is too big
- the noload queue is emptied with the parser process which indexes the file names only
- the 'start from file' functionality now also reads from ftp crawler

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7368 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-11 00:31:57 +00:00
orbiter
db99db4be9 some redesign of the search-fail-response mechanism:
when a search fails for a single url because the snippet cannot be generated, then the url reference is deleted from the index. This mechanism was redesign and enhanced. The process now also writes into the work tables into the table searchfl to prepare a re-indexing mechanism.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7364 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-06 14:34:58 +00:00
f1ori
4915d1781a * use local backup-file, if remote network-definition is not availible
* resolve single point of failure in networks, managed by central network-definitions

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7363 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-06 11:09:16 +00:00
orbiter
4e2c14efbb fixed bugs in parser and ftp client
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7360 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-02 11:05:04 +00:00
orbiter
cc6499bf8d - added http://blekko.com as search heuristic (like scroogle). This was easy since they deliver their search results also as rss feed
- renamed YaCys search result modifications keywords for RECENT, NEAR and language: to the blekko slashtag naming scheme. YaCy now supports the following blekko-like slash built-in slashtags:
/date
 - for search results ordered by date (most recent up)
 /near
 - for search results where search words appear near to each other (closest up)
 /language/<lang>
 - for a sorting by language where the wanted language gets up. Example: /language/de
  

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7350 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-11-29 18:08:20 +00:00
orbiter
a9f754c45f removed unused CR accumulation and distribution process
this was never used and extended in the last years. The resulting YBR ranking criteria
is still a good idea and will be used in the future. Possible generation methods for YBR
ranking are:
- "trust-rank" using the link structure as can be discovered in a single crawl (idea from FSCONS)
- "block-rank" calculated from the local link structure
- a distributed "block-rank" using the xml API to the link structure from other peers

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7349 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-11-29 11:07:42 +00:00
low012
9b3fae9496 *) cleaning up the code a little bit
*) program to interface, not implementation

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7345 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-11-28 02:57:31 +00:00
low012
38fdf43587 *) renamed classes according to standard Java coding conventions
*) String.isEmpty() was introduced in Java 1.6, but we still use Java 1.5

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7330 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-11-21 01:29:32 +00:00
low012
025e3f4790 *) renamed classes according to standard Java coding conventions
*) removed unsused code

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7328 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-11-21 00:39:21 +00:00
orbiter
445619f3ec added a submenu ConfigHTCache_p.html to set the size of the HTCache separately from the proxy configuration.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7291 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-11-02 23:57:11 +00:00
f1ori
acd93b1b31 * add failsafe mechanisme to domainlist retrieval
domainlist is saved locally, if none of the given urls in network.unit.domainlist
  could be retrieved, the file from the last boot is used instead

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7289 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-11-02 17:57:48 +00:00
f1ori
def4253555 * add option to network definition to provide a domainlist (syntax like in blacklists)
* crawler and search allow only urls matching one in domainlist (if list is provided)
* this may be useful to prevent dedicated networks from being "polluted"
* FilterEngine is improved Backlist-object, Blacklist may inherit from FilterEngine in the future

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7285 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-30 14:44:33 +00:00
mikeworks
caabebf9be Fixed spelling mistake omiting -> omitting in debug messages in ConfigUpdate_p.java and Switchboard.java
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7280 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-28 04:03:11 +00:00
f1ori
7d8de34778 * add a bit documentation to DigestURI, use DigestURI(string) instead of DigestURI(string, null)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7276 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-26 16:10:20 +00:00
orbiter
25a8e55bc9 more logging about bad seeds
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7275 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-26 15:00:22 +00:00
orbiter
e3964f2c31 better catch of network definition load error; continue with secondary network load definition location
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7270 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-25 09:20:45 +00:00