Commit Graph

149 Commits

Author SHA1 Message Date
Michael Peter Christen
7611bf79bd Merge branch 'master' of gitorious.org:yacy/icewindxs-rc1
Conflicts:
	locales/ru.lng
2014-09-10 13:24:49 +02:00
Michael Peter Christen
c115f3869c enhanced snippet computation and test method in ViewFile 2014-07-28 15:42:57 +02:00
Michael Peter Christen
2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
attribute in the <a> tag for each crawl. This introduces a lot of
changes because it extends the usage of the AnchorURL Object type which
now also has a different toString method that the underlying
DigestURL.toString. It is therefore not advised to use .toString at all
for urls, just just toNormalform(false) instead.
2014-07-18 12:43:01 +02:00
reger
336425912a remove unused localSearchThread from SearchEvent 2014-07-10 02:14:03 +02:00
reger
431a5f9c4e added test case for TextSnippet,
removed obsolete/unused parameter and reference to MediaSnippet
2014-06-30 05:36:48 +02:00
reger
a5707cd2eb enable proper Author navigator
- author facet is based on omitted author_sxt field
- adjust to make author nav available on exist of author field but keep using author_sxt to construct the facet (why!?)
- add check for querymodifier author in searchevent
2014-06-27 23:05:06 +02:00
Michael Peter Christen
b893c42a0f bugfix for image search 2014-06-26 12:56:33 +02:00
Michael Peter Christen
d2151857f1 Added collection navigation:
The collection field (can be filled i.e. in Crawl Start) can be used to
add categories to YaCy index entries. The usage of that field was
restricted to solr searches and post argument filters as implemented in
commit f7571386a3.
This commit extends collections to a full navigation option in the
standard YaCy search interface. The field is not active by default but
can be activated easily in the /ConfigSearchPage_p.html servlet (just
check the 'Collection' facet field). Collections can now be used for (at
least) two purposes:
- to provide search tenants (through post argument collection)
- to provide self-made category navigation
Search requests may now have (independently from switched on or off
collection facet) a "collection:<collection-name>" modifier attached;
firthermore collection names may use disjunctions using the '|' pipe
symbol. For example, this is a valid search request:
www collection:user|proxy
2014-06-15 12:11:23 +02:00
Michael Peter Christen
f0db501630 better handling of ranking parameters and new default values for date
navigation which is done using ranking in solr.
2014-05-22 03:01:07 +02:00
Michael Peter Christen
4e734815e8 enhanced snippets: remove lines which are identical to the title and
choose longer versions if possible. Prefer the description part.
2014-05-06 16:48:50 +02:00
reger
727dfb5875 refactore URIMetadataNode to further unify interaction with index
-  URIMetadataNode extending SolrDocument
- use language as stored (String), reducing conversion to string
- optimize debug code in transferIndex
2014-04-20 01:41:30 +02:00
Michael Peter Christen
8b44fcf0f4 added missing @Override annotation 2014-03-28 13:48:37 +01:00
reger
ca7444dbdf limit filetype nav to known extension also on image/media search
- on text search we limit filetype nav already to known extension, apply filter to image search
2014-03-23 23:10:29 +01:00
Michael Peter Christen
6ed9c0164e attaching names to all Threads to get a better view in profiling tools
like VisualVM
2014-02-28 15:02:01 +01:00
Michael Peter Christen
751c128544 extra sleep for remote searches enhances search results because there is
more time for more remote peers to contribute on the first result page
2014-02-09 14:57:17 +01:00
Michael Peter Christen
0cabcbbe83 more efficient wordcount 2014-02-09 14:45:12 +01:00
reger
9b24dae2b7 add language navigation filter clause to rwi results 2014-01-25 22:59:23 +01:00
reger
f307d65dcf prepare for a language navigator
works fine to restrict language for local solrSearches.
More work needs to be done to make rwi/remote searches respect the modifier.language restriction.
2014-01-24 03:11:25 +01:00
orbiter
5ec0c969c9 fix for http://bugs.yacy.net/view.php?id=354 2014-01-22 20:59:53 +01:00
Michael Peter Christen
1ea17bd9f3 - removed old metadata database and all migration code
- refactored all code which uses URIMetadataRow as standard for word
hash length and word hash ordering and moved that to the class 'Word',
becuase the class URIMetadataRow defined the old metadata data structure
and should be superfluous in the future
- removed unused methods from URIMetadataRow as preparation for further
removal of that class
2014-01-20 18:31:46 +01:00
reger
97e84439fb adjusted ConfigHeuristic and changed QueryGoal.getOriginalQueryString to .getQueryString
- since specific heuristic Twitter & Blekko is not longer available or redundant with OpenSearchHeuristic,
adjusted ConfigHeuristic to use OpensearchHeuristic settings only.
For this the default OSD search target list is made available (copied) by default and the other configs are removed.

- the return of QueryGoal.getOriginalQueryString includes the queryModifier, which are held separately in a modifier object,
but in most (all) cases just the query term is expected, clarified and renamed it to QueryGoal.getQueryString which returns
just the search term (if needed a .getOrigianlQueryString could be implemented in Queryparameters, adding the modifiers)

- started to adjust internal html href references from absolute to relative (currently it is mixed).
For future development we should prefer relative href targets (less trouble with context aware  servlets)
2014-01-20 00:58:17 +01:00
Michael Peter Christen
f8ce7040ab remote search peer selection schema change:
- all non-dht targets (previously separated into 'robinson' for dht-like
queries and 'node' for solr queries) are non 'extra' peers, which are
queries using solr
- these extra-peers are now selected using a ranking on last-seen,
peer-tag-matches, node-peer flags, peer age, and link count. The ranking
is done using a weight and a random factor.
- the number of extra peers is 50% of the dht peers
- the dht peers now exclude too young peers to prevent bad results
during strong growth of the network
- the number of dht peers (and therefore extra-peers) is reduced when
the memory of the peer is low and/or some documents still appear in the
indexing-queue. This shall prevent a peer from deadlocks when p2p
queries are made in a fast sequence on weak hardware.
2014-01-16 17:27:14 +01:00
reger
dd8ea0cdd6 fix "add to blacklist" button style in IndexControlRWIs_p
- added default filename filter to select field (as only addition to *.black list is permanent)

- modified Blacklist_p header/legend to show all active blacklists 
  (to support understanding that all configured lists are active)
- removed obsolete code in Blacklist_p servlet
2013-12-30 20:03:59 +01:00
Michael Peter Christen
09412ea3a4 counting search requests in solr interface 2013-12-12 03:37:19 +01:00
Michael Peter Christen
78eac85161 better calibration of caches and queue maximum sizes 2013-12-04 23:15:10 +01:00
Michael Peter Christen
2c39b65409 fixes for searches containing stopwords. The fix was done using a
reconstruction of the search word set access method to protect that
words are deleted from the sets from the outside of the QueryGoal class.
2013-11-26 02:24:47 +01:00
orbiter
61409788eb less word hash computations (removing some overhead because of MD5
calcs) using the clear word in a normalized form.
2013-11-25 15:20:54 +01:00
Michael Peter Christen
087df05e24 added option to Config_Network_p.html to enable remote search while
DHT-Receive is switched off.
2013-11-13 13:38:01 +01:00
Michael Peter Christen
1a4a69c226 set more logger to 'final static' 2013-11-13 06:18:48 +01:00
Michael Peter Christen
1b4fa2947d - fixed a problem which ocurred when a document was not recognized with
the right content domain (i.e. identifying that it is an image, text
etc.) because it used the file extension and not an existing mime type
assignment.
- fixed the new setting that images shall be loaded for a better image
search.
- both fixes together makes it now possible to crawl
commons.wikimedia.org which makes use of 'funny' document names (i.e.
ending with .jpg while the document is html)
2013-10-23 00:16:54 +02:00
orbiter
8ac2e8c8c9 added location navigator which causes that the image to the map search
is visible whenever a location is available in the search result.
To activate this, the search.navigation property in yacy.conf must be
modified to the new default values.
2013-09-24 11:26:51 +02:00
Michael Peter Christen
5e31bad711 - the webgraph shall store all links which appear on a web page and not
all unique links! This made it necessary, that a large portion of the
parser and link processing classes must be adopted to carry a different
type of link collection which carry a property attribute which are
attached to web anchors.
- introduction of a new URL class, AnchorURL
- the other url classes, DigestURI and MultiProtocolURI had been renamed
and refactored to fit into a new document package schema, document.id
- cleanup of net.yacy.cora.document package and refactoring
2013-09-15 00:30:23 +02:00
Michael Peter Christen
85456f46b2 added two new fields, exact_signature_copycount_i and
fuzzy_signature_copycount_i, which count the number of copies of
non-unique documents and assigns this to each document. Thus, each
document there is a number assigned which shows how many copies of this
document exists.
These fields are disabled by default.
2013-09-04 23:11:53 +02:00
Michael Peter Christen
a2511b5600 turned images_alt_txt back to images_alt_sxt because it is not necessary
to index the alt text. Indexed image Text is in images_text_t
2013-09-04 10:47:18 +02:00
Michael Peter Christen
85b1922244 activated image type navigation for image search 2013-09-03 13:34:01 +02:00
Michael Peter Christen
ab1201fdfd fixed wrong facet count 2013-09-03 12:22:29 +02:00
Michael Peter Christen
049c3b3f2e added an option to exclude image search results from text search. This
is on by default.
2013-09-03 11:14:23 +02:00
Michael Peter Christen
a8c5bfcf58 avoid to create unnecessary objects 2013-09-03 09:48:05 +02:00
Michael Peter Christen
cb85b22725 redesign of the image search process (with much better results,
unfortunately the index schema has changed and p2p image search will not
be muchmuch better until many people update)
2013-09-02 18:55:38 +02:00
reger
a5019bc470 make Vocabulary Navigator tags a hard result entry filter
by checking vocabulary tags also for rwi results (currently a filter is applied to the solr query)

TODO: as vocabularies are only locally valid, auto-switch to Searchdom.LOCAL could be considered.
2013-08-13 03:07:25 +02:00
Roland Haeder
841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
to optimize memory usage

Conflicts:
	source/net/yacy/search/Switchboard.java
2013-07-17 18:31:30 +02:00
Michael Peter Christen
5878c1d599 - refactoring of log to ConcurrentLog:
jdk-based logger tend to block
at java.util.logging.Logger.log(Logger.java:476) in concurrent
environments. This makes logging a main performance issue. To overcome
this problem, this is a add-on to jdk logging to put log entries on a
concurrent message queue and log the messages one by one using a
separate process.
- FTPClient uses the concurrent logging instead of the log4j logger
2013-07-09 14:28:25 +02:00
Michael Peter Christen
8caaf6203a fixed false multiple-generation of remote facet search which
caused high cpu usage on remote side.
2013-06-28 12:39:36 +02:00
reger
d367b1f4d9 add null pointer check to stopword fix 2013-06-07 00:13:45 +02:00
reger
7480e87386 - fix stopword handling for RWI see example http://bugs.yacy.net/view.php?id=247
- append language setting specific stopword list

- remove unused OVERHANG stack type
2013-06-06 22:07:54 +02:00
Michael Peter Christen
409d6edf53 Store node/solr search threads to be able to send them an interrupt
signal in case that a cleanup process wants to remove the search
process. Added also a new cleanup process which can reduce the number of
stored searches to a specific number which can be higher or lower
according to the remaining RAM. The cleanup process is called every time
a search ist started.
2013-05-30 12:38:15 +02:00
Michael Peter Christen
0c1a018bbd removed 'later' tactic because it used too much RAM, reduced number of
soft commits, reduced caching size of search events, ensured that solr
results are processed before connection is closed to keep that stuff not
too long in RAM
2013-05-29 18:27:27 +02:00
orbiter
da621e827e prevent NPE in case RWI is disabled 2013-05-28 16:26:38 +02:00
Michael Peter Christen
c2b1075dcf activating pollImmediately in case that DHT receive is off. This will
cause a much faster search result when running in public robinson mode.
2013-05-28 10:36:49 +02:00
Michael Peter Christen
8dbc80da70 redesign of index.exist-test: this shall now not be done using a single
id to be tested, but with a collection of ids. This will cause only a
single call to solr instead of many. The result is a much better
performace when testing the existence of many urls. The effect should
cause very much less IO during index transmission, both on sender and
receiver side.
2013-05-17 13:59:37 +02:00