Commit Graph

299 Commits

Author SHA1 Message Date
Michael Peter Christen
6ed9c0164e attaching names to all Threads to get a better view in profiling tools
like VisualVM
2014-02-28 15:02:01 +01:00
Michael Peter Christen
0f6b72f24b do not use luke requests for remote solr servers if the result is
different from normal requests. This happens if the remote solr is
actually a solrCloud; in such cases the luke request returns only the
result of the single solr peer, not the whole cloud.
also done: some refactoring.
2014-02-26 14:30:48 +01:00
Michael Peter Christen
751c128544 extra sleep for remote searches enhances search results because there is
more time for more remote peers to contribute on the first result page
2014-02-09 14:57:17 +01:00
Michael Peter Christen
0cabcbbe83 more efficient wordcount 2014-02-09 14:45:12 +01:00
reger
b693ce9759 allow combining selection of different search nav's (facets)
- selecting more than one nav combines the 2 selections (with AND)
- unselecting one nav clears all selected

(e.g. select filetype:pdf and /language/fr shows ~ french pdf's only)
2014-01-30 22:57:27 +01:00
reger
cb71413d19 fix page nav, to keeping modifier
(was new issue)
2014-01-30 22:00:32 +01:00
orbiter
416481c33e added a boost on appearance of combined words (in the same order the
user submitted that) when searching for more than one word
2014-01-30 10:51:08 +01:00
reger
9b24dae2b7 add language navigation filter clause to rwi results 2014-01-25 22:59:23 +01:00
reger
f307d65dcf prepare for a language navigator
works fine to restrict language for local solrSearches.
More work needs to be done to make rwi/remote searches respect the modifier.language restriction.
2014-01-24 03:11:25 +01:00
orbiter
5ec0c969c9 fix for http://bugs.yacy.net/view.php?id=354 2014-01-22 20:59:53 +01:00
Michael Peter Christen
1ea17bd9f3 - removed old metadata database and all migration code
- refactored all code which uses URIMetadataRow as standard for word
hash length and word hash ordering and moved that to the class 'Word',
becuase the class URIMetadataRow defined the old metadata data structure
and should be superfluous in the future
- removed unused methods from URIMetadataRow as preparation for further
removal of that class
2014-01-20 18:31:46 +01:00
reger
97e84439fb adjusted ConfigHeuristic and changed QueryGoal.getOriginalQueryString to .getQueryString
- since specific heuristic Twitter & Blekko is not longer available or redundant with OpenSearchHeuristic,
adjusted ConfigHeuristic to use OpensearchHeuristic settings only.
For this the default OSD search target list is made available (copied) by default and the other configs are removed.

- the return of QueryGoal.getOriginalQueryString includes the queryModifier, which are held separately in a modifier object,
but in most (all) cases just the query term is expected, clarified and renamed it to QueryGoal.getQueryString which returns
just the search term (if needed a .getOrigianlQueryString could be implemented in Queryparameters, adding the modifiers)

- started to adjust internal html href references from absolute to relative (currently it is mixed).
For future development we should prefer relative href targets (less trouble with context aware  servlets)
2014-01-20 00:58:17 +01:00
Michael Peter Christen
f8ce7040ab remote search peer selection schema change:
- all non-dht targets (previously separated into 'robinson' for dht-like
queries and 'node' for solr queries) are non 'extra' peers, which are
queries using solr
- these extra-peers are now selected using a ranking on last-seen,
peer-tag-matches, node-peer flags, peer age, and link count. The ranking
is done using a weight and a random factor.
- the number of extra peers is 50% of the dht peers
- the dht peers now exclude too young peers to prevent bad results
during strong growth of the network
- the number of dht peers (and therefore extra-peers) is reduced when
the memory of the peer is low and/or some documents still appear in the
indexing-queue. This shall prevent a peer from deadlocks when p2p
queries are made in a fast sequence on weak hardware.
2014-01-16 17:27:14 +01:00
Michael Peter Christen
9bd71fdbb4 made the access tracker class static because it shall be used by the
jetty auth module
2014-01-05 05:04:28 +01:00
Michael Peter Christen
b9d36e45e0 removed the &amp explicit encoding of ampersand character since this is
double-translated within the template replacement process.
2014-01-05 03:40:10 +01:00
orbiter
dcf46ce8f6 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-12-31 15:20:49 +01:00
orbiter
343d2ef49a new data type for access tracker (unfinished) 2013-12-31 15:20:34 +01:00
reger
dd8ea0cdd6 fix "add to blacklist" button style in IndexControlRWIs_p
- added default filename filter to select field (as only addition to *.black list is permanent)

- modified Blacklist_p header/legend to show all active blacklists 
  (to support understanding that all configured lists are active)
- removed obsolete code in Blacklist_p servlet
2013-12-30 20:03:59 +01:00
reger
abbf487023 fix QueryGoal Image query (missing space)
see query log example .. url_file_ext_s:(jpg OR png OR gif) ORcontent_type:(image/*)) ..
2013-12-29 20:14:10 +01:00
reger
26e9d7e066 fix NPE in IndexControlRWIs_p.html
- metatags my be null
Caused by: java.lang.NullPointerException
	at net.yacy.search.query.QueryParams.getFacets(QueryParams.java:445)
	at net.yacy.search.query.QueryParams.getBasicParams(QueryParams.java:400)
	at net.yacy.search.query.QueryParams.solrTextQuery(QueryParams.java:345)
	at net.yacy.search.query.QueryParams.solrQuery(QueryParams.java:334)
	at net.yacy.search.query.SearchEvent.<init>(SearchEvent.java:290)
	at net.yacy.search.query.SearchEventCache.getEvent(SearchEventCache.java:176)
	at IndexControlRWIs_p.genSearchresult(IndexControlRWIs_p.java:641)
	at IndexControlRWIs_p.respond(IndexControlRWIs_p.java:141)
2013-12-29 08:05:37 +01:00
orbiter
3961b643a3 write solr searches to search log 2013-12-29 01:25:44 +01:00
Michael Peter Christen
25f9c35033 add patch which shall prevent that naive search mistakes like usage of
regular expressions cause no results. Usage of '*' followed by a dot or
any expression will now cause that this expression is used as a filetype
search.
2013-12-27 00:34:55 +01:00
Michael Peter Christen
25250405f1 solr servlet preparation for join with jetty branch 2013-12-20 00:45:58 +01:00
Michael Peter Christen
09412ea3a4 counting search requests in solr interface 2013-12-12 03:37:19 +01:00
Michael Peter Christen
78eac85161 better calibration of caches and queue maximum sizes 2013-12-04 23:15:10 +01:00
Michael Peter Christen
c8af19bd37 removed unnecessary check which causes a NPE when searching with empty
search string
2013-12-04 17:58:36 +01:00
Michael Peter Christen
6f3a923691 fixed urlmask which was not able to combine several constraints 2013-12-04 13:48:01 +01:00
Michael Peter Christen
ae55d69ef6 include/exclude size NPE fix (recently added) 2013-11-26 11:47:04 +01:00
Michael Peter Christen
2c39b65409 fixes for searches containing stopwords. The fix was done using a
reconstruction of the search word set access method to protect that
words are deleted from the sets from the outside of the QueryGoal class.
2013-11-26 02:24:47 +01:00
orbiter
61409788eb less word hash computations (removing some overhead because of MD5
calcs) using the clear word in a normalized form.
2013-11-25 15:20:54 +01:00
Michael Peter Christen
bf1bdd52a6 prevent requesting of 0-facets (which actually exist) 2013-11-15 15:41:41 +01:00
Michael Peter Christen
087df05e24 added option to Config_Network_p.html to enable remote search while
DHT-Receive is switched off.
2013-11-13 13:38:01 +01:00
Michael Peter Christen
1a4a69c226 set more logger to 'final static' 2013-11-13 06:18:48 +01:00
Michael Peter Christen
69b8d61c47 fix for search requests in GSA interface which contain 'funny'
characters (like ':' etc.)
2013-11-12 15:54:54 +01:00
reger
7b17cdf6dd add content_type:image/* to image search
- see numerous idx entries with content_type image without url_file_ext_s (for various reason) which should be included in result
- try it yourself with following sample query
   /solr/select?q=content_type:image/* AND -url_file_ext_s:[* TO *]&defType=edismax&fl=sku,url_file_ext_s,content_type

adresses also possible url without or deviating extension.
2013-11-07 03:11:03 +01:00
Michael Peter Christen
1b4fa2947d - fixed a problem which ocurred when a document was not recognized with
the right content domain (i.e. identifying that it is an image, text
etc.) because it used the file extension and not an existing mime type
assignment.
- fixed the new setting that images shall be loaded for a better image
search.
- both fixes together makes it now possible to crawl
commons.wikimedia.org which makes use of 'funny' document names (i.e.
ending with .jpg while the document is html)
2013-10-23 00:16:54 +02:00
Michael Peter Christen
78e7aadb26 removed unused initialization method 2013-10-07 23:51:28 +02:00
Michael Peter Christen
4fbc4740df removed warnings 2013-10-07 23:41:50 +02:00
orbiter
8ac2e8c8c9 added location navigator which causes that the image to the map search
is visible whenever a location is available in the search result.
To activate this, the search.navigation property in yacy.conf must be
modified to the new default values.
2013-09-24 11:26:51 +02:00
Michael Peter Christen
5e31bad711 - the webgraph shall store all links which appear on a web page and not
all unique links! This made it necessary, that a large portion of the
parser and link processing classes must be adopted to carry a different
type of link collection which carry a property attribute which are
attached to web anchors.
- introduction of a new URL class, AnchorURL
- the other url classes, DigestURI and MultiProtocolURI had been renamed
and refactored to fit into a new document package schema, document.id
- cleanup of net.yacy.cora.document package and refactoring
2013-09-15 00:30:23 +02:00
Michael Peter Christen
85456f46b2 added two new fields, exact_signature_copycount_i and
fuzzy_signature_copycount_i, which count the number of copies of
non-unique documents and assigns this to each document. Thus, each
document there is a number assigned which shows how many copies of this
document exists.
These fields are disabled by default.
2013-09-04 23:11:53 +02:00
Michael Peter Christen
a2511b5600 turned images_alt_txt back to images_alt_sxt because it is not necessary
to index the alt text. Indexed image Text is in images_text_t
2013-09-04 10:47:18 +02:00
Michael Peter Christen
85b1922244 activated image type navigation for image search 2013-09-03 13:34:01 +02:00
Michael Peter Christen
9e12fdff23 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-09-03 12:22:57 +02:00
Michael Peter Christen
ab1201fdfd fixed wrong facet count 2013-09-03 12:22:29 +02:00
Michael Peter Christen
049c3b3f2e added an option to exclude image search results from text search. This
is on by default.
2013-09-03 11:14:23 +02:00
Michael Peter Christen
a8c5bfcf58 avoid to create unnecessary objects 2013-09-03 09:48:05 +02:00
Michael Peter Christen
dc179bd61f fix for catchall query goal for image search 2013-09-03 07:55:21 +02:00
reger
392174de8c remove all_words, all_strings lists from QueryGoal
- only used for text highlighting in parser text (ViewFile.html) which can be done with include_strings only
2013-09-02 23:09:43 +02:00
Michael Peter Christen
169ef8963d one more fix for image search 2013-09-02 20:02:26 +02:00