Commit Graph

325 Commits

Author SHA1 Message Date
Michael Peter Christen
c115f3869c enhanced snippet computation and test method in ViewFile 2014-07-28 15:42:57 +02:00
orbiter
1027f3d04a fix for the usage of ready-prepared solr queries, some queries are
formulated as edismax query but this was not set as query attribut. The
defType=edismax property needs a qf-field, so this was added as well. Do
not remove that field again! This fixes also a problem with title-unique
computation.
2014-07-25 18:53:13 +02:00
Michael Peter Christen
6e1dc444c3 added a snippet test function in ViewFile: you can now search for a
specific word on the document; the servlet returns the snippet in the
same way as it would be shown in a search result.
2014-07-24 14:59:37 +02:00
Michael Peter Christen
2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
attribute in the <a> tag for each crawl. This introduces a lot of
changes because it extends the usage of the AnchorURL Object type which
now also has a different toString method that the underlying
DigestURL.toString. It is therefore not advised to use .toString at all
for urls, just just toNormalform(false) instead.
2014-07-18 12:43:01 +02:00
reger
336425912a remove unused localSearchThread from SearchEvent 2014-07-10 02:14:03 +02:00
orbiter
59160984cc timeline performance update 2014-07-03 13:06:29 +02:00
Michael Peter Christen
1cd4b2e8be Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-07-01 16:06:12 +02:00
Michael Peter Christen
8c52f0651b refactoring of AccessTracker events & timeline fix 2014-07-01 16:06:01 +02:00
reger
431a5f9c4e added test case for TextSnippet,
removed obsolete/unused parameter and reference to MediaSnippet
2014-06-30 05:36:48 +02:00
Michael Peter Christen
f5b817bac4 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-06-29 22:25:08 +02:00
reger
a5707cd2eb enable proper Author navigator
- author facet is based on omitted author_sxt field
- adjust to make author nav available on exist of author field but keep using author_sxt to construct the facet (why!?)
- add check for querymodifier author in searchevent
2014-06-27 23:05:06 +02:00
Michael Peter Christen
74206a10c7 refactoring 2014-06-27 14:40:36 +02:00
orbiter
fec673c9d1 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-06-27 10:15:37 +02:00
orbiter
c59da9fe7a added access tracker log reader stub 2014-06-27 10:14:36 +02:00
Michael Peter Christen
b893c42a0f bugfix for image search 2014-06-26 12:56:33 +02:00
orbiter
0bbb5040b8 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-06-15 12:38:52 +02:00
orbiter
9d5d86cd03 Added filter query options to the ranking servlet /RankingSolr_p.html.
Filter queries are not actually related to ranking, but user requests
have pointed out that specific boost queries to move results to the end
of the result list are not sufficient. Such boost filters may be better
executed as actual filter and therefore such a filter can now be
statically applied to every search request. A typical use could be the
expression "http_unique_b:true AND www_unique_b:true" which uses the
recently introduced fields http_unique_b and www_unique_b which are true
only for one of the alternatives with/without http(s) and with/without
prefix 'www.' in host names.
2014-06-15 12:38:30 +02:00
Michael Peter Christen
d2151857f1 Added collection navigation:
The collection field (can be filled i.e. in Crawl Start) can be used to
add categories to YaCy index entries. The usage of that field was
restricted to solr searches and post argument filters as implemented in
commit f7571386a3.
This commit extends collections to a full navigation option in the
standard YaCy search interface. The field is not active by default but
can be activated easily in the /ConfigSearchPage_p.html servlet (just
check the 'Collection' facet field). Collections can now be used for (at
least) two purposes:
- to provide search tenants (through post argument collection)
- to provide self-made category navigation
Search requests may now have (independently from switched on or off
collection facet) a "collection:<collection-name>" modifier attached;
firthermore collection names may use disjunctions using the '|' pipe
symbol. For example, this is a valid search request:
www collection:user|proxy
2014-06-15 12:11:23 +02:00
Michael Peter Christen
f0db501630 better handling of ranking parameters and new default values for date
navigation which is done using ranking in solr.
2014-05-22 03:01:07 +02:00
Michael Peter Christen
4e734815e8 enhanced snippets: remove lines which are identical to the title and
choose longer versions if possible. Prefer the description part.
2014-05-06 16:48:50 +02:00
reger
727dfb5875 refactore URIMetadataNode to further unify interaction with index
-  URIMetadataNode extending SolrDocument
- use language as stored (String), reducing conversion to string
- optimize debug code in transferIndex
2014-04-20 01:41:30 +02:00
Michael Peter Christen
da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
This organizes all urls to be loaded in separate queues for each host.
Each host separates the crawl depth into it's own queue. The primary
rule for urls taken from any queue is, that the crawl depth is minimal.
This produces a crawl depth which is identical to the clickdepth.
Furthermorem the crawl is able to create a much better balancing over
all hosts which is fair to all hosts that are in the queue.
This process will create a very large number of files for wide crawls in
the QUEUES folder: for each host a directory, for each crawl depth a
file inside the directory. A crawl with maxdepth = 4 will be able to
create 10.000s of files. To be able to use that many file readers, it
was necessary to implement a new index data structure which opens the
file only if an access is wanted (OnDemandOpenFileIndex). The usage of
such on-demand file reader shall prevent that the number of file
pointers is over the system limit, which is usually about 10.000 open
files. Some parts of YaCy had to be adopted to handle the crawl depth
number correctly. The logging and the IndexCreateQueues servlet had to
be adopted to show the crawl queues differently, because the host name
is attached to the port on the host to differentiate between http,
https, and ftp services.
2014-04-16 21:34:28 +02:00
Michael Peter Christen
8b44fcf0f4 added missing @Override annotation 2014-03-28 13:48:37 +01:00
Michael Peter Christen
cbdfef7ce1 changed protocol facet to show also all other counts if one facet is
selected
2014-03-27 13:29:14 +01:00
reger
ca7444dbdf limit filetype nav to known extension also on image/media search
- on text search we limit filetype nav already to known extension, apply filter to image search
2014-03-23 23:10:29 +01:00
Michael Peter Christen
d1091e79f8 - added stealth button to navigation menu
- more fixes to progress bar
2014-03-21 18:01:26 +01:00
Michael Peter Christen
6ed9c0164e attaching names to all Threads to get a better view in profiling tools
like VisualVM
2014-02-28 15:02:01 +01:00
Michael Peter Christen
0f6b72f24b do not use luke requests for remote solr servers if the result is
different from normal requests. This happens if the remote solr is
actually a solrCloud; in such cases the luke request returns only the
result of the single solr peer, not the whole cloud.
also done: some refactoring.
2014-02-26 14:30:48 +01:00
Michael Peter Christen
751c128544 extra sleep for remote searches enhances search results because there is
more time for more remote peers to contribute on the first result page
2014-02-09 14:57:17 +01:00
Michael Peter Christen
0cabcbbe83 more efficient wordcount 2014-02-09 14:45:12 +01:00
reger
b693ce9759 allow combining selection of different search nav's (facets)
- selecting more than one nav combines the 2 selections (with AND)
- unselecting one nav clears all selected

(e.g. select filetype:pdf and /language/fr shows ~ french pdf's only)
2014-01-30 22:57:27 +01:00
reger
cb71413d19 fix page nav, to keeping modifier
(was new issue)
2014-01-30 22:00:32 +01:00
orbiter
416481c33e added a boost on appearance of combined words (in the same order the
user submitted that) when searching for more than one word
2014-01-30 10:51:08 +01:00
reger
9b24dae2b7 add language navigation filter clause to rwi results 2014-01-25 22:59:23 +01:00
reger
f307d65dcf prepare for a language navigator
works fine to restrict language for local solrSearches.
More work needs to be done to make rwi/remote searches respect the modifier.language restriction.
2014-01-24 03:11:25 +01:00
orbiter
5ec0c969c9 fix for http://bugs.yacy.net/view.php?id=354 2014-01-22 20:59:53 +01:00
Michael Peter Christen
1ea17bd9f3 - removed old metadata database and all migration code
- refactored all code which uses URIMetadataRow as standard for word
hash length and word hash ordering and moved that to the class 'Word',
becuase the class URIMetadataRow defined the old metadata data structure
and should be superfluous in the future
- removed unused methods from URIMetadataRow as preparation for further
removal of that class
2014-01-20 18:31:46 +01:00
reger
97e84439fb adjusted ConfigHeuristic and changed QueryGoal.getOriginalQueryString to .getQueryString
- since specific heuristic Twitter & Blekko is not longer available or redundant with OpenSearchHeuristic,
adjusted ConfigHeuristic to use OpensearchHeuristic settings only.
For this the default OSD search target list is made available (copied) by default and the other configs are removed.

- the return of QueryGoal.getOriginalQueryString includes the queryModifier, which are held separately in a modifier object,
but in most (all) cases just the query term is expected, clarified and renamed it to QueryGoal.getQueryString which returns
just the search term (if needed a .getOrigianlQueryString could be implemented in Queryparameters, adding the modifiers)

- started to adjust internal html href references from absolute to relative (currently it is mixed).
For future development we should prefer relative href targets (less trouble with context aware  servlets)
2014-01-20 00:58:17 +01:00
Michael Peter Christen
f8ce7040ab remote search peer selection schema change:
- all non-dht targets (previously separated into 'robinson' for dht-like
queries and 'node' for solr queries) are non 'extra' peers, which are
queries using solr
- these extra-peers are now selected using a ranking on last-seen,
peer-tag-matches, node-peer flags, peer age, and link count. The ranking
is done using a weight and a random factor.
- the number of extra peers is 50% of the dht peers
- the dht peers now exclude too young peers to prevent bad results
during strong growth of the network
- the number of dht peers (and therefore extra-peers) is reduced when
the memory of the peer is low and/or some documents still appear in the
indexing-queue. This shall prevent a peer from deadlocks when p2p
queries are made in a fast sequence on weak hardware.
2014-01-16 17:27:14 +01:00
Michael Peter Christen
9bd71fdbb4 made the access tracker class static because it shall be used by the
jetty auth module
2014-01-05 05:04:28 +01:00
Michael Peter Christen
b9d36e45e0 removed the &amp explicit encoding of ampersand character since this is
double-translated within the template replacement process.
2014-01-05 03:40:10 +01:00
orbiter
dcf46ce8f6 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-12-31 15:20:49 +01:00
orbiter
343d2ef49a new data type for access tracker (unfinished) 2013-12-31 15:20:34 +01:00
reger
dd8ea0cdd6 fix "add to blacklist" button style in IndexControlRWIs_p
- added default filename filter to select field (as only addition to *.black list is permanent)

- modified Blacklist_p header/legend to show all active blacklists 
  (to support understanding that all configured lists are active)
- removed obsolete code in Blacklist_p servlet
2013-12-30 20:03:59 +01:00
reger
abbf487023 fix QueryGoal Image query (missing space)
see query log example .. url_file_ext_s:(jpg OR png OR gif) ORcontent_type:(image/*)) ..
2013-12-29 20:14:10 +01:00
reger
26e9d7e066 fix NPE in IndexControlRWIs_p.html
- metatags my be null
Caused by: java.lang.NullPointerException
	at net.yacy.search.query.QueryParams.getFacets(QueryParams.java:445)
	at net.yacy.search.query.QueryParams.getBasicParams(QueryParams.java:400)
	at net.yacy.search.query.QueryParams.solrTextQuery(QueryParams.java:345)
	at net.yacy.search.query.QueryParams.solrQuery(QueryParams.java:334)
	at net.yacy.search.query.SearchEvent.<init>(SearchEvent.java:290)
	at net.yacy.search.query.SearchEventCache.getEvent(SearchEventCache.java:176)
	at IndexControlRWIs_p.genSearchresult(IndexControlRWIs_p.java:641)
	at IndexControlRWIs_p.respond(IndexControlRWIs_p.java:141)
2013-12-29 08:05:37 +01:00
orbiter
3961b643a3 write solr searches to search log 2013-12-29 01:25:44 +01:00
Michael Peter Christen
25f9c35033 add patch which shall prevent that naive search mistakes like usage of
regular expressions cause no results. Usage of '*' followed by a dot or
any expression will now cause that this expression is used as a filetype
search.
2013-12-27 00:34:55 +01:00
Michael Peter Christen
25250405f1 solr servlet preparation for join with jetty branch 2013-12-20 00:45:58 +01:00
Michael Peter Christen
09412ea3a4 counting search requests in solr interface 2013-12-12 03:37:19 +01:00