Commit Graph

1346 Commits

Author SHA1 Message Date
luccioman
9e86d183b8 Disable manual search results resorting when resorting is done with JS
Also added a constant for the js resorting setting key.
2017-09-13 07:58:05 +02:00
luccioman
66cb9c4ff9 Added Solr filter queries for audio, video and application domains
Inspired from the existing one used on image search, and consistent with
post filtering on content domain applied in SearchEvent.addNodes().

These filters are quite simplistic but at least audio, video or
application search now return results. Previously, when filtering on
these content domains, many results pages (and often even the first
page) were empty while the total results count suggested that results
should be available. This was because filtering on domain was only
applied AFTER requesting Solr indexes.
2017-09-08 11:16:37 +02:00
luccioman
5d3ceb31b7 Improved search navigators counters accuracy and consistency.
- added some missing increments from RWI results
- decrement relevant navigator counts when solr or RWI results are
evicted because duplicates detection or constraints checked belatedly
- do not compute facets when unnecessary to avoid unwanted CPU load
- do not increment from facets when already done
- do not rely on facets on remote solr peers requests, as most of the
time only a limited part of their total results if fetched (thus also
preventing unnecessary load on remote peers)
- use a concurrency friendly score map for the dates navigators to
prevent unwanted ConcurrentModificationExceptions

This improves the situation for the most obvious inconsistencies in
search navigators counts, but more has to be done for a true accuracy
(notably when query modifiers constraints are applied belatedly - after
the solr or RWI retrieval request - such as the content domain
constraint)
2017-09-06 16:58:40 +02:00
luccioman
a28428047a Fixed count of filtered results from local solr.
Was inadequately modified in my previous related commits (making next
pages buttons unavailable in Search portal mode), as
SearchEvent.local_solr_available did not count the total filtered
results but only the ones within the currently fetched result page(s).
2017-08-31 11:24:59 +02:00
luccioman
3c9df6e0ce Use local solr filtered results in total search results count.
This modification has indeed low incidence as eventual query modifiers
are already applied when requesting the local solr index. 
It mainly impact doublons detected with results from remote peers.

Also updated javadocs for clarification.
2017-08-30 12:23:45 +02:00
luccioman
a1a0515312 Added a button to manually refresh sorting of p2p search results.
As a server-side oriented alternative to the JavaScript realtime
resorting feature proposed in PR #104.
The goal is the same as in this PR : having the possibility compensate
the network latency of various peers results fetching and obtain once
possible a consistently ranked result set.
2017-08-28 19:03:51 +02:00
luccioman
4eba88f2ff Removed some unnecessary uses of java.lang.reflect api.
This improves code browsing and readability, making search by references
or call hierarchy IDE features more accurate.
2017-08-24 18:47:18 +02:00
luccioman
da3dbf9ea1 Use Javadoc style comments on SearchEvent properties.
For better code readability and understanding.
2017-08-23 08:20:37 +02:00
reger
e918ec199e Replace deprecated ConcurrentHashSet with recommended Java8
ConcurrentHashMap.newKeySet() in postprocessDocuments()
2017-08-06 23:26:27 +02:00
reger
275d65fffe Patch last_modified date with internal FirstSeenTime() if no date provided
to make sure updated documents are indexed with their last-modified
date as provided in current crawl. 
(to patch moddate always with firstseen might bear the risk of miss 
actual updates).
2017-08-05 22:30:06 +02:00
luccioman
0ee8c030c4 Log an error when Solr folder migration fails for some reason. 2017-07-17 15:35:10 +02:00
luccioman
dcc56318bb Made remote search max system load limits configurable from UI.
As reported by davide on YaCy forums (
http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6004 ) when the
system is on high load, unless reading carefully YaCy configuration
file, it could be difficult to understand why remote search results are
not fetched.
2017-06-30 11:30:54 +02:00
reger
ddd13b776d Add keyword constraint to rwi query result filter
To discard rwi results not matching query keyword: parameter
2017-06-30 02:11:18 +02:00
luccioman
8da3174867 Ensure lower case conversion consistency with any default locale.
Especially for Turkish speaking users using "tr" as their system default
locale : strings for technical stuff (URLs, tag names, constants...)
must not be lower cased with the default locale, as 'I' doesn't becomes
'i' like in other locales such as "en", but becomes 'ı'.
2017-06-27 06:42:33 +02:00
luccioman
28b451a0b3 Made Cache compression level and lock timeout user configurable 2017-06-14 19:02:08 +02:00
luccioman
a7394b479b Limit the synchronization blocking time on some Cache operations.
Using a Reentrant lock instead of the intrinsic synchronization lock
permits limiting the blocking time to acquire a lock.

Useful on a very busy Cache concurrently accessed by many threads : when
the time to acquire a lock is too high, getting/storing content on the
cache becomes inefficient, and it is then better to fall back to loading
remote resources.

Illustrated by the CacheTest stress test and some traces reported in
mantis 751 ( http://mantis.tokeek.de/view.php?id=751 )
2017-06-14 09:13:50 +02:00
Michael Peter Christen
6fe735945d migrated Solr 5.5 -> Solr 6.6 and from Java 1.7 -> 1.8
Also: now Version 1.921
2017-06-09 12:25:23 +02:00
luccioman
8399275142 Properly close file output streams even on exceptions scenarios. 2017-06-08 07:19:16 +02:00
luccioman
a04feac064 Ensure file input streams proper closing in both success and failures
Also add when possible a warning level log message on input stream
closing error instead of failing silently. This could help understanding
some IO exceptions such as "too many files open".
2017-06-03 04:00:46 +02:00
reger
a814f3d885 Introduce keyword query parameter
This enables keyword navigator to filter on keywords. Added search page
output and layout config for keywords, allowing e.g. in Intranet use
to display the keywords. No styling or links applied to the keyword
text (but is desirable possibly in combination with bootstrap-tagsinput
for future/intranet).
2017-06-02 01:00:21 +02:00
luccioman
02ec0ed13c Quoted param value in Solr query to avoid unwanted traces in logs
When Webgraph Solr core is enabled, crawling and removing from index an
URL whose hash starts with the '-' character (example URL :
https://cs.wikipedia.org/ whose hash is "-2-HuTEndn4x") produced a full
ParseException stack trace in YaCy logs. This was not blocking because
the Solr query parser is able to escape itself the query and run it
successfully, but filled uselessly YaCy logs.
2017-05-24 08:43:03 +02:00
Michael Peter Christen
3b1d640a3c enhanced debugging 2017-05-18 00:28:12 +02:00
luccioman
31ad043bb9 Added user interface feedback on results feeding termination status.
Added as an additional icon with title in the search progress bar, to
inform about background search feeder threads terminated or still
running. While giving a bit more information to users about the p2p
search process, this can help choosing whether or not wait a little bit
more time before going to the next page, in order to get results from
various sources sorted as best as possible (see #91 for a discussion
about sorting accuracy and network latency).

Other related modifications included :
 - regular updates to statistics in the progress bar until the
background feeders are completely terminated.
 - removed some uses of unsecure and discouraged JavaScript elements
2017-05-15 13:15:16 +02:00
sgaebel
ff6392215e added closing of lst-Tag in solr-Export 2017-05-13 20:38:25 +02:00
luccioman
d90b001e1b Improved previous merge "Show ranking in HTML UI".
- added the new setting as configurable in the "Debug/Analysis" settings
page. Debug/analysis is its main purpose for now as there is currently
no nice and "understansable" ranking score info servlet (see forum
discussion http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5884 ) 
- render in the "Search Page Layout" page preview when enabled
- added constants
2017-05-11 18:02:33 +02:00
luccioman
654801523e Fixed StringIndexOutOfBoundsException case.
Revealed by commit c77e43a : the exception was then thrown when indexing
pages containing mailto: scheme URL links with the Solr Webgraph core
enabled.
Fixed the error case and restored filtering on mailto links in
Document.resortLinks() as these URLs still should not appear in
Document.hyperlinks.
2017-05-09 18:32:47 +02:00
Michael Peter Christen
200b100fb8 added patch to rewrite altered yacy grid schema into yacy schema
This generates the stub and protocol parts of an url for inboundlinks,
outboundlinks and images
2017-05-01 11:38:02 +02:00
reger
9ad4d16829 Add a responsHeader to the solr index export with a format identifier
and export parameter (in accordance with response xml format) for easier
format detection on import.
2017-04-30 23:53:52 +02:00
luccioman
9697209ef6 Fixed Index Export feature for compatibility with old indexed documents.
This is a fix for mantis 682 (http://mantis.tokeek.de/view.php?id=682)
and issue #116
2017-04-28 11:39:51 +02:00
Michael Peter Christen
973d74712f added yacy grid flatjson surrogate parser 2017-04-25 08:44:02 +02:00
luccioman
b1da92648e Fixed surrogates import monitoring page (/CrawlResults.html?process=7)
This page was always empty, as described in mantis 740
(http://mantis.tokeek.de/view.php?id=740)
2017-04-24 18:24:26 +02:00
luccioman
527d494c1a Fixed "Unchecked conversion" compilation warnings. 2017-04-24 13:27:07 +02:00
Michael Peter Christen
335868edba Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2017-04-17 12:26:27 +02:00
luccioman
f66438442e Extended Mediawiki dump import to remote URLs.
When using a public HTTP URL in /IndexImportMediawiki_p.html, the remote
file now is directly streamed and processed, allowing import of several
GB dumps even with a low memory remote peer, and without need to
manually download the dump file first.
2017-04-14 14:32:44 +02:00
reger
7b80189bda Activate hosts navigator plugin. This includes rwi results in the navigator
count.
This might be tangential related to http://mantis.tokeek.de/view.php?id=736
as the example includes a local index search, while rwi results are not
counted.
2017-04-10 22:42:06 +02:00
Michael Peter Christen
f5ad29edb1 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2017-04-07 09:15:15 +02:00
Michael Peter Christen
76e9135526 added flatjson parser (stub, unfinished) 2017-04-07 09:15:05 +02:00
reger
b7417ac329 Introduce a Keyword search navigator using the index field keywords.
The keywords field string is split into words as navigator entries.

A keyword navigator facet is essential for search appliance usage were
documents and metadata use often specialized keyword vocabularies to 
filter search results. This navi can be used without custom index schema.

As we don't have defined a search query command to filter "keywords" yet,
the filtering is limited by adding the keyword to the search query.
2017-04-05 00:08:25 +02:00
reger
ba339a2a45 Add servlet to import warc file from filesystem IndexImportWarc_p.html.
Apply Importer interface to WarcImporter
2017-04-02 03:32:21 +02:00
Michael Peter Christen
1d81b8f102 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2017-04-01 01:04:27 +02:00
Michael Peter Christen
69081bce00 added export to elasticsearch. The export dump can easily be imported to
elasticsearch using the command
curl -XPOST localhost:9200/collection1/yacy/_bulk --data-binary
@yacy_dump_XXX.flatjson
2017-04-01 01:04:17 +02:00
reger
510f11d374 Implement surrogate import from Warc archives (as first option handle
warc = Web ARChive File Format.
Warc files with extension .warc or compressed warc.gz can be placed in the
DATA/surrogate/in and contained responses are imported to the index.
The used library is stream based so we can easily extend it later to use
and load warc's from the net.
2017-03-31 00:58:11 +02:00
reger
3dd23c178b Introduce the option to configure a shutdown port.
A port value of -1 will disable this option.

If set to a value greater 0, YaCy listens on this of on the local loopback 
address (127.0.0.1) for a shutdown or restart signal.
E.g. connect to http://localhost:8005/shutdown will stop the YaCy server.
http://localhost:8005/restart will restart it.
This option allows to stop YaCy locally independant from the web web 
frontend (which might be configured for password protected remote access).
2017-03-19 02:30:08 +01:00
reger
a2afb4bae0 add switchboardconstants for server ports config keys 2017-03-18 20:02:26 +01:00
reger
9b6d1abd9e eliminate some compiler unchecked and deprecation warnings
in nav plugins by explicite type declaration and replacing date.getYear
with Calendar.get
2017-03-09 01:42:36 +01:00
luccioman
0173b0bc32 Added an advanced settings page for referrer policy settings.
Feedback will be welcome, notably on the descriptive content of this
page.
2017-03-03 12:05:30 +01:00
luccioman
cdcd923375 Privacy enhancement : added settings to control referrer policy.
HTTP "Referer" header sent by the browser when using YaCy can now be
controlled either with the referrer meta tag as a global policy, or only
for search result links by adding the attribute rel="noreferrer".

To improve privacy with the less possible regressions, the default is
set as meta tag with value "origin-when-cross-origin" : internal YaCy
links behavior is not affected, but when visiting external websites
referrer url is not empty but stripped from query parameters and path.

Older browsers, Safari, MS IE and Edge do not support the referrer meta
tag, so the standard but less flexible noreferrer link type can also be
enabled as an alternative.

User-friendly settings page to be implemented.
2017-02-28 18:11:54 +01:00
reger
86534a56f7 fixed ReindexSolrBusyThread new and unexpected repeat of same query with
low number of found documents - by adding additional end condition to 
remove processed query with number of found docs <= process-chunck-size.

Noticed on query h4_txt:[* TO *], found 21, process 21, call of commit happend
but on next cycle same query again 21 docs found (while h4_txt was removed 
from schema and committed inputdocuments).
2017-02-27 23:00:46 +01:00
luccioman
ac766327d3 Switched a few more Solr fields from strictly mandatory to optional 2017-02-24 11:08:18 +01:00
Burkhard
4fdc11cae8 Update SearchEvent.java
Fix NPE on disabled local SolrIndex, occuring on search moving to the 2nd result page.
The debug purpose only setting to disabeling local SolrIndex (System Admin -> Debug Settings) should long term probably be removed from production code.
2017-02-22 02:01:48 +01:00