Commit Graph

13180 Commits

Author SHA1 Message Date
luccioman
31ad043bb9 Added user interface feedback on results feeding termination status.
Added as an additional icon with title in the search progress bar, to
inform about background search feeder threads terminated or still
running. While giving a bit more information to users about the p2p
search process, this can help choosing whether or not wait a little bit
more time before going to the next page, in order to get results from
various sources sorted as best as possible (see #91 for a discussion
about sorting accuracy and network latency).

Other related modifications included :
 - regular updates to statistics in the progress bar until the
background feeders are completely terminated.
 - removed some uses of unsecure and discouraged JavaScript elements
2017-05-15 13:15:16 +02:00
sgaebel
ff6392215e added closing of lst-Tag in solr-Export 2017-05-13 20:38:25 +02:00
luccioman
d90b001e1b Improved previous merge "Show ranking in HTML UI".
- added the new setting as configurable in the "Debug/Analysis" settings
page. Debug/analysis is its main purpose for now as there is currently
no nice and "understansable" ranking score info servlet (see forum
discussion http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5884 ) 
- render in the "Search Page Layout" page preview when enabled
- added constants
2017-05-11 18:02:33 +02:00
luccioman
efe1232d90 Merge branch 'html-show-ranking' of
https://github.com/JeremyRand/yacy_search_server

Conflicts:
	defaults/yacy.init
2017-05-11 14:53:57 +02:00
luccioman
0f0f42b509 Added some JavaDoc 2017-05-11 08:33:19 +02:00
reger
077d062be3 Adjust mergeDocuments to keep youngest last-modified date of document
collection
2017-05-09 22:52:54 +02:00
luccioman
654801523e Fixed StringIndexOutOfBoundsException case.
Revealed by commit c77e43a : the exception was then thrown when indexing
pages containing mailto: scheme URL links with the Solr Webgraph core
enabled.
Fixed the error case and restored filtering on mailto links in
Document.resortLinks() as these URLs still should not appear in
Document.hyperlinks.
2017-05-09 18:32:47 +02:00
luccioman
b297f5bdbe Updated Debian package post install script admin password encoding.
To fit the now default HTTP authentication method set to Digest in
commit f7fce1b.
Also fixed unauthenticated access from localhost setting when first
installing the Debian package and letting the prompted password field
empty.
2017-05-09 12:20:41 +02:00
luccioman
7623d7728f Fixed Debian install message misspelling. 2017-05-09 12:15:41 +02:00
luccioman
522a268305 Improved new blacklist entries URL scheme detection. 2017-05-04 16:36:45 +02:00
luccioman
532981b363 Updated putHTML() JavaDoc 2017-05-04 11:21:27 +02:00
luccioman
58d23047dd Handle '?' and '+' chars as valid wild cards when adding to blacklist.
An entry such as "domain.com/[a-z]+" is a valid regular expression and
do not need additional ".*.*/.*" wildcards.
2017-05-04 11:19:59 +02:00
luccioman
4564541b3b Fixed blacklist Regex containing '+' characters rendering.
As reported on YaCy forum by shni
(http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5970) when a
blacklist entry contained both '?' and '+' characters, the '+' chars
were wrongly decoded and rendered as spaces.
2017-05-04 11:12:58 +02:00
luccioman
0612a8f4f2 Fixed the previously added link to scheduled dump operations. 2017-05-04 08:45:30 +02:00
luccioman
a87281b498 Added MediaWiki dump import scheduling feature.
Checking the last modified date by default to prevent unnecessary long
running operations.
2017-05-03 18:53:01 +02:00
luccioman
10c03c6c64 Improved MediaWiki dump import monitoring.
When import thread is terminated :
 - now stop refreshing and stay on the monitoring page to give user a
feedback after a long running import
 - added link to the next monitoring step : results from surrogates
reader
 - added link to new import
 
On the new import page, added a link on the eventual last import report.
2017-05-02 09:38:45 +02:00
luccioman
edd7ccac40 Added some JavaDoc 2017-05-02 09:33:11 +02:00
luccioman
79fdf14b0a Fixed regression introduced by commit 9ad4d16
On MediaWiki dump imports, the SurrogateReader was trying to unread too
many bytes, then failing with the following exception :
"java.io.IOException: Push back buffer is full".
2017-05-02 09:32:04 +02:00
Michael Peter Christen
7678fd67e3 copied fix from yacy_grid_parser for wrong array type 2017-05-01 11:44:26 +02:00
Michael Peter Christen
200b100fb8 added patch to rewrite altered yacy grid schema into yacy schema
This generates the stub and protocol parts of an url for inboundlinks,
outboundlinks and images
2017-05-01 11:38:02 +02:00
reger
9ad4d16829 Add a responsHeader to the solr index export with a format identifier
and export parameter (in accordance with response xml format) for easier
format detection on import.
2017-04-30 23:53:52 +02:00
luccioman
9697209ef6 Fixed Index Export feature for compatibility with old indexed documents.
This is a fix for mantis 682 (http://mantis.tokeek.de/view.php?id=682)
and issue #116
2017-04-28 11:39:51 +02:00
luccioman
88c062639b Added some JavaDoc 2017-04-28 11:36:48 +02:00
luccioman
8d288f5dba Crawl results page : apply table lines number limit.
Take into account the already existing default limit value (especially
useful after a long crawl or surrogates import), or a custom one from
parameter "count".
Added a "Show all" link for convenience.
2017-04-27 18:24:54 +02:00
luccioman
31fff2c986 Extended WikiCode template inclusion syntax support.
Wiki templates are not rendered but syntax support is improved, which
greatly enhance snippets rendering on search results coming from a
MediaWiki dump import.
Tested on various dumps from Wikimedia at
https://dumps.wikimedia.org/backup-index.html
See also Wikipedia transclusion documentation at
https://en.wikipedia.org/wiki/Wikipedia:Transclusion
2017-04-27 09:50:04 +02:00
Michael Peter Christen
973d74712f added yacy grid flatjson surrogate parser 2017-04-25 08:44:02 +02:00
luccioman
b1da92648e Fixed surrogates import monitoring page (/CrawlResults.html?process=7)
This page was always empty, as described in mantis 740
(http://mantis.tokeek.de/view.php?id=740)
2017-04-24 18:24:26 +02:00
luccioman
527d494c1a Fixed "Unchecked conversion" compilation warnings. 2017-04-24 13:27:07 +02:00
reger
2b03e40134 upd to jwat-1.0.5 2017-04-22 23:32:40 +02:00
reger
7a7da698d4 fix unit test MultiProtocolURL(file) assertion for Windows path with
drive letter.
2017-04-20 00:47:52 +02:00
reger
c77e43a391 Take out mailto collect in internal parsed document
As earlier plans to make use of mailto as separate webgraph entity didn't
materialize (see  http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5726&p=32493&hilit=mailto#p32493)
free the unused handling and resources.
2017-04-20 00:18:18 +02:00
Michael Peter Christen
335868edba Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2017-04-17 12:26:27 +02:00
reger
bec34d3546 Add url input field as source for WarcImporter
allowing to import warc from url without prior download.
2017-04-16 04:25:29 +02:00
reger
d3df8a46c4 fix unresolved_pattern on missing post parameter api/message.html 2017-04-14 21:14:26 +02:00
luccioman
f66438442e Extended Mediawiki dump import to remote URLs.
When using a public HTTP URL in /IndexImportMediawiki_p.html, the remote
file now is directly streamed and processed, allowing import of several
GB dumps even with a low memory remote peer, and without need to
manually download the dump file first.
2017-04-14 14:32:44 +02:00
luccioman
e5c3b16748 Improved http client close time on stream processing errors. 2017-04-14 14:23:50 +02:00
luccioman
23775e76e2 Fixed endless loop case in wikicode processing.
Detected when importing recent MediaWiki dumps containing some pages
with script content in plain text format (see Scribunto extension
https://www.mediawiki.org/wiki/Extension:Scribunto ).

Further improvement : modify the MediawikiImporter to prevent processing
revisions whose <model> is not wikitext.
2017-04-12 17:17:03 +02:00
luccioman
0bc868a819 Improved support for non ASCII chars in local file system URLs
Creating a MultiProtocolURL instance from a File object and then
retrieving a File with getFSFile() was inconsistent with file paths
containing space or non ASCII chars.
2017-04-12 09:23:10 +02:00
luccioman
7edddd7b0d Improved error reports on various wiki dump prerequisites failure cases.
Also added some JavaDoc.
2017-04-11 08:21:34 +02:00
luccioman
dfe8d4139b Used a text input for wiki dump import file selection.
Using an HTML "file" input was confusing (as reported by promocore on
YaCy forum : http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5965) ,
and it only worked with MS IE/Edge on a local YaCy peer :
 - for security reasons some current major browsers such as Firefox or
Chrome do not allow to send full file path information when using a file
form input
 - the local file system selection popup doesn't make sense when you
want to import a dump on a remote YaCy server
2017-04-11 07:34:17 +02:00
reger
3a71430030 Adjust ConfigSearchPage_p to activated hosts navigator as plugin 2017-04-10 22:58:20 +02:00
reger
7b80189bda Activate hosts navigator plugin. This includes rwi results in the navigator
count.
This might be tangential related to http://mantis.tokeek.de/view.php?id=736
as the example includes a local index search, while rwi results are not
counted.
2017-04-10 22:42:06 +02:00
reger
05a1b14b4a add missing text from ConfigRobotsTxt_p to master.lng
and link to Translation Editor to Translation News page.
2017-04-09 21:42:05 +02:00
reger
a39c00a93f add servlet to list user in UserDB and made user editor available in
separate servlet for a quick and easy overview of configured user and
selection for edit.
2017-04-09 02:09:32 +02:00
reger
a4498e17c0 fix edit current user form to required post mehtod
introduced with cde237b687
2017-04-08 22:54:57 +02:00
Michael Peter Christen
f5ad29edb1 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2017-04-07 09:15:15 +02:00
Michael Peter Christen
76e9135526 added flatjson parser (stub, unfinished) 2017-04-07 09:15:05 +02:00
reger
46a4aaf09c upd to Solr-5.5.4 2017-04-06 21:18:01 +02:00
reger
b7417ac329 Introduce a Keyword search navigator using the index field keywords.
The keywords field string is split into words as navigator entries.

A keyword navigator facet is essential for search appliance usage were
documents and metadata use often specialized keyword vocabularies to 
filter search results. This navi can be used without custom index schema.

As we don't have defined a search query command to filter "keywords" yet,
the filtering is limited by adding the keyword to the search query.
2017-04-05 00:08:25 +02:00
reger
eddb7a9804 upd to pdfbox-2.0.5.jar and transient dependency xmpcore-5.1.3.jar
required by metadata-extractor-2.10.1 (fix build.xml compiler warning)
2017-04-04 00:59:26 +02:00