Commit Graph

13401 Commits

Author SHA1 Message Date
luccioman
d98c04853d Ensure proper closing of file input streams. 2017-06-02 12:14:29 +02:00
luccioman
c53c58fa85 Unsure closing ChunkIterator stream in every possible use case.
Also trace in logs the eventual close failures instead of failing
silently.
This should help prevent holding too many unreleased system file
handlers, as in the case reported by eros on YaCy forum
(http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5988&sid=b00e7486c1bf7e48a0d63eb328ccca02
)
2017-06-02 09:47:45 +02:00
luccioman
29e52bda39 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2017-06-02 01:47:53 +02:00
luccioman
a9cb083fa1 Improved consistency between loader openInputStream and load functions 2017-06-02 01:46:06 +02:00
reger
a814f3d885 Introduce keyword query parameter
This enables keyword navigator to filter on keywords. Added search page
output and layout config for keywords, allowing e.g. in Intranet use
to display the keywords. No styling or links applied to the keyword
text (but is desirable possibly in combination with bootstrap-tagsinput
for future/intranet).
2017-06-02 01:00:21 +02:00
luccioman
cbccf97361 Added JavaDoc to the getpageinfo_p API servlet. 2017-05-30 17:38:16 +02:00
luccioman
c226ded799 Fix unescape of URLs having some '%' chars but not percent-encoded 2017-05-30 12:32:14 +02:00
luccioman
bd88fd303e Deprecated duplicated and internally unused getpageinfo servlet.
Redirections set for the transition of any eventual external uses:
 - /api/getpageinfo.xml to /api/getpageinfo_p.xml
 - /api/getpageinfo.json to /api/getpageinfo_p.json
2017-05-30 09:29:28 +02:00
luccioman
306a82dd71 Fixed scraper NullPointerException cases on malformed URLs. 2017-05-30 08:48:20 +02:00
luccioman
aa55d71cf5 Fixed a NullPointerException case on Digest authentication.
Could occur when upgrading from a Debian package configured with Basic
authentication (as in release 1.92.9000) to a more recent one with
Digest authentication, without having re-encoded the admin password (for
example with dpkg-reconfigure).

As reported by eros on YaCy forum
(http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5988#p33686).
2017-05-29 19:16:09 +02:00
reger
b65a04087b upd to pdfbox-2.0.6.jar 2017-05-24 22:13:42 +02:00
luccioman
02ec0ed13c Quoted param value in Solr query to avoid unwanted traces in logs
When Webgraph Solr core is enabled, crawling and removing from index an
URL whose hash starts with the '-' character (example URL :
https://cs.wikipedia.org/ whose hash is "-2-HuTEndn4x") produced a full
ParseException stack trace in YaCy logs. This was not blocking because
the Solr query parser is able to escape itself the query and run it
successfully, but filled uselessly YaCy logs.
2017-05-24 08:43:03 +02:00
luccioman
1be4d32f99 Restored search page default behavior for Tab, Page Up and Down keys
Replaced by shortcuts defined by the HTML "accesskey" attribute which
has the advantage to be advertised by screen readers when focusing the
corresponding buttons, contrary to custom JavasScript key handlers.
Now With Firefox :
 - "Alt + Shift + n" for next page
 - "Alt + Shift + p" for previous page

Following ARIA recommendation : "keyboard shortcuts enhance, not
replace, standard keyboard access." ( see
https://www.w3.org/TR/wai-aria-practices/#kbd_shortcuts_behavior_design)

Fix for mantis 711 (http://mantis.tokeek.de/view.php?id=711)
2017-05-23 07:25:40 +02:00
reger
1737af37cf Set request originator to own peer in warc importer
in addition to change in 039162fbf0
2017-05-22 01:56:11 +02:00
reger
039162fbf0 Change warc importer to use defaultsurrogate-crawl profile, as reported
by LA_FORGE http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5990 and
analysed by @luccioman (see comment 510f11d374)
it creates conflict using a other crawlprofile without setting originator.
2017-05-22 01:34:08 +02:00
Michael Peter Christen
3b1d640a3c enhanced debugging 2017-05-18 00:28:12 +02:00
Michael Peter Christen
7de7879f13 added a cache to prevent too many seed enumerations 2017-05-18 00:28:00 +02:00
luccioman
bd7411a53a Enable p2p and cluster communication when "Protection of all pages" on
As reported by paul89 on YaCy forum
(http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5958 ), when setting
the "Protection of all pages" to "On" in the "ConfigAccounts_p.html"
page, the peer became completely unreachable by others, which is not the
purpose of this feature.
But the restriction still makes sense as a security enforcement and is
maintained in private "Robinson mode" where by the way any peer-to-peer
or cluster communication would be rejected.
2017-05-17 09:00:29 +02:00
luccioman
45346c1be8 Added missing accessibility attributes on search results progress bar. 2017-05-16 09:44:13 +02:00
luccioman
91a06bc669 Annotated search result information separators for screen readers. 2017-05-15 13:31:24 +02:00
luccioman
31ad043bb9 Added user interface feedback on results feeding termination status.
Added as an additional icon with title in the search progress bar, to
inform about background search feeder threads terminated or still
running. While giving a bit more information to users about the p2p
search process, this can help choosing whether or not wait a little bit
more time before going to the next page, in order to get results from
various sources sorted as best as possible (see #91 for a discussion
about sorting accuracy and network latency).

Other related modifications included :
 - regular updates to statistics in the progress bar until the
background feeders are completely terminated.
 - removed some uses of unsecure and discouraged JavaScript elements
2017-05-15 13:15:16 +02:00
sgaebel
ff6392215e added closing of lst-Tag in solr-Export 2017-05-13 20:38:25 +02:00
luccioman
d90b001e1b Improved previous merge "Show ranking in HTML UI".
- added the new setting as configurable in the "Debug/Analysis" settings
page. Debug/analysis is its main purpose for now as there is currently
no nice and "understansable" ranking score info servlet (see forum
discussion http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5884 ) 
- render in the "Search Page Layout" page preview when enabled
- added constants
2017-05-11 18:02:33 +02:00
luccioman
efe1232d90 Merge branch 'html-show-ranking' of
https://github.com/JeremyRand/yacy_search_server

Conflicts:
	defaults/yacy.init
2017-05-11 14:53:57 +02:00
luccioman
0f0f42b509 Added some JavaDoc 2017-05-11 08:33:19 +02:00
reger
077d062be3 Adjust mergeDocuments to keep youngest last-modified date of document
collection
2017-05-09 22:52:54 +02:00
luccioman
654801523e Fixed StringIndexOutOfBoundsException case.
Revealed by commit c77e43a : the exception was then thrown when indexing
pages containing mailto: scheme URL links with the Solr Webgraph core
enabled.
Fixed the error case and restored filtering on mailto links in
Document.resortLinks() as these URLs still should not appear in
Document.hyperlinks.
2017-05-09 18:32:47 +02:00
luccioman
b297f5bdbe Updated Debian package post install script admin password encoding.
To fit the now default HTTP authentication method set to Digest in
commit f7fce1b.
Also fixed unauthenticated access from localhost setting when first
installing the Debian package and letting the prompted password field
empty.
2017-05-09 12:20:41 +02:00
luccioman
7623d7728f Fixed Debian install message misspelling. 2017-05-09 12:15:41 +02:00
luccioman
522a268305 Improved new blacklist entries URL scheme detection. 2017-05-04 16:36:45 +02:00
luccioman
532981b363 Updated putHTML() JavaDoc 2017-05-04 11:21:27 +02:00
luccioman
58d23047dd Handle '?' and '+' chars as valid wild cards when adding to blacklist.
An entry such as "domain.com/[a-z]+" is a valid regular expression and
do not need additional ".*.*/.*" wildcards.
2017-05-04 11:19:59 +02:00
luccioman
4564541b3b Fixed blacklist Regex containing '+' characters rendering.
As reported on YaCy forum by shni
(http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5970) when a
blacklist entry contained both '?' and '+' characters, the '+' chars
were wrongly decoded and rendered as spaces.
2017-05-04 11:12:58 +02:00
luccioman
0612a8f4f2 Fixed the previously added link to scheduled dump operations. 2017-05-04 08:45:30 +02:00
luccioman
a87281b498 Added MediaWiki dump import scheduling feature.
Checking the last modified date by default to prevent unnecessary long
running operations.
2017-05-03 18:53:01 +02:00
luccioman
10c03c6c64 Improved MediaWiki dump import monitoring.
When import thread is terminated :
 - now stop refreshing and stay on the monitoring page to give user a
feedback after a long running import
 - added link to the next monitoring step : results from surrogates
reader
 - added link to new import
 
On the new import page, added a link on the eventual last import report.
2017-05-02 09:38:45 +02:00
luccioman
edd7ccac40 Added some JavaDoc 2017-05-02 09:33:11 +02:00
luccioman
79fdf14b0a Fixed regression introduced by commit 9ad4d16
On MediaWiki dump imports, the SurrogateReader was trying to unread too
many bytes, then failing with the following exception :
"java.io.IOException: Push back buffer is full".
2017-05-02 09:32:04 +02:00
Michael Peter Christen
7678fd67e3 copied fix from yacy_grid_parser for wrong array type 2017-05-01 11:44:26 +02:00
Michael Peter Christen
200b100fb8 added patch to rewrite altered yacy grid schema into yacy schema
This generates the stub and protocol parts of an url for inboundlinks,
outboundlinks and images
2017-05-01 11:38:02 +02:00
reger
9ad4d16829 Add a responsHeader to the solr index export with a format identifier
and export parameter (in accordance with response xml format) for easier
format detection on import.
2017-04-30 23:53:52 +02:00
luccioman
9697209ef6 Fixed Index Export feature for compatibility with old indexed documents.
This is a fix for mantis 682 (http://mantis.tokeek.de/view.php?id=682)
and issue #116
2017-04-28 11:39:51 +02:00
luccioman
88c062639b Added some JavaDoc 2017-04-28 11:36:48 +02:00
luccioman
8d288f5dba Crawl results page : apply table lines number limit.
Take into account the already existing default limit value (especially
useful after a long crawl or surrogates import), or a custom one from
parameter "count".
Added a "Show all" link for convenience.
2017-04-27 18:24:54 +02:00
luccioman
31fff2c986 Extended WikiCode template inclusion syntax support.
Wiki templates are not rendered but syntax support is improved, which
greatly enhance snippets rendering on search results coming from a
MediaWiki dump import.
Tested on various dumps from Wikimedia at
https://dumps.wikimedia.org/backup-index.html
See also Wikipedia transclusion documentation at
https://en.wikipedia.org/wiki/Wikipedia:Transclusion
2017-04-27 09:50:04 +02:00
Michael Peter Christen
973d74712f added yacy grid flatjson surrogate parser 2017-04-25 08:44:02 +02:00
luccioman
b1da92648e Fixed surrogates import monitoring page (/CrawlResults.html?process=7)
This page was always empty, as described in mantis 740
(http://mantis.tokeek.de/view.php?id=740)
2017-04-24 18:24:26 +02:00
luccioman
527d494c1a Fixed "Unchecked conversion" compilation warnings. 2017-04-24 13:27:07 +02:00
reger
2b03e40134 upd to jwat-1.0.5 2017-04-22 23:32:40 +02:00
reger
7a7da698d4 fix unit test MultiProtocolURL(file) assertion for Windows path with
drive letter.
2017-04-20 00:47:52 +02:00