Commit Graph

12759 Commits

Author SHA1 Message Date
luccioman
54cfcc3f56 CrawlCheck_p.html : also display info about disallowed URLs. 2016-10-12 11:26:59 +02:00
luccioman
8b341e9818 Robots : properly handle URLs including non ASCII characters
This fixes GitHub issue 80 (
https://github.com/yacy/yacy_search_server/issues/80 ) reported by
Lord-Protector.
2016-10-12 11:25:36 +02:00
luccioman
75bb77f0cb Refactoring : extracted a method to handle authorized action links. 2016-10-12 09:31:42 +02:00
luccioman
c996b04741 HTML validation : fixed URL encoding of search results action links. 2016-10-12 09:16:47 +02:00
luccioman
2b81703828 Refactored search result action links construction.
These are long URLS with common parts : it is valuable to build the
common parts only one time.
2016-10-12 08:45:32 +02:00
reger
e68b00678e prevent negative score on URIMetadataNode - in the special case were no
solr score is supplied.
+ assert before use & test case
2016-10-11 19:54:50 +02:00
luccioman
242707f9b4 Fixed loadFromCache with strategy IFFRESH.
This fixes mantis 695 ( http://mantis.tokeek.de/view.php?id=695 ) :
crawl start with 'Link-List of URL' option on websites using cookies.
2016-10-10 01:10:35 +02:00
reger
c778219768 remove module for swfparser from maven parent pom
not longer required for the build
see a4465c97d6
2016-10-07 23:49:03 +02:00
luccioman
094aed8664 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2016-10-07 11:06:34 +02:00
luccioman
c7402a2f89 Removed invalid empty form action.
A form action URL must not be empty (see
https://www.w3.org/TR/html/sec-forms.html#element-attrdef-form-action ).
No action attribute has the same effect (relaunching the same GET
action) but is valid HTML.
2016-10-07 10:57:31 +02:00
luccioman
37df2e19fd Removed xmlns attribute which no more makes sense in HTML5 pages. 2016-10-07 10:46:20 +02:00
luccioman
94924e288f Added some accessibility improvements to the main interface.
Tested with NVDA screen reader.
2016-10-07 10:44:45 +02:00
luccioman
dd86f7c44e Fixed HTML validation errors and grouped radios options in fieldsets 2016-10-07 10:43:06 +02:00
luccioman
fc0c72c84b Switched to the short HTML Doctype
This pages were already no more XHTML 1.0 because made use of the HTML5
syntax and elements.
Applied current (2016) HTML standard recommended Doctype declaration
(see https://www.w3.org/TR/html/syntax.html#the-doctype ).
2016-10-07 10:42:23 +02:00
reger
7c81160f45 correct blacklist export as text url to blacklists_p.txt
was using servlet for network access and missing network.unit.name
fix for http://mantis.tokeek.de/view.php?id=694
+ prevent unresoved_pattern in yacy/list servlet
2016-10-07 03:03:41 +02:00
reger
b752bcfecb adjust date in text detection to ignore some program version strings
like "3.1.2.0102" see http://mantis.tokeek.de/view.php?id=650
+ expand test case
2016-10-06 23:37:12 +02:00
reger
b017e97421 optimize condenser language detection a little.
langdetect probabilities take letter case into account, add words from
description and anchors etc. as is.
+ add it to javadoc
2016-10-06 19:03:52 +02:00
reger
ae3717d087 adjust Tokenizer sentence count to ignore repeated punktuation (like !!!! )
+ remove unused sentenceword map (we use only the count)
+ upd test case for sentence count
2016-10-06 03:41:07 +02:00
luccioman
b5eb7a9217 Removed unnecessary crawlingDomFilterDepth hidden field.
It had incorrect "-UNRESOLVED_PATTERN-" value (see  second part of
mantis 691 http://mantis.tokeek.de/view.php?id=691 )

Note : crawlingDomFilterDepth is apparently unused in current (2016)
YaCy code-base. It was also unnecessary because crawlingDomFilterCheck
hidden field is set to "off".
2016-10-05 13:48:22 +02:00
luccioman
f6d7c6ee1f Fixed Recorded action URLs beginning displayed in /Table_API_p.html
Removed scheme, host and port from URL to avoid dealing with http/https,
external host and port retrieving issues.

What's more, this is consistent with how URL are displayed in
/Tables_p.html?table=api&count=100&reverse=on&search= or
Tables_p.xml?table=api&count=100&search=

This fixes mantis 691 first part
(http://mantis.tokeek.de/view.php?id=691)
2016-10-05 12:20:37 +02:00
reger
474f0476c6 adjust Tokenizer sentence count on trailing text after last recognized sentence
+ upd test case for rwi multi-word-query  (leaving results known to fail untested)
2016-10-05 05:52:37 +02:00
luccioman
34658ddb9b Merge pull request #76 from luccioman/crawler
Crawl monitoring : refresh running crawls table
2016-10-04 05:06:18 +02:00
luccioman
0065c9b9ea Crawl monitoring : refresh running crawls table
Fix mantis 690 ( http://mantis.tokeek.de/view.php?id=690 ). 
Tested on :
- MS Windows 10 : Edge, Firefox 49, Chrome 53
- Debian Jessie : Firefox ESR 45
2016-10-04 03:56:03 +02:00
luccioman
e1e632ad84 Switched to the short HTML Doctype
This page was already no more XHTML 1.0 as it makes use of the HTML5
<progress> element.
Applied current HTML standard recommended Doctype declaration (see
https://www.w3.org/TR/html/syntax.html#the-doctype ).
2016-10-04 03:56:02 +02:00
luccioman
4d8611e5e7 Tables accessibility : added missing <thead> sections. 2016-10-04 03:56:02 +02:00
luccioman
9fb3142317 Restricted variables scope to function handleStatus() in Crawler.js
Missing 'var' in declaration was unnecessarily giving global scope to
these variables.
2016-10-04 03:56:02 +02:00
reger
3861ac9293 upd maven dependency-check plugin to reflect changes of https://nvd.nist.gov
+ upd unknown ant script with current lib/jsch version
2016-10-04 03:05:26 +02:00
reger
681a61dafb adjust rwi index result word position handling used for rwi ranking
- correct WordReferenceVars.toRowEntry posintext parameter
to set expected min posintext (the difference is on multi-word queries,
while positions are ordered by search word order).
- modified posofphrase/posinphrase join operation
 - to set min posofphrase
 - and keep posinphrase if not same posofphrase (was set to 0, no differentiation during ranking)
+ fix compiler msg (missing type declaration)
2016-10-04 01:42:18 +02:00
reger
14f7577231 add support for older Word versions (Word6/Word95) to docParser 2016-10-03 01:52:51 +02:00
reger
8794e06721 upd to poi-3.15.jar 2016-10-03 01:48:35 +02:00
reger
e25f2ee88b mention date search parameter in search option help (index.html) 2016-10-02 06:36:34 +02:00
reger
1a79c64495 generalize DateDetection with holiday date rules readily available in icu
to make sure current dates are recognized (was fixed to 2014 - 2016)
+ adjust holiday date parser from pattern.match to pattern.find to deal with leading and trailing text
+ moved relative date recognition (morgen, tomorrow) to parseline (used by query parser only), as not working and problematic for indexing
+ add test case for parseline (used by query parser)
2016-10-02 03:19:12 +02:00
reger
6f68f08354 correct DateDetection Silvester date
add Thanksgiving
2016-10-01 03:16:27 +02:00
reger
32a2e3a22a have RSSFeed.getChannel return empty message on missing channel element,
a) required b) prevent NPE in rss servlets
+ add test
2016-09-30 21:46:57 +02:00
reger
fedb9f8151 del double entry in master.lng 2016-09-30 21:42:42 +02:00
luccioman
8d57b5b970 Added some javadocs. 2016-09-30 17:12:55 +02:00
luccioman
4585a60d7e Made use of the constant corresponding to the hard-coded value. 2016-09-30 17:12:29 +02:00
luccioman
60df09fff9 Fixed some HTML validation errors : Illegal character in query
Now encode space characters in URLs query part.
2016-09-30 10:54:53 +02:00
luccioman
a76a46a2e9 Removed invalid rel="[count]" from links in tagcloud.
These are no valid link relationships, and do not appear to be used in
scripting or styling. 
If necessary, a valid alternative could be to add an attribute such as
data-count="[count]"
2016-09-30 09:43:51 +02:00
reger
862f28eaa6 display number of documents/rss-items for label "docs" in load_rss_p servlet
(as replacement for the rarely used "docs" rss-tag for a url to the rss-specification)
2016-09-29 23:59:10 +02:00
luccioman
5027912f30 Fixed <p> spacers : blocks elements such as <div> are not allowed inside 2016-09-29 14:24:15 +02:00
luccioman
abe489a0b5 Removed unnecessary ARIA "form" role on native HTML form elements.
This fixes warnings reported by W3C Nu Html Checker
(https://validator.w3.org/nu/).
2016-09-29 13:42:07 +02:00
luccioman
cca4186044 Fixed HTML validation error : "Stray end tag div" 2016-09-29 11:42:59 +02:00
luccioman
dcdea2d02f Fixed shutdown for crawler.MaxActiveThreads value greater than 200
Shutdown was hanging in CrawlQueues.close() at
this.workerQueue.put(POISON_REQUEST) when config value
crawler.MaxActiveThreads was greater than 200.

Revealed by "Collision" Threads dumps in mantis 689
(http://mantis.tokeek.de/view.php?id=689#c1312)

Fixed consistency between this.worker.length and this.workerQueue
capacity, and made the process more reliable using non-blocking offer()
function.
2016-09-29 10:33:11 +02:00
reger
ada473ced2 fix ConfigBasic servlet parameter name for Japanese _jp->_ja 2016-09-28 16:08:36 +02:00
luccioman
d286ba2c3e Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2016-09-28 14:53:08 +02:00
luccioman
b8f6458152 Prevent yacy main thread from hanging on browser opening process.
First fix for mantis 689 (http://mantis.tokeek.de/view.php?id=689).

On Debian Linux, with a headless jre and no open browser,
browser.openBrowserClassic() was called and waited forever the browser
process end (p.waitFor()). YaCy shutdown was therefore not working until
the browser was closed.

Also modified browser opening command for Unix platform to open the
default the browser (with xdg-open util) instead of Firefox.

xdg-open also has the advantage to be asynchronous (not blocking).
2016-09-28 14:52:30 +02:00
reger
cf3a4bdf52 upd to pdfbox-2.0.3 2016-09-27 23:12:10 +02:00
reger
70e1eb30a5 prevent StringIndexOutOfBounds in getLocalFile()
+ tighten patching of DOS path w/o protocol to drive "LETTER":
2016-09-27 22:40:36 +02:00
luccioman
1bb0b135ac Avoid duplication of various MS Windows file URLs flavors
Fix for mantis 692 (http://mantis.tokeek.de/view.php?id=692)
2016-09-27 07:53:08 +02:00