Commit Graph

3716 Commits

Author SHA1 Message Date
luccioman
3ccd89e274 Fixed MultiProtocolURL.resolveBackpath to handle remaining '..' segments 2016-10-13 16:18:24 +02:00
luccioman
4b699c469a Blacklist refactoring : extracted a function for easier unit testing 2016-10-13 15:33:31 +02:00
luccioman
54cfcc3f56 CrawlCheck_p.html : also display info about disallowed URLs. 2016-10-12 11:26:59 +02:00
luccioman
8b341e9818 Robots : properly handle URLs including non ASCII characters
This fixes GitHub issue 80 (
https://github.com/yacy/yacy_search_server/issues/80 ) reported by
Lord-Protector.
2016-10-12 11:25:36 +02:00
reger
e68b00678e prevent negative score on URIMetadataNode - in the special case were no
solr score is supplied.
+ assert before use & test case
2016-10-11 19:54:50 +02:00
luccioman
242707f9b4 Fixed loadFromCache with strategy IFFRESH.
This fixes mantis 695 ( http://mantis.tokeek.de/view.php?id=695 ) :
crawl start with 'Link-List of URL' option on websites using cookies.
2016-10-10 01:10:35 +02:00
reger
b752bcfecb adjust date in text detection to ignore some program version strings
like "3.1.2.0102" see http://mantis.tokeek.de/view.php?id=650
+ expand test case
2016-10-06 23:37:12 +02:00
reger
b017e97421 optimize condenser language detection a little.
langdetect probabilities take letter case into account, add words from
description and anchors etc. as is.
+ add it to javadoc
2016-10-06 19:03:52 +02:00
reger
ae3717d087 adjust Tokenizer sentence count to ignore repeated punktuation (like !!!! )
+ remove unused sentenceword map (we use only the count)
+ upd test case for sentence count
2016-10-06 03:41:07 +02:00
reger
474f0476c6 adjust Tokenizer sentence count on trailing text after last recognized sentence
+ upd test case for rwi multi-word-query  (leaving results known to fail untested)
2016-10-05 05:52:37 +02:00
reger
3861ac9293 upd maven dependency-check plugin to reflect changes of https://nvd.nist.gov
+ upd unknown ant script with current lib/jsch version
2016-10-04 03:05:26 +02:00
reger
681a61dafb adjust rwi index result word position handling used for rwi ranking
- correct WordReferenceVars.toRowEntry posintext parameter
to set expected min posintext (the difference is on multi-word queries,
while positions are ordered by search word order).
- modified posofphrase/posinphrase join operation
 - to set min posofphrase
 - and keep posinphrase if not same posofphrase (was set to 0, no differentiation during ranking)
+ fix compiler msg (missing type declaration)
2016-10-04 01:42:18 +02:00
reger
14f7577231 add support for older Word versions (Word6/Word95) to docParser 2016-10-03 01:52:51 +02:00
reger
1a79c64495 generalize DateDetection with holiday date rules readily available in icu
to make sure current dates are recognized (was fixed to 2014 - 2016)
+ adjust holiday date parser from pattern.match to pattern.find to deal with leading and trailing text
+ moved relative date recognition (morgen, tomorrow) to parseline (used by query parser only), as not working and problematic for indexing
+ add test case for parseline (used by query parser)
2016-10-02 03:19:12 +02:00
reger
6f68f08354 correct DateDetection Silvester date
add Thanksgiving
2016-10-01 03:16:27 +02:00
reger
32a2e3a22a have RSSFeed.getChannel return empty message on missing channel element,
a) required b) prevent NPE in rss servlets
+ add test
2016-09-30 21:46:57 +02:00
luccioman
8d57b5b970 Added some javadocs. 2016-09-30 17:12:55 +02:00
luccioman
60df09fff9 Fixed some HTML validation errors : Illegal character in query
Now encode space characters in URLs query part.
2016-09-30 10:54:53 +02:00
reger
862f28eaa6 display number of documents/rss-items for label "docs" in load_rss_p servlet
(as replacement for the rarely used "docs" rss-tag for a url to the rss-specification)
2016-09-29 23:59:10 +02:00
luccioman
dcdea2d02f Fixed shutdown for crawler.MaxActiveThreads value greater than 200
Shutdown was hanging in CrawlQueues.close() at
this.workerQueue.put(POISON_REQUEST) when config value
crawler.MaxActiveThreads was greater than 200.

Revealed by "Collision" Threads dumps in mantis 689
(http://mantis.tokeek.de/view.php?id=689#c1312)

Fixed consistency between this.worker.length and this.workerQueue
capacity, and made the process more reliable using non-blocking offer()
function.
2016-09-29 10:33:11 +02:00
luccioman
d286ba2c3e Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2016-09-28 14:53:08 +02:00
luccioman
b8f6458152 Prevent yacy main thread from hanging on browser opening process.
First fix for mantis 689 (http://mantis.tokeek.de/view.php?id=689).

On Debian Linux, with a headless jre and no open browser,
browser.openBrowserClassic() was called and waited forever the browser
process end (p.waitFor()). YaCy shutdown was therefore not working until
the browser was closed.

Also modified browser opening command for Unix platform to open the
default the browser (with xdg-open util) instead of Firefox.

xdg-open also has the advantage to be asynchronous (not blocking).
2016-09-28 14:52:30 +02:00
reger
70e1eb30a5 prevent StringIndexOutOfBounds in getLocalFile()
+ tighten patching of DOS path w/o protocol to drive "LETTER":
2016-09-27 22:40:36 +02:00
luccioman
1bb0b135ac Avoid duplication of various MS Windows file URLs flavors
Fix for mantis 692 (http://mantis.tokeek.de/view.php?id=692)
2016-09-27 07:53:08 +02:00
luccioman
b9a8476f02 Removed unused import 2016-09-27 07:41:45 +02:00
reger
e73c1eea8c remove unused rootpattern, leftover from commit
9a5ab4e2c1 (diff-d2b184283abed53ae260fc9eabdaef40)
2016-09-26 02:54:58 +02:00
reger
6f8c3ccea4 improve url hash computation for file path with mixed java & windows
file.separator to compute equal hashes (by normalizing path for computation)
+ expand test case for to check mixed java / windows file url notation
like e.g. file:///c:/test/file.html vs. file:///c:\test/file.html
- relates partially to http://mantis.tokeek.de/view.php?id=692
2016-09-25 22:08:12 +02:00
reger
efcb6a1e74 fix supported mime XML -> xml for rssParser (mime normalized to lower case for comparison)
+ add mime text/xml as in use for rss in the wild
2016-09-23 23:37:12 +02:00
luccioman
b3b75b0498 Accessibility : add a customizable alternative text to YaCy log
Applied W3C recommendations :
https://www.w3.org/TR/html51/semantics-embedded-content.html#a-link-or-button-containing-nothing-but-an-image
and
https://www.w3.org/TR/html51/semantics-embedded-content.html#logos-insignia-flags-or-emblems
2016-09-22 16:08:33 +02:00
luccioman
f2bc1b268d Updated URL fragment validation rules according to current standards
See RFC 3986 (https://tools.ietf.org/html/rfc3986) or URL living
standard (https://url.spec.whatwg.org/)
2016-09-22 11:28:33 +02:00
luccioman
b1b8e69da8 Fixed NullPointerException cases 2016-09-22 11:25:33 +02:00
luccioman
3ee4f56c39 Improved ErrorCache behavior when switching networks
Even after network switch, ErroCache was still holding a reference to
the previous Solr cores, thus becoming useless until next YaCy restart.

Initial error cache filling with recent errors from the index was also
missing after the swtich.
2016-09-22 09:07:07 +02:00
luccioman
7d5ba2afa4 Added some JavaDoc and moved crawlStacker close at the right place. 2016-09-22 08:21:14 +02:00
luccioman
8edbcd8ad4 Log eventual Solr instances close errors.
We do not want to block on this kind of error, but this should not
silently fail as it may have later consequences.
2016-09-22 08:20:01 +02:00
reger
330768c8a2 fix for solr write.lock after mode change http://mantis.tokeek.de/view.php?id=686
The embedded core holds a lock on the index and must be closed. Earlier commit
comment states that core should be closed with solr instance instead on close 
of connector.
Adjusted the InstanceMirror.close() to take care of closing the embedded 
instance to release the lock.
In 2 routines of fulltext this was already explicite implemented (disconnectLocalSolr).
Now this disconnect is part of the InstanceMirror.close().
2016-09-22 00:16:22 +02:00
reger
585d2a6441 test case: for NewsPool to check the id modificator (for unique id)
and observe the distribution order .. hands on.
+ add test/DATA to gitignor
2016-09-20 01:55:56 +02:00
luccioman
de5c873e38 Removed unused JavaScript file docs.min.js
This file is used by Bootstrap documentation website
(http://getbootstrap.com/) but is not part of the Bootstrap distribution
and has not be included in a Bootstrap based application.
2016-09-20 00:17:42 +02:00
Michael Peter Christen
df51e4ef07 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2016-09-19 11:01:58 +02:00
Michael Peter Christen
e063aaf97f enable fuzzy search, solr style (append a ~ to get a fuzzyness on the
word)
2016-09-19 11:01:39 +02:00
reger
ff6589fc0f test case: simulating multi word query for local rwi index
Purpose of the test case is to be able to (controlled) analyse the rwi ranking for
multi word searches (with focus on posintext and word-distance ranking)
2016-09-18 00:59:27 +02:00
reger
e990297d2e avoid NPE on hello message with missing "yourip" key
http://mantis.tokeek.de/view.php?id=684
2016-09-15 23:26:25 +02:00
reger
e51ab8c7aa hack to generate a unique message-id for messages created in the same second
by optionally add a 1 second offset counter to the current time (which is
used as the unique id part)
2016-09-15 02:59:32 +02:00
Michael Peter Christen
b82300358a removed version number check because it does not work any more if
version numbers are expressed in a different way as we expect. That
could cause that YaCy does not run on systems which are appropriate but
we simply do not understand the version string.
2016-09-14 16:32:57 +02:00
Michael Peter Christen
2107674999 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2016-09-14 16:24:55 +02:00
Michael Peter Christen
0d28f563f4 fix for java version "9-ea" 2016-09-14 16:24:32 +02:00
reger
3b694b3935 add some javadoc to rwi wordreference distance, position
to remember facts for http://mantis.tokeek.de/view.php?id=683
Init missing word position to 0 like in other non text body words
2016-09-14 00:36:19 +02:00
reger
a4465c97d6 as requested, disable/remove old swf parser
http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5861#p33098
2016-09-13 02:47:36 +02:00
reger
7f63fc50f3 prepare a IndexSegment test case for RWI index testing
+ prevent NPE in Segment.clear() on missing embedded solr instance.
2016-09-11 23:25:44 +02:00
reger
96467c5467 remove not needed counter in Tokeninzer (completing last changes)
including a small change, word posintext counting. 
We remember/store 1st posintext. Previously following words got a handle (posintext)
excluding found. Now it just counts and assigns true posintext as handle (posintext)
2016-09-10 18:23:09 +02:00
luccioman
d66b0f7b7b Fixed french messages encoding in YaCy tray.
Also added the missing french translations.
2016-09-09 07:43:33 +02:00