Commit Graph

78 Commits

Author SHA1 Message Date
luccioman
c226ded799 Fix unescape of URLs having some '%' chars but not percent-encoded 2017-05-30 12:32:14 +02:00
reger
077d062be3 Adjust mergeDocuments to keep youngest last-modified date of document
collection
2017-05-09 22:52:54 +02:00
luccioman
522a268305 Improved new blacklist entries URL scheme detection. 2017-05-04 16:36:45 +02:00
luccioman
31fff2c986 Extended WikiCode template inclusion syntax support.
Wiki templates are not rendered but syntax support is improved, which
greatly enhance snippets rendering on search results coming from a
MediaWiki dump import.
Tested on various dumps from Wikimedia at
https://dumps.wikimedia.org/backup-index.html
See also Wikipedia transclusion documentation at
https://en.wikipedia.org/wiki/Wikipedia:Transclusion
2017-04-27 09:50:04 +02:00
reger
7a7da698d4 fix unit test MultiProtocolURL(file) assertion for Windows path with
drive letter.
2017-04-20 00:47:52 +02:00
luccioman
23775e76e2 Fixed endless loop case in wikicode processing.
Detected when importing recent MediaWiki dumps containing some pages
with script content in plain text format (see Scribunto extension
https://www.mediawiki.org/wiki/Extension:Scribunto ).

Further improvement : modify the MediawikiImporter to prevent processing
revisions whose <model> is not wikitext.
2017-04-12 17:17:03 +02:00
luccioman
0bc868a819 Improved support for non ASCII chars in local file system URLs
Creating a MultiProtocolURL instance from a File object and then
retrieving a File with getFSFile() was inconsistent with file paths
containing space or non ASCII chars.
2017-04-12 09:23:10 +02:00
reger
777cb5b812 remove test case for Standard_MemoryControl which will always fail
see https://github.com/yacy/yacy_search_server/pull/114
2017-04-02 03:59:37 +02:00
reger
1ccc44e681 fix default/httpd.mime Z file extension to lower case
+ test case
2017-03-26 23:52:31 +02:00
reger
18c7563dbe Extend DCEntry.getLanguage convert to ISO639-1 codes for more languages
by using icu.ULocale for languages not already covered (ICU normalizes 
to ISO639-1 2 char codes).
Add test class
Use DublinCore vocabulary declarations in DCEntry and SurrogateReader 
for easier usage debugging, 
Init SurrogateReader.inputSource on first use.
2017-03-05 02:26:10 +01:00
reger
275c0cddd1 Adjust DefaultServlet test case to recent change,
depreciate unused CONNECTION_PROP_PROTOCOL (also as it might be 
misleading with getProtocol vs getScheme)
2017-02-26 02:39:52 +01:00
reger
41e2ee0eca Fix call parameter for ConnectionInfo in MonitorHandler
(expected scheme e.g. http, was protocol version).
Depreceate obsolete custom X-...-Scheme header constant.
Use existing FORMAT_ANSIC Dateformatter in HeaderFramework.
Correct htmlParserTest (del one not intended println)
2017-02-25 23:55:17 +01:00
reger
f254fcfc67 fix htmlParser <script> text extraction on code containing expression
recognized as tag like 1<a
reported in https://github.com/yacy/yacy_search_server/issues/109

Script content is ignored by default, but the text is filtered for html
tags. Modified scraper to skip tag filtering while within a <script> 
section (until a closing tag is detected </script>. 
Possible side effect, missing </script> end-tag will truncate trailing 
content text.
2017-02-24 01:25:32 +01:00
luccioman
2f191e0e1c Improved MultiprocotolURL non ASCII characters support.
After @sinkuu Pull Request #108 added JUnit tests, updated some JavaDoc
and also improved URL tokenization to support non ASCII characters.
2017-02-23 11:09:43 +01:00
luccioman
5c8958bcea Updated Javadoc and Junit tests for the WebStructureGraph class. 2017-01-17 17:01:56 +01:00
luccioman
d9766ca981 Fixed WatchWebStructure_p.html render to include https URLs.
As described in mantis 721 (http://mantis.tokeek.de/view.php?id=721)
WatchWebStructure_p.html failed to include in its structure view https
and other protocols and ports than default http.
2017-01-16 18:41:58 +01:00
luccioman
ed3dd5e31a Fixed webstructure.xml API used with a domain name 'about' parameter.
As described in mantis 720 (http://mantis.tokeek.de/view.php?id=720),
when requesting this API with a domain name instead of a complete URL
only HTTP references on default port were listed.
2017-01-16 16:41:06 +01:00
luccioman
0da1e6ba16 Factored code re-implementing DigestURL.hosthash() method.
This ensure consistent implementation of the url host hash generation
and easier usage finding in source code.

Also added a unit test for this function.
2017-01-16 10:18:42 +01:00
luccioman
86adfef30f Added automated unit tests and perfs test for WebStructureGraph class.
Fixed references count when multiple links target the same domain name
in one document.
2017-01-13 16:10:59 +01:00
luccioman
c9889991b9 Fixed 2 failing JUNit tests. 2017-01-09 17:59:01 +01:00
reger
083df255e4 fix html tag attribute parsing containing attribute w/o value
e.g. itemscope or autofocus (in such case the next key was not properly
recognized).
2016-12-24 06:57:11 +01:00
reger
cb95b7339a include html5 <time> tag in content scraper,
add "datetime" property of <time> tag to scrapers startdate list.
Datetime is parsed as iso8601 (xml) date, html5 allows partial as well
as duration (not handled by this)
2016-12-24 03:11:35 +01:00
luccioman
aa9ddf3c23 Added control over Robots.txt active threads maximum number.
When starting a crawl from a file containing thousands of links,
configuration setting "crawler.MaxActiveThreads" is effective to prevent
saturating the system with too many outgoing HTTP connections threads
launched by the crawler.
But robots.txt was not affected by this setting and was indefinitely
increasing the number of concurrently loading threads until most ot the
connections timed out.

To improve performance control, added a pool of threads for Robots.txt,
consistently used in its ensureExist() and massCrawlCheck() methods.
The Robots.txt threads pool max size can now be configured in the
/PerformanceQueus_p.html page, or with the new
"robots.txt.MaxActiveThreads" setting, initialized with the same default
value as the crawler.
2016-11-23 18:13:05 +01:00
reger
fdcf33f08f fix Domain.stripToHostName for some IPv6 cases
add unit test for it
2016-11-19 16:37:16 +01:00
reger
ac6e198bd1 add unit test for Domains.stripToPort,
simplify ipv6 check
2016-11-19 06:22:55 +01:00
luccioman
a0dfbaca6a FileUtils : added some JavaDocs and unit test cases 2016-11-16 15:12:21 +01:00
reger
395f2e8946 Make ServletRequest implement the standardized HttpServletRequest interface,
to make all readily available information from the original ServletRequest
available to YaCy servlets (without converting data to internal structures).
The implementation of the common interface allows easier integration of
YaCy servlets with the servlet standard (e.g. shared login service with
the servlet container etc.)
2016-11-14 01:37:16 +01:00
luccioman
7296e3884f Switched even more URLs to pure relative ones.
Thus a YaCy peer can run behind a reverse proxy subfolder without need
for the reverse proxy to rewrite HTML links (a CPU costly operation).

Tested on Debian Jessie with an apache2 reverse proxy.

See related mantis issues http://mantis.tokeek.de/view.php?id=106 and
http://mantis.tokeek.de/view.php?id=701
2016-11-09 02:40:33 +01:00
luccioman
731684105a Improved absolute URLs rendering in OpenSearch desc and RSS feeds.
When the peer is behind a reverse proxy providing SSL/TLS encryption,
the rendered absolute URLs should start with https when the user browser
requested https : added limited support to the X-Forwarded-Proto HTTP
header notably provided on Heroku platform.
Also added some unit tests.
2016-11-08 02:39:45 +01:00
reger
c9e81d2fa0 fix Column parsing from celldefinition string, without cellwidth def.
(outofbound exception)
2016-11-06 03:34:24 +01:00
reger
af39a76bf6 Reduce number of default max. search navigator lines (from 10000)
to 100 + make it configurable
2016-10-29 04:19:46 +02:00
reger
20a1b29ed3 add simple test case for ReferenceContainer helpful for debugging
calculated ranking parameter
2016-10-26 01:38:40 +02:00
reger
3c7220bc7b Refacture rwi reference word position and word distance calculation
used for rwi ranking.
Main changes:  
- introduce a  posintext() to access the stored value. This reduces also mem alloc of position array for WordReferenceRow (index access)
- use the positions() array for joined references on multi-word queries if needed (otherwise allow positions() to be null
- adjust assignments and the min() max() and distance() calculation accordingly
2016-10-23 19:40:02 +02:00
luccioman
c3c4a52408 Added more examples in Blacklist JUnit test. 2016-10-19 13:14:20 +02:00
reger
8b74a6bf57 fix min/max calculation of WordReferenceVars.distance()
Issue was the calculation in AbstractReference with positions.clear() call,
this made distance result always 0 (distance needs min 2 positions) and created concurrency issues.
+ unit test of changes
2016-10-17 23:58:28 +02:00
luccioman
7717a3d43d Fixed license headers on files created to improve favicon management. 2016-10-14 11:55:49 +02:00
luccioman
6e1959f469 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
Conflicts:
	htroot/yacysearchitem.java
	source/net/yacy/cora/federate/solr/responsewriter/YJsonResponseWriter.java
	source/net/yacy/search/schema/CollectionConfiguration.java
	source/net/yacy/server/serverObjects.java
2016-10-14 11:29:55 +02:00
luccioman
7136b1ad60 HTML validation : fixed URL encoding of Pictures link. 2016-10-14 09:58:14 +02:00
luccioman
3ccd89e274 Fixed MultiProtocolURL.resolveBackpath to handle remaining '..' segments 2016-10-13 16:18:24 +02:00
luccioman
f1f4459f88 Added some unit tests for Blacklist.isListed() 2016-10-13 15:39:47 +02:00
reger
e68b00678e prevent negative score on URIMetadataNode - in the special case were no
solr score is supplied.
+ assert before use & test case
2016-10-11 19:54:50 +02:00
reger
b752bcfecb adjust date in text detection to ignore some program version strings
like "3.1.2.0102" see http://mantis.tokeek.de/view.php?id=650
+ expand test case
2016-10-06 23:37:12 +02:00
reger
b017e97421 optimize condenser language detection a little.
langdetect probabilities take letter case into account, add words from
description and anchors etc. as is.
+ add it to javadoc
2016-10-06 19:03:52 +02:00
reger
ae3717d087 adjust Tokenizer sentence count to ignore repeated punktuation (like !!!! )
+ remove unused sentenceword map (we use only the count)
+ upd test case for sentence count
2016-10-06 03:41:07 +02:00
reger
474f0476c6 adjust Tokenizer sentence count on trailing text after last recognized sentence
+ upd test case for rwi multi-word-query  (leaving results known to fail untested)
2016-10-05 05:52:37 +02:00
reger
1a79c64495 generalize DateDetection with holiday date rules readily available in icu
to make sure current dates are recognized (was fixed to 2014 - 2016)
+ adjust holiday date parser from pattern.match to pattern.find to deal with leading and trailing text
+ moved relative date recognition (morgen, tomorrow) to parseline (used by query parser only), as not working and problematic for indexing
+ add test case for parseline (used by query parser)
2016-10-02 03:19:12 +02:00
reger
32a2e3a22a have RSSFeed.getChannel return empty message on missing channel element,
a) required b) prevent NPE in rss servlets
+ add test
2016-09-30 21:46:57 +02:00
luccioman
4585a60d7e Made use of the constant corresponding to the hard-coded value. 2016-09-30 17:12:29 +02:00
luccioman
1bb0b135ac Avoid duplication of various MS Windows file URLs flavors
Fix for mantis 692 (http://mantis.tokeek.de/view.php?id=692)
2016-09-27 07:53:08 +02:00
reger
6f8c3ccea4 improve url hash computation for file path with mixed java & windows
file.separator to compute equal hashes (by normalizing path for computation)
+ expand test case for to check mixed java / windows file url notation
like e.g. file:///c:/test/file.html vs. file:///c:\test/file.html
- relates partially to http://mantis.tokeek.de/view.php?id=692
2016-09-25 22:08:12 +02:00