Commit Graph

197 Commits

Author SHA1 Message Date
luccioman
5c8958bcea Updated Javadoc and Junit tests for the WebStructureGraph class. 2017-01-17 17:01:56 +01:00
luccioman
d9766ca981 Fixed WatchWebStructure_p.html render to include https URLs.
As described in mantis 721 (http://mantis.tokeek.de/view.php?id=721)
WatchWebStructure_p.html failed to include in its structure view https
and other protocols and ports than default http.
2017-01-16 18:41:58 +01:00
luccioman
ed3dd5e31a Fixed webstructure.xml API used with a domain name 'about' parameter.
As described in mantis 720 (http://mantis.tokeek.de/view.php?id=720),
when requesting this API with a domain name instead of a complete URL
only HTTP references on default port were listed.
2017-01-16 16:41:06 +01:00
luccioman
0da1e6ba16 Factored code re-implementing DigestURL.hosthash() method.
This ensure consistent implementation of the url host hash generation
and easier usage finding in source code.

Also added a unit test for this function.
2017-01-16 10:18:42 +01:00
luccioman
86adfef30f Added automated unit tests and perfs test for WebStructureGraph class.
Fixed references count when multiple links target the same domain name
in one document.
2017-01-13 16:10:59 +01:00
luccioman
c9889991b9 Fixed 2 failing JUNit tests. 2017-01-09 17:59:01 +01:00
reger
083df255e4 fix html tag attribute parsing containing attribute w/o value
e.g. itemscope or autofocus (in such case the next key was not properly
recognized).
2016-12-24 06:57:11 +01:00
reger
cb95b7339a include html5 <time> tag in content scraper,
add "datetime" property of <time> tag to scrapers startdate list.
Datetime is parsed as iso8601 (xml) date, html5 allows partial as well
as duration (not handled by this)
2016-12-24 03:11:35 +01:00
luccioman
aa9ddf3c23 Added control over Robots.txt active threads maximum number.
When starting a crawl from a file containing thousands of links,
configuration setting "crawler.MaxActiveThreads" is effective to prevent
saturating the system with too many outgoing HTTP connections threads
launched by the crawler.
But robots.txt was not affected by this setting and was indefinitely
increasing the number of concurrently loading threads until most ot the
connections timed out.

To improve performance control, added a pool of threads for Robots.txt,
consistently used in its ensureExist() and massCrawlCheck() methods.
The Robots.txt threads pool max size can now be configured in the
/PerformanceQueus_p.html page, or with the new
"robots.txt.MaxActiveThreads" setting, initialized with the same default
value as the crawler.
2016-11-23 18:13:05 +01:00
reger
fdcf33f08f fix Domain.stripToHostName for some IPv6 cases
add unit test for it
2016-11-19 16:37:16 +01:00
reger
ac6e198bd1 add unit test for Domains.stripToPort,
simplify ipv6 check
2016-11-19 06:22:55 +01:00
luccioman
a0dfbaca6a FileUtils : added some JavaDocs and unit test cases 2016-11-16 15:12:21 +01:00
reger
395f2e8946 Make ServletRequest implement the standardized HttpServletRequest interface,
to make all readily available information from the original ServletRequest
available to YaCy servlets (without converting data to internal structures).
The implementation of the common interface allows easier integration of
YaCy servlets with the servlet standard (e.g. shared login service with
the servlet container etc.)
2016-11-14 01:37:16 +01:00
luccioman
7296e3884f Switched even more URLs to pure relative ones.
Thus a YaCy peer can run behind a reverse proxy subfolder without need
for the reverse proxy to rewrite HTML links (a CPU costly operation).

Tested on Debian Jessie with an apache2 reverse proxy.

See related mantis issues http://mantis.tokeek.de/view.php?id=106 and
http://mantis.tokeek.de/view.php?id=701
2016-11-09 02:40:33 +01:00
luccioman
731684105a Improved absolute URLs rendering in OpenSearch desc and RSS feeds.
When the peer is behind a reverse proxy providing SSL/TLS encryption,
the rendered absolute URLs should start with https when the user browser
requested https : added limited support to the X-Forwarded-Proto HTTP
header notably provided on Heroku platform.
Also added some unit tests.
2016-11-08 02:39:45 +01:00
reger
c9e81d2fa0 fix Column parsing from celldefinition string, without cellwidth def.
(outofbound exception)
2016-11-06 03:34:24 +01:00
reger
af39a76bf6 Reduce number of default max. search navigator lines (from 10000)
to 100 + make it configurable
2016-10-29 04:19:46 +02:00
reger
20a1b29ed3 add simple test case for ReferenceContainer helpful for debugging
calculated ranking parameter
2016-10-26 01:38:40 +02:00
reger
3c7220bc7b Refacture rwi reference word position and word distance calculation
used for rwi ranking.
Main changes:  
- introduce a  posintext() to access the stored value. This reduces also mem alloc of position array for WordReferenceRow (index access)
- use the positions() array for joined references on multi-word queries if needed (otherwise allow positions() to be null
- adjust assignments and the min() max() and distance() calculation accordingly
2016-10-23 19:40:02 +02:00
luccioman
c3c4a52408 Added more examples in Blacklist JUnit test. 2016-10-19 13:14:20 +02:00
reger
8b74a6bf57 fix min/max calculation of WordReferenceVars.distance()
Issue was the calculation in AbstractReference with positions.clear() call,
this made distance result always 0 (distance needs min 2 positions) and created concurrency issues.
+ unit test of changes
2016-10-17 23:58:28 +02:00
luccioman
93ea366778 Updated license header file name 2016-10-15 11:34:50 +02:00
luccioman
4c0be4d5d4 Fixed maven compilation error
Removed unit test yacysearchitemTest from default maven Junit tests
path, as yacysearchitem class is not in maven build classpath.
2016-10-15 11:34:23 +02:00
luccioman
7717a3d43d Fixed license headers on files created to improve favicon management. 2016-10-14 11:55:49 +02:00
luccioman
6e1959f469 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
Conflicts:
	htroot/yacysearchitem.java
	source/net/yacy/cora/federate/solr/responsewriter/YJsonResponseWriter.java
	source/net/yacy/search/schema/CollectionConfiguration.java
	source/net/yacy/server/serverObjects.java
2016-10-14 11:29:55 +02:00
luccioman
7136b1ad60 HTML validation : fixed URL encoding of Pictures link. 2016-10-14 09:58:14 +02:00
luccioman
3ccd89e274 Fixed MultiProtocolURL.resolveBackpath to handle remaining '..' segments 2016-10-13 16:18:24 +02:00
luccioman
f1f4459f88 Added some unit tests for Blacklist.isListed() 2016-10-13 15:39:47 +02:00
reger
e68b00678e prevent negative score on URIMetadataNode - in the special case were no
solr score is supplied.
+ assert before use & test case
2016-10-11 19:54:50 +02:00
reger
b752bcfecb adjust date in text detection to ignore some program version strings
like "3.1.2.0102" see http://mantis.tokeek.de/view.php?id=650
+ expand test case
2016-10-06 23:37:12 +02:00
reger
b017e97421 optimize condenser language detection a little.
langdetect probabilities take letter case into account, add words from
description and anchors etc. as is.
+ add it to javadoc
2016-10-06 19:03:52 +02:00
reger
ae3717d087 adjust Tokenizer sentence count to ignore repeated punktuation (like !!!! )
+ remove unused sentenceword map (we use only the count)
+ upd test case for sentence count
2016-10-06 03:41:07 +02:00
reger
474f0476c6 adjust Tokenizer sentence count on trailing text after last recognized sentence
+ upd test case for rwi multi-word-query  (leaving results known to fail untested)
2016-10-05 05:52:37 +02:00
reger
1a79c64495 generalize DateDetection with holiday date rules readily available in icu
to make sure current dates are recognized (was fixed to 2014 - 2016)
+ adjust holiday date parser from pattern.match to pattern.find to deal with leading and trailing text
+ moved relative date recognition (morgen, tomorrow) to parseline (used by query parser only), as not working and problematic for indexing
+ add test case for parseline (used by query parser)
2016-10-02 03:19:12 +02:00
reger
32a2e3a22a have RSSFeed.getChannel return empty message on missing channel element,
a) required b) prevent NPE in rss servlets
+ add test
2016-09-30 21:46:57 +02:00
luccioman
4585a60d7e Made use of the constant corresponding to the hard-coded value. 2016-09-30 17:12:29 +02:00
luccioman
1bb0b135ac Avoid duplication of various MS Windows file URLs flavors
Fix for mantis 692 (http://mantis.tokeek.de/view.php?id=692)
2016-09-27 07:53:08 +02:00
reger
6f8c3ccea4 improve url hash computation for file path with mixed java & windows
file.separator to compute equal hashes (by normalizing path for computation)
+ expand test case for to check mixed java / windows file url notation
like e.g. file:///c:/test/file.html vs. file:///c:\test/file.html
- relates partially to http://mantis.tokeek.de/view.php?id=692
2016-09-25 22:08:12 +02:00
reger
330768c8a2 fix for solr write.lock after mode change http://mantis.tokeek.de/view.php?id=686
The embedded core holds a lock on the index and must be closed. Earlier commit
comment states that core should be closed with solr instance instead on close 
of connector.
Adjusted the InstanceMirror.close() to take care of closing the embedded 
instance to release the lock.
In 2 routines of fulltext this was already explicite implemented (disconnectLocalSolr).
Now this disconnect is part of the InstanceMirror.close().
2016-09-22 00:16:22 +02:00
reger
11786457b7 add test case for EmeddedSolrConnector close()
for issue http://mantis.tokeek.de/view.php?id=686
(without solving the issue here)
2016-09-21 21:08:21 +02:00
reger
585d2a6441 test case: for NewsPool to check the id modificator (for unique id)
and observe the distribution order .. hands on.
+ add test/DATA to gitignor
2016-09-20 01:55:56 +02:00
reger
ff6589fc0f test case: simulating multi word query for local rwi index
Purpose of the test case is to be able to (controlled) analyse the rwi ranking for
multi word searches (with focus on posintext and word-distance ranking)
2016-09-18 00:59:27 +02:00
reger
7f63fc50f3 prepare a IndexSegment test case for RWI index testing
+ prevent NPE in Segment.clear() on missing embedded solr instance.
2016-09-11 23:25:44 +02:00
reger
272cdd496a reactivate sentence counter in WordTokenizer for phrasepos ranking,
by counting punktuation (delivered as 1 char word) again.
2016-09-07 02:16:16 +02:00
Michael Peter Christen
5e165a8150 removed unused imports 2016-09-06 18:46:24 +02:00
reger
e310ec5f70 fix posInText ranking calculation to score 0 on no position info
+ fix Word posInText calc in Tokenizer to start with 1
+ test case
2016-09-06 00:05:59 +02:00
reger
39dd244693 fix ConcurrentScoreMap.set() calculation of totalCount()
+ test case
2016-09-04 22:18:07 +02:00
reger
ebde21079a refactor xlsParser to include Excel file attribute (like author) in parser result doc.
Similar to ppt and doc parser, completing a TODO in xlsParser.
2016-08-13 23:46:36 +02:00
reger
5e335b32da fix Blacklist.contains() matching path pattern to string
similar to 5e9e871192
+ add proof testcase
2016-08-04 01:12:49 +02:00
reger
f89d4eb51d fix MultiProtocolURL init (assign of host) for urls with '/' in query part
+ add to test case
2016-07-17 04:17:01 +02:00