Commit Graph

288 Commits

Author SHA1 Message Date
Michael Peter Christen
3959d43a5c fixed doku link 2021-08-03 16:57:24 +02:00
sgaebel
fc03c4b4fe removes some warning and unused objects 2020-08-03 20:44:31 +02:00
sgaebel
4a495df63a removes some deprecation-warnings 2020-07-31 17:28:06 +02:00
sgaebel
dd9d4b1188 replace org.junit.Assert.assertThat by
org.hamcrest.MatcherAssert.assertThat from hamcrest 2.2 to avoid
deprecation-warning
2020-07-28 19:09:26 +02:00
sgaebel
df9ea0a42a removes some warnings: unused imports, params 2020-07-27 22:20:49 +02:00
Michael Peter Christen
64a17faca0 added debug code to parser test to investigate why this fails in travis
build
2020-07-12 16:57:46 +02:00
Michael Peter Christen
f0f12f875b fix for failing parser test: new forum link 2019-10-13 11:12:32 +02:00
Michael Christen
3a46b07603 fixed many links to old forum, now https://searchlab.eu 2019-06-15 11:43:27 +02:00
luccioman
e90405b6f0 Support parsing audio URLs without file extension
Added also a Junit for the audio tag parser
2019-04-09 11:40:21 +02:00
luccioman
3fb449b3b6 Properly resolve relative URLs against document URL in html base tags
Fixes issue #256
2018-12-06 20:18:00 +01:00
luccioman
9daeea823b Fixed concurrency issue on cache used for circles rendering
Without synchronization lock, concurrent rendering of images including
circles could lead to glitches as reported in issue #248
2018-11-10 22:00:49 +01:00
luccioman
61c337f29a Decode blacklist entries for easier edition of non ascii chars
Not using the JDK URLDecoder.decode() function, as it strips '+'
characters when they occur after '?' (both characters having regular
expression semantics when used in blacklist path patterns)
2018-10-04 09:33:58 +02:00
luccioman
ed93221fa1 Improved normalization of blacklist path patterns having non ascii chars
Normalize blacklist path patterns using percent-encoding, at pattern
edition in web interface and at loading from configuration files.

Fixes issue #237
2018-10-02 14:36:13 +02:00
luccioman
685122363d Added a parser for XZ compressed archives.
As suggested by LA_FORGE on mantis 781
(http://mantis.tokeek.de/view.php?id=781)
2018-08-15 10:07:39 +02:00
luccioman
26aa5f7a0f Suppress compilation warning on unit testing intentional failure 2018-07-08 08:14:07 +02:00
luccioman
f895745e1c Removed more unsafe concurrent accesses to SimpleDateFormat instances.
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).

Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
2018-06-29 15:49:55 +02:00
luccioman
e97580dfc7 Fixed unsafe conccurent access to generic SimpleDateFormat instances
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).

Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
2018-06-28 14:59:23 +02:00
luccioman
cced94298a Added a new crawler document filter type using Solr syntax
This makes possbile to set up much more advanced document crawl filters,
by filtering on one or more document indexed fields before inserting in
the index.
2018-06-19 10:12:20 +02:00
luccioman
2c155ece77 Fixed JUnit test after removal of unused Transformer 2018-06-19 07:07:18 +02:00
luccioman
e357ade47d Reduced memory footprint of text snippet extraction
By not parsing and storing at first all sentences of a document, but
only on the fly the ones necessary to compute the snippet.
2018-05-13 10:29:52 +02:00
luccioman
e115e57cc7 Reduced text snippet extraction processing time.
By not generating MD5 hashes on all words of indexed texts, processing
time is reduced by 30 to 50% on indexed documents with more than 1Mbytes
of plain text.
2018-05-11 15:42:53 +02:00
luccioman
3b89c232db Easier tracking of longest text snippets initializations
When text snippets statistics are enabled and FINE log level is enabled
on the TextSnippetStatistics class.
2018-05-01 09:58:05 +02:00
luccioman
8d7099a081 Handle escaped line breaks and separators in vocabulary import from CSV 2018-02-15 07:29:17 +01:00
luccioman
eb20589e29 Fixed issue #158 : completed div CSS class ignore in crawl 2018-02-10 11:56:28 +01:00
luccioman
33593c22e9 Fixed loss of other modifiers on keywords/tags search navigation links 2018-02-06 17:17:13 +01:00
luccioman
9412881230 Added basic support for autotagging microdata annotated item types.
With the appropriate vocabulary settings in Vocabulary_p.html page, this
can produce Vocabulary search facets displaying item types referenced in
html documents by microdata annotation.
Tested notably, but not limited to, vocabulary classes/types defined by
Schema.org and Dublin Core.
2018-02-06 10:25:38 +01:00
luccioman
5a14d34a7d Refactoring : documented and extracted autotagging processing functions. 2018-02-02 10:27:36 +01:00
luccioman
58b9834729 Added HTML microdata typed items parsing capability.
This adds the possibility for the HTML parser to gather typed items URLs
annotated in HTML tags with itemscope and itemtype attributes (see
microdata specification https://www.w3.org/TR/microdata/ ), notably
Types from the schema.org vocabulary, but also Types/Classes from any
other vocabulary, such as the common ones listed in the RDFa core
context ( https://www.w3.org/2011/rdfa-context/rdfa-1.1.html ).
2018-02-02 09:31:40 +01:00
luccioman
fa6d030b0b Moved dbtest to the test source folder. 2018-01-29 14:03:01 +01:00
luccioman
098ee63911 Added a manual performance test for the HostBalancer.
Consequently to the report in mantis 776
(http://mantis.tokeek.de/view.php?id=776).

Running the perfs test with different control parameters seems to reveal
that the YaCy's RowHandleMap used in the balancer depthCache is finally
more efficient than for example the ConcurrentHashMap from JDK 8.
2018-01-28 12:41:56 +01:00
luccioman
46b5249c20 Removed time condition on HostBalancer initialization in JUnit test.
Its initialization in main application usage remains asynchronous.
2018-01-26 17:15:27 +01:00
luccioman
36e9b1c5b3 Fixed SegmentTest test case time dependant occasional failures
As highlighted by latest automated Travis builds.
2018-01-02 10:21:07 +01:00
Michael Peter Christen
b907819cb4 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2017-12-09 22:29:54 +01:00
Michael Peter Christen
25573bd5ab added a crawl filter based on <div> tag class names
When a crawl is started, a new field to exclude content from scraping is
available. The field can be identified with the class name of div tags.
All text contained in such a div tag where the configured class name(s)
match are not indexed, while the remaining page is indexed.
2017-12-09 22:29:35 +01:00
luccioman
d95b288f19 Removed use of deprecated Jetty IPAccessHandler for client filtering.
Upgraded to InetAccessHandler.
Added InetPathAccessHandler extension to InetAccessHandler to maintain
path patterns capability previously available in IPAccessHandler but
lost in InetAccessHandler.

Filtering on IPv6 addresses is now supported.

Support for deprecated pattern formats such as "192.168." and
"192.168.1.1/path" has been removed, but startup automated migration
should convert such patterns eventually present in serverClient.
2017-12-08 15:12:08 +01:00
luccioman
0a120787e3 Improved accuracy of URLs search filters : protocol, tld, host, file ext 2017-12-01 11:19:31 +01:00
luccioman
d1c7dfd852 Fixed URL parsing with fragment and empty path 2017-12-01 09:48:42 +01:00
luccioman
e2f6427a63 Added a basic JUnit test for the Visio parser (vsdParser) 2017-11-22 09:06:16 +01:00
luccioman
d41ad7af6f Restore initial locale at the end of a JUnit test case which modify it. 2017-11-20 18:50:49 +01:00
luccioman
7206f1ed71 Do locale neutral case conversions on domain names.
Required to properly run on systems with default locale set to Turkish
language, as with this locale the 'i' character has different upper and
lower case flavors than with other locales.
2017-11-20 18:47:46 +01:00
luccioman
398c66f06c Do locale neutral case conversions in MultiProtocolURL
For any relevant URL parts : host name, URL scheme, session ids or
technical parts (see https://url.spec.whatwg.org/#url-writing and
https://tools.ietf.org/html/rfc3986 for current standard references).

Remaining locale sensitive conversion used for detection of URL word
components in urlComps() makes sense but using detected language would
be preferable than using the default system locale.
2017-11-20 15:23:33 +01:00
luccioman
9531b83598 Do locale neutral case conversions in Classification
Required for people using Turkish language as their default system
locale, as with this locale the 'i' character has different upper and
lower case flavors than with other locales.
2017-11-20 09:48:46 +01:00
luccioman
ac209cac2e Updated the generic top-level known domains list.
Using current IANA reference list at
https://www.iana.org/domains/root/db

The generated URL hashes on these domains stay the same but performance
is greatly improved as a DNS resolve request is required on URL hash
computation when the TLD part of the host name is unknown.

Hash computation mean time measured on 1541 sample URLs (one on each
TLD) and a computer with a DSL connection : about 230ms before change,
then only 20ms.
2017-11-14 09:42:09 +01:00
luccioman
fcd57e2d0f Improved some JUnit tests isolation and resources release
The modified tests were successfull when run manually from an IDE such
as Eclipse, but failed occasionnally when run with maven as part of the
overall test suite.
2017-11-08 09:33:30 +01:00
luccioman
e0eda84c24 Remove old hard-coded holiday dates from DateDection class.
Replaced with rules based relative to current year as already done for a
part of the supported dates.
2017-11-07 19:02:09 +01:00
luccioman
73977ec0fe Added a html parser charset detection unit test 2017-11-06 09:14:03 +01:00
luccioman
285f0d6a39 Consistently encode snapshot image with format requested on the API.
Previously, calling /api/snapshot.png rendered JPEG encoded images.
2017-10-18 07:53:07 +02:00
luccioman
7c319c841e Fixed pdf2image conversion with imagemagick on PDFs having transparency
The target image format (jpeg) doesn't support transparency, so the
Html2ImageTest produced unusable black images when ran on a linux
machine having imagemagick package installed.
2017-10-16 19:45:17 +02:00
luccioman
fe75f326d8 Fixed ProfilingGraph calculation integer overflows and added test class.
Complementary to fix proposed in PR #128 by @otteresk.
2017-10-16 09:18:12 +02:00
luccioman
5bf76f058a Adjusted ResponseHeaderTest to succeed on slow or highly loaded CPU 2017-10-09 19:08:39 +02:00