Commit Graph

4213 Commits

Author SHA1 Message Date
luccioman
aaefd5219c Reduce log verbosity of RSS loader on feed items with no link 2018-03-20 10:09:17 +01:00
luccioman
cf62b571bd Added RSS reader support for enclosure feed item sub element.
Enclosure element (see
http://www.rssboard.org/rss-specification#ltenclosuregtSubelementOfLtitemgt
) can be seen for example in podcasts feeds.
2018-03-20 07:38:29 +01:00
luccioman
e5f5de0fc7 Added some JavaDoc to the RSSMessage class. 2018-03-19 11:15:31 +01:00
luccioman
0d7625ecfb Handle Solr fields restrict and alias in YaCy html and exml writers
Thus allowing for example to read more easily the local Solr index full
metadata in HTML by restricting if desired to some fields of interest.

See Solr documentation about the 'fl' (Field List) parameter at
https://lucene.apache.org/solr/guide/6_6/common-query-parameters.html#CommonQueryParameters-Thefl_FieldList_Parameter
2018-03-16 11:35:42 +01:00
luccioman
3da2739bbd Parse and index more common audio metadata text tag fields. 2018-03-15 09:59:57 +01:00
luccioman
846aba00fa Added parsing of URLs eventually present in audio metadata tags 2018-03-13 23:08:52 +01:00
Michael Peter Christen
187075b878 added nav filter 2018-03-10 15:46:53 +01:00
luccioman
bcbd0ae1a4 Enabled partial parsing of audio resources. 2018-03-01 20:50:44 +01:00
luccioman
fda0189613 Updated audio file extensions with ones recently added to audioTagParser 2018-02-28 13:46:40 +01:00
luccioman
978e2be95b Let a chance for other parsers on audioTagParser error
As done in all other parsers, eventually falling back in the end to the
genericParser which creates a minimal index entry.
2018-02-28 12:27:17 +01:00
luccioman
9e5846a26e Small fix on svg parser error message 2018-02-28 12:23:52 +01:00
luccioman
11611dbdcf Reuse existing File copy function to handle audio parser tmp files 2018-02-28 11:58:32 +01:00
luccioman
f77f8f40f9 Factored audio parser tag processing 2018-02-28 08:19:13 +01:00
luccioman
9a7a353d0e Removed some unnecessary intermediate list creation on array copy. 2018-02-28 07:49:40 +01:00
luccioman
fb6457f5bc Fixed NPE case when on audio resource parsed with null tag 2018-02-28 07:31:32 +01:00
luccioman
c3ff50c17a Updated the list of audio file formats supported by the audioTagParser
Follows upgrade to Jaudiotagger dependency to version 2.2.5.
2018-02-27 18:04:12 +01:00
luccioman
1b90479a76 Added missing vocabulary navigator increment on results from RWI 2018-02-23 11:36:03 +01:00
luccioman
46c9da6428 Allow creation of vocabularies from remote CSV file URLs. 2018-02-21 08:41:13 +01:00
luccioman
17c7a85f18 Make StreamResponse usable in Java try-with-resources statements 2018-02-21 08:38:35 +01:00
luccioman
b67742336e Provide user interface messages on vocabulary creation read/write errors 2018-02-19 11:48:40 +01:00
luccioman
3e8dd90211 Use https rather than http in links and queries to openstreetmap.org 2018-02-15 19:14:07 +01:00
luccioman
3a973dbb23 Removed unused import 2018-02-14 09:27:17 +01:00
luccioman
e9527cd0e5 Reuse the same Pattern instance when matching multiple key/values 2018-02-14 07:14:25 +01:00
luccioman
dbf4c1cd76 Improved blacklist entries editing operations :
- Fixes issue #160 : handle properly syntax exceptions with a user
friendly message
- Fixes loss of information on multiple blacklist entries editions
- Fixes loss of entries when moving entries from one list to another
2018-02-13 18:24:26 +01:00
reger
87077b8fb6 Adjust and move Language Navigator to be member of the navigatior plugin
list.
2018-02-12 00:16:34 +01:00
luccioman
eb20589e29 Fixed issue #158 : completed div CSS class ignore in crawl 2018-02-10 11:56:28 +01:00
luccioman
0cdee4e26a Fixed loss of "meanCount" search param when using facets or page buttons
Then on new search queries, no suggestions at all could be displayed.
2018-02-08 08:07:30 +01:00
luccioman
117a859879 Do not clear all search modifiers when unselecting one modifier.
Previously, when clicking a selected facet in the search results page to
unselect it, all other eventually selected modifiers/facets were also
removed.
2018-02-07 15:54:46 +01:00
luccioman
33593c22e9 Fixed loss of other modifiers on keywords/tags search navigation links 2018-02-06 17:17:13 +01:00
luccioman
a9dc0874c0 Remove old query terms from search results suggestions links.
Especially when old terms were misspelled, suggestions links then
provided most of the time empty results.
2018-02-06 15:14:14 +01:00
luccioman
9412881230 Added basic support for autotagging microdata annotated item types.
With the appropriate vocabulary settings in Vocabulary_p.html page, this
can produce Vocabulary search facets displaying item types referenced in
html documents by microdata annotation.
Tested notably, but not limited to, vocabulary classes/types defined by
Schema.org and Dublin Core.
2018-02-06 10:25:38 +01:00
luccioman
5a14d34a7d Refactoring : documented and extracted autotagging processing functions. 2018-02-02 10:27:36 +01:00
luccioman
58b9834729 Added HTML microdata typed items parsing capability.
This adds the possibility for the HTML parser to gather typed items URLs
annotated in HTML tags with itemscope and itemtype attributes (see
microdata specification https://www.w3.org/TR/microdata/ ), notably
Types from the schema.org vocabulary, but also Types/Classes from any
other vocabulary, such as the common ones listed in the RDFa core
context ( https://www.w3.org/2011/rdfa-context/rdfa-1.1.html ).
2018-02-02 09:31:40 +01:00
luccioman
80fb1026d0 Create recrawl requests with the relevant crawl profile.
Recrawl default profile was previously effectively used for crawl
stacker acceptance check, but request entries were indeed still created
with the "snippetGlobalText" profile.
2018-01-30 21:00:18 +01:00
luccioman
539925a275 Added an utility to generate/update XLIFF master file from lng files. 2018-01-29 18:34:47 +01:00
luccioman
fa6d030b0b Moved dbtest to the test source folder. 2018-01-29 14:03:01 +01:00
luccioman
6cd3847d0a Fixed NullPointerException case on Table init with relative file path.
Can occur for example when running dbtest with relative test table file
name (wihout explicit parent folder).
2018-01-29 14:00:43 +01:00
luccioman
28883d8a71 Shutdown daemon threads at the end of dbtest 2018-01-29 13:56:37 +01:00
luccioman
929e0d6eae Replaced improper ByteBuffer.equals() implementation by Arrays.equals()
Renamed also ByteBuffer.equals() to startsWith() as this is the
appropriate function implementation semantics.
2018-01-29 13:38:25 +01:00
luccioman
46b5249c20 Removed time condition on HostBalancer initialization in JUnit test.
Its initialization in main application usage remains asynchronous.
2018-01-26 17:15:27 +01:00
luccioman
8b572b7337 Commit Solr index before simulating or starting recrawl job.
This ensures up-to-date simulation query results, and recrawl
processing.
2018-01-26 10:31:13 +01:00
luccioman
733cacdbb8 Revised the RDFaParser main launcher for minimal proper operation.
This parser is still not enabled in the main text parsers list. More
would have to be done to make it functional.
2018-01-25 07:57:56 +01:00
luccioman
7baa99f26f Fixed stored URL in web cache when redirection(s) occurs.
Associate cached content to the last redirection location, instead of
the first URL of a redirection(s) chain :
 - for proper base URL processing in parsers (fixes mantis 636 -
http://mantis.tokeek.de/view.php?id=636)
 - to prevent duplicated content in Solr index when recrawling a
redirected URL
2018-01-20 18:56:40 +01:00
luccioman
9ddf92d143 Removed unncessary reflection usage for workflow tasks.
This improves code readability and maintainability (calls hierarchy are
easier to read) and eventually performance.
2018-01-15 10:05:49 +01:00
luccioman
897d3d30cc Added new recrawl job profile to the list of default crawl profiles 2018-01-15 08:30:37 +01:00
luccioman
9624516bf8 Refresh recrawl job profile threshold date like other default profiles 2018-01-15 08:06:28 +01:00
luccioman
b712a0671e Added a specific default crawl profile for the recrawl job.
- with only light constraint on known indexed documents load date, as it
can already been controlled by the selection query, and the goal of the
job is indeed to recrawl selected documents now
- using the iffresh cache strategy
2018-01-13 15:46:04 +01:00
luccioman
adf3fa493d Added comments about crawl profiles recrawl cycles 2018-01-13 12:13:04 +01:00
luccioman
3638e16c2e More comprehensive log on rejected recrawls caused by date constraint 2018-01-13 12:07:56 +01:00
luccioman
d47afe6fab Use a constant for crawler reject reason prefix with specific processing 2018-01-13 10:45:00 +01:00
luccioman
4e03335625 Added more details to the recrawl job report 2018-01-12 11:47:13 +01:00
luccioman
6425963cee Fixed internal tables exact value match iterator 2018-01-10 18:38:42 +01:00
luccioman
0c9e0b3566 Record recrawl calls to make them schedulable 2018-01-10 17:05:53 +01:00
luccioman
433e241e4f Added a report info box about eventual last terminated recrawl job
For easier monitoring of recrawls.
2018-01-09 22:33:15 +01:00
luccioman
b2af25b14f Added a stop condition to the Recrawl busy thread 2018-01-09 10:22:26 +01:00
luccioman
421728d25a Made possible to customize selection query before launching a recrawl 2018-01-08 21:20:46 +01:00
luccioman
36e9b1c5b3 Fixed SegmentTest test case time dependant occasional failures
As highlighted by latest automated Travis builds.
2018-01-02 10:21:07 +01:00
luccioman
8a4ea1c11e Added UI switch to control content domain constraint per search request 2018-01-02 08:13:14 +01:00
reger
f8071ac8ae Make TokenizedStringNavigator (used for keyword search facet) active
check case insensitive.
As keywords are compared lower case, make sure user input keyword:Key
or keyword:key will be shown as active in facet entry key.
2017-12-28 02:51:52 +01:00
luccioman
e6907fdab3 Added optional search parameter/setting to control content domain filter
Thus allowing to choose at configuration or per search request, whether
extending or not results beyond strict content domain filter (image,
video, audio or application).

Related graphical controls to be added to user interface.
2017-12-23 18:56:17 +01:00
luccioman
f52217c939 Enable full size images preview for users with extended search rights 2017-12-22 11:39:30 +01:00
luccioman
09c4ee56a7 Added optional https support for remote crawl and profile operations 2017-12-21 18:41:32 +01:00
luccioman
5db1c9155a Do locale independant case conversion on hosts, schemes, and file exts.
Required for proper operation when the default system locale is Turkish,
as dottless and dotted i characters have specific case conversion rules
in this language.
2017-12-19 13:52:05 +01:00
luccioman
1c4803e40a Enable optional https support for /yacy/transferURL API calls.
Also updated some Javadoc and consistently use Switchboard instance as a
constructor parameter where relevant.
2017-12-19 12:30:49 +01:00
luccioman
c6e1befbca Restored peer URL host name stripping removed from previous commit.
Still useful for peers with IPv6 addresses.
2017-12-15 17:03:35 +01:00
luccioman
17e004599d Started implementing optional https preference for protocol operations
Introduced through the new configurable setting
network.unit.protocol.https.preferred, defaulting to false for now.

Let choose to prefer using https when available on remote peers to
perform YaCy protocol operations including notably hello or transferRWI.

Not yet implemented for every YaCy protocol operations.
2017-12-15 11:28:46 +01:00
Michael Peter Christen
b907819cb4 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2017-12-09 22:29:54 +01:00
Michael Peter Christen
25573bd5ab added a crawl filter based on <div> tag class names
When a crawl is started, a new field to exclude content from scraping is
available. The field can be identified with the class name of div tags.
All text contained in such a div tag where the configured class name(s)
match are not indexed, while the remaining page is indexed.
2017-12-09 22:29:35 +01:00
luccioman
d95b288f19 Removed use of deprecated Jetty IPAccessHandler for client filtering.
Upgraded to InetAccessHandler.
Added InetPathAccessHandler extension to InetAccessHandler to maintain
path patterns capability previously available in IPAccessHandler but
lost in InetAccessHandler.

Filtering on IPv6 addresses is now supported.

Support for deprecated pattern formats such as "192.168." and
"192.168.1.1/path" has been removed, but startup automated migration
should convert such patterns eventually present in serverClient.
2017-12-08 15:12:08 +01:00
reger
cc7a93e6b6 remove deprecated jetty continuation class from urlproxyservlet
(was a long time carry over, while not supporting async requests)
2017-12-08 01:01:07 +01:00
Michael Peter Christen
607b39b427 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
Conflicts:
	htroot/yacysearchitem.java
2017-12-07 15:25:41 +01:00
Michael Peter Christen
4355de0f3c (more!) evaluation of XRealIP from nginx reverse proxy 2017-12-07 15:16:11 +01:00
luccioman
a4494d6e01 Improved support for internationalized domain names on "site:" modifier
Allow typing directly internationalized domain names including non ASCII
characters in the search field. 
Search is done using the ASCII Compatible Encoding (ACE) representation.
2017-12-04 18:23:26 +01:00
luccioman
d07006bac4 Do locale independant case conversion on "filetype:" query modifier. 2017-12-04 14:11:29 +01:00
luccioman
8fbf25d1ed Made "site:" query modifier case insensitive. 2017-12-04 14:08:34 +01:00
luccioman
867388e05b Refactored 'site:' query modifier parsing into a dedicated function. 2017-12-04 13:58:15 +01:00
luccioman
c9d80b5b77 Prefer fine URL match over approximate URL mask regex on final filtering
Also prevent adding a redundant and CPU costly Solr url mask filter
query when possible
2017-12-01 11:52:52 +01:00
luccioman
0a120787e3 Improved accuracy of URLs search filters : protocol, tld, host, file ext 2017-12-01 11:19:31 +01:00
luccioman
d1c7dfd852 Fixed URL parsing with fragment and empty path 2017-12-01 09:48:42 +01:00
luccioman
e07ef1b610 Apply tld query modifier on Solr host_s mandatory field.
The filter has thus much more chances to be effective than when applied
on the optional field host_dnc_s.
2017-12-01 08:46:46 +01:00
luccioman
478e92deff Fixed url mask filter generated when protocol modifier is not null 2017-11-30 20:21:45 +01:00
luccioman
29de4a65d7 Refactored url mask filter build from query modifiers
For better readability and easier unit testing.
2017-11-30 09:20:32 +01:00
reger
d5a75537e4 remove redundant setting of timeout for remoteinstance
and replace depreciated updatesolrclient instantiation with recommended builder
2017-11-26 02:53:51 +01:00
luccioman
f01aac31fd Made possible to use https for remote search on peers with SSL enabled.
Default is still http to prevent any regressions, but a new setting is
available to choose https as the preferred protocol to perform remote
searches. 
New configuration setting 'remotesearch.https.preferred' is manually
editable in yacy.conf file or in Advanced Properties page
(/ConfigProperties_p.html).
Should be enabled as default in the future for improved privacy. 
Https could also eventually be used for other peers communications.
2017-11-24 14:10:41 +01:00
luccioman
e2f6427a63 Added a basic JUnit test for the Visio parser (vsdParser) 2017-11-22 09:06:16 +01:00
luccioman
1e9cdaabd4 Do locale neutral case conversion of HTML charset name.
Required to properly run on systems with default locale set to Turkish
language, as with this locale the 'i' character has different upper and
lower case flavors than with other locales.
2017-11-20 18:52:45 +01:00
luccioman
7206f1ed71 Do locale neutral case conversions on domain names.
Required to properly run on systems with default locale set to Turkish
language, as with this locale the 'i' character has different upper and
lower case flavors than with other locales.
2017-11-20 18:47:46 +01:00
luccioman
398c66f06c Do locale neutral case conversions in MultiProtocolURL
For any relevant URL parts : host name, URL scheme, session ids or
technical parts (see https://url.spec.whatwg.org/#url-writing and
https://tools.ietf.org/html/rfc3986 for current standard references).

Remaining locale sensitive conversion used for detection of URL word
components in urlComps() makes sense but using detected language would
be preferable than using the default system locale.
2017-11-20 15:23:33 +01:00
luccioman
9531b83598 Do locale neutral case conversions in Classification
Required for people using Turkish language as their default system
locale, as with this locale the 'i' character has different upper and
lower case flavors than with other locales.
2017-11-20 09:48:46 +01:00
luccioman
d22fc0d0a2 Updated lists of known sponsored and country-code TLDs.
Using current IANA reference list at
https://www.iana.org/domains/root/db .

As for previous update on known generic TLDs list, the generated URL
hashes on these domains stay the same but it improves performance of URL
hash computation for URLs on these domains.
2017-11-16 09:50:55 +01:00
luccioman
ac209cac2e Updated the generic top-level known domains list.
Using current IANA reference list at
https://www.iana.org/domains/root/db

The generated URL hashes on these domains stay the same but performance
is greatly improved as a DNS resolve request is required on URL hash
computation when the TLD part of the host name is unknown.

Hash computation mean time measured on 1541 sample URLs (one on each
TLD) and a computer with a DSL connection : about 230ms before change,
then only 20ms.
2017-11-14 09:42:09 +01:00
luccioman
938d8a9731 Added some JavaDoc 2017-11-14 09:24:13 +01:00
luccioman
e0eda84c24 Remove old hard-coded holiday dates from DateDection class.
Replaced with rules based relative to current year as already done for a
part of the supported dates.
2017-11-07 19:02:09 +01:00
luccioman
cb10daba92 Renamed Chinese & Greek lng files using ISO639-1 codes.
Previously named with their ISO 3166-1 country code : this way, when
setting language to "Browser" in ConfigBasic.html, it didn't work
properly when browser preferred language was Chinese or Greek as their
respective language codes are "zh" and "el" (not "cn" and "gr" which are
their country codes)
2017-11-04 11:06:05 +01:00
luccioman
46f37e38dc Customized Threads with generic name for easier monitoring. 2017-10-31 08:53:17 +01:00
luccioman
046be566e1 Updated a license header typo. 2017-10-30 07:38:47 +01:00
Apply55gx
3c905a2a5c fix typo 2017-10-27 14:00:30 +02:00
luccioman
8e732d437c Enable HTTP Digest authentication for non admin users.
Also ensure authentication is not lost by Digest timeout when navigating
between index.html and search results page.

This way, running searches with extended features on a remote peer or a
password protected peer works with a regular user (with "Extended
search" rights). 
When authenticating on the search page with a user without "Extended
search" rights, it appears as authenticated, but has just its usual
access to the public search features.
2017-10-26 07:51:18 +02:00
luccioman
d8eaf621cc Fixed blacklist returned location URL on empty parameters 2017-10-24 09:30:21 +02:00
luccioman
af198b990b Added an optional login link/status to the search public top nav bar.
Thus allowing a more convenient way (wihout the need to go to the admin
section) to login when searching on your remote or password protected
peer and benefit from extended search features such as Heuristics,
Bookmarking or JavasScript resorting.

Can be disabled using the ConfigSearchPage_p.html.
2017-10-21 10:57:36 +02:00