Commit Graph

8786 Commits

Author SHA1 Message Date
luccioman
ed93221fa1 Improved normalization of blacklist path patterns having non ascii chars
Normalize blacklist path patterns using percent-encoding, at pattern
edition in web interface and at loading from configuration files.

Fixes issue #237
2018-10-02 14:36:13 +02:00
luccioman
2a73b63d9e Use a constant default target file name for seed SCP upload method
To make seed upload (in /Settings_p.html?page=seed page) with SCP easier
when the user specify a remote target directory path.

See report by @vikulin in issue #227
2018-09-16 10:37:47 +02:00
luccioman
b5eabb626f Removed some dead code 2018-09-14 14:02:32 +02:00
luccioman
db7ad76366 Improved support for Java logs file pattern options
- support of "%h" and "%t" pattern components
- more proper initialization of file handler when the data folder is not
the default one, notably to prevent a non blocking but ugly error stack
trace reported by the log manager at startup with that kind of setup
2018-09-13 12:17:02 +02:00
luccioman
7adbd1f87d Fixed raw IPV6 addresses snapshots read/write on FAT32 and NTFS fs
Fixes issue #225
2018-09-12 17:34:40 +02:00
luccioman
9b1c87033b Fixed logs folder checking and creation
Previously, if YaCy log folder was for example at
`/home/user/yacy/DATA/LOG`, because of improper truncation of log path,
an unnecessary directory creation was atempted at `/home/us`.
2018-08-31 08:34:28 +02:00
luccioman
c29588dd6a Made possible to provide an absolute data root path for start script
Previously, only a path relative to the user home folder could be
provided
2018-08-30 18:16:22 +02:00
luccioman
d03c098b54 Removed deprecated warning comments about imports and Debian installer
Deprecated by commit be5d3a1066 , as
classpath is now defined in yacycore.jar Manifest file.
2018-08-22 22:35:00 +02:00
luccioman
5b60b4225f Fixed encoding of '+' character on search pages links
As revealed by issue #216
2018-08-20 18:44:04 +02:00
luccioman
54fbe166ba Updated pdf cache clear steps consistently with current pdfbox version
- Removed calls to no more existing clearResources functions (on PDFont
class and its children) since upgrade to pdfbox 2.n.n
- Removed hacky usage of protected internal ClassLoader function. This
removes the warnings displayed when running with JDK9 or JDK10 :

     [java] WARNING: Illegal reflective access by
net.yacy.document.parser.pdfParser$ResourceCleaner (file:<path>) to
method java.lang.ClassLoader.findLoadedClass(java.lang.String)
     [java] WARNING: Please consider reporting this to the maintainers
of net.yacy.document.parser.pdfParser$ResourceCleaner
     [java] WARNING: Use --illegal-access=warn to enable warnings of
further illegal reflective access operations
     [java] WARNING: All illegal access operations will be denied in a
future release

Crawling thousands of pdf documents from various sources after
modifications applied, revealed no new memory leak related to pdfbox
(measurements done with JVisualVM).
2018-08-16 18:23:42 +02:00
luccioman
685122363d Added a parser for XZ compressed archives.
As suggested by LA_FORGE on mantis 781
(http://mantis.tokeek.de/view.php?id=781)
2018-08-15 10:07:39 +02:00
luccioman
4ee14ff3c5 Fixed NullPointerException case on malformed crawl queue folder name 2018-08-13 14:35:26 +02:00
luccioman
21ad9435ec Fixed crawl queue folder naming for IPv6 hosts on MS Windows filesystems
As reported by @vikulin in issue #187, crawling websites using a raw
IPv6 address as host name in their URL failed when running on Microsoft
Windows platforms (FAT32 or NTFS filesystems) when YaCy crawler created
the crawl queue folder, as the ':' character which is part of an IPV6
address is forbidden on these filesystems.
2018-08-11 10:02:26 +02:00
luccioman
8a29551c54 Upgraded the OpenGeoDB dump URL
The status of the library in the DictionaryLoader_p.html page now also
advertises the user that an upgrade can be applied when an older dump is
already loaded.

Upgrade applied as suggested by Niklas Andrus @fapth_gitlab on Gitter
chat.
2018-08-03 18:39:41 +02:00
luccioman
373edf9eac Adjusted yjson Solr writer to support responses from an external Solr
Worked previously only with responses from YaCy embedded Solr, now able
to render the response when YaCy is configured to use an external Solr
index.
2018-07-31 16:22:21 +02:00
luccioman
87bd17b1cf Simplified a little bit the RSS OpenSearch Solr writer 2018-07-31 16:02:50 +02:00
luccioman
dc49ca9c27 Fixed a NPE case on the Solr OpenSearch response writer
Occurred when omitHeader parameter is set to true
2018-07-29 16:30:37 +02:00
luccioman
f4267ed247 Made Solr OpenSearch RSS writer compatible with external Solr index
Worked previously only with responses from YaCy embedded Solr, now able
to render the response when YaCy is configured to use an external Solr
index.
2018-07-28 11:03:31 +02:00
luccioman
b1410f593a Fixed stylesheet relative URLs rendering in Solr html writer
Relative URLs to CSS stylesheets were not properly rendered when using
the Solr html response writer and the "/solr/collection1/select" entry
point instead of "/solr/select".
2018-07-25 08:03:25 +02:00
luccioman
89c59814da Improved rendering of the Solr api relative url in the html writer
In order to have a consistent relative url when using either
/solr/select or /solr/collection1/select entry point.
2018-07-24 10:13:55 +02:00
luccioman
bf4f320b16 Optionally render the response header when using the Solr html writer
With params rendered as html input fields for conveniently modifying
params values and refreshing results.
2018-07-23 18:36:57 +02:00
luccioman
313204ae2c Override qf and df Solr params with defaults only when they are not set 2018-07-23 13:50:24 +02:00
luccioman
bdafb14336 Removed redundant synchronization lock on network switch function
Was useless as done in an already synchronized block, and the lock
object was assigned a new value in that same block, and nowhere else a
lock is requested on that same object.
2018-07-16 09:20:23 +02:00
luccioman
d5f44ea216 Removed unnecessary synchronization lock from serverSwitch constructor
Lock was useless here as it was set on an object instance attribute
while the object itself is not yet constructed and no other threads can
access it.
2018-07-16 09:13:50 +02:00
luccioman
dcad393fe5 Fixed exceeding max size of failreason_s Solr field on large link list
When using the 'From Link-List of URL' as a crawl start, with lists in
the order of one or more thousands of links, the failreason_s Solr field
maximum size (32kb) was exceeded by the string representation of the URL
must-match filter when a crawl URL was rejected because not matching.
2018-07-11 08:13:29 +02:00
luccioman
f467601561 Properly lock solrInstances for reboot and restoration of embedded Solr
Putting a synchronization lock directly on the solrInstances property
was ineffective as it is assigned a new (unlocked) instance in these
operations.
2018-07-08 08:57:59 +02:00
luccioman
9630f81306 Fixed small unnecessary lines of code 2018-07-08 08:15:26 +02:00
luccioman
876bcd2f54 Fixed useless comparison between int parameter and Long.MAX_VALUE 2018-07-08 08:11:01 +02:00
luccioman
c726154a59 Fixed removal of URLs from the delegatedURL remote crawl stack
URLs were removed from the stack using their hash as a bytes array,
whereas the hash is stored in the stack as String instance.
2018-07-05 09:36:36 +02:00
luccioman
2bdd71de60 Added server side columns sorting on the Process Scheduler table
For easier usage of large tables in the Table_API_p.html page.
2018-07-04 10:28:32 +02:00
luccioman
bb51555830 Removed remaining unsafe accesses to SimpleDateFormat instances.
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).

Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
2018-07-02 10:00:40 +02:00
luccioman
f895745e1c Removed more unsafe concurrent accesses to SimpleDateFormat instances.
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).

Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
2018-06-29 15:49:55 +02:00
luccioman
e97580dfc7 Fixed unsafe conccurent access to generic SimpleDateFormat instances
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).

Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
2018-06-28 14:59:23 +02:00
luccioman
8811700e2e Upgraded Jetty dependency from 9.4.9 to 9.4.11 2018-06-20 09:33:26 +02:00
luccioman
d53c33e4ef Fixed potential infinite loop case (does not occur in current code base) 2018-06-20 07:51:59 +02:00
luccioman
a15ac8e0ca Made CrawlProfile loading tolerant to malformed json string attribute 2018-06-19 12:53:17 +02:00
luccioman
a715bb7876 Fixed rendering of solr mustNoMatch value on CrawlProfileEditor_p.xml 2018-06-19 12:50:28 +02:00
luccioman
0b302c5004 Do not block whole server startup on persisted crawl profile load error 2018-06-19 12:48:17 +02:00
luccioman
4d9aa4ed1e Fixed default crawl profile solr mustnotmatch query from previous commit 2018-06-19 11:58:47 +02:00
luccioman
cced94298a Added a new crawler document filter type using Solr syntax
This makes possbile to set up much more advanced document crawl filters,
by filtering on one or more document indexed fields before inserting in
the index.
2018-06-19 10:12:20 +02:00
Michael Christen
e0dc632020 removed transformer
it was not used any more
2018-06-19 00:42:23 +02:00
luccioman
9bc7b6c39d Allow edtion of scheduled next execution dates for finer control
Can be useful more especially when scheduling many API calls over a long
period of time to precisely adjust each scheduled date/time.
2018-06-11 11:38:58 +02:00
luccioman
40e8c7b89b Use the heavy ConcurrentUpdateSolrClient only when necessary
Prefer the lightweight HttpSolrClient when no updates are performed on
the remote Solr instance, as recommended by Solr documentation itself.
2018-06-08 11:18:29 +02:00
luccioman
bd4cfeda3f Add a max acceptable limit to the size of Solr responses on p2p search
Following activation of gzip compression on responses, to ensure
uncompressed content can fit on available memory.
2018-06-08 10:33:23 +02:00
luccioman
de4ea95687 Consistently allow gzip compression of remote Solr responses
Was already enabled when requesting remote Solr with https or with
authentication (as an external Solr index)
2018-06-07 15:20:37 +02:00
luccioman
cea8187161 Reuse expired connections evictors threads provided by apache and solr 2018-06-06 14:24:05 +02:00
luccioman
b5dc1f376f Made outgoing pools max total connections user configurable
For a finer control over the maximum simultaneously active outgoing
connections.
2018-06-06 09:36:50 +02:00
luccioman
387d646c0e Added gzip compression of responses returned to user-agents accepting it
Enabled as default, but can be disabled using the "Server Access
Settings" admin page.
2018-06-05 13:35:39 +02:00
luccioman
a7a4ba3287 Apply remote solr configured timeout on getting connection from pool 2018-06-02 17:38:14 +02:00
luccioman
ee6670fb8f Use a common pooled http connection manager for remote solr instances
For a better control on the maximum simultaneous outgoing http
connections, as already done for any other http connections (crawls, rwi
search, p2p protocol) using the net.yacy.cora.protocol.http.HTTPClient
2018-05-29 09:24:21 +02:00
luccioman
d28f9ba0f6 Removed use of deprecated ConcurrentUpdateSolrClient constructor 2018-05-26 21:00:24 +02:00
luccioman
8a749aa5ad Trace level log message for monitoring remote solr response times 2018-05-26 20:58:05 +02:00
luccioman
35826a3091 Added a search page customization setting to display or not favicons
If not interested in displaying this on your search results and notably
on a peer with limited resources this can help saving some CPU and
outgoing network connections.
2018-05-25 11:13:43 +02:00
luccioman
0082b5ab2a Added missing default Solr http client connection timeout initialization
Consistently with the custom Solr http client used for https connections
to remote Solr peers or to YaCy external Solr storage.

This prevent remote Solr requests threads to wait for establishing a
connection to a remote peer longer than the configured timeout.
2018-05-24 09:24:52 +02:00
luccioman
fa4399d5d2 Small perf improvement : initialize threads names early when possible
Initializing Thread names using the Thread constructor parameter is
faster as it already sets a thread name even if no customized one is
given, while an additional call to the Thread.setName() function
internally do synchronized access, eventually runs access check on the
security manager and performs a native call.

Profiling a running YaCy server revealed that the total processing time
spent on Thread.setName() for a typical p2p search was in the range of
seconds.
2018-05-23 14:45:35 +02:00
luccioman
84d82bfdd7 Adjusted suggestions timeout management
* less CPU usage using the Solr 'allowedTime' parameter
* increase chances to get some results even when a first operation step
goes in time out by letting some time for final snippets results
processing
2018-05-21 14:49:43 +02:00
luccioman
65854bcb22 Fixed NullPointerException when omitHeader=true on external Solr server 2018-05-18 11:30:14 +02:00
luccioman
c4d984cec8 Fixed Solr response header duplication when requesting external Solr 2018-05-18 11:28:30 +02:00
luccioman
124cc24aa3 Properly handle embedded Solr partial results
Solr can provide partial results for example when a processing time
limit (specified with the parameter `timeAllowed`) is exceeded.

Before this fix, getting partial results from an embedded Solr index
resulted in a ClassCastException :
"org.apache.solr.common.SolrDocumentList cannot be cast to
org.apache.solr.response.ResultContext".
2018-05-18 10:14:54 +02:00
luccioman
3ce44cf250 Fixed largest snippet get : don't reject ones starting with a space char 2018-05-14 18:26:25 +02:00
luccioman
f511e16d50 Prevent duplication of Solr query highlight fields parameters
That was caused by concurrent modifications (with addHighlightField()
function) to the same SolrQuery instance when requesting Solr on remote
peers in p2p search.
2018-05-14 15:26:44 +02:00
luccioman
e357ade47d Reduced memory footprint of text snippet extraction
By not parsing and storing at first all sentences of a document, but
only on the fly the ones necessary to compute the snippet.
2018-05-13 10:29:52 +02:00
luccioman
e115e57cc7 Reduced text snippet extraction processing time.
By not generating MD5 hashes on all words of indexed texts, processing
time is reduced by 30 to 50% on indexed documents with more than 1Mbytes
of plain text.
2018-05-11 15:42:53 +02:00
sgaebel
4b79851e12 corrected icons_sizes_sxt to SolrType.string 2018-05-01 14:04:15 +02:00
luccioman
3b89c232db Easier tracking of longest text snippets initializations
When text snippets statistics are enabled and FINE log level is enabled
on the TextSnippetStatistics class.
2018-05-01 09:58:05 +02:00
luccioman
3c4344cb12 Fixed text snippet max init time statistic rendering 2018-05-01 09:39:41 +02:00
reger
a8234b7ea7 Make sure for image resource url enabled index image pixel size fields are filled
if at least one of the image size fields is enabled in index (images_height_val,
images_width_val, images_pixel_val). 
Previously all fields were required to be enabled (hint: default setting 
is height + width enabled)
2018-04-30 04:59:34 +02:00
luccioman
e67df103b5 Removed more remaining uses of deprecated Seed.getIP() function. 2018-04-29 08:26:53 +02:00
luccioman
addd18c993 Removed some remaining uses of deprecated Seed.getIP() 2018-04-26 09:39:30 +02:00
luccioman
c35d0568b6 Support for preferred https in peers communication on more operations 2018-04-24 08:08:24 +02:00
luccioman
e914d17aca Updated call to function deprecated since commons-codec version 1.11 2018-04-23 08:07:56 +02:00
luccioman
a3ec7a7a5f Added analysis optional setting to compute statistics on text snippets
Thus producing some basic stats on processing times for snippets
generation and counts on snippets per source type.
2018-04-15 09:55:08 +02:00
luccioman
1889d484de Added Solr HTML writer support for responses from remote instances 2018-04-12 09:23:00 +02:00
luccioman
2af3bf79c7 Improve rendering of remote Solr admin URLs
- properly handle IPv6 loopback address replacement
 - replace loopback address or host only when accessing peer remotely
 - replace loopback part with the peer hostname as requested rather than
with its seed public IP as this works better for Intranet mode and when
peer is behind a reverse proxy.
2018-04-10 11:15:31 +02:00
luccioman
bb74de7d59 Removed unnecessary "/admin" suffix from remote Solr instance admin URL
For quite quite a long time now, the Solr /admin URL suffix indeed
redirects to the Solr base context (see
https://issues.apache.org/jira/browse/SOLR-3337)
2018-04-09 00:01:45 +02:00
luccioman
0d34034f17 Ensure an embedded Solr is available for Solr dump/restore operations
Otherwise, these operations triggered NullPointerException when only an
external Solr index is attached.
2018-04-07 13:42:06 +02:00
luccioman
d92b191942 Ensure no remote Solr is attached before "Shut Down and Re-Start Solr"
Otherwise once this operation is applied, the remote Solr(s) instances
are deconnected and the embedded Solr is connected even if disabled by
setting "core.service.fulltext".

Also use constants for related default setting values.
2018-04-06 20:34:54 +02:00
luccioman
26d8ad591c Adjusted Solr select servlet output when using an external Solr only
- Use the EnhancedXMLResponseWriter only when requested output is "exml"
- Use the Standard Solr writers when possible, for example for json, xml
or javabin output formats
- Return an error when the requested format can not been rendered with
an external Solr server only

Important : this modification is necessary for peers using exclusively
an external Solr server to be reachable as robinson targets in p2p
search, as the binary format ("javabin") is the default Solr exchange
format for peers.
Before this, when a peer requested a remote one attached only to an
external Solr (no embedded one), it ended with "Invalid type" error, as
the remote peer answered with xml although binary format was requested.
2018-04-06 15:16:54 +02:00
luccioman
69690c13a0 Optionally allow external Solr server with self-signed certificate
This is necessary when you want to attach to a dedicated external Solr
server protected with basic http authentication and requested over https
but having only a self-signed certificate.
2018-04-04 18:16:26 +02:00
luccioman
b882f85900 Fixed NPE case in Solr select servlet on external Solr only setup
Regression introduced with commit
0d7625ecfb
2018-04-03 15:36:17 +02:00
luccioman
2fd4d05e2f Added a shared Java constant for setting key server.servlets.called 2018-04-02 15:16:10 +02:00
luccioman
ba9cd14516 Removed hard-coded patch for Solr 5.0 on ranking boost function
The current default boost function
(`recip(ms(NOW,last_modified),3.16e-11,1,1)`) for the Date ranking
profile is indeed working fine.
What can trigger the error `unexpected docvalues type NUMERIC for field
'last_modified'` is the previous default boost function (quite old now)
or any custom one using the Solr `ord` or `rord` functions on the
last_modified field.
Then the problem was that the migration code in the Switchboard supposed
to detect the old date boost function was incorrect (one trailing right
parenthesis in excess), so the deprecated function remained.

This fixes issue #169.
2018-03-26 16:24:27 +02:00
luccioman
fb3032c530 Added a crawl filtering possibility on documents Media Type (MIME) 2018-03-23 10:28:19 +01:00
luccioman
e45afedee4 Added support for enclosures (media links) to the RSS loader 2018-03-21 08:22:29 +01:00
luccioman
aaefd5219c Reduce log verbosity of RSS loader on feed items with no link 2018-03-20 10:09:17 +01:00
luccioman
cf62b571bd Added RSS reader support for enclosure feed item sub element.
Enclosure element (see
http://www.rssboard.org/rss-specification#ltenclosuregtSubelementOfLtitemgt
) can be seen for example in podcasts feeds.
2018-03-20 07:38:29 +01:00
luccioman
e5f5de0fc7 Added some JavaDoc to the RSSMessage class. 2018-03-19 11:15:31 +01:00
luccioman
0d7625ecfb Handle Solr fields restrict and alias in YaCy html and exml writers
Thus allowing for example to read more easily the local Solr index full
metadata in HTML by restricting if desired to some fields of interest.

See Solr documentation about the 'fl' (Field List) parameter at
https://lucene.apache.org/solr/guide/6_6/common-query-parameters.html#CommonQueryParameters-Thefl_FieldList_Parameter
2018-03-16 11:35:42 +01:00
luccioman
3da2739bbd Parse and index more common audio metadata text tag fields. 2018-03-15 09:59:57 +01:00
luccioman
846aba00fa Added parsing of URLs eventually present in audio metadata tags 2018-03-13 23:08:52 +01:00
Michael Peter Christen
187075b878 added nav filter 2018-03-10 15:46:53 +01:00
luccioman
bcbd0ae1a4 Enabled partial parsing of audio resources. 2018-03-01 20:50:44 +01:00
luccioman
fda0189613 Updated audio file extensions with ones recently added to audioTagParser 2018-02-28 13:46:40 +01:00
luccioman
978e2be95b Let a chance for other parsers on audioTagParser error
As done in all other parsers, eventually falling back in the end to the
genericParser which creates a minimal index entry.
2018-02-28 12:27:17 +01:00
luccioman
9e5846a26e Small fix on svg parser error message 2018-02-28 12:23:52 +01:00
luccioman
11611dbdcf Reuse existing File copy function to handle audio parser tmp files 2018-02-28 11:58:32 +01:00
luccioman
f77f8f40f9 Factored audio parser tag processing 2018-02-28 08:19:13 +01:00
luccioman
9a7a353d0e Removed some unnecessary intermediate list creation on array copy. 2018-02-28 07:49:40 +01:00
luccioman
fb6457f5bc Fixed NPE case when on audio resource parsed with null tag 2018-02-28 07:31:32 +01:00
luccioman
c3ff50c17a Updated the list of audio file formats supported by the audioTagParser
Follows upgrade to Jaudiotagger dependency to version 2.2.5.
2018-02-27 18:04:12 +01:00
luccioman
1b90479a76 Added missing vocabulary navigator increment on results from RWI 2018-02-23 11:36:03 +01:00
luccioman
46c9da6428 Allow creation of vocabularies from remote CSV file URLs. 2018-02-21 08:41:13 +01:00
luccioman
17c7a85f18 Make StreamResponse usable in Java try-with-resources statements 2018-02-21 08:38:35 +01:00
luccioman
b67742336e Provide user interface messages on vocabulary creation read/write errors 2018-02-19 11:48:40 +01:00
luccioman
3e8dd90211 Use https rather than http in links and queries to openstreetmap.org 2018-02-15 19:14:07 +01:00
luccioman
3a973dbb23 Removed unused import 2018-02-14 09:27:17 +01:00
luccioman
e9527cd0e5 Reuse the same Pattern instance when matching multiple key/values 2018-02-14 07:14:25 +01:00
luccioman
dbf4c1cd76 Improved blacklist entries editing operations :
- Fixes issue #160 : handle properly syntax exceptions with a user
friendly message
- Fixes loss of information on multiple blacklist entries editions
- Fixes loss of entries when moving entries from one list to another
2018-02-13 18:24:26 +01:00
reger
87077b8fb6 Adjust and move Language Navigator to be member of the navigatior plugin
list.
2018-02-12 00:16:34 +01:00
luccioman
eb20589e29 Fixed issue #158 : completed div CSS class ignore in crawl 2018-02-10 11:56:28 +01:00
luccioman
0cdee4e26a Fixed loss of "meanCount" search param when using facets or page buttons
Then on new search queries, no suggestions at all could be displayed.
2018-02-08 08:07:30 +01:00
luccioman
117a859879 Do not clear all search modifiers when unselecting one modifier.
Previously, when clicking a selected facet in the search results page to
unselect it, all other eventually selected modifiers/facets were also
removed.
2018-02-07 15:54:46 +01:00
luccioman
33593c22e9 Fixed loss of other modifiers on keywords/tags search navigation links 2018-02-06 17:17:13 +01:00
luccioman
a9dc0874c0 Remove old query terms from search results suggestions links.
Especially when old terms were misspelled, suggestions links then
provided most of the time empty results.
2018-02-06 15:14:14 +01:00
luccioman
9412881230 Added basic support for autotagging microdata annotated item types.
With the appropriate vocabulary settings in Vocabulary_p.html page, this
can produce Vocabulary search facets displaying item types referenced in
html documents by microdata annotation.
Tested notably, but not limited to, vocabulary classes/types defined by
Schema.org and Dublin Core.
2018-02-06 10:25:38 +01:00
luccioman
5a14d34a7d Refactoring : documented and extracted autotagging processing functions. 2018-02-02 10:27:36 +01:00
luccioman
58b9834729 Added HTML microdata typed items parsing capability.
This adds the possibility for the HTML parser to gather typed items URLs
annotated in HTML tags with itemscope and itemtype attributes (see
microdata specification https://www.w3.org/TR/microdata/ ), notably
Types from the schema.org vocabulary, but also Types/Classes from any
other vocabulary, such as the common ones listed in the RDFa core
context ( https://www.w3.org/2011/rdfa-context/rdfa-1.1.html ).
2018-02-02 09:31:40 +01:00
luccioman
80fb1026d0 Create recrawl requests with the relevant crawl profile.
Recrawl default profile was previously effectively used for crawl
stacker acceptance check, but request entries were indeed still created
with the "snippetGlobalText" profile.
2018-01-30 21:00:18 +01:00
luccioman
539925a275 Added an utility to generate/update XLIFF master file from lng files. 2018-01-29 18:34:47 +01:00
luccioman
fa6d030b0b Moved dbtest to the test source folder. 2018-01-29 14:03:01 +01:00
luccioman
6cd3847d0a Fixed NullPointerException case on Table init with relative file path.
Can occur for example when running dbtest with relative test table file
name (wihout explicit parent folder).
2018-01-29 14:00:43 +01:00
luccioman
28883d8a71 Shutdown daemon threads at the end of dbtest 2018-01-29 13:56:37 +01:00
luccioman
929e0d6eae Replaced improper ByteBuffer.equals() implementation by Arrays.equals()
Renamed also ByteBuffer.equals() to startsWith() as this is the
appropriate function implementation semantics.
2018-01-29 13:38:25 +01:00
luccioman
46b5249c20 Removed time condition on HostBalancer initialization in JUnit test.
Its initialization in main application usage remains asynchronous.
2018-01-26 17:15:27 +01:00
luccioman
8b572b7337 Commit Solr index before simulating or starting recrawl job.
This ensures up-to-date simulation query results, and recrawl
processing.
2018-01-26 10:31:13 +01:00
luccioman
733cacdbb8 Revised the RDFaParser main launcher for minimal proper operation.
This parser is still not enabled in the main text parsers list. More
would have to be done to make it functional.
2018-01-25 07:57:56 +01:00
luccioman
7baa99f26f Fixed stored URL in web cache when redirection(s) occurs.
Associate cached content to the last redirection location, instead of
the first URL of a redirection(s) chain :
 - for proper base URL processing in parsers (fixes mantis 636 -
http://mantis.tokeek.de/view.php?id=636)
 - to prevent duplicated content in Solr index when recrawling a
redirected URL
2018-01-20 18:56:40 +01:00
luccioman
9ddf92d143 Removed unncessary reflection usage for workflow tasks.
This improves code readability and maintainability (calls hierarchy are
easier to read) and eventually performance.
2018-01-15 10:05:49 +01:00
luccioman
897d3d30cc Added new recrawl job profile to the list of default crawl profiles 2018-01-15 08:30:37 +01:00
luccioman
9624516bf8 Refresh recrawl job profile threshold date like other default profiles 2018-01-15 08:06:28 +01:00
luccioman
b712a0671e Added a specific default crawl profile for the recrawl job.
- with only light constraint on known indexed documents load date, as it
can already been controlled by the selection query, and the goal of the
job is indeed to recrawl selected documents now
- using the iffresh cache strategy
2018-01-13 15:46:04 +01:00
luccioman
adf3fa493d Added comments about crawl profiles recrawl cycles 2018-01-13 12:13:04 +01:00
luccioman
3638e16c2e More comprehensive log on rejected recrawls caused by date constraint 2018-01-13 12:07:56 +01:00
luccioman
d47afe6fab Use a constant for crawler reject reason prefix with specific processing 2018-01-13 10:45:00 +01:00
luccioman
4e03335625 Added more details to the recrawl job report 2018-01-12 11:47:13 +01:00
luccioman
6425963cee Fixed internal tables exact value match iterator 2018-01-10 18:38:42 +01:00
luccioman
0c9e0b3566 Record recrawl calls to make them schedulable 2018-01-10 17:05:53 +01:00
luccioman
433e241e4f Added a report info box about eventual last terminated recrawl job
For easier monitoring of recrawls.
2018-01-09 22:33:15 +01:00
luccioman
b2af25b14f Added a stop condition to the Recrawl busy thread 2018-01-09 10:22:26 +01:00
luccioman
421728d25a Made possible to customize selection query before launching a recrawl 2018-01-08 21:20:46 +01:00
luccioman
36e9b1c5b3 Fixed SegmentTest test case time dependant occasional failures
As highlighted by latest automated Travis builds.
2018-01-02 10:21:07 +01:00
luccioman
8a4ea1c11e Added UI switch to control content domain constraint per search request 2018-01-02 08:13:14 +01:00
reger
f8071ac8ae Make TokenizedStringNavigator (used for keyword search facet) active
check case insensitive.
As keywords are compared lower case, make sure user input keyword:Key
or keyword:key will be shown as active in facet entry key.
2017-12-28 02:51:52 +01:00
luccioman
e6907fdab3 Added optional search parameter/setting to control content domain filter
Thus allowing to choose at configuration or per search request, whether
extending or not results beyond strict content domain filter (image,
video, audio or application).

Related graphical controls to be added to user interface.
2017-12-23 18:56:17 +01:00
luccioman
f52217c939 Enable full size images preview for users with extended search rights 2017-12-22 11:39:30 +01:00
luccioman
09c4ee56a7 Added optional https support for remote crawl and profile operations 2017-12-21 18:41:32 +01:00
luccioman
5db1c9155a Do locale independant case conversion on hosts, schemes, and file exts.
Required for proper operation when the default system locale is Turkish,
as dottless and dotted i characters have specific case conversion rules
in this language.
2017-12-19 13:52:05 +01:00
luccioman
1c4803e40a Enable optional https support for /yacy/transferURL API calls.
Also updated some Javadoc and consistently use Switchboard instance as a
constructor parameter where relevant.
2017-12-19 12:30:49 +01:00
luccioman
c6e1befbca Restored peer URL host name stripping removed from previous commit.
Still useful for peers with IPv6 addresses.
2017-12-15 17:03:35 +01:00
luccioman
17e004599d Started implementing optional https preference for protocol operations
Introduced through the new configurable setting
network.unit.protocol.https.preferred, defaulting to false for now.

Let choose to prefer using https when available on remote peers to
perform YaCy protocol operations including notably hello or transferRWI.

Not yet implemented for every YaCy protocol operations.
2017-12-15 11:28:46 +01:00