Commit Graph

8849 Commits

Author SHA1 Message Date
Michael Peter Christen
4e9b425f98 missing fix for latest commit 2020-12-03 00:40:51 +01:00
Michael Peter Christen
3213d9db37 updated jetty from 9.4.17 to 9.4.35
and fixed a bug in ServerSideIncludes that appeared only in that recent
version of jetty
2020-12-03 00:21:15 +01:00
Michael Peter Christen
787fec0658 reduced complexity - removed concurrency in sort 2020-12-02 18:39:45 +01:00
Michael Peter Christen
cef5fde343 adding message to UI to make port change transparent 2020-12-02 18:05:38 +01:00
Michael Peter Christen
52228cb6be added a gc to cleanup process (once every 10 minutes) 2020-12-02 00:13:00 +01:00
Michael Peter Christen
22841ffbf1 creating a threaddump during every cleanup process
to be able to find out what a peer did (not) last time before a crash
2020-12-01 03:00:24 +01:00
Michael Peter Christen
36e616271b do better documentation on how to set a default password 2020-12-01 02:18:08 +01:00
Michael Peter Christen
df2bf9ef28 try to fix maven build error 2020-11-29 14:24:33 +01:00
Michael Peter Christen
264bab6700 trying to fight the UI unavaiability
this path addresses a possible issue with too many open connections to
remote peers
2020-11-29 14:15:34 +01:00
Michael Peter Christen
7947baeb49 removed all remaining deprecation warnings 2020-11-23 00:03:18 +01:00
Michael Peter Christen
c0f6d6e11d removed one deprecation warning for jetty library initializing ssl
server port
2020-11-22 23:27:58 +01:00
Michael Peter Christen
133440a7a6 some debug lines 2020-11-22 23:12:04 +01:00
sgaebel
3431f91db9 removes unused 'unused' tokens 2020-08-04 20:09:34 +02:00
sgaebel
fc03c4b4fe removes some warning and unused objects 2020-08-03 20:44:31 +02:00
sgaebel
4a495df63a removes some deprecation-warnings 2020-07-31 17:28:06 +02:00
sgaebel
dd9d4b1188 replace org.junit.Assert.assertThat by
org.hamcrest.MatcherAssert.assertThat from hamcrest 2.2 to avoid
deprecation-warning
2020-07-28 19:09:26 +02:00
sgaebel
df9ea0a42a removes some warnings: unused imports, params 2020-07-27 22:20:49 +02:00
sgaebel
9bc2297161 fixes deleting during recrawl 2020-07-22 22:15:00 +02:00
sgaebel
80785b785e adds deleting during recrawl 2020-07-09 19:32:16 +02:00
Michael Peter Christen
e0ad8ca9da replaced json library from JSON.org with libandroid-json-java
This fixes https://github.com/yacy/yacy_search_server/issues/347
2020-04-24 11:45:25 +02:00
Michael Peter Christen
ea8df27e95 modified org.json.* library to fit into the YaCy environment
as drop-in replacement.
Also made some fixes and enhancements to the library.
2020-04-24 11:42:06 +02:00
Michael Peter Christen
60dc1241a3 added org.json.* library
from https://android.googlesource.com/platform/libcore/+/refs/heads/master/json/src/main/java/org/json
as a preparation step for
https://github.com/yacy/yacy_search_server/issues/347
2020-04-24 10:28:43 +02:00
Michael Peter Christen
053e54a2c7 grand CORS for json files 2019-11-05 11:50:56 +01:00
Michael Christen
cfa27d2fd5 fixed links 2019-10-20 20:20:50 +02:00
Michael Christen
cb20aa7e54 removed donation message in search result column 2019-10-17 01:35:44 +02:00
Michael Christen
25227676ae removed some warnings 2019-09-28 02:07:08 +02:00
luccioman
6b45cd5799 New optional crawl filter on the URL a doc must match to crawl its links
For finer control over which parsed documents can trigger an addition of
their links to the crawl stack, complementary to the existing crawl
depth parameter.
2019-05-01 08:54:19 +02:00
luccioman
d16bc99835 Added "Show Metadata" links to the ViewFile.html links mode
To conveniently follow parsed links in the file viewer
2019-04-18 15:31:38 +02:00
luccioman
a5771b1f14 Made SNI extension user configurable without the need for server restart
TLS Server Name Indication (SNI) extension activation can now be
configured with the new Settings_p.html?page=httpClient administration
page.
SNI extension is also now enabled by default, as in 2019 the
unrecognized_name(112) alert is more properly handled by major web
servers TLS implementations, following the RFC 6066 standard.

Related YaCy issues : #153 #189 and #272
JDK 1.7 bug :
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7127374
Apache httpd issue :
https://bz.apache.org/bugzilla/show_bug.cgi?id=56241
RFC 6066 : https://tools.ietf.org/html/rfc6066#section-3
2019-04-14 15:41:13 +02:00
luccioman
e90405b6f0 Support parsing audio URLs without file extension
Added also a Junit for the audio tag parser
2019-04-09 11:40:21 +02:00
luccioman
a8316c79da Allow JS resorting of search results by unauthenticated users
Acces rate limitations to this search mode by unauthenticated users are
set low by default to prevent unwanted server overload but can be
customized through the SearchAccessRate_p.html configuration page

Fixes #291
2019-04-03 14:21:53 +02:00
luccioman
0ab2b49c31 Made /yacysearch access rate limitations user configurable
With a new admin page at /SearchAccessRate_p.html in menu Network Access
> Local Search > Access Rate Limitations
2019-04-02 17:42:50 +02:00
luccioman
5b7e41202a Added Solr GSA writer support for responses from remote instances 2019-03-27 18:23:41 +01:00
luccioman
4d8a948455 Properly close PDF snapshots loaded with pdfbox library 2019-03-22 09:50:30 +01:00
luccioman
74e6d6e984 Added Solr GrepHTML writer support for responses from remote instances 2019-03-20 18:24:16 +01:00
luccioman
5e6501974d Added Solr snapshots writer support for responses from remote instances 2019-03-19 11:25:44 +01:00
luccioman
384c37102c Improve accuracy of total results count on latest pages in Stealth mode
Previously, when mixing results from local RWI and local Solr (Stealth
mode), total local Solr count could be ignored on last result pages,
when the page offset was higher than local Solr count but lower than
total RWI count.
2019-03-04 10:05:47 +01:00
luccioman
5e9a08355a Improved logging for federated search
- Do not use spaces in logger identifier name so the log level can be
configured in yacy.logging
- Hold the logger instance to avoid the logging system to look for it
from its name at each appended log message
2019-02-02 09:59:24 +01:00
luccioman
9782a98a9c Added the possibility to customize facets sort type and direction
Previously search navigators/facets elements were sorted only by counts.
Now from the ConfigSearchPage_p.html admin page, sort direction
(ascending/descending) and type (on counts or labels) can be customized
independently for each navigator.
2019-01-24 18:43:06 +01:00
sgaebel
c2398fd890 remove warnings: 'Statement unnecessarily nested within else clause' 2019-01-10 20:02:57 +01:00
sgaebel
811d40a6c4 taking care of closing inputstreams, HTTPClient 2019-01-04 18:58:49 +01:00
sgaebel
8d2e7262d9 Recrawl:
- set the chunksize to 100 to meet the max of the embedded solr
- re-enable sorting (the case where we switched it of should be away)
- enable recrawling on remote-solr
2019-01-04 18:46:59 +01:00
sgaebel
8f58c1dcfa extend the SolrServlet to be usable as remote solr (incl. update)
this feature needs to be enabled by uncomment the url-pattern
2019-01-04 18:27:44 +01:00
luccioman
7223a2fdb1 Removed usage of now deprecated Jetty function 2018-12-22 14:42:22 +01:00
luccioman
440d9f2fa0 Exclude peers with empty or disabled RWI from remote RWI search 2018-12-20 14:53:01 +01:00
luccioman
08ea0b0397 Added a configurable timeout to wkhtmltopdf calls for pdf snapshots
Necessary to prevent blocking the indexing workflow when some
wkhtmltopdf renderings fail without terminating
2018-12-11 22:31:31 +01:00
luccioman
3fb449b3b6 Properly resolve relative URLs against document URL in html base tags
Fixes issue #256
2018-12-06 20:18:00 +01:00
luccioman
73a6e45524 Extended detection of external tools used for Snapshots generation
This enable detecting wkhtmltopdf and Imagemagick convert executables
when they are at system Path in addition to common installation paths.
2018-12-06 09:53:08 +01:00
luccioman
7dc1f60619 Fixed detection of absolute data folder path on MS Windows 2018-11-18 10:08:20 +01:00
luccioman
595e144797 Trace a message on incomplete proper server finish when killing process 2018-11-15 17:32:22 +01:00
luccioman
9daeea823b Fixed concurrency issue on cache used for circles rendering
Without synchronization lock, concurrent rendering of images including
circles could lead to glitches as reported in issue #248
2018-11-10 22:00:49 +01:00
Michael Peter Christen
c347e7d3f8 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2018-11-08 14:42:52 +01:00
Michael Peter Christen
848e9304d9 evil bots may crawl harder 2018-11-08 14:42:40 +01:00
luccioman
a997133260 Fixed gzip decompression regression on index transfer APIs
Processing of gzip encoded incoming requests (on /yacy/transferRWI.html
and /yacy/transferURL.html) was no more working since upgrade to Jetty
9.4.12 (see commit 51f4be1).

To prevent any conflicting behavior with Jetty internals, use now the
GzipHandler provided by Jetty to decompress incoming gzip encoded
requests rather than the previously used custom GZIPRequestWrapper.

Fixes issue #249
2018-11-07 14:52:42 +01:00
luccioman
e85f231bdf Fixed termination of Host browser and link structure Solr query threads
On some conditions (especially when reaching timeout), concurrent Solr
query tasks used by the /HostBrowser.html and /api/linkstructure.json
never terminated, thus leaking resources, as reported by @Vort in issue
#246
2018-11-06 10:10:09 +01:00
luccioman
fcf6b16db4 Added new crawler attribute for finer control over Media Type detection
New "Media Type detection" section in the advanced crawl start page
allow to choose between :
- not loading URLs with unknown or unsupported file extension without
checking the actual Media Type (relying Content-Type header for now).
This was the old default behavior, faster, but not really accurate.
- always cross check URL file extension against the actual Media Type.
This lets properly parse URLs ending with an apparently odd file
extension, but which have actually a supported Media Type such as
text/html.

Sample URLs with misleading file extensions added as documentation in
the crawl start page.

fixes issue #244
2018-10-25 10:42:12 +02:00
luccioman
a83a56473e Added suport for PDF snapshots generation when running on MS Windows 2018-10-18 12:41:57 +02:00
luccioman
8852c97cee Added basic styling for cleaner rendering of missing image snapshots
For the output of the Solr snapshots writer
2018-10-15 18:19:57 +02:00
luccioman
746e0e788d Render a relevant HTTP status code on snapshot image rendering error
Instead of a null response body which is not very helpful.
2018-10-14 10:30:30 +02:00
luccioman
50b6edfcf5 Updated Solr snapshots writer for a cleaner html head 2018-10-13 10:36:39 +02:00
luccioman
f366f43d6b Made snapshots size customizable in Solr snapshots response writer 2018-10-13 10:22:47 +02:00
luccioman
7a62fc0e66 Fixed concurrency issue in custom classloader used for template classes
As reported in issue #241, the problem is only critical (random but
complete crash of the JVM) when upgrading to JDK11.
2018-10-11 18:34:39 +02:00
luccioman
61c337f29a Decode blacklist entries for easier edition of non ascii chars
Not using the JDK URLDecoder.decode() function, as it strips '+'
characters when they occur after '?' (both characters having regular
expression semantics when used in blacklist path patterns)
2018-10-04 09:33:58 +02:00
luccioman
ed93221fa1 Improved normalization of blacklist path patterns having non ascii chars
Normalize blacklist path patterns using percent-encoding, at pattern
edition in web interface and at loading from configuration files.

Fixes issue #237
2018-10-02 14:36:13 +02:00
luccioman
2a73b63d9e Use a constant default target file name for seed SCP upload method
To make seed upload (in /Settings_p.html?page=seed page) with SCP easier
when the user specify a remote target directory path.

See report by @vikulin in issue #227
2018-09-16 10:37:47 +02:00
luccioman
b5eabb626f Removed some dead code 2018-09-14 14:02:32 +02:00
luccioman
db7ad76366 Improved support for Java logs file pattern options
- support of "%h" and "%t" pattern components
- more proper initialization of file handler when the data folder is not
the default one, notably to prevent a non blocking but ugly error stack
trace reported by the log manager at startup with that kind of setup
2018-09-13 12:17:02 +02:00
luccioman
7adbd1f87d Fixed raw IPV6 addresses snapshots read/write on FAT32 and NTFS fs
Fixes issue #225
2018-09-12 17:34:40 +02:00
luccioman
9b1c87033b Fixed logs folder checking and creation
Previously, if YaCy log folder was for example at
`/home/user/yacy/DATA/LOG`, because of improper truncation of log path,
an unnecessary directory creation was atempted at `/home/us`.
2018-08-31 08:34:28 +02:00
luccioman
c29588dd6a Made possible to provide an absolute data root path for start script
Previously, only a path relative to the user home folder could be
provided
2018-08-30 18:16:22 +02:00
luccioman
d03c098b54 Removed deprecated warning comments about imports and Debian installer
Deprecated by commit be5d3a1066 , as
classpath is now defined in yacycore.jar Manifest file.
2018-08-22 22:35:00 +02:00
luccioman
5b60b4225f Fixed encoding of '+' character on search pages links
As revealed by issue #216
2018-08-20 18:44:04 +02:00
luccioman
54fbe166ba Updated pdf cache clear steps consistently with current pdfbox version
- Removed calls to no more existing clearResources functions (on PDFont
class and its children) since upgrade to pdfbox 2.n.n
- Removed hacky usage of protected internal ClassLoader function. This
removes the warnings displayed when running with JDK9 or JDK10 :

     [java] WARNING: Illegal reflective access by
net.yacy.document.parser.pdfParser$ResourceCleaner (file:<path>) to
method java.lang.ClassLoader.findLoadedClass(java.lang.String)
     [java] WARNING: Please consider reporting this to the maintainers
of net.yacy.document.parser.pdfParser$ResourceCleaner
     [java] WARNING: Use --illegal-access=warn to enable warnings of
further illegal reflective access operations
     [java] WARNING: All illegal access operations will be denied in a
future release

Crawling thousands of pdf documents from various sources after
modifications applied, revealed no new memory leak related to pdfbox
(measurements done with JVisualVM).
2018-08-16 18:23:42 +02:00
luccioman
685122363d Added a parser for XZ compressed archives.
As suggested by LA_FORGE on mantis 781
(http://mantis.tokeek.de/view.php?id=781)
2018-08-15 10:07:39 +02:00
luccioman
4ee14ff3c5 Fixed NullPointerException case on malformed crawl queue folder name 2018-08-13 14:35:26 +02:00
luccioman
21ad9435ec Fixed crawl queue folder naming for IPv6 hosts on MS Windows filesystems
As reported by @vikulin in issue #187, crawling websites using a raw
IPv6 address as host name in their URL failed when running on Microsoft
Windows platforms (FAT32 or NTFS filesystems) when YaCy crawler created
the crawl queue folder, as the ':' character which is part of an IPV6
address is forbidden on these filesystems.
2018-08-11 10:02:26 +02:00
luccioman
8a29551c54 Upgraded the OpenGeoDB dump URL
The status of the library in the DictionaryLoader_p.html page now also
advertises the user that an upgrade can be applied when an older dump is
already loaded.

Upgrade applied as suggested by Niklas Andrus @fapth_gitlab on Gitter
chat.
2018-08-03 18:39:41 +02:00
luccioman
373edf9eac Adjusted yjson Solr writer to support responses from an external Solr
Worked previously only with responses from YaCy embedded Solr, now able
to render the response when YaCy is configured to use an external Solr
index.
2018-07-31 16:22:21 +02:00
luccioman
87bd17b1cf Simplified a little bit the RSS OpenSearch Solr writer 2018-07-31 16:02:50 +02:00
luccioman
dc49ca9c27 Fixed a NPE case on the Solr OpenSearch response writer
Occurred when omitHeader parameter is set to true
2018-07-29 16:30:37 +02:00
luccioman
f4267ed247 Made Solr OpenSearch RSS writer compatible with external Solr index
Worked previously only with responses from YaCy embedded Solr, now able
to render the response when YaCy is configured to use an external Solr
index.
2018-07-28 11:03:31 +02:00
luccioman
b1410f593a Fixed stylesheet relative URLs rendering in Solr html writer
Relative URLs to CSS stylesheets were not properly rendered when using
the Solr html response writer and the "/solr/collection1/select" entry
point instead of "/solr/select".
2018-07-25 08:03:25 +02:00
luccioman
89c59814da Improved rendering of the Solr api relative url in the html writer
In order to have a consistent relative url when using either
/solr/select or /solr/collection1/select entry point.
2018-07-24 10:13:55 +02:00
luccioman
bf4f320b16 Optionally render the response header when using the Solr html writer
With params rendered as html input fields for conveniently modifying
params values and refreshing results.
2018-07-23 18:36:57 +02:00
luccioman
313204ae2c Override qf and df Solr params with defaults only when they are not set 2018-07-23 13:50:24 +02:00
luccioman
bdafb14336 Removed redundant synchronization lock on network switch function
Was useless as done in an already synchronized block, and the lock
object was assigned a new value in that same block, and nowhere else a
lock is requested on that same object.
2018-07-16 09:20:23 +02:00
luccioman
d5f44ea216 Removed unnecessary synchronization lock from serverSwitch constructor
Lock was useless here as it was set on an object instance attribute
while the object itself is not yet constructed and no other threads can
access it.
2018-07-16 09:13:50 +02:00
luccioman
dcad393fe5 Fixed exceeding max size of failreason_s Solr field on large link list
When using the 'From Link-List of URL' as a crawl start, with lists in
the order of one or more thousands of links, the failreason_s Solr field
maximum size (32kb) was exceeded by the string representation of the URL
must-match filter when a crawl URL was rejected because not matching.
2018-07-11 08:13:29 +02:00
luccioman
f467601561 Properly lock solrInstances for reboot and restoration of embedded Solr
Putting a synchronization lock directly on the solrInstances property
was ineffective as it is assigned a new (unlocked) instance in these
operations.
2018-07-08 08:57:59 +02:00
luccioman
9630f81306 Fixed small unnecessary lines of code 2018-07-08 08:15:26 +02:00
luccioman
876bcd2f54 Fixed useless comparison between int parameter and Long.MAX_VALUE 2018-07-08 08:11:01 +02:00
luccioman
c726154a59 Fixed removal of URLs from the delegatedURL remote crawl stack
URLs were removed from the stack using their hash as a bytes array,
whereas the hash is stored in the stack as String instance.
2018-07-05 09:36:36 +02:00
luccioman
2bdd71de60 Added server side columns sorting on the Process Scheduler table
For easier usage of large tables in the Table_API_p.html page.
2018-07-04 10:28:32 +02:00
luccioman
bb51555830 Removed remaining unsafe accesses to SimpleDateFormat instances.
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).

Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
2018-07-02 10:00:40 +02:00
luccioman
f895745e1c Removed more unsafe concurrent accesses to SimpleDateFormat instances.
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).

Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
2018-06-29 15:49:55 +02:00
luccioman
e97580dfc7 Fixed unsafe conccurent access to generic SimpleDateFormat instances
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).

Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
2018-06-28 14:59:23 +02:00
luccioman
8811700e2e Upgraded Jetty dependency from 9.4.9 to 9.4.11 2018-06-20 09:33:26 +02:00
luccioman
d53c33e4ef Fixed potential infinite loop case (does not occur in current code base) 2018-06-20 07:51:59 +02:00
luccioman
a15ac8e0ca Made CrawlProfile loading tolerant to malformed json string attribute 2018-06-19 12:53:17 +02:00
luccioman
a715bb7876 Fixed rendering of solr mustNoMatch value on CrawlProfileEditor_p.xml 2018-06-19 12:50:28 +02:00
luccioman
0b302c5004 Do not block whole server startup on persisted crawl profile load error 2018-06-19 12:48:17 +02:00
luccioman
4d9aa4ed1e Fixed default crawl profile solr mustnotmatch query from previous commit 2018-06-19 11:58:47 +02:00
luccioman
cced94298a Added a new crawler document filter type using Solr syntax
This makes possbile to set up much more advanced document crawl filters,
by filtering on one or more document indexed fields before inserting in
the index.
2018-06-19 10:12:20 +02:00
Michael Christen
e0dc632020 removed transformer
it was not used any more
2018-06-19 00:42:23 +02:00
luccioman
9bc7b6c39d Allow edtion of scheduled next execution dates for finer control
Can be useful more especially when scheduling many API calls over a long
period of time to precisely adjust each scheduled date/time.
2018-06-11 11:38:58 +02:00
luccioman
40e8c7b89b Use the heavy ConcurrentUpdateSolrClient only when necessary
Prefer the lightweight HttpSolrClient when no updates are performed on
the remote Solr instance, as recommended by Solr documentation itself.
2018-06-08 11:18:29 +02:00
luccioman
bd4cfeda3f Add a max acceptable limit to the size of Solr responses on p2p search
Following activation of gzip compression on responses, to ensure
uncompressed content can fit on available memory.
2018-06-08 10:33:23 +02:00
luccioman
de4ea95687 Consistently allow gzip compression of remote Solr responses
Was already enabled when requesting remote Solr with https or with
authentication (as an external Solr index)
2018-06-07 15:20:37 +02:00
luccioman
cea8187161 Reuse expired connections evictors threads provided by apache and solr 2018-06-06 14:24:05 +02:00
luccioman
b5dc1f376f Made outgoing pools max total connections user configurable
For a finer control over the maximum simultaneously active outgoing
connections.
2018-06-06 09:36:50 +02:00
luccioman
387d646c0e Added gzip compression of responses returned to user-agents accepting it
Enabled as default, but can be disabled using the "Server Access
Settings" admin page.
2018-06-05 13:35:39 +02:00
luccioman
a7a4ba3287 Apply remote solr configured timeout on getting connection from pool 2018-06-02 17:38:14 +02:00
luccioman
ee6670fb8f Use a common pooled http connection manager for remote solr instances
For a better control on the maximum simultaneous outgoing http
connections, as already done for any other http connections (crawls, rwi
search, p2p protocol) using the net.yacy.cora.protocol.http.HTTPClient
2018-05-29 09:24:21 +02:00
luccioman
d28f9ba0f6 Removed use of deprecated ConcurrentUpdateSolrClient constructor 2018-05-26 21:00:24 +02:00
luccioman
8a749aa5ad Trace level log message for monitoring remote solr response times 2018-05-26 20:58:05 +02:00
luccioman
35826a3091 Added a search page customization setting to display or not favicons
If not interested in displaying this on your search results and notably
on a peer with limited resources this can help saving some CPU and
outgoing network connections.
2018-05-25 11:13:43 +02:00
luccioman
0082b5ab2a Added missing default Solr http client connection timeout initialization
Consistently with the custom Solr http client used for https connections
to remote Solr peers or to YaCy external Solr storage.

This prevent remote Solr requests threads to wait for establishing a
connection to a remote peer longer than the configured timeout.
2018-05-24 09:24:52 +02:00
luccioman
fa4399d5d2 Small perf improvement : initialize threads names early when possible
Initializing Thread names using the Thread constructor parameter is
faster as it already sets a thread name even if no customized one is
given, while an additional call to the Thread.setName() function
internally do synchronized access, eventually runs access check on the
security manager and performs a native call.

Profiling a running YaCy server revealed that the total processing time
spent on Thread.setName() for a typical p2p search was in the range of
seconds.
2018-05-23 14:45:35 +02:00
luccioman
84d82bfdd7 Adjusted suggestions timeout management
* less CPU usage using the Solr 'allowedTime' parameter
* increase chances to get some results even when a first operation step
goes in time out by letting some time for final snippets results
processing
2018-05-21 14:49:43 +02:00
luccioman
65854bcb22 Fixed NullPointerException when omitHeader=true on external Solr server 2018-05-18 11:30:14 +02:00
luccioman
c4d984cec8 Fixed Solr response header duplication when requesting external Solr 2018-05-18 11:28:30 +02:00
luccioman
124cc24aa3 Properly handle embedded Solr partial results
Solr can provide partial results for example when a processing time
limit (specified with the parameter `timeAllowed`) is exceeded.

Before this fix, getting partial results from an embedded Solr index
resulted in a ClassCastException :
"org.apache.solr.common.SolrDocumentList cannot be cast to
org.apache.solr.response.ResultContext".
2018-05-18 10:14:54 +02:00
luccioman
3ce44cf250 Fixed largest snippet get : don't reject ones starting with a space char 2018-05-14 18:26:25 +02:00
luccioman
f511e16d50 Prevent duplication of Solr query highlight fields parameters
That was caused by concurrent modifications (with addHighlightField()
function) to the same SolrQuery instance when requesting Solr on remote
peers in p2p search.
2018-05-14 15:26:44 +02:00
luccioman
e357ade47d Reduced memory footprint of text snippet extraction
By not parsing and storing at first all sentences of a document, but
only on the fly the ones necessary to compute the snippet.
2018-05-13 10:29:52 +02:00
luccioman
e115e57cc7 Reduced text snippet extraction processing time.
By not generating MD5 hashes on all words of indexed texts, processing
time is reduced by 30 to 50% on indexed documents with more than 1Mbytes
of plain text.
2018-05-11 15:42:53 +02:00
sgaebel
4b79851e12 corrected icons_sizes_sxt to SolrType.string 2018-05-01 14:04:15 +02:00
luccioman
3b89c232db Easier tracking of longest text snippets initializations
When text snippets statistics are enabled and FINE log level is enabled
on the TextSnippetStatistics class.
2018-05-01 09:58:05 +02:00
luccioman
3c4344cb12 Fixed text snippet max init time statistic rendering 2018-05-01 09:39:41 +02:00
reger
a8234b7ea7 Make sure for image resource url enabled index image pixel size fields are filled
if at least one of the image size fields is enabled in index (images_height_val,
images_width_val, images_pixel_val). 
Previously all fields were required to be enabled (hint: default setting 
is height + width enabled)
2018-04-30 04:59:34 +02:00
luccioman
e67df103b5 Removed more remaining uses of deprecated Seed.getIP() function. 2018-04-29 08:26:53 +02:00
luccioman
addd18c993 Removed some remaining uses of deprecated Seed.getIP() 2018-04-26 09:39:30 +02:00
luccioman
c35d0568b6 Support for preferred https in peers communication on more operations 2018-04-24 08:08:24 +02:00
luccioman
e914d17aca Updated call to function deprecated since commons-codec version 1.11 2018-04-23 08:07:56 +02:00
luccioman
a3ec7a7a5f Added analysis optional setting to compute statistics on text snippets
Thus producing some basic stats on processing times for snippets
generation and counts on snippets per source type.
2018-04-15 09:55:08 +02:00
luccioman
1889d484de Added Solr HTML writer support for responses from remote instances 2018-04-12 09:23:00 +02:00
luccioman
2af3bf79c7 Improve rendering of remote Solr admin URLs
- properly handle IPv6 loopback address replacement
 - replace loopback address or host only when accessing peer remotely
 - replace loopback part with the peer hostname as requested rather than
with its seed public IP as this works better for Intranet mode and when
peer is behind a reverse proxy.
2018-04-10 11:15:31 +02:00
luccioman
bb74de7d59 Removed unnecessary "/admin" suffix from remote Solr instance admin URL
For quite quite a long time now, the Solr /admin URL suffix indeed
redirects to the Solr base context (see
https://issues.apache.org/jira/browse/SOLR-3337)
2018-04-09 00:01:45 +02:00
luccioman
0d34034f17 Ensure an embedded Solr is available for Solr dump/restore operations
Otherwise, these operations triggered NullPointerException when only an
external Solr index is attached.
2018-04-07 13:42:06 +02:00
luccioman
d92b191942 Ensure no remote Solr is attached before "Shut Down and Re-Start Solr"
Otherwise once this operation is applied, the remote Solr(s) instances
are deconnected and the embedded Solr is connected even if disabled by
setting "core.service.fulltext".

Also use constants for related default setting values.
2018-04-06 20:34:54 +02:00
luccioman
26d8ad591c Adjusted Solr select servlet output when using an external Solr only
- Use the EnhancedXMLResponseWriter only when requested output is "exml"
- Use the Standard Solr writers when possible, for example for json, xml
or javabin output formats
- Return an error when the requested format can not been rendered with
an external Solr server only

Important : this modification is necessary for peers using exclusively
an external Solr server to be reachable as robinson targets in p2p
search, as the binary format ("javabin") is the default Solr exchange
format for peers.
Before this, when a peer requested a remote one attached only to an
external Solr (no embedded one), it ended with "Invalid type" error, as
the remote peer answered with xml although binary format was requested.
2018-04-06 15:16:54 +02:00
luccioman
69690c13a0 Optionally allow external Solr server with self-signed certificate
This is necessary when you want to attach to a dedicated external Solr
server protected with basic http authentication and requested over https
but having only a self-signed certificate.
2018-04-04 18:16:26 +02:00
luccioman
b882f85900 Fixed NPE case in Solr select servlet on external Solr only setup
Regression introduced with commit
0d7625ecfb
2018-04-03 15:36:17 +02:00
luccioman
2fd4d05e2f Added a shared Java constant for setting key server.servlets.called 2018-04-02 15:16:10 +02:00
luccioman
ba9cd14516 Removed hard-coded patch for Solr 5.0 on ranking boost function
The current default boost function
(`recip(ms(NOW,last_modified),3.16e-11,1,1)`) for the Date ranking
profile is indeed working fine.
What can trigger the error `unexpected docvalues type NUMERIC for field
'last_modified'` is the previous default boost function (quite old now)
or any custom one using the Solr `ord` or `rord` functions on the
last_modified field.
Then the problem was that the migration code in the Switchboard supposed
to detect the old date boost function was incorrect (one trailing right
parenthesis in excess), so the deprecated function remained.

This fixes issue #169.
2018-03-26 16:24:27 +02:00
luccioman
fb3032c530 Added a crawl filtering possibility on documents Media Type (MIME) 2018-03-23 10:28:19 +01:00
luccioman
e45afedee4 Added support for enclosures (media links) to the RSS loader 2018-03-21 08:22:29 +01:00
luccioman
aaefd5219c Reduce log verbosity of RSS loader on feed items with no link 2018-03-20 10:09:17 +01:00
luccioman
cf62b571bd Added RSS reader support for enclosure feed item sub element.
Enclosure element (see
http://www.rssboard.org/rss-specification#ltenclosuregtSubelementOfLtitemgt
) can be seen for example in podcasts feeds.
2018-03-20 07:38:29 +01:00
luccioman
e5f5de0fc7 Added some JavaDoc to the RSSMessage class. 2018-03-19 11:15:31 +01:00