Commit Graph

8973 Commits

Author SHA1 Message Date
sgaebel
69adaa9f55 makes our HTTPClient closable 2021-10-31 23:06:02 +01:00
sgaebel
fc4275f901 handle all references for client, response, request to be able to close
them
2021-10-31 23:05:50 +01:00
sgaebel
e7d3a363f2 refactor to use finish() 2021-10-31 11:22:35 +01:00
sgaebel
4fc876f4a3 revert back to use EntityUtils.consumeQuietly - as it simply closes the
underlying stream
2021-10-31 11:22:28 +01:00
sgaebel
4f0392e93e refactor use of AuthSchemeProvider 2021-10-31 11:21:59 +01:00
sgaebel
b74f337859 removes double setting of UserAgent 2021-10-31 11:21:06 +01:00
sgaebel
965748fefb some refactoring using try with resources 2021-10-31 11:20:28 +01:00
Michael Peter Christen
552ab7051b fix for warc importer 2021-10-25 19:35:15 +02:00
Michael Peter Christen
3c86b7b780 attempt to make a Mac Release using gradle
This is almost working with many workarounds:
- run rm lib/yacycore.jar
- run ./gradlew clean build bundleNative
- run ant clean all
- run again rm lib/yacycore.jar
- run ./fixMacBuild.sh

The build is then inside build/mac/YaCy.app

Right now this works so far but it does not have the correct release
number inside.

Target is to make this working for Windows releases and to embedd jre
entirely.
2021-10-25 18:37:39 +02:00
Michael Peter Christen
999c819e3e Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2021-10-24 20:50:14 +02:00
Michael Peter Christen
fd770e90e2 spike to identify paths for YaCy within mac application bundles 2021-10-24 20:49:59 +02:00
Michael Peter Christen
d19872fd26 making sure that crawl queues are closed correctly to prevent data loss 2021-10-14 00:30:04 +02:00
sgaebel
90507c0fdc comments out printing query params to std.out 2021-10-04 18:03:06 +02:00
Michael Peter Christen
be0aebad84 fixes https://github.com/yacy/yacy_search_server/issues/424 2021-10-04 14:38:49 +02:00
Michael Peter Christen
63ad8ce6b2 removed ymarks
had not been used since a long time
2021-09-16 22:23:51 +02:00
Michael Peter Christen
ef5a71a592 enhanced crawl start response time
for very very large crawl start lists
2021-09-16 21:01:01 +02:00
Michael Peter Christen
4cadd557dc removed synchronization in table creation
to avoid possible deadlocks when handling OnDemandOpenFileIndex
which happens quite often during wide crawling
2021-09-15 19:34:49 +02:00
admin
9b7668fa58 reduced memory footprint during indexing/crawling 2021-08-24 12:24:52 +02:00
Michael Peter Christen
e6a87e0426 enhanced crawler
a main problem when crawling is long waiting time cuased by crawl-delay
values from robots.txt entries. that attribute is not supported by
google and interpreted by yandex and bing in different ways. In large
crawls there is always one host which blocks the whole crawl with
extreme large values. YaCy now still obeys crawl-delay but limits them
to 10 seconds.
Additionally the blocking logic when loading new robots.txt was analyzed
and a deadlock was removed. Furthermore the construction of new queue
lists was redesigned and it was ensured that always a large list of
different hosts for host-balancing is provided for the loader.
2021-08-17 15:23:21 +02:00
Michael Peter Christen
e9c5e78868 replaced new Number(Number) with Number.instanceOf
to remove deprecation warnings for Java 9
2021-08-08 00:39:03 +02:00
Michael Peter Christen
9e13d77de4 removed call to class.finalize() because of deprecation in java 9
next: removal of finalize() implementation
after testing with assert false
2021-08-07 18:57:49 +02:00
Michael Peter Christen
9ef4503672 fixed some newInstance() warnings
.. by adding .getDeclaredConstructor()
2021-08-07 18:46:53 +02:00
Michael Peter Christen
1d41380f0a better support for mac-specific tray functions in java 9 2021-07-12 17:27:59 +02:00
Michael Peter Christen
e81b770f79 enabled crawl starts with very large sets of start urls
i.e. 10MB large url list with approx 0.5 million start points
2021-06-30 10:45:58 +02:00
Michael Peter Christen
c623a3252e fix for jdk 14 bug 2021-04-23 09:11:03 +02:00
Michael Peter Christen
dbd211a1ad removed/replaced reflection in memory tool 2021-04-22 20:24:13 +02:00
Michael Peter Christen
1cdb21592b added hazelcast and some modifications to align legacy YaCy with
YaCyGrid
2021-04-15 20:39:22 +02:00
Michael Christen
42ea2a1c6f
Merge pull request #405 from jfhs/jfhs/support-all-html-entities
Improve HTML entities support
2021-03-31 01:44:54 +02:00
Michael Christen
b2af745dd6
Merge pull request #404 from lnceballosz/master
NGI0 - Updating licensing aspects according REUSE
2021-03-30 23:48:21 +02:00
jfhs
10bddc2c2d Decode HTML entities in all property values by default 2021-03-30 22:24:55 +02:00
jfhs
2135d259e3 Replace hardcoded html/xml entities with a file, support decoding all defined HTML entities 2021-03-30 22:24:54 +02:00
Michael Peter Christen
8f876a8c72 added concurrency to enhance indexing speed during json surrogate import 2021-03-30 12:07:36 +02:00
Michael Peter Christen
f8cbaeef93 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2021-03-29 18:46:53 +02:00
Michael Peter Christen
a857e3d3d5 fix for json importer 2021-03-29 18:46:42 +02:00
sgaebel
1546232c94 adds ranking for multi document queries only 2021-03-20 17:48:35 +01:00
sgaebel
93b353d22d does not boost or add fields for zero-row-queries (exists()) 2021-03-20 17:48:26 +01:00
sgaebel
f16cd154f7 removes unused imports and variables 2021-03-20 15:14:09 +01:00
sgaebel
c69c462a15 replaces a expensive getLoadTimeURL() by exists()
refactors urlExists to getHarvestProcess as that is what it does
2021-03-20 15:01:31 +01:00
sgaebel
a5488ac8f5 uses edismax queries on query counts > 1 only 2021-03-20 01:06:09 +01:00
sgaebel
26223dc25a replaces getLoadTime() by exists() with a simpler query
since solr-8.8.1 getLoadTime() causes a high cpu usage
2021-03-20 01:06:02 +01:00
sgaebel
8e4d014c06 removes useless SolrRequestInfo.clearRequestInfo(), avoids spamming the
log
2021-03-18 22:33:39 +01:00
Lina Ceballos
a96752f5ab adding SPDX license and copyright headers 2021-03-11 12:17:11 +01:00
Michael Peter Christen
e18d0ef544 trying to set a higher priority to the process that is involved in index
export
2021-03-09 00:04:05 +01:00
Michael Peter Christen
8b4394a6c5 fixes for solr 8.8.1 migration
- replace new guava 30 with older 25 because that is the correct
dependency for solr 8.8.1. The newer one did actually not work!
- index will be crated in a DATA/INDEX/freeworld/SEGMENTS/solr_8_8_1
subfolder. The older solr_6_6 index is not touched but also not
migrated. The index starts with fresh (empty) content.
- Older indexes must be migrated by hand (export/import) so far until a
better solution is found.
- Large schema adoptions for lucene 8.8.1
2021-03-08 13:39:27 +01:00
Michael Peter Christen
ed9789214e fixed seed initialization problem 2021-03-06 13:35:46 +01:00
Al Sutton
8ade8b8775 Remove forced clear to match new behaviour in 2da71c2a40 2021-03-04 16:37:56 +00:00
Al Sutton
09695fc6d3 Update exceptions to match updated API 2021-03-04 16:34:02 +00:00
Al Sutton
69014a701e Update API Usage 2021-03-04 16:14:56 +00:00
Michael Peter Christen
3da7628117 use environment variables to overwrite configuration variables
you can i.e. do:
export YACY_PORT=8092 && ./startYACY.sh
Just append "YACY_" to uppercase version of environment variables and
replace all "." with "_".
2021-02-09 20:26:49 +01:00
Michael Peter Christen
13a2e6dc6e Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2021-01-25 11:49:32 +01:00
Michael Peter Christen
0ae8ccf657 Make it possible to set an empty password disabling the authentication
protocol completely
If you set now an empty password, then the http server will not ask to
authentify. This is required for environment where we attach an outside
authentification service like keycloak or similar using authentication
in an ingress proxy.
This change is part of the approach to run YaCy inside of a kubernetes
cluster where we do not want individual authentication of peers and want
to apply a ingress authentication.
2021-01-25 11:49:21 +01:00
Michael Peter Christen
96592a10cf added option to set yacy configuration values using environment
variables
To use that feature, set an environment variable with prefix "yacy." and
suffix identical to the yacy configuration attribute name.
Additionaly we implemented a way to set a peer name using the setting
"network.unit.agent". This can therefore now be used to set a peer name
with the java call parameter
-Dyacy.network.unit.agent=anonymous
The purpose for this feature is the ability to set peer names in
mass-deployed kubernetes clusters to the same name to prevent that we
are flooding peer name statistics with auto-deployment-generated names.
2021-01-24 22:50:37 +01:00
Michael Peter Christen
198826c362 added network scanner process to discover all YaCy peers in the intranet
this will be used to wire YaCy peers in a kubernetes cluster
2021-01-23 15:14:49 +01:00
Michael Peter Christen
d9602e8325 Implemented a new syntax in the template engine to simplify json APIs
Added also an example for one of the existing APIs. The problem is the
comma separator between objects which must not be there for the last
entry in a sequence. The new syntax adds the separator symbol
automatically.
2021-01-18 00:01:08 +01:00
Michael Peter Christen
5a7f12a9c1 allow network scans for non-standard http/https ports 2021-01-11 00:28:24 +01:00
sgaebel
b8d264f7ec fixes logging 2021-01-04 20:53:40 +01:00
Michael Peter Christen
4c920d05b5 removed superfluous lines 2020-12-29 20:19:58 +01:00
Michael Peter Christen
907f121d0c do not overwrite PW with random PW 2020-12-29 20:18:25 +01:00
Michael Peter Christen
3e6a1e0a49 fixed surrogate process counter 2020-12-28 18:26:22 +01:00
Michael Peter Christen
d3526c52af fixed a problem in warc importer: do not fail if single WARC entries are
faulty
2020-12-28 17:05:06 +01:00
Michael Peter Christen
3078b74e1d Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2020-12-22 00:46:56 +01:00
Michael Peter Christen
01cc32217f fixed apicall call method parameters
and verification in transaction manager
which did not have and exception for localhost/basic authentication
2020-12-22 00:46:47 +01:00
Michael Peter Christen
63f58e4785 enhanced strategy in host browser
limit number of fresh hosts in round robin hashes
2020-12-20 23:15:55 +01:00
Michael Peter Christen
9be36800a4 increased redirect depth by one
this makes sense if one redirect replaces http with https and another
replaces www subdomain by without (and vice versa)
2020-12-20 19:44:16 +01:00
Michael Peter Christen
d0abb0cedb enabling all crawl profiles in all network modes
also: increased default internet crawl speed to
4 urls/s/host
2020-12-19 01:00:51 +01:00
Michael Peter Christen
baad56d83d beautified default peer names 2020-12-14 02:08:49 +01:00
Michael Peter Christen
43a9f4f574 updated solr 6.6.6 -> 7.7.3
dropped GSA support (GSA API is still in YaCy Grid)
The 6.6.6 solr index works without migration also with 7.7.3
2020-12-12 02:06:43 +01:00
Michael Peter Christen
c0d9a3e9a7 turned HostBrowser into a admin-only page, now called IndexBrowser
This was required because spiders and bots crawled through this page and
created load on the peer without use for the user or the YaCy network.
2020-12-11 00:50:52 +01:00
Michael Peter Christen
d359d521a1 fixed warc importer
The importer tried to import a gziped files as plain warc.
It will now check the file extension and use a unzip automatically
on-the-fly.
2020-12-10 11:19:25 +01:00
Michael Peter Christen
e54ab39958 Going back to basic authentication for console/shell commands
This does not affect security because:
- it is going to localhost only
- only users who have already access to the pw hash can do this
- no clear text pw is transmitted because that is not stored anywhere
The switch to basic is required because these commands are required
in the context of hosting on root servers and docker containers
where a password change must be done. But the password shell command
was not working without password which made the concept unusable.
This deficit made it virtually impossible for root server operators
to use YaCy because they had been unable to set up a proper password.
2020-12-09 02:36:55 +01:00
Michael Peter Christen
6271e9122c javadoc fix 2020-12-09 02:22:47 +01:00
Michael Peter Christen
e0f4e3fd9a enhanced ability to debug the code 2020-12-09 02:22:30 +01:00
Michael Peter Christen
eea2d71851 prevent creation of auth schema factories every time a servlet is called 2020-12-06 01:49:34 +01:00
Michael Peter Christen
fcc9386ed3 enhanced the (already fast!) png exporter 2020-12-03 12:18:07 +01:00
Michael Peter Christen
4e9b425f98 missing fix for latest commit 2020-12-03 00:40:51 +01:00
Michael Peter Christen
3213d9db37 updated jetty from 9.4.17 to 9.4.35
and fixed a bug in ServerSideIncludes that appeared only in that recent
version of jetty
2020-12-03 00:21:15 +01:00
Michael Peter Christen
787fec0658 reduced complexity - removed concurrency in sort 2020-12-02 18:39:45 +01:00
Michael Peter Christen
cef5fde343 adding message to UI to make port change transparent 2020-12-02 18:05:38 +01:00
Michael Peter Christen
52228cb6be added a gc to cleanup process (once every 10 minutes) 2020-12-02 00:13:00 +01:00
Michael Peter Christen
22841ffbf1 creating a threaddump during every cleanup process
to be able to find out what a peer did (not) last time before a crash
2020-12-01 03:00:24 +01:00
Michael Peter Christen
36e616271b do better documentation on how to set a default password 2020-12-01 02:18:08 +01:00
Michael Peter Christen
df2bf9ef28 try to fix maven build error 2020-11-29 14:24:33 +01:00
Michael Peter Christen
264bab6700 trying to fight the UI unavaiability
this path addresses a possible issue with too many open connections to
remote peers
2020-11-29 14:15:34 +01:00
Michael Peter Christen
7947baeb49 removed all remaining deprecation warnings 2020-11-23 00:03:18 +01:00
Michael Peter Christen
c0f6d6e11d removed one deprecation warning for jetty library initializing ssl
server port
2020-11-22 23:27:58 +01:00
Michael Peter Christen
133440a7a6 some debug lines 2020-11-22 23:12:04 +01:00
sgaebel
3431f91db9 removes unused 'unused' tokens 2020-08-04 20:09:34 +02:00
sgaebel
fc03c4b4fe removes some warning and unused objects 2020-08-03 20:44:31 +02:00
sgaebel
4a495df63a removes some deprecation-warnings 2020-07-31 17:28:06 +02:00
sgaebel
dd9d4b1188 replace org.junit.Assert.assertThat by
org.hamcrest.MatcherAssert.assertThat from hamcrest 2.2 to avoid
deprecation-warning
2020-07-28 19:09:26 +02:00
sgaebel
df9ea0a42a removes some warnings: unused imports, params 2020-07-27 22:20:49 +02:00
sgaebel
9bc2297161 fixes deleting during recrawl 2020-07-22 22:15:00 +02:00
sgaebel
80785b785e adds deleting during recrawl 2020-07-09 19:32:16 +02:00
Michael Peter Christen
e0ad8ca9da replaced json library from JSON.org with libandroid-json-java
This fixes https://github.com/yacy/yacy_search_server/issues/347
2020-04-24 11:45:25 +02:00
Michael Peter Christen
ea8df27e95 modified org.json.* library to fit into the YaCy environment
as drop-in replacement.
Also made some fixes and enhancements to the library.
2020-04-24 11:42:06 +02:00
Michael Peter Christen
60dc1241a3 added org.json.* library
from https://android.googlesource.com/platform/libcore/+/refs/heads/master/json/src/main/java/org/json
as a preparation step for
https://github.com/yacy/yacy_search_server/issues/347
2020-04-24 10:28:43 +02:00
Michael Peter Christen
053e54a2c7 grand CORS for json files 2019-11-05 11:50:56 +01:00
Michael Christen
cfa27d2fd5 fixed links 2019-10-20 20:20:50 +02:00
Michael Christen
cb20aa7e54 removed donation message in search result column 2019-10-17 01:35:44 +02:00
Michael Christen
25227676ae removed some warnings 2019-09-28 02:07:08 +02:00
luccioman
6b45cd5799 New optional crawl filter on the URL a doc must match to crawl its links
For finer control over which parsed documents can trigger an addition of
their links to the crawl stack, complementary to the existing crawl
depth parameter.
2019-05-01 08:54:19 +02:00
luccioman
d16bc99835 Added "Show Metadata" links to the ViewFile.html links mode
To conveniently follow parsed links in the file viewer
2019-04-18 15:31:38 +02:00
luccioman
a5771b1f14 Made SNI extension user configurable without the need for server restart
TLS Server Name Indication (SNI) extension activation can now be
configured with the new Settings_p.html?page=httpClient administration
page.
SNI extension is also now enabled by default, as in 2019 the
unrecognized_name(112) alert is more properly handled by major web
servers TLS implementations, following the RFC 6066 standard.

Related YaCy issues : #153 #189 and #272
JDK 1.7 bug :
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7127374
Apache httpd issue :
https://bz.apache.org/bugzilla/show_bug.cgi?id=56241
RFC 6066 : https://tools.ietf.org/html/rfc6066#section-3
2019-04-14 15:41:13 +02:00
luccioman
e90405b6f0 Support parsing audio URLs without file extension
Added also a Junit for the audio tag parser
2019-04-09 11:40:21 +02:00
luccioman
a8316c79da Allow JS resorting of search results by unauthenticated users
Acces rate limitations to this search mode by unauthenticated users are
set low by default to prevent unwanted server overload but can be
customized through the SearchAccessRate_p.html configuration page

Fixes #291
2019-04-03 14:21:53 +02:00
luccioman
0ab2b49c31 Made /yacysearch access rate limitations user configurable
With a new admin page at /SearchAccessRate_p.html in menu Network Access
> Local Search > Access Rate Limitations
2019-04-02 17:42:50 +02:00
luccioman
5b7e41202a Added Solr GSA writer support for responses from remote instances 2019-03-27 18:23:41 +01:00
luccioman
4d8a948455 Properly close PDF snapshots loaded with pdfbox library 2019-03-22 09:50:30 +01:00
luccioman
74e6d6e984 Added Solr GrepHTML writer support for responses from remote instances 2019-03-20 18:24:16 +01:00
luccioman
5e6501974d Added Solr snapshots writer support for responses from remote instances 2019-03-19 11:25:44 +01:00
luccioman
384c37102c Improve accuracy of total results count on latest pages in Stealth mode
Previously, when mixing results from local RWI and local Solr (Stealth
mode), total local Solr count could be ignored on last result pages,
when the page offset was higher than local Solr count but lower than
total RWI count.
2019-03-04 10:05:47 +01:00
luccioman
5e9a08355a Improved logging for federated search
- Do not use spaces in logger identifier name so the log level can be
configured in yacy.logging
- Hold the logger instance to avoid the logging system to look for it
from its name at each appended log message
2019-02-02 09:59:24 +01:00
luccioman
9782a98a9c Added the possibility to customize facets sort type and direction
Previously search navigators/facets elements were sorted only by counts.
Now from the ConfigSearchPage_p.html admin page, sort direction
(ascending/descending) and type (on counts or labels) can be customized
independently for each navigator.
2019-01-24 18:43:06 +01:00
sgaebel
c2398fd890 remove warnings: 'Statement unnecessarily nested within else clause' 2019-01-10 20:02:57 +01:00
sgaebel
811d40a6c4 taking care of closing inputstreams, HTTPClient 2019-01-04 18:58:49 +01:00
sgaebel
8d2e7262d9 Recrawl:
- set the chunksize to 100 to meet the max of the embedded solr
- re-enable sorting (the case where we switched it of should be away)
- enable recrawling on remote-solr
2019-01-04 18:46:59 +01:00
sgaebel
8f58c1dcfa extend the SolrServlet to be usable as remote solr (incl. update)
this feature needs to be enabled by uncomment the url-pattern
2019-01-04 18:27:44 +01:00
luccioman
7223a2fdb1 Removed usage of now deprecated Jetty function 2018-12-22 14:42:22 +01:00
luccioman
440d9f2fa0 Exclude peers with empty or disabled RWI from remote RWI search 2018-12-20 14:53:01 +01:00
luccioman
08ea0b0397 Added a configurable timeout to wkhtmltopdf calls for pdf snapshots
Necessary to prevent blocking the indexing workflow when some
wkhtmltopdf renderings fail without terminating
2018-12-11 22:31:31 +01:00
luccioman
3fb449b3b6 Properly resolve relative URLs against document URL in html base tags
Fixes issue #256
2018-12-06 20:18:00 +01:00
luccioman
73a6e45524 Extended detection of external tools used for Snapshots generation
This enable detecting wkhtmltopdf and Imagemagick convert executables
when they are at system Path in addition to common installation paths.
2018-12-06 09:53:08 +01:00
luccioman
7dc1f60619 Fixed detection of absolute data folder path on MS Windows 2018-11-18 10:08:20 +01:00
luccioman
595e144797 Trace a message on incomplete proper server finish when killing process 2018-11-15 17:32:22 +01:00
luccioman
9daeea823b Fixed concurrency issue on cache used for circles rendering
Without synchronization lock, concurrent rendering of images including
circles could lead to glitches as reported in issue #248
2018-11-10 22:00:49 +01:00
Michael Peter Christen
c347e7d3f8 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2018-11-08 14:42:52 +01:00
Michael Peter Christen
848e9304d9 evil bots may crawl harder 2018-11-08 14:42:40 +01:00
luccioman
a997133260 Fixed gzip decompression regression on index transfer APIs
Processing of gzip encoded incoming requests (on /yacy/transferRWI.html
and /yacy/transferURL.html) was no more working since upgrade to Jetty
9.4.12 (see commit 51f4be1).

To prevent any conflicting behavior with Jetty internals, use now the
GzipHandler provided by Jetty to decompress incoming gzip encoded
requests rather than the previously used custom GZIPRequestWrapper.

Fixes issue #249
2018-11-07 14:52:42 +01:00
luccioman
e85f231bdf Fixed termination of Host browser and link structure Solr query threads
On some conditions (especially when reaching timeout), concurrent Solr
query tasks used by the /HostBrowser.html and /api/linkstructure.json
never terminated, thus leaking resources, as reported by @Vort in issue
#246
2018-11-06 10:10:09 +01:00
luccioman
fcf6b16db4 Added new crawler attribute for finer control over Media Type detection
New "Media Type detection" section in the advanced crawl start page
allow to choose between :
- not loading URLs with unknown or unsupported file extension without
checking the actual Media Type (relying Content-Type header for now).
This was the old default behavior, faster, but not really accurate.
- always cross check URL file extension against the actual Media Type.
This lets properly parse URLs ending with an apparently odd file
extension, but which have actually a supported Media Type such as
text/html.

Sample URLs with misleading file extensions added as documentation in
the crawl start page.

fixes issue #244
2018-10-25 10:42:12 +02:00
luccioman
a83a56473e Added suport for PDF snapshots generation when running on MS Windows 2018-10-18 12:41:57 +02:00
luccioman
8852c97cee Added basic styling for cleaner rendering of missing image snapshots
For the output of the Solr snapshots writer
2018-10-15 18:19:57 +02:00
luccioman
746e0e788d Render a relevant HTTP status code on snapshot image rendering error
Instead of a null response body which is not very helpful.
2018-10-14 10:30:30 +02:00
luccioman
50b6edfcf5 Updated Solr snapshots writer for a cleaner html head 2018-10-13 10:36:39 +02:00
luccioman
f366f43d6b Made snapshots size customizable in Solr snapshots response writer 2018-10-13 10:22:47 +02:00
luccioman
7a62fc0e66 Fixed concurrency issue in custom classloader used for template classes
As reported in issue #241, the problem is only critical (random but
complete crash of the JVM) when upgrading to JDK11.
2018-10-11 18:34:39 +02:00
luccioman
61c337f29a Decode blacklist entries for easier edition of non ascii chars
Not using the JDK URLDecoder.decode() function, as it strips '+'
characters when they occur after '?' (both characters having regular
expression semantics when used in blacklist path patterns)
2018-10-04 09:33:58 +02:00
luccioman
ed93221fa1 Improved normalization of blacklist path patterns having non ascii chars
Normalize blacklist path patterns using percent-encoding, at pattern
edition in web interface and at loading from configuration files.

Fixes issue #237
2018-10-02 14:36:13 +02:00
luccioman
2a73b63d9e Use a constant default target file name for seed SCP upload method
To make seed upload (in /Settings_p.html?page=seed page) with SCP easier
when the user specify a remote target directory path.

See report by @vikulin in issue #227
2018-09-16 10:37:47 +02:00
luccioman
b5eabb626f Removed some dead code 2018-09-14 14:02:32 +02:00
luccioman
db7ad76366 Improved support for Java logs file pattern options
- support of "%h" and "%t" pattern components
- more proper initialization of file handler when the data folder is not
the default one, notably to prevent a non blocking but ugly error stack
trace reported by the log manager at startup with that kind of setup
2018-09-13 12:17:02 +02:00
luccioman
7adbd1f87d Fixed raw IPV6 addresses snapshots read/write on FAT32 and NTFS fs
Fixes issue #225
2018-09-12 17:34:40 +02:00
luccioman
9b1c87033b Fixed logs folder checking and creation
Previously, if YaCy log folder was for example at
`/home/user/yacy/DATA/LOG`, because of improper truncation of log path,
an unnecessary directory creation was atempted at `/home/us`.
2018-08-31 08:34:28 +02:00
luccioman
c29588dd6a Made possible to provide an absolute data root path for start script
Previously, only a path relative to the user home folder could be
provided
2018-08-30 18:16:22 +02:00
luccioman
d03c098b54 Removed deprecated warning comments about imports and Debian installer
Deprecated by commit be5d3a1066 , as
classpath is now defined in yacycore.jar Manifest file.
2018-08-22 22:35:00 +02:00
luccioman
5b60b4225f Fixed encoding of '+' character on search pages links
As revealed by issue #216
2018-08-20 18:44:04 +02:00
luccioman
54fbe166ba Updated pdf cache clear steps consistently with current pdfbox version
- Removed calls to no more existing clearResources functions (on PDFont
class and its children) since upgrade to pdfbox 2.n.n
- Removed hacky usage of protected internal ClassLoader function. This
removes the warnings displayed when running with JDK9 or JDK10 :

     [java] WARNING: Illegal reflective access by
net.yacy.document.parser.pdfParser$ResourceCleaner (file:<path>) to
method java.lang.ClassLoader.findLoadedClass(java.lang.String)
     [java] WARNING: Please consider reporting this to the maintainers
of net.yacy.document.parser.pdfParser$ResourceCleaner
     [java] WARNING: Use --illegal-access=warn to enable warnings of
further illegal reflective access operations
     [java] WARNING: All illegal access operations will be denied in a
future release

Crawling thousands of pdf documents from various sources after
modifications applied, revealed no new memory leak related to pdfbox
(measurements done with JVisualVM).
2018-08-16 18:23:42 +02:00
luccioman
685122363d Added a parser for XZ compressed archives.
As suggested by LA_FORGE on mantis 781
(http://mantis.tokeek.de/view.php?id=781)
2018-08-15 10:07:39 +02:00
luccioman
4ee14ff3c5 Fixed NullPointerException case on malformed crawl queue folder name 2018-08-13 14:35:26 +02:00
luccioman
21ad9435ec Fixed crawl queue folder naming for IPv6 hosts on MS Windows filesystems
As reported by @vikulin in issue #187, crawling websites using a raw
IPv6 address as host name in their URL failed when running on Microsoft
Windows platforms (FAT32 or NTFS filesystems) when YaCy crawler created
the crawl queue folder, as the ':' character which is part of an IPV6
address is forbidden on these filesystems.
2018-08-11 10:02:26 +02:00
luccioman
8a29551c54 Upgraded the OpenGeoDB dump URL
The status of the library in the DictionaryLoader_p.html page now also
advertises the user that an upgrade can be applied when an older dump is
already loaded.

Upgrade applied as suggested by Niklas Andrus @fapth_gitlab on Gitter
chat.
2018-08-03 18:39:41 +02:00
luccioman
373edf9eac Adjusted yjson Solr writer to support responses from an external Solr
Worked previously only with responses from YaCy embedded Solr, now able
to render the response when YaCy is configured to use an external Solr
index.
2018-07-31 16:22:21 +02:00
luccioman
87bd17b1cf Simplified a little bit the RSS OpenSearch Solr writer 2018-07-31 16:02:50 +02:00
luccioman
dc49ca9c27 Fixed a NPE case on the Solr OpenSearch response writer
Occurred when omitHeader parameter is set to true
2018-07-29 16:30:37 +02:00
luccioman
f4267ed247 Made Solr OpenSearch RSS writer compatible with external Solr index
Worked previously only with responses from YaCy embedded Solr, now able
to render the response when YaCy is configured to use an external Solr
index.
2018-07-28 11:03:31 +02:00
luccioman
b1410f593a Fixed stylesheet relative URLs rendering in Solr html writer
Relative URLs to CSS stylesheets were not properly rendered when using
the Solr html response writer and the "/solr/collection1/select" entry
point instead of "/solr/select".
2018-07-25 08:03:25 +02:00
luccioman
89c59814da Improved rendering of the Solr api relative url in the html writer
In order to have a consistent relative url when using either
/solr/select or /solr/collection1/select entry point.
2018-07-24 10:13:55 +02:00
luccioman
bf4f320b16 Optionally render the response header when using the Solr html writer
With params rendered as html input fields for conveniently modifying
params values and refreshing results.
2018-07-23 18:36:57 +02:00
luccioman
313204ae2c Override qf and df Solr params with defaults only when they are not set 2018-07-23 13:50:24 +02:00
luccioman
bdafb14336 Removed redundant synchronization lock on network switch function
Was useless as done in an already synchronized block, and the lock
object was assigned a new value in that same block, and nowhere else a
lock is requested on that same object.
2018-07-16 09:20:23 +02:00
luccioman
d5f44ea216 Removed unnecessary synchronization lock from serverSwitch constructor
Lock was useless here as it was set on an object instance attribute
while the object itself is not yet constructed and no other threads can
access it.
2018-07-16 09:13:50 +02:00
luccioman
dcad393fe5 Fixed exceeding max size of failreason_s Solr field on large link list
When using the 'From Link-List of URL' as a crawl start, with lists in
the order of one or more thousands of links, the failreason_s Solr field
maximum size (32kb) was exceeded by the string representation of the URL
must-match filter when a crawl URL was rejected because not matching.
2018-07-11 08:13:29 +02:00
luccioman
f467601561 Properly lock solrInstances for reboot and restoration of embedded Solr
Putting a synchronization lock directly on the solrInstances property
was ineffective as it is assigned a new (unlocked) instance in these
operations.
2018-07-08 08:57:59 +02:00
luccioman
9630f81306 Fixed small unnecessary lines of code 2018-07-08 08:15:26 +02:00
luccioman
876bcd2f54 Fixed useless comparison between int parameter and Long.MAX_VALUE 2018-07-08 08:11:01 +02:00
luccioman
c726154a59 Fixed removal of URLs from the delegatedURL remote crawl stack
URLs were removed from the stack using their hash as a bytes array,
whereas the hash is stored in the stack as String instance.
2018-07-05 09:36:36 +02:00
luccioman
2bdd71de60 Added server side columns sorting on the Process Scheduler table
For easier usage of large tables in the Table_API_p.html page.
2018-07-04 10:28:32 +02:00
luccioman
bb51555830 Removed remaining unsafe accesses to SimpleDateFormat instances.
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).

Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
2018-07-02 10:00:40 +02:00
luccioman
f895745e1c Removed more unsafe concurrent accesses to SimpleDateFormat instances.
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).

Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
2018-06-29 15:49:55 +02:00
luccioman
e97580dfc7 Fixed unsafe conccurent access to generic SimpleDateFormat instances
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).

Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
2018-06-28 14:59:23 +02:00
luccioman
8811700e2e Upgraded Jetty dependency from 9.4.9 to 9.4.11 2018-06-20 09:33:26 +02:00
luccioman
d53c33e4ef Fixed potential infinite loop case (does not occur in current code base) 2018-06-20 07:51:59 +02:00
luccioman
a15ac8e0ca Made CrawlProfile loading tolerant to malformed json string attribute 2018-06-19 12:53:17 +02:00
luccioman
a715bb7876 Fixed rendering of solr mustNoMatch value on CrawlProfileEditor_p.xml 2018-06-19 12:50:28 +02:00
luccioman
0b302c5004 Do not block whole server startup on persisted crawl profile load error 2018-06-19 12:48:17 +02:00
luccioman
4d9aa4ed1e Fixed default crawl profile solr mustnotmatch query from previous commit 2018-06-19 11:58:47 +02:00
luccioman
cced94298a Added a new crawler document filter type using Solr syntax
This makes possbile to set up much more advanced document crawl filters,
by filtering on one or more document indexed fields before inserting in
the index.
2018-06-19 10:12:20 +02:00
Michael Christen
e0dc632020 removed transformer
it was not used any more
2018-06-19 00:42:23 +02:00
luccioman
9bc7b6c39d Allow edtion of scheduled next execution dates for finer control
Can be useful more especially when scheduling many API calls over a long
period of time to precisely adjust each scheduled date/time.
2018-06-11 11:38:58 +02:00
luccioman
40e8c7b89b Use the heavy ConcurrentUpdateSolrClient only when necessary
Prefer the lightweight HttpSolrClient when no updates are performed on
the remote Solr instance, as recommended by Solr documentation itself.
2018-06-08 11:18:29 +02:00
luccioman
bd4cfeda3f Add a max acceptable limit to the size of Solr responses on p2p search
Following activation of gzip compression on responses, to ensure
uncompressed content can fit on available memory.
2018-06-08 10:33:23 +02:00
luccioman
de4ea95687 Consistently allow gzip compression of remote Solr responses
Was already enabled when requesting remote Solr with https or with
authentication (as an external Solr index)
2018-06-07 15:20:37 +02:00
luccioman
cea8187161 Reuse expired connections evictors threads provided by apache and solr 2018-06-06 14:24:05 +02:00
luccioman
b5dc1f376f Made outgoing pools max total connections user configurable
For a finer control over the maximum simultaneously active outgoing
connections.
2018-06-06 09:36:50 +02:00
luccioman
387d646c0e Added gzip compression of responses returned to user-agents accepting it
Enabled as default, but can be disabled using the "Server Access
Settings" admin page.
2018-06-05 13:35:39 +02:00
luccioman
a7a4ba3287 Apply remote solr configured timeout on getting connection from pool 2018-06-02 17:38:14 +02:00
luccioman
ee6670fb8f Use a common pooled http connection manager for remote solr instances
For a better control on the maximum simultaneous outgoing http
connections, as already done for any other http connections (crawls, rwi
search, p2p protocol) using the net.yacy.cora.protocol.http.HTTPClient
2018-05-29 09:24:21 +02:00
luccioman
d28f9ba0f6 Removed use of deprecated ConcurrentUpdateSolrClient constructor 2018-05-26 21:00:24 +02:00
luccioman
8a749aa5ad Trace level log message for monitoring remote solr response times 2018-05-26 20:58:05 +02:00
luccioman
35826a3091 Added a search page customization setting to display or not favicons
If not interested in displaying this on your search results and notably
on a peer with limited resources this can help saving some CPU and
outgoing network connections.
2018-05-25 11:13:43 +02:00
luccioman
0082b5ab2a Added missing default Solr http client connection timeout initialization
Consistently with the custom Solr http client used for https connections
to remote Solr peers or to YaCy external Solr storage.

This prevent remote Solr requests threads to wait for establishing a
connection to a remote peer longer than the configured timeout.
2018-05-24 09:24:52 +02:00
luccioman
fa4399d5d2 Small perf improvement : initialize threads names early when possible
Initializing Thread names using the Thread constructor parameter is
faster as it already sets a thread name even if no customized one is
given, while an additional call to the Thread.setName() function
internally do synchronized access, eventually runs access check on the
security manager and performs a native call.

Profiling a running YaCy server revealed that the total processing time
spent on Thread.setName() for a typical p2p search was in the range of
seconds.
2018-05-23 14:45:35 +02:00
luccioman
84d82bfdd7 Adjusted suggestions timeout management
* less CPU usage using the Solr 'allowedTime' parameter
* increase chances to get some results even when a first operation step
goes in time out by letting some time for final snippets results
processing
2018-05-21 14:49:43 +02:00
luccioman
65854bcb22 Fixed NullPointerException when omitHeader=true on external Solr server 2018-05-18 11:30:14 +02:00
luccioman
c4d984cec8 Fixed Solr response header duplication when requesting external Solr 2018-05-18 11:28:30 +02:00
luccioman
124cc24aa3 Properly handle embedded Solr partial results
Solr can provide partial results for example when a processing time
limit (specified with the parameter `timeAllowed`) is exceeded.

Before this fix, getting partial results from an embedded Solr index
resulted in a ClassCastException :
"org.apache.solr.common.SolrDocumentList cannot be cast to
org.apache.solr.response.ResultContext".
2018-05-18 10:14:54 +02:00
luccioman
3ce44cf250 Fixed largest snippet get : don't reject ones starting with a space char 2018-05-14 18:26:25 +02:00
luccioman
f511e16d50 Prevent duplication of Solr query highlight fields parameters
That was caused by concurrent modifications (with addHighlightField()
function) to the same SolrQuery instance when requesting Solr on remote
peers in p2p search.
2018-05-14 15:26:44 +02:00
luccioman
e357ade47d Reduced memory footprint of text snippet extraction
By not parsing and storing at first all sentences of a document, but
only on the fly the ones necessary to compute the snippet.
2018-05-13 10:29:52 +02:00
luccioman
e115e57cc7 Reduced text snippet extraction processing time.
By not generating MD5 hashes on all words of indexed texts, processing
time is reduced by 30 to 50% on indexed documents with more than 1Mbytes
of plain text.
2018-05-11 15:42:53 +02:00