yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
sgaebel	69adaa9f55	makes our HTTPClient closable	2021-10-31 23:06:02 +01:00
sgaebel	fc4275f901	handle all references for client, response, request to be able to close them	2021-10-31 23:05:50 +01:00
sgaebel	e7d3a363f2	refactor to use finish()	2021-10-31 11:22:35 +01:00
sgaebel	4fc876f4a3	revert back to use EntityUtils.consumeQuietly - as it simply closes the underlying stream	2021-10-31 11:22:28 +01:00
sgaebel	4f0392e93e	refactor use of AuthSchemeProvider	2021-10-31 11:21:59 +01:00
sgaebel	b74f337859	removes double setting of UserAgent	2021-10-31 11:21:06 +01:00
sgaebel	965748fefb	some refactoring using try with resources	2021-10-31 11:20:28 +01:00
Michael Peter Christen	552ab7051b	fix for warc importer	2021-10-25 19:35:15 +02:00
Michael Peter Christen	3c86b7b780	attempt to make a Mac Release using gradle This is almost working with many workarounds: - run rm lib/yacycore.jar - run ./gradlew clean build bundleNative - run ant clean all - run again rm lib/yacycore.jar - run ./fixMacBuild.sh The build is then inside build/mac/YaCy.app Right now this works so far but it does not have the correct release number inside. Target is to make this working for Windows releases and to embedd jre entirely.	2021-10-25 18:37:39 +02:00
Michael Peter Christen	999c819e3e	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	2021-10-24 20:50:14 +02:00
Michael Peter Christen	fd770e90e2	spike to identify paths for YaCy within mac application bundles	2021-10-24 20:49:59 +02:00
Michael Peter Christen	d19872fd26	making sure that crawl queues are closed correctly to prevent data loss	2021-10-14 00:30:04 +02:00
sgaebel	90507c0fdc	comments out printing query params to std.out	2021-10-04 18:03:06 +02:00
Michael Peter Christen	be0aebad84	fixes https://github.com/yacy/yacy_search_server/issues/424	2021-10-04 14:38:49 +02:00
Michael Peter Christen	63ad8ce6b2	removed ymarks had not been used since a long time	2021-09-16 22:23:51 +02:00
Michael Peter Christen	ef5a71a592	enhanced crawl start response time for very very large crawl start lists	2021-09-16 21:01:01 +02:00
Michael Peter Christen	4cadd557dc	removed synchronization in table creation to avoid possible deadlocks when handling OnDemandOpenFileIndex which happens quite often during wide crawling	2021-09-15 19:34:49 +02:00
admin	9b7668fa58	reduced memory footprint during indexing/crawling	2021-08-24 12:24:52 +02:00
Michael Peter Christen	e6a87e0426	enhanced crawler a main problem when crawling is long waiting time cuased by crawl-delay values from robots.txt entries. that attribute is not supported by google and interpreted by yandex and bing in different ways. In large crawls there is always one host which blocks the whole crawl with extreme large values. YaCy now still obeys crawl-delay but limits them to 10 seconds. Additionally the blocking logic when loading new robots.txt was analyzed and a deadlock was removed. Furthermore the construction of new queue lists was redesigned and it was ensured that always a large list of different hosts for host-balancing is provided for the loader.	2021-08-17 15:23:21 +02:00
Michael Peter Christen	e9c5e78868	replaced new Number(Number) with Number.instanceOf to remove deprecation warnings for Java 9	2021-08-08 00:39:03 +02:00
Michael Peter Christen	9e13d77de4	removed call to class.finalize() because of deprecation in java 9 next: removal of finalize() implementation after testing with assert false	2021-08-07 18:57:49 +02:00
Michael Peter Christen	9ef4503672	fixed some newInstance() warnings .. by adding .getDeclaredConstructor()	2021-08-07 18:46:53 +02:00
Michael Peter Christen	1d41380f0a	better support for mac-specific tray functions in java 9	2021-07-12 17:27:59 +02:00
Michael Peter Christen	e81b770f79	enabled crawl starts with very large sets of start urls i.e. 10MB large url list with approx 0.5 million start points	2021-06-30 10:45:58 +02:00
Michael Peter Christen	c623a3252e	fix for jdk 14 bug	2021-04-23 09:11:03 +02:00
Michael Peter Christen	dbd211a1ad	removed/replaced reflection in memory tool	2021-04-22 20:24:13 +02:00
Michael Peter Christen	1cdb21592b	added hazelcast and some modifications to align legacy YaCy with YaCyGrid	2021-04-15 20:39:22 +02:00
Michael Christen	42ea2a1c6f	Merge pull request #405 from jfhs/jfhs/support-all-html-entities Improve HTML entities support	2021-03-31 01:44:54 +02:00
Michael Christen	b2af745dd6	Merge pull request #404 from lnceballosz/master NGI0 - Updating licensing aspects according REUSE	2021-03-30 23:48:21 +02:00
jfhs	10bddc2c2d	Decode HTML entities in all property values by default	2021-03-30 22:24:55 +02:00
jfhs	2135d259e3	Replace hardcoded html/xml entities with a file, support decoding all defined HTML entities	2021-03-30 22:24:54 +02:00
Michael Peter Christen	8f876a8c72	added concurrency to enhance indexing speed during json surrogate import	2021-03-30 12:07:36 +02:00
Michael Peter Christen	f8cbaeef93	Merge branch 'master' of git@github.com:yacy/yacy_search_server.git	2021-03-29 18:46:53 +02:00
Michael Peter Christen	a857e3d3d5	fix for json importer	2021-03-29 18:46:42 +02:00
sgaebel	1546232c94	adds ranking for multi document queries only	2021-03-20 17:48:35 +01:00
sgaebel	93b353d22d	does not boost or add fields for zero-row-queries (exists())	2021-03-20 17:48:26 +01:00
sgaebel	f16cd154f7	removes unused imports and variables	2021-03-20 15:14:09 +01:00
sgaebel	c69c462a15	replaces a expensive getLoadTimeURL() by exists() refactors urlExists to getHarvestProcess as that is what it does	2021-03-20 15:01:31 +01:00
sgaebel	a5488ac8f5	uses edismax queries on query counts > 1 only	2021-03-20 01:06:09 +01:00
sgaebel	26223dc25a	replaces getLoadTime() by exists() with a simpler query since solr-8.8.1 getLoadTime() causes a high cpu usage	2021-03-20 01:06:02 +01:00
sgaebel	8e4d014c06	removes useless SolrRequestInfo.clearRequestInfo(), avoids spamming the log	2021-03-18 22:33:39 +01:00
Lina Ceballos	a96752f5ab	adding SPDX license and copyright headers	2021-03-11 12:17:11 +01:00
Michael Peter Christen	e18d0ef544	trying to set a higher priority to the process that is involved in index export	2021-03-09 00:04:05 +01:00
Michael Peter Christen	8b4394a6c5	fixes for solr 8.8.1 migration - replace new guava 30 with older 25 because that is the correct dependency for solr 8.8.1. The newer one did actually not work! - index will be crated in a DATA/INDEX/freeworld/SEGMENTS/solr_8_8_1 subfolder. The older solr_6_6 index is not touched but also not migrated. The index starts with fresh (empty) content. - Older indexes must be migrated by hand (export/import) so far until a better solution is found. - Large schema adoptions for lucene 8.8.1	2021-03-08 13:39:27 +01:00
Michael Peter Christen	ed9789214e	fixed seed initialization problem	2021-03-06 13:35:46 +01:00
Al Sutton	8ade8b8775	Remove forced clear to match new behaviour in `2da71c2a40`	2021-03-04 16:37:56 +00:00
Al Sutton	09695fc6d3	Update exceptions to match updated API	2021-03-04 16:34:02 +00:00
Al Sutton	69014a701e	Update API Usage	2021-03-04 16:14:56 +00:00
Michael Peter Christen	3da7628117	use environment variables to overwrite configuration variables you can i.e. do: export YACY_PORT=8092 && ./startYACY.sh Just append "YACY_" to uppercase version of environment variables and replace all "." with "_".	2021-02-09 20:26:49 +01:00
Michael Peter Christen	13a2e6dc6e	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	2021-01-25 11:49:32 +01:00
Michael Peter Christen	0ae8ccf657	Make it possible to set an empty password disabling the authentication protocol completely If you set now an empty password, then the http server will not ask to authentify. This is required for environment where we attach an outside authentification service like keycloak or similar using authentication in an ingress proxy. This change is part of the approach to run YaCy inside of a kubernetes cluster where we do not want individual authentication of peers and want to apply a ingress authentication.	2021-01-25 11:49:21 +01:00
Michael Peter Christen	96592a10cf	added option to set yacy configuration values using environment variables To use that feature, set an environment variable with prefix "yacy." and suffix identical to the yacy configuration attribute name. Additionaly we implemented a way to set a peer name using the setting "network.unit.agent". This can therefore now be used to set a peer name with the java call parameter -Dyacy.network.unit.agent=anonymous The purpose for this feature is the ability to set peer names in mass-deployed kubernetes clusters to the same name to prevent that we are flooding peer name statistics with auto-deployment-generated names.	2021-01-24 22:50:37 +01:00
Michael Peter Christen	198826c362	added network scanner process to discover all YaCy peers in the intranet this will be used to wire YaCy peers in a kubernetes cluster	2021-01-23 15:14:49 +01:00
Michael Peter Christen	d9602e8325	Implemented a new syntax in the template engine to simplify json APIs Added also an example for one of the existing APIs. The problem is the comma separator between objects which must not be there for the last entry in a sequence. The new syntax adds the separator symbol automatically.	2021-01-18 00:01:08 +01:00
Michael Peter Christen	5a7f12a9c1	allow network scans for non-standard http/https ports	2021-01-11 00:28:24 +01:00
sgaebel	b8d264f7ec	fixes logging	2021-01-04 20:53:40 +01:00
Michael Peter Christen	4c920d05b5	removed superfluous lines	2020-12-29 20:19:58 +01:00
Michael Peter Christen	907f121d0c	do not overwrite PW with random PW	2020-12-29 20:18:25 +01:00
Michael Peter Christen	3e6a1e0a49	fixed surrogate process counter	2020-12-28 18:26:22 +01:00
Michael Peter Christen	d3526c52af	fixed a problem in warc importer: do not fail if single WARC entries are faulty	2020-12-28 17:05:06 +01:00
Michael Peter Christen	3078b74e1d	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	2020-12-22 00:46:56 +01:00
Michael Peter Christen	01cc32217f	fixed apicall call method parameters and verification in transaction manager which did not have and exception for localhost/basic authentication	2020-12-22 00:46:47 +01:00
Michael Peter Christen	63f58e4785	enhanced strategy in host browser limit number of fresh hosts in round robin hashes	2020-12-20 23:15:55 +01:00
Michael Peter Christen	9be36800a4	increased redirect depth by one this makes sense if one redirect replaces http with https and another replaces www subdomain by without (and vice versa)	2020-12-20 19:44:16 +01:00
Michael Peter Christen	d0abb0cedb	enabling all crawl profiles in all network modes also: increased default internet crawl speed to 4 urls/s/host	2020-12-19 01:00:51 +01:00
Michael Peter Christen	baad56d83d	beautified default peer names	2020-12-14 02:08:49 +01:00
Michael Peter Christen	43a9f4f574	updated solr 6.6.6 -> 7.7.3 dropped GSA support (GSA API is still in YaCy Grid) The 6.6.6 solr index works without migration also with 7.7.3	2020-12-12 02:06:43 +01:00
Michael Peter Christen	c0d9a3e9a7	turned HostBrowser into a admin-only page, now called IndexBrowser This was required because spiders and bots crawled through this page and created load on the peer without use for the user or the YaCy network.	2020-12-11 00:50:52 +01:00
Michael Peter Christen	d359d521a1	fixed warc importer The importer tried to import a gziped files as plain warc. It will now check the file extension and use a unzip automatically on-the-fly.	2020-12-10 11:19:25 +01:00
Michael Peter Christen	e54ab39958	Going back to basic authentication for console/shell commands This does not affect security because: - it is going to localhost only - only users who have already access to the pw hash can do this - no clear text pw is transmitted because that is not stored anywhere The switch to basic is required because these commands are required in the context of hosting on root servers and docker containers where a password change must be done. But the password shell command was not working without password which made the concept unusable. This deficit made it virtually impossible for root server operators to use YaCy because they had been unable to set up a proper password.	2020-12-09 02:36:55 +01:00
Michael Peter Christen	6271e9122c	javadoc fix	2020-12-09 02:22:47 +01:00
Michael Peter Christen	e0f4e3fd9a	enhanced ability to debug the code	2020-12-09 02:22:30 +01:00
Michael Peter Christen	eea2d71851	prevent creation of auth schema factories every time a servlet is called	2020-12-06 01:49:34 +01:00
Michael Peter Christen	fcc9386ed3	enhanced the (already fast!) png exporter	2020-12-03 12:18:07 +01:00
Michael Peter Christen	4e9b425f98	missing fix for latest commit	2020-12-03 00:40:51 +01:00
Michael Peter Christen	3213d9db37	updated jetty from 9.4.17 to 9.4.35 and fixed a bug in ServerSideIncludes that appeared only in that recent version of jetty	2020-12-03 00:21:15 +01:00
Michael Peter Christen	787fec0658	reduced complexity - removed concurrency in sort	2020-12-02 18:39:45 +01:00
Michael Peter Christen	cef5fde343	adding message to UI to make port change transparent	2020-12-02 18:05:38 +01:00
Michael Peter Christen	52228cb6be	added a gc to cleanup process (once every 10 minutes)	2020-12-02 00:13:00 +01:00
Michael Peter Christen	22841ffbf1	creating a threaddump during every cleanup process to be able to find out what a peer did (not) last time before a crash	2020-12-01 03:00:24 +01:00
Michael Peter Christen	36e616271b	do better documentation on how to set a default password	2020-12-01 02:18:08 +01:00
Michael Peter Christen	df2bf9ef28	try to fix maven build error	2020-11-29 14:24:33 +01:00
Michael Peter Christen	264bab6700	trying to fight the UI unavaiability this path addresses a possible issue with too many open connections to remote peers	2020-11-29 14:15:34 +01:00
Michael Peter Christen	7947baeb49	removed all remaining deprecation warnings	2020-11-23 00:03:18 +01:00
Michael Peter Christen	c0f6d6e11d	removed one deprecation warning for jetty library initializing ssl server port	2020-11-22 23:27:58 +01:00
Michael Peter Christen	133440a7a6	some debug lines	2020-11-22 23:12:04 +01:00
sgaebel	3431f91db9	removes unused 'unused' tokens	2020-08-04 20:09:34 +02:00
sgaebel	fc03c4b4fe	removes some warning and unused objects	2020-08-03 20:44:31 +02:00
sgaebel	4a495df63a	removes some deprecation-warnings	2020-07-31 17:28:06 +02:00
sgaebel	dd9d4b1188	replace org.junit.Assert.assertThat by org.hamcrest.MatcherAssert.assertThat from hamcrest 2.2 to avoid deprecation-warning	2020-07-28 19:09:26 +02:00
sgaebel	df9ea0a42a	removes some warnings: unused imports, params	2020-07-27 22:20:49 +02:00
sgaebel	9bc2297161	fixes deleting during recrawl	2020-07-22 22:15:00 +02:00
sgaebel	80785b785e	adds deleting during recrawl	2020-07-09 19:32:16 +02:00
Michael Peter Christen	e0ad8ca9da	replaced json library from JSON.org with libandroid-json-java This fixes https://github.com/yacy/yacy_search_server/issues/347	2020-04-24 11:45:25 +02:00
Michael Peter Christen	ea8df27e95	modified org.json.* library to fit into the YaCy environment as drop-in replacement. Also made some fixes and enhancements to the library.	2020-04-24 11:42:06 +02:00
Michael Peter Christen	60dc1241a3	added org.json.* library from https://android.googlesource.com/platform/libcore/+/refs/heads/master/json/src/main/java/org/json as a preparation step for https://github.com/yacy/yacy_search_server/issues/347	2020-04-24 10:28:43 +02:00
Michael Peter Christen	053e54a2c7	grand CORS for json files	2019-11-05 11:50:56 +01:00
Michael Christen	cfa27d2fd5	fixed links	2019-10-20 20:20:50 +02:00
Michael Christen	cb20aa7e54	removed donation message in search result column	2019-10-17 01:35:44 +02:00
Michael Christen	25227676ae	removed some warnings	2019-09-28 02:07:08 +02:00
luccioman	6b45cd5799	New optional crawl filter on the URL a doc must match to crawl its links For finer control over which parsed documents can trigger an addition of their links to the crawl stack, complementary to the existing crawl depth parameter.	2019-05-01 08:54:19 +02:00
luccioman	d16bc99835	Added "Show Metadata" links to the ViewFile.html links mode To conveniently follow parsed links in the file viewer	2019-04-18 15:31:38 +02:00
luccioman	a5771b1f14	Made SNI extension user configurable without the need for server restart TLS Server Name Indication (SNI) extension activation can now be configured with the new Settings_p.html?page=httpClient administration page. SNI extension is also now enabled by default, as in 2019 the unrecognized_name(112) alert is more properly handled by major web servers TLS implementations, following the RFC 6066 standard. Related YaCy issues : #153 #189 and #272 JDK 1.7 bug : https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7127374 Apache httpd issue : https://bz.apache.org/bugzilla/show_bug.cgi?id=56241 RFC 6066 : https://tools.ietf.org/html/rfc6066#section-3	2019-04-14 15:41:13 +02:00
luccioman	e90405b6f0	Support parsing audio URLs without file extension Added also a Junit for the audio tag parser	2019-04-09 11:40:21 +02:00
luccioman	a8316c79da	Allow JS resorting of search results by unauthenticated users Acces rate limitations to this search mode by unauthenticated users are set low by default to prevent unwanted server overload but can be customized through the SearchAccessRate_p.html configuration page Fixes #291	2019-04-03 14:21:53 +02:00
luccioman	0ab2b49c31	Made /yacysearch access rate limitations user configurable With a new admin page at /SearchAccessRate_p.html in menu Network Access > Local Search > Access Rate Limitations	2019-04-02 17:42:50 +02:00
luccioman	5b7e41202a	Added Solr GSA writer support for responses from remote instances	2019-03-27 18:23:41 +01:00
luccioman	4d8a948455	Properly close PDF snapshots loaded with pdfbox library	2019-03-22 09:50:30 +01:00
luccioman	74e6d6e984	Added Solr GrepHTML writer support for responses from remote instances	2019-03-20 18:24:16 +01:00
luccioman	5e6501974d	Added Solr snapshots writer support for responses from remote instances	2019-03-19 11:25:44 +01:00
luccioman	384c37102c	Improve accuracy of total results count on latest pages in Stealth mode Previously, when mixing results from local RWI and local Solr (Stealth mode), total local Solr count could be ignored on last result pages, when the page offset was higher than local Solr count but lower than total RWI count.	2019-03-04 10:05:47 +01:00
luccioman	5e9a08355a	Improved logging for federated search - Do not use spaces in logger identifier name so the log level can be configured in yacy.logging - Hold the logger instance to avoid the logging system to look for it from its name at each appended log message	2019-02-02 09:59:24 +01:00
luccioman	9782a98a9c	Added the possibility to customize facets sort type and direction Previously search navigators/facets elements were sorted only by counts. Now from the ConfigSearchPage_p.html admin page, sort direction (ascending/descending) and type (on counts or labels) can be customized independently for each navigator.	2019-01-24 18:43:06 +01:00
sgaebel	c2398fd890	remove warnings: 'Statement unnecessarily nested within else clause'	2019-01-10 20:02:57 +01:00
sgaebel	811d40a6c4	taking care of closing inputstreams, HTTPClient	2019-01-04 18:58:49 +01:00
sgaebel	8d2e7262d9	Recrawl: - set the chunksize to 100 to meet the max of the embedded solr - re-enable sorting (the case where we switched it of should be away) - enable recrawling on remote-solr	2019-01-04 18:46:59 +01:00
sgaebel	8f58c1dcfa	extend the SolrServlet to be usable as remote solr (incl. update) this feature needs to be enabled by uncomment the url-pattern	2019-01-04 18:27:44 +01:00
luccioman	7223a2fdb1	Removed usage of now deprecated Jetty function	2018-12-22 14:42:22 +01:00
luccioman	440d9f2fa0	Exclude peers with empty or disabled RWI from remote RWI search	2018-12-20 14:53:01 +01:00
luccioman	08ea0b0397	Added a configurable timeout to wkhtmltopdf calls for pdf snapshots Necessary to prevent blocking the indexing workflow when some wkhtmltopdf renderings fail without terminating	2018-12-11 22:31:31 +01:00
luccioman	3fb449b3b6	Properly resolve relative URLs against document URL in html base tags Fixes issue #256	2018-12-06 20:18:00 +01:00
luccioman	73a6e45524	Extended detection of external tools used for Snapshots generation This enable detecting wkhtmltopdf and Imagemagick convert executables when they are at system Path in addition to common installation paths.	2018-12-06 09:53:08 +01:00
luccioman	7dc1f60619	Fixed detection of absolute data folder path on MS Windows	2018-11-18 10:08:20 +01:00
luccioman	595e144797	Trace a message on incomplete proper server finish when killing process	2018-11-15 17:32:22 +01:00
luccioman	9daeea823b	Fixed concurrency issue on cache used for circles rendering Without synchronization lock, concurrent rendering of images including circles could lead to glitches as reported in issue #248	2018-11-10 22:00:49 +01:00
Michael Peter Christen	c347e7d3f8	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	2018-11-08 14:42:52 +01:00
Michael Peter Christen	848e9304d9	evil bots may crawl harder	2018-11-08 14:42:40 +01:00
luccioman	a997133260	Fixed gzip decompression regression on index transfer APIs Processing of gzip encoded incoming requests (on /yacy/transferRWI.html and /yacy/transferURL.html) was no more working since upgrade to Jetty 9.4.12 (see commit `51f4be1`). To prevent any conflicting behavior with Jetty internals, use now the GzipHandler provided by Jetty to decompress incoming gzip encoded requests rather than the previously used custom GZIPRequestWrapper. Fixes issue #249	2018-11-07 14:52:42 +01:00
luccioman	e85f231bdf	Fixed termination of Host browser and link structure Solr query threads On some conditions (especially when reaching timeout), concurrent Solr query tasks used by the /HostBrowser.html and /api/linkstructure.json never terminated, thus leaking resources, as reported by @Vort in issue #246	2018-11-06 10:10:09 +01:00
luccioman	fcf6b16db4	Added new crawler attribute for finer control over Media Type detection New "Media Type detection" section in the advanced crawl start page allow to choose between : - not loading URLs with unknown or unsupported file extension without checking the actual Media Type (relying Content-Type header for now). This was the old default behavior, faster, but not really accurate. - always cross check URL file extension against the actual Media Type. This lets properly parse URLs ending with an apparently odd file extension, but which have actually a supported Media Type such as text/html. Sample URLs with misleading file extensions added as documentation in the crawl start page. fixes issue #244	2018-10-25 10:42:12 +02:00
luccioman	a83a56473e	Added suport for PDF snapshots generation when running on MS Windows	2018-10-18 12:41:57 +02:00
luccioman	8852c97cee	Added basic styling for cleaner rendering of missing image snapshots For the output of the Solr snapshots writer	2018-10-15 18:19:57 +02:00
luccioman	746e0e788d	Render a relevant HTTP status code on snapshot image rendering error Instead of a null response body which is not very helpful.	2018-10-14 10:30:30 +02:00
luccioman	50b6edfcf5	Updated Solr snapshots writer for a cleaner html head	2018-10-13 10:36:39 +02:00
luccioman	f366f43d6b	Made snapshots size customizable in Solr snapshots response writer	2018-10-13 10:22:47 +02:00
luccioman	7a62fc0e66	Fixed concurrency issue in custom classloader used for template classes As reported in issue #241, the problem is only critical (random but complete crash of the JVM) when upgrading to JDK11.	2018-10-11 18:34:39 +02:00
luccioman	61c337f29a	Decode blacklist entries for easier edition of non ascii chars Not using the JDK URLDecoder.decode() function, as it strips '+' characters when they occur after '?' (both characters having regular expression semantics when used in blacklist path patterns)	2018-10-04 09:33:58 +02:00
luccioman	ed93221fa1	Improved normalization of blacklist path patterns having non ascii chars Normalize blacklist path patterns using percent-encoding, at pattern edition in web interface and at loading from configuration files. Fixes issue #237	2018-10-02 14:36:13 +02:00
luccioman	2a73b63d9e	Use a constant default target file name for seed SCP upload method To make seed upload (in /Settings_p.html?page=seed page) with SCP easier when the user specify a remote target directory path. See report by @vikulin in issue #227	2018-09-16 10:37:47 +02:00
luccioman	b5eabb626f	Removed some dead code	2018-09-14 14:02:32 +02:00
luccioman	db7ad76366	Improved support for Java logs file pattern options - support of "%h" and "%t" pattern components - more proper initialization of file handler when the data folder is not the default one, notably to prevent a non blocking but ugly error stack trace reported by the log manager at startup with that kind of setup	2018-09-13 12:17:02 +02:00
luccioman	7adbd1f87d	Fixed raw IPV6 addresses snapshots read/write on FAT32 and NTFS fs Fixes issue #225	2018-09-12 17:34:40 +02:00
luccioman	9b1c87033b	Fixed logs folder checking and creation Previously, if YaCy log folder was for example at `/home/user/yacy/DATA/LOG`, because of improper truncation of log path, an unnecessary directory creation was atempted at `/home/us`.	2018-08-31 08:34:28 +02:00
luccioman	c29588dd6a	Made possible to provide an absolute data root path for start script Previously, only a path relative to the user home folder could be provided	2018-08-30 18:16:22 +02:00
luccioman	d03c098b54	Removed deprecated warning comments about imports and Debian installer Deprecated by commit `be5d3a1066` , as classpath is now defined in yacycore.jar Manifest file.	2018-08-22 22:35:00 +02:00
luccioman	5b60b4225f	Fixed encoding of '+' character on search pages links As revealed by issue #216	2018-08-20 18:44:04 +02:00
luccioman	54fbe166ba	Updated pdf cache clear steps consistently with current pdfbox version - Removed calls to no more existing clearResources functions (on PDFont class and its children) since upgrade to pdfbox 2.n.n - Removed hacky usage of protected internal ClassLoader function. This removes the warnings displayed when running with JDK9 or JDK10 : [java] WARNING: Illegal reflective access by net.yacy.document.parser.pdfParser$ResourceCleaner (file:<path>) to method java.lang.ClassLoader.findLoadedClass(java.lang.String) [java] WARNING: Please consider reporting this to the maintainers of net.yacy.document.parser.pdfParser$ResourceCleaner [java] WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations [java] WARNING: All illegal access operations will be denied in a future release Crawling thousands of pdf documents from various sources after modifications applied, revealed no new memory leak related to pdfbox (measurements done with JVisualVM).	2018-08-16 18:23:42 +02:00
luccioman	685122363d	Added a parser for XZ compressed archives. As suggested by LA_FORGE on mantis 781 (http://mantis.tokeek.de/view.php?id=781)	2018-08-15 10:07:39 +02:00
luccioman	4ee14ff3c5	Fixed NullPointerException case on malformed crawl queue folder name	2018-08-13 14:35:26 +02:00
luccioman	21ad9435ec	Fixed crawl queue folder naming for IPv6 hosts on MS Windows filesystems As reported by @vikulin in issue #187, crawling websites using a raw IPv6 address as host name in their URL failed when running on Microsoft Windows platforms (FAT32 or NTFS filesystems) when YaCy crawler created the crawl queue folder, as the ':' character which is part of an IPV6 address is forbidden on these filesystems.	2018-08-11 10:02:26 +02:00
luccioman	8a29551c54	Upgraded the OpenGeoDB dump URL The status of the library in the DictionaryLoader_p.html page now also advertises the user that an upgrade can be applied when an older dump is already loaded. Upgrade applied as suggested by Niklas Andrus @fapth_gitlab on Gitter chat.	2018-08-03 18:39:41 +02:00
luccioman	373edf9eac	Adjusted yjson Solr writer to support responses from an external Solr Worked previously only with responses from YaCy embedded Solr, now able to render the response when YaCy is configured to use an external Solr index.	2018-07-31 16:22:21 +02:00
luccioman	87bd17b1cf	Simplified a little bit the RSS OpenSearch Solr writer	2018-07-31 16:02:50 +02:00
luccioman	dc49ca9c27	Fixed a NPE case on the Solr OpenSearch response writer Occurred when omitHeader parameter is set to true	2018-07-29 16:30:37 +02:00
luccioman	f4267ed247	Made Solr OpenSearch RSS writer compatible with external Solr index Worked previously only with responses from YaCy embedded Solr, now able to render the response when YaCy is configured to use an external Solr index.	2018-07-28 11:03:31 +02:00
luccioman	b1410f593a	Fixed stylesheet relative URLs rendering in Solr html writer Relative URLs to CSS stylesheets were not properly rendered when using the Solr html response writer and the "/solr/collection1/select" entry point instead of "/solr/select".	2018-07-25 08:03:25 +02:00
luccioman	89c59814da	Improved rendering of the Solr api relative url in the html writer In order to have a consistent relative url when using either /solr/select or /solr/collection1/select entry point.	2018-07-24 10:13:55 +02:00
luccioman	bf4f320b16	Optionally render the response header when using the Solr html writer With params rendered as html input fields for conveniently modifying params values and refreshing results.	2018-07-23 18:36:57 +02:00
luccioman	313204ae2c	Override qf and df Solr params with defaults only when they are not set	2018-07-23 13:50:24 +02:00
luccioman	bdafb14336	Removed redundant synchronization lock on network switch function Was useless as done in an already synchronized block, and the lock object was assigned a new value in that same block, and nowhere else a lock is requested on that same object.	2018-07-16 09:20:23 +02:00
luccioman	d5f44ea216	Removed unnecessary synchronization lock from serverSwitch constructor Lock was useless here as it was set on an object instance attribute while the object itself is not yet constructed and no other threads can access it.	2018-07-16 09:13:50 +02:00
luccioman	dcad393fe5	Fixed exceeding max size of failreason_s Solr field on large link list When using the 'From Link-List of URL' as a crawl start, with lists in the order of one or more thousands of links, the failreason_s Solr field maximum size (32kb) was exceeded by the string representation of the URL must-match filter when a crawl URL was rejected because not matching.	2018-07-11 08:13:29 +02:00
luccioman	f467601561	Properly lock solrInstances for reboot and restoration of embedded Solr Putting a synchronization lock directly on the solrInstances property was ineffective as it is assigned a new (unlocked) instance in these operations.	2018-07-08 08:57:59 +02:00
luccioman	9630f81306	Fixed small unnecessary lines of code	2018-07-08 08:15:26 +02:00
luccioman	876bcd2f54	Fixed useless comparison between int parameter and Long.MAX_VALUE	2018-07-08 08:11:01 +02:00
luccioman	c726154a59	Fixed removal of URLs from the delegatedURL remote crawl stack URLs were removed from the stack using their hash as a bytes array, whereas the hash is stored in the stack as String instance.	2018-07-05 09:36:36 +02:00
luccioman	2bdd71de60	Added server side columns sorting on the Process Scheduler table For easier usage of large tables in the Table_API_p.html page.	2018-07-04 10:28:32 +02:00
luccioman	bb51555830	Removed remaining unsafe accesses to SimpleDateFormat instances. SimpleDateFormat must not be used by concurrent threads without synchronization for parsing or formating dates as it is not thread-safe (internally holds a calendar instance that is not synchronized). Prefer now DateTimeFormatter when possible as it is thread-safe without concurrent access performance bottleneck (does not internally use synchronization locks).	2018-07-02 10:00:40 +02:00
luccioman	f895745e1c	Removed more unsafe concurrent accesses to SimpleDateFormat instances. SimpleDateFormat must not be used by concurrent threads without synchronization for parsing or formating dates as it is not thread-safe (internally holds a calendar instance that is not synchronized). Prefer now DateTimeFormatter when possible as it is thread-safe without concurrent access performance bottleneck (does not internally use synchronization locks).	2018-06-29 15:49:55 +02:00
luccioman	e97580dfc7	Fixed unsafe conccurent access to generic SimpleDateFormat instances SimpleDateFormat must not be used by concurrent threads without synchronization for parsing or formating dates as it is not thread-safe (internally holds a calendar instance that is not synchronized). Prefer now DateTimeFormatter when possible as it is thread-safe without concurrent access performance bottleneck (does not internally use synchronization locks).	2018-06-28 14:59:23 +02:00
luccioman	8811700e2e	Upgraded Jetty dependency from 9.4.9 to 9.4.11	2018-06-20 09:33:26 +02:00
luccioman	d53c33e4ef	Fixed potential infinite loop case (does not occur in current code base)	2018-06-20 07:51:59 +02:00
luccioman	a15ac8e0ca	Made CrawlProfile loading tolerant to malformed json string attribute	2018-06-19 12:53:17 +02:00
luccioman	a715bb7876	Fixed rendering of solr mustNoMatch value on CrawlProfileEditor_p.xml	2018-06-19 12:50:28 +02:00
luccioman	0b302c5004	Do not block whole server startup on persisted crawl profile load error	2018-06-19 12:48:17 +02:00
luccioman	4d9aa4ed1e	Fixed default crawl profile solr mustnotmatch query from previous commit	2018-06-19 11:58:47 +02:00
luccioman	cced94298a	Added a new crawler document filter type using Solr syntax This makes possbile to set up much more advanced document crawl filters, by filtering on one or more document indexed fields before inserting in the index.	2018-06-19 10:12:20 +02:00
Michael Christen	e0dc632020	removed transformer it was not used any more	2018-06-19 00:42:23 +02:00
luccioman	9bc7b6c39d	Allow edtion of scheduled next execution dates for finer control Can be useful more especially when scheduling many API calls over a long period of time to precisely adjust each scheduled date/time.	2018-06-11 11:38:58 +02:00
luccioman	40e8c7b89b	Use the heavy ConcurrentUpdateSolrClient only when necessary Prefer the lightweight HttpSolrClient when no updates are performed on the remote Solr instance, as recommended by Solr documentation itself.	2018-06-08 11:18:29 +02:00
luccioman	bd4cfeda3f	Add a max acceptable limit to the size of Solr responses on p2p search Following activation of gzip compression on responses, to ensure uncompressed content can fit on available memory.	2018-06-08 10:33:23 +02:00
luccioman	de4ea95687	Consistently allow gzip compression of remote Solr responses Was already enabled when requesting remote Solr with https or with authentication (as an external Solr index)	2018-06-07 15:20:37 +02:00
luccioman	cea8187161	Reuse expired connections evictors threads provided by apache and solr	2018-06-06 14:24:05 +02:00
luccioman	b5dc1f376f	Made outgoing pools max total connections user configurable For a finer control over the maximum simultaneously active outgoing connections.	2018-06-06 09:36:50 +02:00
luccioman	387d646c0e	Added gzip compression of responses returned to user-agents accepting it Enabled as default, but can be disabled using the "Server Access Settings" admin page.	2018-06-05 13:35:39 +02:00
luccioman	a7a4ba3287	Apply remote solr configured timeout on getting connection from pool	2018-06-02 17:38:14 +02:00
luccioman	ee6670fb8f	Use a common pooled http connection manager for remote solr instances For a better control on the maximum simultaneous outgoing http connections, as already done for any other http connections (crawls, rwi search, p2p protocol) using the net.yacy.cora.protocol.http.HTTPClient	2018-05-29 09:24:21 +02:00
luccioman	d28f9ba0f6	Removed use of deprecated ConcurrentUpdateSolrClient constructor	2018-05-26 21:00:24 +02:00
luccioman	8a749aa5ad	Trace level log message for monitoring remote solr response times	2018-05-26 20:58:05 +02:00
luccioman	35826a3091	Added a search page customization setting to display or not favicons If not interested in displaying this on your search results and notably on a peer with limited resources this can help saving some CPU and outgoing network connections.	2018-05-25 11:13:43 +02:00
luccioman	0082b5ab2a	Added missing default Solr http client connection timeout initialization Consistently with the custom Solr http client used for https connections to remote Solr peers or to YaCy external Solr storage. This prevent remote Solr requests threads to wait for establishing a connection to a remote peer longer than the configured timeout.	2018-05-24 09:24:52 +02:00
luccioman	fa4399d5d2	Small perf improvement : initialize threads names early when possible Initializing Thread names using the Thread constructor parameter is faster as it already sets a thread name even if no customized one is given, while an additional call to the Thread.setName() function internally do synchronized access, eventually runs access check on the security manager and performs a native call. Profiling a running YaCy server revealed that the total processing time spent on Thread.setName() for a typical p2p search was in the range of seconds.	2018-05-23 14:45:35 +02:00
luccioman	84d82bfdd7	Adjusted suggestions timeout management * less CPU usage using the Solr 'allowedTime' parameter * increase chances to get some results even when a first operation step goes in time out by letting some time for final snippets results processing	2018-05-21 14:49:43 +02:00
luccioman	65854bcb22	Fixed NullPointerException when omitHeader=true on external Solr server	2018-05-18 11:30:14 +02:00
luccioman	c4d984cec8	Fixed Solr response header duplication when requesting external Solr	2018-05-18 11:28:30 +02:00
luccioman	124cc24aa3	Properly handle embedded Solr partial results Solr can provide partial results for example when a processing time limit (specified with the parameter `timeAllowed`) is exceeded. Before this fix, getting partial results from an embedded Solr index resulted in a ClassCastException : "org.apache.solr.common.SolrDocumentList cannot be cast to org.apache.solr.response.ResultContext".	2018-05-18 10:14:54 +02:00
luccioman	3ce44cf250	Fixed largest snippet get : don't reject ones starting with a space char	2018-05-14 18:26:25 +02:00
luccioman	f511e16d50	Prevent duplication of Solr query highlight fields parameters That was caused by concurrent modifications (with addHighlightField() function) to the same SolrQuery instance when requesting Solr on remote peers in p2p search.	2018-05-14 15:26:44 +02:00
luccioman	e357ade47d	Reduced memory footprint of text snippet extraction By not parsing and storing at first all sentences of a document, but only on the fly the ones necessary to compute the snippet.	2018-05-13 10:29:52 +02:00
luccioman	e115e57cc7	Reduced text snippet extraction processing time. By not generating MD5 hashes on all words of indexed texts, processing time is reduced by 30 to 50% on indexed documents with more than 1Mbytes of plain text.	2018-05-11 15:42:53 +02:00

... 2 3 4 5 6 ...

8973 Commits