yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
luccioman	a997133260	Fixed gzip decompression regression on index transfer APIs Processing of gzip encoded incoming requests (on /yacy/transferRWI.html and /yacy/transferURL.html) was no more working since upgrade to Jetty 9.4.12 (see commit `51f4be1`). To prevent any conflicting behavior with Jetty internals, use now the GzipHandler provided by Jetty to decompress incoming gzip encoded requests rather than the previously used custom GZIPRequestWrapper. Fixes issue #249	2018-11-07 14:52:42 +01:00
luccioman	e85f231bdf	Fixed termination of Host browser and link structure Solr query threads On some conditions (especially when reaching timeout), concurrent Solr query tasks used by the /HostBrowser.html and /api/linkstructure.json never terminated, thus leaking resources, as reported by @Vort in issue #246	2018-11-06 10:10:09 +01:00
luccioman	fcf6b16db4	Added new crawler attribute for finer control over Media Type detection New "Media Type detection" section in the advanced crawl start page allow to choose between : - not loading URLs with unknown or unsupported file extension without checking the actual Media Type (relying Content-Type header for now). This was the old default behavior, faster, but not really accurate. - always cross check URL file extension against the actual Media Type. This lets properly parse URLs ending with an apparently odd file extension, but which have actually a supported Media Type such as text/html. Sample URLs with misleading file extensions added as documentation in the crawl start page. fixes issue #244	2018-10-25 10:42:12 +02:00
luccioman	a83a56473e	Added suport for PDF snapshots generation when running on MS Windows	2018-10-18 12:41:57 +02:00
luccioman	8852c97cee	Added basic styling for cleaner rendering of missing image snapshots For the output of the Solr snapshots writer	2018-10-15 18:19:57 +02:00
luccioman	746e0e788d	Render a relevant HTTP status code on snapshot image rendering error Instead of a null response body which is not very helpful.	2018-10-14 10:30:30 +02:00
luccioman	50b6edfcf5	Updated Solr snapshots writer for a cleaner html head	2018-10-13 10:36:39 +02:00
luccioman	f366f43d6b	Made snapshots size customizable in Solr snapshots response writer	2018-10-13 10:22:47 +02:00
luccioman	7a62fc0e66	Fixed concurrency issue in custom classloader used for template classes As reported in issue #241, the problem is only critical (random but complete crash of the JVM) when upgrading to JDK11.	2018-10-11 18:34:39 +02:00
luccioman	61c337f29a	Decode blacklist entries for easier edition of non ascii chars Not using the JDK URLDecoder.decode() function, as it strips '+' characters when they occur after '?' (both characters having regular expression semantics when used in blacklist path patterns)	2018-10-04 09:33:58 +02:00
luccioman	ed93221fa1	Improved normalization of blacklist path patterns having non ascii chars Normalize blacklist path patterns using percent-encoding, at pattern edition in web interface and at loading from configuration files. Fixes issue #237	2018-10-02 14:36:13 +02:00
luccioman	2a73b63d9e	Use a constant default target file name for seed SCP upload method To make seed upload (in /Settings_p.html?page=seed page) with SCP easier when the user specify a remote target directory path. See report by @vikulin in issue #227	2018-09-16 10:37:47 +02:00
luccioman	b5eabb626f	Removed some dead code	2018-09-14 14:02:32 +02:00
luccioman	db7ad76366	Improved support for Java logs file pattern options - support of "%h" and "%t" pattern components - more proper initialization of file handler when the data folder is not the default one, notably to prevent a non blocking but ugly error stack trace reported by the log manager at startup with that kind of setup	2018-09-13 12:17:02 +02:00
luccioman	7adbd1f87d	Fixed raw IPV6 addresses snapshots read/write on FAT32 and NTFS fs Fixes issue #225	2018-09-12 17:34:40 +02:00
luccioman	9b1c87033b	Fixed logs folder checking and creation Previously, if YaCy log folder was for example at `/home/user/yacy/DATA/LOG`, because of improper truncation of log path, an unnecessary directory creation was atempted at `/home/us`.	2018-08-31 08:34:28 +02:00
luccioman	c29588dd6a	Made possible to provide an absolute data root path for start script Previously, only a path relative to the user home folder could be provided	2018-08-30 18:16:22 +02:00
luccioman	d03c098b54	Removed deprecated warning comments about imports and Debian installer Deprecated by commit `be5d3a1066` , as classpath is now defined in yacycore.jar Manifest file.	2018-08-22 22:35:00 +02:00
luccioman	5b60b4225f	Fixed encoding of '+' character on search pages links As revealed by issue #216	2018-08-20 18:44:04 +02:00
luccioman	54fbe166ba	Updated pdf cache clear steps consistently with current pdfbox version - Removed calls to no more existing clearResources functions (on PDFont class and its children) since upgrade to pdfbox 2.n.n - Removed hacky usage of protected internal ClassLoader function. This removes the warnings displayed when running with JDK9 or JDK10 : [java] WARNING: Illegal reflective access by net.yacy.document.parser.pdfParser$ResourceCleaner (file:<path>) to method java.lang.ClassLoader.findLoadedClass(java.lang.String) [java] WARNING: Please consider reporting this to the maintainers of net.yacy.document.parser.pdfParser$ResourceCleaner [java] WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations [java] WARNING: All illegal access operations will be denied in a future release Crawling thousands of pdf documents from various sources after modifications applied, revealed no new memory leak related to pdfbox (measurements done with JVisualVM).	2018-08-16 18:23:42 +02:00
luccioman	685122363d	Added a parser for XZ compressed archives. As suggested by LA_FORGE on mantis 781 (http://mantis.tokeek.de/view.php?id=781)	2018-08-15 10:07:39 +02:00
luccioman	4ee14ff3c5	Fixed NullPointerException case on malformed crawl queue folder name	2018-08-13 14:35:26 +02:00
luccioman	21ad9435ec	Fixed crawl queue folder naming for IPv6 hosts on MS Windows filesystems As reported by @vikulin in issue #187, crawling websites using a raw IPv6 address as host name in their URL failed when running on Microsoft Windows platforms (FAT32 or NTFS filesystems) when YaCy crawler created the crawl queue folder, as the ':' character which is part of an IPV6 address is forbidden on these filesystems.	2018-08-11 10:02:26 +02:00
luccioman	8a29551c54	Upgraded the OpenGeoDB dump URL The status of the library in the DictionaryLoader_p.html page now also advertises the user that an upgrade can be applied when an older dump is already loaded. Upgrade applied as suggested by Niklas Andrus @fapth_gitlab on Gitter chat.	2018-08-03 18:39:41 +02:00
luccioman	373edf9eac	Adjusted yjson Solr writer to support responses from an external Solr Worked previously only with responses from YaCy embedded Solr, now able to render the response when YaCy is configured to use an external Solr index.	2018-07-31 16:22:21 +02:00
luccioman	87bd17b1cf	Simplified a little bit the RSS OpenSearch Solr writer	2018-07-31 16:02:50 +02:00
luccioman	dc49ca9c27	Fixed a NPE case on the Solr OpenSearch response writer Occurred when omitHeader parameter is set to true	2018-07-29 16:30:37 +02:00
luccioman	f4267ed247	Made Solr OpenSearch RSS writer compatible with external Solr index Worked previously only with responses from YaCy embedded Solr, now able to render the response when YaCy is configured to use an external Solr index.	2018-07-28 11:03:31 +02:00
luccioman	b1410f593a	Fixed stylesheet relative URLs rendering in Solr html writer Relative URLs to CSS stylesheets were not properly rendered when using the Solr html response writer and the "/solr/collection1/select" entry point instead of "/solr/select".	2018-07-25 08:03:25 +02:00
luccioman	89c59814da	Improved rendering of the Solr api relative url in the html writer In order to have a consistent relative url when using either /solr/select or /solr/collection1/select entry point.	2018-07-24 10:13:55 +02:00
luccioman	bf4f320b16	Optionally render the response header when using the Solr html writer With params rendered as html input fields for conveniently modifying params values and refreshing results.	2018-07-23 18:36:57 +02:00
luccioman	313204ae2c	Override qf and df Solr params with defaults only when they are not set	2018-07-23 13:50:24 +02:00
luccioman	bdafb14336	Removed redundant synchronization lock on network switch function Was useless as done in an already synchronized block, and the lock object was assigned a new value in that same block, and nowhere else a lock is requested on that same object.	2018-07-16 09:20:23 +02:00
luccioman	d5f44ea216	Removed unnecessary synchronization lock from serverSwitch constructor Lock was useless here as it was set on an object instance attribute while the object itself is not yet constructed and no other threads can access it.	2018-07-16 09:13:50 +02:00
luccioman	dcad393fe5	Fixed exceeding max size of failreason_s Solr field on large link list When using the 'From Link-List of URL' as a crawl start, with lists in the order of one or more thousands of links, the failreason_s Solr field maximum size (32kb) was exceeded by the string representation of the URL must-match filter when a crawl URL was rejected because not matching.	2018-07-11 08:13:29 +02:00
luccioman	f467601561	Properly lock solrInstances for reboot and restoration of embedded Solr Putting a synchronization lock directly on the solrInstances property was ineffective as it is assigned a new (unlocked) instance in these operations.	2018-07-08 08:57:59 +02:00
luccioman	9630f81306	Fixed small unnecessary lines of code	2018-07-08 08:15:26 +02:00
luccioman	876bcd2f54	Fixed useless comparison between int parameter and Long.MAX_VALUE	2018-07-08 08:11:01 +02:00
luccioman	c726154a59	Fixed removal of URLs from the delegatedURL remote crawl stack URLs were removed from the stack using their hash as a bytes array, whereas the hash is stored in the stack as String instance.	2018-07-05 09:36:36 +02:00
luccioman	2bdd71de60	Added server side columns sorting on the Process Scheduler table For easier usage of large tables in the Table_API_p.html page.	2018-07-04 10:28:32 +02:00
luccioman	bb51555830	Removed remaining unsafe accesses to SimpleDateFormat instances. SimpleDateFormat must not be used by concurrent threads without synchronization for parsing or formating dates as it is not thread-safe (internally holds a calendar instance that is not synchronized). Prefer now DateTimeFormatter when possible as it is thread-safe without concurrent access performance bottleneck (does not internally use synchronization locks).	2018-07-02 10:00:40 +02:00
luccioman	f895745e1c	Removed more unsafe concurrent accesses to SimpleDateFormat instances. SimpleDateFormat must not be used by concurrent threads without synchronization for parsing or formating dates as it is not thread-safe (internally holds a calendar instance that is not synchronized). Prefer now DateTimeFormatter when possible as it is thread-safe without concurrent access performance bottleneck (does not internally use synchronization locks).	2018-06-29 15:49:55 +02:00
luccioman	e97580dfc7	Fixed unsafe conccurent access to generic SimpleDateFormat instances SimpleDateFormat must not be used by concurrent threads without synchronization for parsing or formating dates as it is not thread-safe (internally holds a calendar instance that is not synchronized). Prefer now DateTimeFormatter when possible as it is thread-safe without concurrent access performance bottleneck (does not internally use synchronization locks).	2018-06-28 14:59:23 +02:00
luccioman	8811700e2e	Upgraded Jetty dependency from 9.4.9 to 9.4.11	2018-06-20 09:33:26 +02:00
luccioman	d53c33e4ef	Fixed potential infinite loop case (does not occur in current code base)	2018-06-20 07:51:59 +02:00
luccioman	a15ac8e0ca	Made CrawlProfile loading tolerant to malformed json string attribute	2018-06-19 12:53:17 +02:00
luccioman	a715bb7876	Fixed rendering of solr mustNoMatch value on CrawlProfileEditor_p.xml	2018-06-19 12:50:28 +02:00
luccioman	0b302c5004	Do not block whole server startup on persisted crawl profile load error	2018-06-19 12:48:17 +02:00
luccioman	4d9aa4ed1e	Fixed default crawl profile solr mustnotmatch query from previous commit	2018-06-19 11:58:47 +02:00
luccioman	cced94298a	Added a new crawler document filter type using Solr syntax This makes possbile to set up much more advanced document crawl filters, by filtering on one or more document indexed fields before inserting in the index.	2018-06-19 10:12:20 +02:00

1 2 3 4 5 ...

4257 Commits