yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
luccioman	a1a0515312	Added a button to manually refresh sorting of p2p search results. As a server-side oriented alternative to the JavaScript realtime resorting feature proposed in PR #104. The goal is the same as in this PR : having the possibility compensate the network latency of various peers results fetching and obtain once possible a consistently ranked result set.	2017-08-28 19:03:51 +02:00
luccioman	4eba88f2ff	Removed some unnecessary uses of java.lang.reflect api. This improves code browsing and readability, making search by references or call hierarchy IDE features more accurate.	2017-08-24 18:47:18 +02:00
luccioman	da3dbf9ea1	Use Javadoc style comments on SearchEvent properties. For better code readability and understanding.	2017-08-23 08:20:37 +02:00
luccioman	c6ae87168a	Added unit tests on the gzip parser.	2017-08-22 14:13:00 +02:00
luccioman	169ffdd1c7	Finer control on max links to parse in the html parser.	2017-08-22 14:11:35 +02:00
luccioman	e41d046a9d	Improved parsing support for OOXML spreadsheets (.xlsx) As reported edycop in mantis 765 ( http://mantis.tokeek.de/view.php?id=765 ), parsing of xlsx files was quite incomplete. Now properly support "Shared String Table" entry in Office Open XML spreadsheets, an also detect embedded URLs. Integrating the Apache poi-ooxml library could be an option for finer OOXML formats support, but their SAX style parsing example ( http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api ) tends to show that a custom SAX handler is still efficient for lightweight and low memory footprint processing.	2017-08-21 09:38:20 +02:00
reger	51a4e03c93	Allow to stop currently running warc import (stop button)	2017-08-20 22:17:27 +02:00
luccioman	6cec2cdcb5	Use unredirected robots.txt URL when adding an entry to the table.	2017-08-16 14:21:07 +02:00
luccioman	3f0446f14b	Ensure proper synchronous robots entry retrieval on first check. Previously, when checking for the first time the robots.txt policy on a unknown host (not cached in the robots table), result was always empty in the /getpageinfo_p.xml api and in the /CrawlCheck_p.html page. Next calls returned however the correct information.	2017-08-16 09:30:33 +02:00
luccioman	b23a563065	Prevent search result failure on incomplete images information. Complements the recent modification related to images in commit `7f395ef`. Unfortunately many documents metadata fetched from the freeworld p2p network have only partial information about embedded images. Without proper error handling, this made many searches in p2p mode to fail completely.	2017-08-15 10:11:05 +02:00
Michael Peter Christen	30d71c6359	added usage of X-Real-IP http header to identify request IPs which came through NGINX reverse proxy configurations	2017-08-15 07:16:01 +02:00
Michael Peter Christen	f45378c11c	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	2017-08-14 20:12:26 +02:00
Michael Peter Christen	7f395ef937	added image link in search results This should be a help to make a preview of search results. The image is computed from the list of embedded images, it is always the first image in that list. In rss-type results the image is presented like <media:content medium="image" url="https://abc.xyz/logo.png"/> as defined in http://www.rssboard.org/media-rss#media-content	2017-08-14 20:12:09 +02:00
luccioman	780173008e	Implemented partial stream parsing of tar archives. Also added JUnit tests for the tar parser and fixed unwanted use of the tar parser as a fallback on files included in a tar archive.	2017-08-14 14:57:58 +02:00
luccioman	acab6a6def	Also handle text content when parsing XML within limits.	2017-08-14 14:47:01 +02:00
reger	2a07799ad1	Correction of `d03e2c98ea` Fix Conjunction.addOperator to do nothing if term is empty prevent to result in query string with repeated logical operator like "field:term AND AND field:term" possibliy causing out of mem in postprocessing_doublecontent	2017-08-14 01:03:15 +02:00
reger	d03e2c98ea	Fix Conjunction.addOperator to do nothing if term is empty prevent to result in query string with repeated logical operator like "field:term AND AND field:term" possibliy causing out of mem in postprocessing_doublecontent	2017-08-14 00:52:03 +02:00
reger	b6a41df4f7	Remove deprecated YaCyProxyServlet was replaced by UrlProxyServlet	2017-08-12 21:53:04 +02:00
luccioman	8a94fef9e0	Prevent unwanted cached bytes duplication on stream parsing.	2017-08-12 09:43:49 +02:00
reger	4979439e87	Skip public post of jre version. Added to determine switch to java8 `596b5dfa59`	2017-08-06 23:41:53 +02:00
reger	e918ec199e	Replace deprecated ConcurrentHashSet with recommended Java8 ConcurrentHashMap.newKeySet() in postprocessDocuments()	2017-08-06 23:26:27 +02:00
reger	fb71994342	Harmonizing use of xml reader / sax parser in XMLBlacklistImporter eliminating the need for lib/xercesImpl.jar	2017-08-05 23:47:27 +02:00
reger	275d65fffe	Patch last_modified date with internal FirstSeenTime() if no date provided to make sure updated documents are indexed with their last-modified date as provided in current crawl. (to patch moddate always with firstseen might bear the risk of miss actual updates).	2017-08-05 22:30:06 +02:00
reger	d1b23afed6	Remove obsolete Protocol parameter ttl (time to live) not interpreted in target yacy/query.html also Protocol.querySeed() not used and parameter not interpreted in target servlet yacy/query.html	2017-08-01 00:59:53 +02:00
reger	15d78b1064	Replace deprecated getIP with getIPs in Protocol transferURL() and getProfile(). Remember used ip for error handling and departInterface	2017-07-31 01:55:01 +02:00
reger	ed36b47bec	Replace one more deprecated peerDeparture in Protocol.transferIndex() by moving/using interfaceDeparture() in transferRWI()	2017-07-30 23:02:15 +02:00
luccioman	0ee8c030c4	Log an error when Solr folder migration fails for some reason.	2017-07-17 15:35:10 +02:00
luccioman	5a646540cc	Support parsing gzip files from servers with redundant headers. Some web servers provide both 'Content-Encoding : "gzip"' and 'Content-Type : "application/x-gzip"' HTTP headers on their ".gz" files. This was annoying to fail on such resources which are not so uncommon, while non conforming (see RFC 7231 section 3.1.2.2 for "Content-Encoding" header specification https://tools.ietf.org/html/rfc7231#section-3.1.2.2)	2017-07-16 14:46:46 +02:00
luccioman	11a7f923d4	Distinguish response parsing failures from unexpected exceptions.	2017-07-16 14:39:53 +02:00
luccioman	eda7b0aeb6	Merge branch 'master' of https://github.com/yacy/yacy_search_server	2017-07-15 08:49:25 +02:00
reger	3005be7349	Clean up unmaintained and unused AugmentParser trail.	2017-07-15 00:19:23 +02:00
luccioman	cb4f1358e1	Added gzip parser support for max content bytes limit	2017-07-13 08:18:40 +02:00
luccioman	5216c681a9	Added HTML parser support for maximum content bytes parsing limit	2017-07-13 08:12:10 +02:00
luccioman	4aafebc014	Merge pull request #122 from Scarfmonster/patch-1 I also reproduced the issue, and the fix is working fine. Thanks @Scarfmonster	2017-07-12 16:03:23 +02:00
luccioman	651fad6da5	Added RSS parser support for maximum content bytes parsing limit	2017-07-12 00:18:12 +02:00
luccioman	452a17a8d5	Finer control on bounded input streams with custom stream implementation	2017-07-12 00:13:24 +02:00
luccioman	f8f1959ebb	Added parsing within bounds implementation to the generic parser.	2017-07-11 09:07:48 +02:00
luccioman	e0f400a0bd	Support trying multiple parsers even when streaming on large resources.	2017-07-11 09:06:37 +02:00
luccioman	1e84956721	Support loading local files with a per request specified maximum size. Consistently with the HTTP loader implementation.	2017-07-11 09:04:23 +02:00
luccioman	f369679d1c	Fixed read/copy on input streams reading sometimes less than expected.	2017-07-11 09:00:27 +02:00
luccioman	bf55f1d6e5	Started support of partial parsing on large streamed resources. Thus enable getpageinfo_p API to return something in a reasonable amount of time on resources over MegaBytes size range. Support added first with the generic XML parser, for other formats regular crawler limits apply as usual.	2017-07-08 09:04:03 +02:00
luccioman	90a7c1affa	HTML parser : removed unnecessary remaining recursive processing Recursive processing was removed in commit `67beef657f`, but one remained for anchors content(likely omitted from refactoring). It is no more necessary : other links such as images embedded in anchors are currently correctly detected by the parser. More annoying : that remaining recursive processing could lead to almost endless processing when encountering some (invalid) HTML structures involving nested anchors, as detected and reported by lucipher on YaCy forum ( http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6005 ).	2017-07-03 10:00:53 +02:00
reger	e6e20dab52	upd to Jetty 9.4.6.v20170531 Modify loginservice to the changes in Jetty, partially based on pull request #101 https://github.com/yacy/yacy_search_server/pull/101 bu @automenta	2017-07-01 23:58:28 +02:00
luccioman	dcc56318bb	Made remote search max system load limits configurable from UI. As reported by davide on YaCy forums ( http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6004 ) when the system is on high load, unless reading carefully YaCy configuration file, it could be difficult to understand why remote search results are not fetched.	2017-06-30 11:30:54 +02:00
reger	ddd13b776d	Add keyword constraint to rwi query result filter To discard rwi results not matching query keyword: parameter	2017-06-30 02:11:18 +02:00
luccioman	e82eaee4b6	Apply consistent behavior on HTTP resource size exceeding limit. On content size known from HTTP headers, terminates connection faster and improves error reports quality by reporting relevant message "Content to download exceed maximum value..." rather than previously "no response (NULL) for url...".	2017-06-30 01:13:47 +02:00
luccioman	0b75e92ac2	Do not wrap unnecessarily loader IOExceptions in IOExceptions	2017-06-30 01:06:17 +02:00
luccioman	433bdb7c0d	Respect maxFileSize limit also when streaming HTTP and when relevant. Constraint applied consistently with HTTP content full load in byte array.	2017-06-30 00:30:54 +02:00
luccioman	9b1bb2545e	Refactored plain-text URLs detection implementation. For faster processing (measured about 2 times faster on many real-world examples) and more advanced detection (previous algorithm detected only URLs separated from the rest of the text by a space character).	2017-06-27 19:30:40 +02:00
luccioman	8da3174867	Ensure lower case conversion consistency with any default locale. Especially for Turkish speaking users using "tr" as their system default locale : strings for technical stuff (URLs, tag names, constants...) must not be lower cased with the default locale, as 'I' doesn't becomes 'i' like in other locales such as "en", but becomes 'ı'.	2017-06-27 06:42:33 +02:00
luccioman	286f3018bd	Made mime type and extension normalization locale independent. Previously, upper cased mime type was incorrectly normalized when the default locale is Turkish.	2017-06-26 17:33:56 +02:00
luccioman	319231a458	Added a generic XML parser, able to parse elements text and URLs. This parser adds support for any XML based format other than already supported XML vocabularies such XHTML, RSS/Atom feeds... It will eventually be used as a fallback if one of these specific parsers fail, before falling back to the existing genericParser which extracts not that much useful information except URL tokens.	2017-06-26 16:30:21 +02:00
Ryszard Goń	3cedbbd4ed	Wrong password was removed after the SSL certificate import Removing the keystore password will prevent ssl from working after the next restart. The certificate password should be removed instead. Fixes http://mantis.tokeek.de/view.php?id=687	2017-06-23 02:23:49 +02:00
luccioman	64cec2790d	Improved character encoding detection from Content-Type header Also updated some related JavaDocs	2017-06-22 10:50:34 +02:00
luccioman	0487336ec3	Prevent integer overflow in table statistics and use strong typing	2017-06-19 17:02:11 +02:00
luccioman	d2a4a27f52	Improved stream-oriented parsing entering conditions.	2017-06-17 09:26:37 +02:00
luccioman	9dd790087d	Added HT Cache basic statistics (hit rate)	2017-06-15 09:50:02 +02:00
luccioman	5fdd5d16b1	Use volatile to ensure concurrent threads use up to date property value	2017-06-15 09:48:22 +02:00
luccioman	28b451a0b3	Made Cache compression level and lock timeout user configurable	2017-06-14 19:02:08 +02:00
luccioman	a7394b479b	Limit the synchronization blocking time on some Cache operations. Using a Reentrant lock instead of the intrinsic synchronization lock permits limiting the blocking time to acquire a lock. Useful on a very busy Cache concurrently accessed by many threads : when the time to acquire a lock is too high, getting/storing content on the cache becomes inefficient, and it is then better to fall back to loading remote resources. Illustrated by the CacheTest stress test and some traces reported in mantis 751 ( http://mantis.tokeek.de/view.php?id=751 )	2017-06-14 09:13:50 +02:00
Michael Peter Christen	c94a8c76bd	re-added solr synchronization hack	2017-06-09 12:50:36 +02:00
Michael Peter Christen	6fe735945d	migrated Solr 5.5 -> Solr 6.6 and from Java 1.7 -> 1.8 Also: now Version 1.921	2017-06-09 12:25:23 +02:00
luccioman	ce89492319	Ensure system resource release by closing document stream.	2017-06-08 07:36:11 +02:00
luccioman	8399275142	Properly close file output streams even on exceptions scenarios.	2017-06-08 07:19:16 +02:00
luccioman	4e4dc6c4e5	Removed unnecessary finalize implementation. On such private classes with limited scope but with frequent instance creations and removals within the application lifecycle, implementing the finalize method is particularly unwanted as it decreases the garbage collector performance. What's more the Object.finalize() method is now deprecated in the JDK 9 and will eventually disappear from future releases (see https://bugs.openjdk.java.net/browse/JDK-8177970)	2017-06-06 10:30:02 +02:00
luccioman	a04feac064	Ensure file input streams proper closing in both success and failures Also add when possible a warning level log message on input stream closing error instead of failing silently. This could help understanding some IO exceptions such as "too many files open".	2017-06-03 04:00:46 +02:00
luccioman	d98c04853d	Ensure proper closing of file input streams.	2017-06-02 12:14:29 +02:00
luccioman	c53c58fa85	Unsure closing ChunkIterator stream in every possible use case. Also trace in logs the eventual close failures instead of failing silently. This should help prevent holding too many unreleased system file handlers, as in the case reported by eros on YaCy forum (http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5988&sid=b00e7486c1bf7e48a0d63eb328ccca02 )	2017-06-02 09:47:45 +02:00
luccioman	29e52bda39	Merge branch 'master' of https://github.com/yacy/yacy_search_server	2017-06-02 01:47:53 +02:00
luccioman	a9cb083fa1	Improved consistency between loader openInputStream and load functions	2017-06-02 01:46:06 +02:00
reger	a814f3d885	Introduce keyword query parameter This enables keyword navigator to filter on keywords. Added search page output and layout config for keywords, allowing e.g. in Intranet use to display the keywords. No styling or links applied to the keyword text (but is desirable possibly in combination with bootstrap-tagsinput for future/intranet).	2017-06-02 01:00:21 +02:00
luccioman	c226ded799	Fix unescape of URLs having some '%' chars but not percent-encoded	2017-05-30 12:32:14 +02:00
luccioman	306a82dd71	Fixed scraper NullPointerException cases on malformed URLs.	2017-05-30 08:48:20 +02:00
luccioman	aa55d71cf5	Fixed a NullPointerException case on Digest authentication. Could occur when upgrading from a Debian package configured with Basic authentication (as in release 1.92.9000) to a more recent one with Digest authentication, without having re-encoded the admin password (for example with dpkg-reconfigure). As reported by eros on YaCy forum (http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5988#p33686).	2017-05-29 19:16:09 +02:00
luccioman	02ec0ed13c	Quoted param value in Solr query to avoid unwanted traces in logs When Webgraph Solr core is enabled, crawling and removing from index an URL whose hash starts with the '-' character (example URL : https://cs.wikipedia.org/ whose hash is "-2-HuTEndn4x") produced a full ParseException stack trace in YaCy logs. This was not blocking because the Solr query parser is able to escape itself the query and run it successfully, but filled uselessly YaCy logs.	2017-05-24 08:43:03 +02:00
reger	1737af37cf	Set request originator to own peer in warc importer in addition to change in `039162fbf0`	2017-05-22 01:56:11 +02:00
reger	039162fbf0	Change warc importer to use defaultsurrogate-crawl profile, as reported by LA_FORGE http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5990 and analysed by @luccioman (see comment `510f11d374`) it creates conflict using a other crawlprofile without setting originator.	2017-05-22 01:34:08 +02:00
Michael Peter Christen	3b1d640a3c	enhanced debugging	2017-05-18 00:28:12 +02:00
Michael Peter Christen	7de7879f13	added a cache to prevent too many seed enumerations	2017-05-18 00:28:00 +02:00
luccioman	bd7411a53a	Enable p2p and cluster communication when "Protection of all pages" on As reported by paul89 on YaCy forum (http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5958 ), when setting the "Protection of all pages" to "On" in the "ConfigAccounts_p.html" page, the peer became completely unreachable by others, which is not the purpose of this feature. But the restriction still makes sense as a security enforcement and is maintained in private "Robinson mode" where by the way any peer-to-peer or cluster communication would be rejected.	2017-05-17 09:00:29 +02:00
luccioman	31ad043bb9	Added user interface feedback on results feeding termination status. Added as an additional icon with title in the search progress bar, to inform about background search feeder threads terminated or still running. While giving a bit more information to users about the p2p search process, this can help choosing whether or not wait a little bit more time before going to the next page, in order to get results from various sources sorted as best as possible (see #91 for a discussion about sorting accuracy and network latency). Other related modifications included : - regular updates to statistics in the progress bar until the background feeders are completely terminated. - removed some uses of unsecure and discouraged JavaScript elements	2017-05-15 13:15:16 +02:00
sgaebel	ff6392215e	added closing of lst-Tag in solr-Export	2017-05-13 20:38:25 +02:00
luccioman	d90b001e1b	Improved previous merge "Show ranking in HTML UI". - added the new setting as configurable in the "Debug/Analysis" settings page. Debug/analysis is its main purpose for now as there is currently no nice and "understansable" ranking score info servlet (see forum discussion http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5884 ) - render in the "Search Page Layout" page preview when enabled - added constants	2017-05-11 18:02:33 +02:00
luccioman	0f0f42b509	Added some JavaDoc	2017-05-11 08:33:19 +02:00
reger	077d062be3	Adjust mergeDocuments to keep youngest last-modified date of document collection	2017-05-09 22:52:54 +02:00
luccioman	654801523e	Fixed StringIndexOutOfBoundsException case. Revealed by commit `c77e43a` : the exception was then thrown when indexing pages containing mailto: scheme URL links with the Solr Webgraph core enabled. Fixed the error case and restored filtering on mailto links in Document.resortLinks() as these URLs still should not appear in Document.hyperlinks.	2017-05-09 18:32:47 +02:00
luccioman	522a268305	Improved new blacklist entries URL scheme detection.	2017-05-04 16:36:45 +02:00
luccioman	532981b363	Updated putHTML() JavaDoc	2017-05-04 11:21:27 +02:00
luccioman	58d23047dd	Handle '?' and '+' chars as valid wild cards when adding to blacklist. An entry such as "domain.com/[a-z]+" is a valid regular expression and do not need additional "../.*" wildcards.	2017-05-04 11:19:59 +02:00
luccioman	a87281b498	Added MediaWiki dump import scheduling feature. Checking the last modified date by default to prevent unnecessary long running operations.	2017-05-03 18:53:01 +02:00
luccioman	edd7ccac40	Added some JavaDoc	2017-05-02 09:33:11 +02:00
luccioman	79fdf14b0a	Fixed regression introduced by commit `9ad4d16` On MediaWiki dump imports, the SurrogateReader was trying to unread too many bytes, then failing with the following exception : "java.io.IOException: Push back buffer is full".	2017-05-02 09:32:04 +02:00
Michael Peter Christen	7678fd67e3	copied fix from yacy_grid_parser for wrong array type	2017-05-01 11:44:26 +02:00
Michael Peter Christen	200b100fb8	added patch to rewrite altered yacy grid schema into yacy schema This generates the stub and protocol parts of an url for inboundlinks, outboundlinks and images	2017-05-01 11:38:02 +02:00
reger	9ad4d16829	Add a responsHeader to the solr index export with a format identifier and export parameter (in accordance with response xml format) for easier format detection on import.	2017-04-30 23:53:52 +02:00
luccioman	9697209ef6	Fixed Index Export feature for compatibility with old indexed documents. This is a fix for mantis 682 (http://mantis.tokeek.de/view.php?id=682) and issue #116	2017-04-28 11:39:51 +02:00
luccioman	88c062639b	Added some JavaDoc	2017-04-28 11:36:48 +02:00
luccioman	31fff2c986	Extended WikiCode template inclusion syntax support. Wiki templates are not rendered but syntax support is improved, which greatly enhance snippets rendering on search results coming from a MediaWiki dump import. Tested on various dumps from Wikimedia at https://dumps.wikimedia.org/backup-index.html See also Wikipedia transclusion documentation at https://en.wikipedia.org/wiki/Wikipedia:Transclusion	2017-04-27 09:50:04 +02:00
Michael Peter Christen	973d74712f	added yacy grid flatjson surrogate parser	2017-04-25 08:44:02 +02:00
luccioman	b1da92648e	Fixed surrogates import monitoring page (/CrawlResults.html?process=7) This page was always empty, as described in mantis 740 (http://mantis.tokeek.de/view.php?id=740)	2017-04-24 18:24:26 +02:00

1 2 3 4 5 ...

4087 Commits