yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
luccioman	6b45cd5799	New optional crawl filter on the URL a doc must match to crawl its links For finer control over which parsed documents can trigger an addition of their links to the crawl stack, complementary to the existing crawl depth parameter.	2019-05-01 08:54:19 +02:00
luccioman	a5771b1f14	Made SNI extension user configurable without the need for server restart TLS Server Name Indication (SNI) extension activation can now be configured with the new Settings_p.html?page=httpClient administration page. SNI extension is also now enabled by default, as in 2019 the unrecognized_name(112) alert is more properly handled by major web servers TLS implementations, following the RFC 6066 standard. Related YaCy issues : #153 #189 and #272 JDK 1.7 bug : https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7127374 Apache httpd issue : https://bz.apache.org/bugzilla/show_bug.cgi?id=56241 RFC 6066 : https://tools.ietf.org/html/rfc6066#section-3	2019-04-14 15:41:13 +02:00
luccioman	e90405b6f0	Support parsing audio URLs without file extension Added also a Junit for the audio tag parser	2019-04-09 11:40:21 +02:00
luccioman	08ea0b0397	Added a configurable timeout to wkhtmltopdf calls for pdf snapshots Necessary to prevent blocking the indexing workflow when some wkhtmltopdf renderings fail without terminating	2018-12-11 22:31:31 +01:00
luccioman	fcf6b16db4	Added new crawler attribute for finer control over Media Type detection New "Media Type detection" section in the advanced crawl start page allow to choose between : - not loading URLs with unknown or unsupported file extension without checking the actual Media Type (relying Content-Type header for now). This was the old default behavior, faster, but not really accurate. - always cross check URL file extension against the actual Media Type. This lets properly parse URLs ending with an apparently odd file extension, but which have actually a supported Media Type such as text/html. Sample URLs with misleading file extensions added as documentation in the crawl start page. fixes issue #244	2018-10-25 10:42:12 +02:00
luccioman	54fbe166ba	Updated pdf cache clear steps consistently with current pdfbox version - Removed calls to no more existing clearResources functions (on PDFont class and its children) since upgrade to pdfbox 2.n.n - Removed hacky usage of protected internal ClassLoader function. This removes the warnings displayed when running with JDK9 or JDK10 : [java] WARNING: Illegal reflective access by net.yacy.document.parser.pdfParser$ResourceCleaner (file:<path>) to method java.lang.ClassLoader.findLoadedClass(java.lang.String) [java] WARNING: Please consider reporting this to the maintainers of net.yacy.document.parser.pdfParser$ResourceCleaner [java] WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations [java] WARNING: All illegal access operations will be denied in a future release Crawling thousands of pdf documents from various sources after modifications applied, revealed no new memory leak related to pdfbox (measurements done with JVisualVM).	2018-08-16 18:23:42 +02:00
luccioman	bdafb14336	Removed redundant synchronization lock on network switch function Was useless as done in an already synchronized block, and the lock object was assigned a new value in that same block, and nowhere else a lock is requested on that same object.	2018-07-16 09:20:23 +02:00
luccioman	dcad393fe5	Fixed exceeding max size of failreason_s Solr field on large link list When using the 'From Link-List of URL' as a crawl start, with lists in the order of one or more thousands of links, the failreason_s Solr field maximum size (32kb) was exceeded by the string representation of the URL must-match filter when a crawl URL was rejected because not matching.	2018-07-11 08:13:29 +02:00
luccioman	2bdd71de60	Added server side columns sorting on the Process Scheduler table For easier usage of large tables in the Table_API_p.html page.	2018-07-04 10:28:32 +02:00
luccioman	cced94298a	Added a new crawler document filter type using Solr syntax This makes possbile to set up much more advanced document crawl filters, by filtering on one or more document indexed fields before inserting in the index.	2018-06-19 10:12:20 +02:00
luccioman	b5dc1f376f	Made outgoing pools max total connections user configurable For a finer control over the maximum simultaneously active outgoing connections.	2018-06-06 09:36:50 +02:00
luccioman	ee6670fb8f	Use a common pooled http connection manager for remote solr instances For a better control on the maximum simultaneous outgoing http connections, as already done for any other http connections (crawls, rwi search, p2p protocol) using the net.yacy.cora.protocol.http.HTTPClient	2018-05-29 09:24:21 +02:00
luccioman	fa4399d5d2	Small perf improvement : initialize threads names early when possible Initializing Thread names using the Thread constructor parameter is faster as it already sets a thread name even if no customized one is given, while an additional call to the Thread.setName() function internally do synchronized access, eventually runs access check on the security manager and performs a native call. Profiling a running YaCy server revealed that the total processing time spent on Thread.setName() for a typical p2p search was in the range of seconds.	2018-05-23 14:45:35 +02:00
luccioman	a3ec7a7a5f	Added analysis optional setting to compute statistics on text snippets Thus producing some basic stats on processing times for snippets generation and counts on snippets per source type.	2018-04-15 09:55:08 +02:00
luccioman	d92b191942	Ensure no remote Solr is attached before "Shut Down and Re-Start Solr" Otherwise once this operation is applied, the remote Solr(s) instances are deconnected and the embedded Solr is connected even if disabled by setting "core.service.fulltext". Also use constants for related default setting values.	2018-04-06 20:34:54 +02:00
luccioman	69690c13a0	Optionally allow external Solr server with self-signed certificate This is necessary when you want to attach to a dedicated external Solr server protected with basic http authentication and requested over https but having only a self-signed certificate.	2018-04-04 18:16:26 +02:00
luccioman	ba9cd14516	Removed hard-coded patch for Solr 5.0 on ranking boost function The current default boost function (`recip(ms(NOW,last_modified),3.16e-11,1,1)`) for the Date ranking profile is indeed working fine. What can trigger the error `unexpected docvalues type NUMERIC for field 'last_modified'` is the previous default boost function (quite old now) or any custom one using the Solr `ord` or `rord` functions on the last_modified field. Then the problem was that the migration code in the Switchboard supposed to detect the old date boost function was incorrect (one trailing right parenthesis in excess), so the deprecated function remained. This fixes issue #169.	2018-03-26 16:24:27 +02:00
luccioman	fb3032c530	Added a crawl filtering possibility on documents Media Type (MIME)	2018-03-23 10:28:19 +01:00
luccioman	c3ff50c17a	Updated the list of audio file formats supported by the audioTagParser Follows upgrade to Jaudiotagger dependency to version 2.2.5.	2018-02-27 18:04:12 +01:00
luccioman	9412881230	Added basic support for autotagging microdata annotated item types. With the appropriate vocabulary settings in Vocabulary_p.html page, this can produce Vocabulary search facets displaying item types referenced in html documents by microdata annotation. Tested notably, but not limited to, vocabulary classes/types defined by Schema.org and Dublin Core.	2018-02-06 10:25:38 +01:00
luccioman	9ddf92d143	Removed unncessary reflection usage for workflow tasks. This improves code readability and maintainability (calls hierarchy are easier to read) and eventually performance.	2018-01-15 10:05:49 +01:00
luccioman	9624516bf8	Refresh recrawl job profile threshold date like other default profiles	2018-01-15 08:06:28 +01:00
luccioman	d47afe6fab	Use a constant for crawler reject reason prefix with specific processing	2018-01-13 10:45:00 +01:00
luccioman	09c4ee56a7	Added optional https support for remote crawl and profile operations	2017-12-21 18:41:32 +01:00
luccioman	1c4803e40a	Enable optional https support for /yacy/transferURL API calls. Also updated some Javadoc and consistently use Switchboard instance as a constructor parameter where relevant.	2017-12-19 12:30:49 +01:00
Michael Peter Christen	25573bd5ab	added a crawl filter based on <div> tag class names When a crawl is started, a new field to exclude content from scraping is available. The field can be identified with the class name of div tags. All text contained in such a div tag where the configured class name(s) match are not indexed, while the remaining page is indexed.	2017-12-09 22:29:35 +01:00
luccioman	46f37e38dc	Customized Threads with generic name for easier monitoring.	2017-10-31 08:53:17 +01:00
luccioman	8e732d437c	Enable HTTP Digest authentication for non admin users. Also ensure authentication is not lost by Digest timeout when navigating between index.html and search results page. This way, running searches with extended features on a remote peer or a password protected peer works with a regular user (with "Extended search" rights). When authenticating on the search page with a user without "Extended search" rights, it appears as authenticated, but has just its usual access to the public search features.	2017-10-26 07:51:18 +02:00
luccioman	af198b990b	Added an optional login link/status to the search public top nav bar. Thus allowing a more convenient way (wihout the need to go to the admin section) to login when searching on your remote or password protected peer and benefit from extended search features such as Heuristics, Bookmarking or JavasScript resorting. Can be disabled using the ConfigSearchPage_p.html.	2017-10-21 10:57:36 +02:00
luccioman	ef8aea7f8d	Made the dates navigator max elements number user configurable. Also used object properties on QueryParams instances, rather than using mutable class (static) properties.	2017-09-25 09:19:08 +02:00
luccioman	4eba88f2ff	Removed some unnecessary uses of java.lang.reflect api. This improves code browsing and readability, making search by references or call hierarchy IDE features more accurate.	2017-08-24 18:47:18 +02:00
luccioman	28b451a0b3	Made Cache compression level and lock timeout user configurable	2017-06-14 19:02:08 +02:00
luccioman	a7394b479b	Limit the synchronization blocking time on some Cache operations. Using a Reentrant lock instead of the intrinsic synchronization lock permits limiting the blocking time to acquire a lock. Useful on a very busy Cache concurrently accessed by many threads : when the time to acquire a lock is too high, getting/storing content on the cache becomes inefficient, and it is then better to fall back to loading remote resources. Illustrated by the CacheTest stress test and some traces reported in mantis 751 ( http://mantis.tokeek.de/view.php?id=751 )	2017-06-14 09:13:50 +02:00
luccioman	8399275142	Properly close file output streams even on exceptions scenarios.	2017-06-08 07:19:16 +02:00
luccioman	a04feac064	Ensure file input streams proper closing in both success and failures Also add when possible a warning level log message on input stream closing error instead of failing silently. This could help understanding some IO exceptions such as "too many files open".	2017-06-03 04:00:46 +02:00
Michael Peter Christen	200b100fb8	added patch to rewrite altered yacy grid schema into yacy schema This generates the stub and protocol parts of an url for inboundlinks, outboundlinks and images	2017-05-01 11:38:02 +02:00
Michael Peter Christen	973d74712f	added yacy grid flatjson surrogate parser	2017-04-25 08:44:02 +02:00
luccioman	b1da92648e	Fixed surrogates import monitoring page (/CrawlResults.html?process=7) This page was always empty, as described in mantis 740 (http://mantis.tokeek.de/view.php?id=740)	2017-04-24 18:24:26 +02:00
Michael Peter Christen	f5ad29edb1	Merge branch 'master' of git@github.com:yacy/yacy_search_server.git	2017-04-07 09:15:15 +02:00
Michael Peter Christen	76e9135526	added flatjson parser (stub, unfinished)	2017-04-07 09:15:05 +02:00
reger	ba339a2a45	Add servlet to import warc file from filesystem IndexImportWarc_p.html. Apply Importer interface to WarcImporter	2017-04-02 03:32:21 +02:00
reger	510f11d374	Implement surrogate import from Warc archives (as first option handle warc = Web ARChive File Format. Warc files with extension .warc or compressed warc.gz can be placed in the DATA/surrogate/in and contained responses are imported to the index. The used library is stream based so we can easily extend it later to use and load warc's from the net.	2017-03-31 00:58:11 +02:00
reger	3dd23c178b	Introduce the option to configure a shutdown port. A port value of -1 will disable this option. If set to a value greater 0, YaCy listens on this of on the local loopback address (127.0.0.1) for a shutdown or restart signal. E.g. connect to http://localhost:8005/shutdown will stop the YaCy server. http://localhost:8005/restart will restart it. This option allows to stop YaCy locally independant from the web web frontend (which might be configured for password protected remote access).	2017-03-19 02:30:08 +01:00
reger	a2afb4bae0	add switchboardconstants for server ports config keys	2017-03-18 20:02:26 +01:00
luccioman	c68a8be2d9	Refactored and enforced Solr mandatory fields for proper operation - Added a new method to check activation of mandatory fields on Collection Configuration commit, consistently with checks previously performed in Switchboard startup and with mandatory fields in the default schema. - Reorganized default schema and CollectionConfiguration enumeration : moved no more mandatory fields in a specific section, and moved fields enabled at startup to the mandatory section. - Marked mandatory fields as required and with stronger font in the IndexSchema_p.html page	2017-02-20 10:48:07 +01:00
luccioman	0da1e6ba16	Factored code re-implementing DigestURL.hosthash() method. This ensure consistent implementation of the url host hash generation and easier usage finding in source code. Also added a unit test for this function.	2017-01-16 10:18:42 +01:00
luccioman	6a4d51d8f9	Cleaned up some Javadoc warnings.	2017-01-09 16:44:47 +01:00
reger	68d4dc5cc5	Complete harmonization RequestHeader getCookie with std ServletRequest to use javax.servlet.http.Cookie parameters. Depreciate now obsolete getHeaderCookies. Adjust setting of MaxAge to spec if >= 0 otherwise keep default.	2017-01-02 03:04:21 +01:00
luccioman	1df558a6c6	Fixed YaCy proper shutdown triggered by SIGTERM signal. The main shutdown hook thread was not properly waiting for the main thread termination which consequently could not properly close resources and threads. After terminating a running YaCy peer this way (Ctrl+C in console, or kill <pid> for example), you could see the still existing DATA/yacy.running file. Tested with : - Debian Jessie openjdk 7 and 8 : regular shutdown, Ctrl+C, kill command, system restart while yacy is running - Windows 10 Oracle JDK 7 and 8 : non regression on regular shutdown	2016-12-28 09:47:27 +01:00
luccioman	3ca695390c	FTP crawl start URLs : applied crawl profile depth control Applied rules : - when the FTP URL denotes a file resource, stack it as any start URL : eventually embedded links can be followed applying the usual depth rules - when the FTP URL denotes a directory, list files under this directory and stack them for crawl, and repeat the process on sub folders until crawl depth is reached	2016-12-22 16:25:09 +01:00

1 2 3 4 5 ...

504 Commits