yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	9fcd8f1bda	added canonical filter attention: this is on by default! (it should do the right thing)	2023-01-16 14:50:30 +01:00
Michael Peter Christen	5a52b01c09	front-end integration of tag valency	2023-01-15 20:13:45 +01:00
Michael Peter Christen	a2a40a3096	new link to crawlstart api documentation	2022-09-29 00:25:51 +02:00
Michael Peter Christen	3959d43a5c	fixed doku link	2021-08-03 16:57:24 +02:00
Michael Peter Christen	d0abb0cedb	enabling all crawl profiles in all network modes also: increased default internet crawl speed to 4 urls/s/host	2020-12-19 01:00:51 +01:00
Michael Peter Christen	f03e16d3df	enhanced crawl start url check experience urls are now urlencoded and a check is also performed in case that an url is copied into the url field using copy-paste	2020-01-16 20:59:02 +01:00
luccioman	6b45cd5799	New optional crawl filter on the URL a doc must match to crawl its links For finer control over which parsed documents can trigger an addition of their links to the crawl stack, complementary to the existing crawl depth parameter.	2019-05-01 08:54:19 +02:00
luccioman	fcf6b16db4	Added new crawler attribute for finer control over Media Type detection New "Media Type detection" section in the advanced crawl start page allow to choose between : - not loading URLs with unknown or unsupported file extension without checking the actual Media Type (relying Content-Type header for now). This was the old default behavior, faster, but not really accurate. - always cross check URL file extension against the actual Media Type. This lets properly parse URLs ending with an apparently odd file extension, but which have actually a supported Media Type such as text/html. Sample URLs with misleading file extensions added as documentation in the crawl start page. fixes issue #244	2018-10-25 10:42:12 +02:00
luccioman	92e10d7d1c	Added a crawl start hint message on availability or not of wkhtmltopdf As this tool is required to produce pdf snapshots	2018-10-16 08:02:43 +02:00
luccioman	534f09e92b	Added and updated hint messages about remote crawler status To help identify why remote crawl results may not be received.	2018-07-06 11:30:30 +02:00
luccioman	cced94298a	Added a new crawler document filter type using Solr syntax This makes possbile to set up much more advanced document crawl filters, by filtering on one or more document indexed fields before inserting in the index.	2018-06-19 10:12:20 +02:00
luccioman	fb3032c530	Added a crawl filtering possibility on documents Media Type (MIME)	2018-03-23 10:28:19 +01:00
Michael Peter Christen	187075b878	added nav filter	2018-03-10 15:46:53 +01:00
luccioman	7c644090ff	Fixed CrawlStartExpert.html HTML validation errors Validated with Nu Html Checker 17.11.1	2018-02-16 11:35:15 +01:00
luccioman	519fc9a600	Issue #156 : new option to clean up (or not) search cache on crawl start Prevent also unnecessary search event cache clean-up on each access to the crawl monitor page (Crawler_p.html).	2018-02-16 10:19:41 +01:00
luccioman	eb20589e29	Fixed issue #158 : completed div CSS class ignore in crawl	2018-02-10 11:56:28 +01:00
luccioman	79a2ba306a	Updated links to Java Regular Expressions documentation to version 8	2017-12-19 11:14:20 +01:00
Michael Peter Christen	25573bd5ab	added a crawl filter based on <div> tag class names When a crawl is started, a new field to exclude content from scraping is available. The field can be identified with the class name of div tags. All text contained in such a div tag where the configured class name(s) match are not indexed, while the remaining page is indexed.	2017-12-09 22:29:35 +01:00
luccioman	0f80c978d6	Limit the number of initially previewed links in crawl start pages. This prevent rendering a big and inconvenient scrollbar on resources containing many links. If really needed, preview of all links is still available with a "Show all links" button. Doesn't affect the number of links used once the crawl is effectively started, as the list is then loaded again server-side.	2017-06-17 09:33:14 +02:00
luccioman	62f75417ef	Updated Pattern JavaDoc links to current minimum (1.7) JDK version.	2016-11-14 00:18:40 +01:00
luccioman	812abfc868	Converted one more set of URLs to pure relative ones. Easier YaCy peer configuration behind a reverse proxy subfolder : no need for the reverse proxy to rewrite HTML links or URLs in css files. Tested on Debian Jessie with an apache2 reverse proxy. See related mantis issues http://mantis.tokeek.de/view.php?id=106 and http://mantis.tokeek.de/view.php?id=701	2016-11-12 15:54:35 +01:00
Michael Peter Christen	97930a6aad	added must-not-match filter to snapshot generation. also: fixed some bugs	2015-05-08 13:46:27 +02:00
Michael Peter Christen	fed26f33a8	enhanced timezone managament for indexed data: to support the new time parser and search functions in YaCy a high precision detection of date and time on the day is necessary. That requires that the time zone of the document content and the time zone of the user, doing a search, is detected. The time zone of the search request is done automatically using the browsers time zone offset which is delivered to the search request automatically and invisible to the user. The time zone for the content of web pages cannot be detected automatically and must be an attribute of crawl starts. The advanced crawl start now provides an input field to set the time zone in minutes as an offset number. All parsers must get a time zone offset passed, so this required the change of the parser java api. A lot of other changes had been made which corrects the wrong handling of dates in YaCy which was to add a correction based on the time zone of the server. Now no correction is added and all dates in YaCy are UTC/GMT time zone, a normalized time zone for all peers.	2015-04-15 13:17:23 +02:00
Michael Peter Christen	1309619a71	remove remote indexing option in crawl start if not in p2p mode	2015-02-04 11:37:07 +01:00
Michael Peter Christen	b5ac29c9a5	added a html field scraper which reads text from html entities of a given css class and extends a given vocabulary with a term consisting with the text content of the html class tag. Additionally, the term is included into the semantic facet of the document. This allows the creation of faceted search to documents without the pre-creation of vocabularies; instead, the vocabulary is created on-the-fly, possibly for use in other crawls. If any of the term scraping for a specific vocabulary is successful on a document, this vocabulary is excluded for auto-annotation on the page. To use this feature, do the following: - create a vocabulary on /Vocabulary_p.html (if not existent) - in /CrawlStartExpert.html you will now see the vocabularies as column in a table. The second column provides text fields where you can name the class of html entities where the literal of the corresponding vocabulary shall be scraped out - when doing a search, you will see the content of the scraped fields in a navigation facet for the given vocabulary	2015-01-30 13:20:56 +01:00
Michael Peter Christen	8df8ffbb6d	enhanced the snapshot functionality: - snapshots can now also be xml files which are extracted from the solr index and stored as individual xml files in the snapshot directory along the pdf and jpg images - a transaction layer was placed above of the snapshot directory to distinguish snapshots into 'inventory' and 'archive'. This may be used to do transactions of index fragments using archived solr search results between peers. This is currently unfinished, we need a protocol to move snapshots from inventory to archive - the SNAPSHOT directory was renamed to snapshot and contains now two snapshot subdirectories: inventory and archive - snapshots may now be generated by everyone, not only such peers running on a server with xkhtml2pdf installed. The expert crawl starts provides the option for snapshots to everyone. PDF snapshots are now optional and the option is only shown if xkhtml2pdf is installed. - the snapshot api now provides the request for historised xml files, i.e. call: http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ The result of such xml files is identical with solr search results with only one hit. The pdf generation has been moved from the http loading process to the solr document storage process. This may slow down the process a lot and a different version of the process may be needed.	2014-12-09 16:20:34 +01:00
Michael Peter Christen	6f0167fac1	get cloned crawl start parameter for snapshots	2014-12-02 12:52:05 +01:00
Michael Peter Christen	97f6089a41	YaCy can now create web page snapshots as pdf documents which can later be transcoded into jpg for image previews. To create such pdfs you must do: Add wkhtmltopdf and imagemagick to your OS, which you can do: On a Mac download wkhtmltox-0.12.1_osx-cocoa-x86-64.pkg from http://wkhtmltopdf.org/downloads.html and downloadh ttp://cactuslab.com/imagemagick/assets/ImageMagick-6.8.9-9.pkg.zip In Debian do "apt-get install wkhtmltopdf imagemagick" Then check in /Settings_p.html?page=ProxyAccess: "Transparent Proxy" and "Always Fresh" - this is used by wkhtmltopdf to fetch web pages using the YaCy proxy. Using "Always Fresh" it is possible to get all pages from the proxy cache. Finally, you will see a new option when starting an expert web crawl. You can set a maximum depth for crawling which should cause a pdf generation. The resulting pdfs are then available in DATA/HTCACHE/SNAPSHOTS/<host>.<port>/<depth>/<shard>/<urlhash>.<date>.pdf	2014-12-01 15:03:09 +01:00
orbiter	f642cfbe30	added hint to the regular expression tester	2014-08-27 18:40:20 +02:00
Michael Peter Christen	2de159719b	added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead.	2014-07-18 12:43:01 +02:00
Michael Peter Christen	1b279d7a7e	fixed external link	2014-06-27 15:12:53 +02:00
reger	89e2c5e884	fix: allow enable of CrawlStartExpert.html #file	2014-05-17 22:56:15 +02:00
Michael Peter Christen	a2fba6584f	use submitted default userAgent if cloning a crawl	2014-04-30 05:05:02 +02:00
orbiter	d29b6db270	made crawl start pages public since they do not reveal individual information and they are also not used as servlet to actually start the crawl (which is Crawler_p.html).	2014-03-31 20:42:39 +02:00

34 Commits