yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
reger	8410536f75	keep svnRevision in .init for convert of .conf until release >1.83	2016-03-20 18:12:55 +01:00
reger	726ebee65a	include Version config string in yacy.init (replacing svnRevision)	2016-03-20 03:42:33 +01:00
Michael Peter Christen	f4591b1b51	Merge branch 'master' of git@github.com:yacy/yacy_search_server.git	2016-03-11 18:12:38 +01:00
Michael Peter Christen	1ce38fdaed	0n - added experimental zeronet network which supports intranet peers (still needs work)	2016-03-11 08:55:51 +01:00
Michael Peter Christen	d05ffa1c51	update to seed list	2016-03-11 07:20:38 +01:00
reger	16724c1283	remove unused proxyCookieWhiteList from yacy.init	2016-03-11 01:14:54 +01:00
Michael Peter Christen	5d635879f8	Merge pull request #40 from Scarfmonster/autocrawl Automatic crawling	2016-01-14 22:19:55 +01:00
Ryszard Goń	7d6e0d8470	Add missing settings to autocrawl settings page	2016-01-14 03:27:33 +01:00
Ryszard Goń	a98c395023	Add the Autocrawl thread	2016-01-14 00:50:23 +01:00
reger	4765e374e6	altered clac. of search result items per page to display taking the existing limits into account but make it consistent with search option screen for admin and public user changes: - configured default number of items per page (ConfigPortal_p.html) is used as is (no hardcoded limit) - otherwise requests are limited to 100 results per page ( = search option, index.html) (this basically is the major change, inc. limit from 20 to 100 for public user) P.S. - the older grant of more (1000), if no online snippet calculation, is kept (for the time being) see http://mantis.tokeek.de/view.php?id=627	2016-01-13 01:30:49 +01:00
Ryszard Goń	1728cd30c6	Create autocrawl profiles	2016-01-12 16:28:34 +01:00
reger	e8256bb3b1	remove blekko from opensearch config (not available) see https://blekko.com/ http://searchengineland.com/goodbye-blekko-search-engine-joins-ibms-watson-team-217633	2016-01-04 04:49:10 +01:00
reger	a5faf73afa	remove obsolete yacy.init entries interaction.* (related to removed triplestore)	2015-12-29 15:41:19 +01:00
sixcooler	dce1cb65c4	Merge remote-tracking branch 'choose_remote_name/master'	2015-12-28 23:20:42 +01:00
reger	e84d94f8ca	fix mime table for ms office / open office documents (causing wrong parser detect in intranet mode)	2015-12-22 17:48:24 +01:00
reger	15e46b2bad	exclude in/outboundlinksnofollowcount_i from default schema fields (not used in any function)	2015-12-19 21:25:08 +01:00
luc	8c4ab9c76b	Added an option to eventually limit size of remote solr documents put to local index. See mantis #626.	2015-12-16 02:20:03 +01:00
luc	55a4d15775	Added a note on deprecated default search field and operator.	2015-12-14 23:55:12 +01:00
reger	b2c8bc0ae6	remove md5_s from default index fields it is not assigned a value / not used Due to above also excluded from transfer protocol.	2015-11-27 02:41:02 +01:00
sixcooler	f5a9948860	do not store subfield *_coordinate	2015-11-10 20:32:42 +01:00
sixcooler	fca353e5eb	set startuptype of most solr handlers to lazy	2015-11-10 20:32:05 +01:00
reger	c720b4c249	remove override of dynamicField coordinate_p in solr schema (coordinate_p is not a mandatory field as such doesn't need to be declared as schema.field)	2015-10-24 22:44:28 +02:00
reger	f0b5bc93a3	remove obsolete yacy.init entry "secureHttps" not used anywhere	2015-10-19 03:47:28 +02:00
reger	5e45f1a460	enable Solr schema dynamicField _p (type=location) for YaCy coordinate_p field	2015-09-01 21:47:25 +02:00
sixcooler	87e4abe393	fight the fieldcache by usind DocValues: in Solr-5.x the fieldcache has moved and was not cleared anymore. This results in an huge fieldcache. (http://lucene.apache.org/#highlights-of-the-lucene-release-include https://issues.apache.org/jira/browse/LUCENE-5666) Here I try to use DovValues where it is possible. For this I used the Api-Scheme as new basis für the Solr-Schema. This needs at least a complete optimization of the Solr-Index to get a smaller FieldCache. Everything that is indexed with these setting will not use the Fieldcache at all.	2015-08-31 20:24:41 +02:00
reger	250f6457f0	remove exired domain titan.deep-one.in from bootstrap.seedlist	2015-08-26 23:58:08 +02:00
Michael Peter Christen	df3314ac1a	added a new facet type based on a probabilistic classifier using bayesian filters. This can be used to classify documents during indexing-time using a pre-definied bayesian filter. New wordings: - a context is a class where different categories are possible. The context name is equal to a facet name. - a category is a facet type within a facet navigation. Each context must have several categories, at least one custom name (things you want to discover) and one with the exact name "negative". To use this, you must do: - for each context, you must create a directory within DATA/CLASSIFICATION with the name of the context (the facet name) - within each context directory, you must create text files with one document each per line for every categroy. One of these categories MUST have the name 'negative.txt'. Then, each new document is classified to match within one of the given categories for each context.	2015-08-10 14:27:44 +02:00
Michael Peter Christen	e1cd9c0dba	added another default network / commented out	2015-07-09 16:25:11 +02:00
reger	00d2062813	Rem depreciated AdminHandlers in solrconfig.xml avoid warning log W org.apache.solr.handler.admin.AdminHandlers <requestHandler name="/admin/" class="solr.admin.AdminHandlers" /> is deprecated . It is not required anymore	2015-07-01 00:58:23 +02:00
Michael Peter Christen	694b22f165	migration to Solr 5.2: huge benefits - this is a lot faster! This is a very complex migration: many classes had been renamed or removed, dependencies changed and the solr index type is now aligned to be a solr cloud repository. Together with the Solr 5.2 library update, one other dependent library had been updated as well: httpclient 4.4->4.4.1 Older indexes are migrated from 4_10 to 5_2. However, the new index structure is more efficient and we recommend to re-index everything. Please use the index export before you do the update to a large surrogate xml file. After the update, start with an empty index and then initialize this with your dump.	2015-06-24 01:55:51 +02:00
Michael Peter Christen	9c12555be5	added link to Snapshots in search results if the snapshot exists and option is set in ConfigSearchPage_p (this is a stub: we also need a visualization of pdf files!)	2015-06-07 20:37:37 +02:00
reger	6bc8a9b11e	make Quality of Service Servlet available to prioritize requests from local host This assigns priorities to incoming requests. Higher priority numbers are served before lower. (disabled by default in defaults/web.xml, uncomment or copy entry to DATA/Settings/web.xml)	2015-04-26 04:29:32 +02:00
Michael Peter Christen	b060ba900d	added parsing of contentprop attribute in html tags for content='startDate' and content='endDate'. The value of these field is now written to new solr fields startDates_dts and endDates_dts.	2015-04-13 16:20:00 +02:00
Michael Peter Christen	4cb4f67f38	added parsing of dd, dt and article html fields. The parsed result is written to special solr fields which are deactivated by default.	2015-04-12 22:02:45 +02:00
Michael Peter Christen	36e9cdb376	testing switching off cold searchers; maybe this brings performance enhancements when using large facets	2015-04-07 13:14:41 +02:00
Michael Peter Christen	535f1ebe3b	added a new way of content browsing in search results: - date navigation The date is taken from the CONTENT of the documents / web pages, NOT from a date submitted in the context of metadata (i.e. http header or html head form). This makes it possible to search for documents in the future, i.e. when documents contain event descriptions for future events. The date is written to an index field which is now enabled by default. All documents are scanned for contained date mentions. To visualize the dates for a specific search results, a histogram showing the number of documents for each day is displayed. To render these histograms the morris.js library is used. Morris.js requires also raphael.js which is now also integrated in YaCy. The histogram is now also displayed in the index browser by default. To select a specific range from a search result, the following modifiers had been introduced: from:<date> to:<date> These modifiers can be used separately (i.e. only 'from' or only 'to') to describe an open interval or combined to have a closed interval. Both dates are inclusive. To select a specific single date only, use the 'to:' - modifier. The histogram shows blue and green lines; the green lines denot weekend days (saturday and sunday). Clicking on bars in the histogram has the following reaction: 1st click: add a from:<date> modifier for the date of the bar 2nd click: add a to:<date> modifier for the date of the bar 3rd click: remove from and date modifier and set a on:<date> for the bar When the on:<date> modifier is used, the histogram shows an unlimited time period. This makes it possible to click again (4th click) which is then interpreted as a 1st click again (sets a from modifier). The display feature is NOT switched on by default; to switch it on use the /ConfigSearchPage_p.html servlet.	2015-03-02 04:30:10 +01:00
reger	ba276d3e64	add description_txt to default query fields, Dublin Core Metadata field extracted by most parsers.	2015-02-22 05:42:04 +01:00
reger	fe6f5a395d	fix Umlaut handling in blekko heuristic search term http://mantis.tokeek.de/view.php?id=169 observation: blekko seams to block xxxbot agents (=0 results)	2015-02-08 23:40:33 +01:00
Michael Peter Christen	97ba5ddbb7	configuration option for maxload limit for remote search	2015-02-04 01:12:25 +01:00
Michael Peter Christen	ac19690d30	refactoring with CommonPattern.COMMA	2015-01-29 01:35:28 +01:00
Michael Peter Christen	cf9b22ca5c	do not reindex based on vocabulary fields (there are meanwhile many of them) and some default settings	2015-01-29 01:22:28 +01:00
reger	24f68a4eb7	refactor opensearch heuristic introduce FederateSearchManager handling search heuristic to external systems via specific FederateSearchConnectors, which provide the query() functionallity, the translation to YaCy schema .toYaCySchema() and the search() routine to deliver results to searchevents, which is generally implemented in Abstract connector. The manager enforces now a min 15s delay between calls to external systems. Besides the OpensearchConnector a SolrFederateSearchConnector is available. It uses a additional config file for fieldname translation. default heuristicopensearch.conf: - openbdb.com removed - seems not longer to deliver results - config via solrconnector to datacite.org added (large technical library archive)	2015-01-19 03:30:35 +01:00
reger	4eb89d7f15	revert clickservlet (default was indeed a mistakenly)	2015-01-05 09:10:20 +01:00
Michael Peter Christen	61ae9d2d11	do not use the clickservlet by default. From my personal view, this technique should not be used at all! This project is about privacy, the existence of a click servlet is one example why people should NOT use a search portal if such exists.	2015-01-05 08:21:51 +01:00
sixcooler	5594c43d2e	bump to Solr-/Lucene-4.10.3	2015-01-04 18:47:47 +01:00
reger	d44d8996d0	Added a “don't store remote search results” option This is intended for peers who want to participate in the P2P network but don't wish to load/fill-up their index with metadata of every received search result. The DHT transfer is not effected by this option (and will work as usual, so that a peer disabling the new store to index switch still receives and holds the metadata according to DHT rules). Downside for the local peer is that search speed will not improve if search terms are only avail. remote or by quick hits in local index. To be able to improve the local index a Click-Servlet option was added additionally. If switched on, all search result links point to this servlet, which forwards the users browser (by html header) to the desired page and feeds the page to the fulltext-index. The servlet accepts a parameter defining the action to perform (see defaults/web.xml, index, crawl, crawllinks) The option check-boxes are placed in ConfigPortal.html	2015-01-04 11:10:45 +01:00
reger	e177d69387	remove obsolete config footer option (ConfigPortal user.login) no footer or footer-option in use remove unused yacy.init item allowUnlimitedReceiveIndexFrom	2014-12-29 03:50:00 +01:00
reger	6a04563578	Init Jetty using setDefaultDescriptor (web.xml) to defaults/web.xml so web.xml in defaults dir is applied first and optional DATA/SETTINGS/web.xml loaded on top. By using this Jetty feature (default web.xml) we assure that changes to the default are applied to existing installations and individual addition/changes are still respected.	2014-12-27 00:10:14 +01:00
Michael Peter Christen	eb78388a98	changed prefer strategy for http unique in such a way that http is preferred over https. While this is a bad idea from the standpoint of security it is more common applicable for environments where http and https mix and for some domains https is not available. Then the double-check is possible even if no postprocessing is performed.	2014-12-21 19:17:06 +01:00
Michael Peter Christen	d14114697c	the miss cache does not seem to work, it sometimes contains urlhashes from documents which actually are inside the index. This can be reproduced using the crawl result table at http://localhost:8090/CrawlResults.html?process=5 The cache is temporary disabled to remove the bad behaviour, however a later reactivation of that feater may be possible.	2014-12-21 17:31:51 +01:00

1 2 3 4 5 ...

590 Commits