yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	addba047e2	changes in ranking computation - an existing ranking servlet for solr was extended. It is now possible to set boost values for fields, boost functions and boost queries. - The ranking can have different instances, but currently only the first one is used - added an abstraction layer for fields which can be used for search and those fields can be edited in the solr ranking configruation - the ranking value from solr within the field score is used to combine remote search requests, which all are created using the same locally defined boost values - reduced the number of fields which are used for search (makes it faster) - replaced some text fields by string fields (makes indexing faster) - removed classes which had no use - made a large number of experiments for a better ranking and created a temporary setting which prefers hits inside titles - adjusted also the RWI-based ranking computation to 'prefer title' - made special cases like for portal search where no post-processing and post-ranking is wanted: this keeps the original ranking order as done by Solr - fixed many bugs with old settings for ranking	2013-03-13 14:47:00 +01:00
Michael Peter Christen	788288eb9e	added the generation of 50 (!!) new solr field in the core 'webgraph'. The default schema uses only some of them and the resting search index has now the following properties: - webgraph size will have about 40 times as much entries as default index - the complete index size will increase and may be about the double size of current amount As testing showed, not much indexing performance is lost. The default index will be smaller (moved fields out of it); thus searching can be faster. The new index will cause that some old parts in YaCy can be removed, i.e. specialized webgraph data and the noload crawler. The new index will make it possible to: - search within link texts of linked but not indexed documents (about 20 times of document index in size!!) - get a very detailed link graph - enhance ranking using a complete link graph To get the full access to the new index, the API to solr has now two access points: one with attribute core=collection1 for the default search index and core=webgraph to the new webgraph search index. This is also avaiable for p2p operation but client access is not yet implemented.	2013-02-22 15:45:15 +01:00
Michael Peter Christen	91a0401d59	introduced a second core named 'webgraph'. This core will hold the link structure, but is not filled yet. To have the opportunity of a second core, multi-core functionality had to be implemented to the deep-embedded solr: - migrated the solr_40 directory content to a subdirectory 'collection1'; the previously used default core is now called collection1 - added solr_40/webgraph subdirectory as second core - added a servlet configuration for the second core 'webgraph' in /IndexSchema_p.html - added instance handling as addition to solr connections: all solr connectors are now instances of an solr 'instance' object; this required a complete re-design of the solr embedding - migrated also caching and sharding ontop of new instance handling - migrated the search apis to handle now the access to a specific core, the default core named 'collection1' - migrated the remote solr search interface to access shards of cores; for the yacy remote search the default core is now called 'solr'; using the peer address as solr address - migrated the solr backup and restore process: old backups cannot be used after this migration! - redesign of solr instance handling in all methods which access the instances: they cannot hold copies of these instances any more; the must retrieve the actuall connection object every time they want to write to it (this solves also some bugs when switching the index/network) - added another schema 'solr.webgraph.schema', the old solr.keys.list is replaced by solr.collection.schema	2013-02-21 13:23:55 +01:00
Michael Peter Christen	b6de1f42dc	Full redesign of solr connection architecture. This was done to support multiple solr cores instead of just one. Therefore it is now necessary to distuingish between solr server connections (called an 'Instance') and a connection to a single solr core. One Instance may now have multiple connector classes assigned to it, each connecting to a single core. To support multiple cores it is also necessary to distinguish between the connection configuration and the configuration of the index schema. We will have multiple schema configurations in the future, each for every solr core. This caused that the IndexFederated servlet had to be split into two parts, the new Servlet for the Schema editor is now in the IndexSchema Servlet.	2013-02-15 01:38:10 +01:00
Michael Peter Christen	762b687e47	extended the serverObjects to be able to hold multipel values for a single key. This is done using the solr class MultiMapSolrParams. That class is needed in the OpensearchResultWriter to get multiple facet requests.	2013-02-11 22:12:15 +01:00
Michael Peter Christen	1052263af3	- added a new solr field references_i which stores the number of INCOMING links to the corresponding web page. This information is taken from the reverse link index (a 'little sister' of the RWI index). - this field can be of use to enhance the ranking because a web page with more incoming links can be more more important than others. But this is not true for typical link pages like menues. Therefore the number of outgoing links is needed. - added a new solr attribute 'bf' to solr queries which is a boost function extension. this field can contain a formula which comuptes the boost according to given field values. After some experiments the following forumla is now default: div(add(1,references_i),pow(add(1,inboundlinkscount_i),1.6))^0.4 This takes the number of references and the inbound links. Further experiments are needed to enhance that forumula.	2012-12-18 14:42:35 +01:00
Michael Peter Christen	cb5cbec14d	distinguishing modified query string and original query string	2012-12-15 00:05:46 +01:00
Michael Peter Christen	2b7d46bc1f	using a filter query for the site parameter in GSA api	2012-12-07 14:54:49 +01:00
Michael Peter Christen	8aa08261a7	update to Solr Boost handling	2012-12-05 12:26:42 +01:00
Michael Peter Christen	72f165d58b	added a Boost class which stores solr query boost values. The class can be configured using the yacy.init file. The boost information is taken from the configuration each time when a query to solr is done.	2012-12-02 16:54:29 +01:00
Michael Peter Christen	3de784c8dd	replaced more split and replaceAll missing pattern pre-compilation with pre-compiled pattern	2012-11-26 13:40:53 +01:00
Michael Peter Christen	d48e9788d2	enhanced search result processing behavior - query less at one time; query more often - in between the small queries, evaluate results - remove fields from search results which are not needed	2012-11-26 12:24:35 +01:00
Michael Peter Christen	4eab3aae60	removed overhead by preventing generation of full search results when only the url is requested	2012-11-23 01:35:28 +01:00
Michael Peter Christen	a114bb23bb	- using edismax in gsa interface - generating less field data for gsa search results - using a boost query in gsa interface to move double content to the end of the result list	2012-11-22 13:03:33 +01:00
Michael Peter Christen	952e143580	FINALLY YaCy can now search for full strings using double- or singlequoted strings in the search query line!!!	2012-11-18 16:03:34 +01:00
orbiter	5dfd6359cb	redesign of the QueryParams class: introduced QueryGoal which holds the query string parser. This shall be used to create a proper full-string matching which is handled then by QueryGoal.	2012-11-18 01:22:41 +01:00
Michael Peter Christen	5fd3b93661	added deletion of hosts during crawl start if deleteold option was given	2012-11-13 16:54:28 +01:00
Michael Peter Christen	d64445c3cb	because we have the inurl:<term> - searchmodifier, we don't actually need regular expressions as search attributes. They had now been removed from the advanced search page while they are still created internally. The filter is then expressed against solr as regular expression filter query. If the expression points out a selection of an specific protocol, host or filetype this is then translated into a facetted query.	2012-11-13 11:45:56 +01:00
Michael Peter Christen	2e7219f9fd	removed hightlighting of search results within collections in GSA interface	2012-11-09 16:25:24 +01:00
Michael Peter Christen	5105256927	update to search result logging (this was a remaining issue from the solr 4.0.0 migration)	2012-11-07 14:15:27 +01:00
Michael Peter Christen	e2c4c3c7d3	migration to solr 4.0.0	2012-11-02 12:29:48 +01:00
Michael Peter Christen	1168d09de8	more refactoring - integrated the code of SnippetProcess into SearchEvent	2012-11-01 17:40:06 +01:00
Michael Peter Christen	ef937af35d	more custom field usage in gsa search result	2012-10-18 15:26:55 +02:00
Michael Peter Christen	f8a3ab2d82	added the usage of synonyms to the GSA search interface	2012-10-02 14:29:45 +02:00
Michael Peter Christen	1533bfd63b	refactoring	2012-09-25 21:20:03 +02:00
Michael Peter Christen	872f83ebe0	refactoring	2012-09-25 21:04:58 +02:00
Michael Peter Christen	00c1c777fa	refactoring	2012-09-21 15:48:16 +02:00
orbiter	563d584420	removed more dependencies in cora from kelondro	2012-09-21 11:02:36 +02:00
Michael Peter Christen	975bc95ddf	added default facet fields for json response format (stub)	2012-09-14 12:09:20 +02:00
Michael Peter Christen	b69ed96f0b	- added collections to yacydoc - changed yacydoc.htm to yacydoc.json - added query logging in solr and gsa search result	2012-09-10 15:20:55 +02:00
orbiter	66ac4076c2	added disjunction '\|' option to site parameter in GSA API	2012-09-06 22:35:55 +02:00
Michael Peter Christen	b2b516cc3e	added a collection attribute to crawls and searches: - a solr field collection_sxt can be used to store a set of crawl tags - when this field is activated, a crawl tag can be assigned when crawls are started - the content of the collection field can be comma-separated, all of them are assigned to the documents when they are indexed as result of such a crawl start - a search result can be drilled down to a specific collection; this is currently only available in the solr interface and also in the gsa interface using the 'site' option - this adds a mandatory field for gsa queries (the google api demands that field all the time)	2012-09-03 15:26:08 +02:00
Michael Peter Christen	c72c435517	- moved the gsa search interface from /gsa/searchresult? to /gsa/search? - fixed the NB field data	2012-08-31 14:00:53 +02:00
Michael Peter Christen	3142e675e8	fixed problems with GSA api: - better FS attribute - highlightning of searched words in title	2012-08-29 16:48:53 +02:00
Michael Peter Christen	3b19fe7b52	- fixed num parameter in GSA api - changed FS attribute in GSA api	2012-08-29 16:28:32 +02:00
orbiter	479bfca571	refctoring	2012-08-23 09:30:11 +02:00
Michael Peter Christen	48a82bc705	log queries anonymous from gsa+solr requests	2012-08-22 23:50:40 +02:00
Michael Peter Christen	ab6ec4ec52	added snippet computation to solr/rss and gsa result writer	2012-08-22 17:37:34 +02:00
Michael Peter Christen	0ad52ac4c3	gsa bugfix for date parser	2012-08-21 02:39:28 +02:00
Michael Peter Christen	3ce4c2f937	fixes for gsa result format	2012-08-21 01:57:46 +02:00
Michael Peter Christen	2d5fdfeb65	added authorization-based maximum results limitation to solr and gsa search	2012-08-20 17:10:48 +02:00
Michael Peter Christen	0cab06c47c	refactoring	2012-08-17 15:52:33 +02:00
Michael Peter Christen	06a78eecb7	code simplification	2012-08-17 14:43:32 +02:00
Michael Peter Christen	597bb76e4f	get the peer location more quickly	2012-08-16 16:28:57 +02:00
Michael Peter Christen	d988ba50cf	added a very rudimentary, incomplete, non-verified GSA response writer for solr. Try this: http://localhost:8090/gsa/searchresult?q=pdf&site=col1&num=10	2012-08-14 12:40:26 +02:00

45 Commits