yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	252bb51f98	fix for wrong mime type in noload crawler	2013-03-07 15:31:00 +01:00
orbiter	0f7ea7ad9f	- enhanced solr.add procedure for mass adds - removed unused solr access classes - made snippet generation for documents aus YaCy RWI/DHT concurrent (as it was before the search process removation) - reduced the number of remote results in settings file because the processing of such mass documents add is too CPU-intensive (in Solr)	2013-03-01 15:27:17 +01:00
Michael Peter Christen	840fa22135	disabled clickdepth computation during craling since that is repeated during clean-up phase.	2013-02-28 02:25:39 +01:00
orbiter	d74472f562	corrected result counter	2013-02-27 22:40:23 +01:00
Michael Peter Christen	c95a84103a	complete redesign of search process: - removed 'worker' processes - no internal time-out behaviour: methods either are successful or return null - waiting is only done on top-level - removed snippet-production; this is replaced by solr snippets - removed statistics based on solr size queries (they had been VERY long); the statistics (like suggestions or tag cloud) are now again based on the old but very fast RWI index. In portal or intranet mode the RWI index is usually switched off; if you like to have statistics again then you must switch on the rwis again in this mode. - fixed many bugs regarding correct page counter	2013-02-26 17:16:31 +01:00
Michael Peter Christen	008288719c	fix for schema export to consider also automatically generated coordinate fields	2013-02-25 01:13:03 +01:00
Michael Peter Christen	089dee1770	- generalized SchemaConfiguration into super-class Configuration and adopted other classes which used the configuration-only access for that class - removed many warnings - adjusted logging	2013-02-25 00:09:41 +01:00
Michael Peter Christen	58e1e6fa2b	fixes to schema	2013-02-23 08:14:10 +01:00
Michael Peter Christen	788288eb9e	added the generation of 50 (!!) new solr field in the core 'webgraph'. The default schema uses only some of them and the resting search index has now the following properties: - webgraph size will have about 40 times as much entries as default index - the complete index size will increase and may be about the double size of current amount As testing showed, not much indexing performance is lost. The default index will be smaller (moved fields out of it); thus searching can be faster. The new index will cause that some old parts in YaCy can be removed, i.e. specialized webgraph data and the noload crawler. The new index will make it possible to: - search within link texts of linked but not indexed documents (about 20 times of document index in size!!) - get a very detailed link graph - enhance ranking using a complete link graph To get the full access to the new index, the API to solr has now two access points: one with attribute core=collection1 for the default search index and core=webgraph to the new webgraph search index. This is also avaiable for p2p operation but client access is not yet implemented.	2013-02-22 15:45:15 +01:00
Michael Peter Christen	91a0401d59	introduced a second core named 'webgraph'. This core will hold the link structure, but is not filled yet. To have the opportunity of a second core, multi-core functionality had to be implemented to the deep-embedded solr: - migrated the solr_40 directory content to a subdirectory 'collection1'; the previously used default core is now called collection1 - added solr_40/webgraph subdirectory as second core - added a servlet configuration for the second core 'webgraph' in /IndexSchema_p.html - added instance handling as addition to solr connections: all solr connectors are now instances of an solr 'instance' object; this required a complete re-design of the solr embedding - migrated also caching and sharding ontop of new instance handling - migrated the search apis to handle now the access to a specific core, the default core named 'collection1' - migrated the remote solr search interface to access shards of cores; for the yacy remote search the default core is now called 'solr'; using the peer address as solr address - migrated the solr backup and restore process: old backups cannot be used after this migration! - redesign of solr instance handling in all methods which access the instances: they cannot hold copies of these instances any more; the must retrieve the actuall connection object every time they want to write to it (this solves also some bugs when switching the index/network) - added another schema 'solr.webgraph.schema', the old solr.keys.list is replaced by solr.collection.schema	2013-02-21 13:23:55 +01:00
Michael Peter Christen	b6de1f42dc	Full redesign of solr connection architecture. This was done to support multiple solr cores instead of just one. Therefore it is now necessary to distuingish between solr server connections (called an 'Instance') and a connection to a single solr core. One Instance may now have multiple connector classes assigned to it, each connecting to a single core. To support multiple cores it is also necessary to distinguish between the connection configuration and the configuration of the index schema. We will have multiple schema configurations in the future, each for every solr core. This caused that the IndexFederated servlet had to be split into two parts, the new Servlet for the Schema editor is now in the IndexSchema Servlet.	2013-02-15 01:38:10 +01:00
Michael Peter Christen	4111606654	removed the commitWithin attribute because that is not the way how the index is updated the right way for us. May also be be superfluous with the solr 4.0 softcommit.	2013-02-13 02:29:47 +01:00
Michael Peter Christen	3a6097966d	added jsonp option to yjson result writer	2013-02-13 01:11:57 +01:00
Michael Peter Christen	de58043205	Added image license generation for solr image search results when results are generated within yjson result writer. This makes it possible to view images in yacyinteractive from solr.	2013-02-13 00:33:53 +01:00
Michael Peter Christen	d3508fa8ff	fixed json search, quotes, auto-facets, urls etc. for yacyinteractive.html	2013-02-13 00:01:38 +01:00
Michael Peter Christen	1db23e9eac	Moved methods from SolrServerConnector to AbstractSolrConnector with the result that most of these methods become superfluous in other classes. This is a generalization step towards multi-indexes in Solr.	2013-02-12 22:03:10 +01:00
Michael Peter Christen	c34af7fe94	extended JSON Response Writer and Opensearch Response Writer for the Solr search interface in such way that it is possible to use this interface for the yacyinteractive search. This search interface is now much faster using the Solr search directly. For the Solr interface it was necessary to create a translation from the YaCy search modifiers to the Solr facet selection. This was added in such a way that it becomes generic for the normal YaCy search and as a on-top evaluation for Solr queries.	2013-02-12 03:42:46 +01:00
Michael Peter Christen	d70d99fab5	added more metadata fields and facets to OpensearchResponseWriter. This should make it possible to replace the original and enriched yacy opensearch result with a solr output in opensearch format.	2013-02-11 22:10:14 +01:00
Michael Peter Christen	7806680ab8	fixed a problem with re-feeding of already indexed documents whith coordinates attached.	2013-02-08 12:45:54 +01:00
Michael Peter Christen	19c46e4acf	catch more exceptions	2013-02-04 21:24:39 +01:00
Michael Peter Christen	921091c3a6	use thread-safe http connection manager for authenticated remote solr connections	2013-02-04 17:48:04 +01:00
Michael Peter Christen	3834829b37	bugfixes and more logging for solr connector	2013-02-04 16:42:10 +01:00
Michael Peter Christen	80fe3d7860	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Conflicts: source/net/yacy/cora/federate/solr/connector/EmbeddedSolrConnector.java	2013-02-04 10:57:54 +01:00
Michael Peter Christen	4323621a76	update to Solr 4.1.0	2013-02-04 10:55:49 +01:00
reger	160ce568b3	move testing SolrServlet.main to test, making include of jetty.jar in distribution and classpath obsolete - move jetty.jar to test library - move SolrServlet.main as is to test, add also a junit test simulating main - add build.xml cleanup for EmbeddedSolrConnectorTest created test/DATA - adjust some test compile errors	2013-02-03 22:32:38 +01:00
Michael Peter Christen	d1cb4cbc84	enhanced network scanner, is faster and more flexible now - start more processes - remove superfluous host name resolution - better/more flexible subnet ip range calculation - prefer ipv4 makes better usable ip pre-settings in servlet - extended servlet by new subnet /20 - option - redesign of scanner start process in servlet (generalization)	2013-02-02 09:51:43 +01:00
Michael Peter Christen	f748b0aa7c	NPE fix	2013-02-02 07:20:02 +01:00
Michael Peter Christen	0b6566a389	optimizations when starting large crawl requests with many start urls in one request: - allow larger match-fields in html interface - delete all host hashes at once from zurl - when deleting by host, do not count size of deleted entries since that was the reason it took so long	2013-01-31 13:15:28 +01:00
sixcooler	3a13906121	clear some more caches if running out of memory	2013-01-25 04:24:36 +01:00
Michael Peter Christen	8651ec35fe	turned author_s into the multi-valued field author_sxt	2013-01-24 18:24:31 +01:00
Michael Peter Christen	4735bd47f4	- changed solr commit call and added an optimize option. Since Solr 4.0.0 there is a new softcommit feature which implements a near-real-time (NRT) search option. The softcommit does not do IO and does not cause performance issues. YaCy has now an extension in its solr connectors to use the softcommit feature. The softcommit call now replaces all places where a hard commit was used. Furthermore the commit strategy in when doing a search from the web interface was changed (it's done every time before a search is done). The softcommit feature was implemented because it was needed for the following changes (customer demands), which is also included in this git commit: - added a feature to identify all documents which have unique titles and/or unique descriptions. These unique flags are disabled by default. - added also a feature to set a flag when the url from a canonical tag is equal to the document url. This is also disabled by default. To support the new softcommit strategy, the commitWithinMs option was set to -1 do disable automatic commit based on document insert times. If documents are inserted permanently then also a commit would happen permanently whenever the commitWithinMs time is reached. This would conflict with the regular autocommit of 10 minutes and the new softcommit strategy.	2013-01-23 14:40:58 +01:00
Michael Peter Christen	9ccdd21d76	Merge remote-tracking branch 'aleksejs/fixtrans' Conflicts: locales/ru.lng Tried to merge this but I had to made this 'blind'. Sorry if I deleted something that was right.	2013-01-22 11:54:38 +01:00
Michael Peter Christen	db024a4e19	added new solr fields (unused yet; implementation will follow)	2013-01-21 18:02:29 +01:00
sixcooler	f3e705c4fe	bump to httpclient / httpcore 4.2.3 (bugfix-release)	2013-01-17 20:10:49 +01:00
Michael Peter Christen	38d3feae65	added separate delete commands for the local+remote solr index, the old metadata and old rwi and for the citation index. The important advancement is the separation of the citation index deletion because that index is responsible for the linkdepth calculation. Now a search index can be deleted without the citation index and that should cause that less clickdepths must be post-processed.	2013-01-04 16:39:34 +01:00
Michael Peter Christen	5c0c56cfe1	Preparations to produce a click depth attribute in the search index. This attribute can be used for ranking and for other purpose (demand by customer) The click depth is computed in two steps: - during indexing the current fill-state of the reverse link index is used to backtrack the current page to the root page. The length of that backtrack is the clickdepth. But this does not discover the shortest click depth. To get this, a second process to check again is needed - added a process tag that can be used to do operations on the existing index after a crawl; i.e. calculation the shortest clickpath. Added a field to control this operation but not a method to operate on this. - added a visualization of the clickpath length in the host browser	2013-01-02 20:55:43 +01:00
Michael Peter Christen	6861af87e2	removed warnings	2013-01-02 19:05:48 +01:00
Michael Peter Christen	295884fd54	- Merge commit '168b1d130d9d67b5e8855a0b50c4ba7ad4a416f8' - fixed conflict in htroot/yacysearch.java - removed nedres check because that causes that the remote server is not called at all in most cases (local index has already results but we want more) - fixed a regex bug (a '=' too much)	2013-01-02 15:08:07 +01:00
reger	168b1d130d	Adding heuristic to get search results from configured systems which support opensearch specification - any system supporting opensearch specification can be configured - search query is only forwarded to remote system if not enough results available on local peer - discover function provided, checking the local Solr index for links to opensearchdescription files, to add to the config - sample config file with some general search engines with opensearch support	2012-12-29 08:24:48 +01:00
Michael Peter Christen	eb90d38cd7	added missing extension 'mkv' for navigation	2012-12-27 13:56:13 +01:00
Michael Peter Christen	4a9182ae16	use the search configuration to default the cacheStrategy to the value as given in the search configuration	2012-12-27 03:19:21 +01:00
Michael Peter Christen	98819ec3d9	use solr boost configuration to select search fields. At this time it is possible to enter a negative boost value to switch that value off. This might be different in the future with a better input interface.	2012-12-27 03:17:45 +01:00
Michael Peter Christen	24c9bb35f7	extended the Scheduler: introduced scheduled events - an event type (once, regular) can be selected - for this event type, a fixed time can be selected. This may be either directly after startup or at one of the full hours at a day (==25 options) The main point about this feature is the opportunity to start an action directly after startup. That makes it possible to create YaCy distributions which, after started at the first time, start to index parts of the intranet/internet by itself.	2012-12-22 16:27:14 +01:00
Michael Peter Christen	84f82541e8	search process enhancements	2012-12-19 10:41:22 +01:00
Michael Peter Christen	02020b590b	- removed all extension types from extension navigation which are not proper/known - automatically show the protocol navigation if there is more than http and https - automatically show the extension navigation if there is some media content	2012-12-19 02:38:05 +01:00
Michael Peter Christen	01200f06cc	using the author field as solr-native facet. this makes it necessary to introduce a copy-field for the author field to be copied to a string field. This field is then used to generate facets. Without this field, the facet would consist only of the words of the author names, not of the full author string.	2012-12-19 01:56:33 +01:00
Michael Peter Christen	eac9650b31	added another solr field clickdepth_i which reflects the number of clicks which are necessary to get from the portal of a host to a specific document. At this time, only the start document is flagged with clickdepth '0', all other with '-1'. To get the actual clickdepth, a process must use crawled information to collect the actual number of clicks. This will be added in another/next step.	2012-12-18 17:20:42 +01:00
Michael Peter Christen	1052263af3	- added a new solr field references_i which stores the number of INCOMING links to the corresponding web page. This information is taken from the reverse link index (a 'little sister' of the RWI index). - this field can be of use to enhance the ranking because a web page with more incoming links can be more more important than others. But this is not true for typical link pages like menues. Therefore the number of outgoing links is needed. - added a new solr attribute 'bf' to solr queries which is a boost function extension. this field can contain a formula which comuptes the boost according to given field values. After some experiments the following forumla is now default: div(add(1,references_i),pow(add(1,inboundlinkscount_i),1.6))^0.4 This takes the number of references and the inbound links. Further experiments are needed to enhance that forumula.	2012-12-18 14:42:35 +01:00
Michael Peter Christen	7c3de8b4cd	- fix for localhost detection - added IPv6 patterns for localhost detection	2012-12-18 12:52:20 +01:00
Michael Peter Christen	34f8786508	removed dependency of vocabulary navigation from Jena and it's triplestore; the vocabulary search is now done using generic solr fields which are created on-the-fly during runtime.	2012-12-18 02:29:03 +01:00

1 2 3 4 5 ...

591 Commits