yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
orbiter	e1bfe9d07a	- reduction of the concurrently running processes to make YaCy more adjusted to smaller and 1-core devices. - the workflow processor now starts no process at all. these are started as soon as parser/condenser/indexing queues are filled. - better abstraction	2013-04-25 11:33:17 +02:00
Michael Peter Christen	c091000165	added collection attribute also to the rss feed reader	2013-04-24 01:14:35 +02:00
Michael Peter Christen	d937c55204	extended limitation of dom export size from 100000 to 100000000	2013-04-22 22:33:13 +02:00
Michael Peter Christen	566d6c980c	checking of document signature for a double-document check now refers only to documents within the same domain	2013-04-17 16:15:27 +02:00
Michael Peter Christen	7ab5093321	added new solr title_exact_signature_l and description_exact_signature_l to be able to identify unique title and unique description fields.	2013-04-16 01:35:15 +02:00
Michael Peter Christen	f24ac518e6	redesign of exists()-query (can now be called with query) and the CachedSolrConnector which based its cache on the key value. This will be used to correct the title_unique_b and description_unique_b field.	2013-04-15 14:08:30 +02:00
Michael Peter Christen	27d6222880	added new field host_extent_i which, after a crawl and postprocessing, holds the number of documents for the host where the document is hosted. This is necessary for ranking and the norming of references per local host in the ranking computation.	2013-04-14 20:52:40 +02:00
reger	518b20147c	skip postprocessing during document.store if no citation index connected (prevent null pointer exception)	2013-04-14 02:01:27 +02:00
Michael Peter Christen	ada3f27de7	added three new field for a better ranking: references_internal_i, references_external_i and references_exthosts_i. These can be used to count and evaluate the number of external links to every web page. An experimental ranking function can be i.e.: div(add(references_internal_i,product(references_external_i,references_exthosts_i)),add(clickdepth_i,1))	2013-04-12 16:17:14 +02:00
Michael Peter Christen	edc0b33f6d	- showing references count and clickdepth in host browser - fixed generation and presentation of both values	2013-04-11 14:46:13 +02:00
orbiter	940c6849ee	enhanced did-you-mean (a bit): can now remember previously searched words (plus small enhancements)	2013-03-17 10:52:31 +01:00
Michael Peter Christen	6300730d7f	refactoring of clickdepth computation as preparation for clickdepth computation of webgraph links	2013-03-14 12:13:02 +01:00
orbiter	47114910d5	fix for possible memory leaks	2013-03-13 17:55:37 +01:00
Michael Peter Christen	25300913fa	fixes to search debugging after testing with the different search debugging options	2013-03-05 21:28:22 +01:00
Michael Peter Christen	c2fde018b5	concurrent snippet fetching from solr results which do not have snippets	2013-03-05 12:24:01 +01:00
Michael Peter Christen	2b6c79d347	in method exists() also use the new caching-stacks for documents/metadata	2013-03-04 01:13:17 +01:00
Michael Peter Christen	0d7b4bc891	better protection against OOM during search flush and fixed missing result push	2013-03-03 23:45:47 +01:00
Michael Peter Christen	221ed7d764	- enhanced concurrency during search without IO blocking - introduced a second queue to flush remote search results (now: old metadata structure from DHT peers) - fixed result counters	2013-03-03 22:38:50 +01:00
Michael Peter Christen	3b1d9dc884	made index storage from DHT search result concurrently. This prevents blocking by high CPU usage during search. Also: removed query from Solr for DHT search results; results are taken from the pending queue.	2013-03-02 10:25:52 +01:00
orbiter	f13c0b2abd	fix for search	2013-03-01 19:18:16 +01:00
orbiter	0f7ea7ad9f	- enhanced solr.add procedure for mass adds - removed unused solr access classes - made snippet generation for documents aus YaCy RWI/DHT concurrent (as it was before the search process removation) - reduced the number of remote results in settings file because the processing of such mass documents add is too CPU-intensive (in Solr)	2013-03-01 15:27:17 +01:00
orbiter	d74472f562	corrected result counter	2013-02-27 22:40:23 +01:00
Michael Peter Christen	c95a84103a	complete redesign of search process: - removed 'worker' processes - no internal time-out behaviour: methods either are successful or return null - waiting is only done on top-level - removed snippet-production; this is replaced by solr snippets - removed statistics based on solr size queries (they had been VERY long); the statistics (like suggestions or tag cloud) are now again based on the old but very fast RWI index. In portal or intranet mode the RWI index is usually switched off; if you like to have statistics again then you must switch on the rwis again in this mode. - fixed many bugs regarding correct page counter	2013-02-26 17:16:31 +01:00
Michael Peter Christen	089dee1770	- generalized SchemaConfiguration into super-class Configuration and adopted other classes which used the configuration-only access for that class - removed many warnings - adjusted logging	2013-02-25 00:09:41 +01:00
Michael Peter Christen	c16de49f64	fix for webgraph delete query	2013-02-24 18:17:58 +01:00
Michael Peter Christen	56d5946a59	- added flags in IndexFederated_p.html to switch on or off the webgraph index (new solr core webgraph) .. this is now off by default - completely redesigned this servlet - added description how to attach a remote solr - adjusted naming of servlet and menues - moved 'lazy initialization' attribut from IndexSchema to IndexFederated (this is a general option) back again.	2013-02-24 18:09:34 +01:00
Michael Peter Christen	788288eb9e	added the generation of 50 (!!) new solr field in the core 'webgraph'. The default schema uses only some of them and the resting search index has now the following properties: - webgraph size will have about 40 times as much entries as default index - the complete index size will increase and may be about the double size of current amount As testing showed, not much indexing performance is lost. The default index will be smaller (moved fields out of it); thus searching can be faster. The new index will cause that some old parts in YaCy can be removed, i.e. specialized webgraph data and the noload crawler. The new index will make it possible to: - search within link texts of linked but not indexed documents (about 20 times of document index in size!!) - get a very detailed link graph - enhance ranking using a complete link graph To get the full access to the new index, the API to solr has now two access points: one with attribute core=collection1 for the default search index and core=webgraph to the new webgraph search index. This is also avaiable for p2p operation but client access is not yet implemented.	2013-02-22 15:45:15 +01:00
Michael Peter Christen	91a0401d59	introduced a second core named 'webgraph'. This core will hold the link structure, but is not filled yet. To have the opportunity of a second core, multi-core functionality had to be implemented to the deep-embedded solr: - migrated the solr_40 directory content to a subdirectory 'collection1'; the previously used default core is now called collection1 - added solr_40/webgraph subdirectory as second core - added a servlet configuration for the second core 'webgraph' in /IndexSchema_p.html - added instance handling as addition to solr connections: all solr connectors are now instances of an solr 'instance' object; this required a complete re-design of the solr embedding - migrated also caching and sharding ontop of new instance handling - migrated the search apis to handle now the access to a specific core, the default core named 'collection1' - migrated the remote solr search interface to access shards of cores; for the yacy remote search the default core is now called 'solr'; using the peer address as solr address - migrated the solr backup and restore process: old backups cannot be used after this migration! - redesign of solr instance handling in all methods which access the instances: they cannot hold copies of these instances any more; the must retrieve the actuall connection object every time they want to write to it (this solves also some bugs when switching the index/network) - added another schema 'solr.webgraph.schema', the old solr.keys.list is replaced by solr.collection.schema	2013-02-21 13:23:55 +01:00
Michael Peter Christen	b6de1f42dc	Full redesign of solr connection architecture. This was done to support multiple solr cores instead of just one. Therefore it is now necessary to distuingish between solr server connections (called an 'Instance') and a connection to a single solr core. One Instance may now have multiple connector classes assigned to it, each connecting to a single core. To support multiple cores it is also necessary to distinguish between the connection configuration and the configuration of the index schema. We will have multiple schema configurations in the future, each for every solr core. This caused that the IndexFederated servlet had to be split into two parts, the new Servlet for the Schema editor is now in the IndexSchema Servlet.	2013-02-15 01:38:10 +01:00
Michael Peter Christen	4111606654	removed the commitWithin attribute because that is not the way how the index is updated the right way for us. May also be be superfluous with the solr 4.0 softcommit.	2013-02-13 02:29:47 +01:00
Michael Peter Christen	7806680ab8	fixed a problem with re-feeding of already indexed documents whith coordinates attached.	2013-02-08 12:45:54 +01:00
Michael Peter Christen	4323621a76	update to Solr 4.1.0	2013-02-04 10:55:49 +01:00
Michael Peter Christen	7dfcc92b71	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-01-31 13:15:42 +01:00
Michael Peter Christen	0b6566a389	optimizations when starting large crawl requests with many start urls in one request: - allow larger match-fields in html interface - delete all host hashes at once from zurl - when deleting by host, do not count size of deleted entries since that was the reason it took so long	2013-01-31 13:15:28 +01:00
orbiter	a2160054d7	ability to create vocabularies also without any objectspace: this iterates over all urls in the index do create terms	2013-01-30 19:33:48 +01:00
orbiter	ecc10a752c	fixes to index enumeration for vocabulary production	2013-01-29 18:14:14 +01:00
Michael Peter Christen	0fe7b6fd3b	migrated the index export methods from the old metadata to solr. Now exports are done using solr queries. removed superfluous methods and servlets.	2013-01-24 12:39:19 +01:00
Michael Peter Christen	1768c82010	removed field selection because that created documents with that field only which was not useful when re-writing the same document	2013-01-24 03:26:38 +01:00
Michael Peter Christen	4735bd47f4	- changed solr commit call and added an optimize option. Since Solr 4.0.0 there is a new softcommit feature which implements a near-real-time (NRT) search option. The softcommit does not do IO and does not cause performance issues. YaCy has now an extension in its solr connectors to use the softcommit feature. The softcommit call now replaces all places where a hard commit was used. Furthermore the commit strategy in when doing a search from the web interface was changed (it's done every time before a search is done). The softcommit feature was implemented because it was needed for the following changes (customer demands), which is also included in this git commit: - added a feature to identify all documents which have unique titles and/or unique descriptions. These unique flags are disabled by default. - added also a feature to set a flag when the url from a canonical tag is equal to the document url. This is also disabled by default. To support the new softcommit strategy, the commitWithinMs option was set to -1 do disable automatic commit based on document insert times. If documents are inserted permanently then also a commit would happen permanently whenever the commitWithinMs time is reached. This would conflict with the regular autocommit of 10 minutes and the new softcommit strategy.	2013-01-23 14:40:58 +01:00
reger	3897bb4409	added (manual) urldb migration (link on: Index Administraton -> Federated Solr Index) - migrates all entries in old urldb Metadata coordinate (lat / lon) NumberFormatException still relative often (see excerpt below), - added try/catch for URIMetadataRow (seems not to be needed in URIMetaDataNode, as Solr internally checks for number format) - removed possible typ conversion for lat() / lon() comparison with 0.0f, changed to 0.0 (leaving it to the compiler/optimizer to choose number format) current log excerpt for NumberFormatException: W 2013/01/14 00:10:07 StackTrace For input string: "-" java.lang.NumberFormatException: For input string: "-" at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source) at java.lang.Double.parseDouble(Unknown Source) at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525) at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279) at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277) at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329) at transferURL.respond(transferURL.java:152) ... Caused by: java.lang.NumberFormatException: For input string: "-" at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source) at java.lang.Double.parseDouble(Unknown Source) at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525) at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279) at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277) at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329) at transferURL.respond(transferURL.java:152)	2013-01-14 03:06:24 +01:00
reger	3b6e08b49f	prevent checking of urldb if empty - disconnect urlIndexFile if empty - add missing lock class in submenuSearchConfiguration	2013-01-12 15:20:23 +01:00
Michael Peter Christen	38d3feae65	added separate delete commands for the local+remote solr index, the old metadata and old rwi and for the citation index. The important advancement is the separation of the citation index deletion because that index is responsible for the linkdepth calculation. Now a search index can be deleted without the citation index and that should cause that less clickdepths must be post-processed.	2013-01-04 16:39:34 +01:00
Michael Peter Christen	6f0baaa309	added the clickdepth post-processing: some links may have 'shortcuts' to already calculated click depths. There are then calculated if the crawl buffer is empty and therefore no new 'shortcuts' can be discovered. The status of the clickdepth stack (to-be-processed) can be seen using a solr search command like this: http://localhost:8090/solr/select?q=process_sxt:[%20TO%20]&start=0&rows=30&fl=sku,clickdepth_i,process_sxt	2013-01-04 16:37:39 +01:00
Michael Peter Christen	0f5b6f38c1	enhanced root-url detection	2013-01-03 19:21:21 +01:00
Michael Peter Christen	5c0c56cfe1	Preparations to produce a click depth attribute in the search index. This attribute can be used for ranking and for other purpose (demand by customer) The click depth is computed in two steps: - during indexing the current fill-state of the reverse link index is used to backtrack the current page to the root page. The length of that backtrack is the clickdepth. But this does not discover the shortest click depth. To get this, a second process to check again is needed - added a process tag that can be used to do operations on the existing index after a crawl; i.e. calculation the shortest clickpath. Added a field to control this operation but not a method to operate on this. - added a visualization of the clickpath length in the host browser	2013-01-02 20:55:43 +01:00
reger	4987caf1c9	- apply fix for localhost handling (from yacy2solr) also to metadata2solr	2012-12-23 01:30:52 +01:00
Michael Peter Christen	2a4c064c89	using the publisher information for the author field if no author is given. This applies to cases where only the copyright field in the html header is filled but not the author field	2012-12-19 01:54:35 +01:00
Michael Peter Christen	eac9650b31	added another solr field clickdepth_i which reflects the number of clicks which are necessary to get from the portal of a host to a specific document. At this time, only the start document is flagged with clickdepth '0', all other with '-1'. To get the actual clickdepth, a process must use crawled information to collect the actual number of clicks. This will be added in another/next step.	2012-12-18 17:20:42 +01:00
Michael Peter Christen	1052263af3	- added a new solr field references_i which stores the number of INCOMING links to the corresponding web page. This information is taken from the reverse link index (a 'little sister' of the RWI index). - this field can be of use to enhance the ranking because a web page with more incoming links can be more more important than others. But this is not true for typical link pages like menues. Therefore the number of outgoing links is needed. - added a new solr attribute 'bf' to solr queries which is a boost function extension. this field can contain a formula which comuptes the boost according to given field values. After some experiments the following forumla is now default: div(add(1,references_i),pow(add(1,inboundlinkscount_i),1.6))^0.4 This takes the number of references and the inbound links. Further experiments are needed to enhance that forumula.	2012-12-18 14:42:35 +01:00
Michael Peter Christen	34f8786508	removed dependency of vocabulary navigation from Jena and it's triplestore; the vocabulary search is now done using generic solr fields which are created on-the-fly during runtime.	2012-12-18 02:29:03 +01:00
Michael Peter Christen	fb0fa9a102	- fixed 'delete from subpath' during crawl start which deleted nothing; now works; - changed some crawl start html design details	2012-12-11 13:38:28 +01:00
orbiter	a4a780b871	- fix for bad url conversion in bookmarks when using smb urls - fix for localhost hosts in solr schema host handling	2012-12-10 07:22:42 +01:00
Michael Peter Christen	72f165d58b	added a Boost class which stores solr query boost values. The class can be configured using the yacy.init file. The boost information is taken from the configuration each time when a query to solr is done.	2012-12-02 16:54:29 +01:00
Michael Peter Christen	8fc3679c66	using more pre-compile pattern for split methods	2012-11-26 13:11:55 +01:00
Michael Peter Christen	b7004043ea	- added a field cache for solr queries which call only for a single value - fixed a version conflict exception within a solr add request	2012-11-24 22:30:05 +01:00
Michael Peter Christen	efd2c4622d	added a new fail type attribute for the index to distinguish two separate fail types: network fail and forced exclusion (i.e. by robots or forwarding rules).	2012-11-23 14:00:30 +01:00
Michael Peter Christen	4eab3aae60	removed overhead by preventing generation of full search results when only the url is requested	2012-11-23 01:35:28 +01:00
Michael Peter Christen	d6b82840f8	added a feature to find similarities in documents. This uses an enhanced version of the Nutch/Solr TextProfileSignatue. As a result, a signature of the document is written to the solr search index. Additionally for each time when a signature is written, it is checked if the singature exists already in the index. If the signature does not exist, the document is marked as unique. The unique attribute can now be used to sort document lists and bring duplicates to the end of a result list. To enable this, a large portion of the search api to Solr had to be changed. This affected mainly caching of 'exists' searches to enhance the check for existing signatures and do this without actually doing a solr query. Because here the first time a long number is used as value in the Solr store, also the value naming in the YaCySchema had to be adopted and normalized. This caused that many files had to be changed.	2012-11-21 18:46:49 +01:00
Michael Peter Christen	f5ca5cea44	- added field options to all solr queries. This can be used to restrict the actual data which is fetched from solr. - used the new field options to reduce generic options like getting the load date or the count of search results. should increase overall speed - used the new field options to reduce overhead in the host browser during aquisition of links. - used the field options to make checking of links in crawler faster - if the crawler is paused, the crawl queue is not cleaned	2012-11-19 17:24:34 +01:00
orbiter	5dfd6359cb	redesign of the QueryParams class: introduced QueryGoal which holds the query string parser. This shall be used to create a proper full-string matching which is handled then by QueryGoal.	2012-11-18 01:22:41 +01:00
Michael Peter Christen	5fd3b93661	added deletion of hosts during crawl start if deleteold option was given	2012-11-13 16:54:28 +01:00
Michael Peter Christen	842faf96a2	fixed media search	2012-11-07 17:27:13 +01:00
Michael Peter Christen	93001586a0	removed warnings, removed too-fast pausing of crawls	2012-11-07 15:37:14 +01:00
Michael Peter Christen	12c0db20e5	fixed npe for surrogate import	2012-11-07 02:46:51 +01:00
Michael Peter Christen	52df6ee369	more logging	2012-11-07 02:04:08 +01:00
Michael Peter Christen	15d1460b40	added information about the reason of pausing of crawls	2012-11-06 15:21:56 +01:00
Michael Peter Christen	2371ef031c	added solr faceted search support to YaCy search results added solr highlighting / YaCy snippets to YaCy search results - facets are now much more complete - facets are computed and searched much faster - snippet computation is done by solr if solr knows the snippet	2012-11-06 14:32:08 +01:00
Michael Peter Christen	d481abd087	added the visualization of error-urls to host browser - only visible for admins - a faceted search generates a huge list for all hosts in the host list - the faceted search algorithms had to be modified for that - within the browsing of the directory path, the error cause is written to the url which is presented as error-url - the errors are also accumulated for directory sums	2012-11-06 00:29:37 +01:00
Michael Peter Christen	97f82994a6	automatically pause the crawler if there is a problem with solr	2012-11-05 16:34:42 +01:00
Michael Peter Christen	8fb370d9f8	renovated the way how search results are count. should be correct now...	2012-11-05 03:19:28 +01:00
orbiter	354ef8000d	- added 'deleteold' option to crawler which causes that documents are deleted which are selected by a crawl filter (host or subpath) - site crawl used this option be default now - made option to deleteDomain() concurrency	2012-11-04 02:58:26 +01:00
Michael Peter Christen	75dd706e1b	update to HostBrowser: - time-out after 3 seconds to speed up display (may be incomplete) - showing also all links from the balancer queue in the host list (after the '/') and in the result browser view with tag 'loading'	2012-11-02 13:57:43 +01:00
Michael Peter Christen	e2c4c3c7d3	migration to solr 4.0.0	2012-11-02 12:29:48 +01:00
Michael Peter Christen	9330ad4838	- fixed the delete option in host browser - added a delete method which can be used to delete a full subpath in solr.	2012-11-02 01:22:31 +01:00
Michael Peter Christen	6629e37685	tried to clean up the search process mess	2012-11-01 17:16:43 +01:00
Michael Peter Christen	f8f05ecba7	- added a delete button in host browser to delete a complete subpath - removed storage of default collection name - default is now "user" - made stacking of crawl start points concurrently	2012-10-31 17:44:45 +01:00
Michael Peter Christen	c326aa8f67	disabled writing new entries to crawl stacks to prevent that a domain with many documents block refreshing of the crawl queue	2012-10-29 22:26:52 +01:00
Michael Peter Christen	6905182d41	- fix for number of words log message - adding meta:refresh also to crawler stack	2012-10-29 21:42:31 +01:00
Michael Peter Christen	799d71bc67	enhanced solr caching: - increased cache size which is needed for longer solr commit time - speed hacks on cache write code	2012-10-28 20:31:29 +01:00
Michael Peter Christen	8e1248ffe3	force a commit in advance of a search for the administrator to get most recent results even if commit time is high and an indexing is ongoing.	2012-10-26 15:35:42 +02:00
Michael Peter Christen	3b48c78190	added an option to force a commit to solr. may be used by a search front-end in case that the commitWithinMs time is too short to get recently indexed documents.	2012-10-26 07:39:07 +02:00
Michael Peter Christen	ce0e5b1e17	- more refactoring / private methods - fix for usage of custom solr field names	2012-10-18 15:09:04 +02:00
Michael Peter Christen	ccc3760a47	Refactoring and redesign of data architecture to make URIMetadataRow superfluous. The target is to make a solr document as the core of YaCy documents which would cause that many conversions can be removed. On the way to this target the Equivalence of URIMetadataRow and URIMetadataNode had to be removed to expose the usage of the old URIMetadataRow data structure. This refactoring already removes unneccessary conversions and should make memory usage during indexing lower.	2012-10-18 14:29:11 +02:00
Michael Peter Christen	e5b3c172ff	removed hack which translated Solr documents to virtual RWI entries which had been then mixed with remote RWIs. Now these Solr documents are feeded into the result set as they appear during local and remote search. That makes the search much faster.	2012-10-17 17:45:41 +02:00
Michael Peter Christen	5d16c23a1f	specified more URIMetadata as URIMetadataNode	2012-10-16 18:26:21 +02:00
Michael Peter Christen	43f3345c90	- removed dependencies from URIMetadataRow and made direct access to URIMetadataNode which creates the opportunity to access Solr objects directly and use their information richness - lazy initialization of the URIMetadataNode object - should cause less computation and memory usage during search. - removed dead code	2012-10-16 18:11:57 +02:00
Michael Peter Christen	cc98496ff3	enhanced the HostBrowser: - showing also outbound links to other domains if there are any - the outbound links browser shows also the link structure image - showing even inbound links if the web structure graph has information about that - removed the left menu and made the HostBrowser a part of the top menu for search - moved the file search also to the top menu - added hover information in the HostBrowser to explain what the click means - because the HostBrowser also links to the Metadata viewer ViewFile, there should be a button to switch back to the HostBrowser: added that also.	2012-10-16 17:13:18 +02:00
Michael Peter Christen	21fe8339b4	- enhanced generation of url objects - enhanced computation of link structure graphics - enhanced collection of data for link structures	2012-10-15 13:17:13 +02:00
Michael Peter Christen	1b02408936	use less cache	2012-10-11 14:32:37 +02:00
Michael Peter Christen	5f0ab25382	removed the option to prevent removal of & parts inside of the MultiProtocolURI during normalform computation because that should always be done and also be done during initialization of the MultiProtocolURI Object. The new normalform method takes only one argument which should be 'true' unless you know exactly what you are doing.	2012-10-10 11:46:22 +02:00
Michael Peter Christen	7e3e45fd04	added Open Graph Metadata default fields, see http://ogp.me/ns#	2012-10-09 17:28:48 +02:00
Michael Peter Christen	c3e5f667a7	added schema.org breadcrumb counter to parser and solr schema	2012-10-09 13:02:43 +02:00
Michael Peter Christen	bd769de604	since the solr index is now used for all pages that are indexed locally, there is no need for the RWI index if the index is not transfered to another peer. Therefore the creation of RWI index data is now suppressed if DHT is disabled. This applies for all intranet and portal mode configurations, but not for public robinson modes. A robinson may switch back to public mode and then transmit its data. That means if someone wants to switch never to DHT mode, it would be more appropriate to choose the portal mode.	2012-10-09 11:48:55 +02:00
Michael Peter Christen	f8a3ab2d82	added the usage of synonyms to the GSA search interface	2012-10-02 14:29:45 +02:00
Michael Peter Christen	3d33a5bdf6	turned the synonyms_t Text field into a multi-valued String field synonyms_sxt	2012-10-02 11:13:06 +02:00
Michael Peter Christen	3b959ee002	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-10-02 10:14:09 +02:00
orbiter	3190347814	added a synonyms_t field to solr and a process to read synonym files. This can be used to add another stemming to solr using stemming files that are expressed as synonyms for grammatical alternatives. The synonym/stemming files must have the following form: - each line is a comma-separated list of synonyms - the list of synonyms may be enclosed with {} (like the GSA synonyms file) - the file may contain comments which are lines starting with a '#' The synonym file(s) must be placed in DATA/DICTIONARIES/synonyms/ and are activated by default whenever a synonym file is in place. Then, for each word that is found in a document all synonyms are added to a long text field which is stored into synonyms_t. Processes using the synonyms must query with that field as optional matcher.	2012-10-02 00:02:50 +02:00
Michael Peter Christen	411d0e839b	added an underline text field to solr to record all underlined texts	2012-10-01 14:16:49 +02:00
Michael Peter Christen	c4a3d8870f	fixed computation of links in host browser which are not indexed but knwon by the crawler. Such links are now displayed in grey color.	2012-09-29 02:13:11 +02:00
Michael Peter Christen	24d2ee3c52	- better date ranking - more protection against NPE and time travel effects	2012-09-26 18:36:32 +02:00
Michael Peter Christen	ca313e404f	- if a "/date" modifier is used, the solr remote query applies an ordering by date (ascending) - added also some 'anti-timetravel' protection (check if date is in the future within any metadata date field)	2012-09-26 16:56:33 +02:00
Michael Peter Christen	a4214694df	We assert that no other metadata storage than solr is used now. Therefore a property like solrConnected() must be true all the time. Removal of this method causes removal of all write operations to the old metadata index.	2012-09-26 16:05:11 +02:00
Michael Peter Christen	562183932b	- removed ip_s from default profile since that needs a DNS lookup to create an document entry. This makes remote search much slower. - removed synchronization of add method if ip_s is activated to prevent that a user configuration causes bad behavior. The disadvantage of that is, that a index dump can cause data loss if an indexing is running during index dump - catched more exceptions and more NPE - better abstraction in MirrorSolrConnector - slight performance enhancement when only the index count is requested (rows=0 is sufficient to get a total count)	2012-09-26 13:38:04 +02:00
Michael Peter Christen	1533bfd63b	refactoring	2012-09-25 21:20:03 +02:00
Michael Peter Christen	872f83ebe0	refactoring	2012-09-25 21:04:58 +02:00
Michael Peter Christen	fb9460f0a8	using the search filter to drill down search to file types. A search like "mp3 filetype:mp3" will now maybe surprise you.	2012-09-25 17:52:33 +02:00
Michael Peter Christen	15ea053c3a	- added xml output in IndexControlURLs to get the storage page of index dump commands - adjusted the apicall.sh script to get the downloaded text as output to stdout which is necessary to parse the content out of it - added indexdump.sh script which creates a solr dump and prints out the storage path for the index dump - added synchronization to the Fulltext class to prevent that data is stored to a non-existing solr index while this index is disabled during the storage of the dump	2012-09-25 00:19:52 +02:00
Michael Peter Christen	1b474139dd	used the new zip writer/reader to add a solr dump process: the whole solr index can be written to a zip dump and also restored during runtime	2012-09-24 17:05:28 +02:00
Michael Peter Christen	8219a445f3	refactoring	2012-09-21 16:46:57 +02:00
Michael Peter Christen	00c1c777fa	refactoring	2012-09-21 15:48:16 +02:00
orbiter	563d584420	removed more dependencies in cora from kelondro	2012-09-21 11:02:36 +02:00
Michael Peter Christen	62add1d564	added the protocol and the file name extension to the solr fields since these fields are probably facets in file search	2012-09-11 22:46:39 +02:00
Michael Peter Christen	9db032664e	activate two solr fields which will be used by administration interface (later)	2012-09-11 20:15:54 +02:00
Michael Peter Christen	4634f0e626	fix for images_withalt	2012-09-10 12:30:03 +02:00
Michael Peter Christen	10b911eed4	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-09-07 22:07:02 +02:00
Michael Peter Christen	be67c70a47	added Solr fields: inboundlinks_text_chars_val inboundlinks_text_words_val inboundlinks_alttag_txt outboundlinks_text_chars_val outboundlinks_text_words_val outboundlinks_alttag_txt	2012-09-07 22:06:51 +02:00
orbiter	d73fff0e0e	added solr field images_withalt_i	2012-09-07 21:33:45 +02:00
sixcooler	e78fe3f477	also do a clearcache on the solr-connector-caches	2012-09-06 22:07:07 +02:00
Michael Peter Christen	d8425e6809	added collections to crawl monitor	2012-09-04 14:47:53 +02:00
Michael Peter Christen	ee23fc7a32	added h1..h6 counter fields	2012-09-04 14:11:11 +02:00
Michael Peter Christen	b2b516cc3e	added a collection attribute to crawls and searches: - a solr field collection_sxt can be used to store a set of crawl tags - when this field is activated, a crawl tag can be assigned when crawls are started - the content of the collection field can be comma-separated, all of them are assigned to the documents when they are indexed as result of such a crawl start - a search result can be drilled down to a specific collection; this is currently only available in the solr interface and also in the gsa interface using the 'site' option - this adds a mandatory field for gsa queries (the google api demands that field all the time)	2012-09-03 15:26:08 +02:00
Michael Peter Christen	f75b3f8a47	added more patches to work without RWI data structure	2012-08-31 14:35:56 +02:00
Michael Peter Christen	31d4d38804	- extended the solr interface by a references-by-word-count method - reduced danger that a non-existing RWI database causes NPEs - added Solr queries to did-you-mean: this makes it possible that our did-you-mean algorithm works together with only Solr and without RWIs	2012-08-31 13:03:00 +02:00
Michael Peter Christen	528d6763fa	- added new solr fields: title_count_i, title_chars_val, title_words_val description_count_i, description_chars_val, description_words_val - added many asserts to ensure data type correctness from YaCy to Solr and vice versa - made many fixes according to new findings from these asserts (!)	2012-08-31 10:30:43 +02:00
Michael Peter Christen	2ddc33646a	added new field for solr: url_paths_sxt url_parameter_i url_parameter_key_sxt url_parameter_value_sxt url_chars_i	2012-08-29 16:11:23 +02:00
Michael Peter Christen	316b5fe116	- added a solr type definition verifier - fixed type definition found by the verifier - added multivalue-string fields for solr with extension 'sxt' - added multivalue-integer fields for solr with extension 'val' - renamed some solr attributes from txt to sxt - changed solr query line to an explicit AND/OR structure - added a country code second level domain list to Domains class; with parser - added a host string parser to get domain class name, country-code second-level domain and subdomain out of it - removed old coordinate attributes	2012-08-28 16:58:06 +02:00
Michael Peter Christen	e8acd542b5	- added faceted drill-down for host and geolocation to solr queries - added a new geolocation field to index schema, the old values are migrated if possible	2012-08-27 14:41:33 +02:00
orbiter	29171e2f6c	fixed generation of ontologies from index enumerations	2012-08-24 14:13:42 +02:00
orbiter	01a63ef595	redesign of YaCySchema and SolrDoc handling	2012-08-23 09:51:45 +02:00
orbiter	479bfca571	refctoring	2012-08-23 09:30:11 +02:00
Michael Peter Christen	4716546ef5	- reduced memory usage in index transmission using a transformation of Node to Row objects - removed peerDeparture in solr remote search in case that peer does not answer (this may be normal because it is allowed to switch this off)	2012-08-22 16:30:33 +02:00
orbiter	716ea0cfe2	sorted the solr schema into mandatory and optional fields; reduced number of used field to reduce solr index size	2012-08-21 23:52:56 +02:00
orbiter	9b8c8c0f47	fix from gaston in http://forum.yacy-websuche.de/viewtopic.php?p=26909#p26909	2012-08-21 21:03:26 +02:00
orbiter	d7ea45f698	- get nice text_t values from metadata conversions that are stored into solr as fulltext search index. - added slow migration from old metadata to solr index entries: each entry from the old metadata is removed from that data structure and written into solr.	2012-08-18 19:36:21 +02:00
orbiter	780f8974e7	added ramaining iteration methods for solr in fulltext class	2012-08-18 15:39:14 +02:00
orbiter	ee01c12e56	fixes for putDocument and putMetadata	2012-08-18 13:05:27 +02:00
orbiter	cc47a0876e	reverted `bf55f69176` to have a fall-back option in case that memory problems as reported in http://forum.yacy-websuche.de/viewtopic.php?p=26901#p26901 for full-solr installation are too strong and we have to work with an 'small memory footprint' peer system.	2012-08-18 10:28:40 +02:00
Michael Peter Christen	0cab06c47c	refactoring	2012-08-17 15:52:33 +02:00
Michael Peter Christen	bf55f69176	removed write methods to old metadata file type; all metadata now goes to solr	2012-08-17 15:46:26 +02:00
Michael Peter Christen	40c0856489	refactoring	2012-08-17 15:33:02 +02:00
Michael Peter Christen	06a78eecb7	code simplification	2012-08-17 14:43:32 +02:00
Michael Peter Christen	18f989dfb1	- refactoring (load -> getMetadata) - added getDocument to retrieve Solr documents which shall replace getMetadata	2012-08-17 01:34:38 +02:00
Michael Peter Christen	e5ef840f40	- renamed DoubleSolrConnector to MirrorSolrConnector and added a hit/miss/document cache to the MirrorSolrConnector. - more abstraction to SolrDocument in Connector interface - bugfixes in Solr field reader	2012-08-13 13:32:32 +02:00
Michael Peter Christen	b51df6c7e8	- added coordinate storage in solr schema - fixed shutdown process - fixed some solr-to-metadata reading - added a large number of metadata attributes in ViewFile.html	2012-08-13 10:40:04 +02:00
Michael Peter Christen	bd4f03bc85	removed unused class	2012-08-11 01:05:40 +02:00
orbiter	e816b88b55	changed behaviour of metadata storage: in case that any solr is attached, the metadata is not written to the metadata-db, even if it is enabled but instead to solr. This prevents that metadata is written in two store systems at the same time. It is also the next step to migrate the current metadata-db to solr.	2012-08-10 15:39:10 +02:00
orbiter	2571e0d47a	removed unused classes	2012-08-10 14:47:44 +02:00
Michael Peter Christen	f9c0e6e950	- Implemented and integrated the URIMetadataNode object which is a metadata representation from the solr index. This shall replace metadata from the built-in database in the future. - added the Solr-driven metadata into the search index of YaCy which makes it now possible to run YaCy without the old metadata index. This is a major stept forward to a full migration to Solr.	2012-08-10 13:26:51 +02:00
Michael Peter Christen	136fcb1ad9	refactoring	2012-08-10 06:47:13 +02:00
Michael Peter Christen	bca4a16603	replaced the multivalue generic string field name suffix _ss by _txt because _ss is not part of the standard solr example schema.	2012-08-06 17:58:09 +02:00

1 2 3 4 5 ...

320 Commits