yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	58fe986cca	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-07-30 12:49:14 +02:00
Michael Peter Christen	cf12835f20	replaced the single-text description solr field with a multi-value description_txt text field	2013-07-30 12:48:57 +02:00
reger	f2d99053ed	Field Re-Indexing: prevent endless error loop in ReindexSolrBusyThread on Solr exception (by skipping query causing the exception) (occured during testing while working on q=store:[* TO *])	2013-07-29 01:32:02 +02:00
Michael Peter Christen	c3b2301b2f	fix for http://bugs.yacy.net/view.php?id=268	2013-07-25 09:21:37 +02:00
orbiter	056b42f5aa	- added information about segment count to status_p.xml - also moved this information from the old index structure, which is still in use for the RWI/DHT index to that front-end	2013-07-23 18:03:33 +02:00
orbiter	6fb2811e68	fixes for problems with remote solr and non-activated webgraph index	2013-07-23 16:46:44 +02:00
orbiter	c124037f19	removed forced non-soft commits to prevent index fragmentation	2013-07-22 17:28:20 +02:00
Roland Haeder	be0ff6018f	Removed trailing spaces + some more final	2013-07-17 18:44:24 +02:00
Roland Haeder	841a28ae76	Added 'final' for all exception blocks as this helps the Java compiler to optimize memory usage Conflicts: source/net/yacy/search/Switchboard.java	2013-07-17 18:31:30 +02:00
Michael Peter Christen	1fd006cc56	fixes using the embedded connector	2013-07-17 12:41:54 +02:00
orbiter	5533fc8e01	fix for bug 260	2013-07-14 17:40:28 +02:00
Michael Peter Christen	bcc623a843	refactoring of load_delay: this is a matter of client identification	2013-07-12 16:24:56 +02:00
Michael Peter Christen	5878c1d599	- refactoring of log to ConcurrentLog: jdk-based logger tend to block at java.util.logging.Logger.log(Logger.java:476) in concurrent environments. This makes logging a main performance issue. To overcome this problem, this is a add-on to jdk logging to put log entries on a concurrent message queue and log the messages one by one using a separate process. - FTPClient uses the concurrent logging instead of the log4j logger	2013-07-09 14:28:25 +02:00
Michael Peter Christen	203921006a	redesign of citation index storage	2013-06-30 02:11:46 +02:00
sixcooler	e5abccdfe4	added optimize-option	2013-06-28 14:51:37 +02:00
Michael Peter Christen	570511f3c8	removed fields references_internal_id_sxt and references_internal_url_sxt because they had been shown to be superfluous. The citation of referrer in the host browser is possible without them. Therefore now the host browser does not only show internal, but also external referrer to each link.	2013-06-13 13:01:28 +02:00
Michael Peter Christen	ffc570f95f	removed forced soft commit since this may be the cause for a performance problem	2013-06-11 14:51:26 +02:00
Michael Peter Christen	f7e77a21bf	Added a citation reference computation for intra-domain link structures. While the values for the reference evaluation are computed, also a backlink-structure can be discovered and written to the index as well. The host browser has been extended to show such backlinks to each presented links. The host browser therefore can now show an information where an document is linked. The new citation reference is computed as likelyhood for a random click path with recursive usage of previously computed likelyhood. This process is repeated until the likelyhood converges to a specific number. This number is then normalized to a ranking value CRn, 0<=CRn<=1. The value CRn can therefore be used to rank popularity within intra-domain link structures.	2013-06-07 13:20:57 +02:00
Michael Peter Christen	9fc0c4df98	fix for bad exists 'enhancement'; see bug: http://bugs.yacy.net/view.php?id=245	2013-06-02 13:50:12 +02:00
reger	8a7fcb391d	enable use of solrcore.properties for property substitution of solrconfig.xml - move setting of system property solr.directoryFactory=solr.MMapDirectoryFactory to solrcore.properties - add check of os.arch for 64bit system, if it fails use default/solrcore.x86.properties (if exists) as solrcore.properties reason: on 32bit MMapDirectoryFactory may fail with..... Caused by: java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:849) at org.apache.lucene.store.MMapDirectory.map(MMapDirectory.java:283)	2013-06-01 05:43:08 +02:00
Michael Peter Christen	164603b946	cleanup	2013-05-30 12:47:22 +02:00
Michael Peter Christen	a1644ca0fd	new workflow processor in Segment to enqueue indexing documents to solr	2013-05-30 12:34:53 +02:00
Michael Peter Christen	0c1a018bbd	removed 'later' tactic because it used too much RAM, reduced number of soft commits, reduced caching size of search events, ensured that solr results are processed before connection is closed to keep that stuff not too long in RAM	2013-05-29 18:27:27 +02:00
Michael Peter Christen	709e9b8ce7	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-05-29 13:49:42 +02:00
Michael Peter Christen	281959a2d7	added option to re-boot the embedded solr during run-time. Added also API recording for this method so it can be repeated automatically. The index dump generation is now also available for API recording. Added some synchronization in backend which was necessary for this.	2013-05-29 13:09:34 +02:00
orbiter	da621e827e	prevent NPE in case RWI is disabled	2013-05-28 16:26:38 +02:00
Michael Peter Christen	2b563debbf	javadoc of new multiple-exist test	2013-05-27 13:45:09 +02:00
Michael Peter Christen	8f2d3ce2f9	reduced locking situation in crawler: shifted synchronized location and reduced time-out of robots.txt load limit	2013-05-20 22:05:28 +02:00
Michael Peter Christen	b68fbe7d21	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Conflicts: source/net/yacy/migration.java	2013-05-17 14:13:07 +02:00
Michael Peter Christen	8dbc80da70	redesign of index.exist-test: this shall now not be done using a single id to be tested, but with a collection of ids. This will cause only a single call to solr instead of many. The result is a much better performace when testing the existence of many urls. The effect should cause very much less IO during index transmission, both on sender and receiver side.	2013-05-17 13:59:37 +02:00
reger	7f63d3747d	more generic field selection for reindex option of documents with disabled fields using Luke request to compare config with actual fields in index	2013-05-15 23:16:32 +02:00
reger	79401cb938	added reindex option for documents with disabled or obsolete fields to Solr Schema Editor page (IndexSchema_p.html) this allows to remove obsolete fields from the index (according to current schema config) by selecting all documents containig disabled fields.	2013-05-13 04:06:57 +02:00
Michael Peter Christen	b24d1d18e4	removed synchronization and concurrency in Fulltext class, concurrent deletions are now handled in ConcurrentUpdateSolrConnector	2013-05-11 10:53:12 +02:00
Michael Peter Christen	ad050ec88d	- upgraded httpclient, httpcore and httpmime - removed httpclient 3.1 which has been used by solrj < 4.x.x and is now not used any more - fixed some parts in YaCy which used methods from httpclient 3.1	2013-05-09 00:22:45 +02:00
Michael Peter Christen	e26bdd4a52	fixes to deletion methods (removed unnecessary concurrency and added removal of crawl queue entries)	2013-05-08 13:26:25 +02:00
Michael Peter Christen	f7f3e28c5e	prevent that the size of the index is computed too many times. Because the index size is now provided by solr, and the only way to do that is a match for [* TO *], a size computation is quite complex and time-consuming. Therefore this patch prevents that the method is called at all and if necessary puts a DOS-preventing barrier in front of it.	2013-05-08 11:50:46 +02:00
Michael Peter Christen	cca19d94d4	re-declared some fields to be of type string rather than text which makes them more efficient and less large	2013-05-06 16:45:54 +02:00
Michael Peter Christen	3841854c97	abstraction of catchall term	2013-05-04 00:14:22 +02:00
orbiter	7de5b9cfa0	fix for http://bugs.yacy.net/view.php?id=233 - check geolocation coordinates and accept only those, which are well-formed - the solr push process does not stop crawling any more if after 20 requests to Solr Solr does not accept the record. Instead, a severe log entry asks the user to create a bug request	2013-05-03 00:24:39 +02:00
Michael Peter Christen	f36a7da5f6	- re-introduced existById in solr connector. - intruduced raw-queries for the re-introduced byId-Queries (they are hopefully faster than full edismax queries) - removed the cached solr connector (testing this) to rely only on the solr built-in search caches. That should save some RAM (also). We will see if this is usable.	2013-04-28 21:20:14 +02:00
Michael Peter Christen	3502b4c697	refactoring (renaming) of yacy-solr api	2013-04-27 01:32:18 +02:00
Michael Peter Christen	3a0fcfbeda	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-04-26 10:50:08 +02:00
orbiter	e1bfe9d07a	- reduction of the concurrently running processes to make YaCy more adjusted to smaller and 1-core devices. - the workflow processor now starts no process at all. these are started as soon as parser/condenser/indexing queues are filled. - better abstraction	2013-04-25 11:33:17 +02:00
Michael Peter Christen	c091000165	added collection attribute also to the rss feed reader	2013-04-24 01:14:35 +02:00
Michael Peter Christen	d937c55204	extended limitation of dom export size from 100000 to 100000000	2013-04-22 22:33:13 +02:00
Michael Peter Christen	566d6c980c	checking of document signature for a double-document check now refers only to documents within the same domain	2013-04-17 16:15:27 +02:00
Michael Peter Christen	7ab5093321	added new solr title_exact_signature_l and description_exact_signature_l to be able to identify unique title and unique description fields.	2013-04-16 01:35:15 +02:00
Michael Peter Christen	f24ac518e6	redesign of exists()-query (can now be called with query) and the CachedSolrConnector which based its cache on the key value. This will be used to correct the title_unique_b and description_unique_b field.	2013-04-15 14:08:30 +02:00
Michael Peter Christen	27d6222880	added new field host_extent_i which, after a crawl and postprocessing, holds the number of documents for the host where the document is hosted. This is necessary for ranking and the norming of references per local host in the ranking computation.	2013-04-14 20:52:40 +02:00
reger	518b20147c	skip postprocessing during document.store if no citation index connected (prevent null pointer exception)	2013-04-14 02:01:27 +02:00
Michael Peter Christen	ada3f27de7	added three new field for a better ranking: references_internal_i, references_external_i and references_exthosts_i. These can be used to count and evaluate the number of external links to every web page. An experimental ranking function can be i.e.: div(add(references_internal_i,product(references_external_i,references_exthosts_i)),add(clickdepth_i,1))	2013-04-12 16:17:14 +02:00
Michael Peter Christen	edc0b33f6d	- showing references count and clickdepth in host browser - fixed generation and presentation of both values	2013-04-11 14:46:13 +02:00
orbiter	940c6849ee	enhanced did-you-mean (a bit): can now remember previously searched words (plus small enhancements)	2013-03-17 10:52:31 +01:00
Michael Peter Christen	6300730d7f	refactoring of clickdepth computation as preparation for clickdepth computation of webgraph links	2013-03-14 12:13:02 +01:00
orbiter	47114910d5	fix for possible memory leaks	2013-03-13 17:55:37 +01:00
Michael Peter Christen	25300913fa	fixes to search debugging after testing with the different search debugging options	2013-03-05 21:28:22 +01:00
Michael Peter Christen	c2fde018b5	concurrent snippet fetching from solr results which do not have snippets	2013-03-05 12:24:01 +01:00
Michael Peter Christen	2b6c79d347	in method exists() also use the new caching-stacks for documents/metadata	2013-03-04 01:13:17 +01:00
Michael Peter Christen	0d7b4bc891	better protection against OOM during search flush and fixed missing result push	2013-03-03 23:45:47 +01:00
Michael Peter Christen	221ed7d764	- enhanced concurrency during search without IO blocking - introduced a second queue to flush remote search results (now: old metadata structure from DHT peers) - fixed result counters	2013-03-03 22:38:50 +01:00
Michael Peter Christen	3b1d9dc884	made index storage from DHT search result concurrently. This prevents blocking by high CPU usage during search. Also: removed query from Solr for DHT search results; results are taken from the pending queue.	2013-03-02 10:25:52 +01:00
orbiter	f13c0b2abd	fix for search	2013-03-01 19:18:16 +01:00
orbiter	0f7ea7ad9f	- enhanced solr.add procedure for mass adds - removed unused solr access classes - made snippet generation for documents aus YaCy RWI/DHT concurrent (as it was before the search process removation) - reduced the number of remote results in settings file because the processing of such mass documents add is too CPU-intensive (in Solr)	2013-03-01 15:27:17 +01:00
orbiter	d74472f562	corrected result counter	2013-02-27 22:40:23 +01:00
Michael Peter Christen	c95a84103a	complete redesign of search process: - removed 'worker' processes - no internal time-out behaviour: methods either are successful or return null - waiting is only done on top-level - removed snippet-production; this is replaced by solr snippets - removed statistics based on solr size queries (they had been VERY long); the statistics (like suggestions or tag cloud) are now again based on the old but very fast RWI index. In portal or intranet mode the RWI index is usually switched off; if you like to have statistics again then you must switch on the rwis again in this mode. - fixed many bugs regarding correct page counter	2013-02-26 17:16:31 +01:00
Michael Peter Christen	089dee1770	- generalized SchemaConfiguration into super-class Configuration and adopted other classes which used the configuration-only access for that class - removed many warnings - adjusted logging	2013-02-25 00:09:41 +01:00
Michael Peter Christen	c16de49f64	fix for webgraph delete query	2013-02-24 18:17:58 +01:00
Michael Peter Christen	56d5946a59	- added flags in IndexFederated_p.html to switch on or off the webgraph index (new solr core webgraph) .. this is now off by default - completely redesigned this servlet - added description how to attach a remote solr - adjusted naming of servlet and menues - moved 'lazy initialization' attribut from IndexSchema to IndexFederated (this is a general option) back again.	2013-02-24 18:09:34 +01:00
Michael Peter Christen	788288eb9e	added the generation of 50 (!!) new solr field in the core 'webgraph'. The default schema uses only some of them and the resting search index has now the following properties: - webgraph size will have about 40 times as much entries as default index - the complete index size will increase and may be about the double size of current amount As testing showed, not much indexing performance is lost. The default index will be smaller (moved fields out of it); thus searching can be faster. The new index will cause that some old parts in YaCy can be removed, i.e. specialized webgraph data and the noload crawler. The new index will make it possible to: - search within link texts of linked but not indexed documents (about 20 times of document index in size!!) - get a very detailed link graph - enhance ranking using a complete link graph To get the full access to the new index, the API to solr has now two access points: one with attribute core=collection1 for the default search index and core=webgraph to the new webgraph search index. This is also avaiable for p2p operation but client access is not yet implemented.	2013-02-22 15:45:15 +01:00
Michael Peter Christen	91a0401d59	introduced a second core named 'webgraph'. This core will hold the link structure, but is not filled yet. To have the opportunity of a second core, multi-core functionality had to be implemented to the deep-embedded solr: - migrated the solr_40 directory content to a subdirectory 'collection1'; the previously used default core is now called collection1 - added solr_40/webgraph subdirectory as second core - added a servlet configuration for the second core 'webgraph' in /IndexSchema_p.html - added instance handling as addition to solr connections: all solr connectors are now instances of an solr 'instance' object; this required a complete re-design of the solr embedding - migrated also caching and sharding ontop of new instance handling - migrated the search apis to handle now the access to a specific core, the default core named 'collection1' - migrated the remote solr search interface to access shards of cores; for the yacy remote search the default core is now called 'solr'; using the peer address as solr address - migrated the solr backup and restore process: old backups cannot be used after this migration! - redesign of solr instance handling in all methods which access the instances: they cannot hold copies of these instances any more; the must retrieve the actuall connection object every time they want to write to it (this solves also some bugs when switching the index/network) - added another schema 'solr.webgraph.schema', the old solr.keys.list is replaced by solr.collection.schema	2013-02-21 13:23:55 +01:00
Michael Peter Christen	b6de1f42dc	Full redesign of solr connection architecture. This was done to support multiple solr cores instead of just one. Therefore it is now necessary to distuingish between solr server connections (called an 'Instance') and a connection to a single solr core. One Instance may now have multiple connector classes assigned to it, each connecting to a single core. To support multiple cores it is also necessary to distinguish between the connection configuration and the configuration of the index schema. We will have multiple schema configurations in the future, each for every solr core. This caused that the IndexFederated servlet had to be split into two parts, the new Servlet for the Schema editor is now in the IndexSchema Servlet.	2013-02-15 01:38:10 +01:00
Michael Peter Christen	4111606654	removed the commitWithin attribute because that is not the way how the index is updated the right way for us. May also be be superfluous with the solr 4.0 softcommit.	2013-02-13 02:29:47 +01:00
Michael Peter Christen	7806680ab8	fixed a problem with re-feeding of already indexed documents whith coordinates attached.	2013-02-08 12:45:54 +01:00
Michael Peter Christen	4323621a76	update to Solr 4.1.0	2013-02-04 10:55:49 +01:00
Michael Peter Christen	7dfcc92b71	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-01-31 13:15:42 +01:00
Michael Peter Christen	0b6566a389	optimizations when starting large crawl requests with many start urls in one request: - allow larger match-fields in html interface - delete all host hashes at once from zurl - when deleting by host, do not count size of deleted entries since that was the reason it took so long	2013-01-31 13:15:28 +01:00
orbiter	a2160054d7	ability to create vocabularies also without any objectspace: this iterates over all urls in the index do create terms	2013-01-30 19:33:48 +01:00
orbiter	ecc10a752c	fixes to index enumeration for vocabulary production	2013-01-29 18:14:14 +01:00
Michael Peter Christen	0fe7b6fd3b	migrated the index export methods from the old metadata to solr. Now exports are done using solr queries. removed superfluous methods and servlets.	2013-01-24 12:39:19 +01:00
Michael Peter Christen	1768c82010	removed field selection because that created documents with that field only which was not useful when re-writing the same document	2013-01-24 03:26:38 +01:00
Michael Peter Christen	4735bd47f4	- changed solr commit call and added an optimize option. Since Solr 4.0.0 there is a new softcommit feature which implements a near-real-time (NRT) search option. The softcommit does not do IO and does not cause performance issues. YaCy has now an extension in its solr connectors to use the softcommit feature. The softcommit call now replaces all places where a hard commit was used. Furthermore the commit strategy in when doing a search from the web interface was changed (it's done every time before a search is done). The softcommit feature was implemented because it was needed for the following changes (customer demands), which is also included in this git commit: - added a feature to identify all documents which have unique titles and/or unique descriptions. These unique flags are disabled by default. - added also a feature to set a flag when the url from a canonical tag is equal to the document url. This is also disabled by default. To support the new softcommit strategy, the commitWithinMs option was set to -1 do disable automatic commit based on document insert times. If documents are inserted permanently then also a commit would happen permanently whenever the commitWithinMs time is reached. This would conflict with the regular autocommit of 10 minutes and the new softcommit strategy.	2013-01-23 14:40:58 +01:00
reger	3897bb4409	added (manual) urldb migration (link on: Index Administraton -> Federated Solr Index) - migrates all entries in old urldb Metadata coordinate (lat / lon) NumberFormatException still relative often (see excerpt below), - added try/catch for URIMetadataRow (seems not to be needed in URIMetaDataNode, as Solr internally checks for number format) - removed possible typ conversion for lat() / lon() comparison with 0.0f, changed to 0.0 (leaving it to the compiler/optimizer to choose number format) current log excerpt for NumberFormatException: W 2013/01/14 00:10:07 StackTrace For input string: "-" java.lang.NumberFormatException: For input string: "-" at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source) at java.lang.Double.parseDouble(Unknown Source) at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525) at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279) at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277) at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329) at transferURL.respond(transferURL.java:152) ... Caused by: java.lang.NumberFormatException: For input string: "-" at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source) at java.lang.Double.parseDouble(Unknown Source) at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525) at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279) at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277) at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329) at transferURL.respond(transferURL.java:152)	2013-01-14 03:06:24 +01:00
reger	3b6e08b49f	prevent checking of urldb if empty - disconnect urlIndexFile if empty - add missing lock class in submenuSearchConfiguration	2013-01-12 15:20:23 +01:00
Michael Peter Christen	38d3feae65	added separate delete commands for the local+remote solr index, the old metadata and old rwi and for the citation index. The important advancement is the separation of the citation index deletion because that index is responsible for the linkdepth calculation. Now a search index can be deleted without the citation index and that should cause that less clickdepths must be post-processed.	2013-01-04 16:39:34 +01:00
Michael Peter Christen	6f0baaa309	added the clickdepth post-processing: some links may have 'shortcuts' to already calculated click depths. There are then calculated if the crawl buffer is empty and therefore no new 'shortcuts' can be discovered. The status of the clickdepth stack (to-be-processed) can be seen using a solr search command like this: http://localhost:8090/solr/select?q=process_sxt:[%20TO%20]&start=0&rows=30&fl=sku,clickdepth_i,process_sxt	2013-01-04 16:37:39 +01:00
Michael Peter Christen	0f5b6f38c1	enhanced root-url detection	2013-01-03 19:21:21 +01:00
Michael Peter Christen	5c0c56cfe1	Preparations to produce a click depth attribute in the search index. This attribute can be used for ranking and for other purpose (demand by customer) The click depth is computed in two steps: - during indexing the current fill-state of the reverse link index is used to backtrack the current page to the root page. The length of that backtrack is the clickdepth. But this does not discover the shortest click depth. To get this, a second process to check again is needed - added a process tag that can be used to do operations on the existing index after a crawl; i.e. calculation the shortest clickpath. Added a field to control this operation but not a method to operate on this. - added a visualization of the clickpath length in the host browser	2013-01-02 20:55:43 +01:00
reger	4987caf1c9	- apply fix for localhost handling (from yacy2solr) also to metadata2solr	2012-12-23 01:30:52 +01:00
Michael Peter Christen	2a4c064c89	using the publisher information for the author field if no author is given. This applies to cases where only the copyright field in the html header is filled but not the author field	2012-12-19 01:54:35 +01:00
Michael Peter Christen	eac9650b31	added another solr field clickdepth_i which reflects the number of clicks which are necessary to get from the portal of a host to a specific document. At this time, only the start document is flagged with clickdepth '0', all other with '-1'. To get the actual clickdepth, a process must use crawled information to collect the actual number of clicks. This will be added in another/next step.	2012-12-18 17:20:42 +01:00
Michael Peter Christen	1052263af3	- added a new solr field references_i which stores the number of INCOMING links to the corresponding web page. This information is taken from the reverse link index (a 'little sister' of the RWI index). - this field can be of use to enhance the ranking because a web page with more incoming links can be more more important than others. But this is not true for typical link pages like menues. Therefore the number of outgoing links is needed. - added a new solr attribute 'bf' to solr queries which is a boost function extension. this field can contain a formula which comuptes the boost according to given field values. After some experiments the following forumla is now default: div(add(1,references_i),pow(add(1,inboundlinkscount_i),1.6))^0.4 This takes the number of references and the inbound links. Further experiments are needed to enhance that forumula.	2012-12-18 14:42:35 +01:00
Michael Peter Christen	34f8786508	removed dependency of vocabulary navigation from Jena and it's triplestore; the vocabulary search is now done using generic solr fields which are created on-the-fly during runtime.	2012-12-18 02:29:03 +01:00
Michael Peter Christen	fb0fa9a102	- fixed 'delete from subpath' during crawl start which deleted nothing; now works; - changed some crawl start html design details	2012-12-11 13:38:28 +01:00
orbiter	a4a780b871	- fix for bad url conversion in bookmarks when using smb urls - fix for localhost hosts in solr schema host handling	2012-12-10 07:22:42 +01:00
Michael Peter Christen	72f165d58b	added a Boost class which stores solr query boost values. The class can be configured using the yacy.init file. The boost information is taken from the configuration each time when a query to solr is done.	2012-12-02 16:54:29 +01:00
Michael Peter Christen	8fc3679c66	using more pre-compile pattern for split methods	2012-11-26 13:11:55 +01:00
Michael Peter Christen	b7004043ea	- added a field cache for solr queries which call only for a single value - fixed a version conflict exception within a solr add request	2012-11-24 22:30:05 +01:00
Michael Peter Christen	efd2c4622d	added a new fail type attribute for the index to distinguish two separate fail types: network fail and forced exclusion (i.e. by robots or forwarding rules).	2012-11-23 14:00:30 +01:00
Michael Peter Christen	4eab3aae60	removed overhead by preventing generation of full search results when only the url is requested	2012-11-23 01:35:28 +01:00
Michael Peter Christen	d6b82840f8	added a feature to find similarities in documents. This uses an enhanced version of the Nutch/Solr TextProfileSignatue. As a result, a signature of the document is written to the solr search index. Additionally for each time when a signature is written, it is checked if the singature exists already in the index. If the signature does not exist, the document is marked as unique. The unique attribute can now be used to sort document lists and bring duplicates to the end of a result list. To enable this, a large portion of the search api to Solr had to be changed. This affected mainly caching of 'exists' searches to enhance the check for existing signatures and do this without actually doing a solr query. Because here the first time a long number is used as value in the Solr store, also the value naming in the YaCySchema had to be adopted and normalized. This caused that many files had to be changed.	2012-11-21 18:46:49 +01:00
Michael Peter Christen	f5ca5cea44	- added field options to all solr queries. This can be used to restrict the actual data which is fetched from solr. - used the new field options to reduce generic options like getting the load date or the count of search results. should increase overall speed - used the new field options to reduce overhead in the host browser during aquisition of links. - used the field options to make checking of links in crawler faster - if the crawler is paused, the crawl queue is not cleaned	2012-11-19 17:24:34 +01:00
orbiter	5dfd6359cb	redesign of the QueryParams class: introduced QueryGoal which holds the query string parser. This shall be used to create a proper full-string matching which is handled then by QueryGoal.	2012-11-18 01:22:41 +01:00
Michael Peter Christen	5fd3b93661	added deletion of hosts during crawl start if deleteold option was given	2012-11-13 16:54:28 +01:00
Michael Peter Christen	842faf96a2	fixed media search	2012-11-07 17:27:13 +01:00
Michael Peter Christen	93001586a0	removed warnings, removed too-fast pausing of crawls	2012-11-07 15:37:14 +01:00
Michael Peter Christen	12c0db20e5	fixed npe for surrogate import	2012-11-07 02:46:51 +01:00
Michael Peter Christen	52df6ee369	more logging	2012-11-07 02:04:08 +01:00
Michael Peter Christen	15d1460b40	added information about the reason of pausing of crawls	2012-11-06 15:21:56 +01:00
Michael Peter Christen	2371ef031c	added solr faceted search support to YaCy search results added solr highlighting / YaCy snippets to YaCy search results - facets are now much more complete - facets are computed and searched much faster - snippet computation is done by solr if solr knows the snippet	2012-11-06 14:32:08 +01:00
Michael Peter Christen	d481abd087	added the visualization of error-urls to host browser - only visible for admins - a faceted search generates a huge list for all hosts in the host list - the faceted search algorithms had to be modified for that - within the browsing of the directory path, the error cause is written to the url which is presented as error-url - the errors are also accumulated for directory sums	2012-11-06 00:29:37 +01:00
Michael Peter Christen	97f82994a6	automatically pause the crawler if there is a problem with solr	2012-11-05 16:34:42 +01:00
Michael Peter Christen	8fb370d9f8	renovated the way how search results are count. should be correct now...	2012-11-05 03:19:28 +01:00
orbiter	354ef8000d	- added 'deleteold' option to crawler which causes that documents are deleted which are selected by a crawl filter (host or subpath) - site crawl used this option be default now - made option to deleteDomain() concurrency	2012-11-04 02:58:26 +01:00
Michael Peter Christen	75dd706e1b	update to HostBrowser: - time-out after 3 seconds to speed up display (may be incomplete) - showing also all links from the balancer queue in the host list (after the '/') and in the result browser view with tag 'loading'	2012-11-02 13:57:43 +01:00
Michael Peter Christen	e2c4c3c7d3	migration to solr 4.0.0	2012-11-02 12:29:48 +01:00
Michael Peter Christen	9330ad4838	- fixed the delete option in host browser - added a delete method which can be used to delete a full subpath in solr.	2012-11-02 01:22:31 +01:00
Michael Peter Christen	6629e37685	tried to clean up the search process mess	2012-11-01 17:16:43 +01:00
Michael Peter Christen	f8f05ecba7	- added a delete button in host browser to delete a complete subpath - removed storage of default collection name - default is now "user" - made stacking of crawl start points concurrently	2012-10-31 17:44:45 +01:00
Michael Peter Christen	c326aa8f67	disabled writing new entries to crawl stacks to prevent that a domain with many documents block refreshing of the crawl queue	2012-10-29 22:26:52 +01:00
Michael Peter Christen	6905182d41	- fix for number of words log message - adding meta:refresh also to crawler stack	2012-10-29 21:42:31 +01:00
Michael Peter Christen	799d71bc67	enhanced solr caching: - increased cache size which is needed for longer solr commit time - speed hacks on cache write code	2012-10-28 20:31:29 +01:00
Michael Peter Christen	8e1248ffe3	force a commit in advance of a search for the administrator to get most recent results even if commit time is high and an indexing is ongoing.	2012-10-26 15:35:42 +02:00
Michael Peter Christen	3b48c78190	added an option to force a commit to solr. may be used by a search front-end in case that the commitWithinMs time is too short to get recently indexed documents.	2012-10-26 07:39:07 +02:00
Michael Peter Christen	ce0e5b1e17	- more refactoring / private methods - fix for usage of custom solr field names	2012-10-18 15:09:04 +02:00
Michael Peter Christen	ccc3760a47	Refactoring and redesign of data architecture to make URIMetadataRow superfluous. The target is to make a solr document as the core of YaCy documents which would cause that many conversions can be removed. On the way to this target the Equivalence of URIMetadataRow and URIMetadataNode had to be removed to expose the usage of the old URIMetadataRow data structure. This refactoring already removes unneccessary conversions and should make memory usage during indexing lower.	2012-10-18 14:29:11 +02:00
Michael Peter Christen	e5b3c172ff	removed hack which translated Solr documents to virtual RWI entries which had been then mixed with remote RWIs. Now these Solr documents are feeded into the result set as they appear during local and remote search. That makes the search much faster.	2012-10-17 17:45:41 +02:00
Michael Peter Christen	5d16c23a1f	specified more URIMetadata as URIMetadataNode	2012-10-16 18:26:21 +02:00
Michael Peter Christen	43f3345c90	- removed dependencies from URIMetadataRow and made direct access to URIMetadataNode which creates the opportunity to access Solr objects directly and use their information richness - lazy initialization of the URIMetadataNode object - should cause less computation and memory usage during search. - removed dead code	2012-10-16 18:11:57 +02:00
Michael Peter Christen	cc98496ff3	enhanced the HostBrowser: - showing also outbound links to other domains if there are any - the outbound links browser shows also the link structure image - showing even inbound links if the web structure graph has information about that - removed the left menu and made the HostBrowser a part of the top menu for search - moved the file search also to the top menu - added hover information in the HostBrowser to explain what the click means - because the HostBrowser also links to the Metadata viewer ViewFile, there should be a button to switch back to the HostBrowser: added that also.	2012-10-16 17:13:18 +02:00
Michael Peter Christen	21fe8339b4	- enhanced generation of url objects - enhanced computation of link structure graphics - enhanced collection of data for link structures	2012-10-15 13:17:13 +02:00
Michael Peter Christen	1b02408936	use less cache	2012-10-11 14:32:37 +02:00
Michael Peter Christen	5f0ab25382	removed the option to prevent removal of & parts inside of the MultiProtocolURI during normalform computation because that should always be done and also be done during initialization of the MultiProtocolURI Object. The new normalform method takes only one argument which should be 'true' unless you know exactly what you are doing.	2012-10-10 11:46:22 +02:00
Michael Peter Christen	7e3e45fd04	added Open Graph Metadata default fields, see http://ogp.me/ns#	2012-10-09 17:28:48 +02:00
Michael Peter Christen	c3e5f667a7	added schema.org breadcrumb counter to parser and solr schema	2012-10-09 13:02:43 +02:00
Michael Peter Christen	bd769de604	since the solr index is now used for all pages that are indexed locally, there is no need for the RWI index if the index is not transfered to another peer. Therefore the creation of RWI index data is now suppressed if DHT is disabled. This applies for all intranet and portal mode configurations, but not for public robinson modes. A robinson may switch back to public mode and then transmit its data. That means if someone wants to switch never to DHT mode, it would be more appropriate to choose the portal mode.	2012-10-09 11:48:55 +02:00
Michael Peter Christen	f8a3ab2d82	added the usage of synonyms to the GSA search interface	2012-10-02 14:29:45 +02:00
Michael Peter Christen	3d33a5bdf6	turned the synonyms_t Text field into a multi-valued String field synonyms_sxt	2012-10-02 11:13:06 +02:00
Michael Peter Christen	3b959ee002	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-10-02 10:14:09 +02:00
orbiter	3190347814	added a synonyms_t field to solr and a process to read synonym files. This can be used to add another stemming to solr using stemming files that are expressed as synonyms for grammatical alternatives. The synonym/stemming files must have the following form: - each line is a comma-separated list of synonyms - the list of synonyms may be enclosed with {} (like the GSA synonyms file) - the file may contain comments which are lines starting with a '#' The synonym file(s) must be placed in DATA/DICTIONARIES/synonyms/ and are activated by default whenever a synonym file is in place. Then, for each word that is found in a document all synonyms are added to a long text field which is stored into synonyms_t. Processes using the synonyms must query with that field as optional matcher.	2012-10-02 00:02:50 +02:00
Michael Peter Christen	411d0e839b	added an underline text field to solr to record all underlined texts	2012-10-01 14:16:49 +02:00
Michael Peter Christen	c4a3d8870f	fixed computation of links in host browser which are not indexed but knwon by the crawler. Such links are now displayed in grey color.	2012-09-29 02:13:11 +02:00
Michael Peter Christen	24d2ee3c52	- better date ranking - more protection against NPE and time travel effects	2012-09-26 18:36:32 +02:00
Michael Peter Christen	ca313e404f	- if a "/date" modifier is used, the solr remote query applies an ordering by date (ascending) - added also some 'anti-timetravel' protection (check if date is in the future within any metadata date field)	2012-09-26 16:56:33 +02:00
Michael Peter Christen	a4214694df	We assert that no other metadata storage than solr is used now. Therefore a property like solrConnected() must be true all the time. Removal of this method causes removal of all write operations to the old metadata index.	2012-09-26 16:05:11 +02:00
Michael Peter Christen	562183932b	- removed ip_s from default profile since that needs a DNS lookup to create an document entry. This makes remote search much slower. - removed synchronization of add method if ip_s is activated to prevent that a user configuration causes bad behavior. The disadvantage of that is, that a index dump can cause data loss if an indexing is running during index dump - catched more exceptions and more NPE - better abstraction in MirrorSolrConnector - slight performance enhancement when only the index count is requested (rows=0 is sufficient to get a total count)	2012-09-26 13:38:04 +02:00
Michael Peter Christen	1533bfd63b	refactoring	2012-09-25 21:20:03 +02:00
Michael Peter Christen	872f83ebe0	refactoring	2012-09-25 21:04:58 +02:00
Michael Peter Christen	fb9460f0a8	using the search filter to drill down search to file types. A search like "mp3 filetype:mp3" will now maybe surprise you.	2012-09-25 17:52:33 +02:00
Michael Peter Christen	15ea053c3a	- added xml output in IndexControlURLs to get the storage page of index dump commands - adjusted the apicall.sh script to get the downloaded text as output to stdout which is necessary to parse the content out of it - added indexdump.sh script which creates a solr dump and prints out the storage path for the index dump - added synchronization to the Fulltext class to prevent that data is stored to a non-existing solr index while this index is disabled during the storage of the dump	2012-09-25 00:19:52 +02:00
Michael Peter Christen	1b474139dd	used the new zip writer/reader to add a solr dump process: the whole solr index can be written to a zip dump and also restored during runtime	2012-09-24 17:05:28 +02:00

1 2 3 4 5 ...

362 Commits