yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
Roland Haeder	7263bb82fb	Fix for NPE on shutdown: java.lang.NullPointerException at net.yacy.search.Switchboard.storeDocumentIndex(Switchboard.java:2732) at net.yacy.search.Switchboard.access00(Switchboard.java:207) at net.yacy.search.Switchboard.run(Switchboard.java:3049)	2013-07-27 09:55:43 +02:00
orbiter	080d80c9de	do not write an empty failreason in case that there is no fail. Because of the lazy instantiation rule this value was not actually written, but if lazy instantiation is switched on, then this causes that all crawl starts delete all crawl-start-hosts completely because this looks for filled error reasons.	2013-07-26 17:53:28 +02:00
Michael Peter Christen	61e015268b	fix in forced deletion: forced commit needed	2013-07-25 09:53:19 +02:00
Michael Peter Christen	c3b2301b2f	fix for http://bugs.yacy.net/view.php?id=268	2013-07-25 09:21:37 +02:00
orbiter	3e901dcb06	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-07-23 19:33:07 +02:00
orbiter	f50b596e0b	do not run dht ditribution if system load is over 2.5	2013-07-23 19:32:32 +02:00
orbiter	056b42f5aa	- added information about segment count to status_p.xml - also moved this information from the old index structure, which is still in use for the RWI/DHT index to that front-end	2013-07-23 18:03:33 +02:00
orbiter	6fb2811e68	fixes for problems with remote solr and non-activated webgraph index	2013-07-23 16:46:44 +02:00
sixcooler	af740f3058	changed optimization to a segment-size of index-size/5.000.000 + one if not idle + one (and force) if postprocessing	2013-07-23 14:21:12 +02:00
orbiter	5364c4dcc9	delayed first peer-ping to send the first ping out after the http got up; if the ping comes before the http is up, it cannot be recognized as senior peer (if at all). See also: http://bugs.yacy.net/view.php?id=266	2013-07-22 18:21:37 +02:00
orbiter	e24016e30a	added the property federated.service.solr.indexing.timeout to yacy.init to provide a configurable time-out for solr; see also: http://bugs.yacy.net/view.php?id=254	2013-07-22 17:45:12 +02:00
orbiter	c124037f19	removed forced non-soft commits to prevent index fragmentation	2013-07-22 17:28:20 +02:00
Michael Peter Christen	c15aa758dc	removed failreason_t removal patch because that causes too much confusion using an external solr. to clean up the index after a schema change, use the index cleaner function from the online servlet	2013-07-22 14:17:38 +02:00
Roland Haeder	be0ff6018f	Removed trailing spaces + some more final	2013-07-17 18:44:24 +02:00
Roland Haeder	841a28ae76	Added 'final' for all exception blocks as this helps the Java compiler to optimize memory usage Conflicts: source/net/yacy/search/Switchboard.java	2013-07-17 18:31:30 +02:00
Michael Peter Christen	89c0aa0e74	added collection_sxt to error documents	2013-07-17 15:20:56 +02:00
Michael Peter Christen	0df5195cb0	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-07-17 12:42:06 +02:00
Michael Peter Christen	1fd006cc56	fixes using the embedded connector	2013-07-17 12:41:54 +02:00
orbiter	d0dc86cf3d	logging of deadlocks (if any) during cleanup process	2013-07-17 12:38:58 +02:00
Michael Peter Christen	c6a6f159e8	fix for crawl stack domain counter	2013-07-16 18:18:55 +02:00
Michael Peter Christen	93d1bac140	do a more frequent optimization, reduces IO after optimization	2013-07-16 17:16:48 +02:00
orbiter	290e24564b	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-07-14 17:41:32 +02:00
orbiter	5533fc8e01	fix for bug 260	2013-07-14 17:40:28 +02:00
Michael Peter Christen	b79471ee67	grr	2013-07-14 10:15:47 +02:00
Michael Peter Christen	a79f288ac1	automatically running optimize on solr if user/search is idle for some time	2013-07-14 10:02:08 +02:00
orbiter	a9c8046c87	do a light optimization at the end of a crawl postprocessing	2013-07-13 19:09:46 +02:00
orbiter	a548354c71	replaced type of solr schema object sku of text_en_splitting_tight by string	2013-07-13 18:54:09 +02:00
orbiter	2f1ec8d4a2	npe fix	2013-07-13 11:10:05 +02:00
Michael Peter Christen	bcc623a843	refactoring of load_delay: this is a matter of client identification	2013-07-12 16:24:56 +02:00
orbiter	0d0b3a30f5	activate api actions after postprocessing of crawls	2013-07-12 16:05:48 +02:00
orbiter	2be456e7fb	added a postprocessing field into api/status_p.xml to show if the postprocessing task is running at that time (status: busy) or not (status:idle)	2013-07-12 14:29:22 +02:00
Michael Peter Christen	5878c1d599	- refactoring of log to ConcurrentLog: jdk-based logger tend to block at java.util.logging.Logger.log(Logger.java:476) in concurrent environments. This makes logging a main performance issue. To overcome this problem, this is a add-on to jdk logging to put log entries on a concurrent message queue and log the messages one by one using a separate process. - FTPClient uses the concurrent logging instead of the log4j logger	2013-07-09 14:28:25 +02:00
Michael Peter Christen	a2c8116a8f	accept (but ignore) a '+' sign in front of search words	2013-07-08 16:20:40 +02:00
sixcooler	d5d8936f9d	For indexes that are changing rapidly in NRT situations, fcs (stands for Field Cache per Segment) may be a better choice than the default fc. (saves memory) see: http://wiki.apache.org/solr/SimpleFacetParameters#facet.method	2013-07-04 19:08:53 +02:00
Michael Peter Christen	57ffdfad4c	added a crawl option to obey html-meta-robots-noindex. This is on by default.	2013-07-03 14:50:06 +02:00
Michael Peter Christen	5a5d411ec0	new robots_i attribute fields	2013-07-02 14:29:13 +02:00
Michael Peter Christen	f1c5338210	prepartion for greedy crawl profiles and refactoring	2013-07-01 13:10:09 +02:00
Michael Peter Christen	e6f361f474	adding the canonical tag to crawl queues	2013-07-01 13:09:41 +02:00
Michael Peter Christen	203921006a	redesign of citation index storage	2013-06-30 02:11:46 +02:00
Michael Peter Christen	32aa1d4569	removed unused option for queries	2013-06-28 15:32:36 +02:00
sixcooler	e5abccdfe4	added optimize-option	2013-06-28 14:51:37 +02:00
Michael Peter Christen	8caaf6203a	fixed false multiple-generation of remote facet search which caused high cpu usage on remote side.	2013-06-28 12:39:36 +02:00
Michael Peter Christen	823ae4d6a7	added url_protocol_s to error documents	2013-06-26 16:51:36 +02:00
Michael Peter Christen	9a6fcdf597	npe fix	2013-06-25 16:36:16 +02:00
Michael Peter Christen	16d1d744fa	added url_file_name_s in default collection schema for the file name without the file extension. This part of the file path is removed from the multi-field url_paths_sxt, which has now not the file name as last part of the path list. The same applies to the new fields source_file_name_s and target_file_name_s in the webgraph schema.	2013-06-25 16:27:20 +02:00
Michael Peter Christen	f9d859f5dc	now writing image alt texts and (camelcase-)parsed urls into a text search field for a better image retrieval	2013-06-18 16:51:56 +02:00
orbiter	8792e6c6e9	stub for better image indexing	2013-06-18 13:28:30 +02:00
Michael Peter Christen	bdf306e0a7	increased time-out for loading of seed-lists	2013-06-13 22:32:06 +02:00
Michael Peter Christen	570511f3c8	removed fields references_internal_id_sxt and references_internal_url_sxt because they had been shown to be superfluous. The citation of referrer in the host browser is possible without them. Therefore now the host browser does not only show internal, but also external referrer to each link.	2013-06-13 13:01:28 +02:00
Michael Peter Christen	1762911f57	added synchronizations and timeouts in solr api; missing synchronizations in index modification methods causes deadlocks inside solr.	2013-06-12 02:13:18 +02:00
Michael Peter Christen	ffc570f95f	removed forced soft commit since this may be the cause for a performance problem	2013-06-11 14:51:26 +02:00
Michael Peter Christen	6115bef335	added a 'greedy learning' mechanismn which will cause that a 'fresh' yacy will load linked web pages from search results until the total number of web pages reaches 15000. This shall give fresh peers a 'boost' to get faster a personalized search index.	2013-06-11 14:42:30 +02:00
Michael Peter Christen	8e965ffd16	fix for host compare in case that the host is null. This happens when doing a search in the intranet for file resources (they don't have a host).	2013-06-10 16:23:58 +02:00
Michael Peter Christen	f7a4377812	usage of the new normalized link polularity CRn as default ranking function. This replaces the previous formula, which was bad. Before you update to this version, please check if you changed the ranking function yourself before, since it will be overwritten.	2013-06-07 13:22:22 +02:00
Michael Peter Christen	f7e77a21bf	Added a citation reference computation for intra-domain link structures. While the values for the reference evaluation are computed, also a backlink-structure can be discovered and written to the index as well. The host browser has been extended to show such backlinks to each presented links. The host browser therefore can now show an information where an document is linked. The new citation reference is computed as likelyhood for a random click path with recursive usage of previously computed likelyhood. This process is repeated until the likelyhood converges to a specific number. This number is then normalized to a ranking value CRn, 0<=CRn<=1. The value CRn can therefore be used to rank popularity within intra-domain link structures.	2013-06-07 13:20:57 +02:00
reger	d367b1f4d9	add null pointer check to stopword fix	2013-06-07 00:13:45 +02:00
reger	7480e87386	- fix stopword handling for RWI see example http://bugs.yacy.net/view.php?id=247 - append language setting specific stopword list - remove unused OVERHANG stack type	2013-06-06 22:07:54 +02:00
Michael Peter Christen	9fc0c4df98	fix for bad exists 'enhancement'; see bug: http://bugs.yacy.net/view.php?id=245	2013-06-02 13:50:12 +02:00
reger	8a7fcb391d	enable use of solrcore.properties for property substitution of solrconfig.xml - move setting of system property solr.directoryFactory=solr.MMapDirectoryFactory to solrcore.properties - add check of os.arch for 64bit system, if it fails use default/solrcore.x86.properties (if exists) as solrcore.properties reason: on 32bit MMapDirectoryFactory may fail with..... Caused by: java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:849) at org.apache.lucene.store.MMapDirectory.map(MMapDirectory.java:283)	2013-06-01 05:43:08 +02:00
Michael Peter Christen	f7e887bf49	added missing class	2013-05-30 16:39:48 +02:00
Michael Peter Christen	5f92c68f1f	removed block rank ranking and all YBR files in /ranking	2013-05-30 13:01:22 +02:00
Michael Peter Christen	164603b946	cleanup	2013-05-30 12:47:22 +02:00
Michael Peter Christen	409d6edf53	Store node/solr search threads to be able to send them an interrupt signal in case that a cleanup process wants to remove the search process. Added also a new cleanup process which can reduce the number of stored searches to a specific number which can be higher or lower according to the remaining RAM. The cleanup process is called every time a search ist started.	2013-05-30 12:38:15 +02:00
Michael Peter Christen	2a8b99ea82	remove text_t in search result after snippet has been computed to save space in search result cache	2013-05-30 12:35:47 +02:00
Michael Peter Christen	a1644ca0fd	new workflow processor in Segment to enqueue indexing documents to solr	2013-05-30 12:34:53 +02:00
Michael Peter Christen	0c1a018bbd	removed 'later' tactic because it used too much RAM, reduced number of soft commits, reduced caching size of search events, ensured that solr results are processed before connection is closed to keep that stuff not too long in RAM	2013-05-29 18:27:27 +02:00
Michael Peter Christen	5344a1c5f7	getting the trash out	2013-05-29 16:09:05 +02:00
Michael Peter Christen	709e9b8ce7	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-05-29 13:49:42 +02:00
Michael Peter Christen	281959a2d7	added option to re-boot the embedded solr during run-time. Added also API recording for this method so it can be repeated automatically. The index dump generation is now also available for API recording. Added some synchronization in backend which was necessary for this.	2013-05-29 13:09:34 +02:00
orbiter	da621e827e	prevent NPE in case RWI is disabled	2013-05-28 16:26:38 +02:00
Michael Peter Christen	c2b1075dcf	activating pollImmediately in case that DHT receive is off. This will cause a much faster search result when running in public robinson mode.	2013-05-28 10:36:49 +02:00
Michael Peter Christen	2b563debbf	javadoc of new multiple-exist test	2013-05-27 13:45:09 +02:00
Michael Peter Christen	8f2d3ce2f9	reduced locking situation in crawler: shifted synchronized location and reduced time-out of robots.txt load limit	2013-05-20 22:05:28 +02:00
Michael Peter Christen	b68fbe7d21	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Conflicts: source/net/yacy/migration.java	2013-05-17 14:13:07 +02:00
Michael Peter Christen	06d3063dc9	- no downcase when using collection modifier - removed warnings	2013-05-17 14:11:10 +02:00
Michael Peter Christen	8dbc80da70	redesign of index.exist-test: this shall now not be done using a single id to be tested, but with a collection of ids. This will cause only a single call to solr instead of many. The result is a much better performace when testing the existence of many urls. The effect should cause very much less IO during index transmission, both on sender and receiver side.	2013-05-17 13:59:37 +02:00
reger	7f63d3747d	more generic field selection for reindex option of documents with disabled fields using Luke request to compare config with actual fields in index	2013-05-15 23:16:32 +02:00
Michael Peter Christen	44e363f37f	refactoring of WorkflowProcessor, added process counter, update of process counter if an blocking thread dies. Added also a new column in PerformanceConcurrency_p servlet to show the actual number of concurrent processes.	2013-05-13 13:28:07 +02:00
Michael Peter Christen	4058369288	fixed query expressions for collection selection (added quotes)	2013-05-13 13:27:01 +02:00
reger	79401cb938	added reindex option for documents with disabled or obsolete fields to Solr Schema Editor page (IndexSchema_p.html) this allows to remove obsolete fields from the index (according to current schema config) by selecting all documents containig disabled fields.	2013-05-13 04:06:57 +02:00
orbiter	cf36c1614f	prevent that concurrent deletion process causes wrong double-check in crawl start	2013-05-12 21:37:45 +02:00
Michael Peter Christen	b24d1d18e4	removed synchronization and concurrency in Fulltext class, concurrent deletions are now handled in ConcurrentUpdateSolrConnector	2013-05-11 10:53:12 +02:00
Michael Peter Christen	b9b446bca6	- added ssl configuration sign (a lock) to network statistic/table - fixed a bug in bitfield	2013-05-10 17:32:21 +02:00
reger	4fc6837690	- fix monitor url of crawl job in PerformanceQueues_p.html - reduce logging of every index add (switch embeddedsolr.add from info to debug)	2013-05-10 04:38:13 +02:00
Michael Peter Christen	ad050ec88d	- upgraded httpclient, httpcore and httpmime - removed httpclient 3.1 which has been used by solrj < 4.x.x and is now not used any more - fixed some parts in YaCy which used methods from httpclient 3.1	2013-05-09 00:22:45 +02:00
orbiter	a1c989002b	fix for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=4652 generate dht data even if dht receive and dht transmission is switched off	2013-05-08 16:48:45 +02:00
Michael Peter Christen	e26bdd4a52	fixes to deletion methods (removed unnecessary concurrency and added removal of crawl queue entries)	2013-05-08 13:26:25 +02:00
Michael Peter Christen	f7f3e28c5e	prevent that the size of the index is computed too many times. Because the index size is now provided by solr, and the only way to do that is a match for [* TO *], a size computation is quite complex and time-consuming. Therefore this patch prevents that the method is called at all and if necessary puts a DOS-preventing barrier in front of it.	2013-05-08 11:50:46 +02:00
Michael Peter Christen	cca19d94d4	re-declared some fields to be of type string rather than text which makes them more efficient and less large	2013-05-06 16:45:54 +02:00
Michael Peter Christen	3841854c97	abstraction of catchall term	2013-05-04 00:14:22 +02:00
Michael Peter Christen	ea85674be2	added the date to error documents	2013-05-04 00:14:00 +02:00
orbiter	7de5b9cfa0	fix for http://bugs.yacy.net/view.php?id=233 - check geolocation coordinates and accept only those, which are well-formed - the solr push process does not stop crawling any more if after 20 requests to Solr Solr does not accept the record. Instead, a severe log entry asks the user to create a bug request	2013-05-03 00:24:39 +02:00
Michael Peter Christen	bb4bf3d8fd	infinity timeout bug protection patch	2013-04-30 11:06:48 +02:00
Michael Peter Christen	d1be4127e7	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-04-29 19:31:40 +02:00
Michael Peter Christen	f36a7da5f6	- re-introduced existById in solr connector. - intruduced raw-queries for the re-introduced byId-Queries (they are hopefully faster than full edismax queries) - removed the cached solr connector (testing this) to rely only on the solr built-in search caches. That should save some RAM (also). We will see if this is usable.	2013-04-28 21:20:14 +02:00
reger	46fa800bc7	added httpstatus_i to automatically switched on fields (used in all search queries)	2013-04-27 03:11:44 +02:00
Michael Peter Christen	3502b4c697	refactoring (renaming) of yacy-solr api	2013-04-27 01:32:18 +02:00
Michael Peter Christen	3a0fcfbeda	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-04-26 10:50:08 +02:00
Michael Peter Christen	25499eead5	- added a new field for the regular expression in crawl start - added the field in crawl profile - adopted logging end error management - adopted duplicate document detection - added a new rule to the indexing process to reject non-matching content - full redesign of the expert crawl start servlet The new filter field can now be seen in /CrawlStartExpert_p.html at Section "Document Filter", subsection item "Filter on Content of Document"	2013-04-26 10:49:55 +02:00
orbiter	e1bfe9d07a	- reduction of the concurrently running processes to make YaCy more adjusted to smaller and 1-core devices. - the workflow processor now starts no process at all. these are started as soon as parser/condenser/indexing queues are filled. - better abstraction	2013-04-25 11:33:17 +02:00
Michael Peter Christen	c091000165	added collection attribute also to the rss feed reader	2013-04-24 01:14:35 +02:00
orbiter	f7571386a3	added a 'collection' property attribute in yacysearch.html which can be used to select between different collections as defined during a crawl start with the 'collection' attribute. This actually implements the ability to prepare search tenants which restrict their search results to a specific collection. The main use for this is to provide tenants to the yaml4 interface (at this time).	2013-04-23 20:42:54 +02:00
Michael Peter Christen	d937c55204	extended limitation of dom export size from 100000 to 100000000	2013-04-22 22:33:13 +02:00
Michael Peter Christen	50421171c3	added new schema fields: hreflang_url_sxt and hreflang_cc_sxt for http://support.google.com/webmasters/bin/answer.py?hl=de&answer=189077 navigation_url_sxt and navigation_type_sxt for http://googlewebmastercentral.blogspot.de/2011/09/pagination-with-relnext-and-relprev.html publisher_url_s for http://support.google.com/plus/answer/1713826?hl=de all fields are disabled by default and not written to the index.	2013-04-18 17:21:17 +02:00
Michael Peter Christen	566d6c980c	checking of document signature for a double-document check now refers only to documents within the same domain	2013-04-17 16:15:27 +02:00
Michael Peter Christen	d05dc07cff	setting of new default values for ranking	2013-04-16 15:02:00 +02:00
Michael Peter Christen	97775fbebc	fixed ranking for add-function queries: this did not work. The option was removed. All function queries are now boosts (multiplies the score according to a function). This is also the recommended way to boost rankings based on functions as explained in http://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/	2013-04-16 14:45:14 +02:00
Michael Peter Christen	7ab5093321	added new solr title_exact_signature_l and description_exact_signature_l to be able to identify unique title and unique description fields.	2013-04-16 01:35:15 +02:00
Michael Peter Christen	f24ac518e6	redesign of exists()-query (can now be called with query) and the CachedSolrConnector which based its cache on the key value. This will be used to correct the title_unique_b and description_unique_b field.	2013-04-15 14:08:30 +02:00
Michael Peter Christen	27d6222880	added new field host_extent_i which, after a crawl and postprocessing, holds the number of documents for the host where the document is hosted. This is necessary for ranking and the norming of references per local host in the ranking computation.	2013-04-14 20:52:40 +02:00
reger	518b20147c	skip postprocessing during document.store if no citation index connected (prevent null pointer exception)	2013-04-14 02:01:27 +02:00
Michael Peter Christen	ada3f27de7	added three new field for a better ranking: references_internal_i, references_external_i and references_exthosts_i. These can be used to count and evaluate the number of external links to every web page. An experimental ranking function can be i.e.: div(add(references_internal_i,product(references_external_i,references_exthosts_i)),add(clickdepth_i,1))	2013-04-12 16:17:14 +02:00
Michael Peter Christen	082e3274d6	- setting the same default ranking in the solr interface as for YaCy search interfaces if no other ranking attributes are given - using the YaCy ranking in the GSA interface only if there was not given a GSA-style sort attribute - to avoid confusion about correct ranking attributes, only the default '0'-ranking profile is used and not scenario-adopted (site, date) because that should be configurable in the web interface before it is used actually for ranking.	2013-04-12 10:48:41 +02:00
Michael Peter Christen	a20941c067	resume paused crawls on startup; user expects that restarts 'heal' everything	2013-04-11 15:07:08 +02:00
Michael Peter Christen	edc0b33f6d	- showing references count and clickdepth in host browser - fixed generation and presentation of both values	2013-04-11 14:46:13 +02:00
reger	566a3b0294	fix: Index Administration > Reverse Word Index (IndexControlRWIs_p) corrected use of word search to word-hash search - removed duplicate QueryParams.hashes2Handles , redundant with .hashes2Set	2013-04-08 21:25:21 +02:00
Michael Peter Christen	cf0acd2cb4	upgrade to solr 4.2.1	2013-04-06 16:11:24 +02:00
orbiter	e4d26d1cb4	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-03-17 10:52:42 +01:00
orbiter	940c6849ee	enhanced did-you-mean (a bit): can now remember previously searched words (plus small enhancements)	2013-03-17 10:52:31 +01:00
reger	d57b221921	add: reset Solr schema filed selection to default button in IndexSchema_p	2013-03-17 03:46:29 +01:00
Michael Peter Christen	9406a2e438	fixed NPE during index abstract computation	2013-03-15 10:04:27 +01:00
Michael Peter Christen	2d36a7eaf5	- do not create a new query for all remote peers - no document search this time - adjusted banner and network to not show 'WORDS' but DHT Chunks. This is to avoid confusion for robinson peers which do not create Word Entries	2013-03-15 00:14:28 +01:00
Michael Peter Christen	4af0839be2	use appropriate ranking for each search situation: - when using the /date modifier, a date ranking profile is used - when using a site: modifier, a ranking profile supporting longer urls is used	2013-03-14 21:13:12 +01:00
Michael Peter Christen	b8ed66a55d	added all clickdepth computations for source and target paths in webstructure core	2013-03-14 17:54:33 +01:00
Michael Peter Christen	6300730d7f	refactoring of clickdepth computation as preparation for clickdepth computation of webgraph links	2013-03-14 12:13:02 +01:00
Michael Peter Christen	2080fc7406	removed unused tag fields	2013-03-14 10:35:21 +01:00
orbiter	6b13dd0d3d	added clickdepth field writing for webgraph core (unfinished)	2013-03-14 01:35:38 +01:00
orbiter	47114910d5	fix for possible memory leaks	2013-03-13 17:55:37 +01:00
Michael Peter Christen	addba047e2	changes in ranking computation - an existing ranking servlet for solr was extended. It is now possible to set boost values for fields, boost functions and boost queries. - The ranking can have different instances, but currently only the first one is used - added an abstraction layer for fields which can be used for search and those fields can be edited in the solr ranking configruation - the ranking value from solr within the field score is used to combine remote search requests, which all are created using the same locally defined boost values - reduced the number of fields which are used for search (makes it faster) - replaced some text fields by string fields (makes indexing faster) - removed classes which had no use - made a large number of experiments for a better ranking and created a temporary setting which prefers hits inside titles - adjusted also the RWI-based ranking computation to 'prefer title' - made special cases like for portal search where no post-processing and post-ranking is wanted: this keeps the original ranking order as done by Solr - fixed many bugs with old settings for ranking	2013-03-13 14:47:00 +01:00
orbiter	ab74d559fb	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-03-11 18:23:43 +01:00
Michael Peter Christen	4490133909	removed target_tag_s (superfluous)	2013-03-11 10:46:29 +01:00
orbiter	cd197bb555	fix for NPE if surrogates do not exist	2013-03-10 19:46:06 +01:00
Michael Peter Christen	25300913fa	fixes to search debugging after testing with the different search debugging options	2013-03-05 21:28:22 +01:00
Michael Peter Christen	81380ae5c8	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-03-05 12:24:10 +01:00
Michael Peter Christen	c2fde018b5	concurrent snippet fetching from solr results which do not have snippets	2013-03-05 12:24:01 +01:00
orbiter	b1140e3d82	added debug switches for detailed search testing	2013-03-05 12:19:32 +01:00
orbiter	cdbfddf091	added filter queries for better image, audio and video results	2013-03-04 21:18:54 +01:00
Michael Peter Christen	587ef83eab	added missing cleanup statements for short memory cases during search	2013-03-04 13:01:24 +01:00
Michael Peter Christen	2b6c79d347	in method exists() also use the new caching-stacks for documents/metadata	2013-03-04 01:13:17 +01:00
Michael Peter Christen	ae734b3f8d	enhanced the search result processing - no waiting time at the end - switched on 'classic' snippet production and verification (again)	2013-03-04 00:17:29 +01:00
Michael Peter Christen	0d7b4bc891	better protection against OOM during search flush and fixed missing result push	2013-03-03 23:45:47 +01:00
Michael Peter Christen	221ed7d764	- enhanced concurrency during search without IO blocking - introduced a second queue to flush remote search results (now: old metadata structure from DHT peers) - fixed result counters	2013-03-03 22:38:50 +01:00
Michael Peter Christen	3b1d9dc884	made index storage from DHT search result concurrently. This prevents blocking by high CPU usage during search. Also: removed query from Solr for DHT search results; results are taken from the pending queue.	2013-03-02 10:25:52 +01:00
orbiter	f13c0b2abd	fix for search	2013-03-01 19:18:16 +01:00
orbiter	0f7ea7ad9f	- enhanced solr.add procedure for mass adds - removed unused solr access classes - made snippet generation for documents aus YaCy RWI/DHT concurrent (as it was before the search process removation) - reduced the number of remote results in settings file because the processing of such mass documents add is too CPU-intensive (in Solr)	2013-03-01 15:27:17 +01:00
Michael Peter Christen	f327ffedb4	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-02-28 15:55:13 +01:00
orbiter	9c09fd7d0b	better/less requests to local solr; the request is made in chunks which are exactly at only that size which is needed to present the current search result page. This will also cause that next solr request are made automatically during switching to next pages.	2013-02-28 14:04:08 +01:00
Michael Peter Christen	840fa22135	disabled clickdepth computation during craling since that is repeated during clean-up phase.	2013-02-28 02:25:39 +01:00
orbiter	d74472f562	corrected result counter	2013-02-27 22:40:23 +01:00
Michael Peter Christen	d957739441	removed size request	2013-02-26 17:53:44 +01:00
Michael Peter Christen	c95a84103a	complete redesign of search process: - removed 'worker' processes - no internal time-out behaviour: methods either are successful or return null - waiting is only done on top-level - removed snippet-production; this is replaced by solr snippets - removed statistics based on solr size queries (they had been VERY long); the statistics (like suggestions or tag cloud) are now again based on the old but very fast RWI index. In portal or intranet mode the RWI index is usually switched off; if you like to have statistics again then you must switch on the rwis again in this mode. - fixed many bugs regarding correct page counter	2013-02-26 17:16:31 +01:00
Michael Peter Christen	35fa718b77	testing to use solr for portalsearch caused some bugfixing but no full success: try to comment out the solr search request in yacy-portalsearch.js	2013-02-25 14:31:50 +01:00
Michael Peter Christen	008288719c	fix for schema export to consider also automatically generated coordinate fields	2013-02-25 01:13:03 +01:00
Michael Peter Christen	089dee1770	- generalized SchemaConfiguration into super-class Configuration and adopted other classes which used the configuration-only access for that class - removed many warnings - adjusted logging	2013-02-25 00:09:41 +01:00
Michael Peter Christen	c16de49f64	fix for webgraph delete query	2013-02-24 18:17:58 +01:00
Michael Peter Christen	56d5946a59	- added flags in IndexFederated_p.html to switch on or off the webgraph index (new solr core webgraph) .. this is now off by default - completely redesigned this servlet - added description how to attach a remote solr - adjusted naming of servlet and menues - moved 'lazy initialization' attribut from IndexSchema to IndexFederated (this is a general option) back again.	2013-02-24 18:09:34 +01:00
Michael Peter Christen	14cceb6b17	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Conflicts: htroot/IndexFederated_p.html source/net/yacy/cora/federate/solr/YaCySchema.java source/net/yacy/peers/Protocol.java source/net/yacy/search/Switchboard.java source/net/yacy/search/index/Segment.java also moved portalsearch-dev to yacy-portalsearch to be able to fix problems with new attachment to solr of the search widget	2013-02-23 08:48:33 +01:00
reger	f291d60c5f	on remote Solr search take only locally enabled schema fields from remote solrdocument for the inputdocument added to local index	2013-02-22 22:17:45 +01:00
Michael Peter Christen	788288eb9e	added the generation of 50 (!!) new solr field in the core 'webgraph'. The default schema uses only some of them and the resting search index has now the following properties: - webgraph size will have about 40 times as much entries as default index - the complete index size will increase and may be about the double size of current amount As testing showed, not much indexing performance is lost. The default index will be smaller (moved fields out of it); thus searching can be faster. The new index will cause that some old parts in YaCy can be removed, i.e. specialized webgraph data and the noload crawler. The new index will make it possible to: - search within link texts of linked but not indexed documents (about 20 times of document index in size!!) - get a very detailed link graph - enhance ranking using a complete link graph To get the full access to the new index, the API to solr has now two access points: one with attribute core=collection1 for the default search index and core=webgraph to the new webgraph search index. This is also avaiable for p2p operation but client access is not yet implemented.	2013-02-22 15:45:15 +01:00
Michael Peter Christen	91a0401d59	introduced a second core named 'webgraph'. This core will hold the link structure, but is not filled yet. To have the opportunity of a second core, multi-core functionality had to be implemented to the deep-embedded solr: - migrated the solr_40 directory content to a subdirectory 'collection1'; the previously used default core is now called collection1 - added solr_40/webgraph subdirectory as second core - added a servlet configuration for the second core 'webgraph' in /IndexSchema_p.html - added instance handling as addition to solr connections: all solr connectors are now instances of an solr 'instance' object; this required a complete re-design of the solr embedding - migrated also caching and sharding ontop of new instance handling - migrated the search apis to handle now the access to a specific core, the default core named 'collection1' - migrated the remote solr search interface to access shards of cores; for the yacy remote search the default core is now called 'solr'; using the peer address as solr address - migrated the solr backup and restore process: old backups cannot be used after this migration! - redesign of solr instance handling in all methods which access the instances: they cannot hold copies of these instances any more; the must retrieve the actuall connection object every time they want to write to it (this solves also some bugs when switching the index/network) - added another schema 'solr.webgraph.schema', the old solr.keys.list is replaced by solr.collection.schema	2013-02-21 13:23:55 +01:00
Michael Peter Christen	33bc255e85	prevent that crawl starts with very large url lists cause a time-out in the user front-end	2013-02-15 01:58:28 +01:00
Michael Peter Christen	b6de1f42dc	Full redesign of solr connection architecture. This was done to support multiple solr cores instead of just one. Therefore it is now necessary to distuingish between solr server connections (called an 'Instance') and a connection to a single solr core. One Instance may now have multiple connector classes assigned to it, each connecting to a single core. To support multiple cores it is also necessary to distinguish between the connection configuration and the configuration of the index schema. We will have multiple schema configurations in the future, each for every solr core. This caused that the IndexFederated servlet had to be split into two parts, the new Servlet for the Schema editor is now in the IndexSchema Servlet.	2013-02-15 01:38:10 +01:00
Michael Peter Christen	4111606654	removed the commitWithin attribute because that is not the way how the index is updated the right way for us. May also be be superfluous with the solr 4.0 softcommit.	2013-02-13 02:29:47 +01:00
Michael Peter Christen	de58043205	Added image license generation for solr image search results when results are generated within yjson result writer. This makes it possible to view images in yacyinteractive from solr.	2013-02-13 00:33:53 +01:00
Michael Peter Christen	d3508fa8ff	fixed json search, quotes, auto-facets, urls etc. for yacyinteractive.html	2013-02-13 00:01:38 +01:00
Michael Peter Christen	c34af7fe94	extended JSON Response Writer and Opensearch Response Writer for the Solr search interface in such way that it is possible to use this interface for the yacyinteractive search. This search interface is now much faster using the Solr search directly. For the Solr interface it was necessary to create a translation from the YaCy search modifiers to the Solr facet selection. This was added in such a way that it becomes generic for the normal YaCy search and as a on-top evaluation for Solr queries.	2013-02-12 03:42:46 +01:00
Michael Peter Christen	6f6ddaf7e7	A robinson peer does not need to write RWI data if such peers are only searched using the solr interface. Searching public rpbinsons will be done with solr only in the future.	2013-02-08 17:58:54 +01:00
Michael Peter Christen	7806680ab8	fixed a problem with re-feeding of already indexed documents whith coordinates attached.	2013-02-08 12:45:54 +01:00
Michael Peter Christen	eb80405a16	added a disable function in RemoteCrawl_p servlet which prevents setting of remote crawl if peer is not a senior or principal peer	2013-02-05 12:47:20 +01:00
Michael Peter Christen	e8f7b85b98	fixes to internal RWI usage if RWI is switched off (NPE etc)	2013-02-04 17:11:02 +01:00
Michael Peter Christen	3834829b37	bugfixes and more logging for solr connector	2013-02-04 16:42:10 +01:00
Michael Peter Christen	4323621a76	update to Solr 4.1.0	2013-02-04 10:55:49 +01:00
Michael Peter Christen	592adf7ccb	fix for domain navigation	2013-02-02 07:21:18 +01:00
Michael Peter Christen	7dfcc92b71	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-01-31 13:15:42 +01:00
Michael Peter Christen	0b6566a389	optimizations when starting large crawl requests with many start urls in one request: - allow larger match-fields in html interface - delete all host hashes at once from zurl - when deleting by host, do not count size of deleted entries since that was the reason it took so long	2013-01-31 13:15:28 +01:00
orbiter	a2160054d7	ability to create vocabularies also without any objectspace: this iterates over all urls in the index do create terms	2013-01-30 19:33:48 +01:00
orbiter	ecc10a752c	fixes to index enumeration for vocabulary production	2013-01-29 18:14:14 +01:00
sixcooler	3a13906121	clear some more caches if running out of memory	2013-01-25 04:24:36 +01:00
Michael Peter Christen	8651ec35fe	turned author_s into the multi-valued field author_sxt	2013-01-24 18:24:31 +01:00
Michael Peter Christen	0fe7b6fd3b	migrated the index export methods from the old metadata to solr. Now exports are done using solr queries. removed superfluous methods and servlets.	2013-01-24 12:39:19 +01:00
Michael Peter Christen	1768c82010	removed field selection because that created documents with that field only which was not useful when re-writing the same document	2013-01-24 03:26:38 +01:00
Michael Peter Christen	4735bd47f4	- changed solr commit call and added an optimize option. Since Solr 4.0.0 there is a new softcommit feature which implements a near-real-time (NRT) search option. The softcommit does not do IO and does not cause performance issues. YaCy has now an extension in its solr connectors to use the softcommit feature. The softcommit call now replaces all places where a hard commit was used. Furthermore the commit strategy in when doing a search from the web interface was changed (it's done every time before a search is done). The softcommit feature was implemented because it was needed for the following changes (customer demands), which is also included in this git commit: - added a feature to identify all documents which have unique titles and/or unique descriptions. These unique flags are disabled by default. - added also a feature to set a flag when the url from a canonical tag is equal to the document url. This is also disabled by default. To support the new softcommit strategy, the commitWithinMs option was set to -1 do disable automatic commit based on document insert times. If documents are inserted permanently then also a commit would happen permanently whenever the commitWithinMs time is reached. This would conflict with the regular autocommit of 10 minutes and the new softcommit strategy.	2013-01-23 14:40:58 +01:00
Michael Peter Christen	cba038f97b	one more NPE fix	2013-01-17 21:52:56 +01:00
Michael Peter Christen	c3d50d91f8	relaxing site operator for www prefix: - when using a site operator search for a domain where the domain has a www prefix, also the domain without the www is enclosed - when using a site operator search for a domain where the domain has no www prefix, also the domain with the www in enclosed - in the host navigator, all domains with and without a www prefix are accumulated. That means that the host navigator does never show a host with a www prefix. This should prevent usage mistakes of the site operator.	2013-01-16 14:54:35 +01:00
Michael Peter Christen	db49e91724	fixed a NPE which may appear for freeworld peers without any rwi index data. This the NPE looked like: Caused by: java.lang.NullPointerException at net.yacy.search.query.SearchEvent.<init>(SearchEvent.java:279) at net.yacy.search.query.SearchEventCache.getEvent(SearchEventCache.java:155) at search.respond(search.java:314) ... 12 more	2013-01-16 11:07:20 +01:00
Michael Peter Christen	4faa07c214	added a timeout for topic computation (solr is here much slower than the old metadata-db)	2013-01-15 16:20:43 +01:00
Michael Peter Christen	d2d5be032d	added a 'inlink' search option according to the suggestion in the YaCy forum at http://forum.yacy-websuche.de/viewtopic.php?f=18&t=4572#p27410 The feature was not called 'haslink' but called 'inlink' to have a analogous naming like 'inurl'. This causes now that you can search for words in links of the document, like: * inlink:yacy searches all documents which link to pages which have an 'yacy' in the url.	2013-01-14 12:50:21 +01:00
reger	3897bb4409	added (manual) urldb migration (link on: Index Administraton -> Federated Solr Index) - migrates all entries in old urldb Metadata coordinate (lat / lon) NumberFormatException still relative often (see excerpt below), - added try/catch for URIMetadataRow (seems not to be needed in URIMetaDataNode, as Solr internally checks for number format) - removed possible typ conversion for lat() / lon() comparison with 0.0f, changed to 0.0 (leaving it to the compiler/optimizer to choose number format) current log excerpt for NumberFormatException: W 2013/01/14 00:10:07 StackTrace For input string: "-" java.lang.NumberFormatException: For input string: "-" at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source) at java.lang.Double.parseDouble(Unknown Source) at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525) at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279) at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277) at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329) at transferURL.respond(transferURL.java:152) ... Caused by: java.lang.NumberFormatException: For input string: "-" at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source) at java.lang.Double.parseDouble(Unknown Source) at net.yacy.kelondro.data.meta.URIMetadataRow$Components.lon(URIMetadataRow.java:525) at net.yacy.kelondro.data.meta.URIMetadataRow.lon(URIMetadataRow.java:279) at net.yacy.search.index.SolrConfiguration.metadata2solr(SolrConfiguration.java:277) at net.yacy.search.index.Fulltext.putMetadata(Fulltext.java:329) at transferURL.respond(transferURL.java:152)	2013-01-14 03:06:24 +01:00
reger	3b6e08b49f	prevent checking of urldb if empty - disconnect urlIndexFile if empty - add missing lock class in submenuSearchConfiguration	2013-01-12 15:20:23 +01:00
reger	f143804382	fix configuration for search page navigators - added additional config page (ConfigSearchPage_p) for easy setup of search page layout (to not overload ConfigPortal page) - currently redundant setting with part of ConfigPortal page - added missing config for filetype and protocol navigator - adjusted init of SearchEvent to check navigation config setting - renamed RankigProcess.getTopicNavigator to getTopics (to distiguish between added SearchEvent.getTopicNavigator)	2013-01-05 19:00:54 +01:00
Michael Peter Christen	becd52a984	added also a re-calculation of reference counts during the post-processing of clickcount calculations. This is a really nice thing to have because the reference count affects ranking.	2013-01-05 00:58:27 +01:00
Michael Peter Christen	38d3feae65	added separate delete commands for the local+remote solr index, the old metadata and old rwi and for the citation index. The important advancement is the separation of the citation index deletion because that index is responsible for the linkdepth calculation. Now a search index can be deleted without the citation index and that should cause that less clickdepths must be post-processed.	2013-01-04 16:39:34 +01:00
Michael Peter Christen	6f0baaa309	added the clickdepth post-processing: some links may have 'shortcuts' to already calculated click depths. There are then calculated if the crawl buffer is empty and therefore no new 'shortcuts' can be discovered. The status of the clickdepth stack (to-be-processed) can be seen using a solr search command like this: http://localhost:8090/solr/select?q=process_sxt:[%20TO%20]&start=0&rows=30&fl=sku,clickdepth_i,process_sxt	2013-01-04 16:37:39 +01:00
Michael Peter Christen	0f5b6f38c1	enhanced root-url detection	2013-01-03 19:21:21 +01:00
Michael Peter Christen	5c0c56cfe1	Preparations to produce a click depth attribute in the search index. This attribute can be used for ranking and for other purpose (demand by customer) The click depth is computed in two steps: - during indexing the current fill-state of the reverse link index is used to backtrack the current page to the root page. The length of that backtrack is the clickdepth. But this does not discover the shortest click depth. To get this, a second process to check again is needed - added a process tag that can be used to do operations on the existing index after a crawl; i.e. calculation the shortest clickpath. Added a field to control this operation but not a method to operate on this. - added a visualization of the clickpath length in the host browser	2013-01-02 20:55:43 +01:00
reger	f301336adf	fix: no results with configuration citation reference index switched off - urlcitationindex != null check added to ResultEntry.referencesCount - plus other places where conflicting procedure was used (and urlcitationindex not already checked != null)	2012-12-30 02:13:48 +01:00
orbiter	fe50702eb0	added a filterscannerfail attribute to QueryParams which causes that a check to the network scanner fail/success status can be used/suppressed for search results. This is a feature that comes with the port scanner.	2012-12-29 17:47:34 +01:00
Michael Peter Christen	eb90d38cd7	added missing extension 'mkv' for navigation	2012-12-27 13:56:13 +01:00
Michael Peter Christen	4a9182ae16	use the search configuration to default the cacheStrategy to the value as given in the search configuration	2012-12-27 03:19:21 +01:00
Michael Peter Christen	98819ec3d9	use solr boost configuration to select search fields. At this time it is possible to enter a negative boost value to switch that value off. This might be different in the future with a better input interface.	2012-12-27 03:17:45 +01:00

... 2 3 4 5 6 ...

767 Commits