yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	1b4fa2947d	- fixed a problem which ocurred when a document was not recognized with the right content domain (i.e. identifying that it is an image, text etc.) because it used the file extension and not an existing mime type assignment. - fixed the new setting that images shall be loaded for a better image search. - both fixes together makes it now possible to crawl commons.wikimedia.org which makes use of 'funny' document names (i.e. ending with .jpg while the document is html)	2013-10-23 00:16:54 +02:00
Michael Peter Christen	82621bead0	When doing bootstraping, always accept one seedlist-File without checking the date of the file. This should help to start the peer in case that the user has a completely wrong date setting.	2013-10-22 15:34:51 +02:00
Michael Peter Christen	691d7e70fa	added hint to development/commit rss feed	2013-10-21 15:16:29 +02:00
Michael Peter Christen	c833d02cf5	fixed webgraph postprocessing (did nothing and repeated to do this...)	2013-10-16 11:49:04 +02:00
Michael Peter Christen	74d0256e93	enhanced postprocessing: fixed bugs, enable proper postprocessing also without the harvestingkey, remove crawl profiles after postprocessing, speed-up for clickdepth computation.	2013-10-16 11:27:06 +02:00
Michael Peter Christen	d328cc4a83	fix for didyoumean, added also more asian alphabets	2013-10-09 16:17:50 +02:00
Michael Peter Christen	90c8577840	enhanced ranking; patches to replace old ranking	2013-10-09 15:10:03 +02:00
Michael Peter Christen	1b61bd40ed	- Added new solr field url_file_name_tokens_t which stores the file name tokens. This can be used to enhance the ranking. - Added also a rating_i field as basis for later usage. - enhanced the tokenization process.	2013-10-08 23:48:13 +02:00
orbiter	5f5a97bafc	added the anchor text within web pages to the searcheable entities of a web page. This can be of benefit for the ranking if these fields are used for boosts.	2013-10-08 18:41:07 +02:00
orbiter	705b3338ee	list more fields available for search and for ranking boosts	2013-10-08 18:15:35 +02:00
Michael Peter Christen	78e7aadb26	removed unused initialization method	2013-10-07 23:51:28 +02:00
Michael Peter Christen	4fbc4740df	removed warnings	2013-10-07 23:41:50 +02:00
Michael Peter Christen	21aa6a0321	migration to Solr 4.5.0	2013-10-07 17:09:40 +02:00
Michael Peter Christen	101a6e6e14	Patch the citation index for links with canonical tags. This shall fulfill the following requirement: If a document A links to B and B contains a 'canonical C', then the citation rank computation shall consider that A links to C and B does not link to C. To do so, we first must collect all canonical links, find all references to them, get the anchor list of the documents and patch the citation reference of these links.	2013-10-07 11:15:58 +02:00
Michael Peter Christen	b28d43decc	added two more fields source_cr_host_norm_i,target_cr_host_norm_i in webgraph and an addition to postprocessing to copy all cr ranking attributes to the link edges associated to the postprocessing documents	2013-09-27 16:57:05 +02:00
Michael Peter Christen	a52f3a597e	fix for canonical-from-http-header feature	2013-09-27 15:09:04 +02:00
Michael Peter Christen	2dd7c5be44	added parsing of http-canonical tags (untested, could not find an example page)	2013-09-27 13:17:50 +02:00
Michael Peter Christen	3bf0104199	fix for crawl domain counter limitation (limit was reached too early)	2013-09-26 13:41:52 +02:00
Michael Peter Christen	82bfd9e00a	- crawl profiles shall be deleted from active and passive stacks if they are deleted to terminate the crawl because otherwise the crawl will go on after the load-from-passive stack policy. - better check if a crawl is terminated using the loader queue.	2013-09-26 10:22:31 +02:00
Michael Peter Christen	91a875dff5	self-healing of mistakenly deactivated crawl profiles. This fixes a bug which can happen in rare cases when a crawl start and a cleanup process happen at the same time.	2013-09-25 18:27:54 +02:00
Michael Peter Christen	095053a9b4	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-09-25 17:32:52 +02:00
sixcooler	0cae420d8e	some dns-timing changes: since httpclient uses the domain-cache it is useful not to clean the domain cache until crawling is running (domains are filled into this cache) On huge crawl-starts (eg. from file) my DNS did not follow the high rates - so I reduced the rate and give some more time(-out)	2013-09-25 15:01:28 +02:00
Michael Peter Christen	4f83d5f18c	added the new field harvestkey_s to the collection index and the webgraph index which is temporary filled with the crawl profile key. This is used to select a set of documents for post-processing as soon as a crawl is finished. Now the postprocessing for a specific crawl is started when that specific crawl is finished and not at the end of all post-processing steps.	2013-09-25 14:38:24 +02:00
orbiter	14442efa6d	when profiles are cleaned, there shall be first a callback showing which profiles are cleaned. This shall enable a profile-termination-driven postprocessing. To do this, index writings must carry the profile key which will be implemented in another (next) step.	2013-09-25 11:04:12 +02:00
orbiter	8ac2e8c8c9	added location navigator which causes that the image to the map search is visible whenever a location is available in the search result. To activate this, the search.navigation property in yacy.conf must be modified to the new default values.	2013-09-24 11:26:51 +02:00
Michael Peter Christen	96ed0c980e	- added hosthash to all documents (also fail documents which is needed there for deletion), this fixes a problem for the deletion of old documents for new crawl starts - added clickdepth and citation computation for fail documents	2013-09-23 18:09:42 +02:00
orbiter	828603e4f1	fix for 100%CPU problem in error cache cleaning process	2013-09-21 10:20:13 +02:00
orbiter	c64b51134e	hack to add all tokens from the url to text_t. This was working for the RWI index (and still is working) but not for solr-only search indexes. Maybe we should find a solution using a separate search field instead.	2013-09-21 08:57:43 +02:00
orbiter	f3be1930cb	CPU problem when pusing to the error cache; wrong class, ConcurrentHashMap needed for concurrency	2013-09-20 16:51:50 +02:00
Michael Peter Christen	e40671ddb7	better and consistent deletions for error urls	2013-09-17 15:52:57 +02:00
Michael Peter Christen	2602be8d1e	- removed ZURL data structure; removed also the ZURL data file - replaced load failure logging by information which is stored in Solr - fixed a bug with crawling of feeds: added must-match pattern application to feed urls to filter out such urls which shall not be in a wanted domain - delegatedURLs, which also used ZURLs are now temporary objects in memory	2013-09-17 15:27:02 +02:00
Michael Peter Christen	31920385f7	set anchor rel attribute of all links to "nofollow" if the html meta contains a robots:nofollow or if the http header contains a "X-Robots-Tag: nofollow"	2013-09-16 16:14:56 +02:00
Michael Peter Christen	61c5e40687	- replaced the properties object in AnchorURL with distinct variables for anchor attributes. - this caused that large portions of the parser code had to be adopted as well - added a counter target_order_i for anchor links in webgraph computation	2013-09-15 23:27:04 +02:00
Michael Peter Christen	5e31bad711	- the webgraph shall store all links which appear on a web page and not all unique links! This made it necessary, that a large portion of the parser and link processing classes must be adopted to carry a different type of link collection which carry a property attribute which are attached to web anchors. - introduction of a new URL class, AnchorURL - the other url classes, DigestURI and MultiProtocolURI had been renamed and refactored to fit into a new document package schema, document.id - cleanup of net.yacy.cora.document package and refactoring	2013-09-15 00:30:23 +02:00
Michael Peter Christen	35ab2cef7b	added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in html meta fields to get a correct (or: better) date timestamp. The http:last-modified mostly does not work because it is set to the current date from most CMS.	2013-09-10 10:31:57 +02:00
Michael Peter Christen	9cc8468b30	added tools to visualize image generation (i.e. during testing)	2013-09-09 12:58:26 +02:00
Michael Peter Christen	dbef8ccfcb	forced deletion of ZURL entries for a specific host for each host that appears in the crawl url list	2013-09-05 13:22:16 +02:00
Michael Peter Christen	e137ff4171	refactoring (im preparation for new removeHost method)	2013-09-05 09:59:41 +02:00
Michael Peter Christen	7a5574cd51	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-09-04 23:12:04 +02:00
Michael Peter Christen	85456f46b2	added two new fields, exact_signature_copycount_i and fuzzy_signature_copycount_i, which count the number of copies of non-unique documents and assigns this to each document. Thus, each document there is a number assigned which shows how many copies of this document exists. These fields are disabled by default.	2013-09-04 23:11:53 +02:00
orbiter	26366596d9	fix for a problem which ocurres when a site is crawled where the start url is redirected.	2013-09-04 16:00:47 +02:00
Michael Peter Christen	a2511b5600	turned images_alt_txt back to images_alt_sxt because it is not necessary to index the alt text. Indexed image Text is in images_text_t	2013-09-04 10:47:18 +02:00
Michael Peter Christen	85b1922244	activated image type navigation for image search	2013-09-03 13:34:01 +02:00
Michael Peter Christen	9e12fdff23	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-09-03 12:22:57 +02:00
Michael Peter Christen	ab1201fdfd	fixed wrong facet count	2013-09-03 12:22:29 +02:00
Michael Peter Christen	049c3b3f2e	added an option to exclude image search results from text search. This is on by default.	2013-09-03 11:14:23 +02:00
Michael Peter Christen	69f85265e1	added an option to put image links to the crawl queue and handle these like normal documents. Using this option (by default on at this moment; this might change soon) it is possible to get the exif data into the search index to be used in image search.	2013-09-03 11:13:45 +02:00
Michael Peter Christen	a8c5bfcf58	avoid to create unnecessary objects	2013-09-03 09:48:05 +02:00
Michael Peter Christen	5a0de1b77d	moving image description text to image text field	2013-09-03 09:47:27 +02:00
Michael Peter Christen	dc179bd61f	fix for catchall query goal for image search	2013-09-03 07:55:21 +02:00
reger	392174de8c	remove all_words, all_strings lists from QueryGoal - only used for text highlighting in parser text (ViewFile.html) which can be done with include_strings only	2013-09-02 23:09:43 +02:00
Michael Peter Christen	169ef8963d	one more fix for image search	2013-09-02 20:02:26 +02:00
Michael Peter Christen	cb85b22725	redesign of the image search process (with much better results, unfortunately the index schema has changed and p2p image search will not be muchmuch better until many people update)	2013-09-02 18:55:38 +02:00
reger	29967102a2	optimized QueryGoal (reducing mem and computation by removing all_hashes) - all_hashes used for text highlighting and word distance computation which can be done with include_hashes only	2013-09-02 04:19:53 +02:00
orbiter	f106345eef	link strings should not be tokenized	2013-09-01 14:35:36 +02:00
orbiter	deadeb406e	image alt tag strings should be tokenized	2013-09-01 13:48:10 +02:00
Michael Peter Christen	1a3e42eca4	index migration to lucene 4.4	2013-08-26 12:49:39 +02:00
Michael Peter Christen	a88a62f7aa	added a feature to set a collection for a crawl result based on a regular expression on th url: the collection attribut for a crawl start may be now either a token or a list of tokens, seperated by ',' where a token is either a string or a pair <string,pattern> where the string is separated to the pattern with a ':' and the string is assigned to the document as collection only if the pattern matches with the url.	2013-08-25 00:13:48 +02:00
Michael Peter Christen	765943a4b7	Redesign of crawler identification and robots steering. A non-p2p user in intranets and the internet can now choose to appear as Googlebot. This is an essential necessity to be able to compete in the field of commercial search appliances, since most web pages are these days optimized only for Google and no other search platform any more. All commercial search engine providers have a built-in fake-Google User Agent to be able to get the same search index as Google can do. Without the resistance against obeying to robots.txt in this case, no competition is possible any more. YaCy will always obey the robots.txt when it is used for crawling the web in a peer-to-peer network, but to establish a Search Appliance (like a Google Search Appliance, GSA) it is necessary to be able to behave exactly like a Google crawler. With this change, you will be able to switch the user agent when portal or intranet mode is selected on per-crawl-start basis. Every crawl start can have a different user agent.	2013-08-22 14:23:47 +02:00
Michael Peter Christen	47b1c81d08	- refactoring - generalized writing of url attributes to solr documents - added more url attributes to error documents	2013-08-20 15:46:04 +02:00
Michael Peter Christen	697613170d	less logging for postprocessing (this was a debugging logging with high CPU load)	2013-08-17 09:25:32 +02:00
reger	a5019bc470	make Vocabulary Navigator tags a hard result entry filter by checking vocabulary tags also for rwi results (currently a filter is applied to the solr query) TODO: as vocabularies are only locally valid, auto-switch to Searchdom.LOCAL could be considered.	2013-08-13 03:07:25 +02:00
reger	a67a4b7d86	improve tld: query modifier filter pattern (to prevent tld:net accepting www.abcinet.org)	2013-08-12 21:20:23 +02:00
reger	02fe8b43ba	Field Re-Indexing: display list of fields in reindex queue change servlet to display statistic on 1st click (instead after refresh)	2013-08-11 04:51:29 +02:00
sixcooler	7f501b7c38	clear some caches before reporting low Memory do not break lines in Network-table-rows	2013-08-08 14:38:26 +02:00
Michael Peter Christen	2857499467	fix to collection schema; bug appeared for _txt fields with empty String as content	2013-07-31 13:32:05 +02:00
Michael Peter Christen	58fe986cca	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-07-30 12:49:14 +02:00
Michael Peter Christen	cf12835f20	replaced the single-text description solr field with a multi-value description_txt text field	2013-07-30 12:48:57 +02:00
reger	f2d99053ed	Field Re-Indexing: prevent endless error loop in ReindexSolrBusyThread on Solr exception (by skipping query causing the exception) (occured during testing while working on q=store:[* TO *])	2013-07-29 01:32:02 +02:00
orbiter	d05e0c5368	wait a bit longer before doing the first peer ping	2013-07-27 11:00:35 +02:00
orbiter	b8f57f7703	don't be noisy when doing background tasks that may be allowed to fail	2013-07-27 10:51:58 +02:00
Roland Haeder	0343f0668c	Fix for NPE: E 2013/07/26 20:29:29 BUSYTHREAD Runtime Error in serverInstantThread.job, thread 'net.yacy.search.Switchboard.cleanupJob': null; target exception: null java.lang.NullPointerException at net.yacy.search.schema.CollectionConfiguration.convergenceStep(CollectionConfiguration.java:1116) at net.yacy.search.schema.CollectionConfiguration.postprocessing(CollectionConfiguration.java:897) at net.yacy.search.Switchboard.cleanupJob(Switchboard.java:2296) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at net.yacy.kelondro.workflow.InstantBusyThread.job(InstantBusyThread.java:107) at net.yacy.kelondro.workflow.AbstractBusyThread.run(AbstractBusyThread.java:165) Conflicts: source/net/yacy/search/schema/CollectionConfiguration.java	2013-07-27 10:19:46 +02:00
Roland Haeder	b58ca8622d	Some cleanups: - added SKINS_PATH_DEFAULT as same as LISTS_PATH_DEFAULT was added - Added 'final' keyword to a string	2013-07-27 10:13:57 +02:00
Roland Haeder	7263bb82fb	Fix for NPE on shutdown: java.lang.NullPointerException at net.yacy.search.Switchboard.storeDocumentIndex(Switchboard.java:2732) at net.yacy.search.Switchboard.access00(Switchboard.java:207) at net.yacy.search.Switchboard.run(Switchboard.java:3049)	2013-07-27 09:55:43 +02:00
orbiter	080d80c9de	do not write an empty failreason in case that there is no fail. Because of the lazy instantiation rule this value was not actually written, but if lazy instantiation is switched on, then this causes that all crawl starts delete all crawl-start-hosts completely because this looks for filled error reasons.	2013-07-26 17:53:28 +02:00
Michael Peter Christen	61e015268b	fix in forced deletion: forced commit needed	2013-07-25 09:53:19 +02:00
Michael Peter Christen	c3b2301b2f	fix for http://bugs.yacy.net/view.php?id=268	2013-07-25 09:21:37 +02:00
orbiter	3e901dcb06	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-07-23 19:33:07 +02:00
orbiter	f50b596e0b	do not run dht ditribution if system load is over 2.5	2013-07-23 19:32:32 +02:00
orbiter	056b42f5aa	- added information about segment count to status_p.xml - also moved this information from the old index structure, which is still in use for the RWI/DHT index to that front-end	2013-07-23 18:03:33 +02:00
orbiter	6fb2811e68	fixes for problems with remote solr and non-activated webgraph index	2013-07-23 16:46:44 +02:00
sixcooler	af740f3058	changed optimization to a segment-size of index-size/5.000.000 + one if not idle + one (and force) if postprocessing	2013-07-23 14:21:12 +02:00
orbiter	5364c4dcc9	delayed first peer-ping to send the first ping out after the http got up; if the ping comes before the http is up, it cannot be recognized as senior peer (if at all). See also: http://bugs.yacy.net/view.php?id=266	2013-07-22 18:21:37 +02:00
orbiter	e24016e30a	added the property federated.service.solr.indexing.timeout to yacy.init to provide a configurable time-out for solr; see also: http://bugs.yacy.net/view.php?id=254	2013-07-22 17:45:12 +02:00
orbiter	c124037f19	removed forced non-soft commits to prevent index fragmentation	2013-07-22 17:28:20 +02:00
Michael Peter Christen	c15aa758dc	removed failreason_t removal patch because that causes too much confusion using an external solr. to clean up the index after a schema change, use the index cleaner function from the online servlet	2013-07-22 14:17:38 +02:00
Roland Haeder	be0ff6018f	Removed trailing spaces + some more final	2013-07-17 18:44:24 +02:00
Roland Haeder	841a28ae76	Added 'final' for all exception blocks as this helps the Java compiler to optimize memory usage Conflicts: source/net/yacy/search/Switchboard.java	2013-07-17 18:31:30 +02:00
Michael Peter Christen	89c0aa0e74	added collection_sxt to error documents	2013-07-17 15:20:56 +02:00
Michael Peter Christen	0df5195cb0	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-07-17 12:42:06 +02:00
Michael Peter Christen	1fd006cc56	fixes using the embedded connector	2013-07-17 12:41:54 +02:00
orbiter	d0dc86cf3d	logging of deadlocks (if any) during cleanup process	2013-07-17 12:38:58 +02:00
Michael Peter Christen	c6a6f159e8	fix for crawl stack domain counter	2013-07-16 18:18:55 +02:00
Michael Peter Christen	93d1bac140	do a more frequent optimization, reduces IO after optimization	2013-07-16 17:16:48 +02:00
orbiter	290e24564b	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-07-14 17:41:32 +02:00
orbiter	5533fc8e01	fix for bug 260	2013-07-14 17:40:28 +02:00
Michael Peter Christen	b79471ee67	grr	2013-07-14 10:15:47 +02:00
Michael Peter Christen	a79f288ac1	automatically running optimize on solr if user/search is idle for some time	2013-07-14 10:02:08 +02:00
orbiter	a9c8046c87	do a light optimization at the end of a crawl postprocessing	2013-07-13 19:09:46 +02:00
orbiter	a548354c71	replaced type of solr schema object sku of text_en_splitting_tight by string	2013-07-13 18:54:09 +02:00
orbiter	2f1ec8d4a2	npe fix	2013-07-13 11:10:05 +02:00
Michael Peter Christen	bcc623a843	refactoring of load_delay: this is a matter of client identification	2013-07-12 16:24:56 +02:00
orbiter	0d0b3a30f5	activate api actions after postprocessing of crawls	2013-07-12 16:05:48 +02:00
orbiter	2be456e7fb	added a postprocessing field into api/status_p.xml to show if the postprocessing task is running at that time (status: busy) or not (status:idle)	2013-07-12 14:29:22 +02:00
Michael Peter Christen	5878c1d599	- refactoring of log to ConcurrentLog: jdk-based logger tend to block at java.util.logging.Logger.log(Logger.java:476) in concurrent environments. This makes logging a main performance issue. To overcome this problem, this is a add-on to jdk logging to put log entries on a concurrent message queue and log the messages one by one using a separate process. - FTPClient uses the concurrent logging instead of the log4j logger	2013-07-09 14:28:25 +02:00
Michael Peter Christen	a2c8116a8f	accept (but ignore) a '+' sign in front of search words	2013-07-08 16:20:40 +02:00
sixcooler	d5d8936f9d	For indexes that are changing rapidly in NRT situations, fcs (stands for Field Cache per Segment) may be a better choice than the default fc. (saves memory) see: http://wiki.apache.org/solr/SimpleFacetParameters#facet.method	2013-07-04 19:08:53 +02:00
Michael Peter Christen	57ffdfad4c	added a crawl option to obey html-meta-robots-noindex. This is on by default.	2013-07-03 14:50:06 +02:00
Michael Peter Christen	5a5d411ec0	new robots_i attribute fields	2013-07-02 14:29:13 +02:00
Michael Peter Christen	f1c5338210	prepartion for greedy crawl profiles and refactoring	2013-07-01 13:10:09 +02:00
Michael Peter Christen	e6f361f474	adding the canonical tag to crawl queues	2013-07-01 13:09:41 +02:00
Michael Peter Christen	203921006a	redesign of citation index storage	2013-06-30 02:11:46 +02:00
Michael Peter Christen	32aa1d4569	removed unused option for queries	2013-06-28 15:32:36 +02:00
sixcooler	e5abccdfe4	added optimize-option	2013-06-28 14:51:37 +02:00
Michael Peter Christen	8caaf6203a	fixed false multiple-generation of remote facet search which caused high cpu usage on remote side.	2013-06-28 12:39:36 +02:00
Michael Peter Christen	823ae4d6a7	added url_protocol_s to error documents	2013-06-26 16:51:36 +02:00
Michael Peter Christen	9a6fcdf597	npe fix	2013-06-25 16:36:16 +02:00
Michael Peter Christen	16d1d744fa	added url_file_name_s in default collection schema for the file name without the file extension. This part of the file path is removed from the multi-field url_paths_sxt, which has now not the file name as last part of the path list. The same applies to the new fields source_file_name_s and target_file_name_s in the webgraph schema.	2013-06-25 16:27:20 +02:00
Michael Peter Christen	f9d859f5dc	now writing image alt texts and (camelcase-)parsed urls into a text search field for a better image retrieval	2013-06-18 16:51:56 +02:00
orbiter	8792e6c6e9	stub for better image indexing	2013-06-18 13:28:30 +02:00
Michael Peter Christen	bdf306e0a7	increased time-out for loading of seed-lists	2013-06-13 22:32:06 +02:00
Michael Peter Christen	570511f3c8	removed fields references_internal_id_sxt and references_internal_url_sxt because they had been shown to be superfluous. The citation of referrer in the host browser is possible without them. Therefore now the host browser does not only show internal, but also external referrer to each link.	2013-06-13 13:01:28 +02:00
Michael Peter Christen	1762911f57	added synchronizations and timeouts in solr api; missing synchronizations in index modification methods causes deadlocks inside solr.	2013-06-12 02:13:18 +02:00
Michael Peter Christen	ffc570f95f	removed forced soft commit since this may be the cause for a performance problem	2013-06-11 14:51:26 +02:00
Michael Peter Christen	6115bef335	added a 'greedy learning' mechanismn which will cause that a 'fresh' yacy will load linked web pages from search results until the total number of web pages reaches 15000. This shall give fresh peers a 'boost' to get faster a personalized search index.	2013-06-11 14:42:30 +02:00
Michael Peter Christen	8e965ffd16	fix for host compare in case that the host is null. This happens when doing a search in the intranet for file resources (they don't have a host).	2013-06-10 16:23:58 +02:00
Michael Peter Christen	f7a4377812	usage of the new normalized link polularity CRn as default ranking function. This replaces the previous formula, which was bad. Before you update to this version, please check if you changed the ranking function yourself before, since it will be overwritten.	2013-06-07 13:22:22 +02:00
Michael Peter Christen	f7e77a21bf	Added a citation reference computation for intra-domain link structures. While the values for the reference evaluation are computed, also a backlink-structure can be discovered and written to the index as well. The host browser has been extended to show such backlinks to each presented links. The host browser therefore can now show an information where an document is linked. The new citation reference is computed as likelyhood for a random click path with recursive usage of previously computed likelyhood. This process is repeated until the likelyhood converges to a specific number. This number is then normalized to a ranking value CRn, 0<=CRn<=1. The value CRn can therefore be used to rank popularity within intra-domain link structures.	2013-06-07 13:20:57 +02:00
reger	d367b1f4d9	add null pointer check to stopword fix	2013-06-07 00:13:45 +02:00
reger	7480e87386	- fix stopword handling for RWI see example http://bugs.yacy.net/view.php?id=247 - append language setting specific stopword list - remove unused OVERHANG stack type	2013-06-06 22:07:54 +02:00
Michael Peter Christen	9fc0c4df98	fix for bad exists 'enhancement'; see bug: http://bugs.yacy.net/view.php?id=245	2013-06-02 13:50:12 +02:00
reger	8a7fcb391d	enable use of solrcore.properties for property substitution of solrconfig.xml - move setting of system property solr.directoryFactory=solr.MMapDirectoryFactory to solrcore.properties - add check of os.arch for 64bit system, if it fails use default/solrcore.x86.properties (if exists) as solrcore.properties reason: on 32bit MMapDirectoryFactory may fail with..... Caused by: java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:849) at org.apache.lucene.store.MMapDirectory.map(MMapDirectory.java:283)	2013-06-01 05:43:08 +02:00
Michael Peter Christen	f7e887bf49	added missing class	2013-05-30 16:39:48 +02:00
Michael Peter Christen	5f92c68f1f	removed block rank ranking and all YBR files in /ranking	2013-05-30 13:01:22 +02:00
Michael Peter Christen	164603b946	cleanup	2013-05-30 12:47:22 +02:00
Michael Peter Christen	409d6edf53	Store node/solr search threads to be able to send them an interrupt signal in case that a cleanup process wants to remove the search process. Added also a new cleanup process which can reduce the number of stored searches to a specific number which can be higher or lower according to the remaining RAM. The cleanup process is called every time a search ist started.	2013-05-30 12:38:15 +02:00
Michael Peter Christen	2a8b99ea82	remove text_t in search result after snippet has been computed to save space in search result cache	2013-05-30 12:35:47 +02:00
Michael Peter Christen	a1644ca0fd	new workflow processor in Segment to enqueue indexing documents to solr	2013-05-30 12:34:53 +02:00
Michael Peter Christen	0c1a018bbd	removed 'later' tactic because it used too much RAM, reduced number of soft commits, reduced caching size of search events, ensured that solr results are processed before connection is closed to keep that stuff not too long in RAM	2013-05-29 18:27:27 +02:00
Michael Peter Christen	5344a1c5f7	getting the trash out	2013-05-29 16:09:05 +02:00
Michael Peter Christen	709e9b8ce7	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-05-29 13:49:42 +02:00
Michael Peter Christen	281959a2d7	added option to re-boot the embedded solr during run-time. Added also API recording for this method so it can be repeated automatically. The index dump generation is now also available for API recording. Added some synchronization in backend which was necessary for this.	2013-05-29 13:09:34 +02:00
orbiter	da621e827e	prevent NPE in case RWI is disabled	2013-05-28 16:26:38 +02:00
Michael Peter Christen	c2b1075dcf	activating pollImmediately in case that DHT receive is off. This will cause a much faster search result when running in public robinson mode.	2013-05-28 10:36:49 +02:00
Michael Peter Christen	2b563debbf	javadoc of new multiple-exist test	2013-05-27 13:45:09 +02:00
Michael Peter Christen	8f2d3ce2f9	reduced locking situation in crawler: shifted synchronized location and reduced time-out of robots.txt load limit	2013-05-20 22:05:28 +02:00
Michael Peter Christen	b68fbe7d21	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Conflicts: source/net/yacy/migration.java	2013-05-17 14:13:07 +02:00
Michael Peter Christen	06d3063dc9	- no downcase when using collection modifier - removed warnings	2013-05-17 14:11:10 +02:00
Michael Peter Christen	8dbc80da70	redesign of index.exist-test: this shall now not be done using a single id to be tested, but with a collection of ids. This will cause only a single call to solr instead of many. The result is a much better performace when testing the existence of many urls. The effect should cause very much less IO during index transmission, both on sender and receiver side.	2013-05-17 13:59:37 +02:00
reger	7f63d3747d	more generic field selection for reindex option of documents with disabled fields using Luke request to compare config with actual fields in index	2013-05-15 23:16:32 +02:00
Michael Peter Christen	44e363f37f	refactoring of WorkflowProcessor, added process counter, update of process counter if an blocking thread dies. Added also a new column in PerformanceConcurrency_p servlet to show the actual number of concurrent processes.	2013-05-13 13:28:07 +02:00
Michael Peter Christen	4058369288	fixed query expressions for collection selection (added quotes)	2013-05-13 13:27:01 +02:00
reger	79401cb938	added reindex option for documents with disabled or obsolete fields to Solr Schema Editor page (IndexSchema_p.html) this allows to remove obsolete fields from the index (according to current schema config) by selecting all documents containig disabled fields.	2013-05-13 04:06:57 +02:00
orbiter	cf36c1614f	prevent that concurrent deletion process causes wrong double-check in crawl start	2013-05-12 21:37:45 +02:00
Michael Peter Christen	b24d1d18e4	removed synchronization and concurrency in Fulltext class, concurrent deletions are now handled in ConcurrentUpdateSolrConnector	2013-05-11 10:53:12 +02:00
Michael Peter Christen	b9b446bca6	- added ssl configuration sign (a lock) to network statistic/table - fixed a bug in bitfield	2013-05-10 17:32:21 +02:00
reger	4fc6837690	- fix monitor url of crawl job in PerformanceQueues_p.html - reduce logging of every index add (switch embeddedsolr.add from info to debug)	2013-05-10 04:38:13 +02:00
Michael Peter Christen	ad050ec88d	- upgraded httpclient, httpcore and httpmime - removed httpclient 3.1 which has been used by solrj < 4.x.x and is now not used any more - fixed some parts in YaCy which used methods from httpclient 3.1	2013-05-09 00:22:45 +02:00
orbiter	a1c989002b	fix for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=4652 generate dht data even if dht receive and dht transmission is switched off	2013-05-08 16:48:45 +02:00
Michael Peter Christen	e26bdd4a52	fixes to deletion methods (removed unnecessary concurrency and added removal of crawl queue entries)	2013-05-08 13:26:25 +02:00
Michael Peter Christen	f7f3e28c5e	prevent that the size of the index is computed too many times. Because the index size is now provided by solr, and the only way to do that is a match for [* TO *], a size computation is quite complex and time-consuming. Therefore this patch prevents that the method is called at all and if necessary puts a DOS-preventing barrier in front of it.	2013-05-08 11:50:46 +02:00
Michael Peter Christen	cca19d94d4	re-declared some fields to be of type string rather than text which makes them more efficient and less large	2013-05-06 16:45:54 +02:00
Michael Peter Christen	3841854c97	abstraction of catchall term	2013-05-04 00:14:22 +02:00
Michael Peter Christen	ea85674be2	added the date to error documents	2013-05-04 00:14:00 +02:00
orbiter	7de5b9cfa0	fix for http://bugs.yacy.net/view.php?id=233 - check geolocation coordinates and accept only those, which are well-formed - the solr push process does not stop crawling any more if after 20 requests to Solr Solr does not accept the record. Instead, a severe log entry asks the user to create a bug request	2013-05-03 00:24:39 +02:00
Michael Peter Christen	bb4bf3d8fd	infinity timeout bug protection patch	2013-04-30 11:06:48 +02:00
Michael Peter Christen	d1be4127e7	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-04-29 19:31:40 +02:00
Michael Peter Christen	f36a7da5f6	- re-introduced existById in solr connector. - intruduced raw-queries for the re-introduced byId-Queries (they are hopefully faster than full edismax queries) - removed the cached solr connector (testing this) to rely only on the solr built-in search caches. That should save some RAM (also). We will see if this is usable.	2013-04-28 21:20:14 +02:00
reger	46fa800bc7	added httpstatus_i to automatically switched on fields (used in all search queries)	2013-04-27 03:11:44 +02:00
Michael Peter Christen	3502b4c697	refactoring (renaming) of yacy-solr api	2013-04-27 01:32:18 +02:00
Michael Peter Christen	3a0fcfbeda	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-04-26 10:50:08 +02:00
Michael Peter Christen	25499eead5	- added a new field for the regular expression in crawl start - added the field in crawl profile - adopted logging end error management - adopted duplicate document detection - added a new rule to the indexing process to reject non-matching content - full redesign of the expert crawl start servlet The new filter field can now be seen in /CrawlStartExpert_p.html at Section "Document Filter", subsection item "Filter on Content of Document"	2013-04-26 10:49:55 +02:00
orbiter	e1bfe9d07a	- reduction of the concurrently running processes to make YaCy more adjusted to smaller and 1-core devices. - the workflow processor now starts no process at all. these are started as soon as parser/condenser/indexing queues are filled. - better abstraction	2013-04-25 11:33:17 +02:00
Michael Peter Christen	c091000165	added collection attribute also to the rss feed reader	2013-04-24 01:14:35 +02:00
orbiter	f7571386a3	added a 'collection' property attribute in yacysearch.html which can be used to select between different collections as defined during a crawl start with the 'collection' attribute. This actually implements the ability to prepare search tenants which restrict their search results to a specific collection. The main use for this is to provide tenants to the yaml4 interface (at this time).	2013-04-23 20:42:54 +02:00
Michael Peter Christen	d937c55204	extended limitation of dom export size from 100000 to 100000000	2013-04-22 22:33:13 +02:00
Michael Peter Christen	50421171c3	added new schema fields: hreflang_url_sxt and hreflang_cc_sxt for http://support.google.com/webmasters/bin/answer.py?hl=de&answer=189077 navigation_url_sxt and navigation_type_sxt for http://googlewebmastercentral.blogspot.de/2011/09/pagination-with-relnext-and-relprev.html publisher_url_s for http://support.google.com/plus/answer/1713826?hl=de all fields are disabled by default and not written to the index.	2013-04-18 17:21:17 +02:00
Michael Peter Christen	566d6c980c	checking of document signature for a double-document check now refers only to documents within the same domain	2013-04-17 16:15:27 +02:00
Michael Peter Christen	d05dc07cff	setting of new default values for ranking	2013-04-16 15:02:00 +02:00
Michael Peter Christen	97775fbebc	fixed ranking for add-function queries: this did not work. The option was removed. All function queries are now boosts (multiplies the score according to a function). This is also the recommended way to boost rankings based on functions as explained in http://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/	2013-04-16 14:45:14 +02:00
Michael Peter Christen	7ab5093321	added new solr title_exact_signature_l and description_exact_signature_l to be able to identify unique title and unique description fields.	2013-04-16 01:35:15 +02:00
Michael Peter Christen	f24ac518e6	redesign of exists()-query (can now be called with query) and the CachedSolrConnector which based its cache on the key value. This will be used to correct the title_unique_b and description_unique_b field.	2013-04-15 14:08:30 +02:00
Michael Peter Christen	27d6222880	added new field host_extent_i which, after a crawl and postprocessing, holds the number of documents for the host where the document is hosted. This is necessary for ranking and the norming of references per local host in the ranking computation.	2013-04-14 20:52:40 +02:00
reger	518b20147c	skip postprocessing during document.store if no citation index connected (prevent null pointer exception)	2013-04-14 02:01:27 +02:00
Michael Peter Christen	ada3f27de7	added three new field for a better ranking: references_internal_i, references_external_i and references_exthosts_i. These can be used to count and evaluate the number of external links to every web page. An experimental ranking function can be i.e.: div(add(references_internal_i,product(references_external_i,references_exthosts_i)),add(clickdepth_i,1))	2013-04-12 16:17:14 +02:00
Michael Peter Christen	082e3274d6	- setting the same default ranking in the solr interface as for YaCy search interfaces if no other ranking attributes are given - using the YaCy ranking in the GSA interface only if there was not given a GSA-style sort attribute - to avoid confusion about correct ranking attributes, only the default '0'-ranking profile is used and not scenario-adopted (site, date) because that should be configurable in the web interface before it is used actually for ranking.	2013-04-12 10:48:41 +02:00
Michael Peter Christen	a20941c067	resume paused crawls on startup; user expects that restarts 'heal' everything	2013-04-11 15:07:08 +02:00
Michael Peter Christen	edc0b33f6d	- showing references count and clickdepth in host browser - fixed generation and presentation of both values	2013-04-11 14:46:13 +02:00
reger	566a3b0294	fix: Index Administration > Reverse Word Index (IndexControlRWIs_p) corrected use of word search to word-hash search - removed duplicate QueryParams.hashes2Handles , redundant with .hashes2Set	2013-04-08 21:25:21 +02:00
Michael Peter Christen	cf0acd2cb4	upgrade to solr 4.2.1	2013-04-06 16:11:24 +02:00
orbiter	e4d26d1cb4	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-03-17 10:52:42 +01:00
orbiter	940c6849ee	enhanced did-you-mean (a bit): can now remember previously searched words (plus small enhancements)	2013-03-17 10:52:31 +01:00
reger	d57b221921	add: reset Solr schema filed selection to default button in IndexSchema_p	2013-03-17 03:46:29 +01:00
Michael Peter Christen	9406a2e438	fixed NPE during index abstract computation	2013-03-15 10:04:27 +01:00
Michael Peter Christen	2d36a7eaf5	- do not create a new query for all remote peers - no document search this time - adjusted banner and network to not show 'WORDS' but DHT Chunks. This is to avoid confusion for robinson peers which do not create Word Entries	2013-03-15 00:14:28 +01:00
Michael Peter Christen	4af0839be2	use appropriate ranking for each search situation: - when using the /date modifier, a date ranking profile is used - when using a site: modifier, a ranking profile supporting longer urls is used	2013-03-14 21:13:12 +01:00
Michael Peter Christen	b8ed66a55d	added all clickdepth computations for source and target paths in webstructure core	2013-03-14 17:54:33 +01:00
Michael Peter Christen	6300730d7f	refactoring of clickdepth computation as preparation for clickdepth computation of webgraph links	2013-03-14 12:13:02 +01:00
Michael Peter Christen	2080fc7406	removed unused tag fields	2013-03-14 10:35:21 +01:00
orbiter	6b13dd0d3d	added clickdepth field writing for webgraph core (unfinished)	2013-03-14 01:35:38 +01:00

... 2 3 4 5 6 ...

840 Commits