yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	96ed0c980e	- added hosthash to all documents (also fail documents which is needed there for deletion), this fixes a problem for the deletion of old documents for new crawl starts - added clickdepth and citation computation for fail documents	2013-09-23 18:09:42 +02:00
Michael Peter Christen	179ad281f9	close include byte buffer after usage	2013-09-23 12:19:51 +02:00
reger	6b9a624808	remove double declaration of TLD_any_zone_filter	2013-09-23 03:01:08 +02:00
orbiter	828603e4f1	fix for 100%CPU problem in error cache cleaning process	2013-09-21 10:20:13 +02:00
orbiter	c64b51134e	hack to add all tokens from the url to text_t. This was working for the RWI index (and still is working) but not for solr-only search indexes. Maybe we should find a solution using a separate search field instead.	2013-09-21 08:57:43 +02:00
orbiter	6e8377b8ad	do not check all words with synonym library if the library is empty	2013-09-21 08:56:24 +02:00
orbiter	70ba74b23a	disabled ipv4 preference to enable ipv6-only networks like freifunk	2013-09-20 16:52:37 +02:00
orbiter	f3be1930cb	CPU problem when pusing to the error cache; wrong class, ConcurrentHashMap needed for concurrency	2013-09-20 16:51:50 +02:00
Michael Peter Christen	e40671ddb7	better and consistent deletions for error urls	2013-09-17 15:52:57 +02:00
Michael Peter Christen	2602be8d1e	- removed ZURL data structure; removed also the ZURL data file - replaced load failure logging by information which is stored in Solr - fixed a bug with crawling of feeds: added must-match pattern application to feed urls to filter out such urls which shall not be in a wanted domain - delegatedURLs, which also used ZURLs are now temporary objects in memory	2013-09-17 15:27:02 +02:00
Michael Peter Christen	31920385f7	set anchor rel attribute of all links to "nofollow" if the html meta contains a robots:nofollow or if the http header contains a "X-Robots-Tag: nofollow"	2013-09-16 16:14:56 +02:00
Michael Peter Christen	57e00baf26	fix for parsing of image links inside of anchor links (image-links)	2013-09-15 23:54:46 +02:00
Michael Peter Christen	61c5e40687	- replaced the properties object in AnchorURL with distinct variables for anchor attributes. - this caused that large portions of the parser code had to be adopted as well - added a counter target_order_i for anchor links in webgraph computation	2013-09-15 23:27:04 +02:00
Michael Peter Christen	3ea9bb4427	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-09-15 00:30:41 +02:00
Michael Peter Christen	5e31bad711	- the webgraph shall store all links which appear on a web page and not all unique links! This made it necessary, that a large portion of the parser and link processing classes must be adopted to carry a different type of link collection which carry a property attribute which are attached to web anchors. - introduction of a new URL class, AnchorURL - the other url classes, DigestURI and MultiProtocolURI had been renamed and refactored to fit into a new document package schema, document.id - cleanup of net.yacy.cora.document package and refactoring	2013-09-15 00:30:23 +02:00
reger	603368fc3e	remove redundant declaration of USER_AGENT	2013-09-14 18:29:44 +02:00
Michael Peter Christen	1a8c64117f	decreased the responseHeaderDB database which is now flushed more frequently. This will preserve more documents in the cache in case of a crash.	2013-09-11 13:03:58 +02:00
Michael Peter Christen	35ab2cef7b	added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in html meta fields to get a correct (or: better) date timestamp. The http:last-modified mostly does not work because it is set to the current date from most CMS.	2013-09-10 10:31:57 +02:00
Michael Peter Christen	9cc8468b30	added tools to visualize image generation (i.e. during testing)	2013-09-09 12:58:26 +02:00
Michael Peter Christen	dbef8ccfcb	forced deletion of ZURL entries for a specific host for each host that appears in the crawl url list	2013-09-05 13:22:16 +02:00
Michael Peter Christen	e137ff4171	refactoring (im preparation for new removeHost method)	2013-09-05 09:59:41 +02:00
Michael Peter Christen	7a5574cd51	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-09-04 23:12:04 +02:00
Michael Peter Christen	85456f46b2	added two new fields, exact_signature_copycount_i and fuzzy_signature_copycount_i, which count the number of copies of non-unique documents and assigns this to each document. Thus, each document there is a number assigned which shows how many copies of this document exists. These fields are disabled by default.	2013-09-04 23:11:53 +02:00
orbiter	26366596d9	fix for a problem which ocurres when a site is crawled where the start url is redirected.	2013-09-04 16:00:47 +02:00
Michael Peter Christen	a2511b5600	turned images_alt_txt back to images_alt_sxt because it is not necessary to index the alt text. Indexed image Text is in images_text_t	2013-09-04 10:47:18 +02:00
Michael Peter Christen	85b1922244	activated image type navigation for image search	2013-09-03 13:34:01 +02:00
Michael Peter Christen	9e12fdff23	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-09-03 12:22:57 +02:00
Michael Peter Christen	ab1201fdfd	fixed wrong facet count	2013-09-03 12:22:29 +02:00
Michael Peter Christen	049c3b3f2e	added an option to exclude image search results from text search. This is on by default.	2013-09-03 11:14:23 +02:00
Michael Peter Christen	69f85265e1	added an option to put image links to the crawl queue and handle these like normal documents. Using this option (by default on at this moment; this might change soon) it is possible to get the exif data into the search index to be used in image search.	2013-09-03 11:13:45 +02:00
Michael Peter Christen	e8e558a9b7	fix for content domain classification in URIMetadataNode	2013-09-03 10:49:09 +02:00
Michael Peter Christen	a8c5bfcf58	avoid to create unnecessary objects	2013-09-03 09:48:05 +02:00
Michael Peter Christen	5a0de1b77d	moving image description text to image text field	2013-09-03 09:47:27 +02:00
Michael Peter Christen	dc179bd61f	fix for catchall query goal for image search	2013-09-03 07:55:21 +02:00
reger	392174de8c	remove all_words, all_strings lists from QueryGoal - only used for text highlighting in parser text (ViewFile.html) which can be done with include_strings only	2013-09-02 23:09:43 +02:00
Michael Peter Christen	169ef8963d	one more fix for image search	2013-09-02 20:02:26 +02:00
Michael Peter Christen	cb85b22725	redesign of the image search process (with much better results, unfortunately the index schema has changed and p2p image search will not be muchmuch better until many people update)	2013-09-02 18:55:38 +02:00
reger	29967102a2	optimized QueryGoal (reducing mem and computation by removing all_hashes) - all_hashes used for text highlighting and word distance computation which can be done with include_hashes only	2013-09-02 04:19:53 +02:00
orbiter	f106345eef	link strings should not be tokenized	2013-09-01 14:35:36 +02:00
orbiter	deadeb406e	image alt tag strings should be tokenized	2013-09-01 13:48:10 +02:00
reger	d0e78082d1	return field names in index instead of in schema for SolrServerConnector.getFields	2013-08-31 06:25:12 +02:00
Michael Peter Christen	1a3e42eca4	index migration to lucene 4.4	2013-08-26 12:49:39 +02:00
Michael Peter Christen	a88a62f7aa	added a feature to set a collection for a crawl result based on a regular expression on th url: the collection attribut for a crawl start may be now either a token or a list of tokens, seperated by ',' where a token is either a string or a pair <string,pattern> where the string is separated to the pattern with a ':' and the string is assigned to the document as collection only if the pattern matches with the url.	2013-08-25 00:13:48 +02:00
Michael Peter Christen	3c5abedabf	NPE during shutdown fix	2013-08-24 23:36:50 +02:00
Michael Peter Christen	e4cbe9232d	fixed a crawler bug where a double-occurring url was not re-crawled because the double-check error was written to the error-db and never deleted. No the error-db is cleared on every start and these double-messages are not written to the error-db any more.	2013-08-22 15:56:09 +02:00
Michael Peter Christen	765943a4b7	Redesign of crawler identification and robots steering. A non-p2p user in intranets and the internet can now choose to appear as Googlebot. This is an essential necessity to be able to compete in the field of commercial search appliances, since most web pages are these days optimized only for Google and no other search platform any more. All commercial search engine providers have a built-in fake-Google User Agent to be able to get the same search index as Google can do. Without the resistance against obeying to robots.txt in this case, no competition is possible any more. YaCy will always obey the robots.txt when it is used for crawling the web in a peer-to-peer network, but to establish a Search Appliance (like a Google Search Appliance, GSA) it is necessary to be able to behave exactly like a Google crawler. With this change, you will be able to switch the user agent when portal or intranet mode is selected on per-crawl-start basis. Every crawl start can have a different user agent.	2013-08-22 14:23:47 +02:00
Michael Peter Christen	0f3d8890db	removed an assert which causes a shortcut call circuit	2013-08-22 10:12:25 +02:00
Michael Peter Christen	6d5fefe060	added missing files :(	2013-08-20 16:31:34 +02:00
Michael Peter Christen	554c0351dd	fix for http://bugs.yacy.net/view.php?id=286	2013-08-20 16:10:26 +02:00
Michael Peter Christen	47b1c81d08	- refactoring - generalized writing of url attributes to solr documents - added more url attributes to error documents	2013-08-20 15:46:04 +02:00

1 2 3 4 5 ...

6509 Commits