yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	1a8c64117f	decreased the responseHeaderDB database which is now flushed more frequently. This will preserve more documents in the cache in case of a crash.	2013-09-11 13:03:58 +02:00
Michael Peter Christen	35ab2cef7b	added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in html meta fields to get a correct (or: better) date timestamp. The http:last-modified mostly does not work because it is set to the current date from most CMS.	2013-09-10 10:31:57 +02:00
Michael Peter Christen	9cc8468b30	added tools to visualize image generation (i.e. during testing)	2013-09-09 12:58:26 +02:00
Michael Peter Christen	dbef8ccfcb	forced deletion of ZURL entries for a specific host for each host that appears in the crawl url list	2013-09-05 13:22:16 +02:00
Michael Peter Christen	e137ff4171	refactoring (im preparation for new removeHost method)	2013-09-05 09:59:41 +02:00
Michael Peter Christen	7a5574cd51	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-09-04 23:12:04 +02:00
Michael Peter Christen	85456f46b2	added two new fields, exact_signature_copycount_i and fuzzy_signature_copycount_i, which count the number of copies of non-unique documents and assigns this to each document. Thus, each document there is a number assigned which shows how many copies of this document exists. These fields are disabled by default.	2013-09-04 23:11:53 +02:00
orbiter	26366596d9	fix for a problem which ocurres when a site is crawled where the start url is redirected.	2013-09-04 16:00:47 +02:00
Michael Peter Christen	a2511b5600	turned images_alt_txt back to images_alt_sxt because it is not necessary to index the alt text. Indexed image Text is in images_text_t	2013-09-04 10:47:18 +02:00
Michael Peter Christen	85b1922244	activated image type navigation for image search	2013-09-03 13:34:01 +02:00
Michael Peter Christen	9e12fdff23	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-09-03 12:22:57 +02:00
Michael Peter Christen	ab1201fdfd	fixed wrong facet count	2013-09-03 12:22:29 +02:00
Michael Peter Christen	049c3b3f2e	added an option to exclude image search results from text search. This is on by default.	2013-09-03 11:14:23 +02:00
Michael Peter Christen	69f85265e1	added an option to put image links to the crawl queue and handle these like normal documents. Using this option (by default on at this moment; this might change soon) it is possible to get the exif data into the search index to be used in image search.	2013-09-03 11:13:45 +02:00
Michael Peter Christen	e8e558a9b7	fix for content domain classification in URIMetadataNode	2013-09-03 10:49:09 +02:00
Michael Peter Christen	a8c5bfcf58	avoid to create unnecessary objects	2013-09-03 09:48:05 +02:00
Michael Peter Christen	5a0de1b77d	moving image description text to image text field	2013-09-03 09:47:27 +02:00
Michael Peter Christen	dc179bd61f	fix for catchall query goal for image search	2013-09-03 07:55:21 +02:00
reger	392174de8c	remove all_words, all_strings lists from QueryGoal - only used for text highlighting in parser text (ViewFile.html) which can be done with include_strings only	2013-09-02 23:09:43 +02:00
Michael Peter Christen	169ef8963d	one more fix for image search	2013-09-02 20:02:26 +02:00
Michael Peter Christen	cb85b22725	redesign of the image search process (with much better results, unfortunately the index schema has changed and p2p image search will not be muchmuch better until many people update)	2013-09-02 18:55:38 +02:00
reger	29967102a2	optimized QueryGoal (reducing mem and computation by removing all_hashes) - all_hashes used for text highlighting and word distance computation which can be done with include_hashes only	2013-09-02 04:19:53 +02:00
orbiter	f106345eef	link strings should not be tokenized	2013-09-01 14:35:36 +02:00
orbiter	deadeb406e	image alt tag strings should be tokenized	2013-09-01 13:48:10 +02:00
reger	d0e78082d1	return field names in index instead of in schema for SolrServerConnector.getFields	2013-08-31 06:25:12 +02:00
Michael Peter Christen	1a3e42eca4	index migration to lucene 4.4	2013-08-26 12:49:39 +02:00
Michael Peter Christen	a88a62f7aa	added a feature to set a collection for a crawl result based on a regular expression on th url: the collection attribut for a crawl start may be now either a token or a list of tokens, seperated by ',' where a token is either a string or a pair <string,pattern> where the string is separated to the pattern with a ':' and the string is assigned to the document as collection only if the pattern matches with the url.	2013-08-25 00:13:48 +02:00
Michael Peter Christen	3c5abedabf	NPE during shutdown fix	2013-08-24 23:36:50 +02:00
Michael Peter Christen	e4cbe9232d	fixed a crawler bug where a double-occurring url was not re-crawled because the double-check error was written to the error-db and never deleted. No the error-db is cleared on every start and these double-messages are not written to the error-db any more.	2013-08-22 15:56:09 +02:00
Michael Peter Christen	765943a4b7	Redesign of crawler identification and robots steering. A non-p2p user in intranets and the internet can now choose to appear as Googlebot. This is an essential necessity to be able to compete in the field of commercial search appliances, since most web pages are these days optimized only for Google and no other search platform any more. All commercial search engine providers have a built-in fake-Google User Agent to be able to get the same search index as Google can do. Without the resistance against obeying to robots.txt in this case, no competition is possible any more. YaCy will always obey the robots.txt when it is used for crawling the web in a peer-to-peer network, but to establish a Search Appliance (like a Google Search Appliance, GSA) it is necessary to be able to behave exactly like a Google crawler. With this change, you will be able to switch the user agent when portal or intranet mode is selected on per-crawl-start basis. Every crawl start can have a different user agent.	2013-08-22 14:23:47 +02:00
Michael Peter Christen	0f3d8890db	removed an assert which causes a shortcut call circuit	2013-08-22 10:12:25 +02:00
Michael Peter Christen	6d5fefe060	added missing files :(	2013-08-20 16:31:34 +02:00
Michael Peter Christen	554c0351dd	fix for http://bugs.yacy.net/view.php?id=286	2013-08-20 16:10:26 +02:00
Michael Peter Christen	47b1c81d08	- refactoring - generalized writing of url attributes to solr documents - added more url attributes to error documents	2013-08-20 15:46:04 +02:00
Michael Peter Christen	1c62fa7698	fix for bad snippets in gsa api	2013-08-18 10:37:25 +02:00
Michael Peter Christen	697613170d	less logging for postprocessing (this was a debugging logging with high CPU load)	2013-08-17 09:25:32 +02:00
reger	b4016ff324	- remove possible double initialization of rdfa parser - use ordered list to use preferred parser for mime/extension first (relates to html, rdfa, argument parser) - harmonize xhtml extension config for the 3 html base parsers	2013-08-14 21:12:10 +02:00
reger	f0575bd44b	FieldReIndex: omit active vocabulary fields from reindex detection	2013-08-14 00:00:30 +02:00
reger	a5019bc470	make Vocabulary Navigator tags a hard result entry filter by checking vocabulary tags also for rwi results (currently a filter is applied to the solr query) TODO: as vocabularies are only locally valid, auto-switch to Searchdom.LOCAL could be considered.	2013-08-13 03:07:25 +02:00
reger	a67a4b7d86	improve tld: query modifier filter pattern (to prevent tld:net accepting www.abcinet.org)	2013-08-12 21:20:23 +02:00
reger	02fe8b43ba	Field Re-Indexing: display list of fields in reindex queue change servlet to display statistic on 1st click (instead after refresh)	2013-08-11 04:51:29 +02:00
sixcooler	7f501b7c38	clear some caches before reporting low Memory do not break lines in Network-table-rows	2013-08-08 14:38:26 +02:00
reger	b355dd52c6	Index Administration - Field Re-Indexing: exclude internal Solr _version_ field from obsolete field check	2013-08-08 00:55:21 +02:00
sixcooler	8a96140f92	fix / workaround for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=4750 + Seed.hash should be final	2013-08-01 16:40:58 +02:00
Michael Peter Christen	2857499467	fix to collection schema; bug appeared for _txt fields with empty String as content	2013-07-31 13:32:05 +02:00
Michael Peter Christen	dbfa865700	added a stub of a class for crawler redesign	2013-07-31 13:16:32 +02:00
Michael Peter Christen	76afcccaaf	fix for default boolean post values: the default value MUST NOT be TRUE, because it's normal that a boolean value is missing in the post argument if a checkbox is not selected. Added also some style enhancements to IndexFederated, removed the Solr attachment manual and replaced it with a link to the wiki which explains this in more detail.	2013-07-31 10:49:26 +02:00
orbiter	252c525709	fixed feed api servlet and and enhanced RSSReader class	2013-07-31 06:18:30 +02:00
orbiter	d38c3c14d8	fix for CGI test	2013-07-31 05:43:58 +02:00
Michael Peter Christen	31902f54df	fix for NPE which happens within solr code at MultiMapSolrParams.java, line 52 in case that the array arr.length == 0	2013-07-30 14:32:59 +02:00

1 2 3 4 5 ...

2059 Commits