yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
reger	7c82cd4415	add a end condition to svgParser for wrong content (if parser choosen just by file extension)	2015-09-29 22:57:33 +02:00
reger	356d4d1301	remove rdfParser from init (current function identical with genericParser)	2015-09-26 17:30:34 +02:00
reger	c647d899e3	add svgParser to parse metadate from svg images Reads document level included title and description and skips the graphic content to save bandwidth. svg metadata element is not interpreted - remove rdfParser from init (current function identical with genericParser)	2015-09-26 17:27:33 +02:00
reger	bad34804fe	optimize parseInt for <img> tag attribute parsing Performance better as using Numberformat.parse or parseInt(substring())	2015-09-26 15:42:23 +02:00
Michael Peter Christen	6ebc2451a9	Merge pull request #14 from luccioman/master Translator refactoring : no more regular expression processing	2015-09-24 13:50:23 +02:00
reger	2f51baff4f	check for loading error (includs unsupported formats) to prevent blank thumbnail display in image search because of not handled source which don't load on click. Now the cross icon indicates the problem (inlcuding not supported format)	2015-09-24 01:58:19 +02:00
luc	5578886f6f	Merge branch 'master' of https://github.com/luccioman/yacy_search_server.git	2015-09-23 21:04:20 +02:00
luc	c38d6c1f37	Correction for mantis 535: inurl: parameter doesn't work on URLs with upper-case letters	2015-09-23 21:01:51 +02:00
reger	52e3eb4ce8	harmonize/correct assignment to Ymarkmeta.mime replace use of deprecated	2015-09-23 00:13:10 +02:00
Michael Peter Christen	87f358058e	Fix for index entries which have id's not computed as hash from the url. This makes it possible to operate with outside-computed url hashes in enterprise environments not using the build-in crawler from YaCy.	2015-09-22 11:56:17 +02:00
reger	3f2b8ab5e5	optionally include mime in p2p url exchange string if doctype decodes to ambiguous mime and default conversion is not equal to original	2015-09-22 00:12:31 +02:00
reger	a3195d78ae	add Portuguese month names to date recognition	2015-09-20 23:28:42 +02:00
reger	d2cc11ea8f	fix html parser taking <style> content as text. Noticed some result description contain css content from style tag. Added <style> to tag list to scrape it's content not as text + test case included	2015-09-19 05:30:55 +02:00
Michael Peter Christen	5f706797cb	patch for a bug inside of solr since solr 5.0 when using a boost function with a numeric date field: "unexpected docvalues type NUMERIC for field 'last_modified' (expected one of [SORTED, SORTED_SET]). Use UninvertingReader or index with docvalues." This is a well-known bug inside solr which prevents that now the 'sort by date' in the YaCy search interface can be used. Without this patch no results at all is displayed (since the exception prevents that). Now there is at least a result but it is not ordered properly.	2015-09-18 02:25:44 +02:00
reger	7889fc2389	Hack to prevent Solr issue on partial update on a document containing multivalued date field (regardless if these fields part of update). Switch partial update option off in postprocessing if schema contains *_dts (multivalued date field). see http://mantis.tokeek.de/view.php?id=601	2015-09-13 20:23:15 +02:00
reger	b4cbdea1e7	adapt SolrServerConnector.add to handle error on partial update input document. In case of error we deleted the original document and added the new doc to the index. This is not valid for partial update documents (which contain only a subset of the fields). Remove the "delete" error handling step.	2015-09-13 20:19:50 +02:00
reger	98ab655917	on reindex delete index document with invalid url if discovered	2015-09-12 23:06:13 +02:00
reger	1e8369e18b	use a parsed date in Document.toString	2015-09-12 22:00:40 +02:00
luccioman	199b2ce52d	Translator refactoring : to simplify locale files writing, process keys as simple string and no more as regular expressions. Updated all locale files to adapt to refectored Translator : removed useless escaped characters and did minor corrections. Performed minor syntax corrections on some html source files. Added an util to translate all html source files with all locales without launching full YaCy application. Corrected main arguments parsing on other translation utils.	2015-09-11 17:20:11 +02:00
luccioman	4dd9c0d5d9	Merge from main repository	2015-09-08 08:54:48 +02:00
reger	3428b6f13b	improve filtering by filetype navigator. The used url-filter for filetype doesn't require ".ext" resulting in too many matches, add a sort-out filter for RWI results.	2015-09-07 02:36:22 +02:00
reger	e37a4f0b3d	prevent metadata records in index w/o valid url by throwing MalformedURL exception on URIMetadataNode creation	2015-09-06 22:19:05 +02:00
reger	41c4eade51	extract modification date from vCard (vcfParser)	2015-09-06 04:28:27 +02:00
reger	8768896975	extract lastmodified from openoffice doc set lastmod date in office document parsers	2015-09-06 00:04:54 +02:00
Michael Peter Christen	c40c302748	when many crawl queues are generated, this NPE can occur; probably caused as concurrency issue: W 2015/09/05 14:09:10 ConcurrentLog java.lang.NullPointerException java.lang.NullPointerException at java.util.TreeMap.rotateRight(TreeMap.java:2239) at java.util.TreeMap.fixAfterInsertion(TreeMap.java:2271) at java.util.TreeMap.put(TreeMap.java:582) at net.yacy.kelondro.table.Table.<init>(Table.java:235) at net.yacy.crawler.HostQueue.openStack(HostQueue.java:229) at net.yacy.crawler.HostQueue.getStack(HostQueue.java:204) at net.yacy.crawler.HostQueue.push(HostQueue.java:397) at net.yacy.crawler.HostBalancer.push(HostBalancer.java:237) at net.yacy.crawler.data.NoticedURL.push(NoticedURL.java:184) at net.yacy.crawler.CrawlStacker.stackCrawl(CrawlStacker.java:355) at net.yacy.crawler.CrawlStacker.job(CrawlStacker.java:134) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at net.yacy.kelondro.workflow.InstantBlockingThread.job(InstantBlockingThread.java:101) at net.yacy.kelondro.workflow.AbstractBlockingThread.run(AbstractBlockingThread.java:82) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)	2015-09-05 14:12:17 +02:00
reger	367fe388b9	fix exception throw after sendError in DefaultServlet - reduce debug exception logs in crawler	2015-09-05 01:57:30 +02:00
luccioman	9752bd5f88	Added utils to help translation without launching full YaCy application : - translate all source files with a locale - list all non translated files with a locale	2015-09-04 13:44:44 +02:00
luccioman	2f0f0180e2	Added a function to list files recursively.	2015-09-04 13:42:57 +02:00
luccioman	7e4c1d2282	Translator refactoring : - deleted useless new StringBuilder allocation - use of a new reusable FileNameFilter - added javadoc	2015-09-04 13:42:10 +02:00
reger	802ccaead6	fix init of error cache, use latest faildates => load_date_dt	2015-09-02 02:36:31 +02:00
reger	dba7f15073	apply same size constrain on result image from doc as for linked images see `19f1308bf0`	2015-09-01 23:22:48 +02:00
reger	4cf875336c	complete TODO: getFileExtension handle dot in query part + testcase	2015-08-31 23:28:03 +02:00
sixcooler	87e4abe393	fight the fieldcache by usind DocValues: in Solr-5.x the fieldcache has moved and was not cleared anymore. This results in an huge fieldcache. (http://lucene.apache.org/#highlights-of-the-lucene-release-include https://issues.apache.org/jira/browse/LUCENE-5666) Here I try to use DovValues where it is possible. For this I used the Api-Scheme as new basis für the Solr-Schema. This needs at least a complete optimization of the Solr-Index to get a smaller FieldCache. Everything that is indexed with these setting will not use the Fieldcache at all.	2015-08-31 20:24:41 +02:00
reger	eaf0e8ff2c	start recording/indexing pixel size for image document as for linked images	2015-08-31 01:58:36 +02:00
reger	c33229fc0c	check mime prior to ext for metadata modification for images	2015-08-30 23:02:19 +02:00
reger	19f1308bf0	enforce th result images limit to > 16x16px for linked images http://mantis.tokeek.de/view.php?id=594	2015-08-30 02:19:52 +02:00
reger	0e4ba0360b	fix NPE on .yacyh result url of disconnected peer (cleanup yacyshare remaining)	2015-08-25 23:26:17 +02:00
reger	7ed812a2bf	log missing seed.port in favour of exception to prevent repeating throws	2015-08-25 02:19:00 +02:00
reger	206883f80d	fix: Preserve protocol in url proxy to connect to http/https. Display warning if https target is viewed over http	2015-08-25 01:16:41 +02:00
reger	f7b0b3b7b3	avoid runtime exception by earlier testing for seed.ip=null	2015-08-23 23:01:20 +02:00
Michael Peter Christen	906b5fd742	Merge branch 'master' of git@github.com:yacy/yacy_search_server.git	2015-08-11 00:42:46 +02:00
Michael Peter Christen	8f90767889	fix for filesystem crawl	2015-08-11 00:42:26 +02:00
sixcooler	a3dd4be749	added / corrected charste to be 1.7 compatible. @Orbiter: please check is this is ok for you	2015-08-10 20:53:20 +02:00
Michael Peter Christen	8028410ab7	Merge branch 'master' of git@github.com:yacy/yacy_search_server.git	2015-08-10 14:27:53 +02:00
Michael Peter Christen	df3314ac1a	added a new facet type based on a probabilistic classifier using bayesian filters. This can be used to classify documents during indexing-time using a pre-definied bayesian filter. New wordings: - a context is a class where different categories are possible. The context name is equal to a facet name. - a category is a facet type within a facet navigation. Each context must have several categories, at least one custom name (things you want to discover) and one with the exact name "negative". To use this, you must do: - for each context, you must create a directory within DATA/CLASSIFICATION with the name of the context (the facet name) - within each context directory, you must create text files with one document each per line for every categroy. One of these categories MUST have the name 'negative.txt'. Then, each new document is classified to match within one of the given categories for each context.	2015-08-10 14:27:44 +02:00
reger	1409cabe8b	exclude more default search fields from text copy to text_t for metadata index documents	2015-08-09 21:01:30 +02:00
reger	e2e73258ca	remove obsolete interface SearchAccumulator and unused SRURSSConnector Thread inheritance	2015-08-08 18:35:49 +02:00
Michael Peter Christen	dbbad23e12	removed warnings	2015-08-03 05:37:34 +02:00
Michael Peter Christen	500cfa9457	enhanced logging	2015-08-03 05:17:22 +02:00
Michael Peter Christen	c14bc8d9b7	revert of fq transformation (recent fix)	2015-08-03 05:15:34 +02:00

1 2 3 4 5 ...

3371 Commits