yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
luccioman	654801523e	Fixed StringIndexOutOfBoundsException case. Revealed by commit `c77e43a` : the exception was then thrown when indexing pages containing mailto: scheme URL links with the Solr Webgraph core enabled. Fixed the error case and restored filtering on mailto links in Document.resortLinks() as these URLs still should not appear in Document.hyperlinks.	2017-05-09 18:32:47 +02:00
luccioman	edd7ccac40	Added some JavaDoc	2017-05-02 09:33:11 +02:00
luccioman	79fdf14b0a	Fixed regression introduced by commit `9ad4d16` On MediaWiki dump imports, the SurrogateReader was trying to unread too many bytes, then failing with the following exception : "java.io.IOException: Push back buffer is full".	2017-05-02 09:32:04 +02:00
Michael Peter Christen	7678fd67e3	copied fix from yacy_grid_parser for wrong array type	2017-05-01 11:44:26 +02:00
reger	9ad4d16829	Add a responsHeader to the solr index export with a format identifier and export parameter (in accordance with response xml format) for easier format detection on import.	2017-04-30 23:53:52 +02:00
luccioman	527d494c1a	Fixed "Unchecked conversion" compilation warnings.	2017-04-24 13:27:07 +02:00
reger	c77e43a391	Take out mailto collect in internal parsed document As earlier plans to make use of mailto as separate webgraph entity didn't materialize (see http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5726&p=32493&hilit=mailto#p32493) free the unused handling and resources.	2017-04-20 00:18:18 +02:00
reger	bec34d3546	Add url input field as source for WarcImporter allowing to import warc from url without prior download.	2017-04-16 04:25:29 +02:00
luccioman	f66438442e	Extended Mediawiki dump import to remote URLs. When using a public HTTP URL in /IndexImportMediawiki_p.html, the remote file now is directly streamed and processed, allowing import of several GB dumps even with a low memory remote peer, and without need to manually download the dump file first.	2017-04-14 14:32:44 +02:00
reger	ba339a2a45	Add servlet to import warc file from filesystem IndexImportWarc_p.html. Apply Importer interface to WarcImporter	2017-04-02 03:32:21 +02:00
reger	510f11d374	Implement surrogate import from Warc archives (as first option handle warc = Web ARChive File Format. Warc files with extension .warc or compressed warc.gz can be placed in the DATA/surrogate/in and contained responses are imported to the index. The used library is stream based so we can easily extend it later to use and load warc's from the net.	2017-03-31 00:58:11 +02:00
reger	209a7374bd	remove unused import pdfParser	2017-03-09 22:57:51 +01:00
reger	de1c1c16db	Improve pdf text extraction resource handling. For sort pdf <= 3 pages use already extracted content, only for long pdf > 3 pages reassign content and close internal writer (to direct free buffers)	2017-03-09 22:56:33 +01:00
reger	18c7563dbe	Extend DCEntry.getLanguage convert to ISO639-1 codes for more languages by using icu.ULocale for languages not already covered (ICU normalizes to ISO639-1 2 char codes). Add test class Use DublinCore vocabulary declarations in DCEntry and SurrogateReader for easier usage debugging, Init SurrogateReader.inputSource on first use.	2017-03-05 02:26:10 +01:00
reger	f254fcfc67	fix htmlParser <script> text extraction on code containing expression recognized as tag like 1<a reported in https://github.com/yacy/yacy_search_server/issues/109 Script content is ignored by default, but the text is filtered for html tags. Modified scraper to skip tag filtering while within a <script> section (until a closing tag is detected </script>. Possible side effect, missing </script> end-tag will truncate trailing content text.	2017-02-24 01:25:32 +01:00
Michael Peter Christen	02d0b3172c	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	2017-01-24 15:56:37 +01:00
Michael Peter Christen	d4f45cf05e	added dc.date.modified and dc.date.created to date parser	2017-01-24 15:56:29 +01:00
reger	df80c57842	add ukr and pol to DCEntry.getLanguage ISO639-2 3-char language code conversion to deliver uk, pl 2-char code and use if else to return on match	2017-01-22 00:01:18 +01:00
luccioman	6a4d51d8f9	Cleaned up some Javadoc warnings.	2017-01-09 16:44:47 +01:00
reger	4c9be29a55	fix concurrency issue with htmlParser using not current scraper data resulting in incorrect data for some html index metadata. Details see http://mantis.tokeek.de/view.php?id=717	2017-01-06 03:01:52 +01:00
reger	b522d540b9	Include itemprop latitude/longitude (see schema.org) in attribute parsing for lat/lon. Harmonize number parsing for lat/lon to parseDouble. Fix endDate_dts value assignment.	2016-12-25 23:39:55 +01:00
reger	083df255e4	fix html tag attribute parsing containing attribute w/o value e.g. itemscope or autofocus (in such case the next key was not properly recognized).	2016-12-24 06:57:11 +01:00
reger	cb95b7339a	include html5 <time> tag in content scraper, add "datetime" property of <time> tag to scrapers startdate list. Datetime is parsed as iso8601 (xml) date, html5 allows partial as well as duration (not handled by this)	2016-12-24 03:11:35 +01:00
reger	c50e23c495	reduce creation of empty legacy RequestHeader() in situation where null is acceptable (less for garbage collection).	2016-12-18 02:38:43 +01:00
luccioman	d27adc2b92	Fixed language detector initialization and NullPointerException cases. NullPointerException occurred when using and Identificator instance which encountered and error in its constructor. This error could be caused by a missing "langdetect" folder in the current folder of the main process, or by simultaneous first calls to the constructor, initializing concurrently the DetectorFactory.langlist. Fixes the mantis 714 (http://mantis.tokeek.de/view.php?id=714)	2016-12-05 18:12:21 +01:00
luccioman	3f561c1635	Fixed a NullPointerException case. Could occur when a search request was performed just after peer startup, and the Switchboard Thread "LibraryProvider.initialize" had completed, thus requesting a ProbabilisticClassifier not completely initialized (and having a null contexts property).	2016-12-02 13:45:45 +01:00
luccioman	3092a8ced5	Fixed thread name consistency for improved monitoring. Some tasks were modifying the current thread name without restoring it once finished as it is effectively done elsewhere.	2016-11-23 17:59:52 +01:00
luccioman	eec5779889	Added a name prefix to pooled threads for easier monitoring. Using JVM monitoring tools, it is then easier to identify tasks running inside thread pool with a custom prefix rather than the generic one : "pool-".	2016-11-23 11:21:14 +01:00
reger	8fe28a83f2	harmonize used lastmodified date for rwi and fulltext in storeDocument	2016-11-02 03:43:39 +01:00
luccioman	f0639d810c	Customized name for Threads still using the default "Thread-n" pattern. This makes threads monitoring easier to read.	2016-10-22 17:17:21 +02:00
luccioman	47af33a04c	Advanced Crawl from local file : better processing of large files. Applied strategy : when there is no restriction on domains or sub-path(s), stack anchor links once discovered by the content scraper instead of waiting the complete parsing of the file. This makes it possible to handle a crawling start file with thousands of links in a reasonable amount of time. Performance limitation : even if the crawl start faster with a large file, the content of the parsed file still is fully loaded in memory.	2016-10-21 13:03:31 +02:00
luccioman	7717a3d43d	Fixed license headers on files created to improve favicon management.	2016-10-14 11:55:49 +02:00
luccioman	6e1959f469	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git Conflicts: htroot/yacysearchitem.java source/net/yacy/cora/federate/solr/responsewriter/YJsonResponseWriter.java source/net/yacy/search/schema/CollectionConfiguration.java source/net/yacy/server/serverObjects.java	2016-10-14 11:29:55 +02:00
reger	b752bcfecb	adjust date in text detection to ignore some program version strings like "3.1.2.0102" see http://mantis.tokeek.de/view.php?id=650 + expand test case	2016-10-06 23:37:12 +02:00
reger	b017e97421	optimize condenser language detection a little. langdetect probabilities take letter case into account, add words from description and anchors etc. as is. + add it to javadoc	2016-10-06 19:03:52 +02:00
reger	ae3717d087	adjust Tokenizer sentence count to ignore repeated punktuation (like !!!! ) + remove unused sentenceword map (we use only the count) + upd test case for sentence count	2016-10-06 03:41:07 +02:00
reger	474f0476c6	adjust Tokenizer sentence count on trailing text after last recognized sentence + upd test case for rwi multi-word-query (leaving results known to fail untested)	2016-10-05 05:52:37 +02:00
reger	14f7577231	add support for older Word versions (Word6/Word95) to docParser	2016-10-03 01:52:51 +02:00
reger	1a79c64495	generalize DateDetection with holiday date rules readily available in icu to make sure current dates are recognized (was fixed to 2014 - 2016) + adjust holiday date parser from pattern.match to pattern.find to deal with leading and trailing text + moved relative date recognition (morgen, tomorrow) to parseline (used by query parser only), as not working and problematic for indexing + add test case for parseline (used by query parser)	2016-10-02 03:19:12 +02:00
reger	6f68f08354	correct DateDetection Silvester date add Thanksgiving	2016-10-01 03:16:27 +02:00
reger	efcb6a1e74	fix supported mime XML -> xml for rssParser (mime normalized to lower case for comparison) + add mime text/xml as in use for rss in the wild	2016-09-23 23:37:12 +02:00
luccioman	b1b8e69da8	Fixed NullPointerException cases	2016-09-22 11:25:33 +02:00
reger	a4465c97d6	as requested, disable/remove old swf parser http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5861#p33098	2016-09-13 02:47:36 +02:00
reger	96467c5467	remove not needed counter in Tokeninzer (completing last changes) including a small change, word posintext counting. We remember/store 1st posintext. Previously following words got a handle (posintext) excluding found. Now it just counts and assigns true posintext as handle (posintext)	2016-09-10 18:23:09 +02:00
reger	272cdd496a	reactivate sentence counter in WordTokenizer for phrasepos ranking, by counting punktuation (delivered as 1 char word) again.	2016-09-07 02:16:16 +02:00
Michael Peter Christen	5e165a8150	removed unused imports	2016-09-06 18:46:24 +02:00
reger	e310ec5f70	fix posInText ranking calculation to score 0 on no position info + fix Word posInText calc in Tokenizer to start with 1 + test case	2016-09-06 00:05:59 +02:00
reger	4c7a77662a	eleminate dependency on file-extension in storeDocument but use supported mime-type to also support handling of urls w/o corresponding file-extension. For this refactor use of document.getParserObject() to alway return a Parser (for clean logic) and define/move the scraperObject as local var of AbstractParser. Adjust related calls to getParserObject (where actually a scraperObject is wanted). Addionally skip appending url token to parsed text for dht metadata entries (by default returned as result by rwi index).	2016-08-14 03:53:16 +02:00
reger	ebde21079a	refactor xlsParser to include Excel file attribute (like author) in parser result doc. Similar to ppt and doc parser, completing a TODO in xlsParser.	2016-08-13 23:46:36 +02:00
reger	27163af0e1	improve detection of referenced links by taking http and https link protocol into account + correct query start detection of commit `f89d4eb51d`	2016-07-17 23:42:25 +02:00
luccioman	6e96c7341a	Merge remote-tracking branch 'origin/master' Conflicts: htroot/Load_MediawikiWiki.java htroot/Load_PHPBB3.java htroot/ViewImage.java	2016-07-03 18:59:00 +02:00
reger	9e94989237	upd to PDFBox 2.0.1	2016-05-20 23:12:16 +02:00
reger	24b0fa2a38	extend snapshot Html2Image.pdf2image to use PDFBox image export capability if no external tool installed (and for Win) Resulting jpg are not always perfect (if graphic included) but imho sufficient.	2016-05-16 02:13:33 +02:00
reger	1d940e5a94	upd commons-compress 1.11	2016-04-16 23:31:03 +02:00
reger	764f5100f0	fix delete of temp file after odt % ooxml parser Close zipfile after parsing	2016-03-04 23:05:55 +01:00
reger	06d0e2aeb9	result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode. - Above brought up that parser start url parameter, declared as AnchorURL uses only methodes of parent object DigestURL (changed parameter declaration accordingly).	2016-02-16 02:05:58 +01:00
luc	9f712146df	Display icons in ViewFile "links" mode.	2016-02-10 10:08:07 +01:00
reger	6f0b073bf3	override detected language (statistic langdetect) only with TLD determided language if langdetect probability is not high. + additionally truncate zh-cn / zh-tw returned by langdetect to 2 char ISO639-1 zh used by YaCy	2016-02-07 21:16:22 +01:00
reger	b65e2b527d	include use of condenser's content text for language detection. Language identification may show poor performance on documents with short or no title but clear lang indication in text content. Using content text too improves lang detection. + remove double caching of text in Identificator	2016-02-07 01:52:32 +01:00
luc	3cc5619d93	Improved HTML icons indexing and rendering in search results. See http://mantis.tokeek.de/view.php?id=629	2016-02-02 09:57:54 +01:00
reger	2048b7e057	support scraping start-/enddate from html tag with property "datetime" This may be used in html5 <time> tag (which we don't explicite support yet for date in content scraping).	2016-01-26 21:27:44 +01:00
reger	900d4584ba	complet resource cleanup of lists in contentscraper's close()	2016-01-25 23:54:20 +01:00
reger	1f18653de0	pass parsed swf content trough htmlscraper Swf may contain subset of html tags which shoul'd appear as text. Especially <font> tag may totally screw up metadata servlet if not filtered out.	2016-01-21 02:55:05 +01:00
reger	18ecf57792	add support of compressed swf to swfParser from JavaSWF2 (source compatible to WebCat). Moved swf file signature check to parser Changed use of synced vector to list swf InStream	2016-01-20 00:58:29 +01:00
reger	ff27824964	fix swfParser reading file signature before passing to library (current version expects data w/o signature)	2016-01-10 01:16:31 +01:00
luc	571bc55937	Refactoring : use StandardCharsets constants instead of hard-coded charset names.	2016-01-05 23:37:05 +01:00
reger	46ac0867ff	fix poison mediawikiimporter output queue also after ExecutionException in worker thread. Writer of importer keeps needs a poison to close the file. On exception (e.g. OOM) add a poison marker in outer most try/catch to assure output queue will terminate in this condition too (and closes+renames the surrogate/in/xxx.prt file)	2015-12-28 02:32:00 +01:00
reger	a7591d3ed0	fix mediawikiimporter number format exception on coordinate parsing handle uncomplete metadata like "NS=43/50//N". For other {expr ... } type entries a try catch added	2015-12-27 01:59:15 +01:00
reger	e84d94f8ca	fix mime table for ms office / open office documents (causing wrong parser detect in intranet mode)	2015-12-22 17:48:24 +01:00
reger	45b9bd8403	adjust MultiProtocolURL.protocol detection to handle mailto with "://" in parameters, and feeding hyperlinks to webgraph processing.	2015-12-21 04:42:26 +01:00
reger	0c5548a7ff	fix (todo) remove redundant holding of email link nameproperty in parser document	2015-12-18 02:35:44 +01:00
reger	6b7c10cef8	fix dc:date in mediawikiimporter/document.writexml to use lastmodified	2015-12-17 02:53:10 +01:00
reger	14803d58cd	let html scraper accept html5 <link rel="icon"> for favicon links	2015-12-17 00:36:08 +01:00
reger	4d2b934487	prevent mailto links getting into parser result document's in/outbound link collection by checking mailto scheme early. - fix upper case mailto protocol assignment - add test case for getProtocol	2015-12-16 03:01:17 +01:00
luc	8ebefa4233	Fixed MediaWiki import : DCEntry conversion to SolrInputDocument was failing. Looks like it was broken since Commit `b43811d38c`	2015-12-08 03:34:03 +01:00
luc	7736ee5a42	Updated MediaWimporter main() : display usage in console and stop properly without calling System.exit	2015-12-08 03:30:51 +01:00
luc	27d11f8671	Fixed isSolrDump function : PushBackInputStream was not unread when returning false (for example with a WikiMedia dump).	2015-12-07 21:58:36 +01:00
Michael Peter Christen	135a123a77	less logging in new language detection	2015-12-03 00:39:15 +01:00
Michael Peter Christen	d6e9834040	Merge branch 'master' of https://github.com/Scarfmonster/yacy_search_server # Conflicts: # .classpath # build.xml	2015-11-30 16:54:54 +01:00
Michael Peter Christen	d82d311995	Merge branch 'master' of https://github.com/luccioman/yacy_search_server # Conflicts: # .classpath	2015-11-30 13:34:10 +01:00
reger	e163ea88f6	fix vsdParser (Visio) parser return statement (final block un-necessary throw)	2015-11-28 02:43:38 +01:00
luc	f0478bb14d	BMP and ICO image formats support : integrated /haraldk/TwelveMonkeys imageio-bmp-3.2 library. - better BMP format flavours support - handle PNG encoded icons - handle transparency Added some javadoc url references to .classpath	2015-11-20 09:38:16 +01:00
reger	0d3c5b223e	have psParser cleanup temp file	2015-11-17 23:45:29 +01:00
reger	7d0d19cb8e	avoid File.deleteOnExit() on temp files JVM registers each file in a list regardless of already deleted and never cleans up the list during runtime. This accumulates to a considerable amount of mem during large crawls and/or long uptime. To tackle this, all temp files are now created in a subdir of java.io.tmpdir and the jvm tmpdir property is set to this subdir, which is deleted by code on shutdown. Additionally let pdfParser use this tmp subdir too.	2015-11-17 22:27:07 +01:00
reger	02e4489a23	set tmpfile.deleteOnExit by default, to make sure files are removed on shutdown.	2015-11-16 21:37:45 +01:00
reger	52a9040ae6	Sort out double keywords (dc_subject) early in parsed documents - by direct using Set vs. List - remove not neede String[] getter	2015-11-13 01:48:28 +01:00
reger	20e18d79f8	harmonize document title for archive parsers	2015-11-10 01:29:13 +01:00
reger	112ae013f4	update bzip and bzip parser process, to return one document for the file with combined parser results of the containing file and registers it with supplied url and mime of the archive.	2015-11-07 19:13:18 +01:00
reger	e76a90837b	update zip and tar parser process, to return one document for the file with combined parser results of the containing files.	2015-11-06 23:58:55 +01:00
reger	8532565c7d	optimize order of parsers to try - start with a parser matching the remote supplied mime	2015-11-04 21:52:02 +01:00
reger	5d71fc70e3	fix tarParser early exit on looping content - adjust check of data available according to doc - return null on no recognized content (to not exit TextParser next parser try) - use commons.compress directly	2015-11-03 22:14:14 +01:00
reger	2fcf6f104c	fix bzipParser recognition - Bzip2Inputstream checks magic byte itself to identify bz2 (leave it in input) - try to suppy fitting mime for parsing bz2 content	2015-11-03 03:35:01 +01:00
reger	bbe9df2bb3	fix MediawikiImporter for bz2 dump skip reading bz2 file magicbyte to identify bz2 format as inputstream reset would be required. Common compress reads and checks the magicbytes internally and throws ioexception if wrong, making preread obsolete.	2015-10-25 03:06:15 +01:00
reger	c6687dd560	fix a system.out to log.fine in bmpParser	2015-10-25 00:26:45 +02:00
Michael Peter Christen	ac034db8bc	Merge branch 'master' of https://github.com/luccioman/yacy_search_server # Conflicts: # htroot/js/highslide/highslide.js # source/net/yacy/document/ImageParser.java	2015-10-24 11:22:35 +08:00
luc	5902ce032e	Corrected NullPointerException case when ImageIO reader is not found for image format.	2015-10-19 14:11:26 +02:00
reger	c6495a5b62	add a log entry on parsing ajax crawling scheme snapshot (prev. commit `9252e36aeb`)	2015-10-18 06:19:12 +02:00
reger	9252e36aeb	implement ajax crawling scheme for ajax sites which adhere to the proposed use of hash-bangs to provide html content see freshly deprecated https://developers.google.com/webmasters/ajax-crawling/ Implementation improves parsing of the homepage (ajax page) which uses metatag "fragment" in header and parses supplied html snapshot instead of mostly empty ajax/scripted page. Implementation supports also hash-bang urls (url with anchor starting with ! like ...path#!hashfragment) but our crawler filters it (use of hash-bang is controversly discussed and proposal is deprecated, makes no sense to adjust the crawler, but as long as it is used by some sites the minor change/improvement in htmlparser is good for some time). Quick - how does it work - if metatag fragment with content "!" is found - htmlparser tries to get content of htmls snapshot (using a different url) - htmlparser returns 2 documents (original url and snapshot content - but using same original url) - after parsing result documents are joined (and stored to index containing content also from snapshot page... as the original ajax page contains typically no parseable html content)	2015-10-18 05:51:01 +02:00
Michael Peter Christen	7d075a1d76	added log lines	2015-10-16 23:30:04 +02:00
luc	d6522fa4a2	Integrated haraldk/TwelveMonkeys library to first add TIF image format support.	2015-10-15 10:06:51 +02:00
reger	78e8c6f3e5	refactor special handling (static override) of SUPPORTED_EXTENSIONS/MIME_TYPES not used for genericImageParser	2015-10-11 01:23:52 +02:00
reger	d54c5d310a	add links with image extension not automatically to image links. With the wide spread use e.g. of Wikimedia the url file extension of links with image extension often point to html.	2015-10-10 23:49:58 +02:00
reger	851e8f6c8a	check jpeg file signature in genericImageParser to fail early without further object allocation if source is not a jpeg.	2015-10-05 01:58:31 +02:00
reger	d5330391de	remove some unused var allocation in parser	2015-10-01 23:11:58 +02:00
reger	7c82cd4415	add a end condition to svgParser for wrong content (if parser choosen just by file extension)	2015-09-29 22:57:33 +02:00
reger	356d4d1301	remove rdfParser from init (current function identical with genericParser)	2015-09-26 17:30:34 +02:00
reger	c647d899e3	add svgParser to parse metadate from svg images Reads document level included title and description and skips the graphic content to save bandwidth. svg metadata element is not interpreted - remove rdfParser from init (current function identical with genericParser)	2015-09-26 17:27:33 +02:00
reger	bad34804fe	optimize parseInt for <img> tag attribute parsing Performance better as using Numberformat.parse or parseInt(substring())	2015-09-26 15:42:23 +02:00
reger	2f51baff4f	check for loading error (includs unsupported formats) to prevent blank thumbnail display in image search because of not handled source which don't load on click. Now the cross icon indicates the problem (inlcuding not supported format)	2015-09-24 01:58:19 +02:00
reger	a3195d78ae	add Portuguese month names to date recognition	2015-09-20 23:28:42 +02:00
reger	d2cc11ea8f	fix html parser taking <style> content as text. Noticed some result description contain css content from style tag. Added <style> to tag list to scrape it's content not as text + test case included	2015-09-19 05:30:55 +02:00
reger	1e8369e18b	use a parsed date in Document.toString	2015-09-12 22:00:40 +02:00
reger	41c4eade51	extract modification date from vCard (vcfParser)	2015-09-06 04:28:27 +02:00
reger	8768896975	extract lastmodified from openoffice doc set lastmod date in office document parsers	2015-09-06 00:04:54 +02:00
sixcooler	a3dd4be749	added / corrected charste to be 1.7 compatible. @Orbiter: please check is this is ok for you	2015-08-10 20:53:20 +02:00
Michael Peter Christen	df3314ac1a	added a new facet type based on a probabilistic classifier using bayesian filters. This can be used to classify documents during indexing-time using a pre-definied bayesian filter. New wordings: - a context is a class where different categories are possible. The context name is equal to a facet name. - a category is a facet type within a facet navigation. Each context must have several categories, at least one custom name (things you want to discover) and one with the exact name "negative". To use this, you must do: - for each context, you must create a directory within DATA/CLASSIFICATION with the name of the context (the facet name) - within each context directory, you must create text files with one document each per line for every categroy. One of these categories MUST have the name 'negative.txt'. Then, each new document is classified to match within one of the given categories for each context.	2015-08-10 14:27:44 +02:00
Michael Peter Christen	7b412e8c07	added msg (text emails) format; should be handled by html parser.	2015-07-08 17:36:37 +02:00
Ryszard Goń	59096935d0	Use language-detection library for increased accuracy	2015-07-02 18:41:13 +02:00
Michael Peter Christen	90f75c8c3d	added enrichment of synonyms and vocabularies for imported documents during surrogate reading: those attributes from the dump are removed during the import process and replaced by new detected attributes according to the setting of the YaCy peer. This may cause that all such attributes are removed if the importing peer has no synonyms and/or no vocabularies defined.	2015-07-02 00:23:50 +02:00
Michael Peter Christen	7829480b82	refactoring: separated condenser and tokenizer	2015-07-01 18:28:18 +02:00
Michael Peter Christen	593de05922	enhanced surrogate import process speed (dramatically!)	2015-06-29 12:28:34 +02:00
reger	7478338a40	remove augmented parsing activation from frontend experimental implementation not used and based on error prone experimental rdfaparser	2015-06-05 00:51:00 +02:00
reger	11aa2edfe1	remove RDFa parser activation from frontend reason: experimental implementatin of RDFa parser not executed (limited to special urls) but may cause error on normal html parsing due to a inputstream.reset	2015-06-05 00:15:16 +02:00
Michael Peter Christen	d0aff91f23	fix for index import	2015-06-01 01:56:09 +02:00
Michael Peter Christen	b43811d38c	added surrogate import process for exported solr dumps. Just throw your solr dump file into DATA/SURROGATES/in/ and it will be imported!	2015-05-30 13:19:59 +02:00
reger	8a9622c31c	fix string OoB on getImagelinks with long alttext in description calculation	2015-05-24 01:59:40 +02:00
Michael Peter Christen	ff29b0e503	added option to re-index exported xml snapshot dumps to HTCACHE/snapshots by just placing them in the SURROGATES/in path	2015-05-08 15:30:26 +02:00
Michael Peter Christen	6f4fe4b175	revert of `8a7c68e4c7` keeping surrogates after processing is essential for some users. If the space they are taking is too high, please set up an automatic deletion process (like a cronjob).	2015-05-08 14:01:30 +02:00
Michael Peter Christen	fed26f33a8	enhanced timezone managament for indexed data: to support the new time parser and search functions in YaCy a high precision detection of date and time on the day is necessary. That requires that the time zone of the document content and the time zone of the user, doing a search, is detected. The time zone of the search request is done automatically using the browsers time zone offset which is delivered to the search request automatically and invisible to the user. The time zone for the content of web pages cannot be detected automatically and must be an attribute of crawl starts. The advanced crawl start now provides an input field to set the time zone in minutes as an offset number. All parsers must get a time zone offset passed, so this required the change of the parser java api. A lot of other changes had been made which corrects the wrong handling of dates in YaCy which was to add a correction based on the time zone of the server. Now no correction is added and all dates in YaCy are UTC/GMT time zone, a normalized time zone for all peers.	2015-04-15 13:17:23 +02:00
Michael Peter Christen	b060ba900d	added parsing of contentprop attribute in html tags for content='startDate' and content='endDate'. The value of these field is now written to new solr fields startDates_dts and endDates_dts.	2015-04-13 16:20:00 +02:00
Michael Peter Christen	4cb4f67f38	added parsing of dd, dt and article html fields. The parsed result is written to special solr fields which are deactivated by default.	2015-04-12 22:02:45 +02:00
Michael Peter Christen	4d00175157	<experimental> added parsing of <article> html element. Whenever such an element occurs, the complete content of all article elements replaces the parsed <content> part of documents.	2015-04-10 16:16:20 +02:00
reger	2e8c24e02a	fix link to DeReWo download file	2015-03-11 20:02:23 +01:00
Michael Peter Christen	893889bc7b	added special terms for on: - Date modifier: tomorrow, today; i.e.: search for: "Berlin on:tomorrow" to find events happening tomorrow in Berlin	2015-03-02 13:10:05 +01:00
Michael Peter Christen	535f1ebe3b	added a new way of content browsing in search results: - date navigation The date is taken from the CONTENT of the documents / web pages, NOT from a date submitted in the context of metadata (i.e. http header or html head form). This makes it possible to search for documents in the future, i.e. when documents contain event descriptions for future events. The date is written to an index field which is now enabled by default. All documents are scanned for contained date mentions. To visualize the dates for a specific search results, a histogram showing the number of documents for each day is displayed. To render these histograms the morris.js library is used. Morris.js requires also raphael.js which is now also integrated in YaCy. The histogram is now also displayed in the index browser by default. To select a specific range from a search result, the following modifiers had been introduced: from:<date> to:<date> These modifiers can be used separately (i.e. only 'from' or only 'to') to describe an open interval or combined to have a closed interval. Both dates are inclusive. To select a specific single date only, use the 'to:' - modifier. The histogram shows blue and green lines; the green lines denot weekend days (saturday and sunday). Clicking on bars in the histogram has the following reaction: 1st click: add a from:<date> modifier for the date of the bar 2nd click: add a to:<date> modifier for the date of the bar 3rd click: remove from and date modifier and set a on:<date> for the bar When the on:<date> modifier is used, the histogram shows an unlimited time period. This makes it possible to click again (4th click) which is then interpreted as a 1st click again (sets a from modifier). The display feature is NOT switched on by default; to switch it on use the /ConfigSearchPage_p.html servlet.	2015-03-02 04:30:10 +01:00
reger	2d2299f484	fix mimetype of rss items in rss parser - remove self reference as anchor for items	2015-02-25 01:58:42 +01:00
Michael Peter Christen	b432049d59	enhanced date parsing time	2015-02-25 01:05:46 +01:00
reger	a0f04db9ea	add extracted description/subject to pptParser	2015-02-22 05:31:56 +01:00
reger	7e35518787	add extracted description/subject to docParser	2015-02-16 00:50:16 +01:00
Michael Peter Christen	1f5b5c0111	npe fix for latest scraper feature	2015-02-10 08:33:30 +01:00
Michael Peter Christen	ee97302a23	hack to make date detection faster (while it becomes a bit incomplete regarding language alternatives)	2015-02-09 18:46:06 +01:00
Michael Peter Christen	b5ac29c9a5	added a html field scraper which reads text from html entities of a given css class and extends a given vocabulary with a term consisting with the text content of the html class tag. Additionally, the term is included into the semantic facet of the document. This allows the creation of faceted search to documents without the pre-creation of vocabularies; instead, the vocabulary is created on-the-fly, possibly for use in other crawls. If any of the term scraping for a specific vocabulary is successful on a document, this vocabulary is excluded for auto-annotation on the page. To use this feature, do the following: - create a vocabulary on /Vocabulary_p.html (if not existent) - in /CrawlStartExpert.html you will now see the vocabularies as column in a table. The second column provides text fields where you can name the class of html entities where the literal of the corresponding vocabulary shall be scraped out - when doing a search, you will see the content of the scraped fields in a navigation facet for the given vocabulary	2015-01-30 13:20:56 +01:00
Michael Peter Christen	de3e373913	using precompiled CommonPattern.TAB for split	2015-01-29 02:22:28 +01:00
Michael Peter Christen	1f5047b15f	using precompiled pattern CommonPattern.SEMICOLON for splits	2015-01-29 02:19:41 +01:00
Michael Peter Christen	69eacdf4eb	applying precompiled CommonPattern.COMMA.split to all places where split(",") was used	2015-01-29 01:46:22 +01:00
reger	5ca0762179	fix: eom on parsing ico file by genericImageParser trace: java.lang.OutOfMemoryError: Java heap space at java.awt.image.DataBufferInt.<init>(DataBufferInt.java:75) at java.awt.image.Raster.createPackedRaster(Raster.java:467) at java.awt.image.DirectColorModel.createCompatibleWritableRaster(DirectColorModel.java:1032) at java.awt.image.BufferedImage.<init>(BufferedImage.java:331) at net.yacy.document.parser.images.bmpParser$IMAGEMAP.<init>(bmpParser.java:149) at net.yacy.document.parser.images.bmpParser.parse(bmpParser.java:69) at net.yacy.document.parser.images.genericImageParser.parse(genericImageParser.java:116)	2015-01-24 23:17:07 +01:00
Michael Peter Christen	4144c7cc52	do not write frame links to webgraph	2015-01-06 14:14:25 +01:00
reger	3ac1d14a21	improve TexParser.mimeOf( fileextension ) by returning 1st defined in supported list. This prevents unusual mapping of supported fileextension -> mimetype (like htm=application/x-tex)	2015-01-02 04:20:02 +01:00
Michael Peter Christen	d2792a43fd	do not write iframe and embed links into webgraph, but use them anyway for crawling	2015-01-02 02:44:03 +01:00
Michael Peter Christen	6ad43c4a8b	removed debug code	2014-12-22 14:24:09 +01:00

1 2 3 4 5 ...

770 Commits