yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
reger	e310ec5f70	fix posInText ranking calculation to score 0 on no position info + fix Word posInText calc in Tokenizer to start with 1 + test case	2016-09-06 00:05:59 +02:00
reger	4c7a77662a	eleminate dependency on file-extension in storeDocument but use supported mime-type to also support handling of urls w/o corresponding file-extension. For this refactor use of document.getParserObject() to alway return a Parser (for clean logic) and define/move the scraperObject as local var of AbstractParser. Adjust related calls to getParserObject (where actually a scraperObject is wanted). Addionally skip appending url token to parsed text for dht metadata entries (by default returned as result by rwi index).	2016-08-14 03:53:16 +02:00
reger	ebde21079a	refactor xlsParser to include Excel file attribute (like author) in parser result doc. Similar to ppt and doc parser, completing a TODO in xlsParser.	2016-08-13 23:46:36 +02:00
reger	27163af0e1	improve detection of referenced links by taking http and https link protocol into account + correct query start detection of commit `f89d4eb51d`	2016-07-17 23:42:25 +02:00
reger	9e94989237	upd to PDFBox 2.0.1	2016-05-20 23:12:16 +02:00
reger	24b0fa2a38	extend snapshot Html2Image.pdf2image to use PDFBox image export capability if no external tool installed (and for Win) Resulting jpg are not always perfect (if graphic included) but imho sufficient.	2016-05-16 02:13:33 +02:00
reger	1d940e5a94	upd commons-compress 1.11	2016-04-16 23:31:03 +02:00
reger	764f5100f0	fix delete of temp file after odt % ooxml parser Close zipfile after parsing	2016-03-04 23:05:55 +01:00
reger	06d0e2aeb9	result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode. - Above brought up that parser start url parameter, declared as AnchorURL uses only methodes of parent object DigestURL (changed parameter declaration accordingly).	2016-02-16 02:05:58 +01:00
reger	6f0b073bf3	override detected language (statistic langdetect) only with TLD determided language if langdetect probability is not high. + additionally truncate zh-cn / zh-tw returned by langdetect to 2 char ISO639-1 zh used by YaCy	2016-02-07 21:16:22 +01:00
reger	b65e2b527d	include use of condenser's content text for language detection. Language identification may show poor performance on documents with short or no title but clear lang indication in text content. Using content text too improves lang detection. + remove double caching of text in Identificator	2016-02-07 01:52:32 +01:00
reger	2048b7e057	support scraping start-/enddate from html tag with property "datetime" This may be used in html5 <time> tag (which we don't explicite support yet for date in content scraping).	2016-01-26 21:27:44 +01:00
reger	900d4584ba	complet resource cleanup of lists in contentscraper's close()	2016-01-25 23:54:20 +01:00
reger	1f18653de0	pass parsed swf content trough htmlscraper Swf may contain subset of html tags which shoul'd appear as text. Especially <font> tag may totally screw up metadata servlet if not filtered out.	2016-01-21 02:55:05 +01:00
reger	18ecf57792	add support of compressed swf to swfParser from JavaSWF2 (source compatible to WebCat). Moved swf file signature check to parser Changed use of synced vector to list swf InStream	2016-01-20 00:58:29 +01:00
reger	ff27824964	fix swfParser reading file signature before passing to library (current version expects data w/o signature)	2016-01-10 01:16:31 +01:00
luc	571bc55937	Refactoring : use StandardCharsets constants instead of hard-coded charset names.	2016-01-05 23:37:05 +01:00
reger	46ac0867ff	fix poison mediawikiimporter output queue also after ExecutionException in worker thread. Writer of importer keeps needs a poison to close the file. On exception (e.g. OOM) add a poison marker in outer most try/catch to assure output queue will terminate in this condition too (and closes+renames the surrogate/in/xxx.prt file)	2015-12-28 02:32:00 +01:00
reger	a7591d3ed0	fix mediawikiimporter number format exception on coordinate parsing handle uncomplete metadata like "NS=43/50//N". For other {expr ... } type entries a try catch added	2015-12-27 01:59:15 +01:00
reger	e84d94f8ca	fix mime table for ms office / open office documents (causing wrong parser detect in intranet mode)	2015-12-22 17:48:24 +01:00
reger	45b9bd8403	adjust MultiProtocolURL.protocol detection to handle mailto with "://" in parameters, and feeding hyperlinks to webgraph processing.	2015-12-21 04:42:26 +01:00
reger	0c5548a7ff	fix (todo) remove redundant holding of email link nameproperty in parser document	2015-12-18 02:35:44 +01:00
reger	6b7c10cef8	fix dc:date in mediawikiimporter/document.writexml to use lastmodified	2015-12-17 02:53:10 +01:00
reger	14803d58cd	let html scraper accept html5 <link rel="icon"> for favicon links	2015-12-17 00:36:08 +01:00
reger	4d2b934487	prevent mailto links getting into parser result document's in/outbound link collection by checking mailto scheme early. - fix upper case mailto protocol assignment - add test case for getProtocol	2015-12-16 03:01:17 +01:00
luc	8ebefa4233	Fixed MediaWiki import : DCEntry conversion to SolrInputDocument was failing. Looks like it was broken since Commit `b43811d38c`	2015-12-08 03:34:03 +01:00
luc	7736ee5a42	Updated MediaWimporter main() : display usage in console and stop properly without calling System.exit	2015-12-08 03:30:51 +01:00
luc	27d11f8671	Fixed isSolrDump function : PushBackInputStream was not unread when returning false (for example with a WikiMedia dump).	2015-12-07 21:58:36 +01:00
Michael Peter Christen	135a123a77	less logging in new language detection	2015-12-03 00:39:15 +01:00
Michael Peter Christen	d6e9834040	Merge branch 'master' of https://github.com/Scarfmonster/yacy_search_server # Conflicts: # .classpath # build.xml	2015-11-30 16:54:54 +01:00
Michael Peter Christen	d82d311995	Merge branch 'master' of https://github.com/luccioman/yacy_search_server # Conflicts: # .classpath	2015-11-30 13:34:10 +01:00
reger	e163ea88f6	fix vsdParser (Visio) parser return statement (final block un-necessary throw)	2015-11-28 02:43:38 +01:00
luc	f0478bb14d	BMP and ICO image formats support : integrated /haraldk/TwelveMonkeys imageio-bmp-3.2 library. - better BMP format flavours support - handle PNG encoded icons - handle transparency Added some javadoc url references to .classpath	2015-11-20 09:38:16 +01:00
reger	0d3c5b223e	have psParser cleanup temp file	2015-11-17 23:45:29 +01:00
reger	7d0d19cb8e	avoid File.deleteOnExit() on temp files JVM registers each file in a list regardless of already deleted and never cleans up the list during runtime. This accumulates to a considerable amount of mem during large crawls and/or long uptime. To tackle this, all temp files are now created in a subdir of java.io.tmpdir and the jvm tmpdir property is set to this subdir, which is deleted by code on shutdown. Additionally let pdfParser use this tmp subdir too.	2015-11-17 22:27:07 +01:00
reger	02e4489a23	set tmpfile.deleteOnExit by default, to make sure files are removed on shutdown.	2015-11-16 21:37:45 +01:00
reger	52a9040ae6	Sort out double keywords (dc_subject) early in parsed documents - by direct using Set vs. List - remove not neede String[] getter	2015-11-13 01:48:28 +01:00
reger	20e18d79f8	harmonize document title for archive parsers	2015-11-10 01:29:13 +01:00
reger	112ae013f4	update bzip and bzip parser process, to return one document for the file with combined parser results of the containing file and registers it with supplied url and mime of the archive.	2015-11-07 19:13:18 +01:00
reger	e76a90837b	update zip and tar parser process, to return one document for the file with combined parser results of the containing files.	2015-11-06 23:58:55 +01:00
reger	8532565c7d	optimize order of parsers to try - start with a parser matching the remote supplied mime	2015-11-04 21:52:02 +01:00
reger	5d71fc70e3	fix tarParser early exit on looping content - adjust check of data available according to doc - return null on no recognized content (to not exit TextParser next parser try) - use commons.compress directly	2015-11-03 22:14:14 +01:00
reger	2fcf6f104c	fix bzipParser recognition - Bzip2Inputstream checks magic byte itself to identify bz2 (leave it in input) - try to suppy fitting mime for parsing bz2 content	2015-11-03 03:35:01 +01:00
reger	bbe9df2bb3	fix MediawikiImporter for bz2 dump skip reading bz2 file magicbyte to identify bz2 format as inputstream reset would be required. Common compress reads and checks the magicbytes internally and throws ioexception if wrong, making preread obsolete.	2015-10-25 03:06:15 +01:00
reger	c6687dd560	fix a system.out to log.fine in bmpParser	2015-10-25 00:26:45 +02:00
Michael Peter Christen	ac034db8bc	Merge branch 'master' of https://github.com/luccioman/yacy_search_server # Conflicts: # htroot/js/highslide/highslide.js # source/net/yacy/document/ImageParser.java	2015-10-24 11:22:35 +08:00
luc	5902ce032e	Corrected NullPointerException case when ImageIO reader is not found for image format.	2015-10-19 14:11:26 +02:00
reger	c6495a5b62	add a log entry on parsing ajax crawling scheme snapshot (prev. commit `9252e36aeb`)	2015-10-18 06:19:12 +02:00
reger	9252e36aeb	implement ajax crawling scheme for ajax sites which adhere to the proposed use of hash-bangs to provide html content see freshly deprecated https://developers.google.com/webmasters/ajax-crawling/ Implementation improves parsing of the homepage (ajax page) which uses metatag "fragment" in header and parses supplied html snapshot instead of mostly empty ajax/scripted page. Implementation supports also hash-bang urls (url with anchor starting with ! like ...path#!hashfragment) but our crawler filters it (use of hash-bang is controversly discussed and proposal is deprecated, makes no sense to adjust the crawler, but as long as it is used by some sites the minor change/improvement in htmlparser is good for some time). Quick - how does it work - if metatag fragment with content "!" is found - htmlparser tries to get content of htmls snapshot (using a different url) - htmlparser returns 2 documents (original url and snapshot content - but using same original url) - after parsing result documents are joined (and stored to index containing content also from snapshot page... as the original ajax page contains typically no parseable html content)	2015-10-18 05:51:01 +02:00
Michael Peter Christen	7d075a1d76	added log lines	2015-10-16 23:30:04 +02:00

1 2 3 4 5 ...

621 Commits