Commit Graph

770 Commits

Author SHA1 Message Date
luccioman
654801523e Fixed StringIndexOutOfBoundsException case.
Revealed by commit c77e43a : the exception was then thrown when indexing
pages containing mailto: scheme URL links with the Solr Webgraph core
enabled.
Fixed the error case and restored filtering on mailto links in
Document.resortLinks() as these URLs still should not appear in
Document.hyperlinks.
2017-05-09 18:32:47 +02:00
luccioman
edd7ccac40 Added some JavaDoc 2017-05-02 09:33:11 +02:00
luccioman
79fdf14b0a Fixed regression introduced by commit 9ad4d16
On MediaWiki dump imports, the SurrogateReader was trying to unread too
many bytes, then failing with the following exception :
"java.io.IOException: Push back buffer is full".
2017-05-02 09:32:04 +02:00
Michael Peter Christen
7678fd67e3 copied fix from yacy_grid_parser for wrong array type 2017-05-01 11:44:26 +02:00
reger
9ad4d16829 Add a responsHeader to the solr index export with a format identifier
and export parameter (in accordance with response xml format) for easier
format detection on import.
2017-04-30 23:53:52 +02:00
luccioman
527d494c1a Fixed "Unchecked conversion" compilation warnings. 2017-04-24 13:27:07 +02:00
reger
c77e43a391 Take out mailto collect in internal parsed document
As earlier plans to make use of mailto as separate webgraph entity didn't
materialize (see  http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5726&p=32493&hilit=mailto#p32493)
free the unused handling and resources.
2017-04-20 00:18:18 +02:00
reger
bec34d3546 Add url input field as source for WarcImporter
allowing to import warc from url without prior download.
2017-04-16 04:25:29 +02:00
luccioman
f66438442e Extended Mediawiki dump import to remote URLs.
When using a public HTTP URL in /IndexImportMediawiki_p.html, the remote
file now is directly streamed and processed, allowing import of several
GB dumps even with a low memory remote peer, and without need to
manually download the dump file first.
2017-04-14 14:32:44 +02:00
reger
ba339a2a45 Add servlet to import warc file from filesystem IndexImportWarc_p.html.
Apply Importer interface to WarcImporter
2017-04-02 03:32:21 +02:00
reger
510f11d374 Implement surrogate import from Warc archives (as first option handle
warc = Web ARChive File Format.
Warc files with extension .warc or compressed warc.gz can be placed in the
DATA/surrogate/in and contained responses are imported to the index.
The used library is stream based so we can easily extend it later to use
and load warc's from the net.
2017-03-31 00:58:11 +02:00
reger
209a7374bd remove unused import pdfParser 2017-03-09 22:57:51 +01:00
reger
de1c1c16db Improve pdf text extraction resource handling.
For sort pdf <= 3 pages use already extracted content,
only for long pdf > 3 pages reassign content and close internal writer (to direct free buffers)
2017-03-09 22:56:33 +01:00
reger
18c7563dbe Extend DCEntry.getLanguage convert to ISO639-1 codes for more languages
by using icu.ULocale for languages not already covered (ICU normalizes 
to ISO639-1 2 char codes).
Add test class
Use DublinCore vocabulary declarations in DCEntry and SurrogateReader 
for easier usage debugging, 
Init SurrogateReader.inputSource on first use.
2017-03-05 02:26:10 +01:00
reger
f254fcfc67 fix htmlParser <script> text extraction on code containing expression
recognized as tag like 1<a
reported in https://github.com/yacy/yacy_search_server/issues/109

Script content is ignored by default, but the text is filtered for html
tags. Modified scraper to skip tag filtering while within a <script> 
section (until a closing tag is detected </script>. 
Possible side effect, missing </script> end-tag will truncate trailing 
content text.
2017-02-24 01:25:32 +01:00
Michael Peter Christen
02d0b3172c Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2017-01-24 15:56:37 +01:00
Michael Peter Christen
d4f45cf05e added dc.date.modified and dc.date.created to date parser 2017-01-24 15:56:29 +01:00
reger
df80c57842 add ukr and pol to DCEntry.getLanguage ISO639-2 3-char language code
conversion to deliver uk, pl 2-char code
and use if else to return on match
2017-01-22 00:01:18 +01:00
luccioman
6a4d51d8f9 Cleaned up some Javadoc warnings. 2017-01-09 16:44:47 +01:00
reger
4c9be29a55 fix concurrency issue with htmlParser using not current scraper data
resulting in incorrect data for some html index metadata.
Details see http://mantis.tokeek.de/view.php?id=717
2017-01-06 03:01:52 +01:00
reger
b522d540b9 Include itemprop latitude/longitude (see schema.org) in attribute
parsing for lat/lon.
Harmonize number parsing for lat/lon to parseDouble.
Fix endDate_dts value assignment.
2016-12-25 23:39:55 +01:00
reger
083df255e4 fix html tag attribute parsing containing attribute w/o value
e.g. itemscope or autofocus (in such case the next key was not properly
recognized).
2016-12-24 06:57:11 +01:00
reger
cb95b7339a include html5 <time> tag in content scraper,
add "datetime" property of <time> tag to scrapers startdate list.
Datetime is parsed as iso8601 (xml) date, html5 allows partial as well
as duration (not handled by this)
2016-12-24 03:11:35 +01:00
reger
c50e23c495 reduce creation of empty legacy RequestHeader() in situation where null
is acceptable (less for garbage collection).
2016-12-18 02:38:43 +01:00
luccioman
d27adc2b92 Fixed language detector initialization and NullPointerException cases.
NullPointerException occurred when using and Identificator instance
which encountered and error in its constructor.
This error could be caused by a missing "langdetect" folder in the
current folder of the main process, or by simultaneous first calls to
the constructor, initializing concurrently the DetectorFactory.langlist.

Fixes the mantis 714 (http://mantis.tokeek.de/view.php?id=714)
2016-12-05 18:12:21 +01:00
luccioman
3f561c1635 Fixed a NullPointerException case.
Could occur when a search request was performed just after peer startup,
and the Switchboard Thread "LibraryProvider.initialize" had completed,
thus requesting a ProbabilisticClassifier not completely initialized
(and having a null contexts property).
2016-12-02 13:45:45 +01:00
luccioman
3092a8ced5 Fixed thread name consistency for improved monitoring.
Some tasks were modifying the current thread name without restoring it
once finished as it is effectively done elsewhere.
2016-11-23 17:59:52 +01:00
luccioman
eec5779889 Added a name prefix to pooled threads for easier monitoring.
Using JVM monitoring tools, it is then easier to identify tasks running
inside thread pool with a custom prefix rather than the generic one :
"pool-".
2016-11-23 11:21:14 +01:00
reger
8fe28a83f2 harmonize used lastmodified date for rwi and fulltext in storeDocument 2016-11-02 03:43:39 +01:00
luccioman
f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
This makes threads monitoring easier to read.
2016-10-22 17:17:21 +02:00
luccioman
47af33a04c Advanced Crawl from local file : better processing of large files.
Applied strategy : when there is no restriction on domains or
sub-path(s), stack anchor links once discovered by the content scraper
instead of waiting the complete parsing of the file. 

This makes it possible to handle a crawling start file with thousands of
links in a reasonable amount of time.

Performance limitation : even if the crawl start faster with a large
file, the content of the parsed file still is fully loaded in memory.
2016-10-21 13:03:31 +02:00
luccioman
7717a3d43d Fixed license headers on files created to improve favicon management. 2016-10-14 11:55:49 +02:00
luccioman
6e1959f469 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
Conflicts:
	htroot/yacysearchitem.java
	source/net/yacy/cora/federate/solr/responsewriter/YJsonResponseWriter.java
	source/net/yacy/search/schema/CollectionConfiguration.java
	source/net/yacy/server/serverObjects.java
2016-10-14 11:29:55 +02:00
reger
b752bcfecb adjust date in text detection to ignore some program version strings
like "3.1.2.0102" see http://mantis.tokeek.de/view.php?id=650
+ expand test case
2016-10-06 23:37:12 +02:00
reger
b017e97421 optimize condenser language detection a little.
langdetect probabilities take letter case into account, add words from
description and anchors etc. as is.
+ add it to javadoc
2016-10-06 19:03:52 +02:00
reger
ae3717d087 adjust Tokenizer sentence count to ignore repeated punktuation (like !!!! )
+ remove unused sentenceword map (we use only the count)
+ upd test case for sentence count
2016-10-06 03:41:07 +02:00
reger
474f0476c6 adjust Tokenizer sentence count on trailing text after last recognized sentence
+ upd test case for rwi multi-word-query  (leaving results known to fail untested)
2016-10-05 05:52:37 +02:00
reger
14f7577231 add support for older Word versions (Word6/Word95) to docParser 2016-10-03 01:52:51 +02:00
reger
1a79c64495 generalize DateDetection with holiday date rules readily available in icu
to make sure current dates are recognized (was fixed to 2014 - 2016)
+ adjust holiday date parser from pattern.match to pattern.find to deal with leading and trailing text
+ moved relative date recognition (morgen, tomorrow) to parseline (used by query parser only), as not working and problematic for indexing
+ add test case for parseline (used by query parser)
2016-10-02 03:19:12 +02:00
reger
6f68f08354 correct DateDetection Silvester date
add Thanksgiving
2016-10-01 03:16:27 +02:00
reger
efcb6a1e74 fix supported mime XML -> xml for rssParser (mime normalized to lower case for comparison)
+ add mime text/xml as in use for rss in the wild
2016-09-23 23:37:12 +02:00
luccioman
b1b8e69da8 Fixed NullPointerException cases 2016-09-22 11:25:33 +02:00
reger
a4465c97d6 as requested, disable/remove old swf parser
http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5861#p33098
2016-09-13 02:47:36 +02:00
reger
96467c5467 remove not needed counter in Tokeninzer (completing last changes)
including a small change, word posintext counting. 
We remember/store 1st posintext. Previously following words got a handle (posintext)
excluding found. Now it just counts and assigns true posintext as handle (posintext)
2016-09-10 18:23:09 +02:00
reger
272cdd496a reactivate sentence counter in WordTokenizer for phrasepos ranking,
by counting punktuation (delivered as 1 char word) again.
2016-09-07 02:16:16 +02:00
Michael Peter Christen
5e165a8150 removed unused imports 2016-09-06 18:46:24 +02:00
reger
e310ec5f70 fix posInText ranking calculation to score 0 on no position info
+ fix Word posInText calc in Tokenizer to start with 1
+ test case
2016-09-06 00:05:59 +02:00
reger
4c7a77662a eleminate dependency on file-extension in storeDocument but use supported mime-type
to also support handling of urls w/o corresponding file-extension.
For this refactor use of document.getParserObject() to alway return a Parser (for clean logic)
and define/move the scraperObject as local var of AbstractParser.
Adjust related calls to getParserObject (where actually a scraperObject is wanted).
Addionally skip appending url token to parsed text for dht metadata entries 
(by default returned as result by rwi index).
2016-08-14 03:53:16 +02:00
reger
ebde21079a refactor xlsParser to include Excel file attribute (like author) in parser result doc.
Similar to ppt and doc parser, completing a TODO in xlsParser.
2016-08-13 23:46:36 +02:00
reger
27163af0e1 improve detection of referenced links by taking http and https link protocol
into account
+ correct query start detection of commit f89d4eb51d
2016-07-17 23:42:25 +02:00
luccioman
6e96c7341a Merge remote-tracking branch 'origin/master'
Conflicts:
	htroot/Load_MediawikiWiki.java
	htroot/Load_PHPBB3.java
	htroot/ViewImage.java
2016-07-03 18:59:00 +02:00
reger
9e94989237 upd to PDFBox 2.0.1 2016-05-20 23:12:16 +02:00
reger
24b0fa2a38 extend snapshot Html2Image.pdf2image to use PDFBox image export capability
if no external tool installed (and for Win)
Resulting jpg are not always perfect (if graphic included) but imho sufficient.
2016-05-16 02:13:33 +02:00
reger
1d940e5a94 upd commons-compress 1.11 2016-04-16 23:31:03 +02:00
reger
764f5100f0 fix delete of temp file after odt % ooxml parser
Close zipfile after parsing
2016-03-04 23:05:55 +01:00
reger
06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
- Above brought up that parser start url parameter, declared as AnchorURL uses only methodes of parent object DigestURL (changed parameter declaration accordingly).
2016-02-16 02:05:58 +01:00
luc
9f712146df Display icons in ViewFile "links" mode. 2016-02-10 10:08:07 +01:00
reger
6f0b073bf3 override detected language (statistic langdetect) only with TLD determided
language if langdetect probability is not high.
+ additionally truncate zh-cn / zh-tw returned by langdetect to 2 char ISO639-1 zh
used by YaCy
2016-02-07 21:16:22 +01:00
reger
b65e2b527d include use of condenser's content text for language detection.
Language identification may show poor performance on documents with short or no
title but clear lang indication in text content. Using content text too
improves lang detection.
+ remove double caching of text in Identificator
2016-02-07 01:52:32 +01:00
luc
3cc5619d93 Improved HTML icons indexing and rendering in search results.
See http://mantis.tokeek.de/view.php?id=629
2016-02-02 09:57:54 +01:00
reger
2048b7e057 support scraping start-/enddate from html tag with property "datetime"
This may be used in html5 <time> tag (which we don't explicite support yet for date in content scraping).
2016-01-26 21:27:44 +01:00
reger
900d4584ba complet resource cleanup of lists in contentscraper's close() 2016-01-25 23:54:20 +01:00
reger
1f18653de0 pass parsed swf content trough htmlscraper
Swf may contain subset of html tags which shoul'd appear as text.
Especially <font> tag may totally screw up metadata servlet if not filtered out.
2016-01-21 02:55:05 +01:00
reger
18ecf57792 add support of compressed swf to swfParser
from JavaSWF2 (source compatible to WebCat).
Moved swf file signature check to parser
Changed use of synced vector to list swf InStream
2016-01-20 00:58:29 +01:00
reger
ff27824964 fix swfParser reading file signature
before passing to library (current version expects data w/o signature)
2016-01-10 01:16:31 +01:00
luc
571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
charset names.
2016-01-05 23:37:05 +01:00
reger
46ac0867ff fix poison mediawikiimporter output queue also after ExecutionException
in worker thread.
Writer of importer keeps needs a poison to close the file. On exception (e.g. OOM)
add a poison marker in outer most try/catch to assure output queue will terminate
in this condition too (and closes+renames the surrogate/in/xxx.prt file)
2015-12-28 02:32:00 +01:00
reger
a7591d3ed0 fix mediawikiimporter number format exception on coordinate parsing
handle uncomplete metadata like "NS=43/50//N". 
For other {expr ... } type entries a try catch added
2015-12-27 01:59:15 +01:00
reger
e84d94f8ca fix mime table for ms office / open office documents
(causing wrong parser detect in intranet mode)
2015-12-22 17:48:24 +01:00
reger
45b9bd8403 adjust MultiProtocolURL.protocol detection to handle mailto with "://" in parameters,
and feeding hyperlinks to webgraph processing.
2015-12-21 04:42:26 +01:00
reger
0c5548a7ff fix (todo) remove redundant holding of email link nameproperty in parser document 2015-12-18 02:35:44 +01:00
reger
6b7c10cef8 fix dc:date in mediawikiimporter/document.writexml to use lastmodified 2015-12-17 02:53:10 +01:00
reger
14803d58cd let html scraper accept html5 <link rel="icon"> for favicon links 2015-12-17 00:36:08 +01:00
reger
4d2b934487 prevent mailto links getting into parser result document's in/outbound link collection
by checking mailto scheme early.
- fix upper case mailto protocol assignment
- add test case for getProtocol
2015-12-16 03:01:17 +01:00
luc
8ebefa4233 Fixed MediaWiki import : DCEntry conversion to SolrInputDocument was
failing. Looks like it was broken since Commit
b43811d38c
2015-12-08 03:34:03 +01:00
luc
7736ee5a42 Updated MediaWimporter main() : display usage in console and stop
properly without calling System.exit
2015-12-08 03:30:51 +01:00
luc
27d11f8671 Fixed isSolrDump function : PushBackInputStream was not unread when
returning false (for example with a WikiMedia dump).
2015-12-07 21:58:36 +01:00
Michael Peter Christen
135a123a77 less logging in new language detection 2015-12-03 00:39:15 +01:00
Michael Peter Christen
d6e9834040 Merge branch 'master' of
https://github.com/Scarfmonster/yacy_search_server

# Conflicts:
#	.classpath
#	build.xml
2015-11-30 16:54:54 +01:00
Michael Peter Christen
d82d311995 Merge branch 'master' of https://github.com/luccioman/yacy_search_server
# Conflicts:
#	.classpath
2015-11-30 13:34:10 +01:00
reger
e163ea88f6 fix vsdParser (Visio) parser return statement
(final block un-necessary throw)
2015-11-28 02:43:38 +01:00
luc
f0478bb14d BMP and ICO image formats support : integrated /haraldk/TwelveMonkeys
imageio-bmp-3.2 library.

 - better BMP format flavours support
 - handle PNG encoded icons
 - handle transparency
 
Added some javadoc url references to .classpath
2015-11-20 09:38:16 +01:00
reger
0d3c5b223e have psParser cleanup temp file 2015-11-17 23:45:29 +01:00
reger
7d0d19cb8e avoid File.deleteOnExit() on temp files
JVM registers each file in a list regardless of already deleted and never
cleans up the list during runtime.
This accumulates to a considerable amount of mem during large crawls and/or
long uptime.
To tackle this, all temp files are now created in a subdir of java.io.tmpdir 
and the jvm tmpdir property is set to this subdir, which is deleted by
code on shutdown.
Additionally let pdfParser use this tmp subdir too.
2015-11-17 22:27:07 +01:00
reger
02e4489a23 set tmpfile.deleteOnExit by default,
to make sure files are removed on shutdown.
2015-11-16 21:37:45 +01:00
reger
52a9040ae6 Sort out double keywords (dc_subject) early in parsed documents
- by direct using Set vs. List
- remove not neede String[] getter
2015-11-13 01:48:28 +01:00
reger
20e18d79f8 harmonize document title for archive parsers 2015-11-10 01:29:13 +01:00
reger
112ae013f4 update bzip and bzip parser process,
to return one document for the file with combined parser results of the
containing file and registers it with supplied url and mime of the archive.
2015-11-07 19:13:18 +01:00
reger
e76a90837b update zip and tar parser process,
to return one document for the file with combined parser results of the
containing files.
2015-11-06 23:58:55 +01:00
reger
8532565c7d optimize order of parsers to try
- start with a parser matching the remote supplied mime
2015-11-04 21:52:02 +01:00
reger
5d71fc70e3 fix tarParser early exit on looping content
- adjust check of data available according to doc 
- return null on no recognized content (to not exit TextParser next parser try)
- use commons.compress directly
2015-11-03 22:14:14 +01:00
reger
2fcf6f104c fix bzipParser recognition
- Bzip2Inputstream checks magic byte itself to identify bz2 (leave it in input)
- try to suppy fitting mime for parsing bz2 content
2015-11-03 03:35:01 +01:00
reger
bbe9df2bb3 fix MediawikiImporter for bz2 dump
skip reading bz2 file magicbyte to identify bz2 format as inputstream reset would be required. Common compress reads and checks the magicbytes internally and throws ioexception if wrong, making preread obsolete.
2015-10-25 03:06:15 +01:00
reger
c6687dd560 fix a system.out to log.fine
in bmpParser
2015-10-25 00:26:45 +02:00
Michael Peter Christen
ac034db8bc Merge branch 'master' of https://github.com/luccioman/yacy_search_server
# Conflicts:
#	htroot/js/highslide/highslide.js
#	source/net/yacy/document/ImageParser.java
2015-10-24 11:22:35 +08:00
luc
5902ce032e Corrected NullPointerException case when ImageIO reader is not found for
image format.
2015-10-19 14:11:26 +02:00
reger
c6495a5b62 add a log entry on parsing ajax crawling scheme snapshot
(prev. commit 9252e36aeb)
2015-10-18 06:19:12 +02:00
reger
9252e36aeb implement ajax crawling scheme for ajax sites which adhere to the proposed use of hash-bangs to provide html content
see freshly deprecated https://developers.google.com/webmasters/ajax-crawling/
Implementation improves parsing of the homepage (ajax page) which uses metatag "fragment" in header and parses supplied html snapshot instead of mostly empty ajax/scripted page.
Implementation supports also hash-bang urls (url with anchor starting with ! like  ...path#!hashfragment) but our crawler filters it
(use of hash-bang is controversly discussed and proposal is deprecated, makes no sense to adjust the crawler, but as long as it is used by some sites the minor change/improvement in htmlparser is good for some time).
Quick - how does it work
- if metatag fragment with content "!" is found
   - htmlparser tries to get content of htmls snapshot (using a different url)
   - htmlparser returns 2 documents (original url and snapshot content - but using same original url)
- after parsing result documents are joined (and stored to index containing content also from snapshot page... as the original ajax page contains typically no parseable html content)
2015-10-18 05:51:01 +02:00
Michael Peter Christen
7d075a1d76 added log lines 2015-10-16 23:30:04 +02:00
luc
d6522fa4a2 Integrated haraldk/TwelveMonkeys library to first add TIF image format
support.
2015-10-15 10:06:51 +02:00
reger
78e8c6f3e5 refactor special handling (static override) of SUPPORTED_EXTENSIONS/MIME_TYPES
not used for genericImageParser
2015-10-11 01:23:52 +02:00
reger
d54c5d310a add links with image extension not automatically to image links.
With the wide spread use e.g. of Wikimedia the url file extension of links with image extension often point to html.
2015-10-10 23:49:58 +02:00
reger
851e8f6c8a check jpeg file signature in genericImageParser
to fail early without further object allocation if source is not a jpeg.
2015-10-05 01:58:31 +02:00
reger
d5330391de remove some unused var allocation in parser 2015-10-01 23:11:58 +02:00
reger
7c82cd4415 add a end condition to svgParser for wrong content
(if parser choosen just by file extension)
2015-09-29 22:57:33 +02:00
reger
356d4d1301 remove rdfParser from init (current function identical with genericParser) 2015-09-26 17:30:34 +02:00
reger
c647d899e3 add svgParser to parse metadate from svg images
Reads document level included title and description and skips the graphic content to save bandwidth.
svg metadata element is not interpreted
- remove rdfParser from init (current function identical with genericParser)
2015-09-26 17:27:33 +02:00
reger
bad34804fe optimize parseInt for <img> tag attribute parsing
Performance better as using Numberformat.parse or parseInt(substring())
2015-09-26 15:42:23 +02:00
reger
2f51baff4f check for loading error (includs unsupported formats)
to prevent blank thumbnail display in image search because of not handled source which don't load on click.
Now the cross icon indicates the problem (inlcuding not supported format)
2015-09-24 01:58:19 +02:00
reger
a3195d78ae add Portuguese month names to date recognition 2015-09-20 23:28:42 +02:00
reger
d2cc11ea8f fix html parser taking <style> content as text.
Noticed some result description contain css content from style tag.
Added <style> to tag list to scrape it's content not as text
+ test case included
2015-09-19 05:30:55 +02:00
reger
1e8369e18b use a parsed date in Document.toString 2015-09-12 22:00:40 +02:00
reger
41c4eade51 extract modification date from vCard (vcfParser) 2015-09-06 04:28:27 +02:00
reger
8768896975 extract lastmodified from openoffice doc
set lastmod date in office document parsers
2015-09-06 00:04:54 +02:00
sixcooler
a3dd4be749 added / corrected charste to be 1.7 compatible.
@Orbiter: please check is this is ok for you
2015-08-10 20:53:20 +02:00
Michael Peter Christen
df3314ac1a added a new facet type based on a probabilistic classifier using
bayesian filters. This can be used to classify documents during
indexing-time using a pre-definied bayesian filter.

New wordings:
- a context is a class where different categories are possible. The
context name is equal to a facet name.
- a category is a facet type within a facet navigation. Each context
must have several categories, at least one custom name (things you want
to discover) and one with the exact name "negative".

To use this, you must do:
- for each context, you must create a directory within
DATA/CLASSIFICATION with the name of the context (the facet name)
- within each context directory, you must create text files with one
document each per line for every categroy. One of these categories MUST
have the name 'negative.txt'.

Then, each new document is classified to match within one of the given
categories for each context.
2015-08-10 14:27:44 +02:00
Michael Peter Christen
7b412e8c07 added msg (text emails) format; should be handled by html parser. 2015-07-08 17:36:37 +02:00
Ryszard Goń
59096935d0 Use language-detection library for increased accuracy 2015-07-02 18:41:13 +02:00
Michael Peter Christen
90f75c8c3d added enrichment of synonyms and vocabularies for imported documents
during surrogate reading: those attributes from the dump are removed
during the import process and replaced by new detected attributes
according to the setting of the YaCy peer.
This may cause that all such attributes are removed if the importing
peer has no synonyms and/or no vocabularies defined.
2015-07-02 00:23:50 +02:00
Michael Peter Christen
7829480b82 refactoring: separated condenser and tokenizer 2015-07-01 18:28:18 +02:00
Michael Peter Christen
593de05922 enhanced surrogate import process speed (dramatically!) 2015-06-29 12:28:34 +02:00
reger
7478338a40 remove augmented parsing activation from frontend
experimental implementation not used and based on error prone experimental rdfaparser
2015-06-05 00:51:00 +02:00
reger
11aa2edfe1 remove RDFa parser activation from frontend
reason: experimental implementatin of RDFa parser not executed (limited to special urls) but may cause error on normal html parsing due to a inputstream.reset
2015-06-05 00:15:16 +02:00
Michael Peter Christen
d0aff91f23 fix for index import 2015-06-01 01:56:09 +02:00
Michael Peter Christen
b43811d38c added surrogate import process for exported solr dumps.
Just throw your solr dump file into DATA/SURROGATES/in/ and it will be
imported!
2015-05-30 13:19:59 +02:00
reger
8a9622c31c fix string OoB on getImagelinks with long alttext
in description calculation
2015-05-24 01:59:40 +02:00
Michael Peter Christen
ff29b0e503 added option to re-index exported xml snapshot dumps to
HTCACHE/snapshots by just placing them in the SURROGATES/in path
2015-05-08 15:30:26 +02:00
Michael Peter Christen
6f4fe4b175 revert of 8a7c68e4c7
keeping surrogates after processing is essential for some users. If the
space they are taking is too high, please set up an automatic deletion
process (like a cronjob).
2015-05-08 14:01:30 +02:00
Michael Peter Christen
fed26f33a8 enhanced timezone managament for indexed data:
to support the new time parser and search functions in YaCy a high
precision detection of date and time on the day is necessary. That
requires that the time zone of the document content and the time zone of
the user, doing a search, is detected. The time zone of the search
request is done automatically using the browsers time zone offset which
is delivered to the search request automatically and invisible to the
user. The time zone for the content of web pages cannot be detected
automatically and must be an attribute of crawl starts. The advanced
crawl start now provides an input field to set the time zone in minutes
as an offset number. All parsers must get a time zone offset passed, so
this required the change of the parser java api. A lot of other changes
had been made which corrects the wrong handling of dates in YaCy which
was to add a correction based on the time zone of the server. Now no
correction is added and all dates in YaCy are UTC/GMT time zone, a
normalized time zone for all peers.
2015-04-15 13:17:23 +02:00
Michael Peter Christen
b060ba900d added parsing of contentprop attribute in html tags for
content='startDate' and content='endDate'. The value of these field is
now written to new solr fields startDates_dts and endDates_dts.
2015-04-13 16:20:00 +02:00
Michael Peter Christen
4cb4f67f38 added parsing of dd, dt and article html fields. The parsed result is
written to special solr fields which are deactivated by default.
2015-04-12 22:02:45 +02:00
Michael Peter Christen
4d00175157 <experimental> added parsing of <article> html element.
Whenever such an element occurs, the complete content of all article
elements replaces the parsed <content> part of documents.
2015-04-10 16:16:20 +02:00
reger
2e8c24e02a fix link to DeReWo download file 2015-03-11 20:02:23 +01:00
Michael Peter Christen
893889bc7b added special terms for on: - Date modifier: tomorrow, today; i.e.:
search for: "Berlin on:tomorrow" to find events happening tomorrow in
Berlin
2015-03-02 13:10:05 +01:00
Michael Peter Christen
535f1ebe3b added a new way of content browsing in search results:
- date navigation

The date is taken from the CONTENT of the documents / web pages, NOT
from a date submitted in the context of metadata (i.e. http header or
html head form). This makes it possible to search for documents in the
future, i.e. when documents contain event descriptions for future
events.

The date is written to an index field which is now enabled by default.
All documents are scanned for contained date mentions.
To visualize the dates for a specific search results, a histogram
showing the number of documents for each day is displayed. To render
these histograms the morris.js library is used. Morris.js requires also
raphael.js which is now also integrated in YaCy.

The histogram is now also displayed in the index browser by default.

To select a specific range from a search result, the following modifiers
had been introduced:
from:<date>
to:<date>
These modifiers can be used separately (i.e. only 'from' or only 'to')
to describe an open interval or combined to have a closed interval. Both
dates are inclusive. To select a specific single date only, use the
'to:' - modifier.

The histogram shows blue and green lines; the green lines denot weekend
days (saturday and sunday).

Clicking on bars in the histogram has the following reaction:
1st click: add a from:<date> modifier for the date of the bar
2nd click: add a to:<date> modifier for the date of the bar
3rd click: remove from and date modifier and set a on:<date> for the bar
When the on:<date> modifier is used, the histogram shows an unlimited
time period. This makes it possible to click again (4th click) which is
then interpreted as a 1st click again (sets a from modifier).

The display feature is NOT switched on by default; to switch it on use
the /ConfigSearchPage_p.html servlet.
2015-03-02 04:30:10 +01:00
reger
2d2299f484 fix mimetype of rss items in rss parser
- remove self reference as anchor for items
2015-02-25 01:58:42 +01:00
Michael Peter Christen
b432049d59 enhanced date parsing time 2015-02-25 01:05:46 +01:00
reger
a0f04db9ea add extracted description/subject to pptParser 2015-02-22 05:31:56 +01:00
reger
7e35518787 add extracted description/subject to docParser 2015-02-16 00:50:16 +01:00
Michael Peter Christen
1f5b5c0111 npe fix for latest scraper feature 2015-02-10 08:33:30 +01:00
Michael Peter Christen
ee97302a23 hack to make date detection faster (while it becomes a bit incomplete
regarding language alternatives)
2015-02-09 18:46:06 +01:00
Michael Peter Christen
b5ac29c9a5 added a html field scraper which reads text from html entities of a
given css class and extends a given vocabulary with a term consisting
with the text content of the html class tag. Additionally, the term is
included into the semantic facet of the document. This allows the
creation of faceted search to documents without the pre-creation of
vocabularies; instead, the vocabulary is created on-the-fly, possibly
for use in other crawls. If any of the term scraping for a specific
vocabulary is successful on a document, this vocabulary is excluded for
auto-annotation on the page.

To use this feature, do the following:
- create a vocabulary on /Vocabulary_p.html (if not existent)
- in /CrawlStartExpert.html you will now see the vocabularies as column
in a table. The second column provides text fields where you can name
the class of html entities where the literal of the corresponding
vocabulary shall be scraped out
- when doing a search, you will see the content of the scraped fields in
a navigation facet for the given vocabulary
2015-01-30 13:20:56 +01:00
Michael Peter Christen
de3e373913 using precompiled CommonPattern.TAB for split 2015-01-29 02:22:28 +01:00
Michael Peter Christen
1f5047b15f using precompiled pattern CommonPattern.SEMICOLON for splits 2015-01-29 02:19:41 +01:00
Michael Peter Christen
69eacdf4eb applying precompiled CommonPattern.COMMA.split to all places where
split(",") was used
2015-01-29 01:46:22 +01:00
reger
5ca0762179 fix: eom on parsing ico file by genericImageParser
trace: java.lang.OutOfMemoryError: Java heap space
	at java.awt.image.DataBufferInt.<init>(DataBufferInt.java:75)
	at java.awt.image.Raster.createPackedRaster(Raster.java:467)
	at java.awt.image.DirectColorModel.createCompatibleWritableRaster(DirectColorModel.java:1032)
	at java.awt.image.BufferedImage.<init>(BufferedImage.java:331)
	at net.yacy.document.parser.images.bmpParser$IMAGEMAP.<init>(bmpParser.java:149)
	at net.yacy.document.parser.images.bmpParser.parse(bmpParser.java:69)
	at net.yacy.document.parser.images.genericImageParser.parse(genericImageParser.java:116)
2015-01-24 23:17:07 +01:00
Michael Peter Christen
4144c7cc52 do not write frame links to webgraph 2015-01-06 14:14:25 +01:00
reger
3ac1d14a21 improve TexParser.mimeOf( fileextension ) by returning 1st defined in supported list.
This prevents unusual mapping of supported fileextension -> mimetype
(like htm=application/x-tex)
2015-01-02 04:20:02 +01:00
Michael Peter Christen
d2792a43fd do not write iframe and embed links into webgraph, but use them anyway
for crawling
2015-01-02 02:44:03 +01:00
Michael Peter Christen
6ad43c4a8b removed debug code 2014-12-22 14:24:09 +01:00