Commit Graph

517 Commits

Author SHA1 Message Date
Michael Peter Christen
8b5d074715 fix for image parser (there is a class missing!) 2014-12-16 12:10:15 +01:00
reger
9edc7308aa update to metadata-extractor-2.7.0.jar
add 2 simple JUnit test cases for jpeg and tif parsing
2014-12-15 20:45:05 +01:00
Michael Peter Christen
bbf0ac40c3 add the actual DateDetection class... (missed in latest commit) 2014-12-14 13:43:30 +01:00
Michael Peter Christen
66b5a56976 Added and integrated new date detection class which can identify date
notions within the fulltext of a document. This class attempts to
identify also dates given abbreviated or with missing year or described
with names for special days, like 'Halloween'. In case that a date has
no year given, the current year and following years are considered.

This process is therefore able to identify a large set of dates to a
document, either because there are several dates given in the document
or the date is ambiguous. Four new Solr fields are used to store the
parsing result:

dates_in_content_sxt:
if date expressions can be found in the content, these dates are listed
here in order of the appearances

dates_in_content_count_i:
the number of entries in dates_in_content_sxt

date_in_content_min_dt:
if dates_in_content_sxt is filled, this contains the oldest date from
the list of available dates

#date_in_content_max_dt:
if dates_in_content_sxt is filled, this contains the youngest date from
the list of available dates, that may also be possibly in the future

These fields are deactiviated by default because the evaluation of
regular expressions to detect the date is yet too CPU intensive. Maybe
future enhancements will cause that this is switched on by default.

The purpose of these fields is the creation of calendar-like search
facets, to be implemented next.
2014-12-14 13:40:45 +01:00
Michael Peter Christen
6a1865f507 refactoring date -> lastModified 2014-12-11 23:37:41 +01:00
Michael Peter Christen
8df8ffbb6d enhanced the snapshot functionality:
- snapshots can now also be xml files which are extracted from the solr
index and stored as individual xml files in the snapshot directory along
the pdf and jpg images
- a transaction layer was placed above of the snapshot directory to
distinguish snapshots into 'inventory' and 'archive'. This may be used
to do transactions of index fragments using archived solr search results
between peers. This is currently unfinished, we need a protocol to move
snapshots from inventory to archive
- the SNAPSHOT directory was renamed to snapshot and contains now two
snapshot subdirectories: inventory and archive
- snapshots may now be generated by everyone, not only such peers
running on a server with xkhtml2pdf installed. The expert crawl starts
provides the option for snapshots to everyone. PDF snapshots are now
optional and the option is only shown if xkhtml2pdf is installed.
- the snapshot api now provides the request for historised xml files,
i.e. call:
http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ
The result of such xml files is identical with solr search results with
only one hit.
The pdf generation has been moved from the http loading process to the
solr document storage process. This may slow down the process a lot and
a different version of the process may be needed.
2014-12-09 16:20:34 +01:00
reger
28456dfc09 skip creation of unused Bluelist contenttransformer 2014-12-02 21:03:00 +01:00
Michael Peter Christen
321840fde3 Replaced all fixed thread pools with cached thread pools. The cached
thread pools will flush their cached (dead) threads after 60 seconds.
This will cause that YaCy now runs constantly withl about 50 threads,
about 100 at peak times. Previously, about 400 threads had been cached
and kept in a hibernation state, which caused that the numproc counter
in /proc/user_beancounters (exists only in VM-hosted linux) was as high
as the cached number of threads. This caused that VM supervisors
terminated whole VM sessions if a limit was reached. Many VM providers
have limits of numproc=96 which made it virtually impossible to run YaCy
on such machines. With this change, it will be possible to run many YaCy
instances even on VM hosts.
2014-12-02 16:26:07 +01:00
Michael Peter Christen
a1ee101079 recognize more html file extensions 2014-12-02 12:10:44 +01:00
reger
0c97cc2440 skip unused call parameter for hashSentence() 2014-11-30 19:42:33 +01:00
reger
5790c7242e skip to tokenize punktuation as word in WordTokenizer
remove unused variables in condenser related to Tokenizer
2014-11-29 17:16:05 +01:00
Michael Peter Christen
6a2a669db4 added loading of the synonyms file from addon/synonyms into the
knowledge loader
2014-11-19 17:36:56 +01:00
Michael Peter Christen
07c5b57953 removed warnings 2014-10-15 11:19:25 +02:00
reger
59c6532a65 add link extraction to pdfParser
this extracts clickable links in pdf and adds it to the list of links

include a test case for this function

this is the corrected comment for commit:
aa2e15d846
2014-10-06 04:51:31 +02:00
reger
aa2e15d846 allow url parameter in worktable apicall
allow url=wwwl?param=a&param=b (with ?, & encoded)
fix:  http://mantis.tokeek.de/view.php?id=100

fix double adding of  '&' in MultiProtocolURL.escape()
2014-10-05 20:05:03 +02:00
reger
b0c87d8240 fix image search expand box, cut-off of 2nd capture line height
tested with IE11 and Firefox 32 (change worked for both to show 2nd line without cutting off height)

+fix charset parameter in metadataImageParser
+update start errMsgTxt to "java 1.7"
2014-10-03 01:43:05 +02:00
Michael Peter Christen
3073c69aee Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-09-30 14:54:06 +02:00
reger
eaccce3467 added metadataImageParser for tif and psd (Photoshop) images.
This is a modified genericImageParser adding tif (and psd) support even if java ImageIO plugin for tif is not installed in JDK.
Adds just tif and psd to the available parsers.
Uses the same library to extract metadata, so could eventually be merged with genericImageParser.
All detected metadata are added to the parsed document (potentially some more as with genericImageParser)
2014-09-30 05:04:47 +02:00
reger
a69f5358ff use javax ImageIO getReader to add supported image extension/mime
genericImageParser uses javax ImageIO, supported images depend on available plugins for ImageIO package (this is JDK installation specific). Jpeg, png and gif are availabel by default. Tif and others only on avalable plugin (in classpath).
Add supported image type dynamically on startup.
2014-09-29 07:42:51 +02:00
Michael Peter Christen
67cd4c37bd activated the new apk parser which was already ready but not included in
the parser initialization. To make the apk parser usable, the handling
of application type links had to be modified. Now all documents which
have not a parser attached are placed to the noload-queue while all
other documents are parsed using the associated parser class. This may
have side-Effects on other parsers and the display of different file
classes (images, apps, videos).
2014-09-24 13:32:58 +02:00
reger
03a7a29db3 limit OAI import urn resolver try for Deutsche National Library
The resolver service of National Library uses name space nbn, limit use of nbn-resolving.de accordingly to urn:nbn:
- add resolver for rfc's
2014-09-14 01:38:27 +02:00
orbiter
b6d57f06eb enhanced the apk parser (up to beeing production-ready).
The parser is not yet activated and will be after the next release step.
2014-09-04 09:41:42 +02:00
orbiter
c9e593cf78 removed warnings 2014-08-11 23:53:12 +02:00
reger
e9eae45b55 simplify rssreader and improve atom feed link extraction
- type detection (rss/atom) 
    - init type parameter overwritten during parse, parameter obsolete
    - detection by endtag changed to simpler first-tag evaluation
- channel image not used, removed related extra parser handling
    - remove unused code (set/getImage) in rssfeed
- atom link extraction to account for possible multipe link tags
   - spec limits link to one with rel="alternate" or one without rel attribute
     not accounting for the follwing type & hreflang exception yet:

   o  atom:entry elements MUST NOT contain more than one atom:link
      element with a rel attribute value of "alternate" that has the
      same combination of type and hreflang attribute values.
2014-08-10 01:29:16 +02:00
reger
8f77719091 fix "Ljava.lang.String" in crawl queue anchor name
(e.g. IndexCreateQueues_p.html?stack=LOCAL with images in queue)
2014-08-04 02:38:58 +02:00
Michael Peter Christen
98f45c9032 fix for image alt attachment to AnchorURLs in html parser. 2014-08-01 12:04:15 +02:00
orbiter
08409ec680 no idea why the words max was an ordered one. This change increaes speed
dunring document processin a bit
2014-07-23 17:54:16 +02:00
Michael Peter Christen
b44626e55b fixed target_alt_t in webgraph 2014-07-22 18:24:10 +02:00
Michael Peter Christen
2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
attribute in the <a> tag for each crawl. This introduces a lot of
changes because it extends the usage of the AnchorURL Object type which
now also has a different toString method that the underlying
DigestURL.toString. It is therefore not advised to use .toString at all
for urls, just just toNormalform(false) instead.
2014-07-18 12:43:01 +02:00
Michael Peter Christen
e039e78210 small bugfixes 2014-07-16 16:04:38 +02:00
Michael Peter Christen
fb3dd56b02 fix for processing of noindex flag in http header 2014-07-10 17:13:35 +02:00
Michael Peter Christen
f3a6b6e21e fix for bad URL decoding 2014-07-10 01:59:29 +02:00
Michael Peter Christen
aee5b108e5 added linkScraperParser, a parser which ignores the text like the
generic parser but extracts links like the htmlParser. This should be
used for ASCII documents without known text format annotation like
source code files or json documents. Probably also good for xml files
without known schema.
2014-07-07 13:37:17 +02:00
reger
40133ba2d0 fix NPE in Condenser,
discovered by calling IndexControlRWI, "Word Deletion" with "for every resolvable and deleted URL reference"
2014-07-06 13:24:36 +02:00
reger
cb2c17d236 extract author and keywords in .doc and .ppt parser 2014-06-29 02:54:09 +02:00
orbiter
fec673c9d1 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-06-27 10:15:37 +02:00
orbiter
4a66af716d added apkParser stub (work in progress) 2014-06-27 10:15:01 +02:00
reger
2d67f29244 adjust mergeDocument after parsing to
- preserve charset and languages
- fix merge of author
2014-06-26 22:16:15 +02:00
Michael Peter Christen
0d29b972cc Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-06-26 13:02:56 +02:00
reger
7847a93558 fix AbstractParser.singleList not adding null strings
- prevents null titles in oo... parser  (as detected by ParserTest)
- correct ParserTest dc_description check (dc_description allowed to return 0 length array)
2014-06-26 02:56:45 +02:00
Michael Peter Christen
8acae852a0 write <em>-tagged texts also into the bold_txt field 2014-06-25 11:51:11 +02:00
reger
3b559e7846 optimize pdfParser
skip starting reader thread if all content already read
2014-06-10 04:25:20 +02:00
reger
09f73b790f fix pdfParser not closed warning from pdfbox
for encrypted pdf on exit due to missing permission to extract
2014-06-08 08:20:30 +02:00
reger
d8d318233e fix logging settings
- add missing .level
- remove obsolete jena settings
- set default level=INFO to prevent debug logging of not explicite specified classes
2014-06-01 06:43:50 +02:00
orbiter
97983ba89f fixed generics warnings for generic array instantiation that appeared
after migration to Java 7
2014-05-20 21:50:16 +02:00
orbiter
88f4af90da removed warnings 2014-05-13 22:27:31 +02:00
reger
8a7c68e4c7 content of surrogates/out never accessed (remove)
After import the conent is never accessed but may take up a lot of disk space,
also the getLoadedOAIServer (which lists the files in surrogate out) is not used.
Making the surrogate.out obsolete. Removed keeping of xmls after import.
2014-05-04 09:29:07 +02:00
reger
2eb7682772 add html5 audio/video <source> tag to html content scraper
- <source src=.. type=..> tag content is added to embed collection
2014-04-29 00:41:29 +02:00
reger
0b6db04e40 fix contentscraper img height/width parsing
prevent numberformat exception on common "100px" property

- include in test case
2014-04-28 04:59:47 +02:00
reger
121d25be38 recover sax fatal error on OAI-PMH import of xml with entity error
this allows to continue loading next resumptionToken even if import file caused sax parser error
fix http://mantis.tokeek.de/view.php?id=63
2014-04-25 01:05:28 +02:00