yacy_search_server/source/net/yacy/crawler/data
Michael Peter Christen 66b5a56976 Added and integrated new date detection class which can identify date
notions within the fulltext of a document. This class attempts to
identify also dates given abbreviated or with missing year or described
with names for special days, like 'Halloween'. In case that a date has
no year given, the current year and following years are considered.

This process is therefore able to identify a large set of dates to a
document, either because there are several dates given in the document
or the date is ambiguous. Four new Solr fields are used to store the
parsing result:

dates_in_content_sxt:
if date expressions can be found in the content, these dates are listed
here in order of the appearances

dates_in_content_count_i:
the number of entries in dates_in_content_sxt

date_in_content_min_dt:
if dates_in_content_sxt is filled, this contains the oldest date from
the list of available dates

#date_in_content_max_dt:
if dates_in_content_sxt is filled, this contains the youngest date from
the list of available dates, that may also be possibly in the future

These fields are deactiviated by default because the evaluation of
regular expressions to detect the date is yet too CPU intensive. Maybe
future enhancements will cause that this is switched on by default.

The purpose of these fields is the creation of calendar-like search
facets, to be implemented next.
2014-12-14 13:40:45 +01:00
..
Cache.java more IPv6 bugfixes 2014-10-06 17:44:27 +02:00
CrawlProfile.java enhanced the snapshot functionality: 2014-12-09 16:20:34 +01:00
CrawlQueues.java ViewFile servlet: update index if newer, 2014-12-05 01:13:37 +01:00
Latency.java - added a new Crawler Balancer: HostBalancer and HostQueues: 2014-04-16 21:34:28 +02:00
NoticedURL.java reduce number of calls to queue.size() because that may be a bottleneck 2014-11-23 20:09:32 +01:00
ResultImages.java fix for image alt attachment to AnchorURLs in html parser. 2014-08-01 12:04:15 +02:00
ResultURLs.java migrated the index export methods from the old metadata to solr. Now 2013-01-24 12:39:19 +01:00
Snapshots.java added concurrent generation of snapshot pdfs 2014-12-10 14:10:05 +01:00
Transactions.java Added and integrated new date detection class which can identify date 2014-12-14 13:40:45 +01:00