Commit Graph

11657 Commits

Author SHA1 Message Date
Michael Peter Christen
bbf0ac40c3 add the actual DateDetection class... (missed in latest commit) 2014-12-14 13:43:30 +01:00
Michael Peter Christen
66b5a56976 Added and integrated new date detection class which can identify date
notions within the fulltext of a document. This class attempts to
identify also dates given abbreviated or with missing year or described
with names for special days, like 'Halloween'. In case that a date has
no year given, the current year and following years are considered.

This process is therefore able to identify a large set of dates to a
document, either because there are several dates given in the document
or the date is ambiguous. Four new Solr fields are used to store the
parsing result:

dates_in_content_sxt:
if date expressions can be found in the content, these dates are listed
here in order of the appearances

dates_in_content_count_i:
the number of entries in dates_in_content_sxt

date_in_content_min_dt:
if dates_in_content_sxt is filled, this contains the oldest date from
the list of available dates

#date_in_content_max_dt:
if dates_in_content_sxt is filled, this contains the youngest date from
the list of available dates, that may also be possibly in the future

These fields are deactiviated by default because the evaluation of
regular expressions to detect the date is yet too CPU intensive. Maybe
future enhancements will cause that this is switched on by default.

The purpose of these fields is the creation of calendar-like search
facets, to be implemented next.
2014-12-14 13:40:45 +01:00
Michael Peter Christen
c3c2b6999b fixes on wkhtmltopdf 2014-12-14 04:03:20 +01:00
Michael Peter Christen
114f0afc1e enable sku as anchor in html response writer 2014-12-14 04:02:13 +01:00
Michael Peter Christen
aa80cb1159 enhanced tagging preparation speed which reduces initialization time for
very large vocabularies
2014-12-13 09:54:41 +01:00
Michael Peter Christen
6a1865f507 refactoring date -> lastModified 2014-12-11 23:37:41 +01:00
Michael Peter Christen
ab6cc3c88c added concurrent generation of snapshot pdfs 2014-12-10 14:10:05 +01:00
Michael Peter Christen
ff035a20e7 fix for vocabulary import (double term detection) 2014-12-10 14:09:34 +01:00
Michael Peter Christen
e6650050fe fix for Is Facet checkbox 2014-12-10 13:14:39 +01:00
Michael Peter Christen
bd3ed5cae5 added charset detection to vocabulary reader 2014-12-10 13:11:51 +01:00
Michael Peter Christen
413eeefed4 added character set detection library from
http://www-archive.mozilla.org/projects/intl/chardet.html
2014-12-10 13:08:29 +01:00
Michael Peter Christen
7bfc5b80cb added new options to vocabulary editor:
- new switch 'isFacet' which causes that the usage of the vocabulary for
search facets is enabled or disabled. This shall be used for large
vocabularies sind searched in solr are extremely slow if facets for a
large set of alternative terms are generated
- new option to disable auto-enrichment from synonyms
- new option to add synonyms from another column when importing from csv
- automatically recognize double-occurrences in synonyms and bundling
terms for such synonyms
2014-12-10 12:20:27 +01:00
Michael Peter Christen
87b53b3572 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-12-09 16:20:44 +01:00
Michael Peter Christen
8df8ffbb6d enhanced the snapshot functionality:
- snapshots can now also be xml files which are extracted from the solr
index and stored as individual xml files in the snapshot directory along
the pdf and jpg images
- a transaction layer was placed above of the snapshot directory to
distinguish snapshots into 'inventory' and 'archive'. This may be used
to do transactions of index fragments using archived solr search results
between peers. This is currently unfinished, we need a protocol to move
snapshots from inventory to archive
- the SNAPSHOT directory was renamed to snapshot and contains now two
snapshot subdirectories: inventory and archive
- snapshots may now be generated by everyone, not only such peers
running on a server with xkhtml2pdf installed. The expert crawl starts
provides the option for snapshots to everyone. PDF snapshots are now
optional and the option is only shown if xkhtml2pdf is installed.
- the snapshot api now provides the request for historised xml files,
i.e. call:
http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ
The result of such xml files is identical with solr search results with
only one hit.
The pdf generation has been moved from the http loading process to the
solr document storage process. This may slow down the process a lot and
a different version of the process may be needed.
2014-12-09 16:20:34 +01:00
reger
5d67e165d9 remove redundant null check in ResponseHeader.lastModified
added a JUnit testcase for ResponseHeader dates (using age()),
adjusted age() to pass all tests
2014-12-09 00:58:08 +01:00
Michael Peter Christen
4111d42c81 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-12-08 12:40:12 +01:00
Michael Peter Christen
793ce6d13b added confirmation dialogs for row deletion 2014-12-08 11:41:28 +01:00
Michael Peter Christen
cdc21d43b1 more robustness for broken table data in Table_API_p.html -- see bug
report http://mantis.tokeek.de/view.php?id=495
2014-12-08 11:35:40 +01:00
reger
1d3ea35d69 prevent NPE on host link for to short HeuristicCfg.OpenSearchURL 2014-12-08 01:35:37 +01:00
Michael Peter Christen
a95af11050 enhancement for clearing the crawl queue 2014-12-07 23:43:38 +01:00
reger
5f0bb1214f modified FieldReIndex to reindex queries with low number of documents first
by using a internally a score map with number of documents as score
and working through the list from low to high.
2014-12-07 04:31:09 +01:00
reger
8055ed5b2a update to commons-logging-1.2 2014-12-06 22:32:24 +01:00
reger
e52370728a fix startup stop on missing HTCACHE/SNAPSHOT directory 2014-12-06 02:25:24 +01:00
reger
e5236aa7ca Merge origin/master 2014-12-06 01:44:03 +01:00
reger
70cf7060a4 coding fixes suggested in
http://mantis.tokeek.de/view.php?id=509
http://mantis.tokeek.de/view.php?id=510
2014-12-06 01:42:24 +01:00
Michael Peter Christen
d97deb5555 npe fix 2014-12-06 00:43:12 +01:00
Michael Peter Christen
4fe4bf29ad added rss feed output to snapshot servlet which can be used to get a
list of latest/oldest entries in the snapshot database. This is an
example:
http://localhost:8090/api/snapshot.rss?depth=2&order=LATESTFIRST&host=yacy.net&maxcount=100

The properties depth, order, host and maxcount can be omited. The
meaning of the fields are:
host: select only urls from this host or all, if not given
depth: select only urls at that crawl depth or all, if not given
maxcount: select at most the given number of urls or 10, if not given
order: either LATESTFIRST to select the youngest entries, OLDESTFIRST to
select the first entries or ANY to select any

The rss feed needs administration rights to work, a call to this servlet
with rss extension must attach login credentials.
2014-12-06 00:25:05 +01:00
Michael Peter Christen
8b522687e0 added toString() methods to feed classes which makes it possible to
export full rss feed files out of the RSSFeed class
2014-12-06 00:18:14 +01:00
reger
568c991405 remove the unused Request variable
(fix of  prev. commit)
2014-12-05 03:03:28 +01:00
reger
d6539ba597 Merge origin/master 2014-12-05 01:15:41 +01:00
reger
ff18129def ViewFile servlet: update index if newer,
so viewed text and metadata (stored) info is similar
- to archive it, use request with profile to allow indexing (defaultglobaltext) and update index 
   (the resource is loaded, parsed anyway, so it's not a expensive operation)

Request: remove 2 unused init parameter 
- number of anchors of the parent
- forkfactor sum of anchors of all ancestors
2014-12-05 01:13:37 +01:00
Michael Peter Christen
a304058840 added Image Events as another option to generate images with a mac if no
Ghostscript is available or does not work...
2014-12-04 01:21:24 +01:00
Michael Peter Christen
d83de9ecf5 added another path for the convert command because on older Macs
ImageMagick has a different installation location
2014-12-03 18:07:05 +01:00
Michael Peter Christen
226aea5914 added a servlet which can create preview images, preview tumbnails and
preview pdfs from web pages, i.e.:
http://localhost:8090/api/snapshot.png?url=http://yacy.net/en/&width=128&height=128
http://localhost:8090/api/snapshot.jpg?url=http://yacy.net/en/&width=128&height=128
http://localhost:8090/api/snapshot.pdf?url=http://yacy.net/en/

This supports also an on-the-fly generation of the preview documents if
the user is an administrator. Otherwise, the servlet fails.
To enable this, you must add wkhtmltopdf, imagemagick and (on headless
servers) xvfb to your operation system.

for detailed instructions, see
97f6089a41
2014-12-03 11:45:48 +01:00
reger
28456dfc09 skip creation of unused Bluelist contenttransformer 2014-12-02 21:03:00 +01:00
Michael Peter Christen
321840fde3 Replaced all fixed thread pools with cached thread pools. The cached
thread pools will flush their cached (dead) threads after 60 seconds.
This will cause that YaCy now runs constantly withl about 50 threads,
about 100 at peak times. Previously, about 400 threads had been cached
and kept in a hibernation state, which caused that the numproc counter
in /proc/user_beancounters (exists only in VM-hosted linux) was as high
as the cached number of threads. This caused that VM supervisors
terminated whole VM sessions if a limit was reached. Many VM providers
have limits of numproc=96 which made it virtually impossible to run YaCy
on such machines. With this change, it will be possible to run many YaCy
instances even on VM hosts.
2014-12-02 16:26:07 +01:00
Michael Peter Christen
181911376c showing list of all thread in threaddump using the ThreadMXBean counter
(this obviously show more threads than before?)
2014-12-02 16:21:06 +01:00
Michael Peter Christen
7bfab5eb9d set Busy- and Blocking-Threads to daemon mode (they will now not prevent
YaCy from termination if still running)
2014-12-02 16:05:00 +01:00
Michael Peter Christen
64887f6b21 show number of threads on status page 2014-12-02 16:04:11 +01:00
Michael Peter Christen
e586e423aa in case that loading from the cache fails, load from wkhtmltopdf without
cache using the user agent string given in the crawl profile
2014-12-02 13:35:19 +01:00
Michael Peter Christen
d5bac64421 recognize more html file types for snapshots 2014-12-02 12:52:36 +01:00
Michael Peter Christen
6f0167fac1 get cloned crawl start parameter for snapshots 2014-12-02 12:52:05 +01:00
Michael Peter Christen
a1ee101079 recognize more html file extensions 2014-12-02 12:10:44 +01:00
Michael Peter Christen
8480641f2d fix to xvfb-run usage (quotes did not parse in xvfb-run, default values
are appropriate)
2014-12-02 11:51:12 +01:00
Michael Peter Christen
68b040e31e added fail-over missing http proxy service (i.e. overload) and quiet
mode
2014-12-01 18:21:52 +01:00
Michael Peter Christen
25a64c51b3 moved snapshot generation out of the html handler to prevent that
existing cache entries cause that the handler is not executed
2014-12-01 17:37:25 +01:00
Michael Peter Christen
c35170a305 more logging 2014-12-01 16:50:37 +01:00
Michael Peter Christen
e8be07ec78 grr 2014-12-01 16:38:07 +01:00
Michael Peter Christen
6f81bb756c wrap wkhtmltopdf with xvfb if necessary 2014-12-01 16:26:28 +01:00
Michael Peter Christen
0119f8665d more logging when failing to create pdf snapshot 2014-12-01 16:00:45 +01:00