Commit Graph

16 Commits

Author SHA1 Message Date
luccioman
d27adc2b92 Fixed language detector initialization and NullPointerException cases.
NullPointerException occurred when using and Identificator instance
which encountered and error in its constructor.
This error could be caused by a missing "langdetect" folder in the
current folder of the main process, or by simultaneous first calls to
the constructor, initializing concurrently the DetectorFactory.langlist.

Fixes the mantis 714 (http://mantis.tokeek.de/view.php?id=714)
2016-12-05 18:12:21 +01:00
reger
b017e97421 optimize condenser language detection a little.
langdetect probabilities take letter case into account, add words from
description and anchors etc. as is.
+ add it to javadoc
2016-10-06 19:03:52 +02:00
reger
6f0b073bf3 override detected language (statistic langdetect) only with TLD determided
language if langdetect probability is not high.
+ additionally truncate zh-cn / zh-tw returned by langdetect to 2 char ISO639-1 zh
used by YaCy
2016-02-07 21:16:22 +01:00
reger
b65e2b527d include use of condenser's content text for language detection.
Language identification may show poor performance on documents with short or no
title but clear lang indication in text content. Using content text too
improves lang detection.
+ remove double caching of text in Identificator
2016-02-07 01:52:32 +01:00
Michael Peter Christen
135a123a77 less logging in new language detection 2015-12-03 00:39:15 +01:00
Ryszard Goń
59096935d0 Use language-detection library for increased accuracy 2015-07-02 18:41:13 +02:00
Michael Peter Christen
8b44fcf0f4 added missing @Override annotation 2014-03-28 13:48:37 +01:00
Michael Peter Christen
1a4a69c226 set more logger to 'final static' 2013-11-13 06:18:48 +01:00
Michael Peter Christen
5878c1d599 - refactoring of log to ConcurrentLog:
jdk-based logger tend to block
at java.util.logging.Logger.log(Logger.java:476) in concurrent
environments. This makes logging a main performance issue. To overcome
this problem, this is a add-on to jdk logging to put log entries on a
concurrent message queue and log the messages one by one using a
separate process.
- FTPClient uses the concurrent logging instead of the log4j logger
2013-07-09 14:28:25 +02:00
Michael Peter Christen
e101c2e0e2 added changes from copperdust (submitted by email):
1. Improved and fixed language detection:
	1.1 Identificator.java - recognition fix (improved)
	1.2 DCEntry.java - fix (changed detection order due to detection from
tld in many cases is incorrect)
	1.3 MultiProtocolURI.java - fixed and enhanced language from tld
detection (all currently used top-level domains; ccTLD added but not
tested).
2. Ukrainian language update.
3. Main Slavic languages langstats (tested and works fine).
2012-02-22 12:21:27 +01:00
orbiter
f667b9c289 enhanced identificator: using AtomicInteger for counter
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7785 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-19 13:31:10 +00:00
low012
3d95981f7d *) cleaning up the code a little bit
*) minor changes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7396 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-27 17:07:21 +00:00
orbiter
60e71876ad - more abstraction (HashMap -> Map)
- more concurrency-awareness (HashMap -> ConcurrentHashMap)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6910 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-01 13:02:11 +00:00
orbiter
dd459281c8 applied code changes that are recommended by PMD
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6563 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-10 23:09:48 +00:00
orbiter
4a5100789f replaced _all_ size() == 0 with isEmpty() and all size() > 0 with !isEmpty(). The isEmpty() method is much faster in some cases, especially when used to access badly balanced hashtables where an size() operation becomes a large iteration.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6510 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-12-02 00:37:59 +00:00
orbiter
b79f4f062f refactoring of yacy documents and parsers: they depend now only on the kelondro classes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6426 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-18 00:53:43 +00:00