Commit Graph

9 Commits

Author SHA1 Message Date
luccioman
5a14d34a7d Refactoring : documented and extracted autotagging processing functions. 2018-02-02 10:27:36 +01:00
reger
b017e97421 optimize condenser language detection a little.
langdetect probabilities take letter case into account, add words from
description and anchors etc. as is.
+ add it to javadoc
2016-10-06 19:03:52 +02:00
reger
ae3717d087 adjust Tokenizer sentence count to ignore repeated punktuation (like !!!! )
+ remove unused sentenceword map (we use only the count)
+ upd test case for sentence count
2016-10-06 03:41:07 +02:00
reger
474f0476c6 adjust Tokenizer sentence count on trailing text after last recognized sentence
+ upd test case for rwi multi-word-query  (leaving results known to fail untested)
2016-10-05 05:52:37 +02:00
reger
96467c5467 remove not needed counter in Tokeninzer (completing last changes)
including a small change, word posintext counting. 
We remember/store 1st posintext. Previously following words got a handle (posintext)
excluding found. Now it just counts and assigns true posintext as handle (posintext)
2016-09-10 18:23:09 +02:00
reger
272cdd496a reactivate sentence counter in WordTokenizer for phrasepos ranking,
by counting punktuation (delivered as 1 char word) again.
2016-09-07 02:16:16 +02:00
reger
e310ec5f70 fix posInText ranking calculation to score 0 on no position info
+ fix Word posInText calc in Tokenizer to start with 1
+ test case
2016-09-06 00:05:59 +02:00
Michael Peter Christen
90f75c8c3d added enrichment of synonyms and vocabularies for imported documents
during surrogate reading: those attributes from the dump are removed
during the import process and replaced by new detected attributes
according to the setting of the YaCy peer.
This may cause that all such attributes are removed if the importing
peer has no synonyms and/or no vocabularies defined.
2015-07-02 00:23:50 +02:00
Michael Peter Christen
7829480b82 refactoring: separated condenser and tokenizer 2015-07-01 18:28:18 +02:00