Commit Graph

3039 Commits

Author SHA1 Message Date
reger
0c97cc2440 skip unused call parameter for hashSentence() 2014-11-30 19:42:33 +01:00
reger
5790c7242e skip to tokenize punktuation as word in WordTokenizer
remove unused variables in condenser related to Tokenizer
2014-11-29 17:16:05 +01:00
reger
f07392ff17 add. use host port parameter in YaCyApp 2014-11-29 15:27:16 +01:00
Michael Peter Christen
ad0da5f246 added new web page snapshot infrastructure which will lead to the
ability to have web page previews in the search results.
(This is a stub, no function available with this yet...)
2014-11-29 11:56:32 +01:00
Michael Peter Christen
1d45d9405a security bugfix 2014-11-28 01:19:01 +01:00
Michael Peter Christen
ff728b4aa5 ignore url errors during search 2014-11-27 20:50:55 +01:00
Michael Peter Christen
8317914ce3 changed vocabulary navigator object type to TreeMap to get a specific
order into the vocabularies. This is now lexicographic which is not so
much random as a hashed order
2014-11-27 07:44:41 +01:00
Michael Peter Christen
d5c1b07768 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-11-26 18:07:17 +01:00
Michael Peter Christen
c0f9f6ac66 added option to change the navbar-default, i.e. usable for dark skins 2014-11-26 18:01:35 +01:00
Michael Peter Christen
10794e8efd trying facet.method fc instead of fcs to handle large facets 2014-11-25 23:11:42 +01:00
Michael Peter Christen
041b605cfe Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-11-25 09:48:48 +01:00
Michael Peter Christen
f1f74e8626 toString fix 2014-11-24 20:53:40 +01:00
Michael Peter Christen
30276a2b48 prevent that a local Solr search and a local RWI search are running
concurrently. When a RWI search result is flushed into the result set,
id does Solr Queries (which replaced the old-style Metadata Queries) and
they are possibly running concurrently to a previously startet Solr
search. Both methods may block each other with IO. To enhance the speed,
they are now serialized. Because the Solr search results may result in
better results using the more advanced and configurable Ranking methods,
this result is preverred over the RWI search result. However, remote RWI
search results are still feeded concurrently into the search result as
well.
2014-11-24 20:53:19 +01:00
Michael Peter Christen
84763126e0 added option to make the YaCy proxy act as the cache is never stale. If
set to 'Always Fresh' the cache is always used if the entry in the cache
exist. This is a good way to archive web content and access it without
going online again in case the documents exist.
To do so, open /Settings_p.html?page=ProxyAccess and check the "Always
Fresh" checkbox.
This is set do false which behave as set before.
If you set this to true, then you have your web archive in DATA/HTCACHE.
Copy this to carry around your private copy of the internet!
2014-11-24 20:28:52 +01:00
reger
1e7ee72240 fix path lookup to ./defaults/yacy.badwords
(fix of commit ee277b9b3e)
2014-11-23 23:29:20 +01:00
reger
7d863d6254 fix empty text facet entry
(noticed on Author facet)
2014-11-23 23:12:01 +01:00
Michael Peter Christen
a39419f2ef more stacks shall be considered for on-demand loading, not only
deep-depth stacks to prevent "too many open files" problem
2014-11-23 20:11:23 +01:00
Michael Peter Christen
5bb52f79be reduce number of calls to queue.size() because that may be a bottleneck
during crawling
2014-11-23 20:09:32 +01:00
Michael Peter Christen
4920ab7b76 optimize usage of size() cache 2014-11-23 20:07:32 +01:00
reger
ee277b9b3e allow for local yacy.stopwords and yacy.badwords list (in DATA/SETTINGS/)
if file in DATA/SETTINGS it is loaded otherwise file in ./defaults is loaded
   (if locale ./defaults/stopwords.xx doesn't exist take solr/lang/stopwords_xx.txt as default)

move yacy.stopwords, yacy.stopwords.de and yacy.badwords.example out of root directory to ./defaults directory
2014-11-23 05:22:23 +01:00
reger
de56266bcb remove redundant toLower for topwords 2014-11-22 22:49:23 +01:00
Michael Peter Christen
a34f837592 better delete all files in path when removing host crawl stack 2014-11-22 12:09:07 +01:00
Michael Peter Christen
10b1db430a if we have many hosts, use on-demand earlier 2014-11-22 12:04:04 +01:00
Michael Peter Christen
1324927e66 prevent division by zero 2014-11-22 12:01:00 +01:00
Michael Peter Christen
2beb6abeb6 disabled crazy sleep loop 2014-11-21 14:38:54 +01:00
Michael Peter Christen
70f03f7c8e do not cache search requests to Solr if the result is used for
doublechecking. If a double-check comes from cached results the
doublecheck fails.
2014-11-20 18:45:27 +01:00
Michael Peter Christen
a0b84e4def use a LinkedHashMap for factes to maintain facet order as given by solr 2014-11-20 18:44:29 +01:00
reger
ef5dc68313 include domtype to searcheventcache id
to differenciate between local / global events for reuse of cached events 
fix for http://mantis.tokeek.de/view.php?id=493
2014-11-20 02:04:43 +01:00
Michael Peter Christen
0dc6e0a5f2 added option to enrich vocabularies with synonyms from synonym database 2014-11-19 18:12:43 +01:00
Michael Peter Christen
6a2a669db4 added loading of the synonyms file from addon/synonyms into the
knowledge loader
2014-11-19 17:36:56 +01:00
Michael Peter Christen
c67c5c0709 added new solr schema fields which record the occurences of vocabulary
matchings. These matches can be used for result boosting, i.e. if a
document contains words from a specific vocabulary, boost it.
2014-11-18 15:02:34 +01:00
Michael Peter Christen
a67a465415 fix field counter for multi-fields in html writer for the solr servlet 2014-11-18 12:11:18 +01:00
Michael Peter Christen
ec9d021568 added option in vocabulary editor to import CSV files with different
encodings (preselected windows-type character encoding which is typical
for CSV files). Fixed also other problems with character encoding in
dictionary files. Automatically generated vocabularies are now also
noted in the API steering.
2014-11-17 14:22:40 +01:00
reger
3c818fc912 add a check of java version string >=1.7 to startup class
stopping start with error msg on version < 1.7
2014-11-16 01:26:07 +01:00
Michael Peter Christen
0550b54d56 added fix to postprocessing: avoid caching of postprocessing collection
to always get fresh lists of documents. This is necessary since the
postprocessing changes the same documents which the
postprocessing-collection query selects.
2014-11-14 16:34:55 +01:00
Michael Peter Christen
68e8039fd1 added high-precision scheduler for API processes. This allows also to
make the execution in dependency of available RAM or CPU load. The
default value for CPU load is 4.0 and the check runs once a minute.
2014-11-14 10:02:50 +01:00
Michael Peter Christen
8aee7f940e added missing class for latest changes 2014-11-13 01:30:12 +01:00
Michael Peter Christen
97039049e4 fix in key enumeration methods for cases where the enumeration is done
in reverse order.
2014-11-13 01:15:31 +01:00
Michael Peter Christen
7e1b0b6712 fix for wildcard patch in search queries 2014-11-13 00:59:30 +01:00
Michael Peter Christen
0a879c98e7 added new 'firstSeen' database table and necessary data structures which
hold a date for each URL to record when a url was first seen. This is
then used to overwrite the modification date for urls upon recrawl in
case that the first-seen date is before the latest document date. This
behaviour is necessary due to the common behaviour of content management
systems which attach always the current date to all documents. Using the
firstSeen database it is possible to approximate a real first document
creation date in case that the crawler starts frequently for the same
domain. As a result the search results ordered by date have a much
better quality and the usage of YaCy as search agent for latest news has
a better quality.
2014-11-13 00:58:58 +01:00
Michael Peter Christen
421ee64f33 another fix to ordering of table indexes; fixes also network stats
graphics
2014-11-11 13:57:04 +01:00
Michael Peter Christen
1db476c67e fix for bad table iteration 2014-11-10 18:52:01 +01:00
reger
e4316e2d74 skip creation of local var in proxyhandler.storetocache 2014-11-09 04:17:14 +01:00
sixcooler
9c6e3a6b1c fix assertation-failure in version-string for Solr-4.10.2 by changing
the assert - hope that is ok
+ add forgotten NB-Projekt-changes
2014-11-07 22:43:50 +01:00
sixcooler
725b206fb4 update to solr-/lucene-4.10.2 2014-11-07 18:51:31 +01:00
Michael Peter Christen
5c97ecb30f fix of bad query generation for search facets 2014-11-07 18:11:49 +01:00
Michael Peter Christen
95d87f00b3 fix for bad query generation in doublecheck in postprocessing 2014-11-07 18:11:23 +01:00
orbiter
72c2bc5189 fix for search in case where local peer has no local seed address in
portal mode
2014-11-02 21:16:51 +01:00
orbiter
5be352da99 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-11-02 20:35:08 +01:00
orbiter
0fcd8097a3 removed unused options from BusyThreads 2014-11-02 20:08:49 +01:00