Commit Graph

3083 Commits

Author SHA1 Message Date
Michael Peter Christen
8317914ce3 changed vocabulary navigator object type to TreeMap to get a specific
order into the vocabularies. This is now lexicographic which is not so
much random as a hashed order
2014-11-27 07:44:41 +01:00
Michael Peter Christen
d5c1b07768 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-11-26 18:07:17 +01:00
Michael Peter Christen
c0f9f6ac66 added option to change the navbar-default, i.e. usable for dark skins 2014-11-26 18:01:35 +01:00
Michael Peter Christen
10794e8efd trying facet.method fc instead of fcs to handle large facets 2014-11-25 23:11:42 +01:00
Michael Peter Christen
041b605cfe Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-11-25 09:48:48 +01:00
Michael Peter Christen
f1f74e8626 toString fix 2014-11-24 20:53:40 +01:00
Michael Peter Christen
30276a2b48 prevent that a local Solr search and a local RWI search are running
concurrently. When a RWI search result is flushed into the result set,
id does Solr Queries (which replaced the old-style Metadata Queries) and
they are possibly running concurrently to a previously startet Solr
search. Both methods may block each other with IO. To enhance the speed,
they are now serialized. Because the Solr search results may result in
better results using the more advanced and configurable Ranking methods,
this result is preverred over the RWI search result. However, remote RWI
search results are still feeded concurrently into the search result as
well.
2014-11-24 20:53:19 +01:00
Michael Peter Christen
84763126e0 added option to make the YaCy proxy act as the cache is never stale. If
set to 'Always Fresh' the cache is always used if the entry in the cache
exist. This is a good way to archive web content and access it without
going online again in case the documents exist.
To do so, open /Settings_p.html?page=ProxyAccess and check the "Always
Fresh" checkbox.
This is set do false which behave as set before.
If you set this to true, then you have your web archive in DATA/HTCACHE.
Copy this to carry around your private copy of the internet!
2014-11-24 20:28:52 +01:00
reger
1e7ee72240 fix path lookup to ./defaults/yacy.badwords
(fix of commit ee277b9b3e)
2014-11-23 23:29:20 +01:00
reger
7d863d6254 fix empty text facet entry
(noticed on Author facet)
2014-11-23 23:12:01 +01:00
Michael Peter Christen
a39419f2ef more stacks shall be considered for on-demand loading, not only
deep-depth stacks to prevent "too many open files" problem
2014-11-23 20:11:23 +01:00
Michael Peter Christen
5bb52f79be reduce number of calls to queue.size() because that may be a bottleneck
during crawling
2014-11-23 20:09:32 +01:00
Michael Peter Christen
4920ab7b76 optimize usage of size() cache 2014-11-23 20:07:32 +01:00
reger
ee277b9b3e allow for local yacy.stopwords and yacy.badwords list (in DATA/SETTINGS/)
if file in DATA/SETTINGS it is loaded otherwise file in ./defaults is loaded
   (if locale ./defaults/stopwords.xx doesn't exist take solr/lang/stopwords_xx.txt as default)

move yacy.stopwords, yacy.stopwords.de and yacy.badwords.example out of root directory to ./defaults directory
2014-11-23 05:22:23 +01:00
reger
de56266bcb remove redundant toLower for topwords 2014-11-22 22:49:23 +01:00
Michael Peter Christen
a34f837592 better delete all files in path when removing host crawl stack 2014-11-22 12:09:07 +01:00
Michael Peter Christen
10b1db430a if we have many hosts, use on-demand earlier 2014-11-22 12:04:04 +01:00
Michael Peter Christen
1324927e66 prevent division by zero 2014-11-22 12:01:00 +01:00
Michael Peter Christen
2beb6abeb6 disabled crazy sleep loop 2014-11-21 14:38:54 +01:00
Michael Peter Christen
70f03f7c8e do not cache search requests to Solr if the result is used for
doublechecking. If a double-check comes from cached results the
doublecheck fails.
2014-11-20 18:45:27 +01:00
Michael Peter Christen
a0b84e4def use a LinkedHashMap for factes to maintain facet order as given by solr 2014-11-20 18:44:29 +01:00
reger
ef5dc68313 include domtype to searcheventcache id
to differenciate between local / global events for reuse of cached events 
fix for http://mantis.tokeek.de/view.php?id=493
2014-11-20 02:04:43 +01:00
Michael Peter Christen
0dc6e0a5f2 added option to enrich vocabularies with synonyms from synonym database 2014-11-19 18:12:43 +01:00
Michael Peter Christen
6a2a669db4 added loading of the synonyms file from addon/synonyms into the
knowledge loader
2014-11-19 17:36:56 +01:00
Michael Peter Christen
c67c5c0709 added new solr schema fields which record the occurences of vocabulary
matchings. These matches can be used for result boosting, i.e. if a
document contains words from a specific vocabulary, boost it.
2014-11-18 15:02:34 +01:00
Michael Peter Christen
a67a465415 fix field counter for multi-fields in html writer for the solr servlet 2014-11-18 12:11:18 +01:00
Michael Peter Christen
ec9d021568 added option in vocabulary editor to import CSV files with different
encodings (preselected windows-type character encoding which is typical
for CSV files). Fixed also other problems with character encoding in
dictionary files. Automatically generated vocabularies are now also
noted in the API steering.
2014-11-17 14:22:40 +01:00
reger
3c818fc912 add a check of java version string >=1.7 to startup class
stopping start with error msg on version < 1.7
2014-11-16 01:26:07 +01:00
Michael Peter Christen
0550b54d56 added fix to postprocessing: avoid caching of postprocessing collection
to always get fresh lists of documents. This is necessary since the
postprocessing changes the same documents which the
postprocessing-collection query selects.
2014-11-14 16:34:55 +01:00
Michael Peter Christen
68e8039fd1 added high-precision scheduler for API processes. This allows also to
make the execution in dependency of available RAM or CPU load. The
default value for CPU load is 4.0 and the check runs once a minute.
2014-11-14 10:02:50 +01:00
Michael Peter Christen
8aee7f940e added missing class for latest changes 2014-11-13 01:30:12 +01:00
Michael Peter Christen
97039049e4 fix in key enumeration methods for cases where the enumeration is done
in reverse order.
2014-11-13 01:15:31 +01:00
Michael Peter Christen
7e1b0b6712 fix for wildcard patch in search queries 2014-11-13 00:59:30 +01:00
Michael Peter Christen
0a879c98e7 added new 'firstSeen' database table and necessary data structures which
hold a date for each URL to record when a url was first seen. This is
then used to overwrite the modification date for urls upon recrawl in
case that the first-seen date is before the latest document date. This
behaviour is necessary due to the common behaviour of content management
systems which attach always the current date to all documents. Using the
firstSeen database it is possible to approximate a real first document
creation date in case that the crawler starts frequently for the same
domain. As a result the search results ordered by date have a much
better quality and the usage of YaCy as search agent for latest news has
a better quality.
2014-11-13 00:58:58 +01:00
Michael Peter Christen
421ee64f33 another fix to ordering of table indexes; fixes also network stats
graphics
2014-11-11 13:57:04 +01:00
Michael Peter Christen
1db476c67e fix for bad table iteration 2014-11-10 18:52:01 +01:00
reger
e4316e2d74 skip creation of local var in proxyhandler.storetocache 2014-11-09 04:17:14 +01:00
sixcooler
9c6e3a6b1c fix assertation-failure in version-string for Solr-4.10.2 by changing
the assert - hope that is ok
+ add forgotten NB-Projekt-changes
2014-11-07 22:43:50 +01:00
sixcooler
725b206fb4 update to solr-/lucene-4.10.2 2014-11-07 18:51:31 +01:00
Michael Peter Christen
5c97ecb30f fix of bad query generation for search facets 2014-11-07 18:11:49 +01:00
Michael Peter Christen
95d87f00b3 fix for bad query generation in doublecheck in postprocessing 2014-11-07 18:11:23 +01:00
orbiter
72c2bc5189 fix for search in case where local peer has no local seed address in
portal mode
2014-11-02 21:16:51 +01:00
orbiter
5be352da99 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-11-02 20:35:08 +01:00
orbiter
0fcd8097a3 removed unused options from BusyThreads 2014-11-02 20:08:49 +01:00
Michael Peter Christen
fe8b1d137d emergency bugfix for 100% CPU in image drawing 2014-11-02 13:28:10 +01:00
Michael Peter Christen
92007e5d2d more enhancements to posprocessing speed 2014-11-02 12:52:23 +01:00
Michael Peter Christen
9a7fe9e0d1 fix for bad timing computation in postprocessing 2014-10-31 23:17:56 +01:00
Michael Peter Christen
bd16119a00 another fix for postprocessing (the query for "" on numeric field did
not work in external solr)
2014-10-31 17:44:45 +01:00
Michael Peter Christen
327e83bfe7 more fixes in postprocessing: partitioning of the complete queue to
enable smaller queries
2014-10-31 17:30:24 +01:00
orbiter
2bc6199408 more concurrency for postprocessing 2014-10-30 21:52:52 +01:00
orbiter
a83cf26c38 more fixes and enhancements to postprocessing 2014-10-30 20:53:57 +01:00
orbiter
71758f0d62 enhanced postprocessing by usage of a field-list generation to prevent
lazy initialization of the documents. This is useful because the
documents must be read completely anyway.
2014-10-30 18:05:48 +01:00
orbiter
7856fbdbe8 fix for npe (in rare cases) 2014-10-30 15:20:35 +01:00
orbiter
8a2b569d7c fix for literal computation 2014-10-30 15:01:27 +01:00
orbiter
856da2712b Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-10-29 16:53:18 +01:00
orbiter
ca9cd7b58a more IPv6 fixes 2014-10-29 16:52:58 +01:00
Michael Peter Christen
b4585e9546 added new index size history image in /Status.html page 2014-10-29 13:37:44 +01:00
Michael Peter Christen
167c5a51f0 IPv6 fix 2014-10-28 15:36:13 +01:00
Michael Peter Christen
fe537679de fix for exact_signature_unique_b, exact_signature_copycount_i,
fuzzy_signature_unique_b and fuzzy_signature_copycount_i: apply same
criteria for 'valid document' as for title and description uniqueness
test.
2014-10-24 15:04:40 +02:00
sixcooler
eb9d2705d2 fix for ConnectionInfo.cleanup of server-connections 2014-10-22 11:25:07 +02:00
Michael Peter Christen
2e5214eb21 added field postprocessing.partialUpdate to settings which can be used
to switch on or off partial updates. Both options should cause the same
result. Default is on.
2014-10-17 14:17:49 +02:00
Michael Peter Christen
11074d8d24 fix for a ssl bug that appear only in java 7.
The bug was reported in
http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5407&p=30956#p30956
a solution was described in
http://teknosrc.com/javax-net-ssl-sslprotocolexception-handshake-alert-unrecognized_name-solved/
which worked for this example given in the yacy forum
2014-10-17 13:25:17 +02:00
Michael Peter Christen
e96490e3a1 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-10-17 12:51:35 +02:00
Michael Peter Christen
77662e08e1 concurrently initialize the error cache; extended also the cache by
factor 10 up to 1000 entries. This error cache is only used to catch up
paused crawls between shutdown+startup
2014-10-17 12:45:26 +02:00
sixcooler
d8fcc4a2f5 added a timeout on Jetty connectors 2014-10-16 20:36:12 +02:00
Michael Peter Christen
0f0b60404b Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-10-15 22:04:45 +02:00
sixcooler
72561926aa do not overwrite yacy.conf in case of an exception
may be a fix for http://mantis.tokeek.de/view.php?id=180
2014-10-15 18:13:54 +02:00
Michael Peter Christen
07c5b57953 removed warnings 2014-10-15 11:19:25 +02:00
orbiter
fa2ad101ec enhanced graphics computation (avoiding long string parsing for colours) 2014-10-15 10:31:24 +02:00
orbiter
ef813cec91 added proper copyright notice to OSM tiles presented at the search
result page
2014-10-15 09:13:23 +02:00
Michael Peter Christen
fca11701f0 better profiling of solr queries 2014-10-15 00:55:42 +02:00
Michael Peter Christen
2e09da9832 npe fix 2014-10-14 12:48:15 +02:00
Michael Peter Christen
d80418f1b1 added partial updates to solr during postprocessing: during
postprocessing the solr documents are now not completely retrieved.
instead, only fiels, needed for the postprocessing are extracted. When
Solr document are written, this is done using partial updates.

This increases postprocessing speed by about 50% for embedded Solr
configurations. For external Solr configurations the enhancement should
be much higher because the postprocessing with remote Solr is very slow.
When doing partial updates to a remote Solr, this method should perform
much better than before, it is expected that this is even much higher
than the increase with local Solr.
2014-10-14 12:19:59 +02:00
Michael Peter Christen
b1cfbc4a04 added new solr field url_paths_count_i which can be used to enhance the
index browser and maybe also for ranking; possibly also for
SEO-with-YaCy applications.
2014-10-13 23:51:19 +02:00
Michael Peter Christen
e69883d5ab fix-fix for
30d4402cd1
2014-10-13 16:51:27 +02:00
Michael Peter Christen
30d4402cd1 fixed location search 2014-10-13 14:28:11 +02:00
Michael Peter Christen
6983dff334 explain crawl denial when not switched to intranet mode 2014-10-11 09:02:12 +02:00
Michael Peter Christen
f818f84adb more ipv6 fixes 2014-10-11 00:34:07 +02:00
Michael Peter Christen
afd5bd5f5f slightly enhanced Network table computation by using a lazy initialized
bitfield for peer flags
2014-10-10 14:40:31 +02:00
Michael Peter Christen
2c2b50e65d refactoring (class name should start with uppercase letter) 2014-10-10 14:32:21 +02:00
Michael Peter Christen
bc275dca07 added network history graph image /NetworkHistory.png which can show
many different statistics about the history of the peer.
2014-10-10 14:06:47 +02:00
Marc Nause
ce9368246b Merge branch 'master' of gitorious.org:yacy/rc1 2014-10-09 13:35:31 +02:00
Marc Nause
5603809deb Minor changes:
*) reduced visibility of a method
*) updated comments
2014-10-09 13:31:36 +02:00
Michael Peter Christen
d8beafba3a fix for values in CrawlProfileEditor table and xml; now the full profile
is available in the xml.
2014-10-09 13:27:20 +02:00
Michael Peter Christen
ec95dfa2e6 fixed crawl profile xml result which did not show the correct crawl
status.
2014-10-08 18:48:57 +02:00
Michael Peter Christen
8c1a89cb34 added another decoration flag to switch off network graphics in crawler
monitor and index browser: decoration.grafics.linkstructure
Please set this to false to remove the graphics from the interface.
2014-10-08 17:12:35 +02:00
Michael Peter Christen
ee27be3399 misc bugfixes (concurrency, memory protection) 2014-10-08 15:22:29 +02:00
Michael Peter Christen
9b1958e8ca more ipv6 bugfixes 2014-10-08 15:21:49 +02:00
Michael Peter Christen
7817fc50c9 added a high cpu cycle monitor to PerformanceQueues 2014-10-08 15:20:43 +02:00
Michael Peter Christen
5082feb103 less volume for effect sounds 2014-10-08 15:04:35 +02:00
Michael Peter Christen
e8392e2ff2 fix for local search 2014-10-08 13:44:03 +02:00
Michael Peter Christen
0bfc69b29b more ipv6 bugfixes 2014-10-08 12:38:56 +02:00
Michael Peter Christen
a27563e5c3 removed the atmo sound clips because they had been too large 2014-10-07 23:42:41 +02:00
Michael Peter Christen
883622306e Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
Conflicts:
	source/net/yacy/peers/Protocol.java
2014-10-07 23:33:28 +02:00
Michael Peter Christen
97995a1dd9 fix for remote search process 2014-10-07 23:30:32 +02:00
Michael Peter Christen
0843b12ef3 ipv6 fix: avoid that shrinked own ip set is overwritten with (non-valid)
set of local IPs
2014-10-07 22:36:01 +02:00
Michael Peter Christen
92c5d97486 fix for bad node flag setting with IPv6 2014-10-07 22:16:18 +02:00
orbiter
c27bad9326 more ipv6 fixes 2014-10-07 20:09:48 +02:00
orbiter
cddf884bc4 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-10-07 19:27:14 +02:00
Michael Peter Christen
460858fb22 more ipv6 fixes 2014-10-07 18:53:23 +02:00