Commit Graph

11423 Commits

Author SHA1 Message Date
reger
28456dfc09 skip creation of unused Bluelist contenttransformer 2014-12-02 21:03:00 +01:00
Michael Peter Christen
321840fde3 Replaced all fixed thread pools with cached thread pools. The cached
thread pools will flush their cached (dead) threads after 60 seconds.
This will cause that YaCy now runs constantly withl about 50 threads,
about 100 at peak times. Previously, about 400 threads had been cached
and kept in a hibernation state, which caused that the numproc counter
in /proc/user_beancounters (exists only in VM-hosted linux) was as high
as the cached number of threads. This caused that VM supervisors
terminated whole VM sessions if a limit was reached. Many VM providers
have limits of numproc=96 which made it virtually impossible to run YaCy
on such machines. With this change, it will be possible to run many YaCy
instances even on VM hosts.
2014-12-02 16:26:07 +01:00
Michael Peter Christen
181911376c showing list of all thread in threaddump using the ThreadMXBean counter
(this obviously show more threads than before?)
2014-12-02 16:21:06 +01:00
Michael Peter Christen
7bfab5eb9d set Busy- and Blocking-Threads to daemon mode (they will now not prevent
YaCy from termination if still running)
2014-12-02 16:05:00 +01:00
Michael Peter Christen
64887f6b21 show number of threads on status page 2014-12-02 16:04:11 +01:00
Michael Peter Christen
e586e423aa in case that loading from the cache fails, load from wkhtmltopdf without
cache using the user agent string given in the crawl profile
2014-12-02 13:35:19 +01:00
Michael Peter Christen
d5bac64421 recognize more html file types for snapshots 2014-12-02 12:52:36 +01:00
Michael Peter Christen
6f0167fac1 get cloned crawl start parameter for snapshots 2014-12-02 12:52:05 +01:00
Michael Peter Christen
a1ee101079 recognize more html file extensions 2014-12-02 12:10:44 +01:00
Michael Peter Christen
8480641f2d fix to xvfb-run usage (quotes did not parse in xvfb-run, default values
are appropriate)
2014-12-02 11:51:12 +01:00
Michael Peter Christen
68b040e31e added fail-over missing http proxy service (i.e. overload) and quiet
mode
2014-12-01 18:21:52 +01:00
Michael Peter Christen
25a64c51b3 moved snapshot generation out of the html handler to prevent that
existing cache entries cause that the handler is not executed
2014-12-01 17:37:25 +01:00
Michael Peter Christen
c35170a305 more logging 2014-12-01 16:50:37 +01:00
Michael Peter Christen
e8be07ec78 grr 2014-12-01 16:38:07 +01:00
Michael Peter Christen
6f81bb756c wrap wkhtmltopdf with xvfb if necessary 2014-12-01 16:26:28 +01:00
Michael Peter Christen
0119f8665d more logging when failing to create pdf snapshot 2014-12-01 16:00:45 +01:00
Michael Peter Christen
416fe886e3 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-12-01 15:20:24 +01:00
Michael Peter Christen
60f27bdf49 added the property timeoutrequests to configuration to disable
TimeoutRequests. The purpose is to test if YaCy runs better on VMs where
there is a limitation of concurrent processes;  see
/proc/user_beancounters in row numproc; this value is limited and should
be low. Try to set timeoutrequests to keep this low. (works only after
restart)
2014-12-01 15:20:10 +01:00
Michael Peter Christen
97f6089a41 YaCy can now create web page snapshots as pdf documents which can later
be transcoded into jpg for image previews. To create such pdfs you must
do:

Add wkhtmltopdf and imagemagick to your OS, which you can do:
On a Mac download wkhtmltox-0.12.1_osx-cocoa-x86-64.pkg from
http://wkhtmltopdf.org/downloads.html and downloadh
ttp://cactuslab.com/imagemagick/assets/ImageMagick-6.8.9-9.pkg.zip
In Debian do "apt-get install wkhtmltopdf imagemagick"

Then check in /Settings_p.html?page=ProxyAccess: "Transparent Proxy" and
"Always Fresh" - this is used by wkhtmltopdf to fetch web pages using
the YaCy proxy. Using "Always Fresh" it is possible to get all pages
from the proxy cache.

Finally, you will see a new option when starting an expert web crawl.
You can set a maximum depth for crawling which should cause a pdf
generation. The resulting pdfs are then available in
DATA/HTCACHE/SNAPSHOTS/<host>.<port>/<depth>/<shard>/<urlhash>.<date>.pdf
2014-12-01 15:03:09 +01:00
Michael Peter Christen
41d00350e4 moved network configuration to Use Case submenu; this is necessary
because the definiton of portal peers within the YaCy freeworld network
is otherwise splitted into two different main menus.
2014-12-01 01:12:51 +01:00
reger
ff80700aff replace depreciated Solr DateField.formatExternal with recommended TrieDateField.formatExternal 2014-12-01 00:21:30 +01:00
Michael Peter Christen
9ea120dbe5 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-11-30 22:02:25 +01:00
reger
aa7122f079 update to guava.18.0.jar and jsch.0.1.51.jar 2014-11-30 19:43:53 +01:00
reger
0c97cc2440 skip unused call parameter for hashSentence() 2014-11-30 19:42:33 +01:00
reger
221f86dd5e position api icon (ViewFile.html) 2014-11-30 01:58:14 +01:00
reger
4c14a8b44d update to poi-3.10.1.jar 2014-11-29 22:36:02 +01:00
reger
ea633a794c including small junit test case for WordTokenizer 2014-11-29 22:13:24 +01:00
reger
5790c7242e skip to tokenize punktuation as word in WordTokenizer
remove unused variables in condenser related to Tokenizer
2014-11-29 17:16:05 +01:00
reger
f07392ff17 add. use host port parameter in YaCyApp 2014-11-29 15:27:16 +01:00
Michael Peter Christen
09d2867050 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-11-29 12:05:19 +01:00
Michael Peter Christen
ad0da5f246 added new web page snapshot infrastructure which will lead to the
ability to have web page previews in the search results.
(This is a stub, no function available with this yet...)
2014-11-29 11:56:32 +01:00
reger
aa0faeabc5 adjust translation text of error msg on empty query
(ru: needs correction)
2014-11-29 03:09:55 +01:00
reger
c475be2937 fix (enable) error msg on empty query 2014-11-28 22:44:33 +01:00
reger
ef5c5b4489 update to Jetty 9.2.4 2014-11-28 20:24:39 +01:00
reger
f709132961 remove obsolete alternate link
fix api link
2014-11-28 01:40:46 +01:00
Michael Peter Christen
5f5c7d69d1 added image screenshot generator 2014-11-28 01:25:52 +01:00
Michael Peter Christen
3c71e1c872 show vocabularies in search result (in case of debugging) 2014-11-28 01:19:31 +01:00
Michael Peter Christen
1d45d9405a security bugfix 2014-11-28 01:19:01 +01:00
Michael Peter Christen
ff728b4aa5 ignore url errors during search 2014-11-27 20:50:55 +01:00
Michael Peter Christen
c94c24638f disabled postprocessing by default. If you read this: please disable
postprocessing in your peer as well: open /IndexSchema_p.html, then
deselect field process_sxt
2014-11-27 12:13:20 +01:00
Michael Peter Christen
2fce2e2697 larger boost fields for ranking 2014-11-27 12:11:54 +01:00
Michael Peter Christen
6c03ff8355 bold words in snippets should not be coloured black in the base style
because there are styles with dark backgrounds which make the bold word
invisible
2014-11-27 08:08:05 +01:00
Michael Peter Christen
8317914ce3 changed vocabulary navigator object type to TreeMap to get a specific
order into the vocabularies. This is now lexicographic which is not so
much random as a hashed order
2014-11-27 07:44:41 +01:00
Michael Peter Christen
d5c1b07768 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-11-26 18:07:17 +01:00
Michael Peter Christen
c0f9f6ac66 added option to change the navbar-default, i.e. usable for dark skins 2014-11-26 18:01:35 +01:00
Michael Peter Christen
10794e8efd trying facet.method fc instead of fcs to handle large facets 2014-11-25 23:11:42 +01:00
Michael Peter Christen
041b605cfe Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-11-25 09:48:48 +01:00
Michael Peter Christen
f1f74e8626 toString fix 2014-11-24 20:53:40 +01:00
Michael Peter Christen
30276a2b48 prevent that a local Solr search and a local RWI search are running
concurrently. When a RWI search result is flushed into the result set,
id does Solr Queries (which replaced the old-style Metadata Queries) and
they are possibly running concurrently to a previously startet Solr
search. Both methods may block each other with IO. To enhance the speed,
they are now serialized. Because the Solr search results may result in
better results using the more advanced and configurable Ranking methods,
this result is preverred over the RWI search result. However, remote RWI
search results are still feeded concurrently into the search result as
well.
2014-11-24 20:53:19 +01:00
Michael Peter Christen
84763126e0 added option to make the YaCy proxy act as the cache is never stale. If
set to 'Always Fresh' the cache is always used if the entry in the cache
exist. This is a good way to archive web content and access it without
going online again in case the documents exist.
To do so, open /Settings_p.html?page=ProxyAccess and check the "Always
Fresh" checkbox.
This is set do false which behave as set before.
If you set this to true, then you have your web archive in DATA/HTCACHE.
Copy this to carry around your private copy of the internet!
2014-11-24 20:28:52 +01:00