yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	413eeefed4	added character set detection library from http://www-archive.mozilla.org/projects/intl/chardet.html	2014-12-10 13:08:29 +01:00
Michael Peter Christen	7bfc5b80cb	added new options to vocabulary editor: - new switch 'isFacet' which causes that the usage of the vocabulary for search facets is enabled or disabled. This shall be used for large vocabularies sind searched in solr are extremely slow if facets for a large set of alternative terms are generated - new option to disable auto-enrichment from synonyms - new option to add synonyms from another column when importing from csv - automatically recognize double-occurrences in synonyms and bundling terms for such synonyms	2014-12-10 12:20:27 +01:00
Michael Peter Christen	87b53b3572	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-12-09 16:20:44 +01:00
Michael Peter Christen	8df8ffbb6d	enhanced the snapshot functionality: - snapshots can now also be xml files which are extracted from the solr index and stored as individual xml files in the snapshot directory along the pdf and jpg images - a transaction layer was placed above of the snapshot directory to distinguish snapshots into 'inventory' and 'archive'. This may be used to do transactions of index fragments using archived solr search results between peers. This is currently unfinished, we need a protocol to move snapshots from inventory to archive - the SNAPSHOT directory was renamed to snapshot and contains now two snapshot subdirectories: inventory and archive - snapshots may now be generated by everyone, not only such peers running on a server with xkhtml2pdf installed. The expert crawl starts provides the option for snapshots to everyone. PDF snapshots are now optional and the option is only shown if xkhtml2pdf is installed. - the snapshot api now provides the request for historised xml files, i.e. call: http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ The result of such xml files is identical with solr search results with only one hit. The pdf generation has been moved from the http loading process to the solr document storage process. This may slow down the process a lot and a different version of the process may be needed.	2014-12-09 16:20:34 +01:00
reger	5d67e165d9	remove redundant null check in ResponseHeader.lastModified added a JUnit testcase for ResponseHeader dates (using age()), adjusted age() to pass all tests	2014-12-09 00:58:08 +01:00
Michael Peter Christen	4111d42c81	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-12-08 12:40:12 +01:00
Michael Peter Christen	793ce6d13b	added confirmation dialogs for row deletion	2014-12-08 11:41:28 +01:00
Michael Peter Christen	cdc21d43b1	more robustness for broken table data in Table_API_p.html -- see bug report http://mantis.tokeek.de/view.php?id=495	2014-12-08 11:35:40 +01:00
reger	1d3ea35d69	prevent NPE on host link for to short HeuristicCfg.OpenSearchURL	2014-12-08 01:35:37 +01:00
Michael Peter Christen	a95af11050	enhancement for clearing the crawl queue	2014-12-07 23:43:38 +01:00
reger	5f0bb1214f	modified FieldReIndex to reindex queries with low number of documents first by using a internally a score map with number of documents as score and working through the list from low to high.	2014-12-07 04:31:09 +01:00
reger	8055ed5b2a	update to commons-logging-1.2	2014-12-06 22:32:24 +01:00
reger	e52370728a	fix startup stop on missing HTCACHE/SNAPSHOT directory	2014-12-06 02:25:24 +01:00
reger	e5236aa7ca	Merge origin/master	2014-12-06 01:44:03 +01:00
reger	70cf7060a4	coding fixes suggested in http://mantis.tokeek.de/view.php?id=509 http://mantis.tokeek.de/view.php?id=510	2014-12-06 01:42:24 +01:00
Michael Peter Christen	d97deb5555	npe fix	2014-12-06 00:43:12 +01:00
Michael Peter Christen	4fe4bf29ad	added rss feed output to snapshot servlet which can be used to get a list of latest/oldest entries in the snapshot database. This is an example: http://localhost:8090/api/snapshot.rss?depth=2&order=LATESTFIRST&host=yacy.net&maxcount=100 The properties depth, order, host and maxcount can be omited. The meaning of the fields are: host: select only urls from this host or all, if not given depth: select only urls at that crawl depth or all, if not given maxcount: select at most the given number of urls or 10, if not given order: either LATESTFIRST to select the youngest entries, OLDESTFIRST to select the first entries or ANY to select any The rss feed needs administration rights to work, a call to this servlet with rss extension must attach login credentials.	2014-12-06 00:25:05 +01:00
Michael Peter Christen	8b522687e0	added toString() methods to feed classes which makes it possible to export full rss feed files out of the RSSFeed class	2014-12-06 00:18:14 +01:00
reger	568c991405	remove the unused Request variable (fix of prev. commit)	2014-12-05 03:03:28 +01:00
reger	d6539ba597	Merge origin/master	2014-12-05 01:15:41 +01:00
reger	ff18129def	ViewFile servlet: update index if newer, so viewed text and metadata (stored) info is similar - to archive it, use request with profile to allow indexing (defaultglobaltext) and update index (the resource is loaded, parsed anyway, so it's not a expensive operation) Request: remove 2 unused init parameter - number of anchors of the parent - forkfactor sum of anchors of all ancestors	2014-12-05 01:13:37 +01:00
Michael Peter Christen	a304058840	added Image Events as another option to generate images with a mac if no Ghostscript is available or does not work...	2014-12-04 01:21:24 +01:00
Michael Peter Christen	d83de9ecf5	added another path for the convert command because on older Macs ImageMagick has a different installation location	2014-12-03 18:07:05 +01:00
Michael Peter Christen	226aea5914	added a servlet which can create preview images, preview tumbnails and preview pdfs from web pages, i.e.: http://localhost:8090/api/snapshot.png?url=http://yacy.net/en/&width=128&height=128 http://localhost:8090/api/snapshot.jpg?url=http://yacy.net/en/&width=128&height=128 http://localhost:8090/api/snapshot.pdf?url=http://yacy.net/en/ This supports also an on-the-fly generation of the preview documents if the user is an administrator. Otherwise, the servlet fails. To enable this, you must add wkhtmltopdf, imagemagick and (on headless servers) xvfb to your operation system. for detailed instructions, see `97f6089a41`	2014-12-03 11:45:48 +01:00
reger	28456dfc09	skip creation of unused Bluelist contenttransformer	2014-12-02 21:03:00 +01:00
Michael Peter Christen	321840fde3	Replaced all fixed thread pools with cached thread pools. The cached thread pools will flush their cached (dead) threads after 60 seconds. This will cause that YaCy now runs constantly withl about 50 threads, about 100 at peak times. Previously, about 400 threads had been cached and kept in a hibernation state, which caused that the numproc counter in /proc/user_beancounters (exists only in VM-hosted linux) was as high as the cached number of threads. This caused that VM supervisors terminated whole VM sessions if a limit was reached. Many VM providers have limits of numproc=96 which made it virtually impossible to run YaCy on such machines. With this change, it will be possible to run many YaCy instances even on VM hosts.	2014-12-02 16:26:07 +01:00
Michael Peter Christen	181911376c	showing list of all thread in threaddump using the ThreadMXBean counter (this obviously show more threads than before?)	2014-12-02 16:21:06 +01:00
Michael Peter Christen	7bfab5eb9d	set Busy- and Blocking-Threads to daemon mode (they will now not prevent YaCy from termination if still running)	2014-12-02 16:05:00 +01:00
Michael Peter Christen	64887f6b21	show number of threads on status page	2014-12-02 16:04:11 +01:00
Michael Peter Christen	e586e423aa	in case that loading from the cache fails, load from wkhtmltopdf without cache using the user agent string given in the crawl profile	2014-12-02 13:35:19 +01:00
Michael Peter Christen	d5bac64421	recognize more html file types for snapshots	2014-12-02 12:52:36 +01:00
Michael Peter Christen	6f0167fac1	get cloned crawl start parameter for snapshots	2014-12-02 12:52:05 +01:00
Michael Peter Christen	a1ee101079	recognize more html file extensions	2014-12-02 12:10:44 +01:00
Michael Peter Christen	8480641f2d	fix to xvfb-run usage (quotes did not parse in xvfb-run, default values are appropriate)	2014-12-02 11:51:12 +01:00
Michael Peter Christen	68b040e31e	added fail-over missing http proxy service (i.e. overload) and quiet mode	2014-12-01 18:21:52 +01:00
Michael Peter Christen	25a64c51b3	moved snapshot generation out of the html handler to prevent that existing cache entries cause that the handler is not executed	2014-12-01 17:37:25 +01:00
Michael Peter Christen	c35170a305	more logging	2014-12-01 16:50:37 +01:00
Michael Peter Christen	e8be07ec78	grr	2014-12-01 16:38:07 +01:00
Michael Peter Christen	6f81bb756c	wrap wkhtmltopdf with xvfb if necessary	2014-12-01 16:26:28 +01:00
Michael Peter Christen	0119f8665d	more logging when failing to create pdf snapshot	2014-12-01 16:00:45 +01:00
Michael Peter Christen	416fe886e3	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-12-01 15:20:24 +01:00
Michael Peter Christen	60f27bdf49	added the property timeoutrequests to configuration to disable TimeoutRequests. The purpose is to test if YaCy runs better on VMs where there is a limitation of concurrent processes; see /proc/user_beancounters in row numproc; this value is limited and should be low. Try to set timeoutrequests to keep this low. (works only after restart)	2014-12-01 15:20:10 +01:00
Michael Peter Christen	97f6089a41	YaCy can now create web page snapshots as pdf documents which can later be transcoded into jpg for image previews. To create such pdfs you must do: Add wkhtmltopdf and imagemagick to your OS, which you can do: On a Mac download wkhtmltox-0.12.1_osx-cocoa-x86-64.pkg from http://wkhtmltopdf.org/downloads.html and downloadh ttp://cactuslab.com/imagemagick/assets/ImageMagick-6.8.9-9.pkg.zip In Debian do "apt-get install wkhtmltopdf imagemagick" Then check in /Settings_p.html?page=ProxyAccess: "Transparent Proxy" and "Always Fresh" - this is used by wkhtmltopdf to fetch web pages using the YaCy proxy. Using "Always Fresh" it is possible to get all pages from the proxy cache. Finally, you will see a new option when starting an expert web crawl. You can set a maximum depth for crawling which should cause a pdf generation. The resulting pdfs are then available in DATA/HTCACHE/SNAPSHOTS/<host>.<port>/<depth>/<shard>/<urlhash>.<date>.pdf	2014-12-01 15:03:09 +01:00
Michael Peter Christen	41d00350e4	moved network configuration to Use Case submenu; this is necessary because the definiton of portal peers within the YaCy freeworld network is otherwise splitted into two different main menus.	2014-12-01 01:12:51 +01:00
reger	ff80700aff	replace depreciated Solr DateField.formatExternal with recommended TrieDateField.formatExternal	2014-12-01 00:21:30 +01:00
Michael Peter Christen	9ea120dbe5	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-11-30 22:02:25 +01:00
reger	aa7122f079	update to guava.18.0.jar and jsch.0.1.51.jar	2014-11-30 19:43:53 +01:00
reger	0c97cc2440	skip unused call parameter for hashSentence()	2014-11-30 19:42:33 +01:00
reger	221f86dd5e	position api icon (ViewFile.html)	2014-11-30 01:58:14 +01:00
reger	4c14a8b44d	update to poi-3.10.1.jar	2014-11-29 22:36:02 +01:00

1 2 3 4 5 ...

11447 Commits