yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	3d5e354471	small changes to search headline colour	2014-04-29 18:46:50 +02:00
Michael Peter Christen	d79d7dde55	fix for result display	2014-04-29 16:24:21 +02:00
Michael Peter Christen	362c988c05	design fixes to better use the new colours	2014-04-29 16:24:01 +02:00
Michael Peter Christen	71efc76170	new default skin pdbootstrap which keeps the design shapes but slightly changes the colours to match with bootstrap colours	2014-04-29 16:23:42 +02:00
Michael Peter Christen	bbadccbd8d	better buttons	2014-04-29 16:22:31 +02:00
reger	2eb7682772	add html5 audio/video <source> tag to html content scraper - <source src=.. type=..> tag content is added to embed collection	2014-04-29 00:41:29 +02:00
Michael Peter Christen	a9963d5c95	bootstrap update	2014-04-28 11:52:13 +02:00
Michael Peter Christen	b2bbb9a0b5	Merge branch 'master' of gitorious.org:yacy/icewindxs-rc1	2014-04-28 09:17:21 +02:00
reger	0b6db04e40	fix contentscraper img height/width parsing prevent numberformat exception on common "100px" property - include in test case	2014-04-28 04:59:47 +02:00
malykhin.dmitry	37424b0c42	Update russian translation	2014-04-28 01:54:34 +04:00
reger	4e57000a40	remove redundant javascript & id in index.html to set focus to query field in IE11	2014-04-27 22:22:00 +02:00
reger	ffc5b75c73	optimize and fix lat / lon assignment	2014-04-27 20:52:06 +02:00
reger	9313447de2	reimplement tighter lat/lon calc in URIMetadataNode from old MetadataRow, considering http://mantis.tokeek.de/view.php?id=272	2014-04-27 18:20:33 +02:00
reger	d812f80784	add exit proxy link to UrlProxy on proxied pages a link to exit proxy is added to top of page. Link text can be configured in web.xml init-parameter (see default/web.xml). If missing no link is displayed.	2014-04-26 22:27:59 +02:00
reger	78d08998db	throw MalformedURLException on unknown protocol on other than the supported http https ftp file smb \\ mailto	2014-04-26 01:30:51 +02:00
reger	bb8181b2be	fix: resolve url without path but searchpart e.g. http://yacy.net?q=test was resolved as host "yacy.net?q=test" now host="yacy.net" path="/" fixes http://mantis.tokeek.de/view.php?id=47 added test case for getHost	2014-04-25 20:15:55 +02:00
orbiter	a3542f29b4	npe fix	2014-04-25 09:26:20 +02:00
orbiter	c48d2a2a02	npe fix	2014-04-25 09:23:10 +02:00
reger	121d25be38	recover sax fatal error on OAI-PMH import of xml with entity error this allows to continue loading next resumptionToken even if import file caused sax parser error fix http://mantis.tokeek.de/view.php?id=63	2014-04-25 01:05:28 +02:00
reger	81dc2aa536	add current css to HTMLResponseWriter to fix metadata view (using css from metas.template except js links)	2014-04-23 23:41:10 +02:00
orbiter	2fd8a0ead6	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-04-23 23:13:23 +02:00
orbiter	8e5ce7cd51	fixed a situation where finished crawls had not been detected.	2014-04-23 23:13:07 +02:00
orbiter	c6f0bd05f8	better removal of stored urls when doing a crawl start	2014-04-23 23:12:08 +02:00
orbiter	2f63bd0261	enhanced Host Balancer strategy: fair round robin	2014-04-23 23:11:37 +02:00
orbiter	0c88a32c36	do not apply lazy value instantiation for numeric or boolean values because that is misleading and confusing in case of 0- or false-values and may cause NPEs in retrieval functions.	2014-04-23 08:41:36 +02:00
orbiter	8e04030596	in case of short memory, do not cut down robinson peers to 1, just reduce by 50%	2014-04-23 08:37:19 +02:00
reger	86f6975edc	exclude html tags in in/outboundlinks_anchortext_txt parsed text - some outboundlinks_anchortext_txt in index contain e.g. <span>text</span> or more tags, remove all tags for text property (inline img tags are still parsed) - added test case for above (to htmlParserTest) - fix solr test case	2014-04-23 00:55:16 +02:00
orbiter	469e0a62f1	added new button to terminate all crawls	2014-04-22 23:14:54 +02:00
orbiter	ccb1864d55	catch IllegalArgumentException for wrong process types (that is needed for migrations when new process types are introduced or disappear)	2014-04-22 23:14:05 +02:00
orbiter	4ee4ba1576	fix for NPE in IndexCreateParserErrors_p.html caused by bad handling of lazy value instantiation of 0-value in crawldepth_i	2014-04-22 19:48:49 +02:00
orbiter	12ba890205	removed warnings	2014-04-22 19:35:15 +02:00
reger	d51f9cc863	add custom Jetty errorhandler to provide custom error page footer line - remove redundant mime check in UrlProxyServlet	2014-04-21 17:28:21 +02:00
reger	c193a02023	defer creation of new ArrayList after possible early return (to skip not used object allocation)	2014-04-21 17:16:06 +02:00
reger	727dfb5875	refactore URIMetadataNode to further unify interaction with index - URIMetadataNode extending SolrDocument - use language as stored (String), reducing conversion to string - optimize debug code in transferIndex	2014-04-20 01:41:30 +02:00
reger	79e7947442	- remove empty http0_9 status text array and unused default_charset = ISO-8859-1	2014-04-18 22:03:16 +02:00
reger	2dabe2009d	- remove unused manual http KeepAlive config (reducing references to obsolete httpdemon) - add port info to settings_http	2014-04-18 19:57:35 +02:00
Michael Peter Christen	5746aae3db	add canonical links to the same crawldepth, not the next crawldepth	2014-04-18 06:51:46 +02:00
Michael Peter Christen	74ab5ef9fa	increased runtime for postprocessing query job	2014-04-18 06:51:10 +02:00
Michael Peter Christen	8b32dd5f9e	special strategy for balancer: do not remove targets with zero wait time from the queue	2014-04-18 06:50:07 +02:00
Michael Peter Christen	9c6228d948	fix for deadlocks in crawler	2014-04-17 16:58:17 +02:00
Michael Peter Christen	7a2f3e2353	increased resource.disk.used.max.steadystate and resource.disk.used.max.overshot by 4 times because first users reached that limit and wondered why the crawler was paused automatically :) The crawler will now stop at 2TB disk usage :)	2014-04-17 16:19:38 +02:00
Michael Peter Christen	10cf8215bd	added crawl depth for failed documents	2014-04-17 13:21:43 +02:00
Michael Peter Christen	7fefebaeca	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-04-17 12:55:38 +02:00
Michael Peter Christen	c2f62e783f	- better subgraph handling, less overhead for crawls without the webgraph - usage of crawler crawldepth cache for the linkgraph target depth computation	2014-04-17 12:54:18 +02:00
Michael Peter Christen	06afb568e2	new Strategies in Balancer: - doublecheck cache now records the crawl depth as well - doublecheck cache is available from the outside (made static) - no more need to crawl hosts with lowest depth first, instead all hosts which have only singleton entries are preferred to reduce the number of files.	2014-04-17 12:52:54 +02:00
Michael Peter Christen	1aea01fe5b	fix for Table in case that requested file does not exist and paths also do not exist	2014-04-17 12:44:05 +02:00
reger	710054bb37	implement gzip input handling directly in defaultservlet (making reference to legacy httpdemon obsolete)	2014-04-17 03:20:29 +02:00
Michael Peter Christen	b4b0d14c04	fix for display bug	2014-04-16 22:24:04 +02:00
Michael Peter Christen	9a5ab4e2c1	removed clickdepth_i field and related postprocessing. This information is now available in the crawldepth_i field which is identical to clickdepth_i because of a specific crawler strategy.	2014-04-16 22:16:20 +02:00
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	2014-04-16 21:34:28 +02:00

1 2 3 4 5 ...

10823 Commits