yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
luccioman	aa9ddf3c23	Added control over Robots.txt active threads maximum number. When starting a crawl from a file containing thousands of links, configuration setting "crawler.MaxActiveThreads" is effective to prevent saturating the system with too many outgoing HTTP connections threads launched by the crawler. But robots.txt was not affected by this setting and was indefinitely increasing the number of concurrently loading threads until most ot the connections timed out. To improve performance control, added a pool of threads for Robots.txt, consistently used in its ensureExist() and massCrawlCheck() methods. The Robots.txt threads pool max size can now be configured in the /PerformanceQueus_p.html page, or with the new "robots.txt.MaxActiveThreads" setting, initialized with the same default value as the crawler.	2016-11-23 18:13:05 +01:00
reger	08a0acc35d	make a YearNavigator availabel, useable as SearchEvent.naviator plugin. It can take any Date field of the index and displays a list of year strings in reverse order by the year (not the score/count). To allow to define the index field to use, the fieldname (and title can be appended to the navi's name "year" e.g. year:load_date_dt:LoadDate It works also with dates_in_content_dts field (from the graphical date navigator). Here the query parameter from: to: are used on selection as Query modifier (for other dates currently no query parameter available, so selection won't work to filter search results). Not included in the UI Searchpage layout config so far (for experiment with it manual change to conf needed).	2016-11-21 16:52:53 +01:00
reger	bad8f87998	remove old/obsolete clear text "adminAccount" credential entry from init and setConfig (.,empty) from servlets/code	2016-11-20 00:20:47 +01:00
luccioman	7296e3884f	Switched even more URLs to pure relative ones. Thus a YaCy peer can run behind a reverse proxy subfolder without need for the reverse proxy to rewrite HTML links (a CPU costly operation). Tested on Debian Jessie with an apache2 reverse proxy. See related mantis issues http://mantis.tokeek.de/view.php?id=106 and http://mantis.tokeek.de/view.php?id=701	2016-11-09 02:40:33 +01:00
luccioman	84b81c1af0	Switched more URLs to relative ones when possible. This permits an easier and more flexible reverse proxy configuration. Some related mantis issues : http://mantis.tokeek.de/view.php?id=106 and http://mantis.tokeek.de/view.php?id=701	2016-11-08 03:05:51 +01:00
reger	af39a76bf6	Reduce number of default max. search navigator lines (from 10000) to 100 + make it configurable	2016-10-29 04:19:46 +02:00
luccioman	6e1959f469	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git Conflicts: htroot/yacysearchitem.java source/net/yacy/cora/federate/solr/responsewriter/YJsonResponseWriter.java source/net/yacy/search/schema/CollectionConfiguration.java source/net/yacy/server/serverObjects.java	2016-10-14 11:29:55 +02:00
JeremyRand	4963ecb0a0	Add preference (disabled by default) to show the ranking for each result on the HTML UI.	2016-10-04 11:49:16 +00:00
luccioman	b3b75b0498	Accessibility : add a customizable alternative text to YaCy log Applied W3C recommendations : https://www.w3.org/TR/html51/semantics-embedded-content.html#a-link-or-button-containing-nothing-but-an-image and https://www.w3.org/TR/html51/semantics-embedded-content.html#logos-insignia-flags-or-emblems	2016-09-22 16:08:33 +02:00
reger	35a7d57260	update lucenematchversion to current (5.2.0 -> 5.5.0) there should be no need for reindex by the update	2016-07-23 18:36:43 +02:00
Marc Nause	1f7013a1e3	removed unused properties in default config (CGI capabilities of YaCy's HTTPd have been removed many moons ago)	2016-07-21 21:36:00 +02:00
luccioman	893a40995a	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git	2016-07-04 21:24:40 +02:00
Michael Peter Christen	634e48309b	another peer list update	2016-07-04 11:02:36 +02:00
Michael Peter Christen	16420e5507	added another principal peer	2016-07-03 22:50:50 +02:00
luccioman	6e96c7341a	Merge remote-tracking branch 'origin/master' Conflicts: htroot/Load_MediawikiWiki.java htroot/Load_PHPBB3.java htroot/ViewImage.java	2016-07-03 18:59:00 +02:00
JeremyRand	433217b33e	Properly support multiple Boost Queries. (Previous code was broken because it concatenated multiple Boost Queries together rather than passing Solr an array.)	2016-05-20 20:17:51 -05:00
reger	ef24593347	delete obsolete SEARCHRESULT busythread constants not used since 29.05.2013 18:27:27 `0c1a018bbd`	2016-05-04 01:30:10 +02:00
reger	8410536f75	keep svnRevision in .init for convert of .conf until release >1.83	2016-03-20 18:12:55 +01:00
reger	726ebee65a	include Version config string in yacy.init (replacing svnRevision)	2016-03-20 03:42:33 +01:00
Michael Peter Christen	f4591b1b51	Merge branch 'master' of git@github.com:yacy/yacy_search_server.git	2016-03-11 18:12:38 +01:00
Michael Peter Christen	1ce38fdaed	0n - added experimental zeronet network which supports intranet peers (still needs work)	2016-03-11 08:55:51 +01:00
Michael Peter Christen	d05ffa1c51	update to seed list	2016-03-11 07:20:38 +01:00
reger	16724c1283	remove unused proxyCookieWhiteList from yacy.init	2016-03-11 01:14:54 +01:00
luc	3cc5619d93	Improved HTML icons indexing and rendering in search results. See http://mantis.tokeek.de/view.php?id=629	2016-02-02 09:57:54 +01:00
Michael Peter Christen	5d635879f8	Merge pull request #40 from Scarfmonster/autocrawl Automatic crawling	2016-01-14 22:19:55 +01:00
Ryszard Goń	7d6e0d8470	Add missing settings to autocrawl settings page	2016-01-14 03:27:33 +01:00
Ryszard Goń	a98c395023	Add the Autocrawl thread	2016-01-14 00:50:23 +01:00
reger	4765e374e6	altered clac. of search result items per page to display taking the existing limits into account but make it consistent with search option screen for admin and public user changes: - configured default number of items per page (ConfigPortal_p.html) is used as is (no hardcoded limit) - otherwise requests are limited to 100 results per page ( = search option, index.html) (this basically is the major change, inc. limit from 20 to 100 for public user) P.S. - the older grant of more (1000), if no online snippet calculation, is kept (for the time being) see http://mantis.tokeek.de/view.php?id=627	2016-01-13 01:30:49 +01:00
Ryszard Goń	1728cd30c6	Create autocrawl profiles	2016-01-12 16:28:34 +01:00
reger	e8256bb3b1	remove blekko from opensearch config (not available) see https://blekko.com/ http://searchengineland.com/goodbye-blekko-search-engine-joins-ibms-watson-team-217633	2016-01-04 04:49:10 +01:00
reger	a5faf73afa	remove obsolete yacy.init entries interaction.* (related to removed triplestore)	2015-12-29 15:41:19 +01:00
sixcooler	dce1cb65c4	Merge remote-tracking branch 'choose_remote_name/master'	2015-12-28 23:20:42 +01:00
reger	e84d94f8ca	fix mime table for ms office / open office documents (causing wrong parser detect in intranet mode)	2015-12-22 17:48:24 +01:00
reger	15e46b2bad	exclude in/outboundlinksnofollowcount_i from default schema fields (not used in any function)	2015-12-19 21:25:08 +01:00
luc	8c4ab9c76b	Added an option to eventually limit size of remote solr documents put to local index. See mantis #626.	2015-12-16 02:20:03 +01:00
luc	55a4d15775	Added a note on deprecated default search field and operator.	2015-12-14 23:55:12 +01:00
reger	b2c8bc0ae6	remove md5_s from default index fields it is not assigned a value / not used Due to above also excluded from transfer protocol.	2015-11-27 02:41:02 +01:00
sixcooler	f5a9948860	do not store subfield *_coordinate	2015-11-10 20:32:42 +01:00
sixcooler	fca353e5eb	set startuptype of most solr handlers to lazy	2015-11-10 20:32:05 +01:00
reger	c720b4c249	remove override of dynamicField coordinate_p in solr schema (coordinate_p is not a mandatory field as such doesn't need to be declared as schema.field)	2015-10-24 22:44:28 +02:00
reger	f0b5bc93a3	remove obsolete yacy.init entry "secureHttps" not used anywhere	2015-10-19 03:47:28 +02:00
reger	5e45f1a460	enable Solr schema dynamicField _p (type=location) for YaCy coordinate_p field	2015-09-01 21:47:25 +02:00
sixcooler	87e4abe393	fight the fieldcache by usind DocValues: in Solr-5.x the fieldcache has moved and was not cleared anymore. This results in an huge fieldcache. (http://lucene.apache.org/#highlights-of-the-lucene-release-include https://issues.apache.org/jira/browse/LUCENE-5666) Here I try to use DovValues where it is possible. For this I used the Api-Scheme as new basis für the Solr-Schema. This needs at least a complete optimization of the Solr-Index to get a smaller FieldCache. Everything that is indexed with these setting will not use the Fieldcache at all.	2015-08-31 20:24:41 +02:00
reger	250f6457f0	remove exired domain titan.deep-one.in from bootstrap.seedlist	2015-08-26 23:58:08 +02:00
Michael Peter Christen	df3314ac1a	added a new facet type based on a probabilistic classifier using bayesian filters. This can be used to classify documents during indexing-time using a pre-definied bayesian filter. New wordings: - a context is a class where different categories are possible. The context name is equal to a facet name. - a category is a facet type within a facet navigation. Each context must have several categories, at least one custom name (things you want to discover) and one with the exact name "negative". To use this, you must do: - for each context, you must create a directory within DATA/CLASSIFICATION with the name of the context (the facet name) - within each context directory, you must create text files with one document each per line for every categroy. One of these categories MUST have the name 'negative.txt'. Then, each new document is classified to match within one of the given categories for each context.	2015-08-10 14:27:44 +02:00
Michael Peter Christen	e1cd9c0dba	added another default network / commented out	2015-07-09 16:25:11 +02:00
reger	00d2062813	Rem depreciated AdminHandlers in solrconfig.xml avoid warning log W org.apache.solr.handler.admin.AdminHandlers <requestHandler name="/admin/" class="solr.admin.AdminHandlers" /> is deprecated . It is not required anymore	2015-07-01 00:58:23 +02:00
Michael Peter Christen	694b22f165	migration to Solr 5.2: huge benefits - this is a lot faster! This is a very complex migration: many classes had been renamed or removed, dependencies changed and the solr index type is now aligned to be a solr cloud repository. Together with the Solr 5.2 library update, one other dependent library had been updated as well: httpclient 4.4->4.4.1 Older indexes are migrated from 4_10 to 5_2. However, the new index structure is more efficient and we recommend to re-index everything. Please use the index export before you do the update to a large surrogate xml file. After the update, start with an empty index and then initialize this with your dump.	2015-06-24 01:55:51 +02:00
Michael Peter Christen	9c12555be5	added link to Snapshots in search results if the snapshot exists and option is set in ConfigSearchPage_p (this is a stub: we also need a visualization of pdf files!)	2015-06-07 20:37:37 +02:00
reger	6bc8a9b11e	make Quality of Service Servlet available to prioritize requests from local host This assigns priorities to incoming requests. Higher priority numbers are served before lower. (disabled by default in defaults/web.xml, uncomment or copy entry to DATA/Settings/web.xml)	2015-04-26 04:29:32 +02:00
Michael Peter Christen	b060ba900d	added parsing of contentprop attribute in html tags for content='startDate' and content='endDate'. The value of these field is now written to new solr fields startDates_dts and endDates_dts.	2015-04-13 16:20:00 +02:00
Michael Peter Christen	4cb4f67f38	added parsing of dd, dt and article html fields. The parsed result is written to special solr fields which are deactivated by default.	2015-04-12 22:02:45 +02:00
Michael Peter Christen	36e9cdb376	testing switching off cold searchers; maybe this brings performance enhancements when using large facets	2015-04-07 13:14:41 +02:00
Michael Peter Christen	535f1ebe3b	added a new way of content browsing in search results: - date navigation The date is taken from the CONTENT of the documents / web pages, NOT from a date submitted in the context of metadata (i.e. http header or html head form). This makes it possible to search for documents in the future, i.e. when documents contain event descriptions for future events. The date is written to an index field which is now enabled by default. All documents are scanned for contained date mentions. To visualize the dates for a specific search results, a histogram showing the number of documents for each day is displayed. To render these histograms the morris.js library is used. Morris.js requires also raphael.js which is now also integrated in YaCy. The histogram is now also displayed in the index browser by default. To select a specific range from a search result, the following modifiers had been introduced: from:<date> to:<date> These modifiers can be used separately (i.e. only 'from' or only 'to') to describe an open interval or combined to have a closed interval. Both dates are inclusive. To select a specific single date only, use the 'to:' - modifier. The histogram shows blue and green lines; the green lines denot weekend days (saturday and sunday). Clicking on bars in the histogram has the following reaction: 1st click: add a from:<date> modifier for the date of the bar 2nd click: add a to:<date> modifier for the date of the bar 3rd click: remove from and date modifier and set a on:<date> for the bar When the on:<date> modifier is used, the histogram shows an unlimited time period. This makes it possible to click again (4th click) which is then interpreted as a 1st click again (sets a from modifier). The display feature is NOT switched on by default; to switch it on use the /ConfigSearchPage_p.html servlet.	2015-03-02 04:30:10 +01:00
reger	ba276d3e64	add description_txt to default query fields, Dublin Core Metadata field extracted by most parsers.	2015-02-22 05:42:04 +01:00
reger	fe6f5a395d	fix Umlaut handling in blekko heuristic search term http://mantis.tokeek.de/view.php?id=169 observation: blekko seams to block xxxbot agents (=0 results)	2015-02-08 23:40:33 +01:00
Michael Peter Christen	97ba5ddbb7	configuration option for maxload limit for remote search	2015-02-04 01:12:25 +01:00
Michael Peter Christen	ac19690d30	refactoring with CommonPattern.COMMA	2015-01-29 01:35:28 +01:00
Michael Peter Christen	cf9b22ca5c	do not reindex based on vocabulary fields (there are meanwhile many of them) and some default settings	2015-01-29 01:22:28 +01:00
reger	24f68a4eb7	refactor opensearch heuristic introduce FederateSearchManager handling search heuristic to external systems via specific FederateSearchConnectors, which provide the query() functionallity, the translation to YaCy schema .toYaCySchema() and the search() routine to deliver results to searchevents, which is generally implemented in Abstract connector. The manager enforces now a min 15s delay between calls to external systems. Besides the OpensearchConnector a SolrFederateSearchConnector is available. It uses a additional config file for fieldname translation. default heuristicopensearch.conf: - openbdb.com removed - seems not longer to deliver results - config via solrconnector to datacite.org added (large technical library archive)	2015-01-19 03:30:35 +01:00
reger	4eb89d7f15	revert clickservlet (default was indeed a mistakenly)	2015-01-05 09:10:20 +01:00
Michael Peter Christen	61ae9d2d11	do not use the clickservlet by default. From my personal view, this technique should not be used at all! This project is about privacy, the existence of a click servlet is one example why people should NOT use a search portal if such exists.	2015-01-05 08:21:51 +01:00
sixcooler	5594c43d2e	bump to Solr-/Lucene-4.10.3	2015-01-04 18:47:47 +01:00
reger	d44d8996d0	Added a “don't store remote search results” option This is intended for peers who want to participate in the P2P network but don't wish to load/fill-up their index with metadata of every received search result. The DHT transfer is not effected by this option (and will work as usual, so that a peer disabling the new store to index switch still receives and holds the metadata according to DHT rules). Downside for the local peer is that search speed will not improve if search terms are only avail. remote or by quick hits in local index. To be able to improve the local index a Click-Servlet option was added additionally. If switched on, all search result links point to this servlet, which forwards the users browser (by html header) to the desired page and feeds the page to the fulltext-index. The servlet accepts a parameter defining the action to perform (see defaults/web.xml, index, crawl, crawllinks) The option check-boxes are placed in ConfigPortal.html	2015-01-04 11:10:45 +01:00
reger	e177d69387	remove obsolete config footer option (ConfigPortal user.login) no footer or footer-option in use remove unused yacy.init item allowUnlimitedReceiveIndexFrom	2014-12-29 03:50:00 +01:00
reger	6a04563578	Init Jetty using setDefaultDescriptor (web.xml) to defaults/web.xml so web.xml in defaults dir is applied first and optional DATA/SETTINGS/web.xml loaded on top. By using this Jetty feature (default web.xml) we assure that changes to the default are applied to existing installations and individual addition/changes are still respected.	2014-12-27 00:10:14 +01:00
Michael Peter Christen	eb78388a98	changed prefer strategy for http unique in such a way that http is preferred over https. While this is a bad idea from the standpoint of security it is more common applicable for environments where http and https mix and for some domains https is not available. Then the double-check is possible even if no postprocessing is performed.	2014-12-21 19:17:06 +01:00
Michael Peter Christen	d14114697c	the miss cache does not seem to work, it sometimes contains urlhashes from documents which actually are inside the index. This can be reproduced using the crawl result table at http://localhost:8090/CrawlResults.html?process=5 The cache is temporary disabled to remove the bad behaviour, however a later reactivation of that feater may be possible.	2014-12-21 17:31:51 +01:00
reger	446f374ba9	fix yacy.init comment http://mantis.tokeek.de/view.php?id=513	2014-12-14 19:12:18 +01:00
Michael Peter Christen	66b5a56976	Added and integrated new date detection class which can identify date notions within the fulltext of a document. This class attempts to identify also dates given abbreviated or with missing year or described with names for special days, like 'Halloween'. In case that a date has no year given, the current year and following years are considered. This process is therefore able to identify a large set of dates to a document, either because there are several dates given in the document or the date is ambiguous. Four new Solr fields are used to store the parsing result: dates_in_content_sxt: if date expressions can be found in the content, these dates are listed here in order of the appearances dates_in_content_count_i: the number of entries in dates_in_content_sxt date_in_content_min_dt: if dates_in_content_sxt is filled, this contains the oldest date from the list of available dates #date_in_content_max_dt: if dates_in_content_sxt is filled, this contains the youngest date from the list of available dates, that may also be possibly in the future These fields are deactiviated by default because the evaluation of regular expressions to detect the date is yet too CPU intensive. Maybe future enhancements will cause that this is switched on by default. The purpose of these fields is the creation of calendar-like search facets, to be implemented next.	2014-12-14 13:40:45 +01:00
Michael Peter Christen	114f0afc1e	enable sku as anchor in html response writer	2014-12-14 04:02:13 +01:00
Michael Peter Christen	60f27bdf49	added the property timeoutrequests to configuration to disable TimeoutRequests. The purpose is to test if YaCy runs better on VMs where there is a limitation of concurrent processes; see /proc/user_beancounters in row numproc; this value is limited and should be low. Try to set timeoutrequests to keep this low. (works only after restart)	2014-12-01 15:20:10 +01:00
Michael Peter Christen	1d45d9405a	security bugfix	2014-11-28 01:19:01 +01:00
Michael Peter Christen	c94c24638f	disabled postprocessing by default. If you read this: please disable postprocessing in your peer as well: open /IndexSchema_p.html, then deselect field process_sxt	2014-11-27 12:13:20 +01:00
Michael Peter Christen	c0f9f6ac66	added option to change the navbar-default, i.e. usable for dark skins	2014-11-26 18:01:35 +01:00
Michael Peter Christen	84763126e0	added option to make the YaCy proxy act as the cache is never stale. If set to 'Always Fresh' the cache is always used if the entry in the cache exist. This is a good way to archive web content and access it without going online again in case the documents exist. To do so, open /Settings_p.html?page=ProxyAccess and check the "Always Fresh" checkbox. This is set do false which behave as set before. If you set this to true, then you have your web archive in DATA/HTCACHE. Copy this to carry around your private copy of the internet!	2014-11-24 20:28:52 +01:00
reger	ee277b9b3e	allow for local yacy.stopwords and yacy.badwords list (in DATA/SETTINGS/) if file in DATA/SETTINGS it is loaded otherwise file in ./defaults is loaded (if locale ./defaults/stopwords.xx doesn't exist take solr/lang/stopwords_xx.txt as default) move yacy.stopwords, yacy.stopwords.de and yacy.badwords.example out of root directory to ./defaults directory	2014-11-23 05:22:23 +01:00
Michael Peter Christen	c67c5c0709	added new solr schema fields which record the occurences of vocabulary matchings. These matches can be used for result boosting, i.e. if a document contains words from a specific vocabulary, boost it.	2014-11-18 15:02:34 +01:00
Michael Peter Christen	68e8039fd1	added high-precision scheduler for API processes. This allows also to make the execution in dependency of available RAM or CPU load. The default value for CPU load is 4.0 and the check runs once a minute.	2014-11-14 10:02:50 +01:00
sixcooler	725b206fb4	update to solr-/lucene-4.10.2	2014-11-07 18:51:31 +01:00
Michael Peter Christen	26279b0993	added debug code for statistics about document attributes related to domains	2014-10-29 10:50:08 +01:00
Michael Peter Christen	2e5214eb21	added field postprocessing.partialUpdate to settings which can be used to switch on or off partial updates. Both options should cause the same result. Default is on.	2014-10-17 14:17:49 +02:00
Michael Peter Christen	b1cfbc4a04	added new solr field url_paths_count_i which can be used to enhance the index browser and maybe also for ranking; possibly also for SEO-with-YaCy applications.	2014-10-13 23:51:19 +02:00
Michael Peter Christen	8c1a89cb34	added another decoration flag to switch off network graphics in crawler monitor and index browser: decoration.grafics.linkstructure Please set this to false to remove the graphics from the interface.	2014-10-08 17:12:35 +02:00
Michael Peter Christen	bc221a0f9c	less load and more ram prerequisite for crawl steps	2014-10-08 14:27:38 +02:00
Michael Peter Christen	2a052f446a	Added an experimental audio feedback system. This is the first element of a new 'decoration' component which may hold switches for different external appearance parameters. The first switch in that context is decoration.audio (as usual in yacy.init). This value is set to false by default, that means the audio feedback element is switched off by default. To switch it on, set decoration.audio = true (using /ConfigProperties_p.html). You will then hear sounds for the following events: - remote searches - incoming dht transmissions - new documents from the crawler Sound clips are stored in htroot/env/soundclips/ which is done so because a future implementation will read these files using the http client and with configurable urls which will make it very easy for the user to replace the given sounds with own sounds.	2014-10-07 17:51:07 +02:00
Michael Peter Christen	f03dd0df24	updated seedlist	2014-09-16 15:49:03 +02:00
Michael Peter Christen	2b1cf26828	removed solr warning during startup	2014-09-02 13:25:30 +02:00
Michael Peter Christen	57ce7eeff3	fixed localhost authorization and replaced the adminRealm with an info string which is visible in the browser. That makes it possible that the browser instructs the user how to change a forgotten admin password (during runtime).	2014-09-02 13:15:19 +02:00
orbiter	f318d7c285	enhanced date-ordered ranking	2014-09-01 13:01:30 +02:00
orbiter	b3ebd38079	removed the HTDOCS repository concept because the concept to host files on the YaCy http server is obsolete; YaCy can index file:// and smb:// paths	2014-08-26 19:02:53 +02:00
reger	ec5b1d9e33	let NETWORK_WHITELIST take precedence over NETWORK_BLACKLIST this makes it easier to config exception (for private networks), like blacklist= .* whitelist= 10\..,127\.. ..... allows only listed ip pattern	2014-08-26 01:02:38 +02:00
orbiter	2371d6b8db	target linktexts must be string to enable search facets on these fields	2014-08-01 13:20:25 +02:00
orbiter	161a11070c	yacystats is gone :(	2014-07-29 11:12:01 +02:00
reger	7328c2883b	fix type in .init description http://mantis.tokeek.de/view.php?id=430	2014-07-26 00:38:53 +02:00
reger	94819f0797	set .ini default boost fields to same as assigned by button "reset to default" (in RankingSolr_p) - fix typo http://mantis.tokeek.de/view.php?id=430	2014-07-26 00:17:41 +02:00
reger	a2cb366b25	Combine /heuristic search modifier with opensearch configured targets - with search modifier /heuristic a request is send to all configured opensearch target systems (old /heuristic/blekko modifier not longer valid) - this allows to use opensearch heuristic on individual search request (in contrast to configuration HEURISTIC_OPENSEARCH=true which sends a osd request on all global searches - the index.html searchoption text adjusted to be displayed only if option configured - add Archive-It to predefined systems	2014-07-20 00:00:43 +02:00
Michael Peter Christen	2de159719b	added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead.	2014-07-18 12:43:01 +02:00
Michael Peter Christen	1092e798a5	fixed double content postprocessing	2014-07-07 19:15:11 +02:00
Michael Peter Christen	09dcdb9b19	update to solr 4.9.0	2014-07-01 16:39:00 +02:00
orbiter	0bbb5040b8	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-06-15 12:38:52 +02:00
orbiter	9d5d86cd03	Added filter query options to the ranking servlet /RankingSolr_p.html. Filter queries are not actually related to ranking, but user requests have pointed out that specific boost queries to move results to the end of the result list are not sufficient. Such boost filters may be better executed as actual filter and therefore such a filter can now be statically applied to every search request. A typical use could be the expression "http_unique_b:true AND www_unique_b:true" which uses the recently introduced fields http_unique_b and www_unique_b which are true only for one of the alternatives with/without http(s) and with/without prefix 'www.' in host names.	2014-06-15 12:38:30 +02:00
Michael Peter Christen	d2151857f1	Added collection navigation: The collection field (can be filled i.e. in Crawl Start) can be used to add categories to YaCy index entries. The usage of that field was restricted to solr searches and post argument filters as implemented in commit `f7571386a3`. This commit extends collections to a full navigation option in the standard YaCy search interface. The field is not active by default but can be activated easily in the /ConfigSearchPage_p.html servlet (just check the 'Collection' facet field). Collections can now be used for (at least) two purposes: - to provide search tenants (through post argument collection) - to provide self-made category navigation Search requests may now have (independently from switched on or off collection facet) a "collection:<collection-name>" modifier attached; firthermore collection names may use disjunctions using the '\|' pipe symbol. For example, this is a valid search request: www collection:user\|proxy	2014-06-15 12:11:23 +02:00
Michael Peter Christen	922979aae1	added option to prefer http over https in unique-protocol ranking	2014-06-02 17:40:56 +02:00
reger	d8d318233e	fix logging settings - add missing .level - remove obsolete jena settings - set default level=INFO to prevent debug logging of not explicite specified classes	2014-06-01 06:43:50 +02:00
Michael Peter Christen	698f053658	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-06-01 01:02:12 +02:00
Michael Peter Christen	f23c4142e0	added option to configure a custom user agent within allip networks	2014-06-01 01:02:03 +02:00
orbiter	d7d38f9135	made number of open files in crawler configurable and increased default maximum number of open files from 100 to 1000. This number can be changed with the attribut crawler.onDemandLimit	2014-05-31 09:29:55 +02:00
Michael Peter Christen	ff5b3ac84d	added new fields http_unique_b and www_unique_b which can be used for ranking to prefer urls containing a www subdomain or using the https protocol	2014-05-27 15:28:28 +02:00
Michael Peter Christen	f0db501630	better handling of ranking parameters and new default values for date navigation which is done using ranking in solr.	2014-05-22 03:01:07 +02:00
Michael Peter Christen	d4157184ec	migration to Solr 4.8.1 This includes also an update to zookeeper 3.4.6 and a new library that Solr initializes by default: org.restlet from http://restlet.com/download/current#release=stable&edition=jse&distribution=zip which is included in version 2.2.1 from may 6th 2014	2014-05-21 11:48:08 +02:00
orbiter	2944822bb0	updated bootstrap seed list	2014-05-20 13:27:40 +02:00
reger	e31493e139	"Use remote proxy for yacy" has no function, remove option and related config item see/fix bug http://mantis.tokeek.de/view.php?id=23 http://mantis.tokeek.de/view.php?id=189	2014-05-17 23:36:59 +02:00
reger	f02203fb2f	fix xml validation error on defaults/web.xml	2014-05-11 04:39:59 +02:00
Michael Peter Christen	229f2248b8	added configuration option for maxmimum load and minimum ram for postprocessing	2014-04-30 13:26:32 +02:00
Michael Peter Christen	3d5e354471	small changes to search headline colour	2014-04-29 18:46:50 +02:00
Michael Peter Christen	71efc76170	new default skin pdbootstrap which keeps the design shapes but slightly changes the colours to match with bootstrap colours	2014-04-29 16:23:42 +02:00
reger	d812f80784	add exit proxy link to UrlProxy on proxied pages a link to exit proxy is added to top of page. Link text can be configured in web.xml init-parameter (see default/web.xml). If missing no link is displayed.	2014-04-26 22:27:59 +02:00
reger	2dabe2009d	- remove unused manual http KeepAlive config (reducing references to obsolete httpdemon) - add port info to settings_http	2014-04-18 19:57:35 +02:00
Michael Peter Christen	7a2f3e2353	increased resource.disk.used.max.steadystate and resource.disk.used.max.overshot by 4 times because first users reached that limit and wondered why the crawler was paused automatically :) The crawler will now stop at 2TB disk usage :)	2014-04-17 16:19:38 +02:00
Michael Peter Christen	9a5ab4e2c1	removed clickdepth_i field and related postprocessing. This information is now available in the crawldepth_i field which is identical to clickdepth_i because of a specific crawler strategy.	2014-04-16 22:16:20 +02:00
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	2014-04-16 21:34:28 +02:00
reger	46016fa153	autoupdate fails to download latest release (1.71) due to default release blacklist - removed the default version blacklist regex from init (for future versions) !!! left existing update blacklist setting untouched !!! (existing installation wanting autoupdate for 1.71 need to change blacklist in ConfigUpdate_p.html) - moved old blacklist patch to migration.java	2014-04-13 07:32:32 +02:00
Michael Peter Christen	ebd44a7080	replaced solr 4.6.1 with solr 4.7.1 and added index migration to lucene_47	2014-04-06 10:45:03 +02:00
Michael Peter Christen	ee92d748b5	test using compound file format, see UseCompoundFile in https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig This appears to be necessary as many times a java.io.FileNotFoundException: (Too many open files) appears. See also: https://issues.apache.org/jira/browse/SOLR-4 and desperate users at http://stackoverflow.com/questions/3828343/too-many-open-file-exception-while-indexin-using-solr We cannot force users to do a "ulimit -n 1000000", so this action seems to be required.	2014-04-06 00:35:35 +02:00
Michael Peter Christen	0a95fd27f3	update of seed list	2014-04-04 17:04:49 +02:00
Michael Peter Christen	cca851a417	introduced new solr field crawldepth_i which records the crawl depth of a document. This is the upper limit for the clickdepth_i value which may be shorter in case that the crawler did not take the shortest path to the document.	2014-04-02 23:37:01 +02:00
Michael Peter Christen	39b641d6cd	added tutorial mode - some menu items will only appear if you 'qualify' for them. Thus, the first-time user will only see four menu items. The other items will unfold as the user interacts.	2014-04-02 02:33:17 +02:00
reger	b12200cafe	alternative UrlProxyServlet (for /proxy.html) using different url rewrite rules - use JSoup parser for selective rewrite of html body <a href= links only, instead of regex which rewrites also header href/src links - this improves display of pages which use header <base> tag - tags with src attribute are taken from original location (like css) improving display and are not routed trough the indexer Disadvantage: scripting links will drop out of proxy Setting of the servlet through web.xml exclusivly (in case one would like to quickly switch back to the YaCyProxyServlet, leaving the existing code of YaCyProxyServlet untouched available)	2014-03-30 04:04:02 +02:00
Michael Peter Christen	e515dd460d	added linkscount_i and linksnofollowcount_i to the default solr schema	2014-03-27 23:36:08 +01:00
Michael Peter Christen	a7bc130e27	removed performance settings - they are incomplete and buggy - it was not easy to explain - it did not comply with a KISS strategy - setting a performance of low priority actually caused crashing of a peer - there was nobody who would maintain that functionality	2014-03-27 22:00:57 +01:00
Michael Peter Christen	a28fefba2d	activated language facet by default	2014-03-23 11:39:46 +01:00
Michael Peter Christen	617dd9c97b	- added new input field in index.html - changed progress bar in yacysearch.html - moved pagination navigation to page bottom - moved search term input field to headline	2014-03-21 02:42:09 +01:00
orbiter	7d24bcb98d	added flag to require that all web pages, even such without a "_p" extension require authorization. (default off)	2014-03-20 19:09:47 +01:00
reger	1fe26550a0	remove AugmentedBrowsing_p.html augmented browsing switch (has no function in code, previously used in conjuction with http://reflect.ws)	2014-03-19 22:40:35 +01:00
reger	e972b87a8a	remove AugmentedBrowsingFilters_p.html as none of the settings are used currently config settings frome the page also removed from yacy.init augmentation.reflect augmentation.addDoctype augmentation.reparse interaction.overlayinteraction.enabled	2014-03-17 20:27:04 +01:00
reger	a373fb717d	remove more unused from legacy server.http - triggerOnlineAction not used - useTemplateCache not used	2014-03-14 03:12:04 +01:00
orbiter	f77afa9d1d	add index on _val fields, this affects especially title length an index on fields make search facets on that field possible	2014-03-04 11:24:04 +01:00
Michael Peter Christen	de8f7994ab	as crawling has a low-cpu demand, we want it to run even if the CPU load is VERY high. This applies also if the CPU load is high because of in-cache crawling; in that case we want to experience a high-CPU load as much as possible	2014-02-25 14:17:33 +01:00
Michael Peter Christen	9eb668e951	enhanced the resource observer The resource observer is now able to recognize free disk space AND available space for YaCy. The amount of space which is assigned for YaCy are defined in new settings in the configuration file. Furthermore, there is now a cleanup process which deletes files in case that an autodelete is activated. The autodelete is now BY DEFAULT ON if the disk space is low, which means that YaCy starts to delete documents when the disk is full!	2014-02-12 01:00:44 +01:00
Michael Peter Christen	ca8b100f96	run the cleanup process even when load is high, do postprocessing even if load > 1 (but < 2) but only if there is enough memory (now: 0.5 GB RAM available). The memory amount of the postprocessing is the cause that systems block because they run into a frequent-GC chain which almost locks the peer. If running with enough memory, the postprocessing is fast and not damaging to the system. Because the required RAM of 0.5 GB is never available in default setting, the postprocessing will not run if the peer is not reconfigured to use more memory.	2014-02-10 12:59:30 +01:00
Michael Peter Christen	6e59ca4ebf	removed jena library and all code that depended on jena. When jena was introduced, it was also used for search facets. The generic search facets are now deduced from generic solr fields which makes jena as tool for facet semantics superfluous.	2014-02-07 01:20:06 +01:00
Michael Peter Christen	931541d198	re-inserted default value re-set button to performance queues and patched missing values for recent new queues	2014-02-06 22:39:19 +01:00
Michael Peter Christen	4b7f2fcf38	updated bootstrap seedlist list	2014-01-27 13:55:06 +01:00
reger	a71718a459	add config value for ssl/https port (default=8443) adjust server routines to use config	2014-01-27 01:09:56 +01:00
reger	cf553e5045	added hint to web.xml and for completeness the full set of hardcoded mappings	2014-01-23 23:56:45 +01:00
Michael Peter Christen	a8fdaace31	changed the web.xml as well to migrate the solr servlet	2014-01-23 18:41:45 +01:00
Michael Peter Christen	be5e808236	- removed hardcoded load-test which is now handled in BusyQueues steering, see /PerformanceQueues_p.html - changed default values for crawler queue load limit (high, because these jobs are started upon user request)	2014-01-21 17:48:45 +01:00
sixcooler	40a4030b55	configurable max-load values for YaCy-Threads: try lower values on smal systems like a Pi	2014-01-21 17:04:22 +01:00
Michael Peter Christen	77531850b5	reverted crawling strategy from latest commit.	2014-01-21 16:05:55 +01:00

1 2 3 4 5 ...

708 Commits