yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
orbiter	0bbb5040b8	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-06-15 12:38:52 +02:00
orbiter	9d5d86cd03	Added filter query options to the ranking servlet /RankingSolr_p.html. Filter queries are not actually related to ranking, but user requests have pointed out that specific boost queries to move results to the end of the result list are not sufficient. Such boost filters may be better executed as actual filter and therefore such a filter can now be statically applied to every search request. A typical use could be the expression "http_unique_b:true AND www_unique_b:true" which uses the recently introduced fields http_unique_b and www_unique_b which are true only for one of the alternatives with/without http(s) and with/without prefix 'www.' in host names.	2014-06-15 12:38:30 +02:00
Michael Peter Christen	d2151857f1	Added collection navigation: The collection field (can be filled i.e. in Crawl Start) can be used to add categories to YaCy index entries. The usage of that field was restricted to solr searches and post argument filters as implemented in commit `f7571386a3`. This commit extends collections to a full navigation option in the standard YaCy search interface. The field is not active by default but can be activated easily in the /ConfigSearchPage_p.html servlet (just check the 'Collection' facet field). Collections can now be used for (at least) two purposes: - to provide search tenants (through post argument collection) - to provide self-made category navigation Search requests may now have (independently from switched on or off collection facet) a "collection:<collection-name>" modifier attached; firthermore collection names may use disjunctions using the '\|' pipe symbol. For example, this is a valid search request: www collection:user\|proxy	2014-06-15 12:11:23 +02:00
Michael Peter Christen	922979aae1	added option to prefer http over https in unique-protocol ranking	2014-06-02 17:40:56 +02:00
reger	d8d318233e	fix logging settings - add missing .level - remove obsolete jena settings - set default level=INFO to prevent debug logging of not explicite specified classes	2014-06-01 06:43:50 +02:00
Michael Peter Christen	698f053658	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-06-01 01:02:12 +02:00
Michael Peter Christen	f23c4142e0	added option to configure a custom user agent within allip networks	2014-06-01 01:02:03 +02:00
orbiter	d7d38f9135	made number of open files in crawler configurable and increased default maximum number of open files from 100 to 1000. This number can be changed with the attribut crawler.onDemandLimit	2014-05-31 09:29:55 +02:00
Michael Peter Christen	ff5b3ac84d	added new fields http_unique_b and www_unique_b which can be used for ranking to prefer urls containing a www subdomain or using the https protocol	2014-05-27 15:28:28 +02:00
Michael Peter Christen	f0db501630	better handling of ranking parameters and new default values for date navigation which is done using ranking in solr.	2014-05-22 03:01:07 +02:00
Michael Peter Christen	d4157184ec	migration to Solr 4.8.1 This includes also an update to zookeeper 3.4.6 and a new library that Solr initializes by default: org.restlet from http://restlet.com/download/current#release=stable&edition=jse&distribution=zip which is included in version 2.2.1 from may 6th 2014	2014-05-21 11:48:08 +02:00
orbiter	2944822bb0	updated bootstrap seed list	2014-05-20 13:27:40 +02:00
reger	e31493e139	"Use remote proxy for yacy" has no function, remove option and related config item see/fix bug http://mantis.tokeek.de/view.php?id=23 http://mantis.tokeek.de/view.php?id=189	2014-05-17 23:36:59 +02:00
reger	f02203fb2f	fix xml validation error on defaults/web.xml	2014-05-11 04:39:59 +02:00
Michael Peter Christen	229f2248b8	added configuration option for maxmimum load and minimum ram for postprocessing	2014-04-30 13:26:32 +02:00
Michael Peter Christen	3d5e354471	small changes to search headline colour	2014-04-29 18:46:50 +02:00
Michael Peter Christen	71efc76170	new default skin pdbootstrap which keeps the design shapes but slightly changes the colours to match with bootstrap colours	2014-04-29 16:23:42 +02:00
reger	d812f80784	add exit proxy link to UrlProxy on proxied pages a link to exit proxy is added to top of page. Link text can be configured in web.xml init-parameter (see default/web.xml). If missing no link is displayed.	2014-04-26 22:27:59 +02:00
reger	2dabe2009d	- remove unused manual http KeepAlive config (reducing references to obsolete httpdemon) - add port info to settings_http	2014-04-18 19:57:35 +02:00
Michael Peter Christen	7a2f3e2353	increased resource.disk.used.max.steadystate and resource.disk.used.max.overshot by 4 times because first users reached that limit and wondered why the crawler was paused automatically :) The crawler will now stop at 2TB disk usage :)	2014-04-17 16:19:38 +02:00
Michael Peter Christen	9a5ab4e2c1	removed clickdepth_i field and related postprocessing. This information is now available in the crawldepth_i field which is identical to clickdepth_i because of a specific crawler strategy.	2014-04-16 22:16:20 +02:00
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	2014-04-16 21:34:28 +02:00
reger	46016fa153	autoupdate fails to download latest release (1.71) due to default release blacklist - removed the default version blacklist regex from init (for future versions) !!! left existing update blacklist setting untouched !!! (existing installation wanting autoupdate for 1.71 need to change blacklist in ConfigUpdate_p.html) - moved old blacklist patch to migration.java	2014-04-13 07:32:32 +02:00
Michael Peter Christen	ebd44a7080	replaced solr 4.6.1 with solr 4.7.1 and added index migration to lucene_47	2014-04-06 10:45:03 +02:00
Michael Peter Christen	ee92d748b5	test using compound file format, see UseCompoundFile in https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig This appears to be necessary as many times a java.io.FileNotFoundException: (Too many open files) appears. See also: https://issues.apache.org/jira/browse/SOLR-4 and desperate users at http://stackoverflow.com/questions/3828343/too-many-open-file-exception-while-indexin-using-solr We cannot force users to do a "ulimit -n 1000000", so this action seems to be required.	2014-04-06 00:35:35 +02:00
Michael Peter Christen	0a95fd27f3	update of seed list	2014-04-04 17:04:49 +02:00
Michael Peter Christen	cca851a417	introduced new solr field crawldepth_i which records the crawl depth of a document. This is the upper limit for the clickdepth_i value which may be shorter in case that the crawler did not take the shortest path to the document.	2014-04-02 23:37:01 +02:00
Michael Peter Christen	39b641d6cd	added tutorial mode - some menu items will only appear if you 'qualify' for them. Thus, the first-time user will only see four menu items. The other items will unfold as the user interacts.	2014-04-02 02:33:17 +02:00
reger	b12200cafe	alternative UrlProxyServlet (for /proxy.html) using different url rewrite rules - use JSoup parser for selective rewrite of html body <a href= links only, instead of regex which rewrites also header href/src links - this improves display of pages which use header <base> tag - tags with src attribute are taken from original location (like css) improving display and are not routed trough the indexer Disadvantage: scripting links will drop out of proxy Setting of the servlet through web.xml exclusivly (in case one would like to quickly switch back to the YaCyProxyServlet, leaving the existing code of YaCyProxyServlet untouched available)	2014-03-30 04:04:02 +02:00
Michael Peter Christen	e515dd460d	added linkscount_i and linksnofollowcount_i to the default solr schema	2014-03-27 23:36:08 +01:00
Michael Peter Christen	a7bc130e27	removed performance settings - they are incomplete and buggy - it was not easy to explain - it did not comply with a KISS strategy - setting a performance of low priority actually caused crashing of a peer - there was nobody who would maintain that functionality	2014-03-27 22:00:57 +01:00
Michael Peter Christen	a28fefba2d	activated language facet by default	2014-03-23 11:39:46 +01:00
Michael Peter Christen	617dd9c97b	- added new input field in index.html - changed progress bar in yacysearch.html - moved pagination navigation to page bottom - moved search term input field to headline	2014-03-21 02:42:09 +01:00
orbiter	7d24bcb98d	added flag to require that all web pages, even such without a "_p" extension require authorization. (default off)	2014-03-20 19:09:47 +01:00
reger	1fe26550a0	remove AugmentedBrowsing_p.html augmented browsing switch (has no function in code, previously used in conjuction with http://reflect.ws)	2014-03-19 22:40:35 +01:00
reger	e972b87a8a	remove AugmentedBrowsingFilters_p.html as none of the settings are used currently config settings frome the page also removed from yacy.init augmentation.reflect augmentation.addDoctype augmentation.reparse interaction.overlayinteraction.enabled	2014-03-17 20:27:04 +01:00
reger	a373fb717d	remove more unused from legacy server.http - triggerOnlineAction not used - useTemplateCache not used	2014-03-14 03:12:04 +01:00
orbiter	f77afa9d1d	add index on _val fields, this affects especially title length an index on fields make search facets on that field possible	2014-03-04 11:24:04 +01:00
Michael Peter Christen	de8f7994ab	as crawling has a low-cpu demand, we want it to run even if the CPU load is VERY high. This applies also if the CPU load is high because of in-cache crawling; in that case we want to experience a high-CPU load as much as possible	2014-02-25 14:17:33 +01:00
Michael Peter Christen	9eb668e951	enhanced the resource observer The resource observer is now able to recognize free disk space AND available space for YaCy. The amount of space which is assigned for YaCy are defined in new settings in the configuration file. Furthermore, there is now a cleanup process which deletes files in case that an autodelete is activated. The autodelete is now BY DEFAULT ON if the disk space is low, which means that YaCy starts to delete documents when the disk is full!	2014-02-12 01:00:44 +01:00
Michael Peter Christen	ca8b100f96	run the cleanup process even when load is high, do postprocessing even if load > 1 (but < 2) but only if there is enough memory (now: 0.5 GB RAM available). The memory amount of the postprocessing is the cause that systems block because they run into a frequent-GC chain which almost locks the peer. If running with enough memory, the postprocessing is fast and not damaging to the system. Because the required RAM of 0.5 GB is never available in default setting, the postprocessing will not run if the peer is not reconfigured to use more memory.	2014-02-10 12:59:30 +01:00
Michael Peter Christen	6e59ca4ebf	removed jena library and all code that depended on jena. When jena was introduced, it was also used for search facets. The generic search facets are now deduced from generic solr fields which makes jena as tool for facet semantics superfluous.	2014-02-07 01:20:06 +01:00
Michael Peter Christen	931541d198	re-inserted default value re-set button to performance queues and patched missing values for recent new queues	2014-02-06 22:39:19 +01:00
Michael Peter Christen	4b7f2fcf38	updated bootstrap seedlist list	2014-01-27 13:55:06 +01:00
reger	a71718a459	add config value for ssl/https port (default=8443) adjust server routines to use config	2014-01-27 01:09:56 +01:00
reger	cf553e5045	added hint to web.xml and for completeness the full set of hardcoded mappings	2014-01-23 23:56:45 +01:00
Michael Peter Christen	a8fdaace31	changed the web.xml as well to migrate the solr servlet	2014-01-23 18:41:45 +01:00
Michael Peter Christen	be5e808236	- removed hardcoded load-test which is now handled in BusyQueues steering, see /PerformanceQueues_p.html - changed default values for crawler queue load limit (high, because these jobs are started upon user request)	2014-01-21 17:48:45 +01:00
sixcooler	40a4030b55	configurable max-load values for YaCy-Threads: try lower values on smal systems like a Pi	2014-01-21 17:04:22 +01:00
Michael Peter Christen	77531850b5	reverted crawling strategy from latest commit.	2014-01-21 16:05:55 +01:00
reger	97e84439fb	adjusted ConfigHeuristic and changed QueryGoal.getOriginalQueryString to .getQueryString - since specific heuristic Twitter & Blekko is not longer available or redundant with OpenSearchHeuristic, adjusted ConfigHeuristic to use OpensearchHeuristic settings only. For this the default OSD search target list is made available (copied) by default and the other configs are removed. - the return of QueryGoal.getOriginalQueryString includes the queryModifier, which are held separately in a modifier object, but in most (all) cases just the query term is expected, clarified and renamed it to QueryGoal.getQueryString which returns just the search term (if needed a .getOrigianlQueryString could be implemented in Queryparameters, adding the modifiers) - started to adjust internal html href references from absolute to relative (currently it is mixed). For future development we should prefer relative href targets (less trouble with context aware servlets)	2014-01-20 00:58:17 +01:00
reger	d24a0ec32c	upd heuristic default list (heuristicopensearch.conf) - Faroo Web taken out (requires api key) http://www.faroo.com/hp/api/api.html#description - update Faroo News to new url - Twitter taken out (change to Api 1.1 not supporting rss) https://dev.twitter.com/discussions/24239	2014-01-20 00:03:55 +01:00
reger	0c754dd794	implemented DIGEST authentication, which is for remote login more secure as BASIC were pwd is transmitted near clear text (B64enc). This has some implication as RFC 2617 requires and recommends a password hash MD5(user:realm:pwd) for DIGEST. !!! before activating DIGEST you have to reassign all passwords !!! to allow new calculation of the hash - default authentication is still BASIC - configuration at this time only manually in (DATA/settings) or defaults/web.xml (<auth-method> - the realmname is in defaults/yacy.init adminRealm=YaCy-AdminUI - fyi: the realmname is shown on login screen - changing the realm name invalidates all passwords - but for security you are encouraged to do so (as localhostadmin) - implemented to support both, old hashes for BASIC and new hashes for BASIC and DIGEST - to differentiate old / new hash the in Jetty used hash-prefix "MD5:" is used for new pwd-hashes ( "MD5:hash" )	2014-01-17 00:02:23 +01:00
Michael Peter Christen	f8ce7040ab	remote search peer selection schema change: - all non-dht targets (previously separated into 'robinson' for dht-like queries and 'node' for solr queries) are non 'extra' peers, which are queries using solr - these extra-peers are now selected using a ranking on last-seen, peer-tag-matches, node-peer flags, peer age, and link count. The ranking is done using a weight and a random factor. - the number of extra peers is 50% of the dht peers - the dht peers now exclude too young peers to prevent bad results during strong growth of the network - the number of dht peers (and therefore extra-peers) is reduced when the memory of the peer is low and/or some documents still appear in the indexing-queue. This shall prevent a peer from deadlocks when p2p queries are made in a fast sequence on weak hardware.	2014-01-16 17:27:14 +01:00
reger	f09dbbef96	make SecurityHandler webappcontext ready	2014-01-10 12:36:42 +01:00
reger	37f2a82a5d	making root context (htroot) a WebAppContext - this allows additional features, like servlet configuration via web.xml and many more things. - currently the standard servlets are still configured in the code (so the supplied defaults/web.xml is not realy needed, yet), but could be expanded - lookup for web.xml - 1. in /DATA/SETTINGS then in /defaults	2014-01-10 10:42:47 +01:00
reger	f6099b730d	disabled unused fields in default Solr collection schema	2014-01-10 10:26:45 +01:00
orbiter	2ead4e44d9	introduced a new storage path ARCHIVE inside of DATA which will be used as path for solr index dumps (instead of the SEGMENTS path). This will make a maintenance of index backups easier. It will also provide a tool to migrate from an freeworld index to a webportal index.	2014-01-07 17:53:49 +01:00
reger	fbdd89e198	Merge origin/master	2013-12-27 06:53:14 +01:00
reger	65a2f3d5e7	tweak Jetty credentials to work with YaCy UserDB - user entry in UserDB with admin right can login to access protected pages - dto. admin user, choosen username is stored in conf (adminAccountUserName=)	2013-12-27 06:45:22 +01:00
Michael Peter Christen	ee17bd0b69	added option to attach remote solr servers in read-only mode	2013-12-27 02:55:21 +01:00
Michael Peter Christen	84167adb49	removed unused anomichttpd code after migration to jetty	2013-12-23 01:23:40 +01:00
Michael Peter Christen	7603e879dc	Merge branch 'master' into HEAD Conflicts: .classpath source/net/yacy/cora/federate/solr/SolrServlet.java	2013-12-20 01:19:06 +01:00
Michael Peter Christen	2f16770681	migrated to solr 4.6.0	2013-12-19 21:51:05 +01:00
reger	92d9c56f9f	Merge origin/master into jetty	2013-12-05 22:53:29 +01:00
Michael Peter Christen	e3c2f09de9	- reduce computation in case that specific postprocessing fields are not selected - de-select citation rank computation	2013-12-04 17:48:12 +01:00
reger	effea4bca0	Merge origin/master into jetty Conflicts: source/net/yacy/cora/federate/solr/SolrServlet.java	2013-11-29 22:39:52 +01:00
Michael Peter Christen	a16534cb0a	tried to fix timeout and connection-lost problems when using an outside solr.	2013-11-28 01:31:53 +01:00
reger	f111f30ace	Merge origin/master into jetty	2013-11-17 00:18:25 +01:00
Michael Peter Christen	5ec5be5769	fixed logging for remote solr configuration	2013-11-15 15:36:24 +01:00
Michael Peter Christen	24a052ecb9	removed debug code for existsByIds	2013-11-13 13:41:18 +01:00
Michael Peter Christen	087df05e24	added option to Config_Network_p.html to enable remote search while DHT-Receive is switched off.	2013-11-13 13:38:01 +01:00
Michael Peter Christen	899e7e92b0	added debug code	2013-11-09 02:37:12 +01:00
Michael Peter Christen	a5c1249ee2	reverted autowarming setting in solrconfig	2013-11-09 01:43:44 +01:00
reger	1437c45383	merge rc1/master	2013-11-07 21:30:17 +01:00
Michael Peter Christen	81bb50118e	found and fixed a huge memory leak in solr caching (inside Solr). The not-flushed Solr cache is now handled in this way: - it is smaller by default - an Solr-internal process is started to flush the cache periodically (this does NOT clean the cache, just removes old objects) - a Solr-external process (the standard YaCy cleanup-process) now has direct access to the solr internal cache and flushes them completely. The time frame for such a flush is defined by the cleanup-process frequency, by default 10 minutes.	2013-11-07 10:01:44 +01:00
Michael Peter Christen	7f768b42d3	we do not need the load-image flag any more since this is now controlled by parser switches	2013-11-06 15:00:57 +01:00
reger	f017066197	Merge origin/master into jetty	2013-10-27 15:09:24 +01:00
Michael Peter Christen	f1bfe64361	integrated startpage to compare_yacy	2013-10-26 00:33:36 +02:00
Michael Peter Christen	9bb7eab389	hacks to prevent storage of data longer than necessary during search and some speed enhancements. This should reduce the memory usage during heavy-load search a bit.	2013-10-25 15:05:30 +02:00
orbiter	3c3cb78555	- removed a lot of garbage and bloated code from GuiHandler. - transformed log lines to String before they are stored because the storage space is about 1:250 (45kb for one line before transformation, 180 bytes afterwards) - this saves up to 10MB RAM so we can increase the number of lines to 1000 again.	2013-10-24 20:42:34 +02:00
Michael Peter Christen	6aabc4e5c8	reduced logging line memory, 10000 lines had filled up 450MB! grrr. (thank you, a bomb from the past)	2013-10-24 16:17:53 +02:00
Michael Peter Christen	1b4fa2947d	- fixed a problem which ocurred when a document was not recognized with the right content domain (i.e. identifying that it is an image, text etc.) because it used the file extension and not an existing mime type assignment. - fixed the new setting that images shall be loaded for a better image search. - both fixes together makes it now possible to crawl commons.wikimedia.org which makes use of 'funny' document names (i.e. ending with .jpg while the document is html)	2013-10-23 00:16:54 +02:00
reger	f46c723398	allow to choose used http server, YaCy-Anomic or Jetty - defaults to Jetty (in this branch) - add server version info & config option -> Admin Console -> Advanced Settings -> Http Networking	2013-10-17 03:34:22 +02:00
Michael Peter Christen	820b896146	Replaced the inframe loading from yacy.net for donations with the loading of this iframe from the local host. To make this more flexible, this iframe is loaded once after startup from yacy.net.	2013-10-15 16:46:06 +02:00
reger	cf32a92629	- add size check to multipart form data handling of YaCyDefaultServlet (same as in HTTPDemon.parseMultipart) - reduce Jetty logging - give build.run a bit more memory (set to YaCy.default 600m from 512m)	2013-10-13 20:56:03 +02:00
reger	a44eede8b8	merge rc1/master	2013-10-11 01:50:25 +02:00
Michael Peter Christen	90c8577840	enhanced ranking; patches to replace old ranking	2013-10-09 15:10:03 +02:00
Michael Peter Christen	1b61bd40ed	- Added new solr field url_file_name_tokens_t which stores the file name tokens. This can be used to enhance the ranking. - Added also a rating_i field as basis for later usage. - enhanced the tokenization process.	2013-10-08 23:48:13 +02:00
orbiter	5f5a97bafc	added the anchor text within web pages to the searcheable entities of a web page. This can be of benefit for the ranking if these fields are used for boosts.	2013-10-08 18:41:07 +02:00
Michael Peter Christen	21aa6a0321	migration to Solr 4.5.0	2013-10-07 17:09:40 +02:00
reger	c7c706fd9f	merge with rc1/master	2013-09-30 03:46:39 +02:00
Michael Peter Christen	b28d43decc	added two more fields source_cr_host_norm_i,target_cr_host_norm_i in webgraph and an addition to postprocessing to copy all cr ranking attributes to the link edges associated to the postprocessing documents	2013-09-27 16:57:05 +02:00
Michael Peter Christen	4f83d5f18c	added the new field harvestkey_s to the collection index and the webgraph index which is temporary filled with the crawl profile key. This is used to select a set of documents for post-processing as soon as a crawl is finished. Now the postprocessing for a specific crawl is started when that specific crawl is finished and not at the end of all post-processing steps.	2013-09-25 14:38:24 +02:00
orbiter	8ac2e8c8c9	added location navigator which causes that the image to the map search is visible whenever a location is available in the search result. To activate this, the search.navigation property in yacy.conf must be modified to the new default values.	2013-09-24 11:26:51 +02:00
reger	5111841e5b	- reduce Jetty debug logging - fix Context path initialization	2013-09-23 01:30:45 +02:00
Michael Peter Christen	61c5e40687	- replaced the properties object in AnchorURL with distinct variables for anchor attributes. - this caused that large portions of the parser code had to be adopted as well - added a counter target_order_i for anchor links in webgraph computation	2013-09-15 23:27:04 +02:00
Michael Peter Christen	85456f46b2	added two new fields, exact_signature_copycount_i and fuzzy_signature_copycount_i, which count the number of copies of non-unique documents and assigns this to each document. Thus, each document there is a number assigned which shows how many copies of this document exists. These fields are disabled by default.	2013-09-04 23:11:53 +02:00
Michael Peter Christen	a2511b5600	turned images_alt_txt back to images_alt_sxt because it is not necessary to index the alt text. Indexed image Text is in images_text_t	2013-09-04 10:47:18 +02:00
Michael Peter Christen	69f85265e1	added an option to put image links to the crawl queue and handle these like normal documents. Using this option (by default on at this moment; this might change soon) it is possible to get the exif data into the search index to be used in image search.	2013-09-03 11:13:45 +02:00

1 2 3 4 5 ...

558 Commits