yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
reger	a2cb366b25	Combine /heuristic search modifier with opensearch configured targets - with search modifier /heuristic a request is send to all configured opensearch target systems (old /heuristic/blekko modifier not longer valid) - this allows to use opensearch heuristic on individual search request (in contrast to configuration HEURISTIC_OPENSEARCH=true which sends a osd request on all global searches - the index.html searchoption text adjusted to be displayed only if option configured - add Archive-It to predefined systems	2014-07-20 00:00:43 +02:00
Michael Peter Christen	2de159719b	added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead.	2014-07-18 12:43:01 +02:00
Michael Peter Christen	1092e798a5	fixed double content postprocessing	2014-07-07 19:15:11 +02:00
Michael Peter Christen	09dcdb9b19	update to solr 4.9.0	2014-07-01 16:39:00 +02:00
orbiter	0bbb5040b8	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-06-15 12:38:52 +02:00
orbiter	9d5d86cd03	Added filter query options to the ranking servlet /RankingSolr_p.html. Filter queries are not actually related to ranking, but user requests have pointed out that specific boost queries to move results to the end of the result list are not sufficient. Such boost filters may be better executed as actual filter and therefore such a filter can now be statically applied to every search request. A typical use could be the expression "http_unique_b:true AND www_unique_b:true" which uses the recently introduced fields http_unique_b and www_unique_b which are true only for one of the alternatives with/without http(s) and with/without prefix 'www.' in host names.	2014-06-15 12:38:30 +02:00
Michael Peter Christen	d2151857f1	Added collection navigation: The collection field (can be filled i.e. in Crawl Start) can be used to add categories to YaCy index entries. The usage of that field was restricted to solr searches and post argument filters as implemented in commit `f7571386a3`. This commit extends collections to a full navigation option in the standard YaCy search interface. The field is not active by default but can be activated easily in the /ConfigSearchPage_p.html servlet (just check the 'Collection' facet field). Collections can now be used for (at least) two purposes: - to provide search tenants (through post argument collection) - to provide self-made category navigation Search requests may now have (independently from switched on or off collection facet) a "collection:<collection-name>" modifier attached; firthermore collection names may use disjunctions using the '\|' pipe symbol. For example, this is a valid search request: www collection:user\|proxy	2014-06-15 12:11:23 +02:00
Michael Peter Christen	922979aae1	added option to prefer http over https in unique-protocol ranking	2014-06-02 17:40:56 +02:00
reger	d8d318233e	fix logging settings - add missing .level - remove obsolete jena settings - set default level=INFO to prevent debug logging of not explicite specified classes	2014-06-01 06:43:50 +02:00
Michael Peter Christen	698f053658	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-06-01 01:02:12 +02:00
Michael Peter Christen	f23c4142e0	added option to configure a custom user agent within allip networks	2014-06-01 01:02:03 +02:00
orbiter	d7d38f9135	made number of open files in crawler configurable and increased default maximum number of open files from 100 to 1000. This number can be changed with the attribut crawler.onDemandLimit	2014-05-31 09:29:55 +02:00
Michael Peter Christen	ff5b3ac84d	added new fields http_unique_b and www_unique_b which can be used for ranking to prefer urls containing a www subdomain or using the https protocol	2014-05-27 15:28:28 +02:00
Michael Peter Christen	f0db501630	better handling of ranking parameters and new default values for date navigation which is done using ranking in solr.	2014-05-22 03:01:07 +02:00
Michael Peter Christen	d4157184ec	migration to Solr 4.8.1 This includes also an update to zookeeper 3.4.6 and a new library that Solr initializes by default: org.restlet from http://restlet.com/download/current#release=stable&edition=jse&distribution=zip which is included in version 2.2.1 from may 6th 2014	2014-05-21 11:48:08 +02:00
orbiter	2944822bb0	updated bootstrap seed list	2014-05-20 13:27:40 +02:00
reger	e31493e139	"Use remote proxy for yacy" has no function, remove option and related config item see/fix bug http://mantis.tokeek.de/view.php?id=23 http://mantis.tokeek.de/view.php?id=189	2014-05-17 23:36:59 +02:00
reger	f02203fb2f	fix xml validation error on defaults/web.xml	2014-05-11 04:39:59 +02:00
Michael Peter Christen	229f2248b8	added configuration option for maxmimum load and minimum ram for postprocessing	2014-04-30 13:26:32 +02:00
Michael Peter Christen	3d5e354471	small changes to search headline colour	2014-04-29 18:46:50 +02:00
Michael Peter Christen	71efc76170	new default skin pdbootstrap which keeps the design shapes but slightly changes the colours to match with bootstrap colours	2014-04-29 16:23:42 +02:00
reger	d812f80784	add exit proxy link to UrlProxy on proxied pages a link to exit proxy is added to top of page. Link text can be configured in web.xml init-parameter (see default/web.xml). If missing no link is displayed.	2014-04-26 22:27:59 +02:00
reger	2dabe2009d	- remove unused manual http KeepAlive config (reducing references to obsolete httpdemon) - add port info to settings_http	2014-04-18 19:57:35 +02:00
Michael Peter Christen	7a2f3e2353	increased resource.disk.used.max.steadystate and resource.disk.used.max.overshot by 4 times because first users reached that limit and wondered why the crawler was paused automatically :) The crawler will now stop at 2TB disk usage :)	2014-04-17 16:19:38 +02:00
Michael Peter Christen	9a5ab4e2c1	removed clickdepth_i field and related postprocessing. This information is now available in the crawldepth_i field which is identical to clickdepth_i because of a specific crawler strategy.	2014-04-16 22:16:20 +02:00
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	2014-04-16 21:34:28 +02:00
reger	46016fa153	autoupdate fails to download latest release (1.71) due to default release blacklist - removed the default version blacklist regex from init (for future versions) !!! left existing update blacklist setting untouched !!! (existing installation wanting autoupdate for 1.71 need to change blacklist in ConfigUpdate_p.html) - moved old blacklist patch to migration.java	2014-04-13 07:32:32 +02:00
Michael Peter Christen	ebd44a7080	replaced solr 4.6.1 with solr 4.7.1 and added index migration to lucene_47	2014-04-06 10:45:03 +02:00
Michael Peter Christen	ee92d748b5	test using compound file format, see UseCompoundFile in https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig This appears to be necessary as many times a java.io.FileNotFoundException: (Too many open files) appears. See also: https://issues.apache.org/jira/browse/SOLR-4 and desperate users at http://stackoverflow.com/questions/3828343/too-many-open-file-exception-while-indexin-using-solr We cannot force users to do a "ulimit -n 1000000", so this action seems to be required.	2014-04-06 00:35:35 +02:00
Michael Peter Christen	0a95fd27f3	update of seed list	2014-04-04 17:04:49 +02:00
Michael Peter Christen	cca851a417	introduced new solr field crawldepth_i which records the crawl depth of a document. This is the upper limit for the clickdepth_i value which may be shorter in case that the crawler did not take the shortest path to the document.	2014-04-02 23:37:01 +02:00
Michael Peter Christen	39b641d6cd	added tutorial mode - some menu items will only appear if you 'qualify' for them. Thus, the first-time user will only see four menu items. The other items will unfold as the user interacts.	2014-04-02 02:33:17 +02:00
reger	b12200cafe	alternative UrlProxyServlet (for /proxy.html) using different url rewrite rules - use JSoup parser for selective rewrite of html body <a href= links only, instead of regex which rewrites also header href/src links - this improves display of pages which use header <base> tag - tags with src attribute are taken from original location (like css) improving display and are not routed trough the indexer Disadvantage: scripting links will drop out of proxy Setting of the servlet through web.xml exclusivly (in case one would like to quickly switch back to the YaCyProxyServlet, leaving the existing code of YaCyProxyServlet untouched available)	2014-03-30 04:04:02 +02:00
Michael Peter Christen	e515dd460d	added linkscount_i and linksnofollowcount_i to the default solr schema	2014-03-27 23:36:08 +01:00
Michael Peter Christen	a7bc130e27	removed performance settings - they are incomplete and buggy - it was not easy to explain - it did not comply with a KISS strategy - setting a performance of low priority actually caused crashing of a peer - there was nobody who would maintain that functionality	2014-03-27 22:00:57 +01:00
Michael Peter Christen	a28fefba2d	activated language facet by default	2014-03-23 11:39:46 +01:00
Michael Peter Christen	617dd9c97b	- added new input field in index.html - changed progress bar in yacysearch.html - moved pagination navigation to page bottom - moved search term input field to headline	2014-03-21 02:42:09 +01:00
orbiter	7d24bcb98d	added flag to require that all web pages, even such without a "_p" extension require authorization. (default off)	2014-03-20 19:09:47 +01:00
reger	1fe26550a0	remove AugmentedBrowsing_p.html augmented browsing switch (has no function in code, previously used in conjuction with http://reflect.ws)	2014-03-19 22:40:35 +01:00
reger	e972b87a8a	remove AugmentedBrowsingFilters_p.html as none of the settings are used currently config settings frome the page also removed from yacy.init augmentation.reflect augmentation.addDoctype augmentation.reparse interaction.overlayinteraction.enabled	2014-03-17 20:27:04 +01:00
reger	a373fb717d	remove more unused from legacy server.http - triggerOnlineAction not used - useTemplateCache not used	2014-03-14 03:12:04 +01:00
orbiter	f77afa9d1d	add index on _val fields, this affects especially title length an index on fields make search facets on that field possible	2014-03-04 11:24:04 +01:00
Michael Peter Christen	de8f7994ab	as crawling has a low-cpu demand, we want it to run even if the CPU load is VERY high. This applies also if the CPU load is high because of in-cache crawling; in that case we want to experience a high-CPU load as much as possible	2014-02-25 14:17:33 +01:00
Michael Peter Christen	9eb668e951	enhanced the resource observer The resource observer is now able to recognize free disk space AND available space for YaCy. The amount of space which is assigned for YaCy are defined in new settings in the configuration file. Furthermore, there is now a cleanup process which deletes files in case that an autodelete is activated. The autodelete is now BY DEFAULT ON if the disk space is low, which means that YaCy starts to delete documents when the disk is full!	2014-02-12 01:00:44 +01:00
Michael Peter Christen	ca8b100f96	run the cleanup process even when load is high, do postprocessing even if load > 1 (but < 2) but only if there is enough memory (now: 0.5 GB RAM available). The memory amount of the postprocessing is the cause that systems block because they run into a frequent-GC chain which almost locks the peer. If running with enough memory, the postprocessing is fast and not damaging to the system. Because the required RAM of 0.5 GB is never available in default setting, the postprocessing will not run if the peer is not reconfigured to use more memory.	2014-02-10 12:59:30 +01:00
Michael Peter Christen	6e59ca4ebf	removed jena library and all code that depended on jena. When jena was introduced, it was also used for search facets. The generic search facets are now deduced from generic solr fields which makes jena as tool for facet semantics superfluous.	2014-02-07 01:20:06 +01:00
Michael Peter Christen	931541d198	re-inserted default value re-set button to performance queues and patched missing values for recent new queues	2014-02-06 22:39:19 +01:00
Michael Peter Christen	4b7f2fcf38	updated bootstrap seedlist list	2014-01-27 13:55:06 +01:00
reger	a71718a459	add config value for ssl/https port (default=8443) adjust server routines to use config	2014-01-27 01:09:56 +01:00
reger	cf553e5045	added hint to web.xml and for completeness the full set of hardcoded mappings	2014-01-23 23:56:45 +01:00
Michael Peter Christen	a8fdaace31	changed the web.xml as well to migrate the solr servlet	2014-01-23 18:41:45 +01:00
Michael Peter Christen	be5e808236	- removed hardcoded load-test which is now handled in BusyQueues steering, see /PerformanceQueues_p.html - changed default values for crawler queue load limit (high, because these jobs are started upon user request)	2014-01-21 17:48:45 +01:00
sixcooler	40a4030b55	configurable max-load values for YaCy-Threads: try lower values on smal systems like a Pi	2014-01-21 17:04:22 +01:00
Michael Peter Christen	77531850b5	reverted crawling strategy from latest commit.	2014-01-21 16:05:55 +01:00
reger	97e84439fb	adjusted ConfigHeuristic and changed QueryGoal.getOriginalQueryString to .getQueryString - since specific heuristic Twitter & Blekko is not longer available or redundant with OpenSearchHeuristic, adjusted ConfigHeuristic to use OpensearchHeuristic settings only. For this the default OSD search target list is made available (copied) by default and the other configs are removed. - the return of QueryGoal.getOriginalQueryString includes the queryModifier, which are held separately in a modifier object, but in most (all) cases just the query term is expected, clarified and renamed it to QueryGoal.getQueryString which returns just the search term (if needed a .getOrigianlQueryString could be implemented in Queryparameters, adding the modifiers) - started to adjust internal html href references from absolute to relative (currently it is mixed). For future development we should prefer relative href targets (less trouble with context aware servlets)	2014-01-20 00:58:17 +01:00
reger	d24a0ec32c	upd heuristic default list (heuristicopensearch.conf) - Faroo Web taken out (requires api key) http://www.faroo.com/hp/api/api.html#description - update Faroo News to new url - Twitter taken out (change to Api 1.1 not supporting rss) https://dev.twitter.com/discussions/24239	2014-01-20 00:03:55 +01:00
reger	0c754dd794	implemented DIGEST authentication, which is for remote login more secure as BASIC were pwd is transmitted near clear text (B64enc). This has some implication as RFC 2617 requires and recommends a password hash MD5(user:realm:pwd) for DIGEST. !!! before activating DIGEST you have to reassign all passwords !!! to allow new calculation of the hash - default authentication is still BASIC - configuration at this time only manually in (DATA/settings) or defaults/web.xml (<auth-method> - the realmname is in defaults/yacy.init adminRealm=YaCy-AdminUI - fyi: the realmname is shown on login screen - changing the realm name invalidates all passwords - but for security you are encouraged to do so (as localhostadmin) - implemented to support both, old hashes for BASIC and new hashes for BASIC and DIGEST - to differentiate old / new hash the in Jetty used hash-prefix "MD5:" is used for new pwd-hashes ( "MD5:hash" )	2014-01-17 00:02:23 +01:00
Michael Peter Christen	f8ce7040ab	remote search peer selection schema change: - all non-dht targets (previously separated into 'robinson' for dht-like queries and 'node' for solr queries) are non 'extra' peers, which are queries using solr - these extra-peers are now selected using a ranking on last-seen, peer-tag-matches, node-peer flags, peer age, and link count. The ranking is done using a weight and a random factor. - the number of extra peers is 50% of the dht peers - the dht peers now exclude too young peers to prevent bad results during strong growth of the network - the number of dht peers (and therefore extra-peers) is reduced when the memory of the peer is low and/or some documents still appear in the indexing-queue. This shall prevent a peer from deadlocks when p2p queries are made in a fast sequence on weak hardware.	2014-01-16 17:27:14 +01:00
reger	f09dbbef96	make SecurityHandler webappcontext ready	2014-01-10 12:36:42 +01:00
reger	37f2a82a5d	making root context (htroot) a WebAppContext - this allows additional features, like servlet configuration via web.xml and many more things. - currently the standard servlets are still configured in the code (so the supplied defaults/web.xml is not realy needed, yet), but could be expanded - lookup for web.xml - 1. in /DATA/SETTINGS then in /defaults	2014-01-10 10:42:47 +01:00
reger	f6099b730d	disabled unused fields in default Solr collection schema	2014-01-10 10:26:45 +01:00
orbiter	2ead4e44d9	introduced a new storage path ARCHIVE inside of DATA which will be used as path for solr index dumps (instead of the SEGMENTS path). This will make a maintenance of index backups easier. It will also provide a tool to migrate from an freeworld index to a webportal index.	2014-01-07 17:53:49 +01:00
reger	fbdd89e198	Merge origin/master	2013-12-27 06:53:14 +01:00
reger	65a2f3d5e7	tweak Jetty credentials to work with YaCy UserDB - user entry in UserDB with admin right can login to access protected pages - dto. admin user, choosen username is stored in conf (adminAccountUserName=)	2013-12-27 06:45:22 +01:00
Michael Peter Christen	ee17bd0b69	added option to attach remote solr servers in read-only mode	2013-12-27 02:55:21 +01:00
Michael Peter Christen	84167adb49	removed unused anomichttpd code after migration to jetty	2013-12-23 01:23:40 +01:00
Michael Peter Christen	7603e879dc	Merge branch 'master' into HEAD Conflicts: .classpath source/net/yacy/cora/federate/solr/SolrServlet.java	2013-12-20 01:19:06 +01:00
Michael Peter Christen	2f16770681	migrated to solr 4.6.0	2013-12-19 21:51:05 +01:00
reger	92d9c56f9f	Merge origin/master into jetty	2013-12-05 22:53:29 +01:00
Michael Peter Christen	e3c2f09de9	- reduce computation in case that specific postprocessing fields are not selected - de-select citation rank computation	2013-12-04 17:48:12 +01:00
reger	effea4bca0	Merge origin/master into jetty Conflicts: source/net/yacy/cora/federate/solr/SolrServlet.java	2013-11-29 22:39:52 +01:00
Michael Peter Christen	a16534cb0a	tried to fix timeout and connection-lost problems when using an outside solr.	2013-11-28 01:31:53 +01:00
reger	f111f30ace	Merge origin/master into jetty	2013-11-17 00:18:25 +01:00
Michael Peter Christen	5ec5be5769	fixed logging for remote solr configuration	2013-11-15 15:36:24 +01:00
Michael Peter Christen	24a052ecb9	removed debug code for existsByIds	2013-11-13 13:41:18 +01:00
Michael Peter Christen	087df05e24	added option to Config_Network_p.html to enable remote search while DHT-Receive is switched off.	2013-11-13 13:38:01 +01:00
Michael Peter Christen	899e7e92b0	added debug code	2013-11-09 02:37:12 +01:00
Michael Peter Christen	a5c1249ee2	reverted autowarming setting in solrconfig	2013-11-09 01:43:44 +01:00
reger	1437c45383	merge rc1/master	2013-11-07 21:30:17 +01:00
Michael Peter Christen	81bb50118e	found and fixed a huge memory leak in solr caching (inside Solr). The not-flushed Solr cache is now handled in this way: - it is smaller by default - an Solr-internal process is started to flush the cache periodically (this does NOT clean the cache, just removes old objects) - a Solr-external process (the standard YaCy cleanup-process) now has direct access to the solr internal cache and flushes them completely. The time frame for such a flush is defined by the cleanup-process frequency, by default 10 minutes.	2013-11-07 10:01:44 +01:00
Michael Peter Christen	7f768b42d3	we do not need the load-image flag any more since this is now controlled by parser switches	2013-11-06 15:00:57 +01:00
reger	f017066197	Merge origin/master into jetty	2013-10-27 15:09:24 +01:00
Michael Peter Christen	f1bfe64361	integrated startpage to compare_yacy	2013-10-26 00:33:36 +02:00
Michael Peter Christen	9bb7eab389	hacks to prevent storage of data longer than necessary during search and some speed enhancements. This should reduce the memory usage during heavy-load search a bit.	2013-10-25 15:05:30 +02:00
orbiter	3c3cb78555	- removed a lot of garbage and bloated code from GuiHandler. - transformed log lines to String before they are stored because the storage space is about 1:250 (45kb for one line before transformation, 180 bytes afterwards) - this saves up to 10MB RAM so we can increase the number of lines to 1000 again.	2013-10-24 20:42:34 +02:00
Michael Peter Christen	6aabc4e5c8	reduced logging line memory, 10000 lines had filled up 450MB! grrr. (thank you, a bomb from the past)	2013-10-24 16:17:53 +02:00
Michael Peter Christen	1b4fa2947d	- fixed a problem which ocurred when a document was not recognized with the right content domain (i.e. identifying that it is an image, text etc.) because it used the file extension and not an existing mime type assignment. - fixed the new setting that images shall be loaded for a better image search. - both fixes together makes it now possible to crawl commons.wikimedia.org which makes use of 'funny' document names (i.e. ending with .jpg while the document is html)	2013-10-23 00:16:54 +02:00
reger	f46c723398	allow to choose used http server, YaCy-Anomic or Jetty - defaults to Jetty (in this branch) - add server version info & config option -> Admin Console -> Advanced Settings -> Http Networking	2013-10-17 03:34:22 +02:00
Michael Peter Christen	820b896146	Replaced the inframe loading from yacy.net for donations with the loading of this iframe from the local host. To make this more flexible, this iframe is loaded once after startup from yacy.net.	2013-10-15 16:46:06 +02:00
reger	cf32a92629	- add size check to multipart form data handling of YaCyDefaultServlet (same as in HTTPDemon.parseMultipart) - reduce Jetty logging - give build.run a bit more memory (set to YaCy.default 600m from 512m)	2013-10-13 20:56:03 +02:00
reger	a44eede8b8	merge rc1/master	2013-10-11 01:50:25 +02:00
Michael Peter Christen	90c8577840	enhanced ranking; patches to replace old ranking	2013-10-09 15:10:03 +02:00
Michael Peter Christen	1b61bd40ed	- Added new solr field url_file_name_tokens_t which stores the file name tokens. This can be used to enhance the ranking. - Added also a rating_i field as basis for later usage. - enhanced the tokenization process.	2013-10-08 23:48:13 +02:00
orbiter	5f5a97bafc	added the anchor text within web pages to the searcheable entities of a web page. This can be of benefit for the ranking if these fields are used for boosts.	2013-10-08 18:41:07 +02:00
Michael Peter Christen	21aa6a0321	migration to Solr 4.5.0	2013-10-07 17:09:40 +02:00
reger	c7c706fd9f	merge with rc1/master	2013-09-30 03:46:39 +02:00
Michael Peter Christen	b28d43decc	added two more fields source_cr_host_norm_i,target_cr_host_norm_i in webgraph and an addition to postprocessing to copy all cr ranking attributes to the link edges associated to the postprocessing documents	2013-09-27 16:57:05 +02:00
Michael Peter Christen	4f83d5f18c	added the new field harvestkey_s to the collection index and the webgraph index which is temporary filled with the crawl profile key. This is used to select a set of documents for post-processing as soon as a crawl is finished. Now the postprocessing for a specific crawl is started when that specific crawl is finished and not at the end of all post-processing steps.	2013-09-25 14:38:24 +02:00
orbiter	8ac2e8c8c9	added location navigator which causes that the image to the map search is visible whenever a location is available in the search result. To activate this, the search.navigation property in yacy.conf must be modified to the new default values.	2013-09-24 11:26:51 +02:00
reger	5111841e5b	- reduce Jetty debug logging - fix Context path initialization	2013-09-23 01:30:45 +02:00
Michael Peter Christen	61c5e40687	- replaced the properties object in AnchorURL with distinct variables for anchor attributes. - this caused that large portions of the parser code had to be adopted as well - added a counter target_order_i for anchor links in webgraph computation	2013-09-15 23:27:04 +02:00
Michael Peter Christen	85456f46b2	added two new fields, exact_signature_copycount_i and fuzzy_signature_copycount_i, which count the number of copies of non-unique documents and assigns this to each document. Thus, each document there is a number assigned which shows how many copies of this document exists. These fields are disabled by default.	2013-09-04 23:11:53 +02:00
Michael Peter Christen	a2511b5600	turned images_alt_txt back to images_alt_sxt because it is not necessary to index the alt text. Indexed image Text is in images_text_t	2013-09-04 10:47:18 +02:00
Michael Peter Christen	69f85265e1	added an option to put image links to the crawl queue and handle these like normal documents. Using this option (by default on at this moment; this might change soon) it is possible to get the exif data into the search index to be used in image search.	2013-09-03 11:13:45 +02:00
orbiter	f106345eef	link strings should not be tokenized	2013-09-01 14:35:36 +02:00
orbiter	deadeb406e	image alt tag strings should be tokenized	2013-09-01 13:48:10 +02:00
Michael Peter Christen	1a3e42eca4	index migration to lucene 4.4	2013-08-26 12:49:39 +02:00
Michael Peter Christen	765943a4b7	Redesign of crawler identification and robots steering. A non-p2p user in intranets and the internet can now choose to appear as Googlebot. This is an essential necessity to be able to compete in the field of commercial search appliances, since most web pages are these days optimized only for Google and no other search platform any more. All commercial search engine providers have a built-in fake-Google User Agent to be able to get the same search index as Google can do. Without the resistance against obeying to robots.txt in this case, no competition is possible any more. YaCy will always obey the robots.txt when it is used for crawling the web in a peer-to-peer network, but to establish a Search Appliance (like a Google Search Appliance, GSA) it is necessary to be able to behave exactly like a Google crawler. With this change, you will be able to switch the user agent when portal or intranet mode is selected on per-crawl-start basis. Every crawl start can have a different user agent.	2013-08-22 14:23:47 +02:00
sixcooler	1bc6003057	rise autoCommit maxTime to 3 Minutes to reduce IO lower mergeFactor again (5) for less segments	2013-08-06 03:58:53 +02:00
orbiter	944ae5686c	added donation plea to the about box as default (you can replace this in your peer!)	2013-08-01 12:11:56 +02:00
Michael Peter Christen	58fe986cca	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-07-30 12:49:14 +02:00
Michael Peter Christen	cf12835f20	replaced the single-text description solr field with a multi-value description_txt text field	2013-07-30 12:48:57 +02:00
orbiter	e7fcb81cea	we should not do too much greedylearning at this time as we don't have enough experience with it. set greedylearning.limit.doccount to a much lower limit.	2013-07-27 11:22:40 +02:00
orbiter	bf0ad04e1b	apply load limitation also to dht-in	2013-07-27 10:42:38 +02:00
orbiter	f50b596e0b	do not run dht ditribution if system load is over 2.5	2013-07-23 19:32:32 +02:00
orbiter	e24016e30a	added the property federated.service.solr.indexing.timeout to yacy.init to provide a configurable time-out for solr; see also: http://bugs.yacy.net/view.php?id=254	2013-07-22 17:45:12 +02:00
Roland Haeder	98e10f95e2	Added some cora package loggers	2013-07-17 18:28:10 +02:00
orbiter	1b43e02b86	Merge branch 'master' of git://gitorious.org/~quix0r/yacy/quix0rs-yacy-rc1	2013-07-13 18:54:18 +02:00
orbiter	a548354c71	replaced type of solr schema object sku of text_en_splitting_tight by string	2013-07-13 18:54:09 +02:00
Roland Haeder	ebbb3bc5c1	Fixed CHMOD on many files + added missing loggers (e.g. jena) and made some noisy loggers quiet	2013-07-13 13:12:36 +02:00
orbiter	e609ec388a	metager whitelist update	2013-07-10 15:13:04 +02:00
Michael Peter Christen	2716dfc46c	increase crawler speed by reduction if the busysleep time	2013-07-08 23:40:31 +02:00
Michael Peter Christen	57ffdfad4c	added a crawl option to obey html-meta-robots-noindex. This is on by default.	2013-07-03 14:50:06 +02:00
Michael Peter Christen	5a5d411ec0	new robots_i attribute fields	2013-07-02 14:29:13 +02:00
orbiter	7c6ccc426c	set crawlingQ to true by default because most webpages are dynamic and crawlingQ should only be switched off in case of crawler traps	2013-06-29 20:28:14 +02:00
Michael Peter Christen	16d1d744fa	added url_file_name_s in default collection schema for the file name without the file extension. This part of the file path is removed from the multi-field url_paths_sxt, which has now not the file name as last part of the path list. The same applies to the new fields source_file_name_s and target_file_name_s in the webgraph schema.	2013-06-25 16:27:20 +02:00
orbiter	8792e6c6e9	stub for better image indexing	2013-06-18 13:28:30 +02:00
Michael Peter Christen	570511f3c8	removed fields references_internal_id_sxt and references_internal_url_sxt because they had been shown to be superfluous. The citation of referrer in the host browser is possible without them. Therefore now the host browser does not only show internal, but also external referrer to each link.	2013-06-13 13:01:28 +02:00
Michael Peter Christen	fd1776a3b0	added a new 'Citations' function: each search result item can now be explored for citations within other documents. A click on the 'Citations' link shows an analysis with all text lines in the document each with a complete list of documents which contain the same line. A second section shows the linking documents in ascending order of number of citations from the original document. Because documents from different hosts are most interesting here, they are listed at the top of the page as possible 'copypasta' source.	2013-06-12 15:02:49 +02:00
Michael Peter Christen	7754a1263b	switching back to the merge factor 10; the solr default.	2013-06-12 11:29:35 +02:00
Michael Peter Christen	1762911f57	added synchronizations and timeouts in solr api; missing synchronizations in index modification methods causes deadlocks inside solr.	2013-06-12 02:13:18 +02:00
Michael Peter Christen	959ccc4675	increased the solr merge factor because 4 was too much IO load for frequent index receiving and re-indexing after clickdepth/cr calculation.	2013-06-11 16:51:40 +02:00
Michael Peter Christen	20fab1feb6	allip net has greedy learning disabled	2013-06-11 14:52:46 +02:00
Michael Peter Christen	6115bef335	added a 'greedy learning' mechanismn which will cause that a 'fresh' yacy will load linked web pages from search results until the total number of web pages reaches 15000. This shall give fresh peers a 'boost' to get faster a personalized search index.	2013-06-11 14:42:30 +02:00
Michael Peter Christen	856e5c42ae	the line "Web Search by the People, for the People" is more generic for P2P and portal search as default search string. Otherwise, if people switch to Portal mode, the "P2P Web Search" does not make sense.	2013-06-10 18:36:06 +02:00
Michael Peter Christen	713a6199ef	activated citation ranking by default	2013-06-07 14:26:14 +02:00
Michael Peter Christen	f7a4377812	usage of the new normalized link polularity CRn as default ranking function. This replaces the previous formula, which was bad. Before you update to this version, please check if you changed the ranking function yourself before, since it will be overwritten.	2013-06-07 13:22:22 +02:00
Michael Peter Christen	f7e77a21bf	Added a citation reference computation for intra-domain link structures. While the values for the reference evaluation are computed, also a backlink-structure can be discovered and written to the index as well. The host browser has been extended to show such backlinks to each presented links. The host browser therefore can now show an information where an document is linked. The new citation reference is computed as likelyhood for a random click path with recursive usage of previously computed likelyhood. This process is repeated until the likelyhood converges to a specific number. This number is then normalized to a ranking value CRn, 0<=CRn<=1. The value CRn can therefore be used to rank popularity within intra-domain link structures.	2013-06-07 13:20:57 +02:00
reger	8a7fcb391d	enable use of solrcore.properties for property substitution of solrconfig.xml - move setting of system property solr.directoryFactory=solr.MMapDirectoryFactory to solrcore.properties - add check of os.arch for 64bit system, if it fails use default/solrcore.x86.properties (if exists) as solrcore.properties reason: on 32bit MMapDirectoryFactory may fail with..... Caused by: java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:849) at org.apache.lucene.store.MMapDirectory.map(MMapDirectory.java:283)	2013-06-01 05:43:08 +02:00
Michael Peter Christen	eb9d0ba5b1	ranking and boost function update, small bugfixes, better default search field for solr	2013-05-30 16:30:35 +02:00
Michael Peter Christen	a8dc4346e8	default configuration of MMapDirectoryFactory for solr, increased lock timeout, less documents from remote searches (too many results had easily blocked a peer)	2013-05-30 12:31:28 +02:00
Michael Peter Christen	0c1a018bbd	removed 'later' tactic because it used too much RAM, reduced number of soft commits, reduced caching size of search events, ensured that solr results are processed before connection is closed to keep that stuff not too long in RAM	2013-05-29 18:27:27 +02:00
Michael Peter Christen	536fd1450e	added new keys for update locations	2013-05-29 13:10:32 +02:00
orbiter	a83c2fe833	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-05-10 12:02:40 +02:00
orbiter	4baa0d4a97	Added a default keystore for ssl encryption of the YaCy web interface. This will enable https-access to YaCy, but this feature is disabled by default using the new server.https=false attribute. This has two purposes: - make it easier for everyone to use https (just set server.https=true) - provide the basis for secure yacy-to-yacy communication in the future	2013-05-10 12:02:31 +02:00
reger	da191c839d	reduce SolrConnectorLogging setting (from default ALL to INFO)	2013-05-10 05:54:07 +02:00
Michael Peter Christen	9bd2aee180	migrated to solr 4.3.0	2013-05-09 02:17:53 +02:00
Michael Peter Christen	cca19d94d4	re-declared some fields to be of type string rather than text which makes them more efficient and less large	2013-05-06 16:45:54 +02:00
Michael Peter Christen	cc90f82dbb	increased default proxy client timeout to one minute	2013-05-06 14:58:18 +02:00
Michael Peter Christen	50421171c3	added new schema fields: hreflang_url_sxt and hreflang_cc_sxt for http://support.google.com/webmasters/bin/answer.py?hl=de&answer=189077 navigation_url_sxt and navigation_type_sxt for http://googlewebmastercentral.blogspot.de/2011/09/pagination-with-relnext-and-relprev.html publisher_url_s for http://support.google.com/plus/answer/1713826?hl=de all fields are disabled by default and not written to the index.	2013-04-18 17:21:17 +02:00
Michael Peter Christen	d05dc07cff	setting of new default values for ranking	2013-04-16 15:02:00 +02:00
Michael Peter Christen	97775fbebc	fixed ranking for add-function queries: this did not work. The option was removed. All function queries are now boosts (multiplies the score according to a function). This is also the recommended way to boost rankings based on functions as explained in http://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/	2013-04-16 14:45:14 +02:00
Michael Peter Christen	7ab5093321	added new solr title_exact_signature_l and description_exact_signature_l to be able to identify unique title and unique description fields.	2013-04-16 01:35:15 +02:00
Michael Peter Christen	27d6222880	added new field host_extent_i which, after a crawl and postprocessing, holds the number of documents for the host where the document is hosted. This is necessary for ranking and the norming of references per local host in the ranking computation.	2013-04-14 20:52:40 +02:00
Michael Peter Christen	ada3f27de7	added three new field for a better ranking: references_internal_i, references_external_i and references_exthosts_i. These can be used to count and evaluate the number of external links to every web page. An experimental ranking function can be i.e.: div(add(references_internal_i,product(references_external_i,references_exthosts_i)),add(clickdepth_i,1))	2013-04-12 16:17:14 +02:00
reger	e89491271f	- fix opensearch discover err msg - webgraph not enabled - if no opensearchdescription link found in index - remove search2.net from sample config (is down)	2013-04-04 00:40:59 +02:00
orbiter	17ae51e741	increased number of links limitation from 1000 to 10000 for rss feeds and html documents	2013-03-17 22:13:56 +01:00
Michael Peter Christen	2d36a7eaf5	- do not create a new query for all remote peers - no document search this time - adjusted banner and network to not show 'WORDS' but DHT Chunks. This is to avoid confusion for robinson peers which do not create Word Entries	2013-03-15 00:14:28 +01:00
Michael Peter Christen	4af0839be2	use appropriate ranking for each search situation: - when using the /date modifier, a date ranking profile is used - when using a site: modifier, a ranking profile supporting longer urls is used	2013-03-14 21:13:12 +01:00
Michael Peter Christen	2080fc7406	removed unused tag fields	2013-03-14 10:35:21 +01:00
orbiter	6b13dd0d3d	added clickdepth field writing for webgraph core (unfinished)	2013-03-14 01:35:38 +01:00
Michael Peter Christen	addba047e2	changes in ranking computation - an existing ranking servlet for solr was extended. It is now possible to set boost values for fields, boost functions and boost queries. - The ranking can have different instances, but currently only the first one is used - added an abstraction layer for fields which can be used for search and those fields can be edited in the solr ranking configruation - the ranking value from solr within the field score is used to combine remote search requests, which all are created using the same locally defined boost values - reduced the number of fields which are used for search (makes it faster) - replaced some text fields by string fields (makes indexing faster) - removed classes which had no use - made a large number of experiments for a better ranking and created a temporary setting which prefers hits inside titles - adjusted also the RWI-based ranking computation to 'prefer title' - made special cases like for portal search where no post-processing and post-ranking is wanted: this keeps the original ranking order as done by Solr - fixed many bugs with old settings for ranking	2013-03-13 14:47:00 +01:00
Michael Peter Christen	25300913fa	fixes to search debugging after testing with the different search debugging options	2013-03-05 21:28:22 +01:00
orbiter	b1140e3d82	added debug switches for detailed search testing	2013-03-05 12:19:32 +01:00
Michael Peter Christen	0d7b4bc891	better protection against OOM during search flush and fixed missing result push	2013-03-03 23:45:47 +01:00
Michael Peter Christen	3b1d9dc884	made index storage from DHT search result concurrently. This prevents blocking by high CPU usage during search. Also: removed query from Solr for DHT search results; results are taken from the pending queue.	2013-03-02 10:25:52 +01:00
orbiter	0f7ea7ad9f	- enhanced solr.add procedure for mass adds - removed unused solr access classes - made snippet generation for documents aus YaCy RWI/DHT concurrent (as it was before the search process removation) - reduced the number of remote results in settings file because the processing of such mass documents add is too CPU-intensive (in Solr)	2013-03-01 15:27:17 +01:00
Michael Peter Christen	089dee1770	- generalized SchemaConfiguration into super-class Configuration and adopted other classes which used the configuration-only access for that class - removed many warnings - adjusted logging	2013-02-25 00:09:41 +01:00
Michael Peter Christen	56d5946a59	- added flags in IndexFederated_p.html to switch on or off the webgraph index (new solr core webgraph) .. this is now off by default - completely redesigned this servlet - added description how to attach a remote solr - adjusted naming of servlet and menues - moved 'lazy initialization' attribut from IndexSchema to IndexFederated (this is a general option) back again.	2013-02-24 18:09:34 +01:00
Michael Peter Christen	461d46101d	- Removed log4j from libraries. This can be removed because the package log4j-over-slf4j is there. From slf4j all loggings are routed to the jdk logger. Now all loggings are consistently done to the jdk logger. - added some lines to the logging properties to suppress many solr logging statements. The number of the logging entries had already become a performance issue, therefore removing these from the log should increase performance.	2013-02-23 16:45:05 +01:00
Michael Peter Christen	788288eb9e	added the generation of 50 (!!) new solr field in the core 'webgraph'. The default schema uses only some of them and the resting search index has now the following properties: - webgraph size will have about 40 times as much entries as default index - the complete index size will increase and may be about the double size of current amount As testing showed, not much indexing performance is lost. The default index will be smaller (moved fields out of it); thus searching can be faster. The new index will cause that some old parts in YaCy can be removed, i.e. specialized webgraph data and the noload crawler. The new index will make it possible to: - search within link texts of linked but not indexed documents (about 20 times of document index in size!!) - get a very detailed link graph - enhance ranking using a complete link graph To get the full access to the new index, the API to solr has now two access points: one with attribute core=collection1 for the default search index and core=webgraph to the new webgraph search index. This is also avaiable for p2p operation but client access is not yet implemented.	2013-02-22 15:45:15 +01:00
Michael Peter Christen	91a0401d59	introduced a second core named 'webgraph'. This core will hold the link structure, but is not filled yet. To have the opportunity of a second core, multi-core functionality had to be implemented to the deep-embedded solr: - migrated the solr_40 directory content to a subdirectory 'collection1'; the previously used default core is now called collection1 - added solr_40/webgraph subdirectory as second core - added a servlet configuration for the second core 'webgraph' in /IndexSchema_p.html - added instance handling as addition to solr connections: all solr connectors are now instances of an solr 'instance' object; this required a complete re-design of the solr embedding - migrated also caching and sharding ontop of new instance handling - migrated the search apis to handle now the access to a specific core, the default core named 'collection1' - migrated the remote solr search interface to access shards of cores; for the yacy remote search the default core is now called 'solr'; using the peer address as solr address - migrated the solr backup and restore process: old backups cannot be used after this migration! - redesign of solr instance handling in all methods which access the instances: they cannot hold copies of these instances any more; the must retrieve the actuall connection object every time they want to write to it (this solves also some bugs when switching the index/network) - added another schema 'solr.webgraph.schema', the old solr.keys.list is replaced by solr.collection.schema	2013-02-21 13:23:55 +01:00
Michael Peter Christen	4111606654	removed the commitWithin attribute because that is not the way how the index is updated the right way for us. May also be be superfluous with the solr 4.0 softcommit.	2013-02-13 02:29:47 +01:00
Michael Peter Christen	d70d99fab5	added more metadata fields and facets to OpensearchResponseWriter. This should make it possible to replace the original and enriched yacy opensearch result with a solr output in opensearch format.	2013-02-11 22:10:14 +01:00
Michael Peter Christen	8651ec35fe	turned author_s into the multi-valued field author_sxt	2013-01-24 18:24:31 +01:00
Michael Peter Christen	4735bd47f4	- changed solr commit call and added an optimize option. Since Solr 4.0.0 there is a new softcommit feature which implements a near-real-time (NRT) search option. The softcommit does not do IO and does not cause performance issues. YaCy has now an extension in its solr connectors to use the softcommit feature. The softcommit call now replaces all places where a hard commit was used. Furthermore the commit strategy in when doing a search from the web interface was changed (it's done every time before a search is done). The softcommit feature was implemented because it was needed for the following changes (customer demands), which is also included in this git commit: - added a feature to identify all documents which have unique titles and/or unique descriptions. These unique flags are disabled by default. - added also a feature to set a flag when the url from a canonical tag is equal to the document url. This is also disabled by default. To support the new softcommit strategy, the commitWithinMs option was set to -1 do disable automatic commit based on document insert times. If documents are inserted permanently then also a commit would happen permanently whenever the commitWithinMs time is reached. This would conflict with the regular autocommit of 10 minutes and the new softcommit strategy.	2013-01-23 14:40:58 +01:00
Michael Peter Christen	db024a4e19	added new solr fields (unused yet; implementation will follow)	2013-01-21 18:02:29 +01:00
Michael Peter Christen	9b5bdae1b4	Reverted setting of MMapDirectoryFactory from solrconfig; see http://forum.yacy-websuche.de/viewtopic.php?p=27509#p27509 Instead, in the start script is checked if the host is a 64 host and -Dsolr.directoryFactory=solr.MMapDirectoryFactory is set as java option Reverted the ramBufferSizeMB setting (this was not enabled anyway) because that may be too much memory for small peers and embedded systems. Activated the mergeFactor 4; this was commented out by mistake	2013-01-21 17:55:28 +01:00
orbiter	eb68a30947	solr performance settings the target of these performance settings is the reduction of IO in general and during search in particual. - reduced mergeFactor to 4. This will increase the IO during indexing, but will reduce IO during search. It will also greatly reduce the number of open files which should make it possible to have overall larger indexes until the number of open files in an OS is reached. - increased ramBufferSizeMB to 256mb. This will reduce the number of commits. This change may compensate the reduction of the mergeFactor. - disabled updateLog. This is a real-time search feature which is available in YaCy anyway because a commit is forced if index.html is called. The updateLog feature causes a lot of IO during indexing and search and produced a lot of files in SEGMENTS/solr_40/data/tlog	2013-01-19 11:21:33 +01:00
Michael Peter Christen	f53703df62	using MMapDirectoryFactory as solution for ClosedChannelException given in https://issues.apache.org/jira/browse/SOLR-2247	2013-01-16 14:35:37 +01:00
Michael Peter Christen	22c694f906	activated the clickdepth_i attribute for solr again because the calculcation of that value is not as extensive as expected and furthermore the value is very useful for ranking	2013-01-05 01:00:18 +01:00
Michael Peter Christen	5a0eb1b268	clickpath should not be active by default because it needs extensive computation - partly to be implemented	2013-01-03 01:30:05 +01:00
Michael Peter Christen	5c0c56cfe1	Preparations to produce a click depth attribute in the search index. This attribute can be used for ranking and for other purpose (demand by customer) The click depth is computed in two steps: - during indexing the current fill-state of the reverse link index is used to backtrack the current page to the root page. The length of that backtrack is the clickdepth. But this does not discover the shortest click depth. To get this, a second process to check again is needed - added a process tag that can be used to do operations on the existing index after a crawl; i.e. calculation the shortest clickpath. Added a field to control this operation but not a method to operate on this. - added a visualization of the clickpath length in the host browser	2013-01-02 20:55:43 +01:00
Michael Peter Christen	295884fd54	- Merge commit '168b1d130d9d67b5e8855a0b50c4ba7ad4a416f8' - fixed conflict in htroot/yacysearch.java - removed nedres check because that causes that the remote server is not called at all in most cases (local index has already results but we want more) - fixed a regex bug (a '=' too much)	2013-01-02 15:08:07 +01:00
reger	168b1d130d	Adding heuristic to get search results from configured systems which support opensearch specification - any system supporting opensearch specification can be configured - search query is only forwarded to remote system if not enough results available on local peer - discover function provided, checking the local Solr index for links to opensearchdescription files, to add to the config - sample config file with some general search engines with opensearch support	2012-12-29 08:24:48 +01:00
reger	7761b60325	fix: Broken Link on Crawler_p.html - issue 218 http://bugs.yacy.net/view.php?id=218 - reduced Solr logging (/select)	2012-12-29 04:53:20 +01:00
reger	e9e0d63897	Add config option to show HostBrowser link in search result - ConfigPortal: added checkbox Host Browser - yacy.init: added search.result.show.hostbrowser as default = on (true) - fix HostBrowser: broken link to protected WebStructurePicture for public user	2012-12-27 10:01:10 +01:00
Michael Peter Christen	98819ec3d9	use solr boost configuration to select search fields. At this time it is possible to enter a negative boost value to switch that value off. This might be different in the future with a better input interface.	2012-12-27 03:17:45 +01:00
Michael Peter Christen	01200f06cc	using the author field as solr-native facet. this makes it necessary to introduce a copy-field for the author field to be copied to a string field. This field is then used to generate facets. Without this field, the facet would consist only of the words of the author names, not of the full author string.	2012-12-19 01:56:33 +01:00
Michael Peter Christen	eac9650b31	added another solr field clickdepth_i which reflects the number of clicks which are necessary to get from the portal of a host to a specific document. At this time, only the start document is flagged with clickdepth '0', all other with '-1'. To get the actual clickdepth, a process must use crawled information to collect the actual number of clicks. This will be added in another/next step.	2012-12-18 17:20:42 +01:00
Michael Peter Christen	1052263af3	- added a new solr field references_i which stores the number of INCOMING links to the corresponding web page. This information is taken from the reverse link index (a 'little sister' of the RWI index). - this field can be of use to enhance the ranking because a web page with more incoming links can be more more important than others. But this is not true for typical link pages like menues. Therefore the number of outgoing links is needed. - added a new solr attribute 'bf' to solr queries which is a boost function extension. this field can contain a formula which comuptes the boost according to given field values. After some experiments the following forumla is now default: div(add(1,references_i),pow(add(1,inboundlinkscount_i),1.6))^0.4 This takes the number of references and the inbound links. Further experiments are needed to enhance that forumula.	2012-12-18 14:42:35 +01:00
Michael Peter Christen	72f165d58b	added a Boost class which stores solr query boost values. The class can be configured using the yacy.init file. The boost information is taken from the configuration each time when a query to solr is done.	2012-12-02 16:54:29 +01:00
Michael Peter Christen	ea033f8f8e	added number of characters in url to default index to be able to use this field for ranking	2012-12-02 16:53:02 +01:00
Michael Peter Christen	efd2c4622d	added a new fail type attribute for the index to distinguish two separate fail types: network fail and forced exclusion (i.e. by robots or forwarding rules).	2012-11-23 14:00:30 +01:00
Michael Peter Christen	d6b82840f8	added a feature to find similarities in documents. This uses an enhanced version of the Nutch/Solr TextProfileSignatue. As a result, a signature of the document is written to the solr search index. Additionally for each time when a signature is written, it is checked if the singature exists already in the index. If the signature does not exist, the document is marked as unique. The unique attribute can now be used to sort document lists and bring duplicates to the end of a result list. To enable this, a large portion of the search api to Solr had to be changed. This affected mainly caching of 'exists' searches to enhance the check for existing signatures and do this without actually doing a solr query. Because here the first time a long number is used as value in the Solr store, also the value naming in the YaCySchema had to be adopted and normalized. This caused that many files had to be changed.	2012-11-21 18:46:49 +01:00
reger	328ce0b297	fix: remove fixed individual testing IP (85.25.151.30 = server4you.de) from default/yacy.network.freeworld.unit	2012-11-11 21:19:18 +01:00
Michael Peter Christen	e2c4c3c7d3	migration to solr 4.0.0	2012-11-02 12:29:48 +01:00
sixcooler	2d972f289a	rise commitWithinMs to default-value from SwitchBoard (result in lower hd-io) no dots in memory-graph (there are to much of them)	2012-10-26 02:12:45 +02:00
Michael Peter Christen	1baf498d59	- show more lines in online log - reverse order is default now	2012-10-25 18:38:39 +02:00
sixcooler	206e7bcf94	whitelist yacyportalsearch aka search.yacy.net	2012-10-23 03:49:27 +02:00
Michael Peter Christen	43f3345c90	- removed dependencies from URIMetadataRow and made direct access to URIMetadataNode which creates the opportunity to access Solr objects directly and use their information richness - lazy initialization of the URIMetadataNode object - should cause less computation and memory usage during search. - removed dead code	2012-10-16 18:11:57 +02:00
Michael Peter Christen	7e3e45fd04	added Open Graph Metadata default fields, see http://ogp.me/ns#	2012-10-09 17:28:48 +02:00
Michael Peter Christen	c3e5f667a7	added schema.org breadcrumb counter to parser and solr schema	2012-10-09 13:02:43 +02:00
Michael Peter Christen	42e525ca9a	enhanced the host browser	2012-10-08 14:00:14 +02:00
sof	5cb244b79b	Merge remote branch 'origin/master'	2012-10-05 18:54:39 +02:00
apfelmaennchen	88b062210c	Added a parser for audio file tags (e.g. ID3 tags for MP3 files) based on the jaudiotagger library. The parser is disabled by default as it needs to store temporary files for non file:// protocols, which might be disliked. For your local MP3-collection it loads nicely Artist, Title, Album etc. from the audio files meta data.	2012-10-05 18:54:26 +02:00
Michael Peter Christen	3d33a5bdf6	turned the synonyms_t Text field into a multi-valued String field synonyms_sxt	2012-10-02 11:13:06 +02:00
Michael Peter Christen	3b959ee002	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-10-02 10:14:09 +02:00
orbiter	3190347814	added a synonyms_t field to solr and a process to read synonym files. This can be used to add another stemming to solr using stemming files that are expressed as synonyms for grammatical alternatives. The synonym/stemming files must have the following form: - each line is a comma-separated list of synonyms - the list of synonyms may be enclosed with {} (like the GSA synonyms file) - the file may contain comments which are lines starting with a '#' The synonym file(s) must be placed in DATA/DICTIONARIES/synonyms/ and are activated by default whenever a synonym file is in place. Then, for each word that is found in a document all synonyms are added to a long text field which is stored into synonyms_t. Processes using the synonyms must query with that field as optional matcher.	2012-10-02 00:02:50 +02:00
Michael Peter Christen	411d0e839b	added an underline text field to solr to record all underlined texts	2012-10-01 14:16:49 +02:00
Michael Peter Christen	f45f7fc12e	added new Host Browser to main menu: this new search interface is something completely new for search, but completely common on desktops: browser a web space like one would browse a file system in a file browser. The file listing is created using the search index and a faceted restriction to specific domains.	2012-09-28 22:45:16 +02:00
Michael Peter Christen	80edd8ecd7	some more after-refactoring fixes	2012-09-28 10:24:57 +02:00
Michael Peter Christen	562183932b	- removed ip_s from default profile since that needs a DNS lookup to create an document entry. This makes remote search much slower. - removed synchronization of add method if ip_s is activated to prevent that a user configuration causes bad behavior. The disadvantage of that is, that a index dump can cause data loss if an indexing is running during index dump - catched more exceptions and more NPE - better abstraction in MirrorSolrConnector - slight performance enhancement when only the index count is requested (rows=0 is sufficient to get a total count)	2012-09-26 13:38:04 +02:00
Michael Peter Christen	0504b01bdc	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-09-14 00:48:17 +02:00
orbiter	9413f77b65	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-09-13 23:54:26 +02:00
orbiter	a55e77a115	added twitter search heuristic	2012-09-13 23:53:53 +02:00
Michael Peter Christen	62add1d564	added the protocol and the file name extension to the solr fields since these fields are probably facets in file search	2012-09-11 22:46:39 +02:00
Michael Peter Christen	9db032664e	activate two solr fields which will be used by administration interface (later)	2012-09-11 20:15:54 +02:00
Michael Peter Christen	10b911eed4	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-09-07 22:07:02 +02:00
Michael Peter Christen	be67c70a47	added Solr fields: inboundlinks_text_chars_val inboundlinks_text_words_val inboundlinks_alttag_txt outboundlinks_text_chars_val outboundlinks_text_words_val outboundlinks_alttag_txt	2012-09-07 22:06:51 +02:00
orbiter	d73fff0e0e	added solr field images_withalt_i	2012-09-07 21:33:45 +02:00
Michael Peter Christen	ee23fc7a32	added h1..h6 counter fields	2012-09-04 14:11:11 +02:00
Michael Peter Christen	b2b516cc3e	added a collection attribute to crawls and searches: - a solr field collection_sxt can be used to store a set of crawl tags - when this field is activated, a crawl tag can be assigned when crawls are started - the content of the collection field can be comma-separated, all of them are assigned to the documents when they are indexed as result of such a crawl start - a search result can be drilled down to a specific collection; this is currently only available in the solr interface and also in the gsa interface using the 'site' option - this adds a mandatory field for gsa queries (the google api demands that field all the time)	2012-09-03 15:26:08 +02:00
Michael Peter Christen	528d6763fa	- added new solr fields: title_count_i, title_chars_val, title_words_val description_count_i, description_chars_val, description_words_val - added many asserts to ensure data type correctness from YaCy to Solr and vice versa - made many fixes according to new findings from these asserts (!)	2012-08-31 10:30:43 +02:00
Michael Peter Christen	2ddc33646a	added new field for solr: url_paths_sxt url_parameter_i url_parameter_key_sxt url_parameter_value_sxt url_chars_i	2012-08-29 16:11:23 +02:00
Michael Peter Christen	75d5e3475d	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-08-29 10:13:51 +02:00
cominch	dc468dad01	add content control features for custom filter lists	2012-08-29 09:04:28 +02:00
Michael Peter Christen	316b5fe116	- added a solr type definition verifier - fixed type definition found by the verifier - added multivalue-string fields for solr with extension 'sxt' - added multivalue-integer fields for solr with extension 'val' - renamed some solr attributes from txt to sxt - changed solr query line to an explicit AND/OR structure - added a country code second level domain list to Domains class; with parser - added a host string parser to get domain class name, country-code second-level domain and subdomain out of it - removed old coordinate attributes	2012-08-28 16:58:06 +02:00
Michael Peter Christen	4c79ddb91e	switched off some solr logging	2012-08-27 14:41:47 +02:00
Michael Peter Christen	e8acd542b5	- added faceted drill-down for host and geolocation to solr queries - added a new geolocation field to index schema, the old values are migrated if possible	2012-08-27 14:41:33 +02:00
Michael Peter Christen	af764c106c	re-activated audio and video search because they obviously work (!)	2012-08-22 01:56:13 +02:00
orbiter	716ea0cfe2	sorted the solr schema into mandatory and optional fields; reduced number of used field to reduce solr index size	2012-08-21 23:52:56 +02:00
orbiter	db6863db77	reduced solr cache sizes to check if that solves memory problems a bit	2012-08-18 13:45:37 +02:00
Michael Peter Christen	23226676c6	FOR THE BRAVE.. this is a forced migration to solr which is now ready for production as a replacement of the metadata-db. This intermediate release 1.041 will switch on the previously optional solr index and the old metadata-db will still work as it did before. Solr+metadata are accessed in mixed mode, no migration is done yet. If this causes not a catastrophe until the end of the weekend, we will do a YaCy 1.1 main release containing this as default.	2012-08-16 18:17:47 +02:00
Michael Peter Christen	a1b2c9a67d	doctype2mime fix, influences metadata conversion between old metadata and solr	2012-08-16 17:49:35 +02:00
Michael Peter Christen	703f427303	fixed some peer-ping connection details - larger time-out - removed too old seedlist - fixed a bug in connection test	2012-08-16 17:11:54 +02:00
Michael Peter Christen	ea49a8aa8c	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-08-14 12:40:44 +02:00
Michael Peter Christen	aab0b680c3	- added xslt support for solr result formats. try i.e. http://localhost:8090/solr/select?q=:&start=0&rows=10&wt=xslt&tr=json.xsl - added servlet-side mime-type configuration for streamed servlets. this is used for the result formatters in solr result formats	2012-08-14 11:12:50 +02:00
cominch	e2119f4e76	augmented browsing: replace htmlparser by jsoup, which is more stable and reliable	2012-08-14 10:06:12 +02:00
Michael Peter Christen	b51df6c7e8	- added coordinate storage in solr schema - fixed shutdown process - fixed some solr-to-metadata reading - added a large number of metadata attributes in ViewFile.html	2012-08-13 10:40:04 +02:00
Michael Peter Christen	f9c0e6e950	- Implemented and integrated the URIMetadataNode object which is a metadata representation from the solr index. This shall replace metadata from the built-in database in the future. - added the Solr-driven metadata into the search index of YaCy which makes it now possible to run YaCy without the old metadata index. This is a major stept forward to a full migration to Solr.	2012-08-10 13:26:51 +02:00
Michael Peter Christen	bca4a16603	replaced the multivalue generic string field name suffix _ss by _txt because _ss is not part of the standard solr example schema.	2012-08-06 17:58:09 +02:00
orbiter	67edfd991c	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2012-08-05 15:49:48 +02:00
orbiter	d9173ba7ed	added more solr fields to integrate values from URIMetadataRow. All writings to the Metadata-DB are now also done to solr. This includes metadata transfer during search and rwi transfer. The new/added solr fields are: ## time when resource was loaded load_date_dt ## date until resource shall be considered as fresh fresh_date_dt ## id of the host, a 6-byte hash that is part of the document id host_id_s ## ids of referrer to this document referrer_id_ss ## the md5 of the raw source md5_s ## the name of the publisher of the document publisher_t ## the language used in the document; starts with primary language language_ss ## an external ranking value ranking_i ## the size of the raw source size_i ## number of links to audio resources audiolinkscount_i ## number of links to video resources videolinkscount_i ## number of links to application resources applinkscount_i	2012-08-05 15:49:27 +02:00
Michael Peter Christen	3ce04cecf3	bad hack to prevent a bug appearing in solr	2012-07-31 23:49:07 +02:00
Michael Peter Christen	826967513b	changed options in IndexFederated_p to switch on/off parts of the index individually. The settings are experimental and the values of the settings will be overwritten when an index migration from urldb to solr starts.	2012-07-23 16:28:39 +02:00
Michael Peter Christen	1517a3b7b9	added webm mime-type	2012-07-08 17:59:20 +02:00
Michael Peter Christen	0301aba1e9	removed unused method parameters	2012-07-05 10:23:07 +02:00
Michael Peter Christen	4de50fe808	adding more principal peers for bootstraping	2012-07-05 00:43:41 +02:00
reger	067728bccc	add search result heuristic. adding a crawl job with depth-1 for every displayed search result (crawling every external linked page of displayed search result pages)	2012-07-01 00:12:20 +02:00

... 3 4 5 6 7 ...

712 Commits