yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	91a0401d59	introduced a second core named 'webgraph'. This core will hold the link structure, but is not filled yet. To have the opportunity of a second core, multi-core functionality had to be implemented to the deep-embedded solr: - migrated the solr_40 directory content to a subdirectory 'collection1'; the previously used default core is now called collection1 - added solr_40/webgraph subdirectory as second core - added a servlet configuration for the second core 'webgraph' in /IndexSchema_p.html - added instance handling as addition to solr connections: all solr connectors are now instances of an solr 'instance' object; this required a complete re-design of the solr embedding - migrated also caching and sharding ontop of new instance handling - migrated the search apis to handle now the access to a specific core, the default core named 'collection1' - migrated the remote solr search interface to access shards of cores; for the yacy remote search the default core is now called 'solr'; using the peer address as solr address - migrated the solr backup and restore process: old backups cannot be used after this migration! - redesign of solr instance handling in all methods which access the instances: they cannot hold copies of these instances any more; the must retrieve the actuall connection object every time they want to write to it (this solves also some bugs when switching the index/network) - added another schema 'solr.webgraph.schema', the old solr.keys.list is replaced by solr.collection.schema	2013-02-21 13:23:55 +01:00
Michael Peter Christen	b6de1f42dc	Full redesign of solr connection architecture. This was done to support multiple solr cores instead of just one. Therefore it is now necessary to distuingish between solr server connections (called an 'Instance') and a connection to a single solr core. One Instance may now have multiple connector classes assigned to it, each connecting to a single core. To support multiple cores it is also necessary to distinguish between the connection configuration and the configuration of the index schema. We will have multiple schema configurations in the future, each for every solr core. This caused that the IndexFederated servlet had to be split into two parts, the new Servlet for the Schema editor is now in the IndexSchema Servlet.	2013-02-15 01:38:10 +01:00
Michael Peter Christen	0b6566a389	optimizations when starting large crawl requests with many start urls in one request: - allow larger match-fields in html interface - delete all host hashes at once from zurl - when deleting by host, do not count size of deleted entries since that was the reason it took so long	2013-01-31 13:15:28 +01:00
Michael Peter Christen	0fe7b6fd3b	migrated the index export methods from the old metadata to solr. Now exports are done using solr queries. removed superfluous methods and servlets.	2013-01-24 12:39:19 +01:00
Michael Peter Christen	9ccdd21d76	Merge remote-tracking branch 'aleksejs/fixtrans' Conflicts: locales/ru.lng Tried to merge this but I had to made this 'blind'. Sorry if I deleted something that was right.	2013-01-22 11:54:38 +01:00
Michael Peter Christen	af465cdca5	fix for wrong robots.txt loading for https protocol see also: http://forum.yacy-websuche.de/viewtopic.php?f=5&t=4579	2013-01-16 17:38:06 +01:00
Michael Peter Christen	38d3feae65	added separate delete commands for the local+remote solr index, the old metadata and old rwi and for the citation index. The important advancement is the separation of the citation index deletion because that index is responsible for the linkdepth calculation. Now a search index can be deleted without the citation index and that should cause that less clickdepths must be post-processed.	2013-01-04 16:39:34 +01:00
reger	276e63401e	small sanitary fixes - exclude unix shell scripts in NSIS windows install archive - replace link to env/grafics/yacy.gif to yacy.png (build.nsi) - remove unused code lines (Blacklist_p, Response, WordReferenceVars) - type & xhtml (RankingSolr_p.html)	2013-01-02 01:59:47 +01:00
Michael Peter Christen	8f3bd0c387	fix for smb crawl situation (lost too many urls)	2012-12-26 19:15:11 +01:00
Michael Peter Christen	7c3de8b4cd	- fix for localhost detection - added IPv6 patterns for localhost detection	2012-12-18 12:52:20 +01:00
orbiter	712cc37c40	if maxFileSize < 0 then the file size limit is without limit.	2012-12-10 21:17:45 +01:00
Michael Peter Christen	a3cd3852ab	introduced a better place to update the lastacc time value in latency	2012-12-07 15:49:23 +01:00
Michael Peter Christen	864abcd33d	removed Latency update after URL selection because that causes a completely wrong behaviour when cache fresh cases appear. Makes re-crawling MUCH faster!	2012-12-07 15:35:44 +01:00
Michael Peter Christen	dd241d03bb	latency fix: only set last-visit time if access was actually by the robot	2012-12-07 02:00:12 +01:00
Michael Peter Christen	1e002ab18e	added another blacklist-cleaner into balancer	2012-12-07 01:27:24 +01:00
Michael Peter Christen	10527e28ae	fix for wrong display of error urls in HostBrowser	2012-12-07 00:31:10 +01:00
Michael Peter Christen	756772fbd3	fix for waitingtime computation for intranet configuration	2012-12-06 17:40:52 +01:00
Michael Peter Christen	fa27e5820f	- check blacklist (again) when taking urls from the crawl stack because the blacklist may get extended during crawling - removed debug output	2012-12-06 00:12:16 +01:00
Michael Peter Christen	72f165d58b	added a Boost class which stores solr query boost values. The class can be configured using the yacy.init file. The boost information is taken from the configuration each time when a query to solr is done.	2012-12-02 16:54:29 +01:00
Michael Peter Christen	3de784c8dd	replaced more split and replaceAll missing pattern pre-compilation with pre-compiled pattern	2012-11-26 13:40:53 +01:00
Michael Peter Christen	d48e9788d2	enhanced search result processing behavior - query less at one time; query more often - in between the small queries, evaluate results - remove fields from search results which are not needed	2012-11-26 12:24:35 +01:00
Michael Peter Christen	eca68fa197	added debug code to crawler monitor	2012-11-25 15:43:42 +01:00
orbiter	5aa5202adf	fixes for filesystem indexing	2012-11-24 10:27:29 +01:00
Michael Peter Christen	efd2c4622d	added a new fail type attribute for the index to distinguish two separate fail types: network fail and forced exclusion (i.e. by robots or forwarding rules).	2012-11-23 14:00:30 +01:00
Michael Peter Christen	5e182a566f	- added another enumeration method in kelondro data structure to get a more random access to data for the balancer - added random access inside the balancer	2012-11-23 13:58:39 +01:00
Michael Peter Christen	d6b82840f8	added a feature to find similarities in documents. This uses an enhanced version of the Nutch/Solr TextProfileSignatue. As a result, a signature of the document is written to the solr search index. Additionally for each time when a signature is written, it is checked if the singature exists already in the index. If the signature does not exist, the document is marked as unique. The unique attribute can now be used to sort document lists and bring duplicates to the end of a result list. To enable this, a large portion of the search api to Solr had to be changed. This affected mainly caching of 'exists' searches to enhance the check for existing signatures and do this without actually doing a solr query. Because here the first time a long number is used as value in the Solr store, also the value naming in the YaCySchema had to be adopted and normalized. This caused that many files had to be changed.	2012-11-21 18:46:49 +01:00
Michael Peter Christen	f5ca5cea44	- added field options to all solr queries. This can be used to restrict the actual data which is fetched from solr. - used the new field options to reduce generic options like getting the load date or the count of search results. should increase overall speed - used the new field options to reduce overhead in the host browser during aquisition of links. - used the field options to make checking of links in crawler faster - if the crawler is paused, the crawl queue is not cleaned	2012-11-19 17:24:34 +01:00
cominch	d2a94cc55e	refactor package	2012-11-09 16:22:24 +01:00
Michael Peter Christen	8b1c9cba3d	fixed a problem with non-terminating crawls	2012-11-07 15:05:44 +01:00
Michael Peter Christen	71ed8e5e07	bugfixes for crawler	2012-11-07 12:52:19 +01:00
Michael Peter Christen	158732af37	automatically delete entries from the crawl profile list if crawl is terminated.	2012-11-07 02:03:44 +01:00
Michael Peter Christen	d481abd087	added the visualization of error-urls to host browser - only visible for admins - a faceted search generates a huge list for all hosts in the host list - the faceted search algorithms had to be modified for that - within the browsing of the directory path, the error cause is written to the url which is presented as error-url - the errors are also accumulated for directory sums	2012-11-06 00:29:37 +01:00
Michael Peter Christen	791e1dcfdf	when a new crawl is started, delete all entries about error-urls for crawl-start domains	2012-11-05 22:14:27 +01:00
orbiter	354ef8000d	- added 'deleteold' option to crawler which causes that documents are deleted which are selected by a crawl filter (host or subpath) - site crawl used this option be default now - made option to deleteDomain() concurrency	2012-11-04 02:58:26 +01:00
Michael Peter Christen	75dd706e1b	update to HostBrowser: - time-out after 3 seconds to speed up display (may be incomplete) - showing also all links from the balancer queue in the host list (after the '/') and in the result browser view with tag 'loading'	2012-11-02 13:57:43 +01:00
Michael Peter Christen	0716a24737	added more / all new crawl profile fields into crawl profile editor	2012-10-31 15:13:05 +01:00
Michael Peter Christen	4a14122ba7	in case that a crawl profile has a collection assigned, use the collection to show a name in the web interface. This should prevent that much too long names make the interface unusable.	2012-10-31 14:08:33 +01:00
Michael Peter Christen	0fe8be7981	enhaced data structures for balancer and latency computation which should produce a bit better prognosis about forced waiting times.	2012-10-30 17:30:24 +01:00
Michael Peter Christen	ac9540dfb6	removed options for stopwords which are not used	2012-10-30 12:36:36 +01:00
Michael Peter Christen	b2ffd49817	less latency	2012-10-30 12:26:32 +01:00
Michael Peter Christen	0833937c1c	better balancing and duetime-cumputation also for no-delay intranet hosts	2012-10-30 11:28:49 +01:00
Michael Peter Christen	c326aa8f67	disabled writing new entries to crawl stacks to prevent that a domain with many documents block refreshing of the crawl queue	2012-10-29 22:26:52 +01:00
Michael Peter Christen	c25d7bcb80	- added concurrency for robots.txt loading - changed data model for domain counter	2012-10-29 21:08:45 +01:00
Michael Peter Christen	a87811bc38	more auto-commit calls when a search interface is opened, but not when a search is done there to prevent blocking during search-time.	2012-10-29 11:27:13 +01:00
Michael Peter Christen	2d9e577ad0	replaced the custom robots.txt loader by the standard http loader	2012-10-28 22:48:11 +01:00
Michael Peter Christen	a33e2742cb	- removed unnecessary synchronized and deadlock in crawler - removed problem with monitoring object on Balancer.wait - added missing user agent settings	2012-10-28 19:56:02 +01:00
orbiter	8952153ecf	update to Balancer algorithm: - create a load list from the current list of known hosts - do not create this list for each Balancer.pop access - create the list from those hosts which have a zero-waiting time - select 1/3 from that list which have the most urls waiting - get hosts from the wainting list in random order - fixes for some delta-time computations - always load all urls from hosts which have never been loaded before	2012-10-28 13:24:49 +01:00
Michael Peter Christen	85ca07b90e	when a new crawl is started, an equal crawl, if still running, is terminated and the corresponding crawl profile is deleted (this also clears the crawl queue entries for that crawl profile)	2012-10-25 10:20:55 +02:00
Michael Peter Christen	ae6feb5610	showing the web structure graph as animation in the crawl monitor	2012-10-23 02:50:26 +02:00
Michael Peter Christen	ccc3760a47	Refactoring and redesign of data architecture to make URIMetadataRow superfluous. The target is to make a solr document as the core of YaCy documents which would cause that many conversions can be removed. On the way to this target the Equivalence of URIMetadataRow and URIMetadataNode had to be removed to expose the usage of the old URIMetadataRow data structure. This refactoring already removes unneccessary conversions and should make memory usage during indexing lower.	2012-10-18 14:29:11 +02:00
Michael Peter Christen	e5b3c172ff	removed hack which translated Solr documents to virtual RWI entries which had been then mixed with remote RWIs. Now these Solr documents are feeded into the result set as they appear during local and remote search. That makes the search much faster.	2012-10-17 17:45:41 +02:00
Michael Peter Christen	43f3345c90	- removed dependencies from URIMetadataRow and made direct access to URIMetadataNode which creates the opportunity to access Solr objects directly and use their information richness - lazy initialization of the URIMetadataNode object - should cause less computation and memory usage during search. - removed dead code	2012-10-16 18:11:57 +02:00
Michael Peter Christen	21fe8339b4	- enhanced generation of url objects - enhanced computation of link structure graphics - enhanced collection of data for link structures	2012-10-15 13:17:13 +02:00
Michael Peter Christen	5f0ab25382	removed the option to prevent removal of & parts inside of the MultiProtocolURI during normalform computation because that should always be done and also be done during initialization of the MultiProtocolURI Object. The new normalform method takes only one argument which should be 'true' unless you know exactly what you are doing.	2012-10-10 11:46:22 +02:00
Michael Peter Christen	53789555b9	fix for crawl start filter	2012-10-10 10:40:32 +02:00
Michael Peter Christen	a06930662c	replaced some more .getBytes() with UTF8/ASCII.getBytes()	2012-10-09 12:14:28 +02:00
Michael Peter Christen	4b5e0c1500	added an url rewriter which can be used to remove session ids from urls	2012-10-09 11:24:48 +02:00
Michael Peter Christen	76d218fbef	fixes to crawl profiles	2012-10-08 10:50:40 +02:00
Michael Peter Christen	1533bfd63b	refactoring	2012-09-25 21:20:03 +02:00
Michael Peter Christen	872f83ebe0	refactoring	2012-09-25 21:04:58 +02:00
Michael Peter Christen	8219a445f3	refactoring	2012-09-21 16:46:57 +02:00
Michael Peter Christen	f879a344e7	fix for no depth limit default value	2012-09-21 16:05:17 +02:00
Michael Peter Christen	00c1c777fa	refactoring	2012-09-21 15:48:16 +02:00

1 2 3

113 Commits