yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
Roland Haeder	841a28ae76	Added 'final' for all exception blocks as this helps the Java compiler to optimize memory usage Conflicts: source/net/yacy/search/Switchboard.java	2013-07-17 18:31:30 +02:00
Michael Peter Christen	5878c1d599	- refactoring of log to ConcurrentLog: jdk-based logger tend to block at java.util.logging.Logger.log(Logger.java:476) in concurrent environments. This makes logging a main performance issue. To overcome this problem, this is a add-on to jdk logging to put log entries on a concurrent message queue and log the messages one by one using a separate process. - FTPClient uses the concurrent logging instead of the log4j logger	2013-07-09 14:28:25 +02:00
sixcooler	e5abccdfe4	added optimize-option	2013-06-28 14:51:37 +02:00
Michael Peter Christen	281959a2d7	added option to re-boot the embedded solr during run-time. Added also API recording for this method so it can be repeated automatically. The index dump generation is now also available for API recording. Added some synchronization in backend which was necessary for this.	2013-05-29 13:09:34 +02:00
Michael Peter Christen	b24d1d18e4	removed synchronization and concurrency in Fulltext class, concurrent deletions are now handled in ConcurrentUpdateSolrConnector	2013-05-11 10:53:12 +02:00
Michael Peter Christen	f7f3e28c5e	prevent that the size of the index is computed too many times. Because the index size is now provided by solr, and the only way to do that is a match for [* TO *], a size computation is quite complex and time-consuming. Therefore this patch prevents that the method is called at all and if necessary puts a DOS-preventing barrier in front of it.	2013-05-08 11:50:46 +02:00
Michael Peter Christen	788288eb9e	added the generation of 50 (!!) new solr field in the core 'webgraph'. The default schema uses only some of them and the resting search index has now the following properties: - webgraph size will have about 40 times as much entries as default index - the complete index size will increase and may be about the double size of current amount As testing showed, not much indexing performance is lost. The default index will be smaller (moved fields out of it); thus searching can be faster. The new index will cause that some old parts in YaCy can be removed, i.e. specialized webgraph data and the noload crawler. The new index will make it possible to: - search within link texts of linked but not indexed documents (about 20 times of document index in size!!) - get a very detailed link graph - enhance ranking using a complete link graph To get the full access to the new index, the API to solr has now two access points: one with attribute core=collection1 for the default search index and core=webgraph to the new webgraph search index. This is also avaiable for p2p operation but client access is not yet implemented.	2013-02-22 15:45:15 +01:00
Michael Peter Christen	91a0401d59	introduced a second core named 'webgraph'. This core will hold the link structure, but is not filled yet. To have the opportunity of a second core, multi-core functionality had to be implemented to the deep-embedded solr: - migrated the solr_40 directory content to a subdirectory 'collection1'; the previously used default core is now called collection1 - added solr_40/webgraph subdirectory as second core - added a servlet configuration for the second core 'webgraph' in /IndexSchema_p.html - added instance handling as addition to solr connections: all solr connectors are now instances of an solr 'instance' object; this required a complete re-design of the solr embedding - migrated also caching and sharding ontop of new instance handling - migrated the search apis to handle now the access to a specific core, the default core named 'collection1' - migrated the remote solr search interface to access shards of cores; for the yacy remote search the default core is now called 'solr'; using the peer address as solr address - migrated the solr backup and restore process: old backups cannot be used after this migration! - redesign of solr instance handling in all methods which access the instances: they cannot hold copies of these instances any more; the must retrieve the actuall connection object every time they want to write to it (this solves also some bugs when switching the index/network) - added another schema 'solr.webgraph.schema', the old solr.keys.list is replaced by solr.collection.schema	2013-02-21 13:23:55 +01:00
Michael Peter Christen	0fe7b6fd3b	migrated the index export methods from the old metadata to solr. Now exports are done using solr queries. removed superfluous methods and servlets.	2013-01-24 12:39:19 +01:00
Michael Peter Christen	38d3feae65	added separate delete commands for the local+remote solr index, the old metadata and old rwi and for the citation index. The important advancement is the separation of the citation index deletion because that index is responsible for the linkdepth calculation. Now a search index can be deleted without the citation index and that should cause that less clickdepths must be post-processed.	2013-01-04 16:39:34 +01:00
Michael Peter Christen	4eab3aae60	removed overhead by preventing generation of full search results when only the url is requested	2012-11-23 01:35:28 +01:00
Michael Peter Christen	5fd3b93661	added deletion of hosts during crawl start if deleteold option was given	2012-11-13 16:54:28 +01:00
orbiter	354ef8000d	- added 'deleteold' option to crawler which causes that documents are deleted which are selected by a crawl filter (host or subpath) - site crawl used this option be default now - made option to deleteDomain() concurrency	2012-11-04 02:58:26 +01:00
Michael Peter Christen	ccc3760a47	Refactoring and redesign of data architecture to make URIMetadataRow superfluous. The target is to make a solr document as the core of YaCy documents which would cause that many conversions can be removed. On the way to this target the Equivalence of URIMetadataRow and URIMetadataNode had to be removed to expose the usage of the old URIMetadataRow data structure. This refactoring already removes unneccessary conversions and should make memory usage during indexing lower.	2012-10-18 14:29:11 +02:00
Michael Peter Christen	43f3345c90	- removed dependencies from URIMetadataRow and made direct access to URIMetadataNode which creates the opportunity to access Solr objects directly and use their information richness - lazy initialization of the URIMetadataNode object - should cause less computation and memory usage during search. - removed dead code	2012-10-16 18:11:57 +02:00
Michael Peter Christen	5f0ab25382	removed the option to prevent removal of & parts inside of the MultiProtocolURI during normalform computation because that should always be done and also be done during initialization of the MultiProtocolURI Object. The new normalform method takes only one argument which should be 'true' unless you know exactly what you are doing.	2012-10-10 11:46:22 +02:00
Michael Peter Christen	941873fba4	moved the index deletion functions from IndexControlRWIs to IndexControlURLs where it appears more naturally. Because the RWI administration is less important in the presence of Solr, the IndexControlURL is now the default servlet when the Index Administration button on the main menu is selected.	2012-10-10 00:09:27 +02:00
Michael Peter Christen	1533bfd63b	refactoring	2012-09-25 21:20:03 +02:00
Michael Peter Christen	1b474139dd	used the new zip writer/reader to add a solr dump process: the whole solr index can be written to a zip dump and also restored during runtime	2012-09-24 17:05:28 +02:00
Michael Peter Christen	8219a445f3	refactoring	2012-09-21 16:46:57 +02:00
Michael Peter Christen	00c1c777fa	refactoring	2012-09-21 15:48:16 +02:00
Michael Peter Christen	31d4d38804	- extended the solr interface by a references-by-word-count method - reduced danger that a non-existing RWI database causes NPEs - added Solr queries to did-you-mean: this makes it possible that our did-you-mean algorithm works together with only Solr and without RWIs	2012-08-31 13:03:00 +02:00
orbiter	780f8974e7	added ramaining iteration methods for solr in fulltext class	2012-08-18 15:39:14 +02:00
Michael Peter Christen	0cab06c47c	refactoring	2012-08-17 15:52:33 +02:00
Michael Peter Christen	40c0856489	refactoring	2012-08-17 15:33:02 +02:00
Michael Peter Christen	18f989dfb1	- refactoring (load -> getMetadata) - added getDocument to retrieve Solr documents which shall replace getMetadata	2012-08-17 01:34:38 +02:00
orbiter	69e743d9e3	- more abstraction for the RWI index as preparation for solr integration - added options in search index to switch parts of the index on or off	2012-07-22 13:18:45 +02:00
orbiter	0cbda0b2b8	- replaced all length() == 0 and size() == 0 with isEmpty() - replaced some length() > 0 and size() > 0 with !isEmpty() - cannot be done automatically - implemented some isEmpty() methods	2012-07-10 22:59:03 +02:00
Michael Peter Christen	d3964253ae	- added @SuppressWarnings to unused servlet method parameters - removed unnecessary casts - removed unnecessary throw statements	2012-07-05 09:14:04 +02:00
Michael Peter Christen	03280fb161	removed segments-concept and the Segments class: the segments had been there to create a tenant-infrastructure but were never be used since that was all much too complex. There will be a replacement using a solr navigation using a segment field in the search index.	2012-06-28 14:27:29 +02:00
Michael Peter Christen	f294f2e295	bugfix to http://bugs.yacy.net/view.php?id=181 tried to make a bit less 'noise' to dns server also included: less processes in snippet fetch to reduce load during search on small computers	2012-05-19 01:06:33 +02:00
Michael Christen	9e5894c784	Removed handling of components objects for URIMetadataRows. This is a preparation to replace this rows with nodes from the node store.	2011-12-17 01:27:08 +01:00
orbiter	5a55397f99	some last-minute performance hacks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8101 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-25 11:23:52 +00:00
orbiter	d2ea250d99	refactoring: - moved many classes from de.anomic to net.yacy - made more sub-packages for search classes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7973 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-09-25 16:59:06 +00:00
orbiter	fabda9ad31	added script that can be used to delete a single url from the index call: bin/deleteurl.sh <url> git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7969 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-09-21 23:33:44 +00:00
orbiter	115abc8917	- more attributes for search progress bar - moved cache strategy to cora package git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7778 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-06-13 21:44:03 +00:00
orbiter	4bea3f9714	hack to reduce resource contention caused by massive UTF8 decodings which use java.nio resources: used a ASCII String <-> byte[] conversion wherever possible. Many Strings in YaCy are hashes which are pure ASCII (base64 hashes). The new ASCII String <-> byte[] conversion method have less computation overhead than the UTF8 conversion. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7746 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-05-27 08:24:54 +00:00
orbiter	10e2f588f8	- enhanced ybr ranking computation - many speed/performance hacks - added solr charding and new charding web interface - added option to switch off the yacy index when using solr - added new fail-url categories which are used to make a distinction which fail-urls to be sent to solr - refactoring/renaming of some method names to distinguish host/url hashes better - a large number of bug/npe fixes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7738 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-05-26 10:57:02 +00:00
orbiter	5b579e21a3	code cleanup git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7713 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-05-13 06:21:40 +00:00
orbiter	cb1f49d0f2	replaced all 'new String' with default encoding (missing) or UTF-8 encoding with a String generation method that uses a pre-defined Charset constant for UTF-8. This avoids a cache-lookup for the Charset object using String hashing of the String 'UTF-8'. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7558 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-03-07 20:36:40 +00:00
orbiter	5e186e0122	continuing the fight against deadlocks during time formatting: better caching. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7531 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-02-25 21:11:53 +00:00
lotus	9f6e3f9793	fix: same number format display before and after get/post git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7476 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-02-10 09:08:44 +00:00
orbiter	10ae8d961b	- cora package has now no dependencies to other yacy packages and becomes a 'base' package (refactoring) - cleaned up (removed special code and documentation for 27c3) - added remote search functions to be used within cora git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7420 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-01-03 20:52:54 +00:00
f1ori	7d8de34778	* add a bit documentation to DigestURI, use DigestURI(string) instead of DigestURI(string, null) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7276 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-10-26 16:10:20 +00:00
orbiter	9d080f387e	change in handling of the all-visible home path for storage in YaCy: the home path can now be distinguished between - data home; the path where the DATA directory is created - application home; everything else This will make it possible to store application data on Mac releases within the ~/Library/YaCy directory; a place where Mac applications write their data. Similar techniques will be possible for debian and windows. To use the new data path, YaCy can be started with -start <data path> or -gui <data path> git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7092 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-09-02 19:24:22 +00:00
orbiter	3197ca42ed	preparations to move the HTCache into cora: - move the header framework classes to cora - move the ARC caching classes to cora - refactoring of code to call these classes from cora git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7068 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-08-23 12:32:02 +00:00
orbiter	7bcfa033c9	more abstraction of the htcache when using the LoaderDispatcher: a cache access shall not made directly to the cache any more, all loading attempts shall use the LoaderDispatcher. To control the usage of the cache, a enum instance from CrawlProfile.CacheStrategy shall be used. Some direct loading methods without the usage of a cache strategy have been removed. This affects also the verify-option of the yacysearch servlet. If there is a 'verify=false' now after this commit this does not necessarily mean that no snippets are generated. Instead, all snippets that can be retrieved using the cache only are presented. This still means that the search hit was not verified because the snippet was generated using the cache. If a cache-based generation of snippets is not possible, then the verify=false causes that the link is not rejected. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6936 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-06-21 14:54:54 +00:00
orbiter	25aef069a6	continuing String-hash - to - byte[]-hash redesign that was started in SVN 6775 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6790 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-04-08 00:11:32 +00:00
orbiter	1e8e79b9ef	redesign of reference hash (URL-hash) parameter hand-over: pass value as byte[], not as String. This should cause that less byte[] <-> String conversions are made during time-critical tasks. This redesign is not yet complete, more to come .. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6775 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-03-26 18:33:20 +00:00
orbiter	3300930fc5	- (almost) fixed FTP crawler - integrated/fixed SMB crawler git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6742 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-03-11 15:43:06 +00:00

1 2

99 Commits