yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	0ceeceb35e	more logic on Solr queries; usage of the query terms in posprocessing, saving one query for double document detection now per document	2014-08-04 02:35:38 +02:00
reger	2b8cc5832c	fix seek error for 0 file size records file by add extra check for file size = 0 in cleanlast() - (http://mantis.tokeek.de/view.php?id=411)	2014-07-06 20:49:01 +02:00
reger	2ba394333f	fix Crawler HostQueue release of stackfile - close stackfile inputstream at end of ChunkIterator This should solve startup delay while unfinished crawl jobs exist (maybe also too many open file situation)	2014-07-06 16:04:30 +02:00
Michael Peter Christen	501d55cd35	removed superfluous assert	2014-06-19 12:10:12 +02:00
Michael Peter Christen	f0db501630	better handling of ranking parameters and new default values for date navigation which is done using ranking in solr.	2014-05-22 03:01:07 +02:00
Michael Peter Christen	6634b5b737	debug code for index distribution testing	2014-05-21 18:20:16 +02:00
orbiter	97983ba89f	fixed generics warnings for generic array instantiation that appeared after migration to Java 7	2014-05-20 21:50:16 +02:00
orbiter	88f4af90da	removed warnings	2014-05-13 22:27:31 +02:00
orbiter	89f76da24b	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-05-06 05:38:38 +02:00
sixcooler	b8cee9b7d8	remove tables from tabletracker on close to avoid lots of dead entrys in /PerformanceMemory_p.html	2014-05-02 22:55:47 +02:00
orbiter	f15c832587	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-04-30 07:42:52 +02:00
reger	ffc5b75c73	optimize and fix lat / lon assignment	2014-04-27 20:52:06 +02:00
reger	9313447de2	reimplement tighter lat/lon calc in URIMetadataNode from old MetadataRow, considering http://mantis.tokeek.de/view.php?id=272	2014-04-27 18:20:33 +02:00
orbiter	a3542f29b4	npe fix	2014-04-25 09:26:20 +02:00
orbiter	c48d2a2a02	npe fix	2014-04-25 09:23:10 +02:00
orbiter	12ba890205	removed warnings	2014-04-22 19:35:15 +02:00
reger	727dfb5875	refactore URIMetadataNode to further unify interaction with index - URIMetadataNode extending SolrDocument - use language as stored (String), reducing conversion to string - optimize debug code in transferIndex	2014-04-20 01:41:30 +02:00
Michael Peter Christen	1aea01fe5b	fix for Table in case that requested file does not exist and paths also do not exist	2014-04-17 12:44:05 +02:00
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	2014-04-16 21:34:28 +02:00
Michael Peter Christen	17e0956312	refactoring of SystemLoad calls (only one backend tool)	2014-04-11 09:25:18 +02:00
reger	227c42bc96	eleminate obsolete URIMetaDataRow class by joining it with/into URIMetaDataNode.	2014-04-03 00:35:15 +02:00
Michael Peter Christen	62a36fa584	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-04-02 03:27:08 +02:00
reger	c9f92abddc	fix: application link count (URIMetadataNode)	2014-04-02 03:21:51 +02:00
Michael Peter Christen	5b83887da8	npe fix	2014-04-02 02:34:55 +02:00
Michael Peter Christen	56710ecb26	prevent opening of new files as that could be a cause for the latest too-many-open-files exception. The old file is just truncated if the table is cleaned.	2014-03-28 14:31:43 +01:00
Michael Peter Christen	8b44fcf0f4	added missing @Override annotation	2014-03-28 13:48:37 +01:00
Michael Peter Christen	1a764135be	one more Thread Dump fix for new bootstrap css style	2014-03-27 23:01:28 +01:00
Michael Peter Christen	bb21d825f9	fix for thread dump line spacing	2014-03-27 22:13:37 +01:00
Michael Peter Christen	5f4a6892c1	enhanced RowSet re-sort limit for small sets	2014-03-05 23:28:19 +01:00
Michael Peter Christen	6ed9c0164e	attaching names to all Threads to get a better view in profiling tools like VisualVM	2014-02-28 15:02:01 +01:00
Michael Peter Christen	fdaeac374a	- enhanced postprocessing speed and memory footprint (by using HashMaps instead of TreeMaps) - enhanced memory footprint of database indexes (by introduction of optimize calls) - optimize calls shrink the amount of used memory for index sets if they are not changed afterwards any more	2014-02-28 14:01:09 +01:00
Michael Peter Christen	9eb668e951	enhanced the resource observer The resource observer is now able to recognize free disk space AND available space for YaCy. The amount of space which is assigned for YaCy are defined in new settings in the configuration file. Furthermore, there is now a cleanup process which deletes files in case that an autodelete is activated. The autodelete is now BY DEFAULT ON if the disk space is low, which means that YaCy starts to delete documents when the disk is full!	2014-02-12 01:00:44 +01:00
Michael Peter Christen	fbee98c06f	fixed shortcut self-reference bug	2014-02-11 22:14:46 +01:00
Michael Peter Christen	acc8d7faa7	fixed setting of shortMemoryStatus in MemoryControl	2014-02-09 12:25:55 +01:00
Michael Peter Christen	94245ce0a8	fixed "Size in KBytes" calculation in PerformanceQueues_p.html, see http://bugs.yacy.net/view.php?id=362	2014-02-07 17:19:08 +01:00
Michael Peter Christen	ebfaf753b7	- faster initialization of index files - removal of not used space if index files shrink (rare, but possible)	2014-01-28 12:39:58 +01:00
reger	a3e2cca8e9	improve isOlder check to not overwrite node index with metadata on equal load date	2014-01-26 01:00:52 +01:00
orbiter	c351e47a84	fix for bad-formatted lonlat	2014-01-22 21:33:11 +01:00
Michael Peter Christen	c87cdfca2e	do not set a load prerequisite that prevents the start of one-time-jobs	2014-01-22 17:18:53 +01:00
Michael Peter Christen	6ada0daae9	making latency_factor and maximum number of same hosts in loader queue settings available in Crawler_p.html servlet for steering.	2014-01-21 19:28:00 +01:00
sixcooler	40a4030b55	configurable max-load values for YaCy-Threads: try lower values on smal systems like a Pi	2014-01-21 17:04:22 +01:00
Michael Peter Christen	1ea17bd9f3	- removed old metadata database and all migration code - refactored all code which uses URIMetadataRow as standard for word hash length and word hash ordering and moved that to the class 'Word', becuase the class URIMetadataRow defined the old metadata data structure and should be superfluous in the future - removed unused methods from URIMetadataRow as preparation for further removal of that class	2014-01-20 18:31:46 +01:00
Michael Peter Christen	25a6c05008	experimental removal of synchronization. This should work for all cases where the size() and isEmpty() method is used only for statistics, which happens at many locations in YaCy. If these methods are used for structual reasons (like accessing the last element in an array) then it may fail or cause other problems. As far as visible, this is not the case.	2014-01-19 14:47:11 +01:00
Michael Peter Christen	5695280edd	removed superfluous synchronization	2014-01-19 14:44:58 +01:00
Michael Peter Christen	a1977b7a75	removed debug code	2014-01-19 14:42:26 +01:00
Michael Peter Christen	ec10ed45bd	better logging in logger	2014-01-16 13:08:39 +01:00
Michael Peter Christen	c3dcbdc8d5	try to recover from an OOM during citation index reading and fail-over to second solr core in case of unrecoverable OOM.	2013-11-28 01:10:25 +01:00
Michael Peter Christen	2c39b65409	fixes for searches containing stopwords. The fix was done using a reconstruction of the search word set access method to protect that words are deleted from the sets from the outside of the QueryGoal class.	2013-11-26 02:24:47 +01:00
Michael Peter Christen	191fd3d7e7	added an optimization option to HandleSet mass data storage structure	2013-11-15 15:38:00 +01:00
Michael Peter Christen	1a4a69c226	set more logger to 'final static'	2013-11-13 06:18:48 +01:00
orbiter	3c3cb78555	- removed a lot of garbage and bloated code from GuiHandler. - transformed log lines to String before they are stored because the storage space is about 1:250 (45kb for one line before transformation, 180 bytes afterwards) - this saves up to 10MB RAM so we can increase the number of lines to 1000 again.	2013-10-24 20:42:34 +02:00
Michael Peter Christen	5afa6e3aee	Automatically flush the log cache if a short memory status is reached. For the default of 200 lines this can flush about 10MB.	2013-10-24 17:39:50 +02:00
Michael Peter Christen	6aabc4e5c8	reduced logging line memory, 10000 lines had filled up 450MB! grrr. (thank you, a bomb from the past)	2013-10-24 16:17:53 +02:00
Michael Peter Christen	1b4fa2947d	- fixed a problem which ocurred when a document was not recognized with the right content domain (i.e. identifying that it is an image, text etc.) because it used the file extension and not an existing mime type assignment. - fixed the new setting that images shall be loaded for a better image search. - both fixes together makes it now possible to crawl commons.wikimedia.org which makes use of 'funny' document names (i.e. ending with .jpg while the document is html)	2013-10-23 00:16:54 +02:00
Michael Peter Christen	7b69c438f7	more methods for the table class	2013-10-15 16:46:59 +02:00
Michael Peter Christen	5e31bad711	- the webgraph shall store all links which appear on a web page and not all unique links! This made it necessary, that a large portion of the parser and link processing classes must be adopted to carry a different type of link collection which carry a property attribute which are attached to web anchors. - introduction of a new URL class, AnchorURL - the other url classes, DigestURI and MultiProtocolURI had been renamed and refactored to fit into a new document package schema, document.id - cleanup of net.yacy.cora.document package and refactoring	2013-09-15 00:30:23 +02:00
Michael Peter Christen	e8e558a9b7	fix for content domain classification in URIMetadataNode	2013-09-03 10:49:09 +02:00
Michael Peter Christen	cb85b22725	redesign of the image search process (with much better results, unfortunately the index schema has changed and p2p image search will not be muchmuch better until many people update)	2013-09-02 18:55:38 +02:00
orbiter	f106345eef	link strings should not be tokenized	2013-09-01 14:35:36 +02:00
Michael Peter Christen	0f3d8890db	removed an assert which causes a shortcut call circuit	2013-08-22 10:12:25 +02:00
Michael Peter Christen	47b1c81d08	- refactoring - generalized writing of url attributes to solr documents - added more url attributes to error documents	2013-08-20 15:46:04 +02:00
Michael Peter Christen	58fe986cca	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2013-07-30 12:49:14 +02:00
Michael Peter Christen	cf12835f20	replaced the single-text description solr field with a multi-value description_txt text field	2013-07-30 12:48:57 +02:00
Roland Haeder	13433d41a1	Log this exception better Conflicts: source/net/yacy/kelondro/blob/Tables.java	2013-07-27 09:54:51 +02:00
orbiter	056b42f5aa	- added information about segment count to status_p.xml - also moved this information from the old index structure, which is still in use for the RWI/DHT index to that front-end	2013-07-23 18:03:33 +02:00
Michael Peter Christen	336f86394c	replaced StringBuffer with StringBuilder	2013-07-23 12:21:27 +02:00
Michael Peter Christen	aeac2fb763	replaced more containsKey() -> get() usages by a simple get(), followed by a test for NULL. This should increase the application speed and reduces the lookup time for the affected methods by 50%	2013-07-23 12:16:51 +02:00
Michael Peter Christen	735a66eff3	enhancements to crawler	2013-07-18 12:29:04 +02:00
Roland Haeder	841a28ae76	Added 'final' for all exception blocks as this helps the Java compiler to optimize memory usage Conflicts: source/net/yacy/search/Switchboard.java	2013-07-17 18:31:30 +02:00
Michael Peter Christen	5c6946dd5f	replaced usage of log4j by ConcurrentLog where possible	2013-07-09 14:42:39 +02:00
Michael Peter Christen	5878c1d599	- refactoring of log to ConcurrentLog: jdk-based logger tend to block at java.util.logging.Logger.log(Logger.java:476) in concurrent environments. This makes logging a main performance issue. To overcome this problem, this is a add-on to jdk logging to put log entries on a concurrent message queue and log the messages one by one using a separate process. - FTPClient uses the concurrent logging instead of the log4j logger	2013-07-09 14:28:25 +02:00
reger	a6bf44212e	bugfix: location (lat/lon) meta data retrival (Double.NaN check)	2013-06-30 03:50:07 +02:00
Michael Peter Christen	14186e815e	npe fix	2013-06-13 22:42:21 +02:00
Michael Peter Christen	f7e77a21bf	Added a citation reference computation for intra-domain link structures. While the values for the reference evaluation are computed, also a backlink-structure can be discovered and written to the index as well. The host browser has been extended to show such backlinks to each presented links. The host browser therefore can now show an information where an document is linked. The new citation reference is computed as likelyhood for a random click path with recursive usage of previously computed likelyhood. This process is repeated until the likelyhood converges to a specific number. This number is then normalized to a ranking value CRn, 0<=CRn<=1. The value CRn can therefore be used to rank popularity within intra-domain link structures.	2013-06-07 13:20:57 +02:00
Michael Peter Christen	e20450e798	patch in HTCache and CitationIndex loading in case that a file is broken: do not crash; instead ignore the file and delete it.	2013-06-07 12:52:03 +02:00
reger	7480e87386	- fix stopword handling for RWI see example http://bugs.yacy.net/view.php?id=247 - append language setting specific stopword list - remove unused OVERHANG stack type	2013-06-06 22:07:54 +02:00
Michael Peter Christen	a1644ca0fd	new workflow processor in Segment to enqueue indexing documents to solr	2013-05-30 12:34:53 +02:00
Michael Peter Christen	5344a1c5f7	getting the trash out	2013-05-29 16:09:05 +02:00
orbiter	888a985dc6	set a higher limit for table copy usage	2013-05-27 15:23:12 +02:00
Michael Peter Christen	8dbc80da70	redesign of index.exist-test: this shall now not be done using a single id to be tested, but with a collection of ids. This will cause only a single call to solr instead of many. The result is a much better performace when testing the existence of many urls. The effect should cause very much less IO during index transmission, both on sender and receiver side.	2013-05-17 13:59:37 +02:00
Michael Peter Christen	44e363f37f	refactoring of WorkflowProcessor, added process counter, update of process counter if an blocking thread dies. Added also a new column in PerformanceConcurrency_p servlet to show the actual number of concurrent processes.	2013-05-13 13:28:07 +02:00
orbiter	aeff31cd44	fix for workflow processor (cause: latest redesign for less threads)	2013-05-12 21:36:20 +02:00
orbiter	a1c989002b	fix for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=4652 generate dht data even if dht receive and dht transmission is switched off	2013-05-08 16:48:45 +02:00
orbiter	7de5b9cfa0	fix for http://bugs.yacy.net/view.php?id=233 - check geolocation coordinates and accept only those, which are well-formed - the solr push process does not stop crawling any more if after 20 requests to Solr Solr does not accept the record. Instead, a severe log entry asks the user to create a bug request	2013-05-03 00:24:39 +02:00
Michael Peter Christen	bb4bf3d8fd	infinity timeout bug protection patch	2013-04-30 11:06:48 +02:00
orbiter	e1bfe9d07a	- reduction of the concurrently running processes to make YaCy more adjusted to smaller and 1-core devices. - the workflow processor now starts no process at all. these are started as soon as parser/condenser/indexing queues are filled. - better abstraction	2013-04-25 11:33:17 +02:00
Michael Peter Christen	c1a2175fbc	added transparency to gif image animation and the integration to the YaCy httpd for on-the-fly generated gifs (including animated gifs)	2013-04-21 12:29:05 +02:00
Michael Peter Christen	ada3f27de7	added three new field for a better ranking: references_internal_i, references_external_i and references_exthosts_i. These can be used to count and evaluate the number of external links to every web page. An experimental ranking function can be i.e.: div(add(references_internal_i,product(references_external_i,references_exthosts_i)),add(clickdepth_i,1))	2013-04-12 16:17:14 +02:00
Michael Peter Christen	342ba1049b	- callback fix - memory allocation problem in RowCollection: if memory is too low, do not to try to increase by 1 because this leads to very long execution time and at the end to the same OOM as if we allocate the memory at the moment we need it even if the resource observer states that this memory is not there. To compensate this, the increase size is reduced.	2013-03-19 10:32:01 +01:00
orbiter	47114910d5	fix for possible memory leaks	2013-03-13 17:55:37 +01:00
Michael Peter Christen	addba047e2	changes in ranking computation - an existing ranking servlet for solr was extended. It is now possible to set boost values for fields, boost functions and boost queries. - The ranking can have different instances, but currently only the first one is used - added an abstraction layer for fields which can be used for search and those fields can be edited in the solr ranking configruation - the ranking value from solr within the field score is used to combine remote search requests, which all are created using the same locally defined boost values - reduced the number of fields which are used for search (makes it faster) - replaced some text fields by string fields (makes indexing faster) - removed classes which had no use - made a large number of experiments for a better ranking and created a temporary setting which prefers hits inside titles - adjusted also the RWI-based ranking computation to 'prefer title' - made special cases like for portal search where no post-processing and post-ranking is wanted: this keeps the original ranking order as done by Solr - fixed many bugs with old settings for ranking	2013-03-13 14:47:00 +01:00
Michael Peter Christen	2b6c79d347	in method exists() also use the new caching-stacks for documents/metadata	2013-03-04 01:13:17 +01:00
Michael Peter Christen	3b1d9dc884	made index storage from DHT search result concurrently. This prevents blocking by high CPU usage during search. Also: removed query from Solr for DHT search results; results are taken from the pending queue.	2013-03-02 10:25:52 +01:00
orbiter	d74472f562	corrected result counter	2013-02-27 22:40:23 +01:00
Michael Peter Christen	c95a84103a	complete redesign of search process: - removed 'worker' processes - no internal time-out behaviour: methods either are successful or return null - waiting is only done on top-level - removed snippet-production; this is replaced by solr snippets - removed statistics based on solr size queries (they had been VERY long); the statistics (like suggestions or tag cloud) are now again based on the old but very fast RWI index. In portal or intranet mode the RWI index is usually switched off; if you like to have statistics again then you must switch on the rwis again in this mode. - fixed many bugs regarding correct page counter	2013-02-26 17:16:31 +01:00
Michael Peter Christen	35fa718b77	testing to use solr for portalsearch caused some bugfixing but no full success: try to comment out the solr search request in yacy-portalsearch.js	2013-02-25 14:31:50 +01:00
Michael Peter Christen	089dee1770	- generalized SchemaConfiguration into super-class Configuration and adopted other classes which used the configuration-only access for that class - removed many warnings - adjusted logging	2013-02-25 00:09:41 +01:00
Michael Peter Christen	788288eb9e	added the generation of 50 (!!) new solr field in the core 'webgraph'. The default schema uses only some of them and the resting search index has now the following properties: - webgraph size will have about 40 times as much entries as default index - the complete index size will increase and may be about the double size of current amount As testing showed, not much indexing performance is lost. The default index will be smaller (moved fields out of it); thus searching can be faster. The new index will cause that some old parts in YaCy can be removed, i.e. specialized webgraph data and the noload crawler. The new index will make it possible to: - search within link texts of linked but not indexed documents (about 20 times of document index in size!!) - get a very detailed link graph - enhance ranking using a complete link graph To get the full access to the new index, the API to solr has now two access points: one with attribute core=collection1 for the default search index and core=webgraph to the new webgraph search index. This is also avaiable for p2p operation but client access is not yet implemented.	2013-02-22 15:45:15 +01:00
Michael Peter Christen	91a0401d59	introduced a second core named 'webgraph'. This core will hold the link structure, but is not filled yet. To have the opportunity of a second core, multi-core functionality had to be implemented to the deep-embedded solr: - migrated the solr_40 directory content to a subdirectory 'collection1'; the previously used default core is now called collection1 - added solr_40/webgraph subdirectory as second core - added a servlet configuration for the second core 'webgraph' in /IndexSchema_p.html - added instance handling as addition to solr connections: all solr connectors are now instances of an solr 'instance' object; this required a complete re-design of the solr embedding - migrated also caching and sharding ontop of new instance handling - migrated the search apis to handle now the access to a specific core, the default core named 'collection1' - migrated the remote solr search interface to access shards of cores; for the yacy remote search the default core is now called 'solr'; using the peer address as solr address - migrated the solr backup and restore process: old backups cannot be used after this migration! - redesign of solr instance handling in all methods which access the instances: they cannot hold copies of these instances any more; the must retrieve the actuall connection object every time they want to write to it (this solves also some bugs when switching the index/network) - added another schema 'solr.webgraph.schema', the old solr.keys.list is replaced by solr.collection.schema	2013-02-21 13:23:55 +01:00
Marc Nause	75f9568472	) only install files from the RELEASE directory ) minor changes	2013-02-05 21:02:32 +01:00

1 2 3 4 5 ...

800 Commits