yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
luccioman	fa4399d5d2	Small perf improvement : initialize threads names early when possible Initializing Thread names using the Thread constructor parameter is faster as it already sets a thread name even if no customized one is given, while an additional call to the Thread.setName() function internally do synchronized access, eventually runs access check on the security manager and performs a native call. Profiling a running YaCy server revealed that the total processing time spent on Thread.setName() for a typical p2p search was in the range of seconds.	2018-05-23 14:45:35 +02:00
luccioman	6cd3847d0a	Fixed NullPointerException case on Table init with relative file path. Can occur for example when running dbtest with relative test table file name (wihout explicit parent folder).	2018-01-29 14:00:43 +01:00
luccioman	0487336ec3	Prevent integer overflow in table statistics and use strong typing	2017-06-19 17:02:11 +02:00
luccioman	c53c58fa85	Unsure closing ChunkIterator stream in every possible use case. Also trace in logs the eventual close failures instead of failing silently. This should help prevent holding too many unreleased system file handlers, as in the case reported by eros on YaCy forum (http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5988&sid=b00e7486c1bf7e48a0d63eb328ccca02 )	2017-06-02 09:47:45 +02:00
Michael Peter Christen	c40c302748	when many crawl queues are generated, this NPE can occur; probably caused as concurrency issue: W 2015/09/05 14:09:10 ConcurrentLog java.lang.NullPointerException java.lang.NullPointerException at java.util.TreeMap.rotateRight(TreeMap.java:2239) at java.util.TreeMap.fixAfterInsertion(TreeMap.java:2271) at java.util.TreeMap.put(TreeMap.java:582) at net.yacy.kelondro.table.Table.<init>(Table.java:235) at net.yacy.crawler.HostQueue.openStack(HostQueue.java:229) at net.yacy.crawler.HostQueue.getStack(HostQueue.java:204) at net.yacy.crawler.HostQueue.push(HostQueue.java:397) at net.yacy.crawler.HostBalancer.push(HostBalancer.java:237) at net.yacy.crawler.data.NoticedURL.push(NoticedURL.java:184) at net.yacy.crawler.CrawlStacker.stackCrawl(CrawlStacker.java:355) at net.yacy.crawler.CrawlStacker.job(CrawlStacker.java:134) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at net.yacy.kelondro.workflow.InstantBlockingThread.job(InstantBlockingThread.java:101) at net.yacy.kelondro.workflow.AbstractBlockingThread.run(AbstractBlockingThread.java:82) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)	2015-09-05 14:12:17 +02:00
Michael Peter Christen	fed26f33a8	enhanced timezone managament for indexed data: to support the new time parser and search functions in YaCy a high precision detection of date and time on the day is necessary. That requires that the time zone of the document content and the time zone of the user, doing a search, is detected. The time zone of the search request is done automatically using the browsers time zone offset which is delivered to the search request automatically and invisible to the user. The time zone for the content of web pages cannot be detected automatically and must be an attribute of crawl starts. The advanced crawl start now provides an input field to set the time zone in minutes as an offset number. All parsers must get a time zone offset passed, so this required the change of the parser java api. A lot of other changes had been made which corrects the wrong handling of dates in YaCy which was to add a correction based on the time zone of the server. Now no correction is added and all dates in YaCy are UTC/GMT time zone, a normalized time zone for all peers.	2015-04-15 13:17:23 +02:00
Michael Peter Christen	8aee7f940e	added missing class for latest changes	2014-11-13 01:30:12 +01:00
Michael Peter Christen	421ee64f33	another fix to ordering of table indexes; fixes also network stats graphics	2014-11-11 13:57:04 +01:00
Michael Peter Christen	ee27be3399	misc bugfixes (concurrency, memory protection)	2014-10-08 15:22:29 +02:00
reger	2ba394333f	fix Crawler HostQueue release of stackfile - close stackfile inputstream at end of ChunkIterator This should solve startup delay while unfinished crawl jobs exist (maybe also too many open file situation)	2014-07-06 16:04:30 +02:00
orbiter	97983ba89f	fixed generics warnings for generic array instantiation that appeared after migration to Java 7	2014-05-20 21:50:16 +02:00
sixcooler	b8cee9b7d8	remove tables from tabletracker on close to avoid lots of dead entrys in /PerformanceMemory_p.html	2014-05-02 22:55:47 +02:00
Michael Peter Christen	1aea01fe5b	fix for Table in case that requested file does not exist and paths also do not exist	2014-04-17 12:44:05 +02:00
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	2014-04-16 21:34:28 +02:00
Michael Peter Christen	56710ecb26	prevent opening of new files as that could be a cause for the latest too-many-open-files exception. The old file is just truncated if the table is cleaned.	2014-03-28 14:31:43 +01:00
Michael Peter Christen	8b44fcf0f4	added missing @Override annotation	2014-03-28 13:48:37 +01:00
Michael Peter Christen	fdaeac374a	- enhanced postprocessing speed and memory footprint (by using HashMaps instead of TreeMaps) - enhanced memory footprint of database indexes (by introduction of optimize calls) - optimize calls shrink the amount of used memory for index sets if they are not changed afterwards any more	2014-02-28 14:01:09 +01:00
Michael Peter Christen	5e31bad711	- the webgraph shall store all links which appear on a web page and not all unique links! This made it necessary, that a large portion of the parser and link processing classes must be adopted to carry a different type of link collection which carry a property attribute which are attached to web anchors. - introduction of a new URL class, AnchorURL - the other url classes, DigestURI and MultiProtocolURI had been renamed and refactored to fit into a new document package schema, document.id - cleanup of net.yacy.cora.document package and refactoring	2013-09-15 00:30:23 +02:00
Roland Haeder	841a28ae76	Added 'final' for all exception blocks as this helps the Java compiler to optimize memory usage Conflicts: source/net/yacy/search/Switchboard.java	2013-07-17 18:31:30 +02:00
Michael Peter Christen	5878c1d599	- refactoring of log to ConcurrentLog: jdk-based logger tend to block at java.util.logging.Logger.log(Logger.java:476) in concurrent environments. This makes logging a main performance issue. To overcome this problem, this is a add-on to jdk logging to put log entries on a concurrent message queue and log the messages one by one using a separate process. - FTPClient uses the concurrent logging instead of the log4j logger	2013-07-09 14:28:25 +02:00
orbiter	888a985dc6	set a higher limit for table copy usage	2013-05-27 15:23:12 +02:00
Michael Peter Christen	5e182a566f	- added another enumeration method in kelondro data structure to get a more random access to data for the balancer - added random access inside the balancer	2012-11-23 13:58:39 +01:00
orbiter	276dd6452b	removed warnings	2012-10-23 19:08:44 +02:00
Michael Peter Christen	a8167e6e5b	clean-up: removed unused methods in kelondro	2012-10-06 03:34:52 +02:00
Michael Peter Christen	8219a445f3	refactoring	2012-09-21 16:46:57 +02:00
orbiter	563d584420	removed more dependencies in cora from kelondro	2012-09-21 11:02:36 +02:00
Michael Peter Christen	e072632a54	no complaints about memory if the database is empty	2012-09-11 22:28:10 +02:00
Michael Peter Christen	e5ef840f40	- renamed DoubleSolrConnector to MirrorSolrConnector and added a hit/miss/document cache to the MirrorSolrConnector. - more abstraction to SolrDocument in Connector interface - bugfixes in Solr field reader	2012-08-13 13:32:32 +02:00
Michael Peter Christen	f9c0e6e950	- Implemented and integrated the URIMetadataNode object which is a metadata representation from the solr index. This shall replace metadata from the built-in database in the future. - added the Solr-driven metadata into the search index of YaCy which makes it now possible to run YaCy without the old metadata index. This is a major stept forward to a full migration to Solr.	2012-08-10 13:26:51 +02:00
Michael Peter Christen	1687737771	Abstraction of HandleMap and HandleSet	2012-07-27 12:13:53 +02:00
orbiter	0cbda0b2b8	- replaced all length() == 0 and size() == 0 with isEmpty() - replaced some length() > 0 and size() > 0 with !isEmpty() - cannot be done automatically - implemented some isEmpty() methods	2012-07-10 22:59:03 +02:00
Michael Peter Christen	132afaf687	removed unaccessible code	2012-07-05 11:09:44 +02:00
Michael Peter Christen	0301aba1e9	removed unused method parameters	2012-07-05 10:23:07 +02:00
Michael Peter Christen	8a82609360	- smaller caches to save memory - close cloneable iterators to free memory	2012-07-02 15:40:40 +02:00
Michael Peter Christen	0c345d1559	giving threads name so its easier to see whats happening during debugging and within a thread dump	2012-07-02 09:51:43 +02:00
Michael Peter Christen	2280a7b276	- changed initialization order to prefer allocation of memory for table files first - bugfixes in memory amount calculation	2012-06-09 09:05:47 +02:00
Michael Peter Christen	0746308bc2	only the metadata tables shall be able to use the tail cache	2012-06-08 18:36:11 +02:00
Michael Peter Christen	7ec9bef0c3	fix for OOM	2012-06-08 17:14:09 +02:00
Michael Peter Christen	41c02cb10e	- less restrictions for usage of Table RAM copy - new limit to use the table copy (instead of flag): 400MB available. If less is available, then a copy is never used. If more is available, then it can be used if there is a remaining space of at least 200MB - flush caches more often: flush the Digest cache	2012-06-08 12:48:25 +02:00
Michael Peter Christen	00f2df1120	a variety of possible memory leak fixes	2012-06-06 18:23:18 +02:00
Michael Peter Christen	c15fcde1c8	add-on to latest commit	2012-05-21 17:52:30 +02:00
Michael Peter Christen	cf47d94888	performance hack to parse numbers inside of substrings without actually generating a substring. This avoids the allocation of a String object ech time a substring is parsed. Should affect CPU load during RWI transmission.	2012-05-21 13:40:46 +02:00
Roland 'Quix0r' Haeder	fbb946f913	Made a method static (Eclipse suggested it), removed unused import, pk=null check does now output a warning in logfile	2012-05-17 05:55:44 +02:00
Roland 'Quix0r' Haeder	a093ccf5eb	Now used synchronization in all close() methods to make sure all objects are 'closed' in an ordered way Conflicts: source/de/anomic/http/server/ChunkedInputStream.java source/de/anomic/http/server/ChunkedOutputStream.java source/de/anomic/http/server/ContentLengthInputStream.java source/net/yacy/cora/protocol/Domains.java source/net/yacy/cora/services/federated/solr/SolrShardingConnector.java source/net/yacy/cora/services/federated/solr/SolrSingleConnector.java source/net/yacy/document/content/dao/PhpBB3Dao.java source/net/yacy/document/parser/html/AbstractTransformer.java source/net/yacy/kelondro/blob/BEncodedHeap.java source/net/yacy/kelondro/blob/HeapReader.java source/net/yacy/kelondro/index/RAMIndexCluster.java source/net/yacy/kelondro/io/ByteCountInputStream.java source/net/yacy/kelondro/logging/ConsoleOutErrHandler.java source/net/yacy/kelondro/table/SQLTable.java	2012-05-14 07:41:55 +02:00
Michael Peter Christen	0cf3d36eae	more tolerance in case of corrupted file	2012-05-11 20:46:50 +02:00
Michael Peter Christen	e3bb73c3d6	serialized some database access methods	2012-01-31 21:13:49 +01:00
Michael Peter Christen	49be60a7c8	WorkflowProcess is forced to make small pauses if shortMemoryStatus is reached.	2012-01-10 03:03:12 +01:00
Roland 'Quix0r' Haeder	fa08ed5ae5	Fixed a lot CHMOD rights (no need for execute flag on .java/.html) and introduced local/remote crawl size ratio based check	2011-12-29 00:33:16 +01:00
Michael Christen	c04bfaa51b	refactoring	2011-12-16 23:59:29 +01:00
Michael Christen	404758698a	less io operations	2011-12-06 22:04:34 +01:00

1 2 3

119 Commits