yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	b893c42a0f	bugfix for image search	2014-06-26 12:56:33 +02:00
Michael Peter Christen	74c249288a	added a push api to make it possible to upload files directly without crawling to the YaCy indexer. Files are uploaded using POST multipart requests; multiple file uploads are possible as well. Each file has attached the file date and mime type which is used to get the right parser for the submitted data. Also an url is submitted which is assigned to the document. The CrawlSwitchboard has a new option for default Crawl Profiles which are assigned dynamically from the new push interface.	2014-06-12 18:10:07 +02:00
Michael Peter Christen	ba6ffddefc	refactoring	2014-06-12 05:23:26 +02:00
reger	92d1604a31	Crawler hostbalancer does not delete finished queue files, use alternative delete to fight the sympthom (and fix deletion of host dirs on startup) Root cause (which class holds a lock on .stack) not found. http://mantis.tokeek.de/view.php?id=404	2014-06-05 02:13:08 +02:00
orbiter	d7d38f9135	made number of open files in crawler configurable and increased default maximum number of open files from 100 to 1000. This number can be changed with the attribut crawler.onDemandLimit	2014-05-31 09:29:55 +02:00
reger	ca5437dd50	fix crawl of file:// , also http://mantis.tokeek.de/view.php?id=149 local files can be crawled (intranet mode) url parsing fixed according to RFC 1738 (for unix and windows) for win like file:///c:/tmp or file://localhost/c:/tmp for linux like file:///tmp or file://localhost/tmp Host is ignored and path must be absolute	2014-05-28 03:01:34 +02:00
orbiter	97983ba89f	fixed generics warnings for generic array instantiation that appeared after migration to Java 7	2014-05-20 21:50:16 +02:00
reger	1600414450	fix NPE on continuing crawls after YaCy restart (Agent is then nulll)	2014-05-02 19:32:09 +02:00
Michael Peter Christen	c1c1be8f02	fix for slow crawling and better logging in balancer	2014-04-29 19:50:33 +02:00
Michael Peter Christen	3acf416335	npe fix	2014-04-29 19:24:05 +02:00
orbiter	2f63bd0261	enhanced Host Balancer strategy: fair round robin	2014-04-23 23:11:37 +02:00
Michael Peter Christen	8b32dd5f9e	special strategy for balancer: do not remove targets with zero wait time from the queue	2014-04-18 06:50:07 +02:00
Michael Peter Christen	9c6228d948	fix for deadlocks in crawler	2014-04-17 16:58:17 +02:00
Michael Peter Christen	10cf8215bd	added crawl depth for failed documents	2014-04-17 13:21:43 +02:00
Michael Peter Christen	06afb568e2	new Strategies in Balancer: - doublecheck cache now records the crawl depth as well - doublecheck cache is available from the outside (made static) - no more need to crawl hosts with lowest depth first, instead all hosts which have only singleton entries are preferred to reduce the number of files.	2014-04-17 12:52:54 +02:00
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	2014-04-16 21:34:28 +02:00
Michael Peter Christen	075b6f9278	refactoring of the crawl balancer: the balancer is turned into an interface and the old balancer class is moved into LegacyBalancer to make room for a fresh implementation of a crawl balancer.	2014-04-14 13:32:35 +02:00
Michael Peter Christen	6bd8c6f195	fix for wrong status codes of error pages	2014-04-10 09:08:59 +02:00
Michael Peter Christen	9e503b3376	also delete the robots.txt file from the cache when a new crawl is started	2014-04-09 21:59:54 +02:00
Michael Peter Christen	1c21b3256d	fix for robots.txt handling: delete old entry before starting a new crawl.	2014-04-09 18:33:48 +02:00
Michael Peter Christen	926d28dd3f	fixed a bug which prevented crawl starts after a network switch	2014-04-04 14:43:35 +02:00
Michael Peter Christen	d4b5c457e4	NPE fix	2014-04-04 12:34:34 +02:00
Michael Peter Christen	8b44fcf0f4	added missing @Override annotation	2014-03-28 13:48:37 +01:00
Michael Peter Christen	85a427ec54	support for multiple sitemaps in robots.txt	2014-03-14 13:33:23 +01:00
Michael Peter Christen	b08375da33	fix for bad/missing values of size_i	2014-03-11 09:51:04 +01:00
reger	dd5bf0b71b	cleanup old reference to HTTPDemon.setAlternativeResolver optimize .yacyh check in AbstractRemoteHandler	2014-03-06 03:08:04 +01:00
Michael Peter Christen	e485fbd0ce	- let crawl loader jobs die after 10 seconds without new jobs - corrected shutdown order t prevent a deadlock during shutdown	2014-03-04 00:33:13 +01:00
Michael Peter Christen	bcd9dd9e1d	enhanced concurrent loading by using a fixed set of concurrent loader processes in favor of throwaway-processes. The control mechanism does less often report a 'queue full' message to the busy loop which then does not perform a long busy waiting; instead all requests are queued and new loader processes are started if necessary up to a given limit (as set before)	2014-03-03 22:13:40 +01:00
Michael Peter Christen	6ed9c0164e	attaching names to all Threads to get a better view in profiling tools like VisualVM	2014-02-28 15:02:01 +01:00
Michael Peter Christen	fdaeac374a	- enhanced postprocessing speed and memory footprint (by using HashMaps instead of TreeMaps) - enhanced memory footprint of database indexes (by introduction of optimize calls) - optimize calls shrink the amount of used memory for index sets if they are not changed afterwards any more	2014-02-28 14:01:09 +01:00
orbiter	da5d4128bf	prevent npe	2014-02-25 03:26:20 +01:00
orbiter	a878c7982c	prevent npe	2014-02-25 03:19:41 +01:00
orbiter	ced1a96f9c	fixed error cache	2014-02-25 02:16:22 +01:00
Michael Peter Christen	69391e5d9e	changed strategy to test existence of documents in Solr: using the update time. The reason for that is a better caching for the crawler double-check, which needs the update time for crawler steering.	2014-02-19 04:03:45 +01:00
Michael Peter Christen	8b14e92ba4	added button in host browser to re-load 404/failed documents	2014-01-23 15:56:36 +01:00
Michael Peter Christen	6ada0daae9	making latency_factor and maximum number of same hosts in loader queue settings available in Crawler_p.html servlet for steering.	2014-01-21 19:28:00 +01:00
Michael Peter Christen	0168f80c28	new crawling factors can now be changed during runtime	2014-01-21 17:52:16 +01:00
Michael Peter Christen	77531850b5	reverted crawling strategy from latest commit.	2014-01-21 16:05:55 +01:00
Michael Peter Christen	c0da966dfa	enhanced crawler speed	2014-01-20 21:46:40 +01:00
Michael Peter Christen	0d235a565b	cleanup crawl loader jobs	2014-01-20 18:36:00 +01:00
Michael Peter Christen	1ea17bd9f3	- removed old metadata database and all migration code - refactored all code which uses URIMetadataRow as standard for word hash length and word hash ordering and moved that to the class 'Word', becuase the class URIMetadataRow defined the old metadata data structure and should be superfluous in the future - removed unused methods from URIMetadataRow as preparation for further removal of that class	2014-01-20 18:31:46 +01:00
Michael Peter Christen	022c6d3ce1	do YaCy p2p connections using a timeout-request which covers the http request into a separate thread and ignores the furthure result of a request if that does not answer within the requested time-out. This is a try to solve a problem with the peer-ping, which hangs whenever a peer appears to be dead or blocked.	2014-01-19 15:21:23 +01:00
reger	28eae57e8b	spend CrawlQueues a fremem routine - clears errorStack - will not get hit often (but better little than nothing on low mem)	2014-01-10 10:24:33 +01:00
reger	6932aa4d7a	use configured admin-username for api calls - the admin user name can be configured, in apiExec calls the default "admin" username is used. TODO: the bin/apicall.sh script should likely take that into account.	2014-01-07 21:26:50 +01:00
orbiter	3cb6c7861f	fixed shutdown authenticaton problem	2014-01-06 01:48:54 +01:00
orbiter	f3ac923a7e	ftp client shall be able to open non-anonymous ftp servers if login details are given	2013-12-28 22:42:02 +01:00
Michael Peter Christen	82c0525e71	wrong logger fix	2013-12-23 10:52:02 +01:00
Michael Peter Christen	552ef9f18e	fix for bad ErrorCache.exists test (bug from latest commit)	2013-12-12 10:38:32 +01:00
Michael Peter Christen	303f5694ba	avoid usage of existsByQuery. If a document can be loaded by the ID before testing other fields from the existsByQuery request, then a document cache fills and queries after that one can be avoided.	2013-12-12 03:36:30 +01:00
Michael Peter Christen	0db8e34625	enhanced webgraph processing	2013-12-04 01:54:45 +01:00

1 2 3 4

185 Commits