yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	67cd4c37bd	activated the new apk parser which was already ready but not included in the parser initialization. To make the apk parser usable, the handling of application type links had to be modified. Now all documents which have not a parser attached are placed to the noload-queue while all other documents are parsed using the associated parser class. This may have side-Effects on other parsers and the display of different file classes (images, apps, videos).	2014-09-24 13:32:58 +02:00
Michael Peter Christen	ebd0be2cea	fixes and speed updates for search process	2014-09-10 14:24:03 +02:00
Michael Peter Christen	eca9380e3d	bugfix for crawler double-check: if an url is redirected, the redirect-target was not double-checked. This is now done by replacing the redirect-URL on the crawl queue again (where it is double-checked)	2014-08-06 12:35:12 +02:00
orbiter	22ce4fb4dd	better error handling for remote solr queries and exists-checks	2014-08-01 11:00:10 +02:00
orbiter	4b06adb751	fix for file urls	2014-07-23 17:54:31 +02:00
Michael Peter Christen	2de159719b	added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead.	2014-07-18 12:43:01 +02:00
Michael Peter Christen	b893c42a0f	bugfix for image search	2014-06-26 12:56:33 +02:00
Michael Peter Christen	ba6ffddefc	refactoring	2014-06-12 05:23:26 +02:00
Michael Peter Christen	10cf8215bd	added crawl depth for failed documents	2014-04-17 13:21:43 +02:00
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	2014-04-16 21:34:28 +02:00
Michael Peter Christen	6bd8c6f195	fix for wrong status codes of error pages	2014-04-10 09:08:59 +02:00
Michael Peter Christen	b08375da33	fix for bad/missing values of size_i	2014-03-11 09:51:04 +01:00
reger	dd5bf0b71b	cleanup old reference to HTTPDemon.setAlternativeResolver optimize .yacyh check in AbstractRemoteHandler	2014-03-06 03:08:04 +01:00
Michael Peter Christen	bcd9dd9e1d	enhanced concurrent loading by using a fixed set of concurrent loader processes in favor of throwaway-processes. The control mechanism does less often report a 'queue full' message to the busy loop which then does not perform a long busy waiting; instead all requests are queued and new loader processes are started if necessary up to a given limit (as set before)	2014-03-03 22:13:40 +01:00
Michael Peter Christen	69391e5d9e	changed strategy to test existence of documents in Solr: using the update time. The reason for that is a better caching for the crawler double-check, which needs the update time for crawler steering.	2014-02-19 04:03:45 +01:00
Michael Peter Christen	8b14e92ba4	added button in host browser to re-load 404/failed documents	2014-01-23 15:56:36 +01:00
Michael Peter Christen	022c6d3ce1	do YaCy p2p connections using a timeout-request which covers the http request into a separate thread and ignores the furthure result of a request if that does not answer within the requested time-out. This is a try to solve a problem with the peer-ping, which hangs whenever a peer appears to be dead or blocked.	2014-01-19 15:21:23 +01:00
reger	6932aa4d7a	use configured admin-username for api calls - the admin user name can be configured, in apiExec calls the default "admin" username is used. TODO: the bin/apicall.sh script should likely take that into account.	2014-01-07 21:26:50 +01:00
orbiter	3cb6c7861f	fixed shutdown authenticaton problem	2014-01-06 01:48:54 +01:00
orbiter	f3ac923a7e	ftp client shall be able to open non-anonymous ftp servers if login details are given	2013-12-28 22:42:02 +01:00
reger	fd119deb00	fix NPE on modified since check ( Response.requestHeader allowed to be null)	2013-09-30 02:50:53 +02:00
Michael Peter Christen	91a875dff5	self-healing of mistakenly deactivated crawl profiles. This fixes a bug which can happen in rare cases when a crawl start and a cleanup process happen at the same time.	2013-09-25 18:27:54 +02:00
Michael Peter Christen	2602be8d1e	- removed ZURL data structure; removed also the ZURL data file - replaced load failure logging by information which is stored in Solr - fixed a bug with crawling of feeds: added must-match pattern application to feed urls to filter out such urls which shall not be in a wanted domain - delegatedURLs, which also used ZURLs are now temporary objects in memory	2013-09-17 15:27:02 +02:00
Michael Peter Christen	61c5e40687	- replaced the properties object in AnchorURL with distinct variables for anchor attributes. - this caused that large portions of the parser code had to be adopted as well - added a counter target_order_i for anchor links in webgraph computation	2013-09-15 23:27:04 +02:00
Michael Peter Christen	5e31bad711	- the webgraph shall store all links which appear on a web page and not all unique links! This made it necessary, that a large portion of the parser and link processing classes must be adopted to carry a different type of link collection which carry a property attribute which are attached to web anchors. - introduction of a new URL class, AnchorURL - the other url classes, DigestURI and MultiProtocolURI had been renamed and refactored to fit into a new document package schema, document.id - cleanup of net.yacy.cora.document package and refactoring	2013-09-15 00:30:23 +02:00
Michael Peter Christen	dbef8ccfcb	forced deletion of ZURL entries for a specific host for each host that appears in the crawl url list	2013-09-05 13:22:16 +02:00
orbiter	26366596d9	fix for a problem which ocurres when a site is crawled where the start url is redirected.	2013-09-04 16:00:47 +02:00
Michael Peter Christen	a88a62f7aa	added a feature to set a collection for a crawl result based on a regular expression on th url: the collection attribut for a crawl start may be now either a token or a list of tokens, seperated by ',' where a token is either a string or a pair <string,pattern> where the string is separated to the pattern with a ':' and the string is assigned to the document as collection only if the pattern matches with the url.	2013-08-25 00:13:48 +02:00
Michael Peter Christen	765943a4b7	Redesign of crawler identification and robots steering. A non-p2p user in intranets and the internet can now choose to appear as Googlebot. This is an essential necessity to be able to compete in the field of commercial search appliances, since most web pages are these days optimized only for Google and no other search platform any more. All commercial search engine providers have a built-in fake-Google User Agent to be able to get the same search index as Google can do. Without the resistance against obeying to robots.txt in this case, no competition is possible any more. YaCy will always obey the robots.txt when it is used for crawling the web in a peer-to-peer network, but to establish a Search Appliance (like a Google Search Appliance, GSA) it is necessary to be able to behave exactly like a Google crawler. With this change, you will be able to switch the user agent when portal or intranet mode is selected on per-crawl-start basis. Every crawl start can have a different user agent.	2013-08-22 14:23:47 +02:00
reger	2b7a38640a	extend content type detection on file extension for .tif .tiff .htm	2013-07-21 22:57:21 +02:00
Roland Haeder	841a28ae76	Added 'final' for all exception blocks as this helps the Java compiler to optimize memory usage Conflicts: source/net/yacy/search/Switchboard.java	2013-07-17 18:31:30 +02:00
Michael Peter Christen	89c0aa0e74	added collection_sxt to error documents	2013-07-17 15:20:56 +02:00
Michael Peter Christen	bcc623a843	refactoring of load_delay: this is a matter of client identification	2013-07-12 16:24:56 +02:00
Michael Peter Christen	5c6946dd5f	replaced usage of log4j by ConcurrentLog where possible	2013-07-09 14:42:39 +02:00
Michael Peter Christen	5878c1d599	- refactoring of log to ConcurrentLog: jdk-based logger tend to block at java.util.logging.Logger.log(Logger.java:476) in concurrent environments. This makes logging a main performance issue. To overcome this problem, this is a add-on to jdk logging to put log entries on a concurrent message queue and log the messages one by one using a separate process. - FTPClient uses the concurrent logging instead of the log4j logger	2013-07-09 14:28:25 +02:00
Michael Peter Christen	16d1d744fa	added url_file_name_s in default collection schema for the file name without the file extension. This part of the file path is removed from the multi-field url_paths_sxt, which has now not the file name as last part of the path list. The same applies to the new fields source_file_name_s and target_file_name_s in the webgraph schema.	2013-06-25 16:27:20 +02:00
Michael Peter Christen	2fd7bbb450	reduced load on solr; no seed update in Status and no exists-check in HTTPLoader in case of redirects, that can be done using the htcache.	2013-06-12 00:14:55 +02:00
Michael Peter Christen	8f2d3ce2f9	reduced locking situation in crawler: shifted synchronized location and reduced time-out of robots.txt load limit	2013-05-20 22:05:28 +02:00
Michael Peter Christen	06d3063dc9	- no downcase when using collection modifier - removed warnings	2013-05-17 14:11:10 +02:00
Michael Peter Christen	8dbc80da70	redesign of index.exist-test: this shall now not be done using a single id to be tested, but with a collection of ids. This will cause only a single call to solr instead of many. The result is a much better performace when testing the existence of many urls. The effect should cause very much less IO during index transmission, both on sender and receiver side.	2013-05-17 13:59:37 +02:00
Michael Peter Christen	c091000165	added collection attribute also to the rss feed reader	2013-04-24 01:14:35 +02:00
Michael Peter Christen	252bb51f98	fix for wrong mime type in noload crawler	2013-03-07 15:31:00 +01:00
Michael Peter Christen	788288eb9e	added the generation of 50 (!!) new solr field in the core 'webgraph'. The default schema uses only some of them and the resting search index has now the following properties: - webgraph size will have about 40 times as much entries as default index - the complete index size will increase and may be about the double size of current amount As testing showed, not much indexing performance is lost. The default index will be smaller (moved fields out of it); thus searching can be faster. The new index will cause that some old parts in YaCy can be removed, i.e. specialized webgraph data and the noload crawler. The new index will make it possible to: - search within link texts of linked but not indexed documents (about 20 times of document index in size!!) - get a very detailed link graph - enhance ranking using a complete link graph To get the full access to the new index, the API to solr has now two access points: one with attribute core=collection1 for the default search index and core=webgraph to the new webgraph search index. This is also avaiable for p2p operation but client access is not yet implemented.	2013-02-22 15:45:15 +01:00
reger	276e63401e	small sanitary fixes - exclude unix shell scripts in NSIS windows install archive - replace link to env/grafics/yacy.gif to yacy.png (build.nsi) - remove unused code lines (Blacklist_p, Response, WordReferenceVars) - type & xhtml (RankingSolr_p.html)	2013-01-02 01:59:47 +01:00
orbiter	712cc37c40	if maxFileSize < 0 then the file size limit is without limit.	2012-12-10 21:17:45 +01:00
Michael Peter Christen	a3cd3852ab	introduced a better place to update the lastacc time value in latency	2012-12-07 15:49:23 +01:00
Michael Peter Christen	dd241d03bb	latency fix: only set last-visit time if access was actually by the robot	2012-12-07 02:00:12 +01:00
Michael Peter Christen	10527e28ae	fix for wrong display of error urls in HostBrowser	2012-12-07 00:31:10 +01:00
Michael Peter Christen	d6b82840f8	added a feature to find similarities in documents. This uses an enhanced version of the Nutch/Solr TextProfileSignatue. As a result, a signature of the document is written to the solr search index. Additionally for each time when a signature is written, it is checked if the singature exists already in the index. If the signature does not exist, the document is marked as unique. The unique attribute can now be used to sort document lists and bring duplicates to the end of a result list. To enable this, a large portion of the search api to Solr had to be changed. This affected mainly caching of 'exists' searches to enhance the check for existing signatures and do this without actually doing a solr query. Because here the first time a long number is used as value in the Solr store, also the value naming in the YaCySchema had to be adopted and normalized. This caused that many files had to be changed.	2012-11-21 18:46:49 +01:00
Michael Peter Christen	4a14122ba7	in case that a crawl profile has a collection assigned, use the collection to show a name in the web interface. This should prevent that much too long names make the interface unusable.	2012-10-31 14:08:33 +01:00

1 2

62 Commits