yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	fed26f33a8	enhanced timezone managament for indexed data: to support the new time parser and search functions in YaCy a high precision detection of date and time on the day is necessary. That requires that the time zone of the document content and the time zone of the user, doing a search, is detected. The time zone of the search request is done automatically using the browsers time zone offset which is delivered to the search request automatically and invisible to the user. The time zone for the content of web pages cannot be detected automatically and must be an attribute of crawl starts. The advanced crawl start now provides an input field to set the time zone in minutes as an offset number. All parsers must get a time zone offset passed, so this required the change of the parser java api. A lot of other changes had been made which corrects the wrong handling of dates in YaCy which was to add a correction based on the time zone of the server. Now no correction is added and all dates in YaCy are UTC/GMT time zone, a normalized time zone for all peers.	2015-04-15 13:17:23 +02:00
reger	ff18129def	ViewFile servlet: update index if newer, so viewed text and metadata (stored) info is similar - to archive it, use request with profile to allow indexing (defaultglobaltext) and update index (the resource is loaded, parsed anyway, so it's not a expensive operation) Request: remove 2 unused init parameter - number of anchors of the parent - forkfactor sum of anchors of all ancestors	2014-12-05 01:13:37 +01:00
Michael Peter Christen	6983dff334	explain crawl denial when not switched to intranet mode	2014-10-11 09:02:12 +02:00
Michael Peter Christen	6491270b3a	large IPv6 redesign of peer ping methods! removed preferred IPv4 in start options and added a new field IP6 in peer seeds which will contain one or more IPv6 addresses. Now every peer has one or more IP addresses assigned, even several IPv6 addresses are possible. The peer-ping process must check all given and possible IP addresses for a backping and return the one IP which was successful when pinging the peer. The ping-ing peer must be able to recognize which of the given IPs are available for outside access of the peer and store this accordingly. If only one IPv6 address is available and no IPv4, then the IPv6 is stored in the old IP field of the seed DNA. Many methods in Seed.java are now marked as @deprecated because they had been used for a single IP only. There is still a large construction site left in YaCy now where all these deprecated methods must be replaced with new method calls. The 'extra'-IPs, used by cluster assignment had been removed since that can be replaced with IPv6 usage in p2p clusters. All clusters must now use IPv6 if they want an intranet-routing.	2014-09-30 14:53:52 +02:00
Michael Peter Christen	67cd4c37bd	activated the new apk parser which was already ready but not included in the parser initialization. To make the apk parser usable, the handling of application type links had to be modified. Now all documents which have not a parser attached are placed to the noload-queue while all other documents are parsed using the associated parser class. This may have side-Effects on other parsers and the display of different file classes (images, apps, videos).	2014-09-24 13:32:58 +02:00
Michael Peter Christen	025516f682	fix for crawl limit for number of pages fail	2014-09-20 13:06:46 +02:00
Michael Peter Christen	bf18a39d0e	replaced warning with info	2014-09-16 14:41:04 +02:00
Michael Peter Christen	a7dd89c4de	changed method to write the citation index: do not catch up references during document parsing; instead use the same references that would also be written into the webgraph. That should cause that the webgraph and the citation index express the exact same semantic.	2014-09-02 13:22:12 +02:00
Michael Peter Christen	eca9380e3d	bugfix for crawler double-check: if an url is redirected, the redirect-target was not double-checked. This is now done by replacing the redirect-URL on the crawl queue again (where it is double-checked)	2014-08-06 12:35:12 +02:00
Michael Peter Christen	3c23b89823	less logging	2014-08-04 13:37:34 +02:00
Michael Peter Christen	1609763be5	toString fix	2014-08-04 12:58:39 +02:00
orbiter	22ce4fb4dd	better error handling for remote solr queries and exists-checks	2014-08-01 11:00:10 +02:00
Michael Peter Christen	4eec1a7452	refactoring (change Metadata name of load time data structure to avoid confusion with Node data which is also called metadata)	2014-07-21 23:54:23 +02:00
Michael Peter Christen	2de159719b	added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead.	2014-07-18 12:43:01 +02:00
Michael Peter Christen	b5fc2b63ea	removed exist() retrieval functions from error cache and replaced it with metadata retrieval from connectors directly. This should cause better usage of the cache. Automatically increase the metadata cache if more memory is available.	2014-07-11 19:52:25 +02:00
Michael Peter Christen	62c72360ee	cleanup of checkAcceptanceInitially in CrawlStacker, should avoid double-calling of solr	2014-07-11 18:36:04 +02:00
Michael Peter Christen	b5d78ba156	reduced number of solr queries during crawling	2014-07-11 18:05:11 +02:00
Michael Peter Christen	10cf8215bd	added crawl depth for failed documents	2014-04-17 13:21:43 +02:00
Michael Peter Christen	b08375da33	fix for bad/missing values of size_i	2014-03-11 09:51:04 +01:00
orbiter	da5d4128bf	prevent npe	2014-02-25 03:26:20 +01:00
orbiter	a878c7982c	prevent npe	2014-02-25 03:19:41 +01:00
Michael Peter Christen	69391e5d9e	changed strategy to test existence of documents in Solr: using the update time. The reason for that is a better caching for the crawler double-check, which needs the update time for crawler steering.	2014-02-19 04:03:45 +01:00
orbiter	f3ac923a7e	ftp client shall be able to open non-anonymous ftp servers if login details are given	2013-12-28 22:42:02 +01:00
Michael Peter Christen	552ef9f18e	fix for bad ErrorCache.exists test (bug from latest commit)	2013-12-12 10:38:32 +01:00
Michael Peter Christen	303f5694ba	avoid usage of existsByQuery. If a document can be loaded by the ID before testing other fields from the existsByQuery request, then a document cache fills and queries after that one can be avoided.	2013-12-12 03:36:30 +01:00
Michael Peter Christen	0db8e34625	enhanced webgraph processing	2013-12-04 01:54:45 +01:00
Michael Peter Christen	1a4a69c226	set more logger to 'final static'	2013-11-13 06:18:48 +01:00
Michael Peter Christen	234a974955	load image only if their parser flag is activated	2013-11-04 11:59:28 +01:00
Michael Peter Christen	030d0776ff	Enhanced crawl start for very, very large crawl lists (i.e. > 5000) which had a problem because of badly used concurrency. This fix also caused a redesign of the whole host deletion process. This should fix bug http://bugs.yacy.net/view.php?id=250	2013-10-24 16:20:20 +02:00
Michael Peter Christen	1b4fa2947d	- fixed a problem which ocurred when a document was not recognized with the right content domain (i.e. identifying that it is an image, text etc.) because it used the file extension and not an existing mime type assignment. - fixed the new setting that images shall be loaded for a better image search. - both fixes together makes it now possible to crawl commons.wikimedia.org which makes use of 'funny' document names (i.e. ending with .jpg while the document is html)	2013-10-23 00:16:54 +02:00
Michael Peter Christen	101a6e6e14	Patch the citation index for links with canonical tags. This shall fulfill the following requirement: If a document A links to B and B contains a 'canonical C', then the citation rank computation shall consider that A links to C and B does not link to C. To do so, we first must collect all canonical links, find all references to them, get the anchor list of the documents and patch the citation reference of these links.	2013-10-07 11:15:58 +02:00
Michael Peter Christen	b28d43decc	added two more fields source_cr_host_norm_i,target_cr_host_norm_i in webgraph and an addition to postprocessing to copy all cr ranking attributes to the link edges associated to the postprocessing documents	2013-09-27 16:57:05 +02:00
Michael Peter Christen	3bf0104199	fix for crawl domain counter limitation (limit was reached too early)	2013-09-26 13:41:52 +02:00
Michael Peter Christen	91a875dff5	self-healing of mistakenly deactivated crawl profiles. This fixes a bug which can happen in rare cases when a crawl start and a cleanup process happen at the same time.	2013-09-25 18:27:54 +02:00
Michael Peter Christen	e40671ddb7	better and consistent deletions for error urls	2013-09-17 15:52:57 +02:00
Michael Peter Christen	2602be8d1e	- removed ZURL data structure; removed also the ZURL data file - replaced load failure logging by information which is stored in Solr - fixed a bug with crawling of feeds: added must-match pattern application to feed urls to filter out such urls which shall not be in a wanted domain - delegatedURLs, which also used ZURLs are now temporary objects in memory	2013-09-17 15:27:02 +02:00
Michael Peter Christen	61c5e40687	- replaced the properties object in AnchorURL with distinct variables for anchor attributes. - this caused that large portions of the parser code had to be adopted as well - added a counter target_order_i for anchor links in webgraph computation	2013-09-15 23:27:04 +02:00
Michael Peter Christen	5e31bad711	- the webgraph shall store all links which appear on a web page and not all unique links! This made it necessary, that a large portion of the parser and link processing classes must be adopted to carry a different type of link collection which carry a property attribute which are attached to web anchors. - introduction of a new URL class, AnchorURL - the other url classes, DigestURI and MultiProtocolURI had been renamed and refactored to fit into a new document package schema, document.id - cleanup of net.yacy.cora.document package and refactoring	2013-09-15 00:30:23 +02:00
Michael Peter Christen	dbef8ccfcb	forced deletion of ZURL entries for a specific host for each host that appears in the crawl url list	2013-09-05 13:22:16 +02:00
Michael Peter Christen	e137ff4171	refactoring (im preparation for new removeHost method)	2013-09-05 09:59:41 +02:00
Michael Peter Christen	69f85265e1	added an option to put image links to the crawl queue and handle these like normal documents. Using this option (by default on at this moment; this might change soon) it is possible to get the exif data into the search index to be used in image search.	2013-09-03 11:13:45 +02:00
Michael Peter Christen	e4cbe9232d	fixed a crawler bug where a double-occurring url was not re-crawled because the double-check error was written to the error-db and never deleted. No the error-db is cleared on every start and these double-messages are not written to the error-db any more.	2013-08-22 15:56:09 +02:00
Michael Peter Christen	89c0aa0e74	added collection_sxt to error documents	2013-07-17 15:20:56 +02:00
Michael Peter Christen	c6a6f159e8	fix for crawl stack domain counter	2013-07-16 18:18:55 +02:00
Michael Peter Christen	5878c1d599	- refactoring of log to ConcurrentLog: jdk-based logger tend to block at java.util.logging.Logger.log(Logger.java:476) in concurrent environments. This makes logging a main performance issue. To overcome this problem, this is a add-on to jdk logging to put log entries on a concurrent message queue and log the messages one by one using a separate process. - FTPClient uses the concurrent logging instead of the log4j logger	2013-07-09 14:28:25 +02:00
Michael Peter Christen	a34e137e27	fix for citation index generation in case that entry.referrerhash() is null. This is especially the case if ftp sites are crawled	2013-07-08 16:26:11 +02:00
Michael Peter Christen	57ffdfad4c	added a crawl option to obey html-meta-robots-noindex. This is on by default.	2013-07-03 14:50:06 +02:00
Michael Peter Christen	203921006a	redesign of citation index storage	2013-06-30 02:11:46 +02:00
Michael Peter Christen	374d2e2a52	removed warning message during crawling	2013-06-13 13:03:56 +02:00
Michael Peter Christen	44e363f37f	refactoring of WorkflowProcessor, added process counter, update of process counter if an blocking thread dies. Added also a new column in PerformanceConcurrency_p servlet to show the actual number of concurrent processes.	2013-05-13 13:28:07 +02:00

1 2

68 Commits