yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	191ec8c82a	added concurrency to postprocess rewrite process	2014-08-04 15:28:58 +02:00
Michael Peter Christen	a1e8bdd5e9	log ppm instead of docs/second	2014-08-04 14:44:42 +02:00
Michael Peter Christen	cc0ded7abd	set process type of web graph according to fields as defined in the schema	2014-08-04 14:44:20 +02:00
Michael Peter Christen	338f574bdc	no sorting if http/www unique fields are not demanded (makes query faster) and some code restrucuring	2014-08-04 12:59:38 +02:00
Michael Peter Christen	0ceeceb35e	more logic on Solr queries; usage of the query terms in posprocessing, saving one query for double document detection now per document	2014-08-04 02:35:38 +02:00
orbiter	4099296b45	added new classes which shall reduce call overhead to Solr (stub)	2014-08-03 22:44:22 +02:00
orbiter	3491ab4c38	removed unused images from webgraph edge computation	2014-08-01 13:21:16 +02:00
orbiter	2371d6b8db	target linktexts must be string to enable search facets on these fields	2014-08-01 13:20:25 +02:00
Michael Peter Christen	98f45c9032	fix for image alt attachment to AnchorURLs in html parser.	2014-08-01 12:04:15 +02:00
orbiter	1027f3d04a	fix for the usage of ready-prepared solr queries, some queries are formulated as edismax query but this was not set as query attribut. The defType=edismax property needs a qf-field, so this was added as well. Do not remove that field again! This fixes also a problem with title-unique computation.	2014-07-25 18:53:13 +02:00
Michael Peter Christen	6e1dc444c3	added a snippet test function in ViewFile: you can now search for a specific word on the document; the servlet returns the snippet in the same way as it would be shown in a search result.	2014-07-24 14:59:37 +02:00
Michael Peter Christen	b44626e55b	fixed target_alt_t in webgraph	2014-07-22 18:24:10 +02:00
Michael Peter Christen	504327b15c	fix for condition for writing the webgraph	2014-07-22 00:59:08 +02:00
Michael Peter Christen	4eec1a7452	refactoring (change Metadata name of load time data structure to avoid confusion with Node data which is also called metadata)	2014-07-21 23:54:23 +02:00
reger	f96cfdc84d	prevent array out of bound exception on getRankingProfile(x) on faulty &profileNr= query parameter	2014-07-21 00:04:54 +02:00
Michael Peter Christen	2de159719b	added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead.	2014-07-18 12:43:01 +02:00
Michael Peter Christen	bf1b6b93e7	do not write CR values to webgraph if no CR values are computed	2014-07-16 18:13:29 +02:00
Michael Peter Christen	8514bffc22	enhanced postprocessing status report	2014-07-16 14:57:25 +02:00
Michael Peter Christen	b5d78ba156	reduced number of solr queries during crawling	2014-07-11 18:05:11 +02:00
Michael Peter Christen	fb3dd56b02	fix for processing of noindex flag in http header	2014-07-10 17:13:35 +02:00
Michael Peter Christen	b0d941626f	fixed bugs in canonical, robots and title/description unique calculation	2014-07-10 15:40:38 +02:00
Michael Peter Christen	1092e798a5	fixed double content postprocessing	2014-07-07 19:15:11 +02:00
Michael Peter Christen	36e623d8bf	enhanced metadata enrichment for media file type search: - Web servers may now deliver YaCy-specific http header field with a title and keywords. The new http header fields are: X-YaCy-Media-Title - to be used for media (image, audio, video) titles X-YaCy-Media-Keywords - to be used for media (image, audio, video) keywords - both fields are written to document fields title and keywords and are searched also during image search. - to make the usage of arbitrary http header fields (including this new fields) possible in the /api/push_p.json servlet, a new POST argument is also introduced to push http header fields. The new POST attribute is named "responseHeader-X" (where X is the counter). It is allowed to use this attribute as multi-attribute several times, each can be filled with a http header line. - see /api/push_p.html for examples	2014-06-26 13:02:35 +02:00
Michael Peter Christen	922979aae1	added option to prefer http over https in unique-protocol ranking	2014-06-02 17:40:56 +02:00
Michael Peter Christen	b3b174e2b8	fixed webgraph postprocessing and status display in Crawler_p servlet	2014-06-02 15:06:38 +02:00
Michael Peter Christen	8ad41a882c	fixed several problems with postprocessing: - unique-postprocessing was destroying results from other postprocessings; removed cross-updates as they had been not necessary - unique-postprocessing did not restrict on same protocol - inefficient concurrent update cache was redesigned completely - increased limits for concurrent blocking queues to prevent early time-out	2014-05-29 13:24:24 +02:00
Michael Peter Christen	ff5b3ac84d	added new fields http_unique_b and www_unique_b which can be used for ranking to prefer urls containing a www subdomain or using the https protocol	2014-05-27 15:28:28 +02:00
Michael Peter Christen	53948da7d0	tried to make last_modified recognition smarter	2014-05-22 00:28:51 +02:00
orbiter	97983ba89f	fixed generics warnings for generic array instantiation that appeared after migration to Java 7	2014-05-20 21:50:16 +02:00
orbiter	0d8072aa99	removed warnings	2014-05-13 22:29:05 +02:00
orbiter	ccb1864d55	catch IllegalArgumentException for wrong process types (that is needed for migrations when new process types are introduced or disappear)	2014-04-22 23:14:05 +02:00
orbiter	4ee4ba1576	fix for NPE in IndexCreateParserErrors_p.html caused by bad handling of lazy value instantiation of 0-value in crawldepth_i	2014-04-22 19:48:49 +02:00
reger	727dfb5875	refactore URIMetadataNode to further unify interaction with index - URIMetadataNode extending SolrDocument - use language as stored (String), reducing conversion to string - optimize debug code in transferIndex	2014-04-20 01:41:30 +02:00
Michael Peter Christen	74ab5ef9fa	increased runtime for postprocessing query job	2014-04-18 06:51:10 +02:00
Michael Peter Christen	10cf8215bd	added crawl depth for failed documents	2014-04-17 13:21:43 +02:00
Michael Peter Christen	c2f62e783f	- better subgraph handling, less overhead for crawls without the webgraph - usage of crawler crawldepth cache for the linkgraph target depth computation	2014-04-17 12:54:18 +02:00
Michael Peter Christen	9a5ab4e2c1	removed clickdepth_i field and related postprocessing. This information is now available in the crawldepth_i field which is identical to clickdepth_i because of a specific crawler strategy.	2014-04-16 22:16:20 +02:00
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	2014-04-16 21:34:28 +02:00
Michael Peter Christen	8aeef73d49	fix for virtual root nodes	2014-04-11 15:12:34 +02:00
Michael Peter Christen	7c7fbb9818	find depth-matches also for edge targets	2014-04-11 12:27:21 +02:00
Michael Peter Christen	dd12dd392f	introduction of a data structure for HyperlinkEdges which should use less memory as it does no double-storage of source links for each edge of the graph.	2014-04-11 12:09:33 +02:00
Michael Peter Christen	6ea8bb7348	using MultiProtocolURL for edge data which is faster (hash computation is now much easier) and smaller in size	2014-04-11 10:58:37 +02:00
Michael Peter Christen	a37d067692	refactoring	2014-04-10 23:46:35 +02:00
orbiter	95780eed32	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-04-10 21:40:54 +02:00
Michael Peter Christen	67beef657f	strong redesign of html parser: object recursion is now made using a stack on html tag objects, not using a recursive parse-again method which may cause bad performance and huge memory allocation. The new method also produced better parsed image objects with exact anchor text references.	2014-04-10 18:58:03 +02:00
orbiter	67501c9dda	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	2014-04-09 19:58:54 +02:00
Michael Peter Christen	1c21b3256d	fix for robots.txt handling: delete old entry before starting a new crawl.	2014-04-09 18:33:48 +02:00
orbiter	c250fac9f4	linkstructure refactoring to get more options for clickdepth analysis	2014-04-09 17:52:51 +02:00
Michael Peter Christen	8068e68474	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-04-09 12:45:15 +02:00
Michael Peter Christen	bd886054cb	new structure and enhancements for link graph computation: - added order option to solr queries to be able to retrieve document lists in specific order, here: link length - added HyperlinkEdge class which manages the link structure - integrated the HyperlinkEdge class into clickdepth computation - extended the linkstructure.json servlet to show also the clickdepth and other statistic information	2014-04-09 12:45:04 +02:00

1 2 3 4

161 Commits