yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
reger	b5ca20de15	preserve content_type (mime) if supplied in preference of construct in from file type. (this eventually can benefit image search by using mime only) reduce redundant field assignment for Solrdocuments created from URIMetadataNode (URIMetadataNode = SolrDocument with partially assigned fields)	2014-10-03 22:08:07 +02:00
reger	fb1fcc2b03	handle noarchive tag, skip writing page to cache http://mantis.tokeek.de/view.php?id=44	2014-10-01 04:35:34 +02:00
Michael Peter Christen	2645dc816a	added warning for not well-formed postprocessing queries	2014-09-18 14:36:57 +02:00
Michael Peter Christen	6d3d4c4ea6	changed the concurrent enumeration of query results in such a way that it is now possible to get the results in two steps: - first retrieve all IDs as given for a query - then retieve each document individually This was necessary for very large result sets where a query may run for hours and is possibly terminated by a solr-internal timeout. This occurs regulary during postprocessing and therefore this commit may fix unwanted postprocessing terminations.	2014-09-17 13:58:55 +02:00
Michael Peter Christen	e87dc08c0d	set the correct fail time in error docs	2014-09-05 14:46:11 +02:00
Michael Peter Christen	a7dd89c4de	changed method to write the citation index: do not catch up references during document parsing; instead use the same references that would also be written into the webgraph. That should cause that the webgraph and the citation index express the exact same semantic.	2014-09-02 13:22:12 +02:00
orbiter	d68438c3d9	make sure that the postprocessing background thread never dies by any exception	2014-08-23 10:35:38 +02:00
orbiter	927aaa95a6	concurrency bugfix	2014-08-13 00:59:11 +02:00
reger	f9db5dd6c5	reduce doublecontent check document (prevent out of memory) see http://mantis.tokeek.de/view.php?id=437 test result (concurrency=7) 2000 docs = eom always 1000 docs = eom always 100 docs = eom never chosen -> 200 docs (eom not encountered during test with 1GB mem setting)	2014-08-10 03:18:15 +02:00
reger	a8508417d1	catch NPE during crawl (OAI import) - condenseDocument mime=null (allowed) - collectionconfiguration responseheader = null (allowed)	2014-08-08 00:02:59 +02:00
Michael Peter Christen	6344718f8b	reducing the concurrent query stack size and reduced concurrency of postprocessing to avoid OOM situations	2014-08-06 12:36:59 +02:00
Michael Peter Christen	191ec8c82a	added concurrency to postprocess rewrite process	2014-08-04 15:28:58 +02:00
Michael Peter Christen	a1e8bdd5e9	log ppm instead of docs/second	2014-08-04 14:44:42 +02:00
Michael Peter Christen	cc0ded7abd	set process type of web graph according to fields as defined in the schema	2014-08-04 14:44:20 +02:00
Michael Peter Christen	338f574bdc	no sorting if http/www unique fields are not demanded (makes query faster) and some code restrucuring	2014-08-04 12:59:38 +02:00
Michael Peter Christen	0ceeceb35e	more logic on Solr queries; usage of the query terms in posprocessing, saving one query for double document detection now per document	2014-08-04 02:35:38 +02:00
orbiter	4099296b45	added new classes which shall reduce call overhead to Solr (stub)	2014-08-03 22:44:22 +02:00
orbiter	3491ab4c38	removed unused images from webgraph edge computation	2014-08-01 13:21:16 +02:00
orbiter	2371d6b8db	target linktexts must be string to enable search facets on these fields	2014-08-01 13:20:25 +02:00
Michael Peter Christen	98f45c9032	fix for image alt attachment to AnchorURLs in html parser.	2014-08-01 12:04:15 +02:00
orbiter	1027f3d04a	fix for the usage of ready-prepared solr queries, some queries are formulated as edismax query but this was not set as query attribut. The defType=edismax property needs a qf-field, so this was added as well. Do not remove that field again! This fixes also a problem with title-unique computation.	2014-07-25 18:53:13 +02:00
Michael Peter Christen	6e1dc444c3	added a snippet test function in ViewFile: you can now search for a specific word on the document; the servlet returns the snippet in the same way as it would be shown in a search result.	2014-07-24 14:59:37 +02:00
Michael Peter Christen	b44626e55b	fixed target_alt_t in webgraph	2014-07-22 18:24:10 +02:00
Michael Peter Christen	504327b15c	fix for condition for writing the webgraph	2014-07-22 00:59:08 +02:00
Michael Peter Christen	4eec1a7452	refactoring (change Metadata name of load time data structure to avoid confusion with Node data which is also called metadata)	2014-07-21 23:54:23 +02:00
reger	f96cfdc84d	prevent array out of bound exception on getRankingProfile(x) on faulty &profileNr= query parameter	2014-07-21 00:04:54 +02:00
Michael Peter Christen	2de159719b	added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead.	2014-07-18 12:43:01 +02:00
Michael Peter Christen	bf1b6b93e7	do not write CR values to webgraph if no CR values are computed	2014-07-16 18:13:29 +02:00
Michael Peter Christen	8514bffc22	enhanced postprocessing status report	2014-07-16 14:57:25 +02:00
Michael Peter Christen	b5d78ba156	reduced number of solr queries during crawling	2014-07-11 18:05:11 +02:00
Michael Peter Christen	fb3dd56b02	fix for processing of noindex flag in http header	2014-07-10 17:13:35 +02:00
Michael Peter Christen	b0d941626f	fixed bugs in canonical, robots and title/description unique calculation	2014-07-10 15:40:38 +02:00
Michael Peter Christen	1092e798a5	fixed double content postprocessing	2014-07-07 19:15:11 +02:00
Michael Peter Christen	36e623d8bf	enhanced metadata enrichment for media file type search: - Web servers may now deliver YaCy-specific http header field with a title and keywords. The new http header fields are: X-YaCy-Media-Title - to be used for media (image, audio, video) titles X-YaCy-Media-Keywords - to be used for media (image, audio, video) keywords - both fields are written to document fields title and keywords and are searched also during image search. - to make the usage of arbitrary http header fields (including this new fields) possible in the /api/push_p.json servlet, a new POST argument is also introduced to push http header fields. The new POST attribute is named "responseHeader-X" (where X is the counter). It is allowed to use this attribute as multi-attribute several times, each can be filled with a http header line. - see /api/push_p.html for examples	2014-06-26 13:02:35 +02:00
Michael Peter Christen	922979aae1	added option to prefer http over https in unique-protocol ranking	2014-06-02 17:40:56 +02:00
Michael Peter Christen	b3b174e2b8	fixed webgraph postprocessing and status display in Crawler_p servlet	2014-06-02 15:06:38 +02:00
Michael Peter Christen	8ad41a882c	fixed several problems with postprocessing: - unique-postprocessing was destroying results from other postprocessings; removed cross-updates as they had been not necessary - unique-postprocessing did not restrict on same protocol - inefficient concurrent update cache was redesigned completely - increased limits for concurrent blocking queues to prevent early time-out	2014-05-29 13:24:24 +02:00
Michael Peter Christen	ff5b3ac84d	added new fields http_unique_b and www_unique_b which can be used for ranking to prefer urls containing a www subdomain or using the https protocol	2014-05-27 15:28:28 +02:00
Michael Peter Christen	53948da7d0	tried to make last_modified recognition smarter	2014-05-22 00:28:51 +02:00
orbiter	97983ba89f	fixed generics warnings for generic array instantiation that appeared after migration to Java 7	2014-05-20 21:50:16 +02:00
orbiter	0d8072aa99	removed warnings	2014-05-13 22:29:05 +02:00
orbiter	ccb1864d55	catch IllegalArgumentException for wrong process types (that is needed for migrations when new process types are introduced or disappear)	2014-04-22 23:14:05 +02:00
orbiter	4ee4ba1576	fix for NPE in IndexCreateParserErrors_p.html caused by bad handling of lazy value instantiation of 0-value in crawldepth_i	2014-04-22 19:48:49 +02:00
reger	727dfb5875	refactore URIMetadataNode to further unify interaction with index - URIMetadataNode extending SolrDocument - use language as stored (String), reducing conversion to string - optimize debug code in transferIndex	2014-04-20 01:41:30 +02:00
Michael Peter Christen	74ab5ef9fa	increased runtime for postprocessing query job	2014-04-18 06:51:10 +02:00
Michael Peter Christen	10cf8215bd	added crawl depth for failed documents	2014-04-17 13:21:43 +02:00
Michael Peter Christen	c2f62e783f	- better subgraph handling, less overhead for crawls without the webgraph - usage of crawler crawldepth cache for the linkgraph target depth computation	2014-04-17 12:54:18 +02:00
Michael Peter Christen	9a5ab4e2c1	removed clickdepth_i field and related postprocessing. This information is now available in the crawldepth_i field which is identical to clickdepth_i because of a specific crawler strategy.	2014-04-16 22:16:20 +02:00
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	2014-04-16 21:34:28 +02:00
Michael Peter Christen	8aeef73d49	fix for virtual root nodes	2014-04-11 15:12:34 +02:00

1 2 3 4

172 Commits