yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	f901e7d3cf	fix for non-authorized view of IndexBrowser: show only the number of non-failure documents	2015-06-30 11:12:36 +02:00
Michael Peter Christen	fed26f33a8	enhanced timezone managament for indexed data: to support the new time parser and search functions in YaCy a high precision detection of date and time on the day is necessary. That requires that the time zone of the document content and the time zone of the user, doing a search, is detected. The time zone of the search request is done automatically using the browsers time zone offset which is delivered to the search request automatically and invisible to the user. The time zone for the content of web pages cannot be detected automatically and must be an attribute of crawl starts. The advanced crawl start now provides an input field to set the time zone in minutes as an offset number. All parsers must get a time zone offset passed, so this required the change of the parser java api. A lot of other changes had been made which corrects the wrong handling of dates in YaCy which was to add a correction based on the time zone of the server. Now no correction is added and all dates in YaCy are UTC/GMT time zone, a normalized time zone for all peers.	2015-04-15 13:17:23 +02:00
Michael Peter Christen	535f1ebe3b	added a new way of content browsing in search results: - date navigation The date is taken from the CONTENT of the documents / web pages, NOT from a date submitted in the context of metadata (i.e. http header or html head form). This makes it possible to search for documents in the future, i.e. when documents contain event descriptions for future events. The date is written to an index field which is now enabled by default. All documents are scanned for contained date mentions. To visualize the dates for a specific search results, a histogram showing the number of documents for each day is displayed. To render these histograms the morris.js library is used. Morris.js requires also raphael.js which is now also integrated in YaCy. The histogram is now also displayed in the index browser by default. To select a specific range from a search result, the following modifiers had been introduced: from:<date> to:<date> These modifiers can be used separately (i.e. only 'from' or only 'to') to describe an open interval or combined to have a closed interval. Both dates are inclusive. To select a specific single date only, use the 'to:' - modifier. The histogram shows blue and green lines; the green lines denot weekend days (saturday and sunday). Clicking on bars in the histogram has the following reaction: 1st click: add a from:<date> modifier for the date of the bar 2nd click: add a to:<date> modifier for the date of the bar 3rd click: remove from and date modifier and set a on:<date> for the bar When the on:<date> modifier is used, the histogram shows an unlimited time period. This makes it possible to click again (4th click) which is then interpreted as a 1st click again (sets a from modifier). The display feature is NOT switched on by default; to switch it on use the /ConfigSearchPage_p.html servlet.	2015-03-02 04:30:10 +01:00
reger	9e1ec5fec4	refactor: just some more useages of constant for term ":[* TO *]"	2015-02-01 04:26:33 +01:00
reger	0260d3d800	Allow to hide linkstructure graphic in crawl monitor using/setting the config param DECORATION_GRAFICS_LINKSTRUCTURE	2015-01-28 03:59:01 +01:00
reger	ff18129def	ViewFile servlet: update index if newer, so viewed text and metadata (stored) info is similar - to archive it, use request with profile to allow indexing (defaultglobaltext) and update index (the resource is loaded, parsed anyway, so it's not a expensive operation) Request: remove 2 unused init parameter - number of anchors of the parent - forkfactor sum of anchors of all ancestors	2014-12-05 01:13:37 +01:00
Michael Peter Christen	a0b84e4def	use a LinkedHashMap for factes to maintain facet order as given by solr	2014-11-20 18:44:29 +01:00
Michael Peter Christen	95d87f00b3	fix for bad query generation in doublecheck in postprocessing	2014-11-07 18:11:23 +01:00
orbiter	dbafd4865e	enhanced debug code in host browser	2014-10-30 15:47:44 +01:00
Michael Peter Christen	26279b0993	added debug code for statistics about document attributes related to domains	2014-10-29 10:50:08 +01:00
Michael Peter Christen	75b5f24be4	make browsing of file://z: - paths in index browser easier - this will now show the root paths on a shared drive	2014-10-13 18:33:39 +02:00
Michael Peter Christen	8c1a89cb34	added another decoration flag to switch off network graphics in crawler monitor and index browser: decoration.grafics.linkstructure Please set this to false to remove the graphics from the interface.	2014-10-08 17:12:35 +02:00
Michael Peter Christen	9b1958e8ca	more ipv6 bugfixes	2014-10-08 15:21:49 +02:00
Michael Peter Christen	6d3d4c4ea6	changed the concurrent enumeration of query results in such a way that it is now possible to get the results in two steps: - first retrieve all IDs as given for a query - then retieve each document individually This was necessary for very large result sets where a query may run for hours and is possibly terminated by a solr-internal timeout. This occurs regulary during postprocessing and therefore this commit may fix unwanted postprocessing terminations.	2014-09-17 13:58:55 +02:00
orbiter	500e0b9927	fix for browsing of file paths in Index Browser	2014-08-27 00:03:24 +02:00
orbiter	22ce4fb4dd	better error handling for remote solr queries and exists-checks	2014-08-01 11:00:10 +02:00
Michael Peter Christen	b5fc2b63ea	removed exist() retrieval functions from error cache and replaced it with metadata retrieval from connectors directly. This should cause better usage of the cache. Automatically increase the metadata cache if more memory is available.	2014-07-11 19:52:25 +02:00
Michael Peter Christen	62c72360ee	cleanup of checkAcceptanceInitially in CrawlStacker, should avoid double-calling of solr	2014-07-11 18:36:04 +02:00
orbiter	4ee4ba1576	fix for NPE in IndexCreateParserErrors_p.html caused by bad handling of lazy value instantiation of 0-value in crawldepth_i	2014-04-22 19:48:49 +02:00
Michael Peter Christen	10cf8215bd	added crawl depth for failed documents	2014-04-17 13:21:43 +02:00
Michael Peter Christen	b4b0d14c04	fix for display bug	2014-04-16 22:24:04 +02:00
Michael Peter Christen	9a5ab4e2c1	removed clickdepth_i field and related postprocessing. This information is now available in the crawldepth_i field which is identical to clickdepth_i because of a specific crawler strategy.	2014-04-16 22:16:20 +02:00
Michael Peter Christen	bd886054cb	new structure and enhancements for link graph computation: - added order option to solr queries to be able to retrieve document lists in specific order, here: link length - added HyperlinkEdge class which manages the link structure - integrated the HyperlinkEdge class into clickdepth computation - extended the linkstructure.json servlet to show also the clickdepth and other statistic information	2014-04-09 12:45:04 +02:00
Michael Peter Christen	a6bb9be97e	- added d3.js for visualizations using embedded svg - added a servlet api/linkstructure.json which generates a link graph information in json - added a javascript link graph renderer hypertree.js using d3 and the new servlet linkstructure.json - embedded the new link graph in the crawler monitor and the host browser	2014-04-03 14:51:19 +02:00
Michael Peter Christen	cca851a417	introduced new solr field crawldepth_i which records the crawl depth of a document. This is the upper limit for the clickdepth_i value which may be shorter in case that the crawler did not take the shortest path to the document.	2014-04-02 23:37:01 +02:00
Michael Peter Christen	ff82a80eb3	Integrated HostBrowser back to administration interface; it can appear with and without navigation bar.	2014-03-31 18:19:24 +02:00
Michael Peter Christen	fda591695c	fixed visibility of custom icon	2014-03-28 17:25:39 +01:00
Michael Peter Christen	8b44fcf0f4	added missing @Override annotation	2014-03-28 13:48:37 +01:00
orbiter	3c8d6e1eee	added adminAccount switch to ConfigAccounts_p servlet to switch on protection of all pages; some refactoring as well	2014-03-20 22:11:49 +01:00
Michael Peter Christen	92655c7fd9	- added bootstrap css framework - adopted all YaCy administration pages to new framework - created new search page layout (working, but still work in progress) - old skin files are fully appliable! (and looking good) - target is a new style based on bootstrap examples, see /test.html - icons in YaCy may be replaced by glyphicons (to be done)	2014-03-18 13:42:31 +01:00
Michael Peter Christen	b08375da33	fix for bad/missing values of size_i	2014-03-11 09:51:04 +01:00
Michael Peter Christen	51800007c4	- added concurrency to postprocessing of webgraph document - bundeled separate webgraph postprocesing steps into one	2014-03-06 01:43:48 +01:00
Michael Peter Christen	fdaeac374a	- enhanced postprocessing speed and memory footprint (by using HashMaps instead of TreeMaps) - enhanced memory footprint of database indexes (by introduction of optimize calls) - optimize calls shrink the amount of used memory for index sets if they are not changed afterwards any more	2014-02-28 14:01:09 +01:00
Michael Peter Christen	0f6b72f24b	do not use luke requests for remote solr servers if the result is different from normal requests. This happens if the remote solr is actually a solrCloud; in such cases the luke request returns only the result of the single solr peer, not the whole cloud. also done: some refactoring.	2014-02-26 14:30:48 +01:00
orbiter	da5d4128bf	prevent npe	2014-02-25 03:26:20 +01:00
orbiter	f6e441dd77	refactoring	2014-02-24 21:01:56 +01:00
orbiter	c3f6c06f2c	removed host increment on stored documents from crawler (that was wrong)	2014-02-24 20:02:15 +01:00
Michael Peter Christen	69391e5d9e	changed strategy to test existence of documents in Solr: using the update time. The reason for that is a better caching for the crawler double-check, which needs the update time for crawler steering.	2014-02-19 04:03:45 +01:00
Michael Peter Christen	8b14e92ba4	added button in host browser to re-load 404/failed documents	2014-01-23 15:56:36 +01:00
Michael Peter Christen	234ca720f5	only admins should be able to force a commit	2013-11-26 07:03:20 +01:00
Michael Peter Christen	64048ff217	fir for XSS	2013-11-22 09:53:32 +01:00
Michael Peter Christen	ffe8276063	replaced referrer link masking to 'pure' links to the referring page (that was more useful during testing)	2013-11-06 18:05:46 +01:00
Michael Peter Christen	434e13b46d	in host browser also show the properties of failed documents including referrer urls (this is a VERY USEFUL SEO and Web Admin feature!!)	2013-11-01 13:30:53 +01:00
orbiter	1ac504ae51	use html encoding for urls in metadata	2013-10-31 16:16:29 +01:00
Michael Peter Christen	2602be8d1e	- removed ZURL data structure; removed also the ZURL data file - replaced load failure logging by information which is stored in Solr - fixed a bug with crawling of feeds: added must-match pattern application to feed urls to filter out such urls which shall not be in a wanted domain - delegatedURLs, which also used ZURLs are now temporary objects in memory	2013-09-17 15:27:02 +02:00
Michael Peter Christen	5e31bad711	- the webgraph shall store all links which appear on a web page and not all unique links! This made it necessary, that a large portion of the parser and link processing classes must be adopted to carry a different type of link collection which carry a property attribute which are attached to web anchors. - introduction of a new URL class, AnchorURL - the other url classes, DigestURI and MultiProtocolURI had been renamed and refactored to fit into a new document package schema, document.id - cleanup of net.yacy.cora.document package and refactoring	2013-09-15 00:30:23 +02:00
Michael Peter Christen	e137ff4171	refactoring (im preparation for new removeHost method)	2013-09-05 09:59:41 +02:00
orbiter	f106345eef	link strings should not be tokenized	2013-09-01 14:35:36 +02:00
Michael Peter Christen	76afcccaaf	fix for default boolean post values: the default value MUST NOT be TRUE, because it's normal that a boolean value is missing in the post argument if a checkbox is not selected. Added also some style enhancements to IndexFederated, removed the Solr attachment manual and replaced it with a link to the wiki which explains this in more detail.	2013-07-31 10:49:26 +02:00
Michael Peter Christen	4c242f9af9	always use a default value for boolean options to have transparency for the outcome if the attribute is missing in servlets	2013-07-25 12:17:29 +02:00

1 2

100 Commits