yacy_search_server/defaults/solr.collection.schema
Michael Peter Christen 788288eb9e added the generation of 50 (!!) new solr field in the core 'webgraph'.
The default schema uses only some of them and the resting search index
has now the following properties:
- webgraph size will have about 40 times as much entries as default
index
- the complete index size will increase and may be about the double size
of current amount
As testing showed, not much indexing performance is lost. The default
index will be smaller (moved fields out of it); thus searching
can be faster.
The new index will cause that some old parts in YaCy can be removed,
i.e. specialized webgraph data and the noload crawler. The new index
will make it possible to:
- search within link texts of linked but not indexed documents (about 20
times of document index in size!!)
- get a very detailed link graph
- enhance ranking using a complete link graph

To get the full access to the new index, the API to solr has now two
access points: one with attribute core=collection1 for the default
search index and core=webgraph to the new webgraph search index. This is
also avaiable for p2p operation but client access is not yet
implemented.
2013-02-22 15:45:15 +01:00

407 lines
10 KiB
Plaintext

## this is a list of all solr keys for the default index 'collection1', the fulltext search index
## this complete list of keys can be changed; the actual schema is stored in:
## DATA/SETTINGS/solr.collection.schema
## the syntax of this file:
## - all lines beginning with '##' are comments
## - all non-empty lines not beginning with '#' are keyword lines
## - all lines beginning with '#' and where the second character is not '#' are commented-out keyword lines
### mandatory values, do not disable them, YaCy won't work without them
## primary key of document, the URL hash, string (mandatory field)
id
##url of document, string (mandatory field)
sku
## last-modified from http header, date (mandatory field)
last_modified
## mime-type of document, string (mandatory field)
content_type
## content of title tag, text (mandatory field)
title
## flag shows if title is unique in the whole index; if yes and another document appears with same title, the unique-flag is set to false, boolean
#title_unique_b
## id of the host, a 6-byte hash that is part of the document id (mandatory field)
host_id_s
## the md5 of the raw source (mandatory field)
md5_s
## the 64 bit hash of the org.apache.solr.update.processor.Lookup3Signature of text_t
exact_signature_l
## flag shows if exact_signature_l is unique at the time of document creation, used for double-check during search
exact_signature_unique_b
## 64 bit of the Lookup3Signature from EnhancedTextProfileSignature of text_t
fuzzy_signature_l
## intermediate data produced in EnhancedTextProfileSignature: a list of word frequencies
#fuzzy_signature_text_t
## flag shows if fuzzy_signature_l is unique at the time of document creation, used for double-check during search
fuzzy_signature_unique_b
## the size of the raw source (mandatory field)
size_i
## fail reason if a page was not loaded. if the page was loaded then this field is empty, text (mandatory field)
failreason_t
## fail type if a page was not loaded. This field is either empty, 'excl' or 'fail'
failtype_s
## html status return code (i.e. "200" for ok), -1 if not loaded (see content of failreason_t for this case), int (mandatory field)
httpstatus_i
## redirect url if the error code is 299 < httpstatus_i < 310
#httpstatus_redirect_s
## number of unique http references; used for ranking
references_i
## depth of web page according to number of clicks from the 'main' page, which is the page that appears if only the host is entered as url
clickdepth_i
## needed (post-)processing steps on this metadata set
process_sxt
### optional but highly recommended values, part of the index distribution process
## time when resource was loaded
load_date_dt
## date until resource shall be considered as fresh
fresh_date_dt
## ids of referrer to this document
referrer_id_txt
## the name of the publisher of the document
publisher_t
## the language used in the document
language_s
## number of links to audio resources
audiolinkscount_i
## number of links to video resources
videolinkscount_i
## number of links to application resources
applinkscount_i
### optional but highly recommended values, not part of the index distribution process
## tags that are attached to crawls/index generation to separate the search result into user-defined subsets
collection_sxt
## geospatial point in degrees of latitude,longitude as declared in WSG84, location; this creates two additional subfields, coordinate_p_0_coordinate (latitude) and coordinate_p_1_coordinate (longitude)
coordinate_p
## content of author-tag, texgen
author
## content of description-tag, text
description
## flag shows if description is unique in the whole index; if yes and another document appears with same description, the unique-flag is set to false, boolean
#description_unique_b
## content of keywords tag; words are separated by space
keywords
## character encoding, string
charset_s
## number of words in visible area, int
wordcount_i
## total number of inbound links, int
inboundlinkscount_i
## number of inbound links with nofollow tag, int
inboundlinksnofollowcount_i
## external number of inbound links, int
outboundlinkscount_i
## number of external links with nofollow tag, int
outboundlinksnofollowcount_i
## number of images, int
imagescount_i
## response time of target server in milliseconds, int
responsetime_i
## all visible text, text
text_t
## additional synonyms to the words in the text
synonyms_sxt
## h1 header
h1_txt
## h2 header
h2_txt
## h3 header
h3_txt
## h4 header
h4_txt
## h5 header
h5_txt
## h6 header
h6_txt
### optional values, not part of standard YaCy handling (but useful for external applications)
## ip of host of url (after DNS lookup), string
#ip_s
## tags of css entries, normalized with absolute URL
#css_tag_txt
## urls of css entries, normalized with absolute URL
#css_url_txt
## number of css entries, int
#csscount_i
## urls of script entries, normalized with absolute URL
#scripts_txt
## number of script entries, int
#scriptscount_i
## encoded as binary value into an integer:
## bit 0: "all" contained in html header meta
## bit 1: "index" contained in html header meta
## bit 2: "noindex" contained in html header meta
## bit 3: "nofollow" contained in html header meta
## bit 8: "noarchive" contained in http header properties
## bit 9: "nosnippet" contained in http header properties
## bit 10: "noindex" contained in http header properties
## bit 11: "nofollow" contained in http header properties
## bit 12: "unavailable_after" contained in http header properties
## content of <meta name="robots" content=#content#> tag and the "X-Robots-Tag" HTTP property
#robots_i
## content of <meta name="generator" content=#content#> tag, text
#metagenerator_t
## internal links, normalized (absolute URLs), as <a> - tag with anchor text and nofollow
#inboundlinks_tag_txt
## internal links, only the protocol
inboundlinks_protocol_sxt
## internal links, the url only without the protocol
inboundlinks_urlstub_txt
## external links, normalized (absolute URLs), as <a> - tag with anchor text and nofollow
#outboundlinks_tag_txt
## external links, only the protocol
outboundlinks_protocol_sxt
## external links, the url only without the protocol
outboundlinks_urlstub_txt
## all image tags, encoded as <img> tag inclusive alt- and title property
#images_tag_txt
## all image links without the protocol and '://'
#images_urlstub_txt
## all image link protocols
#images_protocol_sxt
## all image link alt tag
#images_alt_txt
## number of image links with alt tag
#images_withalt_i
## binary pattern for the existance of h1..h6 headlines, int
#htags_i
## url inside the canonical link element, string
#canonical_t
## flag shows if the url in canonical_t is equal to sku, boolean
#canonical_equal_sku_b
## link from the url property inside the refresh link element, string
#refresh_s
## all texts in <li> tags
#li_txt
## number of <li> tags, int
#licount_i
## all texts inside of <b> or <strong> tags. no doubles. listed in the order of number of occurrences in decreasing order
bold_txt
## number of occurrences of texts in bold_txt
#bold_val
## total number of occurrences of <b> or <strong>, int
#boldcount_i
## all texts inside of <i> tags. no doubles. listed in the order of number of occurrences in decreasing order
italic_txt
## number of occurrences of texts in italic_txt
#italic_val
## total number of occurrences of <i>, int
#italiccount_i
## all texts inside of <u> tags. no doubles. listed in the order of number of occurrences in decreasing order
underline_txt
## number of occurrences of texts in underline_txt
#underline_val
## total number of occurrences of <u>, int
#underlinecount_i
## flag that shows if a swf file is linked, boolean
#flash_b
## list of all links to frames
#frames_txt
## number of attr_frames, int
#framesscount_i
## list of all links to iframes
#iframes_txt
## number of attr_iframes, int
#iframesscount_i
## the protocol of the url
url_protocol_s
## all path elements in the url
url_paths_sxt
## the file name extension
url_file_ext_s
## number of key-value pairs in search part of the url
#url_parameter_i
## the keys from key-value pairs in the search part of the url
#url_parameter_key_sxt
## the values from key-value pairs in the search part of the url
#url_parameter_value_sxt
## number of all characters in the url == length of sku field
url_chars_i
## host of the url, string
host_s
## the Domain Class Name, either the TLD or a combination of ccSLD+TLD if a ccSLD is used.
#host_dnc_s
## either the second level domain or, if a ccSLD is used, the third level domain
host_organization_s
## the organization and dnc concatenated with '.'
#host_organizationdnc_s
## the remaining part of the host without organizationdnc
#host_subdomain_s
## number of titles (counting the 'title' field) in the document
#title_count_i
## number of characters for each title
#title_chars_val
## number of words in each title
#title_words_val
## number of descriptions in the document. Its not counting the 'description' field since there is only one. But it counts the number of descriptions that appear in the document (if any)
#description_count_i
## number of characters for each description
#description_chars_val
## number of words in each description
#description_words_val
## number of h1..h6 header lines
#h1_i
#h2_i
#h3_i
#h4_i
#h5_i
#h6_i
## breadcrumbs, see http://schema.org/WebPage; this is a counter how many itemprop="breadcrumb" properties in div tags appears within a page
#schema_org_breadcrumb_i
## Open Graph Metadata field, see http://ogp.me/ns#
#opengraph_title_t
#opengraph_type_s
#opengraph_url_s
#opengraph_image_s
## names of cms attributes; if several are recognized then they are listen in decreasing order of number of matching criterias
#ext_cms_txt
## number of attributes that count for a specific cms in attr_cms
#ext_cms_val
## names of ad-servers/ad-services
#ext_ads_txt
## number of attributes counts in attr_ads
#ext_ads_val
## names of recognized community functions
#ext_community_txt
## number of attribute counts in attr_community
#ext_community_val
## names of map services
#ext_maps_txt
## number of attribute counts in attr_maps
#ext_maps_val
## names of tracker server
#ext_tracker_txt
## number of attribute counts in attr_tracker
#ext_tracker_val
## names matching title expressions
#ext_title_txt
## number of matching title expressions
#ext_title_val