Commit Graph

38 Commits

Author SHA1 Message Date
Michael Peter Christen
43f3345c90 - removed dependencies from URIMetadataRow and made direct access to
URIMetadataNode which creates the opportunity to access Solr objects
directly and use their information richness
- lazy initialization of the URIMetadataNode object - should cause less
computation and memory usage during search.
- removed dead code
2012-10-16 18:11:57 +02:00
Michael Peter Christen
7e3e45fd04 added Open Graph Metadata default fields, see http://ogp.me/ns# 2012-10-09 17:28:48 +02:00
Michael Peter Christen
c3e5f667a7 added schema.org breadcrumb counter to parser and solr schema 2012-10-09 13:02:43 +02:00
Michael Peter Christen
3d33a5bdf6 turned the synonyms_t Text field into a multi-valued String field
synonyms_sxt
2012-10-02 11:13:06 +02:00
Michael Peter Christen
3b959ee002 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-10-02 10:14:09 +02:00
orbiter
3190347814 added a synonyms_t field to solr and a process to read synonym files.
This can be used to add another stemming to solr using stemming files
that are expressed as synonyms for grammatical alternatives. The
synonym/stemming files must have the following form:
- each line is a comma-separated list of synonyms
- the list of synonyms may be enclosed with {} (like the GSA synonyms
file)
- the file may contain comments which are lines starting with a '#'
The synonym file(s) must be placed in DATA/DICTIONARIES/synonyms/ and
are activated by default whenever a synonym file is in place.
Then, for each word that is found in a document all synonyms are added
to a long text field which is stored into synonyms_t. Processes using
the synonyms must query with that field as optional matcher.
2012-10-02 00:02:50 +02:00
Michael Peter Christen
411d0e839b added an underline text field to solr to record all underlined texts 2012-10-01 14:16:49 +02:00
Michael Peter Christen
f45f7fc12e added new Host Browser to main menu:
this new search interface is something completely new for search, but
completely common on desktops: browser a web space like one would browse
a file system in a file browser. The file listing is created using the
search index and a faceted restriction to specific domains.
2012-09-28 22:45:16 +02:00
Michael Peter Christen
562183932b - removed ip_s from default profile since that needs a DNS lookup to
create an document entry. This makes remote search much slower.
- removed synchronization of add method if ip_s is activated to prevent
that a user configuration causes bad behavior. The disadvantage of that
is, that a index dump can cause data loss if an indexing is running
during index dump
- catched more exceptions and more NPE
- better abstraction in MirrorSolrConnector
- slight performance enhancement when only the index count is requested
(rows=0 is sufficient to get a total count)
2012-09-26 13:38:04 +02:00
Michael Peter Christen
62add1d564 added the protocol and the file name extension to the solr fields since
these fields are probably facets in file search
2012-09-11 22:46:39 +02:00
Michael Peter Christen
9db032664e activate two solr fields which will be used by administration interface
(later)
2012-09-11 20:15:54 +02:00
Michael Peter Christen
10b911eed4 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-09-07 22:07:02 +02:00
Michael Peter Christen
be67c70a47 added Solr fields:
inboundlinks_text_chars_val
inboundlinks_text_words_val
inboundlinks_alttag_txt
outboundlinks_text_chars_val
outboundlinks_text_words_val
outboundlinks_alttag_txt
2012-09-07 22:06:51 +02:00
orbiter
d73fff0e0e added solr field images_withalt_i 2012-09-07 21:33:45 +02:00
Michael Peter Christen
ee23fc7a32 added h1..h6 counter fields 2012-09-04 14:11:11 +02:00
Michael Peter Christen
b2b516cc3e added a collection attribute to crawls and searches:
- a solr field collection_sxt can be used to store a set of crawl tags
- when this field is activated, a crawl tag can be assigned when crawls
are started
- the content of the collection field can be comma-separated, all of
them are assigned to the documents when they are indexed as result of
such a crawl start
- a search result can be drilled down to a specific collection; this is
currently only available in the solr interface and also in the gsa
interface using the 'site' option
- this adds a mandatory field for gsa queries (the google api demands
that field all the time)
2012-09-03 15:26:08 +02:00
Michael Peter Christen
528d6763fa - added new solr fields:
title_count_i, title_chars_val, title_words_val
description_count_i, description_chars_val, description_words_val
- added many asserts to ensure data type correctness from YaCy to Solr
and vice versa
- made many fixes according to new findings from these asserts (!)
2012-08-31 10:30:43 +02:00
Michael Peter Christen
2ddc33646a added new field for solr:
url_paths_sxt
url_parameter_i
url_parameter_key_sxt
url_parameter_value_sxt
url_chars_i
2012-08-29 16:11:23 +02:00
Michael Peter Christen
316b5fe116 - added a solr type definition verifier
- fixed type definition found by the verifier
- added multivalue-string fields for solr with extension 'sxt'
- added multivalue-integer fields for solr with extension 'val'
- renamed some solr attributes from txt to sxt
- changed solr query line to an explicit AND/OR structure
- added a country code second level domain list to Domains class; with
parser
- added a host string parser to get domain class name, country-code
second-level domain and subdomain out of it
- removed old coordinate attributes
2012-08-28 16:58:06 +02:00
Michael Peter Christen
e8acd542b5 - added faceted drill-down for host and geolocation to solr queries
- added a new geolocation field to index schema, the old values are
migrated if possible
2012-08-27 14:41:33 +02:00
orbiter
716ea0cfe2 sorted the solr schema into mandatory and optional fields; reduced
number of used field to reduce solr index size
2012-08-21 23:52:56 +02:00
Michael Peter Christen
f9c0e6e950 - Implemented and integrated the URIMetadataNode object which is a
metadata representation from the solr index. This shall replace metadata
from the built-in database in the future.
- added the Solr-driven metadata into the search index of YaCy which
makes it now possible to run YaCy without the old metadata index. This
is a major stept forward to a full migration to Solr.
2012-08-10 13:26:51 +02:00
Michael Peter Christen
bca4a16603 replaced the multivalue generic string field name suffix _ss by _txt
because _ss is not part of the standard solr example schema.
2012-08-06 17:58:09 +02:00
orbiter
d9173ba7ed added more solr fields to integrate values from URIMetadataRow. All
writings to the Metadata-DB are now also done to solr. This includes
metadata transfer during search and rwi transfer.

The new/added solr fields are:

## time when resource was loaded
load_date_dt

## date until resource shall be considered as fresh
fresh_date_dt

## id of the host, a 6-byte hash that is part of the document id
host_id_s

## ids of referrer to this document
referrer_id_ss

## the md5 of the raw source
md5_s

## the name of the publisher of the document
publisher_t

## the language used in the document; starts with primary language
language_ss

## an external ranking value
ranking_i

## the size of the raw source
size_i

## number of links to audio resources
audiolinkscount_i

## number of links to video resources
videolinkscount_i

## number of links to application resources
applinkscount_i
2012-08-05 15:49:27 +02:00
Michael Peter Christen
508a81b86c added solr field 'refresh_s' which stores the refresh url contained in
the meta-refresh html header field.
2012-06-28 13:27:45 +02:00
Michael Peter Christen
3fd4a01286 added option to record urls that are forwarded to the solr index 2012-06-26 13:54:48 +02:00
Michael Peter Christen
8bee1472c9 there is no noindex, only nofollow in links 2012-01-31 23:46:35 +01:00
Michael Peter Christen
987b412491 updated solr scheme: generic declaration of solr schemes 2012-01-13 11:25:15 +01:00
orbiter
1b45e33f04 added robots tag parser to solr scheme
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7986 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-30 13:39:01 +00:00
orbiter
2c3161b4ac refactoring:
RankingProcess -> RWIProcess
ResultFetcher -> SnippetProcess


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7974 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-26 21:42:28 +00:00
orbiter
1c007188ad bugfixes in html parser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7912 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-31 16:02:06 +00:00
orbiter
5dd2efc9a2 - bugfixes in html parser
- new fields in solr
- extended file viewer to debug parser

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7897 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-08-25 15:52:25 +00:00
orbiter
e7c7598923 docfix
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7828 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-04 10:48:01 +00:00
orbiter
b84089ff04 fix for solr scheme list definition
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7826 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-03 22:59:43 +00:00
orbiter
2d4bb139d3 - added counting of links with noindex tag for solr index
- bugfixes for solr index

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7820 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-03 06:40:05 +00:00
orbiter
bda3eec0ff added parsing of canonical link element to html parser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7812 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-07-01 16:38:01 +00:00
orbiter
b6f09a475d - added an index profile editor in the /indexFederated_p.html servlet for solr indexes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7811 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-30 15:49:21 +00:00
orbiter
6deef60bc0 added keyword list for solr index attributes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7807 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-06-29 15:33:27 +00:00