Commit Graph

373 Commits

Author SHA1 Message Date
Michael Peter Christen
eac9650b31 added another solr field clickdepth_i which reflects the number of
clicks which are necessary to get from the portal of a host to a
specific document. At this time, only the start document is flagged with
clickdepth '0', all other with '-1'. To get the actual clickdepth, a
process must use crawled information to collect the actual number of
clicks. This will be added in another/next step.
2012-12-18 17:20:42 +01:00
Michael Peter Christen
1052263af3 - added a new solr field references_i which stores the number of
INCOMING links to the corresponding web page. This information is taken
from the reverse link index (a 'little sister' of the RWI index).
- this field can be of use to enhance the ranking because a web page
with more incoming links can be more more important than others. But
this is not true for typical link pages like menues. Therefore the
number of outgoing links is needed.
- added a new solr attribute 'bf' to solr queries which is a boost
function extension. this field can contain a formula which comuptes the
boost according to given field values. After some experiments the
following forumla is now default:
div(add(1,references_i),pow(add(1,inboundlinkscount_i),1.6))^0.4
This takes the number of references and the inbound links. Further
experiments are needed to enhance that forumula.
2012-12-18 14:42:35 +01:00
Michael Peter Christen
34f8786508 removed dependency of vocabulary navigation from Jena and it's
triplestore; the vocabulary search is now done using generic solr fields
which are created on-the-fly during runtime.
2012-12-18 02:29:03 +01:00
Michael Peter Christen
fb0fa9a102 - fixed 'delete from subpath' during crawl start which deleted nothing;
now works;
- changed some crawl start html design details
2012-12-11 13:38:28 +01:00
orbiter
a4a780b871 - fix for bad url conversion in bookmarks when using smb urls
- fix for localhost hosts in solr schema host handling
2012-12-10 07:22:42 +01:00
Michael Peter Christen
72f165d58b added a Boost class which stores solr query boost values. The class can
be configured using the yacy.init file. The boost information is taken
from the configuration each time when a query to solr is done.
2012-12-02 16:54:29 +01:00
Michael Peter Christen
8fc3679c66 using more pre-compile pattern for split methods 2012-11-26 13:11:55 +01:00
Michael Peter Christen
b7004043ea - added a field cache for solr queries which call only for a single
value
- fixed a version conflict exception within a solr add request
2012-11-24 22:30:05 +01:00
Michael Peter Christen
efd2c4622d added a new fail type attribute for the index to distinguish two
separate fail types: network fail and forced exclusion (i.e. by robots
or forwarding rules).
2012-11-23 14:00:30 +01:00
Michael Peter Christen
4eab3aae60 removed overhead by preventing generation of full search results when
only the url is requested
2012-11-23 01:35:28 +01:00
Michael Peter Christen
d6b82840f8 added a feature to find similarities in documents.
This uses an enhanced version of the Nutch/Solr TextProfileSignatue.
As a result, a signature of the document is written to the solr search
index. Additionally for each time when a signature is written, it is
checked if the singature exists already in the index. If the signature
does not exist, the document is marked as unique. The unique attribute
can now be used to sort document lists and bring duplicates to the end
of a result list.
To enable this, a large portion of the search api to Solr had to be
changed. This affected mainly caching of 'exists' searches to enhance
the check for existing signatures and do this without actually doing a
solr query.
Because here the first time a long number is used as value in the Solr
store, also the value naming in the YaCySchema had to be adopted and
normalized. This caused that many files had to be changed.
2012-11-21 18:46:49 +01:00
Michael Peter Christen
f5ca5cea44 - added field options to all solr queries. This can be used to restrict
the actual data which is fetched from solr.
- used the new field options to reduce generic options like getting the
load date or the count of search results. should increase overall speed
- used the new field options to reduce overhead in the host browser
during aquisition of links.
- used the field options to make checking of links in crawler faster
- if the crawler is paused, the crawl queue is not cleaned
2012-11-19 17:24:34 +01:00
orbiter
5dfd6359cb redesign of the QueryParams class: introduced QueryGoal which holds the
query string parser. This shall be used to create a proper full-string
matching which is handled then by QueryGoal.
2012-11-18 01:22:41 +01:00
Michael Peter Christen
5fd3b93661 added deletion of hosts during crawl start if deleteold option was given 2012-11-13 16:54:28 +01:00
Michael Peter Christen
842faf96a2 fixed media search 2012-11-07 17:27:13 +01:00
Michael Peter Christen
93001586a0 removed warnings, removed too-fast pausing of crawls 2012-11-07 15:37:14 +01:00
Michael Peter Christen
12c0db20e5 fixed npe for surrogate import 2012-11-07 02:46:51 +01:00
Michael Peter Christen
52df6ee369 more logging 2012-11-07 02:04:08 +01:00
Michael Peter Christen
15d1460b40 added information about the reason of pausing of crawls 2012-11-06 15:21:56 +01:00
Michael Peter Christen
2371ef031c added solr faceted search support to YaCy search results
added solr highlighting / YaCy snippets to YaCy search results
- facets are now much more complete
- facets are computed and searched much faster
- snippet computation is done by solr if solr knows the snippet
2012-11-06 14:32:08 +01:00
Michael Peter Christen
d481abd087 added the visualization of error-urls to host browser
- only visible for admins
- a faceted search generates a huge list for all hosts in the host list
- the faceted search algorithms had to be modified for that
- within the browsing of the directory path, the error cause is written
to the url which is presented as error-url
- the errors are also accumulated for directory sums
2012-11-06 00:29:37 +01:00
Michael Peter Christen
97f82994a6 automatically pause the crawler if there is a problem with solr 2012-11-05 16:34:42 +01:00
Michael Peter Christen
8fb370d9f8 renovated the way how search results are count. should be correct now... 2012-11-05 03:19:28 +01:00
orbiter
354ef8000d - added 'deleteold' option to crawler which causes that documents are
deleted which are selected by a crawl filter (host or subpath)
- site crawl used this option be default now
- made option to deleteDomain() concurrency
2012-11-04 02:58:26 +01:00
Michael Peter Christen
75dd706e1b update to HostBrowser:
- time-out after 3 seconds to speed up display (may be incomplete)
- showing also all links from the balancer queue in the host list (after
the '/') and in the result browser view with tag 'loading'
2012-11-02 13:57:43 +01:00
Michael Peter Christen
e2c4c3c7d3 migration to solr 4.0.0 2012-11-02 12:29:48 +01:00
Michael Peter Christen
9330ad4838 - fixed the delete option in host browser
- added a delete method which can be used to delete a full subpath in
solr.
2012-11-02 01:22:31 +01:00
Michael Peter Christen
6629e37685 tried to clean up the search process mess 2012-11-01 17:16:43 +01:00
Michael Peter Christen
f8f05ecba7 - added a delete button in host browser to delete a complete subpath
- removed storage of default collection name - default is now "user"
- made stacking of crawl start points concurrently
2012-10-31 17:44:45 +01:00
Michael Peter Christen
c326aa8f67 disabled writing new entries to crawl stacks to prevent that a domain
with many documents block refreshing of the crawl queue
2012-10-29 22:26:52 +01:00
Michael Peter Christen
6905182d41 - fix for number of words log message
- adding meta:refresh also to crawler stack
2012-10-29 21:42:31 +01:00
Michael Peter Christen
799d71bc67 enhanced solr caching:
- increased cache size which is needed for longer solr commit time
- speed hacks on cache write code
2012-10-28 20:31:29 +01:00
Michael Peter Christen
8e1248ffe3 force a commit in advance of a search for the administrator to get most
recent results even if commit time is high and an indexing is ongoing.
2012-10-26 15:35:42 +02:00
Michael Peter Christen
3b48c78190 added an option to force a commit to solr.
may be used by a search front-end in case that the commitWithinMs time
is too short to get recently indexed documents.
2012-10-26 07:39:07 +02:00
Michael Peter Christen
ce0e5b1e17 - more refactoring / private methods
- fix for usage of custom solr field names
2012-10-18 15:09:04 +02:00
Michael Peter Christen
ccc3760a47 Refactoring and redesign of data architecture to make URIMetadataRow
superfluous. The target is to make a solr document as the core of YaCy
documents which would cause that many conversions can be removed. On the
way to this target the Equivalence of URIMetadataRow and URIMetadataNode
had to be removed to expose the usage of the old URIMetadataRow data
structure.
This refactoring already removes unneccessary conversions and should
make memory usage during indexing lower.
2012-10-18 14:29:11 +02:00
Michael Peter Christen
e5b3c172ff removed hack which translated Solr documents to virtual RWI entries
which had been then mixed with remote RWIs. Now these Solr documents are
feeded into the result set as they appear during local and remote
search. That makes the search much faster.
2012-10-17 17:45:41 +02:00
Michael Peter Christen
5d16c23a1f specified more URIMetadata as URIMetadataNode 2012-10-16 18:26:21 +02:00
Michael Peter Christen
43f3345c90 - removed dependencies from URIMetadataRow and made direct access to
URIMetadataNode which creates the opportunity to access Solr objects
directly and use their information richness
- lazy initialization of the URIMetadataNode object - should cause less
computation and memory usage during search.
- removed dead code
2012-10-16 18:11:57 +02:00
Michael Peter Christen
cc98496ff3 enhanced the HostBrowser:
- showing also outbound links to other domains if there are any
- the outbound links browser shows also the link structure image
- showing even inbound links if the web structure graph has information
about that
- removed the left menu and made the HostBrowser a part of the top menu
for search
- moved the file search also to the top menu
- added hover information in the HostBrowser to explain what the click
means
- because the HostBrowser also links to the Metadata viewer ViewFile,
there should be a button to switch back to the HostBrowser: added that
also.
2012-10-16 17:13:18 +02:00
Michael Peter Christen
21fe8339b4 - enhanced generation of url objects
- enhanced computation of link structure graphics
- enhanced collection of data for link structures
2012-10-15 13:17:13 +02:00
Michael Peter Christen
1b02408936 use less cache 2012-10-11 14:32:37 +02:00
Michael Peter Christen
5f0ab25382 removed the option to prevent removal of & parts inside of the
MultiProtocolURI during normalform computation because that should
always be done and also be done during initialization of the
MultiProtocolURI Object. The new normalform method takes only one
argument which should be 'true' unless you know exactly what you are
doing.
2012-10-10 11:46:22 +02:00
Michael Peter Christen
7e3e45fd04 added Open Graph Metadata default fields, see http://ogp.me/ns# 2012-10-09 17:28:48 +02:00
Michael Peter Christen
c3e5f667a7 added schema.org breadcrumb counter to parser and solr schema 2012-10-09 13:02:43 +02:00
Michael Peter Christen
bd769de604 since the solr index is now used for all pages that are indexed locally,
there is no need for the RWI index if the index is not transfered to
another peer. Therefore the creation of RWI index data is now suppressed
if DHT is disabled. This applies for all intranet and portal mode
configurations, but not for public robinson modes. A robinson may switch
back to public mode and then transmit its data. That means if someone
wants to switch never to DHT mode, it would be more appropriate to
choose the portal mode.
2012-10-09 11:48:55 +02:00
Michael Peter Christen
f8a3ab2d82 added the usage of synonyms to the GSA search interface 2012-10-02 14:29:45 +02:00
Michael Peter Christen
3d33a5bdf6 turned the synonyms_t Text field into a multi-valued String field
synonyms_sxt
2012-10-02 11:13:06 +02:00
Michael Peter Christen
3b959ee002 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-10-02 10:14:09 +02:00
orbiter
3190347814 added a synonyms_t field to solr and a process to read synonym files.
This can be used to add another stemming to solr using stemming files
that are expressed as synonyms for grammatical alternatives. The
synonym/stemming files must have the following form:
- each line is a comma-separated list of synonyms
- the list of synonyms may be enclosed with {} (like the GSA synonyms
file)
- the file may contain comments which are lines starting with a '#'
The synonym file(s) must be placed in DATA/DICTIONARIES/synonyms/ and
are activated by default whenever a synonym file is in place.
Then, for each word that is found in a document all synonyms are added
to a long text field which is stored into synonyms_t. Processes using
the synonyms must query with that field as optional matcher.
2012-10-02 00:02:50 +02:00
Michael Peter Christen
411d0e839b added an underline text field to solr to record all underlined texts 2012-10-01 14:16:49 +02:00
Michael Peter Christen
c4a3d8870f fixed computation of links in host browser which are not indexed but
knwon by the crawler. Such links are now displayed in grey color.
2012-09-29 02:13:11 +02:00
Michael Peter Christen
24d2ee3c52 - better date ranking
- more protection against NPE and time travel effects
2012-09-26 18:36:32 +02:00
Michael Peter Christen
ca313e404f - if a "/date" modifier is used, the solr remote query applies an
ordering by date (ascending)
- added also some 'anti-timetravel' protection (check if date is in the
future within any metadata date field)
2012-09-26 16:56:33 +02:00
Michael Peter Christen
a4214694df We assert that no other metadata storage than solr is used now.
Therefore a property like solrConnected() must be true all the time.
Removal of this method causes removal of all write operations to the old
metadata index.
2012-09-26 16:05:11 +02:00
Michael Peter Christen
562183932b - removed ip_s from default profile since that needs a DNS lookup to
create an document entry. This makes remote search much slower.
- removed synchronization of add method if ip_s is activated to prevent
that a user configuration causes bad behavior. The disadvantage of that
is, that a index dump can cause data loss if an indexing is running
during index dump
- catched more exceptions and more NPE
- better abstraction in MirrorSolrConnector
- slight performance enhancement when only the index count is requested
(rows=0 is sufficient to get a total count)
2012-09-26 13:38:04 +02:00
Michael Peter Christen
1533bfd63b refactoring 2012-09-25 21:20:03 +02:00
Michael Peter Christen
872f83ebe0 refactoring 2012-09-25 21:04:58 +02:00
Michael Peter Christen
fb9460f0a8 using the search filter to drill down search to file types.
A search like "mp3 filetype:mp3" will now maybe surprise you.
2012-09-25 17:52:33 +02:00
Michael Peter Christen
15ea053c3a - added xml output in IndexControlURLs to get the storage page of index
dump commands
- adjusted the apicall.sh script to get the downloaded text as output to
stdout which is necessary to parse the content out of it
- added indexdump.sh script which creates a solr dump and prints out the
storage path for the index dump
- added synchronization to the Fulltext class to prevent that data is
stored to a non-existing solr index while this index is disabled during
the storage of the dump
2012-09-25 00:19:52 +02:00
Michael Peter Christen
1b474139dd used the new zip writer/reader to add a solr dump process: the whole
solr index can be written to a zip dump and also restored during runtime
2012-09-24 17:05:28 +02:00
Michael Peter Christen
8219a445f3 refactoring 2012-09-21 16:46:57 +02:00
Michael Peter Christen
00c1c777fa refactoring 2012-09-21 15:48:16 +02:00
orbiter
563d584420 removed more dependencies in cora from kelondro 2012-09-21 11:02:36 +02:00
Michael Peter Christen
62add1d564 added the protocol and the file name extension to the solr fields since
these fields are probably facets in file search
2012-09-11 22:46:39 +02:00
Michael Peter Christen
9db032664e activate two solr fields which will be used by administration interface
(later)
2012-09-11 20:15:54 +02:00
Michael Peter Christen
4634f0e626 fix for images_withalt 2012-09-10 12:30:03 +02:00
Michael Peter Christen
10b911eed4 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-09-07 22:07:02 +02:00
Michael Peter Christen
be67c70a47 added Solr fields:
inboundlinks_text_chars_val
inboundlinks_text_words_val
inboundlinks_alttag_txt
outboundlinks_text_chars_val
outboundlinks_text_words_val
outboundlinks_alttag_txt
2012-09-07 22:06:51 +02:00
orbiter
d73fff0e0e added solr field images_withalt_i 2012-09-07 21:33:45 +02:00
sixcooler
e78fe3f477 also do a clearcache on the solr-connector-caches 2012-09-06 22:07:07 +02:00
Michael Peter Christen
d8425e6809 added collections to crawl monitor 2012-09-04 14:47:53 +02:00
Michael Peter Christen
ee23fc7a32 added h1..h6 counter fields 2012-09-04 14:11:11 +02:00
Michael Peter Christen
b2b516cc3e added a collection attribute to crawls and searches:
- a solr field collection_sxt can be used to store a set of crawl tags
- when this field is activated, a crawl tag can be assigned when crawls
are started
- the content of the collection field can be comma-separated, all of
them are assigned to the documents when they are indexed as result of
such a crawl start
- a search result can be drilled down to a specific collection; this is
currently only available in the solr interface and also in the gsa
interface using the 'site' option
- this adds a mandatory field for gsa queries (the google api demands
that field all the time)
2012-09-03 15:26:08 +02:00
Michael Peter Christen
f75b3f8a47 added more patches to work without RWI data structure 2012-08-31 14:35:56 +02:00
Michael Peter Christen
31d4d38804 - extended the solr interface by a references-by-word-count method
- reduced danger that a non-existing RWI database causes NPEs
- added Solr queries to did-you-mean: this makes it possible that our
did-you-mean algorithm works together with only Solr and without RWIs
2012-08-31 13:03:00 +02:00
Michael Peter Christen
528d6763fa - added new solr fields:
title_count_i, title_chars_val, title_words_val
description_count_i, description_chars_val, description_words_val
- added many asserts to ensure data type correctness from YaCy to Solr
and vice versa
- made many fixes according to new findings from these asserts (!)
2012-08-31 10:30:43 +02:00
Michael Peter Christen
2ddc33646a added new field for solr:
url_paths_sxt
url_parameter_i
url_parameter_key_sxt
url_parameter_value_sxt
url_chars_i
2012-08-29 16:11:23 +02:00
Michael Peter Christen
316b5fe116 - added a solr type definition verifier
- fixed type definition found by the verifier
- added multivalue-string fields for solr with extension 'sxt'
- added multivalue-integer fields for solr with extension 'val'
- renamed some solr attributes from txt to sxt
- changed solr query line to an explicit AND/OR structure
- added a country code second level domain list to Domains class; with
parser
- added a host string parser to get domain class name, country-code
second-level domain and subdomain out of it
- removed old coordinate attributes
2012-08-28 16:58:06 +02:00
Michael Peter Christen
e8acd542b5 - added faceted drill-down for host and geolocation to solr queries
- added a new geolocation field to index schema, the old values are
migrated if possible
2012-08-27 14:41:33 +02:00
orbiter
29171e2f6c fixed generation of ontologies from index enumerations 2012-08-24 14:13:42 +02:00
orbiter
01a63ef595 redesign of YaCySchema and SolrDoc handling 2012-08-23 09:51:45 +02:00
orbiter
479bfca571 refctoring 2012-08-23 09:30:11 +02:00
Michael Peter Christen
4716546ef5 - reduced memory usage in index transmission using a transformation of
Node to Row objects
- removed peerDeparture in solr remote search in case that peer does not
answer (this may be normal because it is allowed to switch this off)
2012-08-22 16:30:33 +02:00
orbiter
716ea0cfe2 sorted the solr schema into mandatory and optional fields; reduced
number of used field to reduce solr index size
2012-08-21 23:52:56 +02:00
orbiter
9b8c8c0f47 fix from gaston in
http://forum.yacy-websuche.de/viewtopic.php?p=26909#p26909
2012-08-21 21:03:26 +02:00
orbiter
d7ea45f698 - get nice text_t values from metadata conversions that are stored into
solr as fulltext search index.
- added slow migration from old metadata to solr index entries: each
entry from the old metadata is removed from that data structure and
written into solr.
2012-08-18 19:36:21 +02:00
orbiter
780f8974e7 added ramaining iteration methods for solr in fulltext class 2012-08-18 15:39:14 +02:00
orbiter
ee01c12e56 fixes for putDocument and putMetadata 2012-08-18 13:05:27 +02:00
orbiter
cc47a0876e reverted bf55f69176
to have a fall-back option in case that memory problems as reported in
http://forum.yacy-websuche.de/viewtopic.php?p=26901#p26901
for full-solr installation are too strong and we have to work with an
'small memory footprint' peer system.
2012-08-18 10:28:40 +02:00
Michael Peter Christen
0cab06c47c refactoring 2012-08-17 15:52:33 +02:00
Michael Peter Christen
bf55f69176 removed write methods to old metadata file type; all metadata now goes
to solr
2012-08-17 15:46:26 +02:00
Michael Peter Christen
40c0856489 refactoring 2012-08-17 15:33:02 +02:00
Michael Peter Christen
06a78eecb7 code simplification 2012-08-17 14:43:32 +02:00
Michael Peter Christen
18f989dfb1 - refactoring (load -> getMetadata)
- added getDocument to retrieve Solr documents which shall replace
getMetadata
2012-08-17 01:34:38 +02:00
Michael Peter Christen
e5ef840f40 - renamed DoubleSolrConnector to MirrorSolrConnector and added a
hit/miss/document cache to the MirrorSolrConnector.
- more abstraction to SolrDocument in Connector interface
- bugfixes in Solr field reader
2012-08-13 13:32:32 +02:00
Michael Peter Christen
b51df6c7e8 - added coordinate storage in solr schema
- fixed shutdown process
- fixed some solr-to-metadata reading
- added a large number of metadata attributes in ViewFile.html
2012-08-13 10:40:04 +02:00
Michael Peter Christen
bd4f03bc85 removed unused class 2012-08-11 01:05:40 +02:00
orbiter
e816b88b55 changed behaviour of metadata storage: in case that any solr is
attached, the metadata is not written to the metadata-db, even if it is
enabled but instead to solr. This prevents that metadata is written in
two store systems at the same time. It is also the next step to migrate
the current metadata-db to solr.
2012-08-10 15:39:10 +02:00
orbiter
2571e0d47a removed unused classes 2012-08-10 14:47:44 +02:00
Michael Peter Christen
f9c0e6e950 - Implemented and integrated the URIMetadataNode object which is a
metadata representation from the solr index. This shall replace metadata
from the built-in database in the future.
- added the Solr-driven metadata into the search index of YaCy which
makes it now possible to run YaCy without the old metadata index. This
is a major stept forward to a full migration to Solr.
2012-08-10 13:26:51 +02:00
Michael Peter Christen
136fcb1ad9 refactoring 2012-08-10 06:47:13 +02:00
Michael Peter Christen
bca4a16603 replaced the multivalue generic string field name suffix _ss by _txt
because _ss is not part of the standard solr example schema.
2012-08-06 17:58:09 +02:00
orbiter
67edfd991c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-08-05 15:49:48 +02:00
orbiter
d9173ba7ed added more solr fields to integrate values from URIMetadataRow. All
writings to the Metadata-DB are now also done to solr. This includes
metadata transfer during search and rwi transfer.

The new/added solr fields are:

## time when resource was loaded
load_date_dt

## date until resource shall be considered as fresh
fresh_date_dt

## id of the host, a 6-byte hash that is part of the document id
host_id_s

## ids of referrer to this document
referrer_id_ss

## the md5 of the raw source
md5_s

## the name of the publisher of the document
publisher_t

## the language used in the document; starts with primary language
language_ss

## an external ranking value
ranking_i

## the size of the raw source
size_i

## number of links to audio resources
audiolinkscount_i

## number of links to video resources
videolinkscount_i

## number of links to application resources
applinkscount_i
2012-08-05 15:49:27 +02:00
Michael Peter Christen
24d9db1613 snippet retrieval loading processes may use a smaller minimum load time
value than crawling processes. This speeds up the search result
preparation dramatically.
2012-07-30 10:38:23 +02:00
Michael Peter Christen
1687737771 Abstraction of HandleMap and HandleSet 2012-07-27 12:13:53 +02:00
Michael Peter Christen
3bcd9d622b cleaned up classes and methods which are either superfluous at this time
or will be superfluous or subject of complete redesign after the
migration to solr. Removing these things now will make the transition to
solr more simple.
2012-07-25 14:31:54 +02:00
Michael Peter Christen
6f1ddb2519 Moved solr index-add method to the same method where the YaCy index is
written. Also done some code-cleanup.
2012-07-25 01:53:47 +02:00
Michael Peter Christen
76202f068e extended abstraction of local and remote solr index using one front-end
for index administration and querying.
2012-07-24 17:23:29 +02:00
Michael Peter Christen
826967513b changed options in IndexFederated_p to switch on/off parts of the index
individually. The settings are experimental and the values of the
settings will be overwritten when an index migration from urldb to solr
starts.
2012-07-23 16:28:39 +02:00
orbiter
69e743d9e3 - more abstraction for the RWI index as preparation for solr integration
- added options in search index to switch parts of the index on or off
2012-07-22 13:18:45 +02:00
orbiter
5a3c829872 embedded solr is only initiated if it is activated with
IndexFederated_p.html
2012-07-20 11:40:33 +02:00
Michael Peter Christen
97b7bcf2a6 added a solr search index
- by default, a (empty) solr storage instance is created at
SEGMENTS/solr_36
- the index is written if in /IndexFederated_p.html the flag "embedded
solr search index" is switched on
- a standard solr query interface is available now with a new servlet at
http://127.0.0.1:8090/solr/select

To test this, do the following:
- switch to webportal mode
- switch on the feature as described
- do a crawl. this fills the solr index. The normal YaCy search will NOT
work now!
- do a solr query, like:
http://127.0.0.1:8090/solr/select?q=*:*
http://127.0.0.1:8090/solr/select?q=text_t:Help
play with different search fields as you can see in
/IndexFederated_p.html
You can use the standard solr query attributes as described in
http://wiki.apache.org/solr/SearchHandler
2012-07-19 11:34:05 +02:00
orbiter
bbfa497a3c replaced more size() > 0 by !isEmpty() 2012-07-12 11:12:21 +02:00
orbiter
0cbda0b2b8 - replaced all length() == 0 and size() == 0 with isEmpty()
- replaced some length() > 0 and size() > 0 with !isEmpty() - cannot be
done automatically
- implemented some isEmpty() methods
2012-07-10 22:59:03 +02:00
orbiter
62202e2d71 refactoring of query attribute variable names for better consistency
with (next) stored query words
2012-07-09 11:14:50 +02:00
Michael Peter Christen
7c1ba99755 removed more unused method parameters 2012-07-05 10:44:30 +02:00
Michael Peter Christen
d3964253ae - added @SuppressWarnings to unused servlet method parameters
- removed unnecessary casts
- removed unnecessary throw statements
2012-07-05 09:14:04 +02:00
orbiter
d4291ac1f3 more tolerance when creating solar document 2012-07-04 21:15:38 +02:00
Michael Peter Christen
8a82609360 - smaller caches to save memory
- close cloneable iterators to free memory
2012-07-02 15:40:40 +02:00
Michael Peter Christen
1825f165b8 better integration of blacklist according to use case 2012-07-02 13:57:29 +02:00
Michael Peter Christen
03280fb161 removed segments-concept and the Segments class:
the segments had been there to create a tenant-infrastructure but were
never be used since that was all much too complex. There will be a
replacement using a solr navigation using a segment field in the search
index.
2012-06-28 14:27:29 +02:00
Michael Peter Christen
508a81b86c added solr field 'refresh_s' which stores the refresh url contained in
the meta-refresh html header field.
2012-06-28 13:27:45 +02:00
Michael Peter Christen
9116013c64 - allow lazy initialization of solr value (if using 'lazy', then no
0-values and no empty strings are written). This may save a lot of
memory (in ram and on disc) if excessive 0-values or empty strings
appear)
- do not allow default boolean values for checkboxes because that does
not make sense: browsers may omit the checkbox attribute name if the box
is not checked. A default value 'true' would not comply with the
semantic of the browsers response.
- add a checkbox in IndexFederated_p for the lazy initialization of solr
fields.
2012-06-27 12:17:58 +02:00
Michael Peter Christen
0294a53459 - add canonical field only if requested by solr schema
- remove canonical url from in/outbound urls if present
2012-06-26 14:51:57 +02:00
Michael Peter Christen
3fd4a01286 added option to record urls that are forwarded to the solr index 2012-06-26 13:54:48 +02:00
Michael Peter Christen
77f795756c fixing redirects and status codes: storing of status code in
ResponseHeader to make it available for late evaluations, like storage
in solr.
2012-06-25 18:17:31 +02:00
Michael Peter Christen
b9dfca4b0a - fixed IndexFederated Servlet / a embedded Solr can now be selected
- added code stub for an embedded Solr but generation of Solr store is
still commented out (it works but is not yet ready for usage)
2012-06-25 11:34:38 +02:00
Michael Peter Christen
786be7d175 better integration of RDFaParser 2012-06-20 16:39:04 +02:00
Michael Peter Christen
e89747bb67 - added automated generation of vocabularies from url stubs
- added clear of all terms for vocabularies
- added deletion of vocabularies
2012-06-13 15:53:18 +02:00
Roland 'Quix0r' Haeder
edaa09b9b1 Rewrote all String blacklist types to enum 'BlacklistType', closes bug
#143

Conflicts:
	htroot/Supporter.java
	htroot/yacy/crawlReceipt.java
	htroot/yacy/transferRWI.java
	htroot/yacy/transferURL.java
	source/de/anomic/crawler/CrawlStacker.java
	source/de/anomic/data/ListManager.java
	source/net/yacy/peers/Protocol.java
	source/net/yacy/repository/Blacklist.java
	source/net/yacy/repository/LoaderDispatcher.java
	source/net/yacy/search/Switchboard.java
	source/net/yacy/search/index/MetadataRepository.java
	source/net/yacy/search/index/Segment.java
	source/net/yacy/search/query/RWIProcess.java
	source/net/yacy/search/snippet/MediaSnippet.java
2012-06-11 00:17:30 +02:00
Michael Peter Christen
41c02cb10e - less restrictions for usage of Table RAM copy
- new limit to use the table copy (instead of flag): 400MB available. If
less is available, then a copy is never used. If more is available, then
it can be used if there is a remaining space of at least 200MB
- flush caches more often: flush the Digest cache
2012-06-08 12:48:25 +02:00
Michael Peter Christen
ab7107b34b fixed RWIProcess queue limits: now discovering hidden results for mass
result retrieval
2012-06-08 09:14:54 +02:00
Michael Peter Christen
e0d8643226 - performance hacks
- added log warnings in case that search processes run into time-out
situations
- better concurrency for Integer formatter (used a non-synchronized
formatter before)
- bugfix for search termination (a poison pill was missing)
- added timeout parameters for search (again) -> target is, that they
are never reached.
2012-06-04 15:37:39 +02:00
Michael Peter Christen
9b4c699526 ehanced location search:
- search request are now made using a map boundary
- search results are only computed for the map boundary
- the number of results is adopted to the results in the visible range
- added a double-buffering for the search result markers
- added a search query option for the search results:
/radius/<lat>/<lon>/<radius>
2012-05-31 22:39:53 +02:00
Michael Peter Christen
7c1feefb28 introduced a default 10 second time-out in rwi normalization time
uring search process to prevent endless deadlocks after a very long
running search
2012-05-30 16:26:05 +02:00
reger
ee553d971e correct typo in scripts_txt comment 2012-05-19 02:09:16 +02:00
Michael Peter Christen
acf8d521a2 fix for http://bugs.yacy.net/view.php?id=126 2012-05-19 00:21:03 +02:00
Roland 'Quix0r' Haeder
d10627d591 More sync in close() methods
Conflicts:
	source/net/yacy/kelondro/logging/GuiHandler.java
	source/net/yacy/kelondro/workflow/InstantBusyThread.java
2012-05-17 06:03:18 +02:00
Roland 'Quix0r' Haeder
b3ae2aa41f With or without 'final'? At least please try it in other methods
Conflicts:
	source/de/anomic/tools/tarTools.java
2012-05-17 06:00:49 +02:00
Roland 'Quix0r' Haeder
fbb946f913 Made a method static (Eclipse suggested it), removed unused import, pk=null check does now output a warning in logfile 2012-05-17 05:55:44 +02:00
Michael Peter Christen
52d307c735 prevent that the snippet fectch process removes catchall entries 2012-05-17 05:18:52 +02:00
Michael Peter Christen
5deebd02ea added serialization 2012-05-15 23:10:47 +02:00
reger
b2175ea4ef Add possibility to set custom Solr field names for the YaCy default Solr attributes.
- Changing the format of YaCy's solr.key.list while maintainig backward compatibility
  Federated index config screens adjusted accordingly
- modified the Solr update request to use a 3 min Solr autocommit intervall
2012-05-15 22:34:02 +02:00
Michael Peter Christen
2717c1b749 fixed bug in solr interface 2012-05-15 12:25:14 +02:00
Michael Peter Christen
f150bc218b fixed bug in solr error document 2012-05-14 14:56:21 +02:00
Roland 'Quix0r' Haeder
a093ccf5eb Now used synchronization in all close() methods to make sure all objects
are 'closed' in an ordered way

Conflicts:
	source/de/anomic/http/server/ChunkedInputStream.java
	source/de/anomic/http/server/ChunkedOutputStream.java
	source/de/anomic/http/server/ContentLengthInputStream.java
	source/net/yacy/cora/protocol/Domains.java
	source/net/yacy/cora/services/federated/solr/SolrShardingConnector.java
	source/net/yacy/cora/services/federated/solr/SolrSingleConnector.java
	source/net/yacy/document/content/dao/PhpBB3Dao.java
	source/net/yacy/document/parser/html/AbstractTransformer.java
	source/net/yacy/kelondro/blob/BEncodedHeap.java
	source/net/yacy/kelondro/blob/HeapReader.java
	source/net/yacy/kelondro/index/RAMIndexCluster.java
	source/net/yacy/kelondro/io/ByteCountInputStream.java
	source/net/yacy/kelondro/logging/ConsoleOutErrHandler.java
	source/net/yacy/kelondro/table/SQLTable.java
2012-05-14 07:41:55 +02:00
Michael Peter Christen
adeb33bb36 better abstraction for solr objects 2012-05-09 17:21:19 +02:00
Michael Peter Christen
8864141872 more abstraction in solr connection classes 2012-05-09 17:00:56 +02:00
Michael Peter Christen
c00efc2717 made the solr connection more generic 2012-05-09 16:46:45 +02:00
Michael Peter Christen
453010bd68 - solved problems with backpath normalization
- redesigned in/outbound link handover
- removed iframe links from inbound/outbound in solr scheme
2012-04-27 16:48:51 +02:00
Michael Peter Christen
5f5ed33ed8 patch for media search (audio, video apps) 2012-04-27 14:18:02 +02:00
Michael Peter Christen
659178942f - Redesigned crawler and parser to accept embedded links from the NOLOAD
queue and not from virtual documents generated by the parser.
- The parser now generates nice description texts for NOLOAD entries
which shall make it possible to find media content using the search
index and not using the media prefetch algorithm during search (which
was costly)
- Removed the media-search prefetch process from image search
2012-04-24 16:07:03 +02:00
Michael Peter Christen
14f67f217c refactoring of ContentDomain: now subclass of Classification 2012-04-22 00:04:36 +02:00
Michael Christen
02e4dedff2 fix to url citation collection 2012-04-13 11:52:59 +02:00
Michael Christen
e32055aa15 added stub classes for
- a new database for url reference data ('seen links')
- a new database extending the references to the full url metadata
attributes set which shall replace the old metadata database if it is
finished
- migration help classes stub to use old and new metadata databases
simultanously
2012-04-13 07:09:15 +02:00
Michael Christen
8fc86fe397 added storage of full anchor link structure:
the links between all pages are now stored. The same index structure as
used for the word index is used to make a reverse link index.
The new file(s) in SEGMENT/default/citation.index.*.blob store the
citation index. This will be used to create much more detailed link
structures for the YaCy apis and to create a better ranking. A ranking
using the citation.index should provide better results especially for
portal indexes and initranets.
2012-03-29 17:20:14 +02:00
Michael Peter Christen
096c17e7cd added test code 2012-02-25 12:42:13 +01:00
Michael Peter Christen
e3bb73c3d6 serialized some database access methods 2012-01-31 21:13:49 +01:00
Michael Peter Christen
355ecf330f reduced target file site to 64mb 2012-01-29 20:35:48 +01:00
Michael Christen
eff966f396 fix for search process (it was aborted too early during remote search) 2012-01-09 03:02:35 +01:00
Marek Otahal
f40efb39af Blacklist loadList() remove duplicates by using Set
Signed-off-by: Marek Otahal <markotahal@gmail.com>
2012-01-09 01:18:01 +01:00
Michael Christen
0797b0de99 new handling of remote search processes: looking for seeds will now not
block the whole search process any more. A deadlock with a DHT selection
process may have been the cause for interface lockings in the past.
2011-12-21 00:32:03 +01:00
Michael Christen
9e5894c784 Removed handling of components objects for URIMetadataRows.
This is a preparation to replace this rows with nodes from the node
store.
2011-12-17 01:27:08 +01:00
Michael Christen
c04bfaa51b refactoring 2011-12-16 23:59:29 +01:00
Michael Christen
e9dc99fe15 added rules to set specific RWIs as private RWIs which are not
transmitted to remote peers. This will be used for private index copies
and phonetic indexes.
2011-12-14 22:15:51 +01:00
Michael Christen
f14faf503b better ranking because we wait a very little time during the search
process more to get better remote sear results into the ranking priority
stack
2011-12-06 02:24:51 +01:00
orbiter
5a55397f99 some last-minute performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8101 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-25 11:23:52 +00:00
orbiter
e58438c01c - added a new retry connector for solr (for cases where solr responses are slow)
- added a new exist property into the metadataRepository which includes solr entries

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8016 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-08 11:49:04 +00:00
orbiter
035ebfbf3b - performance hacks (should affect the crawl balancer and reduce CPU load during crawl stack re-fill)
- this may have also (good) performance side effects on other parts of YaCy


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7982 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-30 07:57:50 +00:00
orbiter
2c3161b4ac refactoring:
RankingProcess -> RWIProcess
ResultFetcher -> SnippetProcess


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7974 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-26 21:42:28 +00:00
orbiter
d2ea250d99 refactoring:
- moved many classes from de.anomic to net.yacy
- made more sub-packages for search classes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7973 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-09-25 16:59:06 +00:00