Commit Graph

500 Commits

Author SHA1 Message Date
Michael Peter Christen
0833937c1c better balancing and duetime-cumputation also for no-delay intranet
hosts
2012-10-30 11:28:49 +01:00
Michael Peter Christen
c326aa8f67 disabled writing new entries to crawl stacks to prevent that a domain
with many documents block refreshing of the crawl queue
2012-10-29 22:26:52 +01:00
Michael Peter Christen
6905182d41 - fix for number of words log message
- adding meta:refresh also to crawler stack
2012-10-29 21:42:31 +01:00
Michael Peter Christen
c25d7bcb80 - added concurrency for robots.txt loading
- changed data model for domain counter
2012-10-29 21:08:45 +01:00
Michael Peter Christen
2d9e577ad0 replaced the custom robots.txt loader by the standard http loader 2012-10-28 22:48:11 +01:00
Michael Peter Christen
799d71bc67 enhanced solr caching:
- increased cache size which is needed for longer solr commit time
- speed hacks on cache write code
2012-10-28 20:31:29 +01:00
Michael Peter Christen
a33e2742cb - removed unnecessary synchronized and deadlock in crawler
- removed problem with monitoring object on Balancer.wait
- added missing user agent settings
2012-10-28 19:56:02 +01:00
Michael Peter Christen
8e1248ffe3 force a commit in advance of a search for the administrator to get most
recent results even if commit time is high and an indexing is ongoing.
2012-10-26 15:35:42 +02:00
Michael Peter Christen
3b48c78190 added an option to force a commit to solr.
may be used by a search front-end in case that the commitWithinMs time
is too short to get recently indexed documents.
2012-10-26 07:39:07 +02:00
orbiter
276dd6452b removed warnings 2012-10-23 19:08:44 +02:00
sixcooler
47ae7e322e smaller dhtDispatcher.cloudSize
@Orbiter: we talked about this times ago - please revert if I'm wrong
2012-10-21 20:05:28 +02:00
Michael Peter Christen
ce0e5b1e17 - more refactoring / private methods
- fix for usage of custom solr field names
2012-10-18 15:09:04 +02:00
Michael Peter Christen
ccc3760a47 Refactoring and redesign of data architecture to make URIMetadataRow
superfluous. The target is to make a solr document as the core of YaCy
documents which would cause that many conversions can be removed. On the
way to this target the Equivalence of URIMetadataRow and URIMetadataNode
had to be removed to expose the usage of the old URIMetadataRow data
structure.
This refactoring already removes unneccessary conversions and should
make memory usage during indexing lower.
2012-10-18 14:29:11 +02:00
Michael Peter Christen
e5b3c172ff removed hack which translated Solr documents to virtual RWI entries
which had been then mixed with remote RWIs. Now these Solr documents are
feeded into the result set as they appear during local and remote
search. That makes the search much faster.
2012-10-17 17:45:41 +02:00
Michael Peter Christen
5d16c23a1f specified more URIMetadata as URIMetadataNode 2012-10-16 18:26:21 +02:00
Michael Peter Christen
43f3345c90 - removed dependencies from URIMetadataRow and made direct access to
URIMetadataNode which creates the opportunity to access Solr objects
directly and use their information richness
- lazy initialization of the URIMetadataNode object - should cause less
computation and memory usage during search.
- removed dead code
2012-10-16 18:11:57 +02:00
Michael Peter Christen
cc98496ff3 enhanced the HostBrowser:
- showing also outbound links to other domains if there are any
- the outbound links browser shows also the link structure image
- showing even inbound links if the web structure graph has information
about that
- removed the left menu and made the HostBrowser a part of the top menu
for search
- moved the file search also to the top menu
- added hover information in the HostBrowser to explain what the click
means
- because the HostBrowser also links to the Metadata viewer ViewFile,
there should be a button to switch back to the HostBrowser: added that
also.
2012-10-16 17:13:18 +02:00
Michael Peter Christen
21fe8339b4 - enhanced generation of url objects
- enhanced computation of link structure graphics
- enhanced collection of data for link structures
2012-10-15 13:17:13 +02:00
Michael Peter Christen
1b02408936 use less cache 2012-10-11 14:32:37 +02:00
Michael Peter Christen
36c13ed15b less solr prefetch 2012-10-11 10:17:05 +02:00
Michael Peter Christen
5f0ab25382 removed the option to prevent removal of & parts inside of the
MultiProtocolURI during normalform computation because that should
always be done and also be done during initialization of the
MultiProtocolURI Object. The new normalform method takes only one
argument which should be 'true' unless you know exactly what you are
doing.
2012-10-10 11:46:22 +02:00
Michael Peter Christen
7e3e45fd04 added Open Graph Metadata default fields, see http://ogp.me/ns# 2012-10-09 17:28:48 +02:00
Michael Peter Christen
c3e5f667a7 added schema.org breadcrumb counter to parser and solr schema 2012-10-09 13:02:43 +02:00
Michael Peter Christen
a06930662c replaced some more .getBytes() with UTF8/ASCII.getBytes() 2012-10-09 12:14:28 +02:00
Michael Peter Christen
bd769de604 since the solr index is now used for all pages that are indexed locally,
there is no need for the RWI index if the index is not transfered to
another peer. Therefore the creation of RWI index data is now suppressed
if DHT is disabled. This applies for all intranet and portal mode
configurations, but not for public robinson modes. A robinson may switch
back to public mode and then transmit its data. That means if someone
wants to switch never to DHT mode, it would be more appropriate to
choose the portal mode.
2012-10-09 11:48:55 +02:00
Michael Peter Christen
4b5e0c1500 added an url rewriter which can be used to remove session ids from urls 2012-10-09 11:24:48 +02:00
Michael Peter Christen
76d218fbef fixes to crawl profiles 2012-10-08 10:50:40 +02:00
Michael Peter Christen
584663ae8c - redesign of solr query construction
- fix for solr boosts and location search
- fix for number of search results in local search
2012-10-07 07:46:55 +02:00
sof
5cb244b79b Merge remote branch 'origin/master' 2012-10-05 18:54:39 +02:00
apfelmaennchen
88b062210c Added a parser for audio file tags (e.g. ID3 tags for MP3 files) based
on the jaudiotagger library. The parser is disabled by default as it
needs to store temporary files for non file:// protocols, which might be
disliked. For your local MP3-collection it loads nicely Artist,
Title, Album etc. from the audio files meta data.
2012-10-05 18:54:26 +02:00
orbiter
4fed4a86d8 another fix to location search 2012-10-04 22:44:44 +02:00
orbiter
0f7a54452d fix for location search query encoding 2012-10-04 14:46:40 +02:00
Michael Peter Christen
f8a3ab2d82 added the usage of synonyms to the GSA search interface 2012-10-02 14:29:45 +02:00
Michael Peter Christen
3d33a5bdf6 turned the synonyms_t Text field into a multi-valued String field
synonyms_sxt
2012-10-02 11:13:06 +02:00
Michael Peter Christen
3b959ee002 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-10-02 10:14:09 +02:00
orbiter
3190347814 added a synonyms_t field to solr and a process to read synonym files.
This can be used to add another stemming to solr using stemming files
that are expressed as synonyms for grammatical alternatives. The
synonym/stemming files must have the following form:
- each line is a comma-separated list of synonyms
- the list of synonyms may be enclosed with {} (like the GSA synonyms
file)
- the file may contain comments which are lines starting with a '#'
The synonym file(s) must be placed in DATA/DICTIONARIES/synonyms/ and
are activated by default whenever a synonym file is in place.
Then, for each word that is found in a document all synonyms are added
to a long text field which is stored into synonyms_t. Processes using
the synonyms must query with that field as optional matcher.
2012-10-02 00:02:50 +02:00
Michael Peter Christen
411d0e839b added an underline text field to solr to record all underlined texts 2012-10-01 14:16:49 +02:00
Michael Peter Christen
c4a3d8870f fixed computation of links in host browser which are not indexed but
knwon by the crawler. Such links are now displayed in grey color.
2012-09-29 02:13:11 +02:00
Michael Peter Christen
f45f7fc12e added new Host Browser to main menu:
this new search interface is something completely new for search, but
completely common on desktops: browser a web space like one would browse
a file system in a file browser. The file listing is created using the
search index and a faceted restriction to specific domains.
2012-09-28 22:45:16 +02:00
Michael Peter Christen
8556a3d521 extended solr connector with a method to retrieve a single facet. 2012-09-28 13:50:13 +02:00
Michael Peter Christen
23f68f2a69 force usage of default faceting mechanisms for search 2012-09-26 18:48:59 +02:00
Michael Peter Christen
24d2ee3c52 - better date ranking
- more protection against NPE and time travel effects
2012-09-26 18:36:32 +02:00
Michael Peter Christen
ca313e404f - if a "/date" modifier is used, the solr remote query applies an
ordering by date (ascending)
- added also some 'anti-timetravel' protection (check if date is in the
future within any metadata date field)
2012-09-26 16:56:33 +02:00
Michael Peter Christen
a4214694df We assert that no other metadata storage than solr is used now.
Therefore a property like solrConnected() must be true all the time.
Removal of this method causes removal of all write operations to the old
metadata index.
2012-09-26 16:05:11 +02:00
Michael Peter Christen
0cec7e761a enhanced snippet extractor to find snippets also inside of tokens of an
url
2012-09-26 15:33:37 +02:00
Michael Peter Christen
562183932b - removed ip_s from default profile since that needs a DNS lookup to
create an document entry. This makes remote search much slower.
- removed synchronization of add method if ip_s is activated to prevent
that a user configuration causes bad behavior. The disadvantage of that
is, that a index dump can cause data loss if an indexing is running
during index dump
- catched more exceptions and more NPE
- better abstraction in MirrorSolrConnector
- slight performance enhancement when only the index count is requested
(rows=0 is sufficient to get a total count)
2012-09-26 13:38:04 +02:00
Michael Peter Christen
5ac61591f3 better abstraction for solr query params 2012-09-25 23:59:30 +02:00
Michael Peter Christen
1533bfd63b refactoring 2012-09-25 21:20:03 +02:00
Michael Peter Christen
e49359cc95 removed tenant query attribute since it is not used any more and is
replaced by the site-operator in the GSA interface. This operator can
also be simulated in the Solr interface using the collections_sxt field.
2012-09-25 21:09:06 +02:00
Michael Peter Christen
872f83ebe0 refactoring 2012-09-25 21:04:58 +02:00
Michael Peter Christen
fb9460f0a8 using the search filter to drill down search to file types.
A search like "mp3 filetype:mp3" will now maybe surprise you.
2012-09-25 17:52:33 +02:00
Michael Peter Christen
15ea053c3a - added xml output in IndexControlURLs to get the storage page of index
dump commands
- adjusted the apicall.sh script to get the downloaded text as output to
stdout which is necessary to parse the content out of it
- added indexdump.sh script which creates a solr dump and prints out the
storage path for the index dump
- added synchronization to the Fulltext class to prevent that data is
stored to a non-existing solr index while this index is disabled during
the storage of the dump
2012-09-25 00:19:52 +02:00
Michael Peter Christen
1b474139dd used the new zip writer/reader to add a solr dump process: the whole
solr index can be written to a zip dump and also restored during runtime
2012-09-24 17:05:28 +02:00
Michael Peter Christen
e57bf2ca39 simplified DHT classes 2012-09-24 01:04:39 +02:00
Michael Peter Christen
8219a445f3 refactoring 2012-09-21 16:46:57 +02:00
Michael Peter Christen
00c1c777fa refactoring 2012-09-21 15:48:16 +02:00
orbiter
563d584420 removed more dependencies in cora from kelondro 2012-09-21 11:02:36 +02:00
orbiter
63762d8f89 removed kelondro dependencies from cora 2012-09-20 19:38:22 +02:00
orbiter
60b1e23f05 added new crawl options:
- indexUrlMustMatch and indexUrlMustNotMatch which can be used to select
loaded pages for indexing. Default patterns are in such a way that all
loaded pages are also indexed (as before) but when doing an expert crawl
start, then the user may select only specific urls to be indexed.
- crawlerNoDepthLimitMatch is a new pattern that can be used to remove
the crawl depth limitation. This filter a never-match by default (which
causes that the depth is used) but the user can select paths which will
be loaded completely even if a crawl depth is reached.
2012-09-16 21:27:55 +02:00
Michael Peter Christen
6ec02deec6 added new crawl attributes in crawl profile (not active yet) 2012-09-14 16:49:29 +02:00
Michael Peter Christen
0504b01bdc Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-09-14 00:48:17 +02:00
orbiter
9413f77b65 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-09-13 23:54:26 +02:00
orbiter
a55e77a115 added twitter search heuristic 2012-09-13 23:53:53 +02:00
Michael Peter Christen
e54ac38095 - some corrections in usage of getFile() and getFileName()
- added more attributes in json response writer according to yacy
servlet
2012-09-11 23:28:21 +02:00
Michael Peter Christen
62add1d564 added the protocol and the file name extension to the solr fields since
these fields are probably facets in file search
2012-09-11 22:46:39 +02:00
Michael Peter Christen
9db032664e activate two solr fields which will be used by administration interface
(later)
2012-09-11 20:15:54 +02:00
Michael Peter Christen
4634f0e626 fix for images_withalt 2012-09-10 12:30:03 +02:00
Michael Peter Christen
4d29f59a27 removed warnings 2012-09-10 07:15:52 +02:00
Michael Peter Christen
10b911eed4 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-09-07 22:07:02 +02:00
Michael Peter Christen
be67c70a47 added Solr fields:
inboundlinks_text_chars_val
inboundlinks_text_words_val
inboundlinks_alttag_txt
outboundlinks_text_chars_val
outboundlinks_text_words_val
outboundlinks_alttag_txt
2012-09-07 22:06:51 +02:00
orbiter
d73fff0e0e added solr field images_withalt_i 2012-09-07 21:33:45 +02:00
sixcooler
e78fe3f477 also do a clearcache on the solr-connector-caches 2012-09-06 22:07:07 +02:00
Michael Peter Christen
d8425e6809 added collections to crawl monitor 2012-09-04 14:47:53 +02:00
Michael Peter Christen
ee23fc7a32 added h1..h6 counter fields 2012-09-04 14:11:11 +02:00
Michael Peter Christen
b2b516cc3e added a collection attribute to crawls and searches:
- a solr field collection_sxt can be used to store a set of crawl tags
- when this field is activated, a crawl tag can be assigned when crawls
are started
- the content of the collection field can be comma-separated, all of
them are assigned to the documents when they are indexed as result of
such a crawl start
- a search result can be drilled down to a specific collection; this is
currently only available in the solr interface and also in the gsa
interface using the 'site' option
- this adds a mandatory field for gsa queries (the google api demands
that field all the time)
2012-09-03 15:26:08 +02:00
Michael Peter Christen
f75b3f8a47 added more patches to work without RWI data structure 2012-08-31 14:35:56 +02:00
Michael Peter Christen
31d4d38804 - extended the solr interface by a references-by-word-count method
- reduced danger that a non-existing RWI database causes NPEs
- added Solr queries to did-you-mean: this makes it possible that our
did-you-mean algorithm works together with only Solr and without RWIs
2012-08-31 13:03:00 +02:00
Michael Peter Christen
528d6763fa - added new solr fields:
title_count_i, title_chars_val, title_words_val
description_count_i, description_chars_val, description_words_val
- added many asserts to ensure data type correctness from YaCy to Solr
and vice versa
- made many fixes according to new findings from these asserts (!)
2012-08-31 10:30:43 +02:00
Michael Peter Christen
2ddc33646a added new field for solr:
url_paths_sxt
url_parameter_i
url_parameter_key_sxt
url_parameter_value_sxt
url_chars_i
2012-08-29 16:11:23 +02:00
Michael Peter Christen
75d5e3475d Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-08-29 10:13:51 +02:00
cominch
dc468dad01 add content control features for custom filter lists 2012-08-29 09:04:28 +02:00
Michael Peter Christen
316b5fe116 - added a solr type definition verifier
- fixed type definition found by the verifier
- added multivalue-string fields for solr with extension 'sxt'
- added multivalue-integer fields for solr with extension 'val'
- renamed some solr attributes from txt to sxt
- changed solr query line to an explicit AND/OR structure
- added a country code second level domain list to Domains class; with
parser
- added a host string parser to get domain class name, country-code
second-level domain and subdomain out of it
- removed old coordinate attributes
2012-08-28 16:58:06 +02:00
orbiter
a3d5959981 Merge commit '65d49df865f60511d22d86fb15c33a082176e7ab' 2012-08-27 16:56:22 +02:00
Michael Peter Christen
4521d63c92 added boosts to solr search queries 2012-08-27 15:25:25 +02:00
Michael Peter Christen
e8acd542b5 - added faceted drill-down for host and geolocation to solr queries
- added a new geolocation field to index schema, the old values are
migrated if possible
2012-08-27 14:41:33 +02:00
reger
65d49df865 security fix: clear automtic password only if adminAccountForLocalhost=false to prevent remote access to protected pages after restart.
if adminAccountForLocalhost=true leave automatic password unchanged so access from local host is granted but remote access is preventet from the 1st second.
2012-08-26 22:28:14 +02:00
orbiter
29171e2f6c fixed generation of ontologies from index enumerations 2012-08-24 14:13:42 +02:00
orbiter
01a63ef595 redesign of YaCySchema and SolrDoc handling 2012-08-23 09:51:45 +02:00
orbiter
479bfca571 refctoring 2012-08-23 09:30:11 +02:00
Michael Peter Christen
48a82bc705 log queries anonymous from gsa+solr requests 2012-08-22 23:50:40 +02:00
Michael Peter Christen
ab6ec4ec52 added snippet computation to solr/rss and gsa result writer 2012-08-22 17:37:34 +02:00
Michael Peter Christen
4716546ef5 - reduced memory usage in index transmission using a transformation of
Node to Row objects
- removed peerDeparture in solr remote search in case that peer does not
answer (this may be normal because it is allowed to switch this off)
2012-08-22 16:30:33 +02:00
Michael Peter Christen
653645c1cf corrected solr query syntax 2012-08-22 00:48:03 +02:00
orbiter
716ea0cfe2 sorted the solr schema into mandatory and optional fields; reduced
number of used field to reduce solr index size
2012-08-21 23:52:56 +02:00
orbiter
9b8c8c0f47 fix from gaston in
http://forum.yacy-websuche.de/viewtopic.php?p=26909#p26909
2012-08-21 21:03:26 +02:00
Michael Peter Christen
a049761e0c fixed double-check 2012-08-20 14:16:37 +02:00
Michael Peter Christen
f42a57cd7d gsa format update 2012-08-20 12:50:51 +02:00
Michael Peter Christen
ff3eaa21b0 added remote search to solr on YaCy peers!
- when doing a remote search, node peers are selected for solr queries
- the solr query is done concurrently to the standard YaCy rwi search
- the solr search result is feeded into the same data structure that
prepares the rwi search result
- the same remote seach that is done to several outside peers is done to
the local solr index
- the search process works now also without any 'old' RWI data using
solr
2012-08-20 12:16:11 +02:00
Michael Peter Christen
a06123aec6 more abstraction and less parameter overhead for remote search 2012-08-20 01:29:15 +02:00
Michael Peter Christen
f00733186b code simplifications 2012-08-19 13:17:03 +02:00
Michael Peter Christen
db0d438709 fix for http://bugs.yacy.net/view.php?id=206 2012-08-19 08:43:56 +02:00
orbiter
404b0aab09 refactoring in remote search and stub for remote node peer selection 2012-08-18 23:59:25 +02:00
orbiter
d7ea45f698 - get nice text_t values from metadata conversions that are stored into
solr as fulltext search index.
- added slow migration from old metadata to solr index entries: each
entry from the old metadata is removed from that data structure and
written into solr.
2012-08-18 19:36:21 +02:00
orbiter
99ef57f103 reduced sleep times 2012-08-18 17:48:20 +02:00
orbiter
780f8974e7 added ramaining iteration methods for solr in fulltext class 2012-08-18 15:39:14 +02:00
orbiter
ee01c12e56 fixes for putDocument and putMetadata 2012-08-18 13:05:27 +02:00
orbiter
cc47a0876e reverted bf55f69176
to have a fall-back option in case that memory problems as reported in
http://forum.yacy-websuche.de/viewtopic.php?p=26901#p26901
for full-solr installation are too strong and we have to work with an
'small memory footprint' peer system.
2012-08-18 10:28:40 +02:00
Michael Peter Christen
0904afe8fb added concurrent iterator methods to the solr connectors 2012-08-17 18:22:56 +02:00
Michael Peter Christen
0cab06c47c refactoring 2012-08-17 15:52:33 +02:00
Michael Peter Christen
bf55f69176 removed write methods to old metadata file type; all metadata now goes
to solr
2012-08-17 15:46:26 +02:00
Michael Peter Christen
40c0856489 refactoring 2012-08-17 15:33:02 +02:00
Michael Peter Christen
06a78eecb7 code simplification 2012-08-17 14:43:32 +02:00
Michael Peter Christen
9bece5ac5f enhanced snippet fetch - removed a bug that caused documents to be
parsed even if a solr text was available
2012-08-17 14:22:07 +02:00
Michael Peter Christen
18f989dfb1 - refactoring (load -> getMetadata)
- added getDocument to retrieve Solr documents which shall replace
getMetadata
2012-08-17 01:34:38 +02:00
Michael Peter Christen
395b78a0d8 using the solr search index to concurrently search within solr and the
rwis during local search requests.
2012-08-17 01:21:56 +02:00
Michael Peter Christen
6197caf698 added clear-text search words in query params 2012-08-16 23:05:37 +02:00
Michael Peter Christen
23226676c6 FOR THE BRAVE.. this is a forced migration to solr which is now ready
for production as a replacement of the metadata-db.
This intermediate release 1.041 will switch on the previously optional
solr index and the old metadata-db will still work as it did before.
Solr+metadata are accessed in mixed mode, no migration is done yet.
If this causes not a catastrophe until the end of the weekend, we will
do a YaCy 1.1 main release containing this as default.
2012-08-16 18:17:47 +02:00
Michael Peter Christen
d988ba50cf added a very rudimentary, incomplete, non-verified GSA response writer
for solr. Try this:
http://localhost:8090/gsa/searchresult?q=pdf&site=col1&num=10
2012-08-14 12:40:26 +02:00
Michael Peter Christen
aab0b680c3 - added xslt support for solr result formats.
try i.e.
http://localhost:8090/solr/select?q=*:*&start=0&rows=10&wt=xslt&tr=json.xsl
- added servlet-side mime-type configuration for streamed servlets. this
is used for the result formatters in solr result formats
2012-08-14 11:12:50 +02:00
Michael Peter Christen
e5ef840f40 - renamed DoubleSolrConnector to MirrorSolrConnector and added a
hit/miss/document cache to the MirrorSolrConnector.
- more abstraction to SolrDocument in Connector interface
- bugfixes in Solr field reader
2012-08-13 13:32:32 +02:00
Michael Peter Christen
b51df6c7e8 - added coordinate storage in solr schema
- fixed shutdown process
- fixed some solr-to-metadata reading
- added a large number of metadata attributes in ViewFile.html
2012-08-13 10:40:04 +02:00
Michael Peter Christen
da851c6071 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-08-11 01:21:18 +02:00
Michael Peter Christen
bd4f03bc85 removed unused class 2012-08-11 01:05:40 +02:00
orbiter
39f8eb60c3 tried to prevent calls to bad-hack getSize() method and reduced overhead
of that method a bit.
2012-08-10 18:10:25 +02:00
orbiter
e816b88b55 changed behaviour of metadata storage: in case that any solr is
attached, the metadata is not written to the metadata-db, even if it is
enabled but instead to solr. This prevents that metadata is written in
two store systems at the same time. It is also the next step to migrate
the current metadata-db to solr.
2012-08-10 15:39:10 +02:00
orbiter
2571e0d47a removed unused classes 2012-08-10 14:47:44 +02:00
Michael Peter Christen
f9c0e6e950 - Implemented and integrated the URIMetadataNode object which is a
metadata representation from the solr index. This shall replace metadata
from the built-in database in the future.
- added the Solr-driven metadata into the search index of YaCy which
makes it now possible to run YaCy without the old metadata index. This
is a major stept forward to a full migration to Solr.
2012-08-10 13:26:51 +02:00
Michael Peter Christen
136fcb1ad9 refactoring 2012-08-10 06:47:13 +02:00
Michael Peter Christen
a12f693ec9 added two response writer for embedded solr interface:
a rss/opensearch writer and an enhanced solr xml writer.
The enhanced solr writer has less configuration overhead than the
original writer and should by slightly faster. The rss/opensearch writer
is at this time slightly incomplete compared with the already existing
rss search result form YaCy and also snippets are missing at this time.
To test the new interface, open for example:
http://localhost:8090/solr/select?wt=rss&q=olympia
The wt-code for the new result writers are=
wt=rss for opensearch
wt=exml for the enhanced solr xml writer.
Additionally, the SRU search parameters had been added to the solr
interface which can now also be used for a normal solr/xml search.
2012-08-09 18:06:48 +02:00
Michael Peter Christen
bca4a16603 replaced the multivalue generic string field name suffix _ss by _txt
because _ss is not part of the standard solr example schema.
2012-08-06 17:58:09 +02:00
orbiter
67edfd991c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-08-05 15:49:48 +02:00
orbiter
d9173ba7ed added more solr fields to integrate values from URIMetadataRow. All
writings to the Metadata-DB are now also done to solr. This includes
metadata transfer during search and rwi transfer.

The new/added solr fields are:

## time when resource was loaded
load_date_dt

## date until resource shall be considered as fresh
fresh_date_dt

## id of the host, a 6-byte hash that is part of the document id
host_id_s

## ids of referrer to this document
referrer_id_ss

## the md5 of the raw source
md5_s

## the name of the publisher of the document
publisher_t

## the language used in the document; starts with primary language
language_ss

## an external ranking value
ranking_i

## the size of the raw source
size_i

## number of links to audio resources
audiolinkscount_i

## number of links to video resources
videolinkscount_i

## number of links to application resources
applinkscount_i
2012-08-05 15:49:27 +02:00
Michael Peter Christen
3ce04cecf3 bad hack to prevent a bug appearing in solr 2012-07-31 23:49:07 +02:00
Michael Peter Christen
24d9db1613 snippet retrieval loading processes may use a smaller minimum load time
value than crawling processes. This speeds up the search result
preparation dramatically.
2012-07-30 10:38:23 +02:00
Michael Peter Christen
ef488a15f7 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-07-27 12:14:24 +02:00
Michael Peter Christen
1687737771 Abstraction of HandleMap and HandleSet 2012-07-27 12:13:53 +02:00
sixcooler
76b037a20a check content domain fix:
search image/media should not show pages containing image/media
search text should show all/text but image/media
2012-07-27 04:11:52 +02:00
Michael Peter Christen
3bcd9d622b cleaned up classes and methods which are either superfluous at this time
or will be superfluous or subject of complete redesign after the
migration to solr. Removing these things now will make the transition to
solr more simple.
2012-07-25 14:31:54 +02:00
Michael Peter Christen
6f1ddb2519 Moved solr index-add method to the same method where the YaCy index is
written. Also done some code-cleanup.
2012-07-25 01:53:47 +02:00
Michael Peter Christen
315d83cfa0 cleanup 2012-07-24 22:16:56 +02:00
Michael Peter Christen
76202f068e extended abstraction of local and remote solr index using one front-end
for index administration and querying.
2012-07-24 17:23:29 +02:00
Michael Peter Christen
826967513b changed options in IndexFederated_p to switch on/off parts of the index
individually. The settings are experimental and the values of the
settings will be overwritten when an index migration from urldb to solr
starts.
2012-07-23 16:28:39 +02:00
orbiter
69e743d9e3 - more abstraction for the RWI index as preparation for solr integration
- added options in search index to switch parts of the index on or off
2012-07-22 13:18:45 +02:00
orbiter
05a3ffd03a patches to ensure that solr connectors are active ony if they have a
solr object assigned and vice versa
2012-07-20 11:47:50 +02:00
orbiter
5a3c829872 embedded solr is only initiated if it is activated with
IndexFederated_p.html
2012-07-20 11:40:33 +02:00
Michael Peter Christen
97b7bcf2a6 added a solr search index
- by default, a (empty) solr storage instance is created at
SEGMENTS/solr_36
- the index is written if in /IndexFederated_p.html the flag "embedded
solr search index" is switched on
- a standard solr query interface is available now with a new servlet at
http://127.0.0.1:8090/solr/select

To test this, do the following:
- switch to webportal mode
- switch on the feature as described
- do a crawl. this fills the solr index. The normal YaCy search will NOT
work now!
- do a solr query, like:
http://127.0.0.1:8090/solr/select?q=*:*
http://127.0.0.1:8090/solr/select?q=text_t:Help
play with different search fields as you can see in
/IndexFederated_p.html
You can use the standard solr query attributes as described in
http://wiki.apache.org/solr/SearchHandler
2012-07-19 11:34:05 +02:00
orbiter
c00a3cf74d less usage of generic logger to avoid logger generation overhead 2012-07-12 19:54:54 +02:00
orbiter
e76159040b Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-07-12 11:14:04 +02:00
orbiter
bbfa497a3c replaced more size() > 0 by !isEmpty() 2012-07-12 11:12:21 +02:00
Michael Peter Christen
58e7d1952f reduction of logging to prevent too much IO caused be logging 2012-07-12 02:08:11 +02:00
orbiter
0cbda0b2b8 - replaced all length() == 0 and size() == 0 with isEmpty()
- replaced some length() > 0 and size() > 0 with !isEmpty() - cannot be
done automatically
- implemented some isEmpty() methods
2012-07-10 22:59:03 +02:00
orbiter
c7afa8bc48 using SwitchboardConstants for solr attributes 2012-07-10 12:01:20 +02:00
orbiter
c6d8950651 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-07-09 14:33:11 +02:00
orbiter
62202e2d71 refactoring of query attribute variable names for better consistency
with (next) stored query words
2012-07-09 11:14:50 +02:00
Michael Peter Christen
d09d9f2364 filter old peers from bootstrap (now stronger: 60 minutes instead of
240).
2012-07-08 21:25:22 +02:00
Michael Peter Christen
b0c408788b made class methods static where possible 2012-07-05 12:38:41 +02:00
Michael Peter Christen
7c1ba99755 removed more unused method parameters 2012-07-05 10:44:30 +02:00
Michael Peter Christen
0301aba1e9 removed unused method parameters 2012-07-05 10:23:07 +02:00
Michael Peter Christen
241dd8410a removed snippet pattern filter - it was not used 2012-07-05 09:21:27 +02:00
Michael Peter Christen
d3964253ae - added @SuppressWarnings to unused servlet method parameters
- removed unnecessary casts
- removed unnecessary throw statements
2012-07-05 09:14:04 +02:00
Michael Peter Christen
ea10766bfd cleaned unnecessary nested code 2012-07-05 08:44:39 +02:00
orbiter
fc0f9543fe More SentenceReader cleanup 2012-07-05 00:20:58 +02:00
orbiter
d4291ac1f3 more tolerance when creating solar document 2012-07-04 21:15:38 +02:00
orbiter
78fc3cf8f8 refactoring and new usage of SentenceReader: this class appeared as one
of the major CPU users during snippet verification. The class was not
efficient for two reasons:
- it used a too complex input stream; generated from sources and UTF8
byte-conversions. The BufferedReader applied a strong overhead.
- to feed data into the SentenceReader, multiple toString/getBytes had
been applied until a buffered Reader from an input stream was possible.
These superfluous conversions had been removed.
- the best source for the Sentence Reader is a String. Therefore the
production of Strings had been forced inside the Document class.
2012-07-04 21:15:10 +02:00
Michael Peter Christen
613b45f604 - better data structures in secondary search
- fixed a big memory leak in secondary search
2012-07-03 07:12:20 +02:00
Michael Peter Christen
de903a53a0 parser refactoring & hacks 2012-07-03 06:06:38 +02:00
Michael Peter Christen
8a82609360 - smaller caches to save memory
- close cloneable iterators to free memory
2012-07-02 15:40:40 +02:00
Michael Peter Christen
7249d9c9de bugfix for concurrent seed loader 2012-07-02 14:37:57 +02:00
Michael Peter Christen
c72d3b12cd concurrently initialize the seed list during p2p network bootstrap 2012-07-02 14:27:37 +02:00
Michael Peter Christen
1825f165b8 better integration of blacklist according to use case 2012-07-02 13:57:29 +02:00
Michael Peter Christen
c18fa9fa75 Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1 2012-07-02 12:20:57 +02:00
Michael Peter Christen
ce8d4b87d9 fixes for new eclipse 'Juno' warning 'Resource leak'. 2012-07-02 10:27:46 +02:00
Michael Peter Christen
0c345d1559 giving threads name so its easier to see whats happening during
debugging and within a thread dump
2012-07-02 09:51:43 +02:00
reger
067728bccc add search result heuristic. adding a crawl job with depth-1 for every displayed search result (crawling every external linked page of displayed search result pages) 2012-07-01 00:12:20 +02:00
Michael Peter Christen
03280fb161 removed segments-concept and the Segments class:
the segments had been there to create a tenant-infrastructure but were
never be used since that was all much too complex. There will be a
replacement using a solr navigation using a segment field in the search
index.
2012-06-28 14:27:29 +02:00
Michael Peter Christen
508a81b86c added solr field 'refresh_s' which stores the refresh url contained in
the meta-refresh html header field.
2012-06-28 13:27:45 +02:00
Michael Peter Christen
9116013c64 - allow lazy initialization of solr value (if using 'lazy', then no
0-values and no empty strings are written). This may save a lot of
memory (in ram and on disc) if excessive 0-values or empty strings
appear)
- do not allow default boolean values for checkboxes because that does
not make sense: browsers may omit the checkbox attribute name if the box
is not checked. A default value 'true' would not comply with the
semantic of the browsers response.
- add a checkbox in IndexFederated_p for the lazy initialization of solr
fields.
2012-06-27 12:17:58 +02:00
Michael Peter Christen
0294a53459 - add canonical field only if requested by solr schema
- remove canonical url from in/outbound urls if present
2012-06-26 14:51:57 +02:00
Michael Peter Christen
3fd4a01286 added option to record urls that are forwarded to the solr index 2012-06-26 13:54:48 +02:00
Michael Peter Christen
96aeb127e3 generalized localhost naming.
this is also a preparation for a better IPv6 implementation.
2012-06-26 00:08:25 +02:00
Michael Peter Christen
77f795756c fixing redirects and status codes: storing of status code in
ResponseHeader to make it available for late evaluations, like storage
in solr.
2012-06-25 18:17:31 +02:00
Michael Peter Christen
8dd469b9dd added option to configure the autocommit delay time of solr on-the-fly 2012-06-25 14:59:46 +02:00
Michael Peter Christen
b9dfca4b0a - fixed IndexFederated Servlet / a embedded Solr can now be selected
- added code stub for an embedded Solr but generation of Solr store is
still commented out (it works but is not yet ready for usage)
2012-06-25 11:34:38 +02:00
Michael Peter Christen
fad3b14813 added jetty libraries, needed for future use as web server and as
application server for the solr search interface
2012-06-22 15:31:17 +02:00
Michael Peter Christen
a38b0a2c46 extended embedded solr tests to ensure that it will be usable within a
jetty instance
2012-06-22 11:40:02 +02:00
Michael Peter Christen
b9d42fd9c8 using com.google.common.io.Files instead of homebrew methods 2012-06-22 11:39:17 +02:00
Michael Peter Christen
a5eb91fa60 refactoring 2012-06-22 00:49:32 +02:00
Michael Peter Christen
1be0025a9c - added test for EmbeddedSolrConnector
- added needed libraries for this test
this includes most (all) files needed for an embedded solr
2012-06-22 00:36:49 +02:00
Michael Peter Christen
e12bb254b4 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-06-21 14:55:50 +02:00
Michael Peter Christen
3f55dc7c1e - added solr core and libraries that solr needs (lucene is missing, will
follow later)
- added embedded solr connector which can connect to solr
programmatically (without using a server in between)
2012-06-21 14:55:38 +02:00
Michael Peter Christen
786be7d175 better integration of RDFaParser 2012-06-20 16:39:04 +02:00
Michael Peter Christen
0752983fbd - automatic periodic saving of triplestore
- transaction-safe storage of triplestore
2012-06-17 10:50:12 +02:00
Michael Peter Christen
9264d8b4af removed old navigation practice using subject tags in favor of
triplestore-tags
2012-06-17 00:33:40 +02:00
Michael Peter Christen
64c0268b2b show triplestore metadata in yacydoc and viewfile 2012-06-16 17:40:15 +02:00
cominch
a95127c9af Triplestore: initalize per-user triplestores 2012-06-14 11:46:53 +02:00
Michael Peter Christen
e89747bb67 - added automated generation of vocabularies from url stubs
- added clear of all terms for vocabularies
- added deletion of vocabularies
2012-06-13 15:53:18 +02:00
Michael Peter Christen
8b53771db2 changed behavior of navigation processing:
- vocabulary annotation is not done any more into the metadata of urldb
- vocabularies are written into the jena triplestore using a rdf
vocabulary
- vocabularies for rdf tripel must be updated; refactoring done
- with the new navigation tags in the triplestore a faster
pre-urldb-lookup is possible: navigation is processed now within the RWI
during pre-ranking retrieval
- added also a Owl vocabulary stub to add the plain-text url to the
triplestore using the owl:sameas predicate
2012-06-11 23:49:30 +02:00
Michael Peter Christen
5fc6524ca8 - moved triple store to net.yacy.cora.lod (should be generalized there
later
- added abstract add, delete, get methods in the triplestore
- added generation of triples after auto-annotation
- migrated all MultiProtocolURI objects to DigestURI in the parser since
the url hash is needed as subject value in the triples in the triple
store
2012-06-11 16:48:53 +02:00
Michael Peter Christen
4ee6fb1de9 added missing blacklist dht cache storage (maybe due to mistakes in
cherry picking)
2012-06-11 00:38:02 +02:00
Roland 'Quix0r' Haeder
edaa09b9b1 Rewrote all String blacklist types to enum 'BlacklistType', closes bug
#143

Conflicts:
	htroot/Supporter.java
	htroot/yacy/crawlReceipt.java
	htroot/yacy/transferRWI.java
	htroot/yacy/transferURL.java
	source/de/anomic/crawler/CrawlStacker.java
	source/de/anomic/data/ListManager.java
	source/net/yacy/peers/Protocol.java
	source/net/yacy/repository/Blacklist.java
	source/net/yacy/repository/LoaderDispatcher.java
	source/net/yacy/search/Switchboard.java
	source/net/yacy/search/index/MetadataRepository.java
	source/net/yacy/search/index/Segment.java
	source/net/yacy/search/query/RWIProcess.java
	source/net/yacy/search/snippet/MediaSnippet.java
2012-06-11 00:17:30 +02:00