Commit Graph

1809 Commits

Author SHA1 Message Date
Michael Peter Christen
3502b4c697 refactoring (renaming) of yacy-solr api 2013-04-27 01:32:18 +02:00
Michael Peter Christen
3a0fcfbeda Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-04-26 10:50:08 +02:00
Michael Peter Christen
25499eead5 - added a new field for the regular expression in crawl start
- added the field in crawl profile
- adopted logging end error management
- adopted duplicate document detection
- added a new rule to the indexing process to reject non-matching
content
- full redesign of the expert crawl start servlet
The new filter field can now be seen in /CrawlStartExpert_p.html at
Section "Document Filter", subsection item "Filter on Content of
Document"
2013-04-26 10:49:55 +02:00
orbiter
e1bfe9d07a - reduction of the concurrently running processes to make YaCy more
adjusted to smaller and 1-core devices.
- the workflow processor now starts no process at all. these are started
as soon as parser/condenser/indexing queues are filled.
- better abstraction
2013-04-25 11:33:17 +02:00
Michael Peter Christen
c091000165 added collection attribute also to the rss feed reader 2013-04-24 01:14:35 +02:00
orbiter
f7571386a3 added a 'collection' property attribute in yacysearch.html which can be
used to select between different collections as defined during a crawl
start with the 'collection' attribute. This actually implements the
ability to prepare search tenants which restrict their search results to
a specific collection. The main use for this is to provide tenants to
the yaml4 interface (at this time).
2013-04-23 20:42:54 +02:00
orbiter
3e79bd4b1f Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-04-23 12:15:46 +02:00
Michael Peter Christen
d937c55204 extended limitation of dom export size from 100000 to 100000000 2013-04-22 22:33:13 +02:00
Michael Peter Christen
fc2095ac67 some extensions to raster plotter to transform a RGB picture to an
indexed color scheme. This is needed for gif animations
2013-04-22 14:33:04 +02:00
Michael Peter Christen
c1a2175fbc added transparency to gif image animation and the integration to the
YaCy httpd for on-the-fly generated gifs (including animated gifs)
2013-04-21 12:29:05 +02:00
orbiter
5d442dad82 avoid NPE in regex checker 2013-04-20 10:53:49 +02:00
Michael Peter Christen
50421171c3 added new schema fields:
hreflang_url_sxt and hreflang_cc_sxt
for
http://support.google.com/webmasters/bin/answer.py?hl=de&answer=189077

navigation_url_sxt and navigation_type_sxt
for
http://googlewebmastercentral.blogspot.de/2011/09/pagination-with-relnext-and-relprev.html

publisher_url_s
for http://support.google.com/plus/answer/1713826?hl=de

all fields are disabled by default and not written to the index.
2013-04-18 17:21:17 +02:00
Michael Peter Christen
566d6c980c checking of document signature for a double-document check now refers
only to documents within the same domain
2013-04-17 16:15:27 +02:00
Michael Peter Christen
1d30082446 added hindi translation configuration 2013-04-17 12:57:27 +02:00
Michael Peter Christen
d05dc07cff setting of new default values for ranking 2013-04-16 15:02:00 +02:00
Michael Peter Christen
97775fbebc fixed ranking for add-function queries: this did not work. The option
was removed. All function queries are now boosts (multiplies the score
according to a function). This is also the recommended way to boost
rankings based on functions as explained in
http://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/
2013-04-16 14:45:14 +02:00
Michael Peter Christen
ac5fa9fe48 fix for result counter logging 2013-04-16 13:32:13 +02:00
Michael Peter Christen
7ab5093321 added new solr title_exact_signature_l and
description_exact_signature_l to be able to identify unique title and
unique description fields.
2013-04-16 01:35:15 +02:00
Michael Peter Christen
f24ac518e6 redesign of exists()-query (can now be called with query) and the
CachedSolrConnector which based its cache on the key value. This will be
used to correct the title_unique_b and description_unique_b field.
2013-04-15 14:08:30 +02:00
Michael Peter Christen
27d6222880 added new field host_extent_i which, after a crawl and postprocessing,
holds the number of documents for the host where the document is hosted.
This is necessary for ranking and the norming of references per local
host in the ranking computation.
2013-04-14 20:52:40 +02:00
reger
518b20147c skip postprocessing during document.store if no citation index connected (prevent null pointer exception) 2013-04-14 02:01:27 +02:00
Marc Nause
ac478384d3 *) did some long overdue refactoring 2013-04-13 23:04:44 +02:00
Michael Peter Christen
ada3f27de7 added three new field for a better ranking: references_internal_i,
references_external_i and references_exthosts_i. These can be used to
count and evaluate the number of external links to every web page. An
experimental ranking function can be i.e.:
div(add(references_internal_i,product(references_external_i,references_exthosts_i)),add(clickdepth_i,1))
2013-04-12 16:17:14 +02:00
Michael Peter Christen
082e3274d6 - setting the same default ranking in the solr interface as for YaCy
search interfaces if no other ranking attributes are given
- using the YaCy ranking in the GSA interface only if there was not
given a GSA-style sort attribute
- to avoid confusion about correct ranking attributes, only the default
'0'-ranking profile is used and not scenario-adopted (site, date)
because that should be configurable in the web interface before it is
used actually for ranking.
2013-04-12 10:48:41 +02:00
Michael Peter Christen
a20941c067 resume paused crawls on startup; user expects that restarts 'heal'
everything
2013-04-11 15:07:08 +02:00
Michael Peter Christen
edc0b33f6d - showing references count and clickdepth in host browser
- fixed generation and presentation of both values
2013-04-11 14:46:13 +02:00
reger
566a3b0294 fix: Index Administration > Reverse Word Index (IndexControlRWIs_p) corrected use of word search to word-hash search
- removed duplicate QueryParams.hashes2Handles , redundant  with .hashes2Set
2013-04-08 21:25:21 +02:00
Michael Peter Christen
cf0acd2cb4 upgrade to solr 4.2.1 2013-04-06 16:11:24 +02:00
reger
e89491271f - fix opensearch discover err msg - webgraph not enabled - if no opensearchdescription link found in index
- remove search2.net from sample config (is down)
2013-04-04 00:40:59 +02:00
reger
6a9d0b60a3 make sure configured port is reported on recreated mySeed.txt 2013-04-01 03:51:57 +02:00
Michael Peter Christen
870aedf3c6 fixes for better search interface integration in yaml templates 2013-03-20 16:19:49 +01:00
Michael Peter Christen
5512be6673 fix in GSA result writer which evaluates result context fields as
String. After the migration to Solr 4.1.0 'some' of these fields
suddenly are stored as String[]; this patch compensates this confusion.
2013-03-19 10:33:35 +01:00
Michael Peter Christen
342ba1049b - callback fix
- memory allocation problem in RowCollection: if memory is too low, do
not to try to increase by 1 because this leads to very long execution
time and at the end to the same OOM as if we allocate the memory at the
moment we need it even if the resource observer states that this memory
is not there. To compensate this, the increase size is reduced.
2013-03-19 10:32:01 +01:00
orbiter
65d73e5652 renamed callback function to 'callback' because that is a standard for
jsonp which is also used in backbone.js/jquery
2013-03-19 00:59:47 +01:00
orbiter
17ae51e741 increased number of links limitation from 1000 to 10000 for rss feeds
and html documents
2013-03-17 22:13:56 +01:00
orbiter
e4d26d1cb4 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-03-17 10:52:42 +01:00
orbiter
940c6849ee enhanced did-you-mean (a bit): can now remember previously searched
words (plus small enhancements)
2013-03-17 10:52:31 +01:00
reger
d57b221921 add: reset Solr schema filed selection to default button in IndexSchema_p 2013-03-17 03:46:29 +01:00
Michael Peter Christen
9406a2e438 fixed NPE during index abstract computation 2013-03-15 10:04:27 +01:00
Michael Peter Christen
16e9d4d1dd added a restart hint 2013-03-15 10:00:06 +01:00
Michael Peter Christen
b3a54d5b1c fix for wrong class name in log 2013-03-15 09:35:57 +01:00
Michael Peter Christen
2d36a7eaf5 - do not create a new query for all remote peers
- no document search this time
- adjusted banner and network to not show 'WORDS' but DHT Chunks. This
is to avoid confusion for robinson peers which do not create Word
Entries
2013-03-15 00:14:28 +01:00
Michael Peter Christen
4af0839be2 use appropriate ranking for each search situation:
- when using the /date modifier, a date ranking profile is used
- when using a site: modifier, a ranking profile supporting longer urls
is used
2013-03-14 21:13:12 +01:00
Michael Peter Christen
b8ed66a55d added all clickdepth computations for source and target paths in
webstructure core
2013-03-14 17:54:33 +01:00
Michael Peter Christen
6300730d7f refactoring of clickdepth computation as preparation for clickdepth
computation of webgraph links
2013-03-14 12:13:02 +01:00
Michael Peter Christen
2080fc7406 removed unused tag fields 2013-03-14 10:35:21 +01:00
reger
230a12bfe2 adjust Opensearch discover function to new webgraph Solr schema 2013-03-14 03:10:54 +01:00
orbiter
6b13dd0d3d added clickdepth field writing for webgraph core (unfinished) 2013-03-14 01:35:38 +01:00
orbiter
47114910d5 fix for possible memory leaks 2013-03-13 17:55:37 +01:00
Michael Peter Christen
addba047e2 changes in ranking computation
- an existing ranking servlet for solr was extended. It is now possible
to set boost values for fields, boost functions and boost queries.
- The ranking can have different instances, but currently only the first
one is used
- added an abstraction layer for fields which can be used for search and
those fields can be edited in the solr ranking configruation
- the ranking value from solr within the field score is used to combine
remote search requests, which all are created using the same locally
defined boost values
- reduced the number of fields which are used for search (makes it
faster)
- replaced some text fields by string fields (makes indexing faster)
- removed classes which had no use
- made a large number of experiments for a better ranking and created a
temporary setting which prefers hits inside titles
- adjusted also the RWI-based ranking computation to 'prefer title'
- made special cases like for portal search where no post-processing and
post-ranking is wanted: this keeps the original ranking order as done by
Solr
- fixed many bugs with old settings for ranking
2013-03-13 14:47:00 +01:00