Commit Graph

590 Commits

Author SHA1 Message Date
reger
8410536f75 keep svnRevision in .init for convert of .conf until release >1.83 2016-03-20 18:12:55 +01:00
reger
726ebee65a include Version config string in yacy.init (replacing svnRevision) 2016-03-20 03:42:33 +01:00
Michael Peter Christen
f4591b1b51 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2016-03-11 18:12:38 +01:00
Michael Peter Christen
1ce38fdaed 0n - added experimental zeronet network which supports intranet peers
(still needs work)
2016-03-11 08:55:51 +01:00
Michael Peter Christen
d05ffa1c51 update to seed list 2016-03-11 07:20:38 +01:00
reger
16724c1283 remove unused proxyCookieWhiteList from yacy.init 2016-03-11 01:14:54 +01:00
Michael Peter Christen
5d635879f8 Merge pull request #40 from Scarfmonster/autocrawl
Automatic crawling
2016-01-14 22:19:55 +01:00
Ryszard Goń
7d6e0d8470 Add missing settings to autocrawl settings page 2016-01-14 03:27:33 +01:00
Ryszard Goń
a98c395023 Add the Autocrawl thread 2016-01-14 00:50:23 +01:00
reger
4765e374e6 altered clac. of search result items per page to display
taking the existing limits into account but make it consistent with search option screen for admin and public user
changes:
  - configured default number of items per page (ConfigPortal_p.html) is used as is (no hardcoded limit)
  - otherwise requests are limited to 100 results per page ( = search option, index.html)
      (this basically is the major change, inc. limit from 20 to 100 for public user)
P.S. - the older grant of more (1000), if no online snippet calculation, is kept (for the time being)

see http://mantis.tokeek.de/view.php?id=627
2016-01-13 01:30:49 +01:00
Ryszard Goń
1728cd30c6 Create autocrawl profiles 2016-01-12 16:28:34 +01:00
reger
e8256bb3b1 remove blekko from opensearch config (not available)
see https://blekko.com/
http://searchengineland.com/goodbye-blekko-search-engine-joins-ibms-watson-team-217633
2016-01-04 04:49:10 +01:00
reger
a5faf73afa remove obsolete yacy.init entries interaction.*
(related to removed triplestore)
2015-12-29 15:41:19 +01:00
sixcooler
dce1cb65c4 Merge remote-tracking branch 'choose_remote_name/master' 2015-12-28 23:20:42 +01:00
reger
e84d94f8ca fix mime table for ms office / open office documents
(causing wrong parser detect in intranet mode)
2015-12-22 17:48:24 +01:00
reger
15e46b2bad exclude in/outboundlinksnofollowcount_i from default schema fields
(not used in any function)
2015-12-19 21:25:08 +01:00
luc
8c4ab9c76b Added an option to eventually limit size of remote solr documents put to
local index. See mantis #626.
2015-12-16 02:20:03 +01:00
luc
55a4d15775 Added a note on deprecated default search field and operator. 2015-12-14 23:55:12 +01:00
reger
b2c8bc0ae6 remove md5_s from default index fields
it is not assigned a value / not used
Due to above also excluded from transfer protocol.
2015-11-27 02:41:02 +01:00
sixcooler
f5a9948860 do not store subfield *_coordinate 2015-11-10 20:32:42 +01:00
sixcooler
fca353e5eb set startuptype of most solr handlers to lazy 2015-11-10 20:32:05 +01:00
reger
c720b4c249 remove override of dynamicField coordinate_p in solr schema
(coordinate_p is not a mandatory field as such doesn't need to be declared as schema.field)
2015-10-24 22:44:28 +02:00
reger
f0b5bc93a3 remove obsolete yacy.init entry "secureHttps"
not used anywhere
2015-10-19 03:47:28 +02:00
reger
5e45f1a460 enable Solr schema dynamicField _p (type=location) for YaCy coordinate_p field 2015-09-01 21:47:25 +02:00
sixcooler
87e4abe393 fight the fieldcache by usind DocValues: in Solr-5.x the fieldcache has
moved and was not cleared anymore. This results in an huge fieldcache.
(http://lucene.apache.org/#highlights-of-the-lucene-release-include
https://issues.apache.org/jira/browse/LUCENE-5666)
Here I try to use DovValues where it is possible.
For this I used the Api-Scheme as new basis für the Solr-Schema.
This needs at least a complete optimization of the Solr-Index to get a
smaller FieldCache.
Everything that is indexed with these setting will not use the
Fieldcache at all.
2015-08-31 20:24:41 +02:00
reger
250f6457f0 remove exired domain titan.deep-one.in from bootstrap.seedlist 2015-08-26 23:58:08 +02:00
Michael Peter Christen
df3314ac1a added a new facet type based on a probabilistic classifier using
bayesian filters. This can be used to classify documents during
indexing-time using a pre-definied bayesian filter.

New wordings:
- a context is a class where different categories are possible. The
context name is equal to a facet name.
- a category is a facet type within a facet navigation. Each context
must have several categories, at least one custom name (things you want
to discover) and one with the exact name "negative".

To use this, you must do:
- for each context, you must create a directory within
DATA/CLASSIFICATION with the name of the context (the facet name)
- within each context directory, you must create text files with one
document each per line for every categroy. One of these categories MUST
have the name 'negative.txt'.

Then, each new document is classified to match within one of the given
categories for each context.
2015-08-10 14:27:44 +02:00
Michael Peter Christen
e1cd9c0dba added another default network / commented out 2015-07-09 16:25:11 +02:00
reger
00d2062813 Rem depreciated AdminHandlers in solrconfig.xml
avoid warning log
W  org.apache.solr.handler.admin.AdminHandlers <requestHandler name="/admin/"  class="solr.admin.AdminHandlers" /> is deprecated . It is not required anymore
2015-07-01 00:58:23 +02:00
Michael Peter Christen
694b22f165 migration to Solr 5.2: huge benefits - this is a lot faster!
This is a very complex migration: many classes had been renamed or
removed, dependencies changed and the solr index type is now aligned to
be a solr cloud repository.
Together with the Solr 5.2 library update, one other dependent library
had been updated as well: httpclient 4.4->4.4.1

Older indexes are migrated from 4_10 to 5_2. However, the new index
structure is more efficient and we recommend to re-index everything.
Please use the index export before you do the update to a large
surrogate xml file. After the update, start with an empty index and then
initialize this with your dump.
2015-06-24 01:55:51 +02:00
Michael Peter Christen
9c12555be5 added link to Snapshots in search results if the snapshot exists and
option is set in ConfigSearchPage_p
(this is a stub: we also need a visualization of pdf files!)
2015-06-07 20:37:37 +02:00
reger
6bc8a9b11e make Quality of Service Servlet available to prioritize requests from local host
This assigns priorities to incoming requests. Higher priority numbers are served before lower.
(disabled by default in defaults/web.xml, 
uncomment or copy entry to DATA/Settings/web.xml)
2015-04-26 04:29:32 +02:00
Michael Peter Christen
b060ba900d added parsing of contentprop attribute in html tags for
content='startDate' and content='endDate'. The value of these field is
now written to new solr fields startDates_dts and endDates_dts.
2015-04-13 16:20:00 +02:00
Michael Peter Christen
4cb4f67f38 added parsing of dd, dt and article html fields. The parsed result is
written to special solr fields which are deactivated by default.
2015-04-12 22:02:45 +02:00
Michael Peter Christen
36e9cdb376 testing switching off cold searchers; maybe this brings performance
enhancements when using large facets
2015-04-07 13:14:41 +02:00
Michael Peter Christen
535f1ebe3b added a new way of content browsing in search results:
- date navigation

The date is taken from the CONTENT of the documents / web pages, NOT
from a date submitted in the context of metadata (i.e. http header or
html head form). This makes it possible to search for documents in the
future, i.e. when documents contain event descriptions for future
events.

The date is written to an index field which is now enabled by default.
All documents are scanned for contained date mentions.
To visualize the dates for a specific search results, a histogram
showing the number of documents for each day is displayed. To render
these histograms the morris.js library is used. Morris.js requires also
raphael.js which is now also integrated in YaCy.

The histogram is now also displayed in the index browser by default.

To select a specific range from a search result, the following modifiers
had been introduced:
from:<date>
to:<date>
These modifiers can be used separately (i.e. only 'from' or only 'to')
to describe an open interval or combined to have a closed interval. Both
dates are inclusive. To select a specific single date only, use the
'to:' - modifier.

The histogram shows blue and green lines; the green lines denot weekend
days (saturday and sunday).

Clicking on bars in the histogram has the following reaction:
1st click: add a from:<date> modifier for the date of the bar
2nd click: add a to:<date> modifier for the date of the bar
3rd click: remove from and date modifier and set a on:<date> for the bar
When the on:<date> modifier is used, the histogram shows an unlimited
time period. This makes it possible to click again (4th click) which is
then interpreted as a 1st click again (sets a from modifier).

The display feature is NOT switched on by default; to switch it on use
the /ConfigSearchPage_p.html servlet.
2015-03-02 04:30:10 +01:00
reger
ba276d3e64 add description_txt to default query fields,
Dublin Core Metadata field extracted by most parsers.
2015-02-22 05:42:04 +01:00
reger
fe6f5a395d fix Umlaut handling in blekko heuristic search term
http://mantis.tokeek.de/view.php?id=169
observation: blekko seams to block xxxbot agents (=0 results)
2015-02-08 23:40:33 +01:00
Michael Peter Christen
97ba5ddbb7 configuration option for maxload limit for remote search 2015-02-04 01:12:25 +01:00
Michael Peter Christen
ac19690d30 refactoring with CommonPattern.COMMA 2015-01-29 01:35:28 +01:00
Michael Peter Christen
cf9b22ca5c do not reindex based on vocabulary fields (there are meanwhile many of
them) and some default settings
2015-01-29 01:22:28 +01:00
reger
24f68a4eb7 refactor opensearch heuristic
introduce FederateSearchManager handling search heuristic to external systems via specific FederateSearchConnectors,
which provide the query() functionallity, the translation to YaCy schema .toYaCySchema() and the search() routine to deliver results to searchevents, which is generally implemented in Abstract connector.
The manager enforces now a min 15s delay between calls to external systems.
Besides the OpensearchConnector a SolrFederateSearchConnector is available. It uses a additional config file for fieldname translation.

default heuristicopensearch.conf: 
- openbdb.com removed - seems not longer to deliver results
- config via solrconnector to  datacite.org added (large technical library archive)
2015-01-19 03:30:35 +01:00
reger
4eb89d7f15 revert clickservlet
(default was indeed a mistakenly)
2015-01-05 09:10:20 +01:00
Michael Peter Christen
61ae9d2d11 do not use the clickservlet by default. From my personal view, this
technique should not be used at all! This project is about privacy, the
existence of a click servlet is one example why people should NOT use a
search portal if such exists.
2015-01-05 08:21:51 +01:00
sixcooler
5594c43d2e bump to Solr-/Lucene-4.10.3 2015-01-04 18:47:47 +01:00
reger
d44d8996d0 Added a “don't store remote search results” option
This is intended for peers who want to participate in the P2P network but don't wish to load/fill-up their index with metadata of every received search result. 
The DHT transfer is not effected by this option (and will work as usual, so that a peer disabling the new store to index switch still receives and holds the metadata according to DHT rules).
Downside for the local peer is that search speed will not improve if search terms are only avail. remote or by quick hits in local index.

To be able to improve the local index a Click-Servlet option was added additionally.
If switched on, all search result links point to this servlet, which forwards the users browser (by html header) to the desired page and feeds the page to the fulltext-index.
The servlet accepts a parameter defining the action to perform (see defaults/web.xml, index, crawl, crawllinks)

The option check-boxes are placed in ConfigPortal.html
2015-01-04 11:10:45 +01:00
reger
e177d69387 remove obsolete config footer option (ConfigPortal user.login)
no footer or footer-option in use

remove unused yacy.init item allowUnlimitedReceiveIndexFrom
2014-12-29 03:50:00 +01:00
reger
6a04563578 Init Jetty using setDefaultDescriptor (web.xml) to defaults/web.xml
so web.xml in defaults dir is applied first and optional DATA/SETTINGS/web.xml loaded on top.
By using this Jetty feature (default web.xml) we assure that changes to the default are applied to existing installations
and individual addition/changes are still respected.
2014-12-27 00:10:14 +01:00
Michael Peter Christen
eb78388a98 changed prefer strategy for http unique in such a way that http is
preferred over https. While this is a bad idea from the standpoint of
security it is more common applicable for environments where http and
https mix and for some domains https is not available. Then the
double-check is possible even if no postprocessing is performed.
2014-12-21 19:17:06 +01:00
Michael Peter Christen
d14114697c the miss cache does not seem to work, it sometimes contains urlhashes
from documents which actually are inside the index. This can be
reproduced using the crawl result table at 
http://localhost:8090/CrawlResults.html?process=5
The cache is temporary disabled to remove the bad behaviour, however a
later reactivation of that feater may be possible.
2014-12-21 17:31:51 +01:00