Commit Graph

601 Commits

Author SHA1 Message Date
Michael Peter Christen
ac19690d30 refactoring with CommonPattern.COMMA 2015-01-29 01:35:28 +01:00
Michael Peter Christen
cf9b22ca5c do not reindex based on vocabulary fields (there are meanwhile many of
them) and some default settings
2015-01-29 01:22:28 +01:00
reger
24f68a4eb7 refactor opensearch heuristic
introduce FederateSearchManager handling search heuristic to external systems via specific FederateSearchConnectors,
which provide the query() functionallity, the translation to YaCy schema .toYaCySchema() and the search() routine to deliver results to searchevents, which is generally implemented in Abstract connector.
The manager enforces now a min 15s delay between calls to external systems.
Besides the OpensearchConnector a SolrFederateSearchConnector is available. It uses a additional config file for fieldname translation.

default heuristicopensearch.conf: 
- openbdb.com removed - seems not longer to deliver results
- config via solrconnector to  datacite.org added (large technical library archive)
2015-01-19 03:30:35 +01:00
reger
4eb89d7f15 revert clickservlet
(default was indeed a mistakenly)
2015-01-05 09:10:20 +01:00
Michael Peter Christen
61ae9d2d11 do not use the clickservlet by default. From my personal view, this
technique should not be used at all! This project is about privacy, the
existence of a click servlet is one example why people should NOT use a
search portal if such exists.
2015-01-05 08:21:51 +01:00
sixcooler
5594c43d2e bump to Solr-/Lucene-4.10.3 2015-01-04 18:47:47 +01:00
reger
d44d8996d0 Added a “don't store remote search results” option
This is intended for peers who want to participate in the P2P network but don't wish to load/fill-up their index with metadata of every received search result. 
The DHT transfer is not effected by this option (and will work as usual, so that a peer disabling the new store to index switch still receives and holds the metadata according to DHT rules).
Downside for the local peer is that search speed will not improve if search terms are only avail. remote or by quick hits in local index.

To be able to improve the local index a Click-Servlet option was added additionally.
If switched on, all search result links point to this servlet, which forwards the users browser (by html header) to the desired page and feeds the page to the fulltext-index.
The servlet accepts a parameter defining the action to perform (see defaults/web.xml, index, crawl, crawllinks)

The option check-boxes are placed in ConfigPortal.html
2015-01-04 11:10:45 +01:00
reger
e177d69387 remove obsolete config footer option (ConfigPortal user.login)
no footer or footer-option in use

remove unused yacy.init item allowUnlimitedReceiveIndexFrom
2014-12-29 03:50:00 +01:00
reger
6a04563578 Init Jetty using setDefaultDescriptor (web.xml) to defaults/web.xml
so web.xml in defaults dir is applied first and optional DATA/SETTINGS/web.xml loaded on top.
By using this Jetty feature (default web.xml) we assure that changes to the default are applied to existing installations
and individual addition/changes are still respected.
2014-12-27 00:10:14 +01:00
Michael Peter Christen
eb78388a98 changed prefer strategy for http unique in such a way that http is
preferred over https. While this is a bad idea from the standpoint of
security it is more common applicable for environments where http and
https mix and for some domains https is not available. Then the
double-check is possible even if no postprocessing is performed.
2014-12-21 19:17:06 +01:00
Michael Peter Christen
d14114697c the miss cache does not seem to work, it sometimes contains urlhashes
from documents which actually are inside the index. This can be
reproduced using the crawl result table at 
http://localhost:8090/CrawlResults.html?process=5
The cache is temporary disabled to remove the bad behaviour, however a
later reactivation of that feater may be possible.
2014-12-21 17:31:51 +01:00
reger
446f374ba9 fix yacy.init comment
http://mantis.tokeek.de/view.php?id=513
2014-12-14 19:12:18 +01:00
Michael Peter Christen
66b5a56976 Added and integrated new date detection class which can identify date
notions within the fulltext of a document. This class attempts to
identify also dates given abbreviated or with missing year or described
with names for special days, like 'Halloween'. In case that a date has
no year given, the current year and following years are considered.

This process is therefore able to identify a large set of dates to a
document, either because there are several dates given in the document
or the date is ambiguous. Four new Solr fields are used to store the
parsing result:

dates_in_content_sxt:
if date expressions can be found in the content, these dates are listed
here in order of the appearances

dates_in_content_count_i:
the number of entries in dates_in_content_sxt

date_in_content_min_dt:
if dates_in_content_sxt is filled, this contains the oldest date from
the list of available dates

#date_in_content_max_dt:
if dates_in_content_sxt is filled, this contains the youngest date from
the list of available dates, that may also be possibly in the future

These fields are deactiviated by default because the evaluation of
regular expressions to detect the date is yet too CPU intensive. Maybe
future enhancements will cause that this is switched on by default.

The purpose of these fields is the creation of calendar-like search
facets, to be implemented next.
2014-12-14 13:40:45 +01:00
Michael Peter Christen
114f0afc1e enable sku as anchor in html response writer 2014-12-14 04:02:13 +01:00
Michael Peter Christen
60f27bdf49 added the property timeoutrequests to configuration to disable
TimeoutRequests. The purpose is to test if YaCy runs better on VMs where
there is a limitation of concurrent processes;  see
/proc/user_beancounters in row numproc; this value is limited and should
be low. Try to set timeoutrequests to keep this low. (works only after
restart)
2014-12-01 15:20:10 +01:00
Michael Peter Christen
1d45d9405a security bugfix 2014-11-28 01:19:01 +01:00
Michael Peter Christen
c94c24638f disabled postprocessing by default. If you read this: please disable
postprocessing in your peer as well: open /IndexSchema_p.html, then
deselect field process_sxt
2014-11-27 12:13:20 +01:00
Michael Peter Christen
c0f9f6ac66 added option to change the navbar-default, i.e. usable for dark skins 2014-11-26 18:01:35 +01:00
Michael Peter Christen
84763126e0 added option to make the YaCy proxy act as the cache is never stale. If
set to 'Always Fresh' the cache is always used if the entry in the cache
exist. This is a good way to archive web content and access it without
going online again in case the documents exist.
To do so, open /Settings_p.html?page=ProxyAccess and check the "Always
Fresh" checkbox.
This is set do false which behave as set before.
If you set this to true, then you have your web archive in DATA/HTCACHE.
Copy this to carry around your private copy of the internet!
2014-11-24 20:28:52 +01:00
reger
ee277b9b3e allow for local yacy.stopwords and yacy.badwords list (in DATA/SETTINGS/)
if file in DATA/SETTINGS it is loaded otherwise file in ./defaults is loaded
   (if locale ./defaults/stopwords.xx doesn't exist take solr/lang/stopwords_xx.txt as default)

move yacy.stopwords, yacy.stopwords.de and yacy.badwords.example out of root directory to ./defaults directory
2014-11-23 05:22:23 +01:00
Michael Peter Christen
c67c5c0709 added new solr schema fields which record the occurences of vocabulary
matchings. These matches can be used for result boosting, i.e. if a
document contains words from a specific vocabulary, boost it.
2014-11-18 15:02:34 +01:00
Michael Peter Christen
68e8039fd1 added high-precision scheduler for API processes. This allows also to
make the execution in dependency of available RAM or CPU load. The
default value for CPU load is 4.0 and the check runs once a minute.
2014-11-14 10:02:50 +01:00
sixcooler
725b206fb4 update to solr-/lucene-4.10.2 2014-11-07 18:51:31 +01:00
Michael Peter Christen
26279b0993 added debug code for statistics about document attributes related to
domains
2014-10-29 10:50:08 +01:00
Michael Peter Christen
2e5214eb21 added field postprocessing.partialUpdate to settings which can be used
to switch on or off partial updates. Both options should cause the same
result. Default is on.
2014-10-17 14:17:49 +02:00
Michael Peter Christen
b1cfbc4a04 added new solr field url_paths_count_i which can be used to enhance the
index browser and maybe also for ranking; possibly also for
SEO-with-YaCy applications.
2014-10-13 23:51:19 +02:00
Michael Peter Christen
8c1a89cb34 added another decoration flag to switch off network graphics in crawler
monitor and index browser: decoration.grafics.linkstructure
Please set this to false to remove the graphics from the interface.
2014-10-08 17:12:35 +02:00
Michael Peter Christen
bc221a0f9c less load and more ram prerequisite for crawl steps 2014-10-08 14:27:38 +02:00
Michael Peter Christen
2a052f446a Added an experimental audio feedback system.
This is the first element of a new 'decoration' component which may hold
switches for different external appearance parameters.
The first switch in that context is decoration.audio (as usual in
yacy.init). This value is set to false by default, that means the audio
feedback element is switched off by default. To switch it on, set
decoration.audio = true (using /ConfigProperties_p.html). You will then
hear sounds for the following events:
- remote searches
- incoming dht transmissions
- new documents from the crawler
Sound clips are stored in htroot/env/soundclips/ which is done so
because a future implementation will read these files using the http
client and with configurable urls which will make it very easy for the
user to replace the given sounds with own sounds.
2014-10-07 17:51:07 +02:00
Michael Peter Christen
f03dd0df24 updated seedlist 2014-09-16 15:49:03 +02:00
Michael Peter Christen
2b1cf26828 removed solr warning during startup 2014-09-02 13:25:30 +02:00
Michael Peter Christen
57ce7eeff3 fixed localhost authorization and replaced the adminRealm with an info
string which is visible in the browser. That makes it possible that the
browser instructs the user how to change a forgotten admin password
(during runtime).
2014-09-02 13:15:19 +02:00
orbiter
f318d7c285 enhanced date-ordered ranking 2014-09-01 13:01:30 +02:00
orbiter
b3ebd38079 removed the HTDOCS repository concept because the concept to host files
on the YaCy http server is obsolete; YaCy can index file:// and smb://
paths
2014-08-26 19:02:53 +02:00
reger
ec5b1d9e33 let NETWORK_WHITELIST take precedence over NETWORK_BLACKLIST
this makes it easier to config exception (for private networks),
like   blacklist= .*
        whitelist= 10\..*,127\..* .....     allows only listed ip pattern
2014-08-26 01:02:38 +02:00
orbiter
2371d6b8db target linktexts must be string to enable search facets on these fields 2014-08-01 13:20:25 +02:00
orbiter
161a11070c yacystats is gone :( 2014-07-29 11:12:01 +02:00
reger
7328c2883b fix type in .init description
http://mantis.tokeek.de/view.php?id=430
2014-07-26 00:38:53 +02:00
reger
94819f0797 set .ini default boost fields to same as assigned by button "reset to default"
(in RankingSolr_p)
- fix typo http://mantis.tokeek.de/view.php?id=430
2014-07-26 00:17:41 +02:00
reger
a2cb366b25 Combine /heuristic search modifier with opensearch configured targets
- with search modifier /heuristic a request is send to all configured opensearch target systems (old /heuristic/blekko modifier not longer valid)
- this allows to use opensearch heuristic on individual search request (in contrast to configuration HEURISTIC_OPENSEARCH=true which sends a osd request on all global searches
- the index.html searchoption text adjusted to be displayed only if option configured
- add Archive-It to predefined systems
2014-07-20 00:00:43 +02:00
Michael Peter Christen
2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
attribute in the <a> tag for each crawl. This introduces a lot of
changes because it extends the usage of the AnchorURL Object type which
now also has a different toString method that the underlying
DigestURL.toString. It is therefore not advised to use .toString at all
for urls, just just toNormalform(false) instead.
2014-07-18 12:43:01 +02:00
Michael Peter Christen
1092e798a5 fixed double content postprocessing 2014-07-07 19:15:11 +02:00
Michael Peter Christen
09dcdb9b19 update to solr 4.9.0 2014-07-01 16:39:00 +02:00
orbiter
0bbb5040b8 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-06-15 12:38:52 +02:00
orbiter
9d5d86cd03 Added filter query options to the ranking servlet /RankingSolr_p.html.
Filter queries are not actually related to ranking, but user requests
have pointed out that specific boost queries to move results to the end
of the result list are not sufficient. Such boost filters may be better
executed as actual filter and therefore such a filter can now be
statically applied to every search request. A typical use could be the
expression "http_unique_b:true AND www_unique_b:true" which uses the
recently introduced fields http_unique_b and www_unique_b which are true
only for one of the alternatives with/without http(s) and with/without
prefix 'www.' in host names.
2014-06-15 12:38:30 +02:00
Michael Peter Christen
d2151857f1 Added collection navigation:
The collection field (can be filled i.e. in Crawl Start) can be used to
add categories to YaCy index entries. The usage of that field was
restricted to solr searches and post argument filters as implemented in
commit f7571386a3.
This commit extends collections to a full navigation option in the
standard YaCy search interface. The field is not active by default but
can be activated easily in the /ConfigSearchPage_p.html servlet (just
check the 'Collection' facet field). Collections can now be used for (at
least) two purposes:
- to provide search tenants (through post argument collection)
- to provide self-made category navigation
Search requests may now have (independently from switched on or off
collection facet) a "collection:<collection-name>" modifier attached;
firthermore collection names may use disjunctions using the '|' pipe
symbol. For example, this is a valid search request:
www collection:user|proxy
2014-06-15 12:11:23 +02:00
Michael Peter Christen
922979aae1 added option to prefer http over https in unique-protocol ranking 2014-06-02 17:40:56 +02:00
reger
d8d318233e fix logging settings
- add missing .level
- remove obsolete jena settings
- set default level=INFO to prevent debug logging of not explicite specified classes
2014-06-01 06:43:50 +02:00
Michael Peter Christen
698f053658 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-06-01 01:02:12 +02:00
Michael Peter Christen
f23c4142e0 added option to configure a custom user agent within allip networks 2014-06-01 01:02:03 +02:00
orbiter
d7d38f9135 made number of open files in crawler configurable and increased default
maximum number of open files from 100 to 1000. This number can be
changed with the attribut crawler.onDemandLimit
2014-05-31 09:29:55 +02:00
Michael Peter Christen
ff5b3ac84d added new fields http_unique_b and www_unique_b which can be used for
ranking to prefer urls containing a www subdomain or using the https
protocol
2014-05-27 15:28:28 +02:00
Michael Peter Christen
f0db501630 better handling of ranking parameters and new default values for date
navigation which is done using ranking in solr.
2014-05-22 03:01:07 +02:00
Michael Peter Christen
d4157184ec migration to Solr 4.8.1
This includes also an update to zookeeper 3.4.6 and a new library that
Solr initializes by default: org.restlet from
http://restlet.com/download/current#release=stable&edition=jse&distribution=zip
which is included in version 2.2.1 from may 6th 2014
2014-05-21 11:48:08 +02:00
orbiter
2944822bb0 updated bootstrap seed list 2014-05-20 13:27:40 +02:00
reger
e31493e139 "Use remote proxy for yacy" has no function, remove option and related config item
see/fix bug http://mantis.tokeek.de/view.php?id=23
http://mantis.tokeek.de/view.php?id=189
2014-05-17 23:36:59 +02:00
reger
f02203fb2f fix xml validation error on defaults/web.xml 2014-05-11 04:39:59 +02:00
Michael Peter Christen
229f2248b8 added configuration option for maxmimum load and minimum ram for
postprocessing
2014-04-30 13:26:32 +02:00
Michael Peter Christen
3d5e354471 small changes to search headline colour 2014-04-29 18:46:50 +02:00
Michael Peter Christen
71efc76170 new default skin pdbootstrap which keeps the design shapes but slightly
changes the colours to match with bootstrap colours
2014-04-29 16:23:42 +02:00
reger
d812f80784 add exit proxy link to UrlProxy
on proxied pages a link to exit proxy is added to top of page.
Link text can be configured in web.xml init-parameter (see default/web.xml). If missing no link is displayed.
2014-04-26 22:27:59 +02:00
reger
2dabe2009d - remove unused manual http KeepAlive config
(reducing references to obsolete httpdemon)
- add port info to settings_http
2014-04-18 19:57:35 +02:00
Michael Peter Christen
7a2f3e2353 increased resource.disk.used.max.steadystate and
resource.disk.used.max.overshot by 4 times because first users reached
that limit and wondered why the crawler was paused automatically :)

The crawler will now stop at 2TB disk usage :)
2014-04-17 16:19:38 +02:00
Michael Peter Christen
9a5ab4e2c1 removed clickdepth_i field and related postprocessing. This information
is now available in the crawldepth_i field which is identical to
clickdepth_i because of a specific crawler strategy.
2014-04-16 22:16:20 +02:00
Michael Peter Christen
da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
This organizes all urls to be loaded in separate queues for each host.
Each host separates the crawl depth into it's own queue. The primary
rule for urls taken from any queue is, that the crawl depth is minimal.
This produces a crawl depth which is identical to the clickdepth.
Furthermorem the crawl is able to create a much better balancing over
all hosts which is fair to all hosts that are in the queue.
This process will create a very large number of files for wide crawls in
the QUEUES folder: for each host a directory, for each crawl depth a
file inside the directory. A crawl with maxdepth = 4 will be able to
create 10.000s of files. To be able to use that many file readers, it
was necessary to implement a new index data structure which opens the
file only if an access is wanted (OnDemandOpenFileIndex). The usage of
such on-demand file reader shall prevent that the number of file
pointers is over the system limit, which is usually about 10.000 open
files. Some parts of YaCy had to be adopted to handle the crawl depth
number correctly. The logging and the IndexCreateQueues servlet had to
be adopted to show the crawl queues differently, because the host name
is attached to the port on the host to differentiate between http,
https, and ftp services.
2014-04-16 21:34:28 +02:00
reger
46016fa153 autoupdate fails to download latest release (1.71) due to default release blacklist
- removed the default version blacklist regex from init (for future versions)

!!!  left existing update  blacklist setting untouched !!! 
(existing installation wanting autoupdate for 1.71 need to change blacklist in ConfigUpdate_p.html)

- moved old blacklist patch to migration.java
2014-04-13 07:32:32 +02:00
Michael Peter Christen
ebd44a7080 replaced solr 4.6.1 with solr 4.7.1 and added index migration to
lucene_47
2014-04-06 10:45:03 +02:00
Michael Peter Christen
ee92d748b5 test using compound file format, see UseCompoundFile in
https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig
This appears to be necessary as many times a
java.io.FileNotFoundException: (Too many open files) appears.
See also: https://issues.apache.org/jira/browse/SOLR-4 and desperate
users at
http://stackoverflow.com/questions/3828343/too-many-open-file-exception-while-indexin-using-solr
We cannot force users to do a "ulimit -n 1000000", so this action seems
to be required.
2014-04-06 00:35:35 +02:00
Michael Peter Christen
0a95fd27f3 update of seed list 2014-04-04 17:04:49 +02:00
Michael Peter Christen
cca851a417 introduced new solr field crawldepth_i which records the crawl depth of
a document. This is the upper limit for the clickdepth_i value which may
be shorter in case that the crawler did not take the shortest path to
the document.
2014-04-02 23:37:01 +02:00
Michael Peter Christen
39b641d6cd added tutorial mode - some menu items will only appear if you 'qualify'
for them. Thus, the first-time user will only see four menu items. The
other items will unfold as the user interacts.
2014-04-02 02:33:17 +02:00
reger
b12200cafe alternative UrlProxyServlet (for /proxy.html) using different url rewrite rules
- use JSoup parser for selective rewrite of html body <a href=  links only,
instead of regex which rewrites also header href/src links
- this improves display of pages which use header <base> tag
- tags with src attribute are taken from original location (like css) improving display and are not routed trough the indexer
Disadvantage: scripting links will drop out of proxy

Setting of the servlet through web.xml exclusivly (in case one would like to quickly switch back to the YaCyProxyServlet,
leaving the existing code of YaCyProxyServlet untouched available)
2014-03-30 04:04:02 +02:00
Michael Peter Christen
e515dd460d added linkscount_i and linksnofollowcount_i to the default solr schema 2014-03-27 23:36:08 +01:00
Michael Peter Christen
a7bc130e27 removed performance settings
- they are incomplete and buggy
- it was not easy to explain
- it did not comply with a KISS strategy
- setting a performance of low priority actually caused crashing of a
peer
- there was nobody who would maintain that functionality
2014-03-27 22:00:57 +01:00
Michael Peter Christen
a28fefba2d activated language facet by default 2014-03-23 11:39:46 +01:00
Michael Peter Christen
617dd9c97b - added new input field in index.html
- changed progress bar in yacysearch.html
- moved pagination navigation to page bottom
- moved search term input field to headline
2014-03-21 02:42:09 +01:00
orbiter
7d24bcb98d added flag to require that all web pages, even such without a "_p"
extension require authorization. (default off)
2014-03-20 19:09:47 +01:00
reger
1fe26550a0 remove AugmentedBrowsing_p.html augmented browsing switch
(has no function in code, previously used in conjuction with http://reflect.ws)
2014-03-19 22:40:35 +01:00
reger
e972b87a8a remove AugmentedBrowsingFilters_p.html as none of the settings are used currently
config settings frome the page also removed from yacy.init
augmentation.reflect
augmentation.addDoctype
augmentation.reparse
interaction.overlayinteraction.enabled
2014-03-17 20:27:04 +01:00
reger
a373fb717d remove more unused from legacy server.http
- triggerOnlineAction not used
- useTemplateCache not used
2014-03-14 03:12:04 +01:00
orbiter
f77afa9d1d add index on _val fields, this affects especially title length
an index on fields make search facets on that field possible
2014-03-04 11:24:04 +01:00
Michael Peter Christen
de8f7994ab as crawling has a low-cpu demand, we want it to run even if the CPU load
is VERY high. This applies also if the CPU load is high because of
in-cache crawling; in that case we want to experience a high-CPU load as
much as possible
2014-02-25 14:17:33 +01:00
Michael Peter Christen
9eb668e951 enhanced the resource observer
The resource observer is now able to recognize free disk space AND
available space for YaCy. The amount of space which is assigned for YaCy
are defined in new settings in the configuration file.
Furthermore, there is now a cleanup process which deletes files in case
that an autodelete is activated. The autodelete is now BY DEFAULT ON if
the disk space is low, which means that YaCy starts to delete documents
when the disk is full!
2014-02-12 01:00:44 +01:00
Michael Peter Christen
ca8b100f96 run the cleanup process even when load is high, do postprocessing even
if load > 1 (but < 2) but only if there is enough memory (now: 0.5 GB
RAM available). The memory amount of the postprocessing is the cause
that systems block because they run into a frequent-GC chain which
almost locks the peer. If running with enough memory, the postprocessing
is fast and not damaging to the system.
Because the required RAM of 0.5 GB is never available in default
setting, the postprocessing will not run if the peer is not reconfigured
to use more memory.
2014-02-10 12:59:30 +01:00
Michael Peter Christen
6e59ca4ebf removed jena library and all code that depended on jena. When jena was
introduced, it was also used for search facets. The generic search
facets are now deduced from generic solr fields which makes jena as tool
for facet semantics superfluous.
2014-02-07 01:20:06 +01:00
Michael Peter Christen
931541d198 re-inserted default value re-set button to performance queues and
patched missing values for recent new queues
2014-02-06 22:39:19 +01:00
Michael Peter Christen
4b7f2fcf38 updated bootstrap seedlist list 2014-01-27 13:55:06 +01:00
reger
a71718a459 add config value for ssl/https port (default=8443)
adjust server routines to use config
2014-01-27 01:09:56 +01:00
reger
cf553e5045 added hint to web.xml and for completeness the full set of hardcoded mappings 2014-01-23 23:56:45 +01:00
Michael Peter Christen
a8fdaace31 changed the web.xml as well to migrate the solr servlet 2014-01-23 18:41:45 +01:00
Michael Peter Christen
be5e808236 - removed hardcoded load-test which is now handled in BusyQueues
steering, see /PerformanceQueues_p.html
- changed default values for crawler queue load limit (high, because
these jobs are started upon user request)
2014-01-21 17:48:45 +01:00
sixcooler
40a4030b55 configurable max-load values for YaCy-Threads:
try lower values on smal systems like a Pi
2014-01-21 17:04:22 +01:00
Michael Peter Christen
77531850b5 reverted crawling strategy from latest commit. 2014-01-21 16:05:55 +01:00
reger
97e84439fb adjusted ConfigHeuristic and changed QueryGoal.getOriginalQueryString to .getQueryString
- since specific heuristic Twitter & Blekko is not longer available or redundant with OpenSearchHeuristic,
adjusted ConfigHeuristic to use OpensearchHeuristic settings only.
For this the default OSD search target list is made available (copied) by default and the other configs are removed.

- the return of QueryGoal.getOriginalQueryString includes the queryModifier, which are held separately in a modifier object,
but in most (all) cases just the query term is expected, clarified and renamed it to QueryGoal.getQueryString which returns
just the search term (if needed a .getOrigianlQueryString could be implemented in Queryparameters, adding the modifiers)

- started to adjust internal html href references from absolute to relative (currently it is mixed).
For future development we should prefer relative href targets (less trouble with context aware  servlets)
2014-01-20 00:58:17 +01:00
reger
d24a0ec32c upd heuristic default list (heuristicopensearch.conf)
- Faroo Web taken out (requires api key) http://www.faroo.com/hp/api/api.html#description
- update Faroo News to new url
- Twitter taken out (change to Api 1.1 not supporting rss) https://dev.twitter.com/discussions/24239
2014-01-20 00:03:55 +01:00
reger
0c754dd794 implemented DIGEST authentication, which is for remote login more secure
as BASIC were pwd is transmitted near clear text (B64enc).
This has some implication as RFC 2617 requires and recommends a password hash MD5(user:realm:pwd) for DIGEST.

!!! before activating DIGEST you have to reassign all passwords !!! to allow new calculation of the hash
- default authentication is still BASIC
- configuration at this time only manually in (DATA/settings) or  defaults/web.xml  (<auth-method>
- the realmname is in defaults/yacy.init  adminRealm=YaCy-AdminUI
- fyi: the realmname is shown on login screen
- changing the realm name invalidates all passwords - but for security you are encouraged to do so (as localhostadmin)
- implemented to support both, old hashes for BASIC and new hashes for BASIC and DIGEST
- to differentiate old / new hash the in Jetty used hash-prefix "MD5:" is used for new pwd-hashes (  "MD5:hash" )
2014-01-17 00:02:23 +01:00
Michael Peter Christen
f8ce7040ab remote search peer selection schema change:
- all non-dht targets (previously separated into 'robinson' for dht-like
queries and 'node' for solr queries) are non 'extra' peers, which are
queries using solr
- these extra-peers are now selected using a ranking on last-seen,
peer-tag-matches, node-peer flags, peer age, and link count. The ranking
is done using a weight and a random factor.
- the number of extra peers is 50% of the dht peers
- the dht peers now exclude too young peers to prevent bad results
during strong growth of the network
- the number of dht peers (and therefore extra-peers) is reduced when
the memory of the peer is low and/or some documents still appear in the
indexing-queue. This shall prevent a peer from deadlocks when p2p
queries are made in a fast sequence on weak hardware.
2014-01-16 17:27:14 +01:00
reger
f09dbbef96 make SecurityHandler webappcontext ready 2014-01-10 12:36:42 +01:00
reger
37f2a82a5d making root context (htroot) a WebAppContext
- this allows additional features, like servlet configuration via web.xml and many more things.
- currently the standard servlets are still configured in the code (so the supplied defaults/web.xml is not realy needed, yet),
  but could be expanded
- lookup for web.xml - 1. in /DATA/SETTINGS then in /defaults
2014-01-10 10:42:47 +01:00
reger
f6099b730d disabled unused fields in default Solr collection schema 2014-01-10 10:26:45 +01:00