Commit Graph

3936 Commits

Author SHA1 Message Date
reger
c77e43a391 Take out mailto collect in internal parsed document
As earlier plans to make use of mailto as separate webgraph entity didn't
materialize (see  http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5726&p=32493&hilit=mailto#p32493)
free the unused handling and resources.
2017-04-20 00:18:18 +02:00
Michael Peter Christen
335868edba Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2017-04-17 12:26:27 +02:00
reger
bec34d3546 Add url input field as source for WarcImporter
allowing to import warc from url without prior download.
2017-04-16 04:25:29 +02:00
luccioman
f66438442e Extended Mediawiki dump import to remote URLs.
When using a public HTTP URL in /IndexImportMediawiki_p.html, the remote
file now is directly streamed and processed, allowing import of several
GB dumps even with a low memory remote peer, and without need to
manually download the dump file first.
2017-04-14 14:32:44 +02:00
luccioman
e5c3b16748 Improved http client close time on stream processing errors. 2017-04-14 14:23:50 +02:00
luccioman
23775e76e2 Fixed endless loop case in wikicode processing.
Detected when importing recent MediaWiki dumps containing some pages
with script content in plain text format (see Scribunto extension
https://www.mediawiki.org/wiki/Extension:Scribunto ).

Further improvement : modify the MediawikiImporter to prevent processing
revisions whose <model> is not wikitext.
2017-04-12 17:17:03 +02:00
luccioman
0bc868a819 Improved support for non ASCII chars in local file system URLs
Creating a MultiProtocolURL instance from a File object and then
retrieving a File with getFSFile() was inconsistent with file paths
containing space or non ASCII chars.
2017-04-12 09:23:10 +02:00
reger
7b80189bda Activate hosts navigator plugin. This includes rwi results in the navigator
count.
This might be tangential related to http://mantis.tokeek.de/view.php?id=736
as the example includes a local index search, while rwi results are not
counted.
2017-04-10 22:42:06 +02:00
Michael Peter Christen
f5ad29edb1 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2017-04-07 09:15:15 +02:00
Michael Peter Christen
76e9135526 added flatjson parser (stub, unfinished) 2017-04-07 09:15:05 +02:00
reger
b7417ac329 Introduce a Keyword search navigator using the index field keywords.
The keywords field string is split into words as navigator entries.

A keyword navigator facet is essential for search appliance usage were
documents and metadata use often specialized keyword vocabularies to 
filter search results. This navi can be used without custom index schema.

As we don't have defined a search query command to filter "keywords" yet,
the filtering is limited by adding the keyword to the search query.
2017-04-05 00:08:25 +02:00
luccioman
09e72eb0a4 Set Config Portal as a private administration page.
Consistently with its required action from submission credentials, and
because external unauthenticated users do not need to access these
settings.
2017-04-03 11:34:49 +02:00
reger
ba339a2a45 Add servlet to import warc file from filesystem IndexImportWarc_p.html.
Apply Importer interface to WarcImporter
2017-04-02 03:32:21 +02:00
Michael Peter Christen
1d81b8f102 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2017-04-01 01:04:27 +02:00
Michael Peter Christen
69081bce00 added export to elasticsearch. The export dump can easily be imported to
elasticsearch using the command
curl -XPOST localhost:9200/collection1/yacy/_bulk --data-binary
@yacy_dump_XXX.flatjson
2017-04-01 01:04:17 +02:00
reger
510f11d374 Implement surrogate import from Warc archives (as first option handle
warc = Web ARChive File Format.
Warc files with extension .warc or compressed warc.gz can be placed in the
DATA/surrogate/in and contained responses are imported to the index.
The used library is stream based so we can easily extend it later to use
and load warc's from the net.
2017-03-31 00:58:11 +02:00
luccioman
4b649b0a11 Fixed NPE case and API URL link on Solr HTML output for webgraph core. 2017-03-30 15:41:14 +02:00
luccioman
af28a07780 Updated API calls recording/replay with recent changes.
- enabled HTTP POST calls with Digest HTTP authentication
 - made API calls compatible with API newly restricted to HTTP POST only
with transaction token validation
 - ensured backward compatibility with older entries recorded as HTTP
GET
2017-03-30 09:22:28 +02:00
reger
81670c3484 One more use of SwitchboardConstants.SERVER_PORT constant,
apply standard servlet design pattern initialization of solrselectservlet
2017-03-26 20:05:48 +02:00
luccioman
cde237b687 Enforced access controls on some administrative actions.
- ensure use of HTTP POST method : HTTP GET should only be used for
information retrieval and not to perform server side effect operations
(see HTTP standard https://tools.ietf.org/html/rfc7231#section-4.2.1)
 - a transaction token is now required for these administrative form
submissions to ensure the request can not be included in an external
site and performed silently/by mistake by the user browser
2017-03-26 11:48:00 +02:00
luccioman
df5970df6d Extended Apache HTTP Digest Auth. for use of YaCy encoded password
When programmatically requesting the local peer with Apache http client,
authentication credentials must be passed as clear-text values. 
This extension to the apache org.apache.http.impl.auth.DigestScheme
permits use of the YaCy encoded password stored in the
adminAccountBase64MD5 configuration property.
2017-03-26 11:32:44 +02:00
reger
f05976c017 Display the local search word statistic in alphabetic order 2017-03-19 07:12:35 +01:00
reger
3dd23c178b Introduce the option to configure a shutdown port.
A port value of -1 will disable this option.

If set to a value greater 0, YaCy listens on this of on the local loopback 
address (127.0.0.1) for a shutdown or restart signal.
E.g. connect to http://localhost:8005/shutdown will stop the YaCy server.
http://localhost:8005/restart will restart it.
This option allows to stop YaCy locally independant from the web web 
frontend (which might be configured for password protected remote access).
2017-03-19 02:30:08 +01:00
reger
a2afb4bae0 add switchboardconstants for server ports config keys 2017-03-18 20:02:26 +01:00
reger
56d0a87a83 remove double occuance of geo:lat in rss tokens 2017-03-13 03:08:44 +01:00
reger
b4fa1141b8 implement RequestHeader getRequestURI, getRequestURL for legacy request 2017-03-12 01:54:56 +01:00
reger
209a7374bd remove unused import pdfParser 2017-03-09 22:57:51 +01:00
reger
de1c1c16db Improve pdf text extraction resource handling.
For sort pdf <= 3 pages use already extracted content,
only for long pdf > 3 pages reassign content and close internal writer (to direct free buffers)
2017-03-09 22:56:33 +01:00
reger
9b6d1abd9e eliminate some compiler unchecked and deprecation warnings
in nav plugins by explicite type declaration and replacing date.getYear
with Calendar.get
2017-03-09 01:42:36 +01:00
reger
18c7563dbe Extend DCEntry.getLanguage convert to ISO639-1 codes for more languages
by using icu.ULocale for languages not already covered (ICU normalizes 
to ISO639-1 2 char codes).
Add test class
Use DublinCore vocabulary declarations in DCEntry and SurrogateReader 
for easier usage debugging, 
Init SurrogateReader.inputSource on first use.
2017-03-05 02:26:10 +01:00
reger
ce87025462 further avoid to set connect info properties as header value
following comment "use of properties as header values is discouraged"
in case where (proxy)HTTPClient overwrites values with supplied url.
Use defined request.referer procedure in response class.
2017-03-04 22:45:17 +01:00
reger
cd4d891ea4 use pre-defined "Connection" header key, replace depreceated 2017-03-04 19:41:31 +01:00
luccioman
0173b0bc32 Added an advanced settings page for referrer policy settings.
Feedback will be welcome, notably on the descriptive content of this
page.
2017-03-03 12:05:30 +01:00
reger
81963a89fe fix proxyservlet response url to respect http scheme if a relative
Location header is returned.
2017-03-03 00:21:56 +01:00
luccioman
cdcd923375 Privacy enhancement : added settings to control referrer policy.
HTTP "Referer" header sent by the browser when using YaCy can now be
controlled either with the referrer meta tag as a global policy, or only
for search result links by adding the attribute rel="noreferrer".

To improve privacy with the less possible regressions, the default is
set as meta tag with value "origin-when-cross-origin" : internal YaCy
links behavior is not affected, but when visiting external websites
referrer url is not empty but stripped from query parameters and path.

Older browsers, Safari, MS IE and Edge do not support the referrer meta
tag, so the standard but less flexible noreferrer link type can also be
enabled as an alternative.

User-friendly settings page to be implemented.
2017-02-28 18:11:54 +01:00
reger
86534a56f7 fixed ReindexSolrBusyThread new and unexpected repeat of same query with
low number of found documents - by adding additional end condition to 
remove processed query with number of found docs <= process-chunck-size.

Noticed on query h4_txt:[* TO *], found 21, process 21, call of commit happend
but on next cycle same query again 21 docs found (while h4_txt was removed 
from schema and committed inputdocuments).
2017-02-27 23:00:46 +01:00
reger
275c0cddd1 Adjust DefaultServlet test case to recent change,
depreciate unused CONNECTION_PROP_PROTOCOL (also as it might be 
misleading with getProtocol vs getScheme)
2017-02-26 02:39:52 +01:00
reger
41e2ee0eca Fix call parameter for ConnectionInfo in MonitorHandler
(expected scheme e.g. http, was protocol version).
Depreceate obsolete custom X-...-Scheme header constant.
Use existing FORMAT_ANSIC Dateformatter in HeaderFramework.
Correct htmlParserTest (del one not intended println)
2017-02-25 23:55:17 +01:00
luccioman
ac766327d3 Switched a few more Solr fields from strictly mandatory to optional 2017-02-24 11:08:18 +01:00
reger
f254fcfc67 fix htmlParser <script> text extraction on code containing expression
recognized as tag like 1<a
reported in https://github.com/yacy/yacy_search_server/issues/109

Script content is ignored by default, but the text is filtered for html
tags. Modified scraper to skip tag filtering while within a <script> 
section (until a closing tag is detected </script>. 
Possible side effect, missing </script> end-tag will truncate trailing 
content text.
2017-02-24 01:25:32 +01:00
luccioman
2f191e0e1c Improved MultiprocotolURL non ASCII characters support.
After @sinkuu Pull Request #108 added JUnit tests, updated some JavaDoc
and also improved URL tokenization to support non ASCII characters.
2017-02-23 11:09:43 +01:00
luccioman
18e8b3a220 Merge branch 'escape' of https://github.com/sinkuu/yacy_search_server 2017-02-23 11:03:05 +01:00
reger
7419989de3 Correct dublincore title property text to lowercase in htmlresponsewriter,
remove unused (carry over) local variable
Do the same for other responsewriter.
2017-02-23 00:27:56 +01:00
Burkhard
4fdc11cae8 Update SearchEvent.java
Fix NPE on disabled local SolrIndex, occuring on search moving to the 2nd result page.
The debug purpose only setting to disabeling local SolrIndex (System Admin -> Debug Settings) should long term probably be removed from production code.
2017-02-22 02:01:48 +01:00
luccioman
cdc7f3e431 Switched some Solr fields from mandatory to optional
These fields are default enabled but with no doubt not strictly
mandatory with the current code base.

As reported by @reger24, splitting between essential mandatory and
optional fields is still to be improved to reflect the current YaCy
needs.
2017-02-21 22:59:11 +01:00
luccioman
3475d8c1a9 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2017-02-20 10:48:44 +01:00
luccioman
c68a8be2d9 Refactored and enforced Solr mandatory fields for proper operation
- Added a new method to check activation of mandatory fields on
Collection Configuration commit, consistently with checks previously
performed in Switchboard startup and with mandatory fields in the
default schema.
- Reorganized default schema and CollectionConfiguration enumeration :
moved no more mandatory fields in a specific section, and moved fields
enabled at startup to the mandatory section. 
- Marked mandatory fields as required and with stronger font in the
IndexSchema_p.html page
2017-02-20 10:48:07 +01:00
reger
334c70c37a correct fromDate init value on missing param in api/timeline_p servlet
revert test modification from last commit in AccessTracker.main
2017-02-20 00:14:14 +01:00
reger
cc770512d5 add hint of query syntax in AccessTracker log (qs=normal querystring,
sq=solr-querystring) to allow to filter simple text queries for processing,
remove toString for counter parameter
use more predefined constants in solrservlet
2017-02-19 05:23:17 +01:00
luccioman
e5858bc8c8 Fixed a NullPointerException case possible on Index Export
As reported by Palulukas in YaCy forum
(http://forum.yacy-websuche.de/viewtopic.php?f=18&t=5944&sid=dcef5b899ab4aa9b40e3a3d158c13aed#p33454)
the Index Export operation can fails, notably when the Solr index
contains one or more documents with empty (despite required)
"load_date_dt" field.

This fixes the export failure when the situation finally occurs, but
more should be done to harden verifications on minimum required fields.
2017-02-17 11:09:30 +01:00