Commit Graph

13091 Commits

Author SHA1 Message Date
reger
209a7374bd remove unused import pdfParser 2017-03-09 22:57:51 +01:00
reger
de1c1c16db Improve pdf text extraction resource handling.
For sort pdf <= 3 pages use already extracted content,
only for long pdf > 3 pages reassign content and close internal writer (to direct free buffers)
2017-03-09 22:56:33 +01:00
reger
52c9d0c858 upd to pdfbox-2.0.4.jar 2017-03-09 22:50:19 +01:00
reger
9b6d1abd9e eliminate some compiler unchecked and deprecation warnings
in nav plugins by explicite type declaration and replacing date.getYear
with Calendar.get
2017-03-09 01:42:36 +01:00
reger
6eb7d27449 upd to httpclient v4.5.3 2017-03-08 22:35:48 +01:00
luccioman
8e77fe3860 Fixed unresolved pattern case in search results progress bar.
This is a fix for mantis 715 (http://mantis.tokeek.de/view.php?id=715).

A possible path scenario that could leading to this case :
 - YaCy is running low in memory
 - a search is requested
 - before the end of search results rendering, the cleanup job runs and
deletes the running search event from the cache because of short memory
 - then yacysearchitem renders with "-UNRESOLVED_PATTERN-" parameter
values passed to the statistics() JavaScript function
2017-03-08 10:27:18 +01:00
luccioman
79df5bb20a Fixed settingsAck_p.html back link for case where referrer is stripped. 2017-03-07 12:27:27 +01:00
reger
18c7563dbe Extend DCEntry.getLanguage convert to ISO639-1 codes for more languages
by using icu.ULocale for languages not already covered (ICU normalizes 
to ISO639-1 2 char codes).
Add test class
Use DublinCore vocabulary declarations in DCEntry and SurrogateReader 
for easier usage debugging, 
Init SurrogateReader.inputSource on first use.
2017-03-05 02:26:10 +01:00
reger
ce87025462 further avoid to set connect info properties as header value
following comment "use of properties as header values is discouraged"
in case where (proxy)HTTPClient overwrites values with supplied url.
Use defined request.referer procedure in response class.
2017-03-04 22:45:17 +01:00
reger
cd4d891ea4 use pre-defined "Connection" header key, replace depreceated 2017-03-04 19:41:31 +01:00
luccioman
5b03feb776 Fixed unresolved pattern case on /yacysearchlatestinfo.json api 2017-03-03 13:46:44 +01:00
luccioman
0173b0bc32 Added an advanced settings page for referrer policy settings.
Feedback will be welcome, notably on the descriptive content of this
page.
2017-03-03 12:05:30 +01:00
reger
81963a89fe fix proxyservlet response url to respect http scheme if a relative
Location header is returned.
2017-03-03 00:21:56 +01:00
luccioman
9d9f86dcdd Updated Archive-It heuristics URL.
The archive-it OpenSearch URL requested without restriction on
collections ("i" parameter) almost always ends up with timeout or fails.
2017-03-01 09:43:00 +01:00
luccioman
cdcd923375 Privacy enhancement : added settings to control referrer policy.
HTTP "Referer" header sent by the browser when using YaCy can now be
controlled either with the referrer meta tag as a global policy, or only
for search result links by adding the attribute rel="noreferrer".

To improve privacy with the less possible regressions, the default is
set as meta tag with value "origin-when-cross-origin" : internal YaCy
links behavior is not affected, but when visiting external websites
referrer url is not empty but stripped from query parameters and path.

Older browsers, Safari, MS IE and Edge do not support the referrer meta
tag, so the standard but less flexible noreferrer link type can also be
enabled as an alternative.

User-friendly settings page to be implemented.
2017-02-28 18:11:54 +01:00
reger
86534a56f7 fixed ReindexSolrBusyThread new and unexpected repeat of same query with
low number of found documents - by adding additional end condition to 
remove processed query with number of found docs <= process-chunck-size.

Noticed on query h4_txt:[* TO *], found 21, process 21, call of commit happend
but on next cycle same query again 21 docs found (while h4_txt was removed 
from schema and committed inputdocuments).
2017-02-27 23:00:46 +01:00
reger
0aa0dd0b5b fix delta time calculation in PerformanceSearch_p for the 1. entry
(INITIALIZATION displayed absolute date, set delta to 0 for 1. entry)
2017-02-27 01:04:31 +01:00
luccioman
13c5c09518 Fixed datacite.org heuristics base url.
The datacite Solr search http URL was returning http status 301 in order
to redirect to its https version, thus making that YaCy heuristic always
fail.
2017-02-26 11:03:15 +01:00
reger
275c0cddd1 Adjust DefaultServlet test case to recent change,
depreciate unused CONNECTION_PROP_PROTOCOL (also as it might be 
misleading with getProtocol vs getScheme)
2017-02-26 02:39:52 +01:00
reger
41e2ee0eca Fix call parameter for ConnectionInfo in MonitorHandler
(expected scheme e.g. http, was protocol version).
Depreceate obsolete custom X-...-Scheme header constant.
Use existing FORMAT_ANSIC Dateformatter in HeaderFramework.
Correct htmlParserTest (del one not intended println)
2017-02-25 23:55:17 +01:00
luccioman
9e626f6b00 Added a hint title for required fields in the Solr Schema editor 2017-02-24 11:09:42 +01:00
luccioman
ac766327d3 Switched a few more Solr fields from strictly mandatory to optional 2017-02-24 11:08:18 +01:00
reger
f254fcfc67 fix htmlParser <script> text extraction on code containing expression
recognized as tag like 1<a
reported in https://github.com/yacy/yacy_search_server/issues/109

Script content is ignored by default, but the text is filtered for html
tags. Modified scraper to skip tag filtering while within a <script> 
section (until a closing tag is detected </script>. 
Possible side effect, missing </script> end-tag will truncate trailing 
content text.
2017-02-24 01:25:32 +01:00
luccioman
2f191e0e1c Improved MultiprocotolURL non ASCII characters support.
After @sinkuu Pull Request #108 added JUnit tests, updated some JavaDoc
and also improved URL tokenization to support non ASCII characters.
2017-02-23 11:09:43 +01:00
luccioman
18e8b3a220 Merge branch 'escape' of https://github.com/sinkuu/yacy_search_server 2017-02-23 11:03:05 +01:00
luccioman
562fc14eb9 Merge pull request #110 from goofy-bz/patch-1
Fixing some typos
2017-02-23 07:52:55 +01:00
goofy-bz
72a1bc0af1 Fixing some typos
up to line #1000 only
2017-02-23 01:13:31 +01:00
reger
7419989de3 Correct dublincore title property text to lowercase in htmlresponsewriter,
remove unused (carry over) local variable
Do the same for other responsewriter.
2017-02-23 00:27:56 +01:00
Burkhard
4fdc11cae8 Update SearchEvent.java
Fix NPE on disabled local SolrIndex, occuring on search moving to the 2nd result page.
The debug purpose only setting to disabeling local SolrIndex (System Admin -> Debug Settings) should long term probably be removed from production code.
2017-02-22 02:01:48 +01:00
luccioman
cdc7f3e431 Switched some Solr fields from mandatory to optional
These fields are default enabled but with no doubt not strictly
mandatory with the current code base.

As reported by @reger24, splitting between essential mandatory and
optional fields is still to be improved to reflect the current YaCy
needs.
2017-02-21 22:59:11 +01:00
reger
7c188ad092 Add extract of queries.log in form of top search word cloud (last 7 days)
to AccessTracker_p.html (Network Access -> Local Search Log page).
It displays top 20 words of search queries.
2017-02-20 23:27:33 +01:00
luccioman
3475d8c1a9 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2017-02-20 10:48:44 +01:00
luccioman
c68a8be2d9 Refactored and enforced Solr mandatory fields for proper operation
- Added a new method to check activation of mandatory fields on
Collection Configuration commit, consistently with checks previously
performed in Switchboard startup and with mandatory fields in the
default schema.
- Reorganized default schema and CollectionConfiguration enumeration :
moved no more mandatory fields in a specific section, and moved fields
enabled at startup to the mandatory section. 
- Marked mandatory fields as required and with stronger font in the
IndexSchema_p.html page
2017-02-20 10:48:07 +01:00
reger
334c70c37a correct fromDate init value on missing param in api/timeline_p servlet
revert test modification from last commit in AccessTracker.main
2017-02-20 00:14:14 +01:00
reger
cc770512d5 add hint of query syntax in AccessTracker log (qs=normal querystring,
sq=solr-querystring) to allow to filter simple text queries for processing,
remove toString for counter parameter
use more predefined constants in solrservlet
2017-02-19 05:23:17 +01:00
luccioman
e5858bc8c8 Fixed a NullPointerException case possible on Index Export
As reported by Palulukas in YaCy forum
(http://forum.yacy-websuche.de/viewtopic.php?f=18&t=5944&sid=dcef5b899ab4aa9b40e3a3d158c13aed#p33454)
the Index Export operation can fails, notably when the Solr index
contains one or more documents with empty (despite required)
"load_date_dt" field.

This fixes the export failure when the situation finally occurs, but
more should be done to harden verifications on minimum required fields.
2017-02-17 11:09:30 +01:00
reger
7e53860fc7 fix NPE in HTMLResponseWriter on missing document title 2017-02-16 02:36:24 +01:00
reger
5e8879beb7 Reduce self generated content for text_t (visible text index field)
to avoid repeat of tokenized url as description,
continuation of 7e09bff4a1
1409cabe8b
Add some javadoc, and not needed remove of omitted fields in postprocessing.
2017-02-16 01:43:14 +01:00
reger
6ec6ab55ba removed faroo news from default opensearch config
As @luccioman informed, it's only useable with a free api key
http://www.faroo.com/hp/api/api.html
http://blog.faroo.com/2013/06/30/faroo-introduces-an-api-key/
2017-02-15 23:26:54 +01:00
luccioman
6e89d125f2 Added robots.txt support for heuristics federated search.
As noticed by @reger24, abusive use of OpenSearch systems should be
prevented, especially if allowing to parse and reuse HTML results.
robots.txt file is now checked before requesting an external OpenSearch
system to respect the host exclusions and eventual crawl-delay value.
The check is also performed when trying to add a new OpenSearch URL
template through the /ConfigHeuristics_p.html admin page.
2017-02-15 15:04:40 +01:00
sinkuu
a46b232bf1 Use java.net.URLDecoder 2017-02-14 16:55:38 +09:00
reger
7e6e14a406 adjust translation to renamed configparser_p.html 2017-02-14 02:30:26 +01:00
reger
a011a97de9 make ConfigParser a protected page, for consistent behavior of locked
menu items.
2017-02-14 02:04:42 +01:00
reger
f85aaa7c76 update opensearch conf - remove suche.sueddeutsche.de
apparently they've revoked the participation in opensearch initiative.
2017-02-14 00:31:32 +01:00
luccioman
bf16de29c1 Added support for HTML OpenSearch results.
Many OpenSearch systems do not provide results as standard RSS/Atom
feeds but only as HTML. 

This modification add some support for custom OpenSearch HTML results
through the use of mapping files (as already done for federated Solr
search) relying on CSS-like selectors to retrieve information from HTML
content.

An example mapping file is provided to map results from the
www.npmjs.com OpenSearch URL.
2017-02-13 19:11:17 +01:00
reger
a79194a102 upd to Jetty-9.2.21.v20170120 2017-02-11 19:53:27 +01:00
luccioman
4306f4d9a3 Upgraded Apache Ant to 1.10.1 in the Docker alpine flavor image
For a more reliable Docker image build, also switched to the ant archive
repository to fetch the needed binary as other repositories only provide
the latest versions.
2017-02-10 09:40:42 +01:00
luccioman
54405577aa Replaced absolute redirection locations by relative ones when possible.
This makes integration of YaCy behind a reverse proxy subfolder easier.
2017-02-09 16:42:21 +01:00
luccioman
1857651988 Added a new Debug/Analysis advanced settings subsection.
As discussed in PR #93 with @JeremyRand and @reger24 this new advanced
settings page includes:
 - a new setting to control remote Solr responses encoding
 - some existing debug settings which could not be set through the admin
user interface
2017-02-09 11:05:06 +01:00
luccioman
526f2d6a8b Fixed NPE case occurring when local solr index is disabled in search. 2017-02-09 10:59:41 +01:00