Commit Graph

1230 Commits

Author SHA1 Message Date
luccioman
169ffdd1c7 Finer control on max links to parse in the html parser. 2017-08-22 14:11:35 +02:00
Michael Peter Christen
30d71c6359 added usage of X-Real-IP http header
to identify request IPs which came through NGINX reverse proxy
configurations
2017-08-15 07:16:01 +02:00
Michael Peter Christen
7f395ef937 added image link in search results
This should be a help to make a preview of search results.
The image is computed from the list of embedded images, it is
always the first image in that list.
In rss-type results the image is presented like
<media:content medium="image" url="https://abc.xyz/logo.png"/>
as defined in
http://www.rssboard.org/media-rss#media-content
2017-08-14 20:12:09 +02:00
reger
2a07799ad1 Correction of d03e2c98ea
Fix Conjunction.addOperator to do nothing if term is empty
prevent to result in query string with repeated logical operator
like "field:term AND AND field:term"
possibliy causing out of mem in postprocessing_doublecontent
2017-08-14 01:03:15 +02:00
reger
d03e2c98ea Fix Conjunction.addOperator to do nothing if term is empty
prevent to result in query string with repeated logical operator
like "field:term AND AND field:term"
possibliy causing out of mem in postprocessing_doublecontent
2017-08-14 00:52:03 +02:00
luccioman
651fad6da5 Added RSS parser support for maximum content bytes parsing limit 2017-07-12 00:18:12 +02:00
luccioman
452a17a8d5 Finer control on bounded input streams with custom stream implementation 2017-07-12 00:13:24 +02:00
luccioman
e82eaee4b6 Apply consistent behavior on HTTP resource size exceeding limit.
On content size known from HTTP headers, terminates connection faster
and improves error reports quality by reporting relevant message
"Content to download exceed maximum value..." rather than previously "no
response (NULL) for url...".
2017-06-30 01:13:47 +02:00
luccioman
8da3174867 Ensure lower case conversion consistency with any default locale.
Especially for Turkish speaking users using "tr" as their system default
locale : strings for technical stuff (URLs, tag names, constants...)
must not be lower cased with the default locale, as 'I' doesn't becomes
'i' like in other locales such as "en", but becomes 'ı'.
2017-06-27 06:42:33 +02:00
luccioman
64cec2790d Improved character encoding detection from Content-Type header
Also updated some related JavaDocs
2017-06-22 10:50:34 +02:00
Michael Peter Christen
c94a8c76bd re-added solr synchronization hack 2017-06-09 12:50:36 +02:00
Michael Peter Christen
6fe735945d migrated Solr 5.5 -> Solr 6.6 and from Java 1.7 -> 1.8
Also: now Version 1.921
2017-06-09 12:25:23 +02:00
luccioman
8399275142 Properly close file output streams even on exceptions scenarios. 2017-06-08 07:19:16 +02:00
luccioman
4e4dc6c4e5 Removed unnecessary finalize implementation.
On such private classes with limited scope but with frequent instance
creations and removals within the application lifecycle, implementing
the finalize method is particularly unwanted as it decreases the garbage
collector performance.
What's more the Object.finalize() method is now deprecated in the JDK 9
and will eventually disappear from future releases (see
https://bugs.openjdk.java.net/browse/JDK-8177970)
2017-06-06 10:30:02 +02:00
luccioman
d98c04853d Ensure proper closing of file input streams. 2017-06-02 12:14:29 +02:00
luccioman
c226ded799 Fix unescape of URLs having some '%' chars but not percent-encoded 2017-05-30 12:32:14 +02:00
luccioman
88c062639b Added some JavaDoc 2017-04-28 11:36:48 +02:00
luccioman
527d494c1a Fixed "Unchecked conversion" compilation warnings. 2017-04-24 13:27:07 +02:00
luccioman
f66438442e Extended Mediawiki dump import to remote URLs.
When using a public HTTP URL in /IndexImportMediawiki_p.html, the remote
file now is directly streamed and processed, allowing import of several
GB dumps even with a low memory remote peer, and without need to
manually download the dump file first.
2017-04-14 14:32:44 +02:00
luccioman
e5c3b16748 Improved http client close time on stream processing errors. 2017-04-14 14:23:50 +02:00
luccioman
0bc868a819 Improved support for non ASCII chars in local file system URLs
Creating a MultiProtocolURL instance from a File object and then
retrieving a File with getFSFile() was inconsistent with file paths
containing space or non ASCII chars.
2017-04-12 09:23:10 +02:00
Michael Peter Christen
1d81b8f102 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2017-04-01 01:04:27 +02:00
Michael Peter Christen
69081bce00 added export to elasticsearch. The export dump can easily be imported to
elasticsearch using the command
curl -XPOST localhost:9200/collection1/yacy/_bulk --data-binary
@yacy_dump_XXX.flatjson
2017-04-01 01:04:17 +02:00
luccioman
4b649b0a11 Fixed NPE case and API URL link on Solr HTML output for webgraph core. 2017-03-30 15:41:14 +02:00
luccioman
cde237b687 Enforced access controls on some administrative actions.
- ensure use of HTTP POST method : HTTP GET should only be used for
information retrieval and not to perform server side effect operations
(see HTTP standard https://tools.ietf.org/html/rfc7231#section-4.2.1)
 - a transaction token is now required for these administrative form
submissions to ensure the request can not be included in an external
site and performed silently/by mistake by the user browser
2017-03-26 11:48:00 +02:00
luccioman
df5970df6d Extended Apache HTTP Digest Auth. for use of YaCy encoded password
When programmatically requesting the local peer with Apache http client,
authentication credentials must be passed as clear-text values. 
This extension to the apache org.apache.http.impl.auth.DigestScheme
permits use of the YaCy encoded password stored in the
adminAccountBase64MD5 configuration property.
2017-03-26 11:32:44 +02:00
reger
f05976c017 Display the local search word statistic in alphabetic order 2017-03-19 07:12:35 +01:00
reger
56d0a87a83 remove double occuance of geo:lat in rss tokens 2017-03-13 03:08:44 +01:00
reger
b4fa1141b8 implement RequestHeader getRequestURI, getRequestURL for legacy request 2017-03-12 01:54:56 +01:00
reger
cd4d891ea4 use pre-defined "Connection" header key, replace depreceated 2017-03-04 19:41:31 +01:00
reger
275c0cddd1 Adjust DefaultServlet test case to recent change,
depreciate unused CONNECTION_PROP_PROTOCOL (also as it might be 
misleading with getProtocol vs getScheme)
2017-02-26 02:39:52 +01:00
reger
41e2ee0eca Fix call parameter for ConnectionInfo in MonitorHandler
(expected scheme e.g. http, was protocol version).
Depreceate obsolete custom X-...-Scheme header constant.
Use existing FORMAT_ANSIC Dateformatter in HeaderFramework.
Correct htmlParserTest (del one not intended println)
2017-02-25 23:55:17 +01:00
luccioman
2f191e0e1c Improved MultiprocotolURL non ASCII characters support.
After @sinkuu Pull Request #108 added JUnit tests, updated some JavaDoc
and also improved URL tokenization to support non ASCII characters.
2017-02-23 11:09:43 +01:00
luccioman
18e8b3a220 Merge branch 'escape' of https://github.com/sinkuu/yacy_search_server 2017-02-23 11:03:05 +01:00
reger
7419989de3 Correct dublincore title property text to lowercase in htmlresponsewriter,
remove unused (carry over) local variable
Do the same for other responsewriter.
2017-02-23 00:27:56 +01:00
luccioman
c68a8be2d9 Refactored and enforced Solr mandatory fields for proper operation
- Added a new method to check activation of mandatory fields on
Collection Configuration commit, consistently with checks previously
performed in Switchboard startup and with mandatory fields in the
default schema.
- Reorganized default schema and CollectionConfiguration enumeration :
moved no more mandatory fields in a specific section, and moved fields
enabled at startup to the mandatory section. 
- Marked mandatory fields as required and with stronger font in the
IndexSchema_p.html page
2017-02-20 10:48:07 +01:00
reger
7e53860fc7 fix NPE in HTMLResponseWriter on missing document title 2017-02-16 02:36:24 +01:00
luccioman
6e89d125f2 Added robots.txt support for heuristics federated search.
As noticed by @reger24, abusive use of OpenSearch systems should be
prevented, especially if allowing to parse and reuse HTML results.
robots.txt file is now checked before requesting an external OpenSearch
system to respect the host exclusions and eventual crawl-delay value.
The check is also performed when trying to add a new OpenSearch URL
template through the /ConfigHeuristics_p.html admin page.
2017-02-15 15:04:40 +01:00
sinkuu
a46b232bf1 Use java.net.URLDecoder 2017-02-14 16:55:38 +09:00
luccioman
bf16de29c1 Added support for HTML OpenSearch results.
Many OpenSearch systems do not provide results as standard RSS/Atom
feeds but only as HTML. 

This modification add some support for custom OpenSearch HTML results
through the use of mapping files (as already done for federated Solr
search) relying on CSS-like selectors to retrieve information from HTML
content.

An example mapping file is provided to map results from the
www.npmjs.com OpenSearch URL.
2017-02-13 19:11:17 +01:00
luccioman
1857651988 Added a new Debug/Analysis advanced settings subsection.
As discussed in PR #93 with @JeremyRand and @reger24 this new advanced
settings page includes:
 - a new setting to control remote Solr responses encoding
 - some existing debug settings which could not be set through the admin
user interface
2017-02-09 11:05:06 +01:00
luccioman
9a5a124bf2 Distinguished solr connectors thread names for easier monitoring. 2017-02-03 09:54:29 +01:00
luccioman
0da1e6ba16 Factored code re-implementing DigestURL.hosthash() method.
This ensure consistent implementation of the url host hash generation
and easier usage finding in source code.

Also added a unit test for this function.
2017-01-16 10:18:42 +01:00
luccioman
6a4d51d8f9 Cleaned up some Javadoc warnings. 2017-01-09 16:44:47 +01:00
luccioman
86dc198698 Fixed some JavaDocs broken links. 2017-01-09 09:57:53 +01:00
reger
16beb551ea fix DC.Elements namespace in DublinCore vocabulary class
delete redundant (unused) DCElements.
2017-01-07 18:24:29 +01:00
reger
68d4dc5cc5 Complete harmonization RequestHeader getCookie with std ServletRequest
to use javax.servlet.http.Cookie parameters.
Depreciate now obsolete getHeaderCookies.
Adjust setting of MaxAge to spec if >= 0 otherwise keep default.
2017-01-02 03:04:21 +01:00
reger
7bf2bcf504 fix and prevent exception on missing required cookie name
skip cookie creation if name is empty.
2016-12-22 19:52:38 +01:00
luccioman
3ca695390c FTP crawl start URLs : applied crawl profile depth control
Applied rules :
- when the FTP URL denotes a file resource, stack it as any start URL :
eventually embedded links can be followed applying the usual depth rules
- when the FTP URL denotes a directory, list files under this directory
and stack them for crawl, and repeat the process on sub folders until
crawl depth is reached
2016-12-22 16:25:09 +01:00
luccioman
128c8ef8d4 Fixed title rendering having non ASCII chars in QuickCrawlLink_p.html. 2016-12-21 08:19:09 +01:00