Commit Graph

4075 Commits

Author SHA1 Message Date
luccioman
ce89492319 Ensure system resource release by closing document stream. 2017-06-08 07:36:11 +02:00
luccioman
8399275142 Properly close file output streams even on exceptions scenarios. 2017-06-08 07:19:16 +02:00
luccioman
4e4dc6c4e5 Removed unnecessary finalize implementation.
On such private classes with limited scope but with frequent instance
creations and removals within the application lifecycle, implementing
the finalize method is particularly unwanted as it decreases the garbage
collector performance.
What's more the Object.finalize() method is now deprecated in the JDK 9
and will eventually disappear from future releases (see
https://bugs.openjdk.java.net/browse/JDK-8177970)
2017-06-06 10:30:02 +02:00
luccioman
a04feac064 Ensure file input streams proper closing in both success and failures
Also add when possible a warning level log message on input stream
closing error instead of failing silently. This could help understanding
some IO exceptions such as "too many files open".
2017-06-03 04:00:46 +02:00
luccioman
d98c04853d Ensure proper closing of file input streams. 2017-06-02 12:14:29 +02:00
luccioman
c53c58fa85 Unsure closing ChunkIterator stream in every possible use case.
Also trace in logs the eventual close failures instead of failing
silently.
This should help prevent holding too many unreleased system file
handlers, as in the case reported by eros on YaCy forum
(http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5988&sid=b00e7486c1bf7e48a0d63eb328ccca02
)
2017-06-02 09:47:45 +02:00
luccioman
29e52bda39 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2017-06-02 01:47:53 +02:00
luccioman
a9cb083fa1 Improved consistency between loader openInputStream and load functions 2017-06-02 01:46:06 +02:00
reger
a814f3d885 Introduce keyword query parameter
This enables keyword navigator to filter on keywords. Added search page
output and layout config for keywords, allowing e.g. in Intranet use
to display the keywords. No styling or links applied to the keyword
text (but is desirable possibly in combination with bootstrap-tagsinput
for future/intranet).
2017-06-02 01:00:21 +02:00
luccioman
c226ded799 Fix unescape of URLs having some '%' chars but not percent-encoded 2017-05-30 12:32:14 +02:00
luccioman
306a82dd71 Fixed scraper NullPointerException cases on malformed URLs. 2017-05-30 08:48:20 +02:00
luccioman
aa55d71cf5 Fixed a NullPointerException case on Digest authentication.
Could occur when upgrading from a Debian package configured with Basic
authentication (as in release 1.92.9000) to a more recent one with
Digest authentication, without having re-encoded the admin password (for
example with dpkg-reconfigure).

As reported by eros on YaCy forum
(http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5988#p33686).
2017-05-29 19:16:09 +02:00
luccioman
02ec0ed13c Quoted param value in Solr query to avoid unwanted traces in logs
When Webgraph Solr core is enabled, crawling and removing from index an
URL whose hash starts with the '-' character (example URL :
https://cs.wikipedia.org/ whose hash is "-2-HuTEndn4x") produced a full
ParseException stack trace in YaCy logs. This was not blocking because
the Solr query parser is able to escape itself the query and run it
successfully, but filled uselessly YaCy logs.
2017-05-24 08:43:03 +02:00
reger
1737af37cf Set request originator to own peer in warc importer
in addition to change in 039162fbf0
2017-05-22 01:56:11 +02:00
reger
039162fbf0 Change warc importer to use defaultsurrogate-crawl profile, as reported
by LA_FORGE http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5990 and
analysed by @luccioman (see comment 510f11d374)
it creates conflict using a other crawlprofile without setting originator.
2017-05-22 01:34:08 +02:00
Michael Peter Christen
3b1d640a3c enhanced debugging 2017-05-18 00:28:12 +02:00
Michael Peter Christen
7de7879f13 added a cache to prevent too many seed enumerations 2017-05-18 00:28:00 +02:00
luccioman
bd7411a53a Enable p2p and cluster communication when "Protection of all pages" on
As reported by paul89 on YaCy forum
(http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5958 ), when setting
the "Protection of all pages" to "On" in the "ConfigAccounts_p.html"
page, the peer became completely unreachable by others, which is not the
purpose of this feature.
But the restriction still makes sense as a security enforcement and is
maintained in private "Robinson mode" where by the way any peer-to-peer
or cluster communication would be rejected.
2017-05-17 09:00:29 +02:00
luccioman
31ad043bb9 Added user interface feedback on results feeding termination status.
Added as an additional icon with title in the search progress bar, to
inform about background search feeder threads terminated or still
running. While giving a bit more information to users about the p2p
search process, this can help choosing whether or not wait a little bit
more time before going to the next page, in order to get results from
various sources sorted as best as possible (see #91 for a discussion
about sorting accuracy and network latency).

Other related modifications included :
 - regular updates to statistics in the progress bar until the
background feeders are completely terminated.
 - removed some uses of unsecure and discouraged JavaScript elements
2017-05-15 13:15:16 +02:00
sgaebel
ff6392215e added closing of lst-Tag in solr-Export 2017-05-13 20:38:25 +02:00
luccioman
d90b001e1b Improved previous merge "Show ranking in HTML UI".
- added the new setting as configurable in the "Debug/Analysis" settings
page. Debug/analysis is its main purpose for now as there is currently
no nice and "understansable" ranking score info servlet (see forum
discussion http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5884 ) 
- render in the "Search Page Layout" page preview when enabled
- added constants
2017-05-11 18:02:33 +02:00
luccioman
0f0f42b509 Added some JavaDoc 2017-05-11 08:33:19 +02:00
reger
077d062be3 Adjust mergeDocuments to keep youngest last-modified date of document
collection
2017-05-09 22:52:54 +02:00
luccioman
654801523e Fixed StringIndexOutOfBoundsException case.
Revealed by commit c77e43a : the exception was then thrown when indexing
pages containing mailto: scheme URL links with the Solr Webgraph core
enabled.
Fixed the error case and restored filtering on mailto links in
Document.resortLinks() as these URLs still should not appear in
Document.hyperlinks.
2017-05-09 18:32:47 +02:00
luccioman
522a268305 Improved new blacklist entries URL scheme detection. 2017-05-04 16:36:45 +02:00
luccioman
532981b363 Updated putHTML() JavaDoc 2017-05-04 11:21:27 +02:00
luccioman
58d23047dd Handle '?' and '+' chars as valid wild cards when adding to blacklist.
An entry such as "domain.com/[a-z]+" is a valid regular expression and
do not need additional ".*.*/.*" wildcards.
2017-05-04 11:19:59 +02:00
luccioman
a87281b498 Added MediaWiki dump import scheduling feature.
Checking the last modified date by default to prevent unnecessary long
running operations.
2017-05-03 18:53:01 +02:00
luccioman
edd7ccac40 Added some JavaDoc 2017-05-02 09:33:11 +02:00
luccioman
79fdf14b0a Fixed regression introduced by commit 9ad4d16
On MediaWiki dump imports, the SurrogateReader was trying to unread too
many bytes, then failing with the following exception :
"java.io.IOException: Push back buffer is full".
2017-05-02 09:32:04 +02:00
Michael Peter Christen
7678fd67e3 copied fix from yacy_grid_parser for wrong array type 2017-05-01 11:44:26 +02:00
Michael Peter Christen
200b100fb8 added patch to rewrite altered yacy grid schema into yacy schema
This generates the stub and protocol parts of an url for inboundlinks,
outboundlinks and images
2017-05-01 11:38:02 +02:00
reger
9ad4d16829 Add a responsHeader to the solr index export with a format identifier
and export parameter (in accordance with response xml format) for easier
format detection on import.
2017-04-30 23:53:52 +02:00
luccioman
9697209ef6 Fixed Index Export feature for compatibility with old indexed documents.
This is a fix for mantis 682 (http://mantis.tokeek.de/view.php?id=682)
and issue #116
2017-04-28 11:39:51 +02:00
luccioman
88c062639b Added some JavaDoc 2017-04-28 11:36:48 +02:00
luccioman
31fff2c986 Extended WikiCode template inclusion syntax support.
Wiki templates are not rendered but syntax support is improved, which
greatly enhance snippets rendering on search results coming from a
MediaWiki dump import.
Tested on various dumps from Wikimedia at
https://dumps.wikimedia.org/backup-index.html
See also Wikipedia transclusion documentation at
https://en.wikipedia.org/wiki/Wikipedia:Transclusion
2017-04-27 09:50:04 +02:00
Michael Peter Christen
973d74712f added yacy grid flatjson surrogate parser 2017-04-25 08:44:02 +02:00
luccioman
b1da92648e Fixed surrogates import monitoring page (/CrawlResults.html?process=7)
This page was always empty, as described in mantis 740
(http://mantis.tokeek.de/view.php?id=740)
2017-04-24 18:24:26 +02:00
luccioman
527d494c1a Fixed "Unchecked conversion" compilation warnings. 2017-04-24 13:27:07 +02:00
reger
c77e43a391 Take out mailto collect in internal parsed document
As earlier plans to make use of mailto as separate webgraph entity didn't
materialize (see  http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5726&p=32493&hilit=mailto#p32493)
free the unused handling and resources.
2017-04-20 00:18:18 +02:00
Michael Peter Christen
335868edba Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2017-04-17 12:26:27 +02:00
reger
bec34d3546 Add url input field as source for WarcImporter
allowing to import warc from url without prior download.
2017-04-16 04:25:29 +02:00
luccioman
f66438442e Extended Mediawiki dump import to remote URLs.
When using a public HTTP URL in /IndexImportMediawiki_p.html, the remote
file now is directly streamed and processed, allowing import of several
GB dumps even with a low memory remote peer, and without need to
manually download the dump file first.
2017-04-14 14:32:44 +02:00
luccioman
e5c3b16748 Improved http client close time on stream processing errors. 2017-04-14 14:23:50 +02:00
luccioman
23775e76e2 Fixed endless loop case in wikicode processing.
Detected when importing recent MediaWiki dumps containing some pages
with script content in plain text format (see Scribunto extension
https://www.mediawiki.org/wiki/Extension:Scribunto ).

Further improvement : modify the MediawikiImporter to prevent processing
revisions whose <model> is not wikitext.
2017-04-12 17:17:03 +02:00
luccioman
0bc868a819 Improved support for non ASCII chars in local file system URLs
Creating a MultiProtocolURL instance from a File object and then
retrieving a File with getFSFile() was inconsistent with file paths
containing space or non ASCII chars.
2017-04-12 09:23:10 +02:00
reger
7b80189bda Activate hosts navigator plugin. This includes rwi results in the navigator
count.
This might be tangential related to http://mantis.tokeek.de/view.php?id=736
as the example includes a local index search, while rwi results are not
counted.
2017-04-10 22:42:06 +02:00
Michael Peter Christen
f5ad29edb1 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2017-04-07 09:15:15 +02:00
Michael Peter Christen
76e9135526 added flatjson parser (stub, unfinished) 2017-04-07 09:15:05 +02:00
reger
b7417ac329 Introduce a Keyword search navigator using the index field keywords.
The keywords field string is split into words as navigator entries.

A keyword navigator facet is essential for search appliance usage were
documents and metadata use often specialized keyword vocabularies to 
filter search results. This navi can be used without custom index schema.

As we don't have defined a search query command to filter "keywords" yet,
the filtering is limited by adding the keyword to the search query.
2017-04-05 00:08:25 +02:00
luccioman
09e72eb0a4 Set Config Portal as a private administration page.
Consistently with its required action from submission credentials, and
because external unauthenticated users do not need to access these
settings.
2017-04-03 11:34:49 +02:00
reger
ba339a2a45 Add servlet to import warc file from filesystem IndexImportWarc_p.html.
Apply Importer interface to WarcImporter
2017-04-02 03:32:21 +02:00
Michael Peter Christen
1d81b8f102 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2017-04-01 01:04:27 +02:00
Michael Peter Christen
69081bce00 added export to elasticsearch. The export dump can easily be imported to
elasticsearch using the command
curl -XPOST localhost:9200/collection1/yacy/_bulk --data-binary
@yacy_dump_XXX.flatjson
2017-04-01 01:04:17 +02:00
reger
510f11d374 Implement surrogate import from Warc archives (as first option handle
warc = Web ARChive File Format.
Warc files with extension .warc or compressed warc.gz can be placed in the
DATA/surrogate/in and contained responses are imported to the index.
The used library is stream based so we can easily extend it later to use
and load warc's from the net.
2017-03-31 00:58:11 +02:00
luccioman
4b649b0a11 Fixed NPE case and API URL link on Solr HTML output for webgraph core. 2017-03-30 15:41:14 +02:00
luccioman
af28a07780 Updated API calls recording/replay with recent changes.
- enabled HTTP POST calls with Digest HTTP authentication
 - made API calls compatible with API newly restricted to HTTP POST only
with transaction token validation
 - ensured backward compatibility with older entries recorded as HTTP
GET
2017-03-30 09:22:28 +02:00
reger
81670c3484 One more use of SwitchboardConstants.SERVER_PORT constant,
apply standard servlet design pattern initialization of solrselectservlet
2017-03-26 20:05:48 +02:00
luccioman
cde237b687 Enforced access controls on some administrative actions.
- ensure use of HTTP POST method : HTTP GET should only be used for
information retrieval and not to perform server side effect operations
(see HTTP standard https://tools.ietf.org/html/rfc7231#section-4.2.1)
 - a transaction token is now required for these administrative form
submissions to ensure the request can not be included in an external
site and performed silently/by mistake by the user browser
2017-03-26 11:48:00 +02:00
luccioman
df5970df6d Extended Apache HTTP Digest Auth. for use of YaCy encoded password
When programmatically requesting the local peer with Apache http client,
authentication credentials must be passed as clear-text values. 
This extension to the apache org.apache.http.impl.auth.DigestScheme
permits use of the YaCy encoded password stored in the
adminAccountBase64MD5 configuration property.
2017-03-26 11:32:44 +02:00
reger
f05976c017 Display the local search word statistic in alphabetic order 2017-03-19 07:12:35 +01:00
reger
3dd23c178b Introduce the option to configure a shutdown port.
A port value of -1 will disable this option.

If set to a value greater 0, YaCy listens on this of on the local loopback 
address (127.0.0.1) for a shutdown or restart signal.
E.g. connect to http://localhost:8005/shutdown will stop the YaCy server.
http://localhost:8005/restart will restart it.
This option allows to stop YaCy locally independant from the web web 
frontend (which might be configured for password protected remote access).
2017-03-19 02:30:08 +01:00
reger
a2afb4bae0 add switchboardconstants for server ports config keys 2017-03-18 20:02:26 +01:00
reger
56d0a87a83 remove double occuance of geo:lat in rss tokens 2017-03-13 03:08:44 +01:00
reger
b4fa1141b8 implement RequestHeader getRequestURI, getRequestURL for legacy request 2017-03-12 01:54:56 +01:00
reger
209a7374bd remove unused import pdfParser 2017-03-09 22:57:51 +01:00
reger
de1c1c16db Improve pdf text extraction resource handling.
For sort pdf <= 3 pages use already extracted content,
only for long pdf > 3 pages reassign content and close internal writer (to direct free buffers)
2017-03-09 22:56:33 +01:00
reger
9b6d1abd9e eliminate some compiler unchecked and deprecation warnings
in nav plugins by explicite type declaration and replacing date.getYear
with Calendar.get
2017-03-09 01:42:36 +01:00
reger
18c7563dbe Extend DCEntry.getLanguage convert to ISO639-1 codes for more languages
by using icu.ULocale for languages not already covered (ICU normalizes 
to ISO639-1 2 char codes).
Add test class
Use DublinCore vocabulary declarations in DCEntry and SurrogateReader 
for easier usage debugging, 
Init SurrogateReader.inputSource on first use.
2017-03-05 02:26:10 +01:00
reger
ce87025462 further avoid to set connect info properties as header value
following comment "use of properties as header values is discouraged"
in case where (proxy)HTTPClient overwrites values with supplied url.
Use defined request.referer procedure in response class.
2017-03-04 22:45:17 +01:00
reger
cd4d891ea4 use pre-defined "Connection" header key, replace depreceated 2017-03-04 19:41:31 +01:00
luccioman
0173b0bc32 Added an advanced settings page for referrer policy settings.
Feedback will be welcome, notably on the descriptive content of this
page.
2017-03-03 12:05:30 +01:00
reger
81963a89fe fix proxyservlet response url to respect http scheme if a relative
Location header is returned.
2017-03-03 00:21:56 +01:00
luccioman
cdcd923375 Privacy enhancement : added settings to control referrer policy.
HTTP "Referer" header sent by the browser when using YaCy can now be
controlled either with the referrer meta tag as a global policy, or only
for search result links by adding the attribute rel="noreferrer".

To improve privacy with the less possible regressions, the default is
set as meta tag with value "origin-when-cross-origin" : internal YaCy
links behavior is not affected, but when visiting external websites
referrer url is not empty but stripped from query parameters and path.

Older browsers, Safari, MS IE and Edge do not support the referrer meta
tag, so the standard but less flexible noreferrer link type can also be
enabled as an alternative.

User-friendly settings page to be implemented.
2017-02-28 18:11:54 +01:00
reger
86534a56f7 fixed ReindexSolrBusyThread new and unexpected repeat of same query with
low number of found documents - by adding additional end condition to 
remove processed query with number of found docs <= process-chunck-size.

Noticed on query h4_txt:[* TO *], found 21, process 21, call of commit happend
but on next cycle same query again 21 docs found (while h4_txt was removed 
from schema and committed inputdocuments).
2017-02-27 23:00:46 +01:00
reger
275c0cddd1 Adjust DefaultServlet test case to recent change,
depreciate unused CONNECTION_PROP_PROTOCOL (also as it might be 
misleading with getProtocol vs getScheme)
2017-02-26 02:39:52 +01:00
reger
41e2ee0eca Fix call parameter for ConnectionInfo in MonitorHandler
(expected scheme e.g. http, was protocol version).
Depreceate obsolete custom X-...-Scheme header constant.
Use existing FORMAT_ANSIC Dateformatter in HeaderFramework.
Correct htmlParserTest (del one not intended println)
2017-02-25 23:55:17 +01:00
luccioman
ac766327d3 Switched a few more Solr fields from strictly mandatory to optional 2017-02-24 11:08:18 +01:00
reger
f254fcfc67 fix htmlParser <script> text extraction on code containing expression
recognized as tag like 1<a
reported in https://github.com/yacy/yacy_search_server/issues/109

Script content is ignored by default, but the text is filtered for html
tags. Modified scraper to skip tag filtering while within a <script> 
section (until a closing tag is detected </script>. 
Possible side effect, missing </script> end-tag will truncate trailing 
content text.
2017-02-24 01:25:32 +01:00
luccioman
2f191e0e1c Improved MultiprocotolURL non ASCII characters support.
After @sinkuu Pull Request #108 added JUnit tests, updated some JavaDoc
and also improved URL tokenization to support non ASCII characters.
2017-02-23 11:09:43 +01:00
luccioman
18e8b3a220 Merge branch 'escape' of https://github.com/sinkuu/yacy_search_server 2017-02-23 11:03:05 +01:00
reger
7419989de3 Correct dublincore title property text to lowercase in htmlresponsewriter,
remove unused (carry over) local variable
Do the same for other responsewriter.
2017-02-23 00:27:56 +01:00
Burkhard
4fdc11cae8 Update SearchEvent.java
Fix NPE on disabled local SolrIndex, occuring on search moving to the 2nd result page.
The debug purpose only setting to disabeling local SolrIndex (System Admin -> Debug Settings) should long term probably be removed from production code.
2017-02-22 02:01:48 +01:00
luccioman
cdc7f3e431 Switched some Solr fields from mandatory to optional
These fields are default enabled but with no doubt not strictly
mandatory with the current code base.

As reported by @reger24, splitting between essential mandatory and
optional fields is still to be improved to reflect the current YaCy
needs.
2017-02-21 22:59:11 +01:00
luccioman
3475d8c1a9 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2017-02-20 10:48:44 +01:00
luccioman
c68a8be2d9 Refactored and enforced Solr mandatory fields for proper operation
- Added a new method to check activation of mandatory fields on
Collection Configuration commit, consistently with checks previously
performed in Switchboard startup and with mandatory fields in the
default schema.
- Reorganized default schema and CollectionConfiguration enumeration :
moved no more mandatory fields in a specific section, and moved fields
enabled at startup to the mandatory section. 
- Marked mandatory fields as required and with stronger font in the
IndexSchema_p.html page
2017-02-20 10:48:07 +01:00
reger
334c70c37a correct fromDate init value on missing param in api/timeline_p servlet
revert test modification from last commit in AccessTracker.main
2017-02-20 00:14:14 +01:00
reger
cc770512d5 add hint of query syntax in AccessTracker log (qs=normal querystring,
sq=solr-querystring) to allow to filter simple text queries for processing,
remove toString for counter parameter
use more predefined constants in solrservlet
2017-02-19 05:23:17 +01:00
luccioman
e5858bc8c8 Fixed a NullPointerException case possible on Index Export
As reported by Palulukas in YaCy forum
(http://forum.yacy-websuche.de/viewtopic.php?f=18&t=5944&sid=dcef5b899ab4aa9b40e3a3d158c13aed#p33454)
the Index Export operation can fails, notably when the Solr index
contains one or more documents with empty (despite required)
"load_date_dt" field.

This fixes the export failure when the situation finally occurs, but
more should be done to harden verifications on minimum required fields.
2017-02-17 11:09:30 +01:00
reger
7e53860fc7 fix NPE in HTMLResponseWriter on missing document title 2017-02-16 02:36:24 +01:00
reger
5e8879beb7 Reduce self generated content for text_t (visible text index field)
to avoid repeat of tokenized url as description,
continuation of 7e09bff4a1
1409cabe8b
Add some javadoc, and not needed remove of omitted fields in postprocessing.
2017-02-16 01:43:14 +01:00
luccioman
6e89d125f2 Added robots.txt support for heuristics federated search.
As noticed by @reger24, abusive use of OpenSearch systems should be
prevented, especially if allowing to parse and reuse HTML results.
robots.txt file is now checked before requesting an external OpenSearch
system to respect the host exclusions and eventual crawl-delay value.
The check is also performed when trying to add a new OpenSearch URL
template through the /ConfigHeuristics_p.html admin page.
2017-02-15 15:04:40 +01:00
sinkuu
a46b232bf1 Use java.net.URLDecoder 2017-02-14 16:55:38 +09:00
luccioman
bf16de29c1 Added support for HTML OpenSearch results.
Many OpenSearch systems do not provide results as standard RSS/Atom
feeds but only as HTML. 

This modification add some support for custom OpenSearch HTML results
through the use of mapping files (as already done for federated Solr
search) relying on CSS-like selectors to retrieve information from HTML
content.

An example mapping file is provided to map results from the
www.npmjs.com OpenSearch URL.
2017-02-13 19:11:17 +01:00
luccioman
54405577aa Replaced absolute redirection locations by relative ones when possible.
This makes integration of YaCy behind a reverse proxy subfolder easier.
2017-02-09 16:42:21 +01:00
luccioman
1857651988 Added a new Debug/Analysis advanced settings subsection.
As discussed in PR #93 with @JeremyRand and @reger24 this new advanced
settings page includes:
 - a new setting to control remote Solr responses encoding
 - some existing debug settings which could not be set through the admin
user interface
2017-02-09 11:05:06 +01:00
luccioman
526f2d6a8b Fixed NPE case occurring when local solr index is disabled in search. 2017-02-09 10:59:41 +01:00
luccioman
def55ec166 Improved termination of timed out remote solr requests to peers.
On timeout, closing remote Solr requests is proper than simply using
Thread.interrupt() that is not effective in most cases. Closing does not
ask commit on remote solr, but release http connections resources and is
more likely to end those threads that can else wait indefinitely.

Other related improvements included :
 - no more marking remote peer as not available when remote search is
interrupted before timeout by the cleanup job.
 - added a short fine log level trace of failing remote solr requests
2017-02-06 12:41:24 +01:00
luccioman
08de58b6d3 Named a Thread without name for easier monitoring 2017-02-03 09:55:08 +01:00
luccioman
9a5a124bf2 Distinguished solr connectors thread names for easier monitoring. 2017-02-03 09:54:29 +01:00
reger
1f497ccad5 Add consistency check for related index fields upon load and save of
index schema.
To assemble the original link url for out-/inboundlinks, icons and pictures
the *_protocol_sxt and *_urlstub_sxt is needed (due to the used data-reduced
storage methode). Auto-enable *_protocol_sxt if *_urlstub_sxt is enabled.
to be able to correctly assemble the original link url.
2017-01-28 00:36:03 +01:00
luccioman
68afe900d0 Added user-friendly controls over disk usage configuration settings.
As mentioned in issue #103, control settings over YaCy disk usage
already existed but lacked a user-friendly way to set them.

I added it to the Performance_p.html administration page with a little
refactoring on the "Resource Observer" fieldset for improved
accessibility and HTML standards respect.
Also added the possibility to enable/disable the autoregulation fonction
from this page.
2017-01-27 15:47:15 +01:00
reger
95d2a28599 adjust the Field-Reindex Thread to verify and update the document id
in case hash (ID) doesn't match document url (sku field).
2017-01-26 23:49:15 +01:00
luccioman
fc01b69eca Fixed local image search pagination regression.
As reported by @tglman on issue #90, when searching images on the local
index only, pages next to the first were always empty. This was a
regression from commit c25e48e969.
2017-01-25 09:54:39 +01:00
Michael Peter Christen
02d0b3172c Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2017-01-24 15:56:37 +01:00
Michael Peter Christen
d4f45cf05e added dc.date.modified and dc.date.created to date parser 2017-01-24 15:56:29 +01:00
reger
f9180fabc4 assure that RWI Index.Segment IODispatcher is not blocking on shudown
waiting on a semaphore permit.
see desc. http://mantis.tokeek.de/view.php?id=723
2017-01-24 01:51:28 +01:00
reger
e61ee180a7 Group all proxy settings on System Administration by adding settings of
UrlProxyAccss page (moved from deleted AugmentedBrowsing_p), adjust
submenu (remove Augmented Browsing) and translation files.
2017-01-22 23:58:46 +01:00
luccioman
39e081ef38 Fixed display of crawler pending URLs counts in HostBrowser.html page.
As described in mantis 722 (http://mantis.tokeek.de/view.php?id=722)

Also updated some Javadoc.
2017-01-22 12:31:14 +01:00
reger
df80c57842 add ukr and pol to DCEntry.getLanguage ISO639-2 3-char language code
conversion to deliver uk, pl 2-char code
and use if else to return on match
2017-01-22 00:01:18 +01:00
luccioman
e048e74072 Added an optional parameter to webstructure.xml api.
This new "documentStructure" parameter can be set to false to only get
hosts accumulated references on a resource and thus prevent scraping the
specified URL and getting citations references.

Also set WebStructureGraph constants as final and updated the Javadoc
with example api call URLs.
2017-01-19 12:30:44 +01:00
reger
581b00cc20 remove obsolete lastmodified calculation in WebgraphConfig 2017-01-17 23:45:56 +01:00
luccioman
5c8958bcea Updated Javadoc and Junit tests for the WebStructureGraph class. 2017-01-17 17:01:56 +01:00
luccioman
d9766ca981 Fixed WatchWebStructure_p.html render to include https URLs.
As described in mantis 721 (http://mantis.tokeek.de/view.php?id=721)
WatchWebStructure_p.html failed to include in its structure view https
and other protocols and ports than default http.
2017-01-16 18:41:58 +01:00
luccioman
ed3dd5e31a Fixed webstructure.xml API used with a domain name 'about' parameter.
As described in mantis 720 (http://mantis.tokeek.de/view.php?id=720),
when requesting this API with a domain name instead of a complete URL
only HTTP references on default port were listed.
2017-01-16 16:41:06 +01:00
luccioman
0da1e6ba16 Factored code re-implementing DigestURL.hosthash() method.
This ensure consistent implementation of the url host hash generation
and easier usage finding in source code.

Also added a unit test for this function.
2017-01-16 10:18:42 +01:00
luccioman
86adfef30f Added automated unit tests and perfs test for WebStructureGraph class.
Fixed references count when multiple links target the same domain name
in one document.
2017-01-13 16:10:59 +01:00
luccioman
9cea7cbb10 Detailed some Javadoc related to /api/webstructure.xml usage. 2017-01-12 17:52:47 +01:00
luccioman
6a4d51d8f9 Cleaned up some Javadoc warnings. 2017-01-09 16:44:47 +01:00
luccioman
86dc198698 Fixed some JavaDocs broken links. 2017-01-09 09:57:53 +01:00
reger
16beb551ea fix DC.Elements namespace in DublinCore vocabulary class
delete redundant (unused) DCElements.
2017-01-07 18:24:29 +01:00
luccioman
339f005ced Blacklist import and update performance improvements.
Measurement sample : import from blacklist local file containing about
15000 entries
 - before refactoring : several minutes
 - after refactoring : a few seconds!
2017-01-06 12:24:31 +01:00
luccioman
e3892b0957 Added some JavaDoc. 2017-01-06 11:23:40 +01:00
reger
4c9be29a55 fix concurrency issue with htmlParser using not current scraper data
resulting in incorrect data for some html index metadata.
Details see http://mantis.tokeek.de/view.php?id=717
2017-01-06 03:01:52 +01:00
reger
eedee6eabb fix exception on URIMetadataNote instantiation with corrected id hash on
host_id_s. Use Solr setField instead of addField to prevent
java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
	at net.yacy.kelondro.data.meta.URIMetadataNode.hosthash(URIMetadataNode.java:247)
	at net.yacy.search.query.SearchEvent.addNodes(SearchEvent.java:966)
	at net.yacy.peers.Protocol.solrQuery(Protocol.java:1242)
	at net.yacy.peers.RemoteSearch$2.run(RemoteSearch.java:349)
2017-01-05 00:24:37 +01:00
luccioman
c1401d821e Adjusted crawl depth control for FTP crawl start URLs. 2017-01-02 10:24:17 +01:00
reger
68d4dc5cc5 Complete harmonization RequestHeader getCookie with std ServletRequest
to use javax.servlet.http.Cookie parameters.
Depreciate now obsolete getHeaderCookies.
Adjust setting of MaxAge to spec if >= 0 otherwise keep default.
2017-01-02 03:04:21 +01:00
reger
a1e5f7dbca fix of fulltext.remove() by id of webgraph document
webgraph has document hash in source_id_s
2017-01-01 23:53:44 +01:00
luccioman
1df558a6c6 Fixed YaCy proper shutdown triggered by SIGTERM signal.
The main shutdown hook thread was not properly waiting for the main
thread termination which consequently could not properly close resources
and threads. After terminating a running YaCy peer this way (Ctrl+C in
console, or kill <pid> for example), you could see the still existing
DATA/yacy.running file.

Tested with :
 - Debian Jessie openjdk 7 and 8 : regular shutdown, Ctrl+C, kill
command, system restart while yacy is running
 - Windows 10 Oracle JDK 7 and 8 : non regression on regular shutdown
2016-12-28 09:47:27 +01:00
reger
b522d540b9 Include itemprop latitude/longitude (see schema.org) in attribute
parsing for lat/lon.
Harmonize number parsing for lat/lon to parseDouble.
Fix endDate_dts value assignment.
2016-12-25 23:39:55 +01:00
reger
083df255e4 fix html tag attribute parsing containing attribute w/o value
e.g. itemscope or autofocus (in such case the next key was not properly
recognized).
2016-12-24 06:57:11 +01:00
reger
cb95b7339a include html5 <time> tag in content scraper,
add "datetime" property of <time> tag to scrapers startdate list.
Datetime is parsed as iso8601 (xml) date, html5 allows partial as well
as duration (not handled by this)
2016-12-24 03:11:35 +01:00
reger
7bf2bcf504 fix and prevent exception on missing required cookie name
skip cookie creation if name is empty.
2016-12-22 19:52:38 +01:00
luccioman
3ca695390c FTP crawl start URLs : applied crawl profile depth control
Applied rules :
- when the FTP URL denotes a file resource, stack it as any start URL :
eventually embedded links can be followed applying the usual depth rules
- when the FTP URL denotes a directory, list files under this directory
and stack them for crawl, and repeat the process on sub folders until
crawl depth is reached
2016-12-22 16:25:09 +01:00
luccioman
128c8ef8d4 Fixed title rendering having non ASCII chars in QuickCrawlLink_p.html. 2016-12-21 08:19:09 +01:00
reger
8eb6fba59c activate filetype navigator plugin and restrict config (append) of navs
to not already actives.
Dht results are now included in count this might over shoot on redundant
dht and solr, while the previous solr facet based was always low.
2016-12-21 02:04:13 +01:00
luccioman
c25e48e969 Enabled displaying results after 14th page for local search queries.
Fixes issue #90 for local queries only: Stealth mode, Portal mode or
Intranet mode. 
For P2p mode, the issue would probably be difficult to solve with
reasonable performance. This is still to dig.

Also switched some InterreputedException catch log messages to warn
level as this is normal behavior when shutting down a peer.

Fixed yacysearch buttons navbar behavior to deal correctly with total
results count or offset over 1000. Also improved the buttons navbar to
be able to navigate over 10th page for local queries.
2016-12-20 14:52:33 +01:00
luccioman
a3886c6adb Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2016-12-20 13:41:13 +01:00
luccioman
feaa87005e Improved indentation for easier debugging steps. 2016-12-20 08:27:17 +01:00
reger
bab4804d11 add FileTypeNavigator plugin 2016-12-19 23:56:03 +01:00
reger
d35c47090c remove obsolete put of HttpServletRequest attributes to YaCy servlet
parameters on SSI (server side includes).
Query parameters are already merged by dispatcher.include, making copy
of parameter (RequestDispatcher.INCLUDE_QUERY_STRING) obsolete.
All other parameter are not used as YaCy servlet arguments.
2016-12-19 02:30:55 +01:00
reger
0959038624 correct DefaultServlet resource pathinContext calculation
exclude servletPath option as resources are always relative to htroot 
or htdocs, the change reflects this.
Theoretically it and the recent adjustments arcording relative urls 
allows to configure the instance to be configurable in a path other as 
root (/)
2016-12-18 21:11:00 +01:00
reger
c50e23c495 reduce creation of empty legacy RequestHeader() in situation where null
is acceptable (less for garbage collection).
2016-12-18 02:38:43 +01:00
reger
87f6631a2a adjust Cache getHeader to prev. changes/commit 2016-12-18 01:02:56 +01:00
reger
6be7339b1d remove the overhead of unused reverseMappingCache of HeaderFramewor / RequestHeader 2016-12-18 00:57:47 +01:00
reger
c702eb6786 del dead menu link to /repository
(directory not created in current distribution -> old)
2016-12-17 02:38:52 +01:00
reger
baa5d9b9e3 adjust DomainHandler working on resolved .yacy domain
(remove obsolete check for path on hostname)
2016-12-17 01:33:00 +01:00
luccioman
1ba705c23d Use loaderDispatcher instead of HTTPClient to download releases.
The default redirection strategy when using directly HTTPClient is
incorrect when redirection is cross host (the original Host header is
still sent when requesting the redirected location).

YaCy LoaderDispatcher handles redirections properly, thus release
archive files using redirected URLs (such as the URLs on a GitHub
Release page) are successfully downloaded.
2016-12-16 20:38:54 +01:00
luccioman
467650c042 Hardened system update checks.
When a downloaded archive release is corrupted, empty, or can not be
opened for any reason, the update script must not be launched because it
erases the existing lib/*.jar libraries.
2016-12-16 11:03:09 +01:00
luccioman
b5711b8fe1 Added some Javadocs. 2016-12-16 10:43:00 +01:00