Commit Graph

13369 Commits

Author SHA1 Message Date
reger
588c6e96fb upd version for typeahead.jquery.js in jslicense.html 2017-07-16 23:35:56 +02:00
luccioman
5a646540cc Support parsing gzip files from servers with redundant headers.
Some web servers provide both 'Content-Encoding : "gzip"' and
'Content-Type : "application/x-gzip"' HTTP headers on their ".gz" files.
This was annoying to fail on such resources which are not so uncommon,
while non conforming (see RFC 7231 section 3.1.2.2 for
"Content-Encoding" header specification
https://tools.ietf.org/html/rfc7231#section-3.1.2.2)
2017-07-16 14:46:46 +02:00
luccioman
11a7f923d4 Distinguish response parsing failures from unexpected exceptions. 2017-07-16 14:39:53 +02:00
luccioman
8100c033a2 URL Viewer : apply crawler size limits when adding to local index.
This allow large files parsing and preview, while preventing unwanted
OutOfMemory errors which are likely to occur when adding to the Solr
Index resources larger than configured crawler limits.
2017-07-16 14:37:06 +02:00
luccioman
eda7b0aeb6 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2017-07-15 08:49:25 +02:00
reger
3005be7349 Clean up unmaintained and unused AugmentParser trail. 2017-07-15 00:19:23 +02:00
reger
e5cff062b5 Clean up redundant but obsolete jquery.rdfquery-core-1.0.js script lib 2017-07-14 23:41:39 +02:00
luccioman
cb4f1358e1 Added gzip parser support for max content bytes limit 2017-07-13 08:18:40 +02:00
luccioman
5216c681a9 Added HTML parser support for maximum content bytes parsing limit 2017-07-13 08:12:10 +02:00
luccioman
4aafebc014 Merge pull request #122 from Scarfmonster/patch-1
I also reproduced the issue, and the fix is working fine.

Thanks @Scarfmonster
2017-07-12 16:03:23 +02:00
luccioman
651fad6da5 Added RSS parser support for maximum content bytes parsing limit 2017-07-12 00:18:12 +02:00
luccioman
452a17a8d5 Finer control on bounded input streams with custom stream implementation 2017-07-12 00:13:24 +02:00
luccioman
f8f1959ebb Added parsing within bounds implementation to the generic parser. 2017-07-11 09:07:48 +02:00
luccioman
e0f400a0bd Support trying multiple parsers even when streaming on large resources. 2017-07-11 09:06:37 +02:00
luccioman
1e84956721 Support loading local files with a per request specified maximum size.
Consistently with the HTTP loader implementation.
2017-07-11 09:04:23 +02:00
luccioman
f369679d1c Fixed read/copy on input streams reading sometimes less than expected. 2017-07-11 09:00:27 +02:00
reger
23bda133d2 Fix css conflict of YMarks.html to make it viewable.
yacy-ymarks.css sidebar conflicts with bootstraps sidebar (different
overlay settings). Simply renamed it to ymark-sidebar.
2017-07-09 23:08:54 +02:00
reger
af32d291c2 upd to commons-fileupload-1.3.3.jar 2017-07-08 23:46:10 +02:00
reger
a21789d4e7 Fix unresolved pattern in api/share.html by init some display var's 2017-07-08 22:46:15 +02:00
luccioman
bf55f1d6e5 Started support of partial parsing on large streamed resources.
Thus enable getpageinfo_p API to return something in a reasonable amount
of time on resources over MegaBytes size range.
Support added first with the generic XML parser, for other formats
regular crawler limits apply as usual.
2017-07-08 09:04:03 +02:00
luccioman
2a87b08cea Removed temporary html parser test code 2017-07-03 14:53:36 +02:00
luccioman
1b3c169a9c URL Viewer : decode raw text using the eventual response charset.
When provided, or decode as UTF-8 as previously done.
2017-07-03 13:51:14 +02:00
luccioman
90a7c1affa HTML parser : removed unnecessary remaining recursive processing
Recursive processing was removed in commit
67beef657f, but one remained for anchors
content(likely omitted from refactoring). It is no more necessary :
other links such as images embedded in anchors are currently correctly
detected by the parser.

More annoying : that remaining recursive processing could lead to almost
endless processing when encountering some (invalid) HTML structures
involving nested anchors, as detected and reported by lucipher on YaCy
forum ( http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6005 ).
2017-07-03 10:00:53 +02:00
reger
e6e20dab52 upd to Jetty 9.4.6.v20170531
Modify loginservice to the changes in Jetty, partially based on pull 
request #101 https://github.com/yacy/yacy_search_server/pull/101 bu @automenta
2017-07-01 23:58:28 +02:00
luccioman
e4c730b99f Updated PerformanceQueues_p.xml API with last related servlet changes 2017-06-30 11:41:48 +02:00
luccioman
dcc56318bb Made remote search max system load limits configurable from UI.
As reported by davide on YaCy forums (
http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6004 ) when the
system is on high load, unless reading carefully YaCy configuration
file, it could be difficult to understand why remote search results are
not fetched.
2017-06-30 11:30:54 +02:00
reger
ddd13b776d Add keyword constraint to rwi query result filter
To discard rwi results not matching query keyword: parameter
2017-06-30 02:11:18 +02:00
luccioman
e82eaee4b6 Apply consistent behavior on HTTP resource size exceeding limit.
On content size known from HTTP headers, terminates connection faster
and improves error reports quality by reporting relevant message
"Content to download exceed maximum value..." rather than previously "no
response (NULL) for url...".
2017-06-30 01:13:47 +02:00
luccioman
0b75e92ac2 Do not wrap unnecessarily loader IOExceptions in IOExceptions 2017-06-30 01:06:17 +02:00
luccioman
433bdb7c0d Respect maxFileSize limit also when streaming HTTP and when relevant.
Constraint applied consistently with HTTP content full load in byte
array.
2017-06-30 00:30:54 +02:00
luccioman
4b72b29ea2 Added an informative title on the crawl start robots.txt status icon 2017-06-29 11:36:47 +02:00
luccioman
d08f31c3a8 Crawl start Ajax request : properly handle eventual XML parsing errors
Otherwise on a malformed getpageinfo_p XML response (from the browser
point of view), JavaScript errors where thrown and the ajax status
steering wheel remained displayed indefinitely.
2017-06-29 11:25:27 +02:00
luccioman
9b1bb2545e Refactored plain-text URLs detection implementation.
For faster processing (measured about 2 times faster on many real-world
examples) and more advanced detection (previous algorithm detected only
URLs separated from the rest of the text by a space character).
2017-06-27 19:30:40 +02:00
luccioman
8da3174867 Ensure lower case conversion consistency with any default locale.
Especially for Turkish speaking users using "tr" as their system default
locale : strings for technical stuff (URLs, tag names, constants...)
must not be lower cased with the default locale, as 'I' doesn't becomes
'i' like in other locales such as "en", but becomes 'ı'.
2017-06-27 06:42:33 +02:00
luccioman
286f3018bd Made mime type and extension normalization locale independent.
Previously, upper cased mime type was incorrectly normalized when the
default locale is Turkish.
2017-06-26 17:33:56 +02:00
luccioman
319231a458 Added a generic XML parser, able to parse elements text and URLs.
This parser adds support for any XML based format other than already
supported XML vocabularies such XHTML, RSS/Atom feeds... It will
eventually be used as a fallback if one of these specific parsers fail,
before falling back to the existing genericParser which extracts not
that much useful information except URL tokens.
2017-06-26 16:30:21 +02:00
reger
aeeb8a7dd5 upd to jwat-warc-1.0.6.jar 2017-06-25 20:05:37 +02:00
reger
f0ba828627 remove unused Solr optional extra handler lib solr-dataimporthandler-6.6.0.jar 2017-06-24 23:15:25 +02:00
reger
1773b61b3e upd to jsoup-1.10.3.jar 2017-06-24 22:54:43 +02:00
Ryszard Goń
3cedbbd4ed Wrong password was removed after the SSL certificate import
Removing the keystore password will prevent ssl from working after the next restart. The certificate password should be removed instead.
Fixes http://mantis.tokeek.de/view.php?id=687
2017-06-23 02:23:49 +02:00
luccioman
64cec2790d Improved character encoding detection from Content-Type header
Also updated some related JavaDocs
2017-06-22 10:50:34 +02:00
luccioman
1acb7005d0 Added a basic JUnit test with test gz files for the gzip parser 2017-06-21 09:14:50 +02:00
luccioman
1e2fb76720 Properly close test files in htmlParser unit test 2017-06-21 09:11:17 +02:00
luccioman
c41b31dcb3 Cleaned up memory usage page HTML
- fixed validation errors
- removed deprecated attributes
- improved accessibility with richer table semantics (headers and
caption elements) and language declaration
2017-06-20 09:21:55 +02:00
luccioman
0487336ec3 Prevent integer overflow in table statistics and use strong typing 2017-06-19 17:02:11 +02:00
luccioman
0f80c978d6 Limit the number of initially previewed links in crawl start pages.
This prevent rendering a big and inconvenient scrollbar on resources
containing many links.
If really needed, preview of all links is still available with a "Show
all links" button.

Doesn't affect the number of links used once the crawl is effectively
started, as the list is then loaded again server-side.
2017-06-17 09:33:14 +02:00
luccioman
d2a4a27f52 Improved stream-oriented parsing entering conditions. 2017-06-17 09:26:37 +02:00
luccioman
32288a8999 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2017-06-17 08:16:55 +02:00
luccioman
e9b4b29f90 Limit scope of some local JavaScript variables. 2017-06-16 08:50:57 +02:00
Michael Peter Christen
369b8e0e0b added json(p) endpoint for crawl start 2017-06-16 08:44:40 +02:00