Commit Graph

13289 Commits

Author SHA1 Message Date
reger
2a07799ad1 Correction of d03e2c98ea
Fix Conjunction.addOperator to do nothing if term is empty
prevent to result in query string with repeated logical operator
like "field:term AND AND field:term"
possibliy causing out of mem in postprocessing_doublecontent
2017-08-14 01:03:15 +02:00
reger
d03e2c98ea Fix Conjunction.addOperator to do nothing if term is empty
prevent to result in query string with repeated logical operator
like "field:term AND AND field:term"
possibliy causing out of mem in postprocessing_doublecontent
2017-08-14 00:52:03 +02:00
reger
b6a41df4f7 Remove deprecated YaCyProxyServlet
was replaced by UrlProxyServlet
2017-08-12 21:53:04 +02:00
luccioman
8a94fef9e0 Prevent unwanted cached bytes duplication on stream parsing. 2017-08-12 09:43:49 +02:00
luccioman
ed678186a8 Updated xml parser limited parsing test for use latest jdk. 2017-08-12 09:42:06 +02:00
luccioman
366ceae35a Fixed missing transitive dependency to commons-collections4-4.1
Dependency required by poi-3.16. 

Dependency was not provided in YaCy but already defined on previous poi
versions. This only became problematic since upgrade from poi-3.15 to
poi-3.16 (commit dedc6552d3). Indeed in
this new poi release, a poi component used in some YaCy parsers code
paths now explicitely needs a class from the commons-collections4
library : org.apache.poi.hpsf.Section uses now
org.apache.commons.collections4.bidimap.TreeBidiMap.

Impacted YaCy parsers : xlsParser, pptParser, docParser.

Issue detected by the folowing JUnit tests failing :
ParserTest.testpptParsers(), ParserTest.testdocParsers(),
xlsParserTest.testParse()
2017-08-11 20:50:36 +02:00
luccioman
bf72cbffa3 Updated debian package configuration to match new Java 1.8 target
Following migration from Java 1.7 to Java 1.8 in commit
6fe735945d
2017-08-11 20:34:59 +02:00
reger
119b65389d upde to icu4j-59_1.jar 2017-08-10 23:57:37 +02:00
reger
4979439e87 Skip public post of jre version.
Added to determine switch to java8  596b5dfa59
2017-08-06 23:41:53 +02:00
reger
e918ec199e Replace deprecated ConcurrentHashSet with recommended Java8
ConcurrentHashMap.newKeySet() in postprocessDocuments()
2017-08-06 23:26:27 +02:00
reger
fb71994342 Harmonizing use of xml reader / sax parser in XMLBlacklistImporter
eliminating the need for lib/xercesImpl.jar
2017-08-05 23:47:27 +02:00
reger
275d65fffe Patch last_modified date with internal FirstSeenTime() if no date provided
to make sure updated documents are indexed with their last-modified
date as provided in current crawl. 
(to patch moddate always with firstseen might bear the risk of miss 
actual updates).
2017-08-05 22:30:06 +02:00
reger
d1b23afed6 Remove obsolete Protocol parameter ttl (time to live)
not interpreted in target yacy/query.html
also Protocol.querySeed() not used and parameter not interpreted in 
target servlet yacy/query.html
2017-08-01 00:59:53 +02:00
reger
dedc6552d3 upd to poi-3.16.jar 2017-07-31 23:38:10 +02:00
reger
15d78b1064 Replace deprecated getIP with getIPs in Protocol transferURL() and
getProfile().
Remember used ip for error handling and departInterface
2017-07-31 01:55:01 +02:00
reger
ed36b47bec Replace one more deprecated peerDeparture in Protocol.transferIndex()
by moving/using interfaceDeparture() in transferRWI()
2017-07-30 23:02:15 +02:00
reger
37f44941fb upd to pdfbox-2.0.7.jar 2017-07-30 20:09:06 +02:00
reger
41616de0b8 Add SolrConfig ClassicIndexSchemaFactory to prevent Solr startup warning.
This overrides Solr default to use managed schema. As we don't use
programatic schema changes this directs Solr to use schema.xml, eliminating
the warning.
2017-07-23 03:55:56 +02:00
luccioman
0ee8c030c4 Log an error when Solr folder migration fails for some reason. 2017-07-17 15:35:10 +02:00
reger
44d455dfed upd to jwat-warc-1.1.0.jar 2017-07-16 23:37:28 +02:00
reger
588c6e96fb upd version for typeahead.jquery.js in jslicense.html 2017-07-16 23:35:56 +02:00
luccioman
5a646540cc Support parsing gzip files from servers with redundant headers.
Some web servers provide both 'Content-Encoding : "gzip"' and
'Content-Type : "application/x-gzip"' HTTP headers on their ".gz" files.
This was annoying to fail on such resources which are not so uncommon,
while non conforming (see RFC 7231 section 3.1.2.2 for
"Content-Encoding" header specification
https://tools.ietf.org/html/rfc7231#section-3.1.2.2)
2017-07-16 14:46:46 +02:00
luccioman
11a7f923d4 Distinguish response parsing failures from unexpected exceptions. 2017-07-16 14:39:53 +02:00
luccioman
8100c033a2 URL Viewer : apply crawler size limits when adding to local index.
This allow large files parsing and preview, while preventing unwanted
OutOfMemory errors which are likely to occur when adding to the Solr
Index resources larger than configured crawler limits.
2017-07-16 14:37:06 +02:00
luccioman
eda7b0aeb6 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2017-07-15 08:49:25 +02:00
reger
3005be7349 Clean up unmaintained and unused AugmentParser trail. 2017-07-15 00:19:23 +02:00
reger
e5cff062b5 Clean up redundant but obsolete jquery.rdfquery-core-1.0.js script lib 2017-07-14 23:41:39 +02:00
luccioman
cb4f1358e1 Added gzip parser support for max content bytes limit 2017-07-13 08:18:40 +02:00
luccioman
5216c681a9 Added HTML parser support for maximum content bytes parsing limit 2017-07-13 08:12:10 +02:00
luccioman
4aafebc014 Merge pull request #122 from Scarfmonster/patch-1
I also reproduced the issue, and the fix is working fine.

Thanks @Scarfmonster
2017-07-12 16:03:23 +02:00
luccioman
651fad6da5 Added RSS parser support for maximum content bytes parsing limit 2017-07-12 00:18:12 +02:00
luccioman
452a17a8d5 Finer control on bounded input streams with custom stream implementation 2017-07-12 00:13:24 +02:00
luccioman
f8f1959ebb Added parsing within bounds implementation to the generic parser. 2017-07-11 09:07:48 +02:00
luccioman
e0f400a0bd Support trying multiple parsers even when streaming on large resources. 2017-07-11 09:06:37 +02:00
luccioman
1e84956721 Support loading local files with a per request specified maximum size.
Consistently with the HTTP loader implementation.
2017-07-11 09:04:23 +02:00
luccioman
f369679d1c Fixed read/copy on input streams reading sometimes less than expected. 2017-07-11 09:00:27 +02:00
reger
23bda133d2 Fix css conflict of YMarks.html to make it viewable.
yacy-ymarks.css sidebar conflicts with bootstraps sidebar (different
overlay settings). Simply renamed it to ymark-sidebar.
2017-07-09 23:08:54 +02:00
reger
af32d291c2 upd to commons-fileupload-1.3.3.jar 2017-07-08 23:46:10 +02:00
reger
a21789d4e7 Fix unresolved pattern in api/share.html by init some display var's 2017-07-08 22:46:15 +02:00
luccioman
bf55f1d6e5 Started support of partial parsing on large streamed resources.
Thus enable getpageinfo_p API to return something in a reasonable amount
of time on resources over MegaBytes size range.
Support added first with the generic XML parser, for other formats
regular crawler limits apply as usual.
2017-07-08 09:04:03 +02:00
luccioman
2a87b08cea Removed temporary html parser test code 2017-07-03 14:53:36 +02:00
luccioman
1b3c169a9c URL Viewer : decode raw text using the eventual response charset.
When provided, or decode as UTF-8 as previously done.
2017-07-03 13:51:14 +02:00
luccioman
90a7c1affa HTML parser : removed unnecessary remaining recursive processing
Recursive processing was removed in commit
67beef657f, but one remained for anchors
content(likely omitted from refactoring). It is no more necessary :
other links such as images embedded in anchors are currently correctly
detected by the parser.

More annoying : that remaining recursive processing could lead to almost
endless processing when encountering some (invalid) HTML structures
involving nested anchors, as detected and reported by lucipher on YaCy
forum ( http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6005 ).
2017-07-03 10:00:53 +02:00
reger
e6e20dab52 upd to Jetty 9.4.6.v20170531
Modify loginservice to the changes in Jetty, partially based on pull 
request #101 https://github.com/yacy/yacy_search_server/pull/101 bu @automenta
2017-07-01 23:58:28 +02:00
luccioman
e4c730b99f Updated PerformanceQueues_p.xml API with last related servlet changes 2017-06-30 11:41:48 +02:00
luccioman
dcc56318bb Made remote search max system load limits configurable from UI.
As reported by davide on YaCy forums (
http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6004 ) when the
system is on high load, unless reading carefully YaCy configuration
file, it could be difficult to understand why remote search results are
not fetched.
2017-06-30 11:30:54 +02:00
reger
ddd13b776d Add keyword constraint to rwi query result filter
To discard rwi results not matching query keyword: parameter
2017-06-30 02:11:18 +02:00
luccioman
e82eaee4b6 Apply consistent behavior on HTTP resource size exceeding limit.
On content size known from HTTP headers, terminates connection faster
and improves error reports quality by reporting relevant message
"Content to download exceed maximum value..." rather than previously "no
response (NULL) for url...".
2017-06-30 01:13:47 +02:00
luccioman
0b75e92ac2 Do not wrap unnecessarily loader IOExceptions in IOExceptions 2017-06-30 01:06:17 +02:00
luccioman
433bdb7c0d Respect maxFileSize limit also when streaming HTTP and when relevant.
Constraint applied consistently with HTTP content full load in byte
array.
2017-06-30 00:30:54 +02:00