Commit Graph

4087 Commits

Author SHA1 Message Date
luccioman
a1a0515312 Added a button to manually refresh sorting of p2p search results.
As a server-side oriented alternative to the JavaScript realtime
resorting feature proposed in PR #104.
The goal is the same as in this PR : having the possibility compensate
the network latency of various peers results fetching and obtain once
possible a consistently ranked result set.
2017-08-28 19:03:51 +02:00
luccioman
4eba88f2ff Removed some unnecessary uses of java.lang.reflect api.
This improves code browsing and readability, making search by references
or call hierarchy IDE features more accurate.
2017-08-24 18:47:18 +02:00
luccioman
da3dbf9ea1 Use Javadoc style comments on SearchEvent properties.
For better code readability and understanding.
2017-08-23 08:20:37 +02:00
luccioman
c6ae87168a Added unit tests on the gzip parser. 2017-08-22 14:13:00 +02:00
luccioman
169ffdd1c7 Finer control on max links to parse in the html parser. 2017-08-22 14:11:35 +02:00
luccioman
e41d046a9d Improved parsing support for OOXML spreadsheets (.xlsx)
As reported edycop in mantis 765 (
http://mantis.tokeek.de/view.php?id=765 ), parsing of xlsx files was
quite incomplete.
Now properly support "Shared String Table" entry in Office Open XML
spreadsheets, an also detect embedded URLs.

Integrating the Apache poi-ooxml library could be an option for finer
OOXML formats support, but their SAX style parsing example (
http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api ) tends to
show that a custom SAX handler is still efficient for lightweight and
low memory footprint processing.
2017-08-21 09:38:20 +02:00
reger
51a4e03c93 Allow to stop currently running warc import (stop button) 2017-08-20 22:17:27 +02:00
luccioman
6cec2cdcb5 Use unredirected robots.txt URL when adding an entry to the table. 2017-08-16 14:21:07 +02:00
luccioman
3f0446f14b Ensure proper synchronous robots entry retrieval on first check.
Previously, when checking for the first time the robots.txt policy on a
unknown host (not cached in the robots table), result was always empty
in the /getpageinfo_p.xml api and in the /CrawlCheck_p.html page. Next
calls returned however the correct information.
2017-08-16 09:30:33 +02:00
luccioman
b23a563065 Prevent search result failure on incomplete images information.
Complements the recent modification related to images in commit 7f395ef.

Unfortunately many documents metadata fetched from the freeworld p2p
network have only partial information about embedded images. Without
proper error handling, this made many searches in p2p mode to fail
completely.
2017-08-15 10:11:05 +02:00
Michael Peter Christen
30d71c6359 added usage of X-Real-IP http header
to identify request IPs which came through NGINX reverse proxy
configurations
2017-08-15 07:16:01 +02:00
Michael Peter Christen
f45378c11c Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2017-08-14 20:12:26 +02:00
Michael Peter Christen
7f395ef937 added image link in search results
This should be a help to make a preview of search results.
The image is computed from the list of embedded images, it is
always the first image in that list.
In rss-type results the image is presented like
<media:content medium="image" url="https://abc.xyz/logo.png"/>
as defined in
http://www.rssboard.org/media-rss#media-content
2017-08-14 20:12:09 +02:00
luccioman
780173008e Implemented partial stream parsing of tar archives.
Also added JUnit tests for the tar parser and fixed unwanted use of the
tar parser as a fallback on files included in a tar archive.
2017-08-14 14:57:58 +02:00
luccioman
acab6a6def Also handle text content when parsing XML within limits. 2017-08-14 14:47:01 +02:00
reger
2a07799ad1 Correction of d03e2c98ea
Fix Conjunction.addOperator to do nothing if term is empty
prevent to result in query string with repeated logical operator
like "field:term AND AND field:term"
possibliy causing out of mem in postprocessing_doublecontent
2017-08-14 01:03:15 +02:00
reger
d03e2c98ea Fix Conjunction.addOperator to do nothing if term is empty
prevent to result in query string with repeated logical operator
like "field:term AND AND field:term"
possibliy causing out of mem in postprocessing_doublecontent
2017-08-14 00:52:03 +02:00
reger
b6a41df4f7 Remove deprecated YaCyProxyServlet
was replaced by UrlProxyServlet
2017-08-12 21:53:04 +02:00
luccioman
8a94fef9e0 Prevent unwanted cached bytes duplication on stream parsing. 2017-08-12 09:43:49 +02:00
reger
4979439e87 Skip public post of jre version.
Added to determine switch to java8  596b5dfa59
2017-08-06 23:41:53 +02:00
reger
e918ec199e Replace deprecated ConcurrentHashSet with recommended Java8
ConcurrentHashMap.newKeySet() in postprocessDocuments()
2017-08-06 23:26:27 +02:00
reger
fb71994342 Harmonizing use of xml reader / sax parser in XMLBlacklistImporter
eliminating the need for lib/xercesImpl.jar
2017-08-05 23:47:27 +02:00
reger
275d65fffe Patch last_modified date with internal FirstSeenTime() if no date provided
to make sure updated documents are indexed with their last-modified
date as provided in current crawl. 
(to patch moddate always with firstseen might bear the risk of miss 
actual updates).
2017-08-05 22:30:06 +02:00
reger
d1b23afed6 Remove obsolete Protocol parameter ttl (time to live)
not interpreted in target yacy/query.html
also Protocol.querySeed() not used and parameter not interpreted in 
target servlet yacy/query.html
2017-08-01 00:59:53 +02:00
reger
15d78b1064 Replace deprecated getIP with getIPs in Protocol transferURL() and
getProfile().
Remember used ip for error handling and departInterface
2017-07-31 01:55:01 +02:00
reger
ed36b47bec Replace one more deprecated peerDeparture in Protocol.transferIndex()
by moving/using interfaceDeparture() in transferRWI()
2017-07-30 23:02:15 +02:00
luccioman
0ee8c030c4 Log an error when Solr folder migration fails for some reason. 2017-07-17 15:35:10 +02:00
luccioman
5a646540cc Support parsing gzip files from servers with redundant headers.
Some web servers provide both 'Content-Encoding : "gzip"' and
'Content-Type : "application/x-gzip"' HTTP headers on their ".gz" files.
This was annoying to fail on such resources which are not so uncommon,
while non conforming (see RFC 7231 section 3.1.2.2 for
"Content-Encoding" header specification
https://tools.ietf.org/html/rfc7231#section-3.1.2.2)
2017-07-16 14:46:46 +02:00
luccioman
11a7f923d4 Distinguish response parsing failures from unexpected exceptions. 2017-07-16 14:39:53 +02:00
luccioman
eda7b0aeb6 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2017-07-15 08:49:25 +02:00
reger
3005be7349 Clean up unmaintained and unused AugmentParser trail. 2017-07-15 00:19:23 +02:00
luccioman
cb4f1358e1 Added gzip parser support for max content bytes limit 2017-07-13 08:18:40 +02:00
luccioman
5216c681a9 Added HTML parser support for maximum content bytes parsing limit 2017-07-13 08:12:10 +02:00
luccioman
4aafebc014 Merge pull request #122 from Scarfmonster/patch-1
I also reproduced the issue, and the fix is working fine.

Thanks @Scarfmonster
2017-07-12 16:03:23 +02:00
luccioman
651fad6da5 Added RSS parser support for maximum content bytes parsing limit 2017-07-12 00:18:12 +02:00
luccioman
452a17a8d5 Finer control on bounded input streams with custom stream implementation 2017-07-12 00:13:24 +02:00
luccioman
f8f1959ebb Added parsing within bounds implementation to the generic parser. 2017-07-11 09:07:48 +02:00
luccioman
e0f400a0bd Support trying multiple parsers even when streaming on large resources. 2017-07-11 09:06:37 +02:00
luccioman
1e84956721 Support loading local files with a per request specified maximum size.
Consistently with the HTTP loader implementation.
2017-07-11 09:04:23 +02:00
luccioman
f369679d1c Fixed read/copy on input streams reading sometimes less than expected. 2017-07-11 09:00:27 +02:00
luccioman
bf55f1d6e5 Started support of partial parsing on large streamed resources.
Thus enable getpageinfo_p API to return something in a reasonable amount
of time on resources over MegaBytes size range.
Support added first with the generic XML parser, for other formats
regular crawler limits apply as usual.
2017-07-08 09:04:03 +02:00
luccioman
90a7c1affa HTML parser : removed unnecessary remaining recursive processing
Recursive processing was removed in commit
67beef657f, but one remained for anchors
content(likely omitted from refactoring). It is no more necessary :
other links such as images embedded in anchors are currently correctly
detected by the parser.

More annoying : that remaining recursive processing could lead to almost
endless processing when encountering some (invalid) HTML structures
involving nested anchors, as detected and reported by lucipher on YaCy
forum ( http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6005 ).
2017-07-03 10:00:53 +02:00
reger
e6e20dab52 upd to Jetty 9.4.6.v20170531
Modify loginservice to the changes in Jetty, partially based on pull 
request #101 https://github.com/yacy/yacy_search_server/pull/101 bu @automenta
2017-07-01 23:58:28 +02:00
luccioman
dcc56318bb Made remote search max system load limits configurable from UI.
As reported by davide on YaCy forums (
http://forum.yacy-websuche.de/viewtopic.php?f=23&t=6004 ) when the
system is on high load, unless reading carefully YaCy configuration
file, it could be difficult to understand why remote search results are
not fetched.
2017-06-30 11:30:54 +02:00
reger
ddd13b776d Add keyword constraint to rwi query result filter
To discard rwi results not matching query keyword: parameter
2017-06-30 02:11:18 +02:00
luccioman
e82eaee4b6 Apply consistent behavior on HTTP resource size exceeding limit.
On content size known from HTTP headers, terminates connection faster
and improves error reports quality by reporting relevant message
"Content to download exceed maximum value..." rather than previously "no
response (NULL) for url...".
2017-06-30 01:13:47 +02:00
luccioman
0b75e92ac2 Do not wrap unnecessarily loader IOExceptions in IOExceptions 2017-06-30 01:06:17 +02:00
luccioman
433bdb7c0d Respect maxFileSize limit also when streaming HTTP and when relevant.
Constraint applied consistently with HTTP content full load in byte
array.
2017-06-30 00:30:54 +02:00
luccioman
9b1bb2545e Refactored plain-text URLs detection implementation.
For faster processing (measured about 2 times faster on many real-world
examples) and more advanced detection (previous algorithm detected only
URLs separated from the rest of the text by a space character).
2017-06-27 19:30:40 +02:00
luccioman
8da3174867 Ensure lower case conversion consistency with any default locale.
Especially for Turkish speaking users using "tr" as their system default
locale : strings for technical stuff (URLs, tag names, constants...)
must not be lower cased with the default locale, as 'I' doesn't becomes
'i' like in other locales such as "en", but becomes 'ı'.
2017-06-27 06:42:33 +02:00
luccioman
286f3018bd Made mime type and extension normalization locale independent.
Previously, upper cased mime type was incorrectly normalized when the
default locale is Turkish.
2017-06-26 17:33:56 +02:00
luccioman
319231a458 Added a generic XML parser, able to parse elements text and URLs.
This parser adds support for any XML based format other than already
supported XML vocabularies such XHTML, RSS/Atom feeds... It will
eventually be used as a fallback if one of these specific parsers fail,
before falling back to the existing genericParser which extracts not
that much useful information except URL tokens.
2017-06-26 16:30:21 +02:00
Ryszard Goń
3cedbbd4ed Wrong password was removed after the SSL certificate import
Removing the keystore password will prevent ssl from working after the next restart. The certificate password should be removed instead.
Fixes http://mantis.tokeek.de/view.php?id=687
2017-06-23 02:23:49 +02:00
luccioman
64cec2790d Improved character encoding detection from Content-Type header
Also updated some related JavaDocs
2017-06-22 10:50:34 +02:00
luccioman
0487336ec3 Prevent integer overflow in table statistics and use strong typing 2017-06-19 17:02:11 +02:00
luccioman
d2a4a27f52 Improved stream-oriented parsing entering conditions. 2017-06-17 09:26:37 +02:00
luccioman
9dd790087d Added HT Cache basic statistics (hit rate) 2017-06-15 09:50:02 +02:00
luccioman
5fdd5d16b1 Use volatile to ensure concurrent threads use up to date property value 2017-06-15 09:48:22 +02:00
luccioman
28b451a0b3 Made Cache compression level and lock timeout user configurable 2017-06-14 19:02:08 +02:00
luccioman
a7394b479b Limit the synchronization blocking time on some Cache operations.
Using a Reentrant lock instead of the intrinsic synchronization lock
permits limiting the blocking time to acquire a lock.

Useful on a very busy Cache concurrently accessed by many threads : when
the time to acquire a lock is too high, getting/storing content on the
cache becomes inefficient, and it is then better to fall back to loading
remote resources.

Illustrated by the CacheTest stress test and some traces reported in
mantis 751 ( http://mantis.tokeek.de/view.php?id=751 )
2017-06-14 09:13:50 +02:00
Michael Peter Christen
c94a8c76bd re-added solr synchronization hack 2017-06-09 12:50:36 +02:00
Michael Peter Christen
6fe735945d migrated Solr 5.5 -> Solr 6.6 and from Java 1.7 -> 1.8
Also: now Version 1.921
2017-06-09 12:25:23 +02:00
luccioman
ce89492319 Ensure system resource release by closing document stream. 2017-06-08 07:36:11 +02:00
luccioman
8399275142 Properly close file output streams even on exceptions scenarios. 2017-06-08 07:19:16 +02:00
luccioman
4e4dc6c4e5 Removed unnecessary finalize implementation.
On such private classes with limited scope but with frequent instance
creations and removals within the application lifecycle, implementing
the finalize method is particularly unwanted as it decreases the garbage
collector performance.
What's more the Object.finalize() method is now deprecated in the JDK 9
and will eventually disappear from future releases (see
https://bugs.openjdk.java.net/browse/JDK-8177970)
2017-06-06 10:30:02 +02:00
luccioman
a04feac064 Ensure file input streams proper closing in both success and failures
Also add when possible a warning level log message on input stream
closing error instead of failing silently. This could help understanding
some IO exceptions such as "too many files open".
2017-06-03 04:00:46 +02:00
luccioman
d98c04853d Ensure proper closing of file input streams. 2017-06-02 12:14:29 +02:00
luccioman
c53c58fa85 Unsure closing ChunkIterator stream in every possible use case.
Also trace in logs the eventual close failures instead of failing
silently.
This should help prevent holding too many unreleased system file
handlers, as in the case reported by eros on YaCy forum
(http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5988&sid=b00e7486c1bf7e48a0d63eb328ccca02
)
2017-06-02 09:47:45 +02:00
luccioman
29e52bda39 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2017-06-02 01:47:53 +02:00
luccioman
a9cb083fa1 Improved consistency between loader openInputStream and load functions 2017-06-02 01:46:06 +02:00
reger
a814f3d885 Introduce keyword query parameter
This enables keyword navigator to filter on keywords. Added search page
output and layout config for keywords, allowing e.g. in Intranet use
to display the keywords. No styling or links applied to the keyword
text (but is desirable possibly in combination with bootstrap-tagsinput
for future/intranet).
2017-06-02 01:00:21 +02:00
luccioman
c226ded799 Fix unescape of URLs having some '%' chars but not percent-encoded 2017-05-30 12:32:14 +02:00
luccioman
306a82dd71 Fixed scraper NullPointerException cases on malformed URLs. 2017-05-30 08:48:20 +02:00
luccioman
aa55d71cf5 Fixed a NullPointerException case on Digest authentication.
Could occur when upgrading from a Debian package configured with Basic
authentication (as in release 1.92.9000) to a more recent one with
Digest authentication, without having re-encoded the admin password (for
example with dpkg-reconfigure).

As reported by eros on YaCy forum
(http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5988#p33686).
2017-05-29 19:16:09 +02:00
luccioman
02ec0ed13c Quoted param value in Solr query to avoid unwanted traces in logs
When Webgraph Solr core is enabled, crawling and removing from index an
URL whose hash starts with the '-' character (example URL :
https://cs.wikipedia.org/ whose hash is "-2-HuTEndn4x") produced a full
ParseException stack trace in YaCy logs. This was not blocking because
the Solr query parser is able to escape itself the query and run it
successfully, but filled uselessly YaCy logs.
2017-05-24 08:43:03 +02:00
reger
1737af37cf Set request originator to own peer in warc importer
in addition to change in 039162fbf0
2017-05-22 01:56:11 +02:00
reger
039162fbf0 Change warc importer to use defaultsurrogate-crawl profile, as reported
by LA_FORGE http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5990 and
analysed by @luccioman (see comment 510f11d374)
it creates conflict using a other crawlprofile without setting originator.
2017-05-22 01:34:08 +02:00
Michael Peter Christen
3b1d640a3c enhanced debugging 2017-05-18 00:28:12 +02:00
Michael Peter Christen
7de7879f13 added a cache to prevent too many seed enumerations 2017-05-18 00:28:00 +02:00
luccioman
bd7411a53a Enable p2p and cluster communication when "Protection of all pages" on
As reported by paul89 on YaCy forum
(http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5958 ), when setting
the "Protection of all pages" to "On" in the "ConfigAccounts_p.html"
page, the peer became completely unreachable by others, which is not the
purpose of this feature.
But the restriction still makes sense as a security enforcement and is
maintained in private "Robinson mode" where by the way any peer-to-peer
or cluster communication would be rejected.
2017-05-17 09:00:29 +02:00
luccioman
31ad043bb9 Added user interface feedback on results feeding termination status.
Added as an additional icon with title in the search progress bar, to
inform about background search feeder threads terminated or still
running. While giving a bit more information to users about the p2p
search process, this can help choosing whether or not wait a little bit
more time before going to the next page, in order to get results from
various sources sorted as best as possible (see #91 for a discussion
about sorting accuracy and network latency).

Other related modifications included :
 - regular updates to statistics in the progress bar until the
background feeders are completely terminated.
 - removed some uses of unsecure and discouraged JavaScript elements
2017-05-15 13:15:16 +02:00
sgaebel
ff6392215e added closing of lst-Tag in solr-Export 2017-05-13 20:38:25 +02:00
luccioman
d90b001e1b Improved previous merge "Show ranking in HTML UI".
- added the new setting as configurable in the "Debug/Analysis" settings
page. Debug/analysis is its main purpose for now as there is currently
no nice and "understansable" ranking score info servlet (see forum
discussion http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5884 ) 
- render in the "Search Page Layout" page preview when enabled
- added constants
2017-05-11 18:02:33 +02:00
luccioman
0f0f42b509 Added some JavaDoc 2017-05-11 08:33:19 +02:00
reger
077d062be3 Adjust mergeDocuments to keep youngest last-modified date of document
collection
2017-05-09 22:52:54 +02:00
luccioman
654801523e Fixed StringIndexOutOfBoundsException case.
Revealed by commit c77e43a : the exception was then thrown when indexing
pages containing mailto: scheme URL links with the Solr Webgraph core
enabled.
Fixed the error case and restored filtering on mailto links in
Document.resortLinks() as these URLs still should not appear in
Document.hyperlinks.
2017-05-09 18:32:47 +02:00
luccioman
522a268305 Improved new blacklist entries URL scheme detection. 2017-05-04 16:36:45 +02:00
luccioman
532981b363 Updated putHTML() JavaDoc 2017-05-04 11:21:27 +02:00
luccioman
58d23047dd Handle '?' and '+' chars as valid wild cards when adding to blacklist.
An entry such as "domain.com/[a-z]+" is a valid regular expression and
do not need additional ".*.*/.*" wildcards.
2017-05-04 11:19:59 +02:00
luccioman
a87281b498 Added MediaWiki dump import scheduling feature.
Checking the last modified date by default to prevent unnecessary long
running operations.
2017-05-03 18:53:01 +02:00
luccioman
edd7ccac40 Added some JavaDoc 2017-05-02 09:33:11 +02:00
luccioman
79fdf14b0a Fixed regression introduced by commit 9ad4d16
On MediaWiki dump imports, the SurrogateReader was trying to unread too
many bytes, then failing with the following exception :
"java.io.IOException: Push back buffer is full".
2017-05-02 09:32:04 +02:00
Michael Peter Christen
7678fd67e3 copied fix from yacy_grid_parser for wrong array type 2017-05-01 11:44:26 +02:00
Michael Peter Christen
200b100fb8 added patch to rewrite altered yacy grid schema into yacy schema
This generates the stub and protocol parts of an url for inboundlinks,
outboundlinks and images
2017-05-01 11:38:02 +02:00
reger
9ad4d16829 Add a responsHeader to the solr index export with a format identifier
and export parameter (in accordance with response xml format) for easier
format detection on import.
2017-04-30 23:53:52 +02:00
luccioman
9697209ef6 Fixed Index Export feature for compatibility with old indexed documents.
This is a fix for mantis 682 (http://mantis.tokeek.de/view.php?id=682)
and issue #116
2017-04-28 11:39:51 +02:00
luccioman
88c062639b Added some JavaDoc 2017-04-28 11:36:48 +02:00
luccioman
31fff2c986 Extended WikiCode template inclusion syntax support.
Wiki templates are not rendered but syntax support is improved, which
greatly enhance snippets rendering on search results coming from a
MediaWiki dump import.
Tested on various dumps from Wikimedia at
https://dumps.wikimedia.org/backup-index.html
See also Wikipedia transclusion documentation at
https://en.wikipedia.org/wiki/Wikipedia:Transclusion
2017-04-27 09:50:04 +02:00
Michael Peter Christen
973d74712f added yacy grid flatjson surrogate parser 2017-04-25 08:44:02 +02:00
luccioman
b1da92648e Fixed surrogates import monitoring page (/CrawlResults.html?process=7)
This page was always empty, as described in mantis 740
(http://mantis.tokeek.de/view.php?id=740)
2017-04-24 18:24:26 +02:00