Commit Graph

13238 Commits

Author SHA1 Message Date
luccioman
4b72b29ea2 Added an informative title on the crawl start robots.txt status icon 2017-06-29 11:36:47 +02:00
luccioman
d08f31c3a8 Crawl start Ajax request : properly handle eventual XML parsing errors
Otherwise on a malformed getpageinfo_p XML response (from the browser
point of view), JavaScript errors where thrown and the ajax status
steering wheel remained displayed indefinitely.
2017-06-29 11:25:27 +02:00
luccioman
9b1bb2545e Refactored plain-text URLs detection implementation.
For faster processing (measured about 2 times faster on many real-world
examples) and more advanced detection (previous algorithm detected only
URLs separated from the rest of the text by a space character).
2017-06-27 19:30:40 +02:00
luccioman
8da3174867 Ensure lower case conversion consistency with any default locale.
Especially for Turkish speaking users using "tr" as their system default
locale : strings for technical stuff (URLs, tag names, constants...)
must not be lower cased with the default locale, as 'I' doesn't becomes
'i' like in other locales such as "en", but becomes 'ı'.
2017-06-27 06:42:33 +02:00
luccioman
286f3018bd Made mime type and extension normalization locale independent.
Previously, upper cased mime type was incorrectly normalized when the
default locale is Turkish.
2017-06-26 17:33:56 +02:00
luccioman
319231a458 Added a generic XML parser, able to parse elements text and URLs.
This parser adds support for any XML based format other than already
supported XML vocabularies such XHTML, RSS/Atom feeds... It will
eventually be used as a fallback if one of these specific parsers fail,
before falling back to the existing genericParser which extracts not
that much useful information except URL tokens.
2017-06-26 16:30:21 +02:00
reger
aeeb8a7dd5 upd to jwat-warc-1.0.6.jar 2017-06-25 20:05:37 +02:00
reger
f0ba828627 remove unused Solr optional extra handler lib solr-dataimporthandler-6.6.0.jar 2017-06-24 23:15:25 +02:00
reger
1773b61b3e upd to jsoup-1.10.3.jar 2017-06-24 22:54:43 +02:00
luccioman
64cec2790d Improved character encoding detection from Content-Type header
Also updated some related JavaDocs
2017-06-22 10:50:34 +02:00
luccioman
1acb7005d0 Added a basic JUnit test with test gz files for the gzip parser 2017-06-21 09:14:50 +02:00
luccioman
1e2fb76720 Properly close test files in htmlParser unit test 2017-06-21 09:11:17 +02:00
luccioman
c41b31dcb3 Cleaned up memory usage page HTML
- fixed validation errors
- removed deprecated attributes
- improved accessibility with richer table semantics (headers and
caption elements) and language declaration
2017-06-20 09:21:55 +02:00
luccioman
0487336ec3 Prevent integer overflow in table statistics and use strong typing 2017-06-19 17:02:11 +02:00
luccioman
0f80c978d6 Limit the number of initially previewed links in crawl start pages.
This prevent rendering a big and inconvenient scrollbar on resources
containing many links.
If really needed, preview of all links is still available with a "Show
all links" button.

Doesn't affect the number of links used once the crawl is effectively
started, as the list is then loaded again server-side.
2017-06-17 09:33:14 +02:00
luccioman
d2a4a27f52 Improved stream-oriented parsing entering conditions. 2017-06-17 09:26:37 +02:00
luccioman
32288a8999 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2017-06-17 08:16:55 +02:00
luccioman
e9b4b29f90 Limit scope of some local JavaScript variables. 2017-06-16 08:50:57 +02:00
Michael Peter Christen
369b8e0e0b added json(p) endpoint for crawl start 2017-06-16 08:44:40 +02:00
reger
83ba45ebae make nsis build script require java 8 2017-06-16 06:31:45 +02:00
reger
cf70081cfc update nsi installer java autodl bundleid to use jre-8u131 2017-06-16 02:17:49 +02:00
reger
9220ccbec7 remove reference to velocityresponsewriter in solrconfig.xml
it is not longer part of solr-core api
http://lucene.apache.org/solr/6_6_0/index.html
2017-06-16 00:12:09 +02:00
reger
4be4bfbba6 remove sample path setting in solrconfig.xml not valid in Yacy
resulting in startup stop exception after fresh swithch to 1.921
2017-06-15 21:02:18 +02:00
reger
510859bcce update maven pom setting to YaCy version 1.921
java 1.8 and solr 6.6
2017-06-15 20:24:53 +02:00
luccioman
f6e8d71718 Prevent high CPU load at startup, caused by the Solr suggester build.
Reported by Collision on mantis 758 (
http://mantis.tokeek.de/view.php?id=758 ).
Introduced by the new YaCy Solr configuration for Solr 6.6.0 (see commit
6fe735945d), including now Suggester
configuration.
2017-06-15 14:13:46 +02:00
luccioman
9dd790087d Added HT Cache basic statistics (hit rate) 2017-06-15 09:50:02 +02:00
luccioman
5fdd5d16b1 Use volatile to ensure concurrent threads use up to date property value 2017-06-15 09:48:22 +02:00
luccioman
28b451a0b3 Made Cache compression level and lock timeout user configurable 2017-06-14 19:02:08 +02:00
luccioman
a7394b479b Limit the synchronization blocking time on some Cache operations.
Using a Reentrant lock instead of the intrinsic synchronization lock
permits limiting the blocking time to acquire a lock.

Useful on a very busy Cache concurrently accessed by many threads : when
the time to acquire a lock is too high, getting/storing content on the
cache becomes inefficient, and it is then better to fall back to loading
remote resources.

Illustrated by the CacheTest stress test and some traces reported in
mantis 751 ( http://mantis.tokeek.de/view.php?id=751 )
2017-06-14 09:13:50 +02:00
luccioman
73ab4a7b3a Prevent log pollution from unwanted Solr warnings.
Many non-blocking "java.nio.file.NoSuchFileException" traces with
warning log level can be logged by Solr, especially when heavily
crawling. This is issue is known from Solr 5.x but still unresolved with
Solr 6.x ( https://issues.apache.org/jira/browse/SOLR-9120 )

Consequently upgraded to "SEVERE" the default log level of the related
internal Solr class.

See also mantis 727 ( http://mantis.tokeek.de/view.php?id=727 )
2017-06-14 08:56:11 +02:00
Michael Peter Christen
c94a8c76bd re-added solr synchronization hack 2017-06-09 12:50:36 +02:00
Michael Peter Christen
6fe735945d migrated Solr 5.5 -> Solr 6.6 and from Java 1.7 -> 1.8
Also: now Version 1.921
2017-06-09 12:25:23 +02:00
luccioman
ce89492319 Ensure system resource release by closing document stream. 2017-06-08 07:36:11 +02:00
luccioman
8399275142 Properly close file output streams even on exceptions scenarios. 2017-06-08 07:19:16 +02:00
luccioman
4e4dc6c4e5 Removed unnecessary finalize implementation.
On such private classes with limited scope but with frequent instance
creations and removals within the application lifecycle, implementing
the finalize method is particularly unwanted as it decreases the garbage
collector performance.
What's more the Object.finalize() method is now deprecated in the JDK 9
and will eventually disappear from future releases (see
https://bugs.openjdk.java.net/browse/JDK-8177970)
2017-06-06 10:30:02 +02:00
reger
632354e2ff Tokenize result entry keywords and add some styling for display 2017-06-04 01:50:40 +02:00
reger
c42d17f607 upd to commons-compress-1.14.jar 2017-06-03 21:58:04 +02:00
luccioman
a04feac064 Ensure file input streams proper closing in both success and failures
Also add when possible a warning level log message on input stream
closing error instead of failing silently. This could help understanding
some IO exceptions such as "too many files open".
2017-06-03 04:00:46 +02:00
luccioman
d98c04853d Ensure proper closing of file input streams. 2017-06-02 12:14:29 +02:00
luccioman
c53c58fa85 Unsure closing ChunkIterator stream in every possible use case.
Also trace in logs the eventual close failures instead of failing
silently.
This should help prevent holding too many unreleased system file
handlers, as in the case reported by eros on YaCy forum
(http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5988&sid=b00e7486c1bf7e48a0d63eb328ccca02
)
2017-06-02 09:47:45 +02:00
luccioman
29e52bda39 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2017-06-02 01:47:53 +02:00
luccioman
a9cb083fa1 Improved consistency between loader openInputStream and load functions 2017-06-02 01:46:06 +02:00
reger
a814f3d885 Introduce keyword query parameter
This enables keyword navigator to filter on keywords. Added search page
output and layout config for keywords, allowing e.g. in Intranet use
to display the keywords. No styling or links applied to the keyword
text (but is desirable possibly in combination with bootstrap-tagsinput
for future/intranet).
2017-06-02 01:00:21 +02:00
luccioman
cbccf97361 Added JavaDoc to the getpageinfo_p API servlet. 2017-05-30 17:38:16 +02:00
luccioman
c226ded799 Fix unescape of URLs having some '%' chars but not percent-encoded 2017-05-30 12:32:14 +02:00
luccioman
bd88fd303e Deprecated duplicated and internally unused getpageinfo servlet.
Redirections set for the transition of any eventual external uses:
 - /api/getpageinfo.xml to /api/getpageinfo_p.xml
 - /api/getpageinfo.json to /api/getpageinfo_p.json
2017-05-30 09:29:28 +02:00
luccioman
306a82dd71 Fixed scraper NullPointerException cases on malformed URLs. 2017-05-30 08:48:20 +02:00
luccioman
aa55d71cf5 Fixed a NullPointerException case on Digest authentication.
Could occur when upgrading from a Debian package configured with Basic
authentication (as in release 1.92.9000) to a more recent one with
Digest authentication, without having re-encoded the admin password (for
example with dpkg-reconfigure).

As reported by eros on YaCy forum
(http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5988#p33686).
2017-05-29 19:16:09 +02:00
reger
b65a04087b upd to pdfbox-2.0.6.jar 2017-05-24 22:13:42 +02:00
luccioman
02ec0ed13c Quoted param value in Solr query to avoid unwanted traces in logs
When Webgraph Solr core is enabled, crawling and removing from index an
URL whose hash starts with the '-' character (example URL :
https://cs.wikipedia.org/ whose hash is "-2-HuTEndn4x") produced a full
ParseException stack trace in YaCy logs. This was not blocking because
the Solr query parser is able to escape itself the query and run it
successfully, but filled uselessly YaCy logs.
2017-05-24 08:43:03 +02:00