Commit Graph

4257 Commits

Author SHA1 Message Date
luccioman
a997133260 Fixed gzip decompression regression on index transfer APIs
Processing of gzip encoded incoming requests (on /yacy/transferRWI.html
and /yacy/transferURL.html) was no more working since upgrade to Jetty
9.4.12 (see commit 51f4be1).

To prevent any conflicting behavior with Jetty internals, use now the
GzipHandler provided by Jetty to decompress incoming gzip encoded
requests rather than the previously used custom GZIPRequestWrapper.

Fixes issue #249
2018-11-07 14:52:42 +01:00
luccioman
e85f231bdf Fixed termination of Host browser and link structure Solr query threads
On some conditions (especially when reaching timeout), concurrent Solr
query tasks used by the /HostBrowser.html and /api/linkstructure.json
never terminated, thus leaking resources, as reported by @Vort in issue
#246
2018-11-06 10:10:09 +01:00
luccioman
fcf6b16db4 Added new crawler attribute for finer control over Media Type detection
New "Media Type detection" section in the advanced crawl start page
allow to choose between :
- not loading URLs with unknown or unsupported file extension without
checking the actual Media Type (relying Content-Type header for now).
This was the old default behavior, faster, but not really accurate.
- always cross check URL file extension against the actual Media Type.
This lets properly parse URLs ending with an apparently odd file
extension, but which have actually a supported Media Type such as
text/html.

Sample URLs with misleading file extensions added as documentation in
the crawl start page.

fixes issue #244
2018-10-25 10:42:12 +02:00
luccioman
a83a56473e Added suport for PDF snapshots generation when running on MS Windows 2018-10-18 12:41:57 +02:00
luccioman
8852c97cee Added basic styling for cleaner rendering of missing image snapshots
For the output of the Solr snapshots writer
2018-10-15 18:19:57 +02:00
luccioman
746e0e788d Render a relevant HTTP status code on snapshot image rendering error
Instead of a null response body which is not very helpful.
2018-10-14 10:30:30 +02:00
luccioman
50b6edfcf5 Updated Solr snapshots writer for a cleaner html head 2018-10-13 10:36:39 +02:00
luccioman
f366f43d6b Made snapshots size customizable in Solr snapshots response writer 2018-10-13 10:22:47 +02:00
luccioman
7a62fc0e66 Fixed concurrency issue in custom classloader used for template classes
As reported in issue #241, the problem is only critical (random but
complete crash of the JVM) when upgrading to JDK11.
2018-10-11 18:34:39 +02:00
luccioman
61c337f29a Decode blacklist entries for easier edition of non ascii chars
Not using the JDK URLDecoder.decode() function, as it strips '+'
characters when they occur after '?' (both characters having regular
expression semantics when used in blacklist path patterns)
2018-10-04 09:33:58 +02:00
luccioman
ed93221fa1 Improved normalization of blacklist path patterns having non ascii chars
Normalize blacklist path patterns using percent-encoding, at pattern
edition in web interface and at loading from configuration files.

Fixes issue #237
2018-10-02 14:36:13 +02:00
luccioman
2a73b63d9e Use a constant default target file name for seed SCP upload method
To make seed upload (in /Settings_p.html?page=seed page) with SCP easier
when the user specify a remote target directory path.

See report by @vikulin in issue #227
2018-09-16 10:37:47 +02:00
luccioman
b5eabb626f Removed some dead code 2018-09-14 14:02:32 +02:00
luccioman
db7ad76366 Improved support for Java logs file pattern options
- support of "%h" and "%t" pattern components
- more proper initialization of file handler when the data folder is not
the default one, notably to prevent a non blocking but ugly error stack
trace reported by the log manager at startup with that kind of setup
2018-09-13 12:17:02 +02:00
luccioman
7adbd1f87d Fixed raw IPV6 addresses snapshots read/write on FAT32 and NTFS fs
Fixes issue #225
2018-09-12 17:34:40 +02:00
luccioman
9b1c87033b Fixed logs folder checking and creation
Previously, if YaCy log folder was for example at
`/home/user/yacy/DATA/LOG`, because of improper truncation of log path,
an unnecessary directory creation was atempted at `/home/us`.
2018-08-31 08:34:28 +02:00
luccioman
c29588dd6a Made possible to provide an absolute data root path for start script
Previously, only a path relative to the user home folder could be
provided
2018-08-30 18:16:22 +02:00
luccioman
d03c098b54 Removed deprecated warning comments about imports and Debian installer
Deprecated by commit be5d3a1066 , as
classpath is now defined in yacycore.jar Manifest file.
2018-08-22 22:35:00 +02:00
luccioman
5b60b4225f Fixed encoding of '+' character on search pages links
As revealed by issue #216
2018-08-20 18:44:04 +02:00
luccioman
54fbe166ba Updated pdf cache clear steps consistently with current pdfbox version
- Removed calls to no more existing clearResources functions (on PDFont
class and its children) since upgrade to pdfbox 2.n.n
- Removed hacky usage of protected internal ClassLoader function. This
removes the warnings displayed when running with JDK9 or JDK10 :

     [java] WARNING: Illegal reflective access by
net.yacy.document.parser.pdfParser$ResourceCleaner (file:<path>) to
method java.lang.ClassLoader.findLoadedClass(java.lang.String)
     [java] WARNING: Please consider reporting this to the maintainers
of net.yacy.document.parser.pdfParser$ResourceCleaner
     [java] WARNING: Use --illegal-access=warn to enable warnings of
further illegal reflective access operations
     [java] WARNING: All illegal access operations will be denied in a
future release

Crawling thousands of pdf documents from various sources after
modifications applied, revealed no new memory leak related to pdfbox
(measurements done with JVisualVM).
2018-08-16 18:23:42 +02:00
luccioman
685122363d Added a parser for XZ compressed archives.
As suggested by LA_FORGE on mantis 781
(http://mantis.tokeek.de/view.php?id=781)
2018-08-15 10:07:39 +02:00
luccioman
4ee14ff3c5 Fixed NullPointerException case on malformed crawl queue folder name 2018-08-13 14:35:26 +02:00
luccioman
21ad9435ec Fixed crawl queue folder naming for IPv6 hosts on MS Windows filesystems
As reported by @vikulin in issue #187, crawling websites using a raw
IPv6 address as host name in their URL failed when running on Microsoft
Windows platforms (FAT32 or NTFS filesystems) when YaCy crawler created
the crawl queue folder, as the ':' character which is part of an IPV6
address is forbidden on these filesystems.
2018-08-11 10:02:26 +02:00
luccioman
8a29551c54 Upgraded the OpenGeoDB dump URL
The status of the library in the DictionaryLoader_p.html page now also
advertises the user that an upgrade can be applied when an older dump is
already loaded.

Upgrade applied as suggested by Niklas Andrus @fapth_gitlab on Gitter
chat.
2018-08-03 18:39:41 +02:00
luccioman
373edf9eac Adjusted yjson Solr writer to support responses from an external Solr
Worked previously only with responses from YaCy embedded Solr, now able
to render the response when YaCy is configured to use an external Solr
index.
2018-07-31 16:22:21 +02:00
luccioman
87bd17b1cf Simplified a little bit the RSS OpenSearch Solr writer 2018-07-31 16:02:50 +02:00
luccioman
dc49ca9c27 Fixed a NPE case on the Solr OpenSearch response writer
Occurred when omitHeader parameter is set to true
2018-07-29 16:30:37 +02:00
luccioman
f4267ed247 Made Solr OpenSearch RSS writer compatible with external Solr index
Worked previously only with responses from YaCy embedded Solr, now able
to render the response when YaCy is configured to use an external Solr
index.
2018-07-28 11:03:31 +02:00
luccioman
b1410f593a Fixed stylesheet relative URLs rendering in Solr html writer
Relative URLs to CSS stylesheets were not properly rendered when using
the Solr html response writer and the "/solr/collection1/select" entry
point instead of "/solr/select".
2018-07-25 08:03:25 +02:00
luccioman
89c59814da Improved rendering of the Solr api relative url in the html writer
In order to have a consistent relative url when using either
/solr/select or /solr/collection1/select entry point.
2018-07-24 10:13:55 +02:00
luccioman
bf4f320b16 Optionally render the response header when using the Solr html writer
With params rendered as html input fields for conveniently modifying
params values and refreshing results.
2018-07-23 18:36:57 +02:00
luccioman
313204ae2c Override qf and df Solr params with defaults only when they are not set 2018-07-23 13:50:24 +02:00
luccioman
bdafb14336 Removed redundant synchronization lock on network switch function
Was useless as done in an already synchronized block, and the lock
object was assigned a new value in that same block, and nowhere else a
lock is requested on that same object.
2018-07-16 09:20:23 +02:00
luccioman
d5f44ea216 Removed unnecessary synchronization lock from serverSwitch constructor
Lock was useless here as it was set on an object instance attribute
while the object itself is not yet constructed and no other threads can
access it.
2018-07-16 09:13:50 +02:00
luccioman
dcad393fe5 Fixed exceeding max size of failreason_s Solr field on large link list
When using the 'From Link-List of URL' as a crawl start, with lists in
the order of one or more thousands of links, the failreason_s Solr field
maximum size (32kb) was exceeded by the string representation of the URL
must-match filter when a crawl URL was rejected because not matching.
2018-07-11 08:13:29 +02:00
luccioman
f467601561 Properly lock solrInstances for reboot and restoration of embedded Solr
Putting a synchronization lock directly on the solrInstances property
was ineffective as it is assigned a new (unlocked) instance in these
operations.
2018-07-08 08:57:59 +02:00
luccioman
9630f81306 Fixed small unnecessary lines of code 2018-07-08 08:15:26 +02:00
luccioman
876bcd2f54 Fixed useless comparison between int parameter and Long.MAX_VALUE 2018-07-08 08:11:01 +02:00
luccioman
c726154a59 Fixed removal of URLs from the delegatedURL remote crawl stack
URLs were removed from the stack using their hash as a bytes array,
whereas the hash is stored in the stack as String instance.
2018-07-05 09:36:36 +02:00
luccioman
2bdd71de60 Added server side columns sorting on the Process Scheduler table
For easier usage of large tables in the Table_API_p.html page.
2018-07-04 10:28:32 +02:00
luccioman
bb51555830 Removed remaining unsafe accesses to SimpleDateFormat instances.
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).

Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
2018-07-02 10:00:40 +02:00
luccioman
f895745e1c Removed more unsafe concurrent accesses to SimpleDateFormat instances.
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).

Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
2018-06-29 15:49:55 +02:00
luccioman
e97580dfc7 Fixed unsafe conccurent access to generic SimpleDateFormat instances
SimpleDateFormat must not be used by concurrent threads without
synchronization for parsing or formating dates as it is not thread-safe
(internally holds a calendar instance that is not synchronized).

Prefer now DateTimeFormatter when possible as it is thread-safe without
concurrent access performance bottleneck (does not internally use
synchronization locks).
2018-06-28 14:59:23 +02:00
luccioman
8811700e2e Upgraded Jetty dependency from 9.4.9 to 9.4.11 2018-06-20 09:33:26 +02:00
luccioman
d53c33e4ef Fixed potential infinite loop case (does not occur in current code base) 2018-06-20 07:51:59 +02:00
luccioman
a15ac8e0ca Made CrawlProfile loading tolerant to malformed json string attribute 2018-06-19 12:53:17 +02:00
luccioman
a715bb7876 Fixed rendering of solr mustNoMatch value on CrawlProfileEditor_p.xml 2018-06-19 12:50:28 +02:00
luccioman
0b302c5004 Do not block whole server startup on persisted crawl profile load error 2018-06-19 12:48:17 +02:00
luccioman
4d9aa4ed1e Fixed default crawl profile solr mustnotmatch query from previous commit 2018-06-19 11:58:47 +02:00
luccioman
cced94298a Added a new crawler document filter type using Solr syntax
This makes possbile to set up much more advanced document crawl filters,
by filtering on one or more document indexed fields before inserting in
the index.
2018-06-19 10:12:20 +02:00