Commit Graph

8720 Commits

Author SHA1 Message Date
luccioman
e90405b6f0 Support parsing audio URLs without file extension
Added also a Junit for the audio tag parser
2019-04-09 11:40:21 +02:00
luccioman
a8316c79da Allow JS resorting of search results by unauthenticated users
Acces rate limitations to this search mode by unauthenticated users are
set low by default to prevent unwanted server overload but can be
customized through the SearchAccessRate_p.html configuration page

Fixes #291
2019-04-03 14:21:53 +02:00
luccioman
0ab2b49c31 Made /yacysearch access rate limitations user configurable
With a new admin page at /SearchAccessRate_p.html in menu Network Access
> Local Search > Access Rate Limitations
2019-04-02 17:42:50 +02:00
luccioman
5b7e41202a Added Solr GSA writer support for responses from remote instances 2019-03-27 18:23:41 +01:00
luccioman
4d8a948455 Properly close PDF snapshots loaded with pdfbox library 2019-03-22 09:50:30 +01:00
luccioman
74e6d6e984 Added Solr GrepHTML writer support for responses from remote instances 2019-03-20 18:24:16 +01:00
luccioman
5e6501974d Added Solr snapshots writer support for responses from remote instances 2019-03-19 11:25:44 +01:00
luccioman
384c37102c Improve accuracy of total results count on latest pages in Stealth mode
Previously, when mixing results from local RWI and local Solr (Stealth
mode), total local Solr count could be ignored on last result pages,
when the page offset was higher than local Solr count but lower than
total RWI count.
2019-03-04 10:05:47 +01:00
luccioman
5e9a08355a Improved logging for federated search
- Do not use spaces in logger identifier name so the log level can be
configured in yacy.logging
- Hold the logger instance to avoid the logging system to look for it
from its name at each appended log message
2019-02-02 09:59:24 +01:00
luccioman
9782a98a9c Added the possibility to customize facets sort type and direction
Previously search navigators/facets elements were sorted only by counts.
Now from the ConfigSearchPage_p.html admin page, sort direction
(ascending/descending) and type (on counts or labels) can be customized
independently for each navigator.
2019-01-24 18:43:06 +01:00
sgaebel
c2398fd890 remove warnings: 'Statement unnecessarily nested within else clause' 2019-01-10 20:02:57 +01:00
sgaebel
811d40a6c4 taking care of closing inputstreams, HTTPClient 2019-01-04 18:58:49 +01:00
sgaebel
8d2e7262d9 Recrawl:
- set the chunksize to 100 to meet the max of the embedded solr
- re-enable sorting (the case where we switched it of should be away)
- enable recrawling on remote-solr
2019-01-04 18:46:59 +01:00
sgaebel
8f58c1dcfa extend the SolrServlet to be usable as remote solr (incl. update)
this feature needs to be enabled by uncomment the url-pattern
2019-01-04 18:27:44 +01:00
luccioman
7223a2fdb1 Removed usage of now deprecated Jetty function 2018-12-22 14:42:22 +01:00
luccioman
440d9f2fa0 Exclude peers with empty or disabled RWI from remote RWI search 2018-12-20 14:53:01 +01:00
luccioman
08ea0b0397 Added a configurable timeout to wkhtmltopdf calls for pdf snapshots
Necessary to prevent blocking the indexing workflow when some
wkhtmltopdf renderings fail without terminating
2018-12-11 22:31:31 +01:00
luccioman
3fb449b3b6 Properly resolve relative URLs against document URL in html base tags
Fixes issue #256
2018-12-06 20:18:00 +01:00
luccioman
73a6e45524 Extended detection of external tools used for Snapshots generation
This enable detecting wkhtmltopdf and Imagemagick convert executables
when they are at system Path in addition to common installation paths.
2018-12-06 09:53:08 +01:00
luccioman
7dc1f60619 Fixed detection of absolute data folder path on MS Windows 2018-11-18 10:08:20 +01:00
luccioman
595e144797 Trace a message on incomplete proper server finish when killing process 2018-11-15 17:32:22 +01:00
luccioman
9daeea823b Fixed concurrency issue on cache used for circles rendering
Without synchronization lock, concurrent rendering of images including
circles could lead to glitches as reported in issue #248
2018-11-10 22:00:49 +01:00
Michael Peter Christen
c347e7d3f8 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2018-11-08 14:42:52 +01:00
Michael Peter Christen
848e9304d9 evil bots may crawl harder 2018-11-08 14:42:40 +01:00
luccioman
a997133260 Fixed gzip decompression regression on index transfer APIs
Processing of gzip encoded incoming requests (on /yacy/transferRWI.html
and /yacy/transferURL.html) was no more working since upgrade to Jetty
9.4.12 (see commit 51f4be1).

To prevent any conflicting behavior with Jetty internals, use now the
GzipHandler provided by Jetty to decompress incoming gzip encoded
requests rather than the previously used custom GZIPRequestWrapper.

Fixes issue #249
2018-11-07 14:52:42 +01:00
luccioman
e85f231bdf Fixed termination of Host browser and link structure Solr query threads
On some conditions (especially when reaching timeout), concurrent Solr
query tasks used by the /HostBrowser.html and /api/linkstructure.json
never terminated, thus leaking resources, as reported by @Vort in issue
#246
2018-11-06 10:10:09 +01:00
luccioman
fcf6b16db4 Added new crawler attribute for finer control over Media Type detection
New "Media Type detection" section in the advanced crawl start page
allow to choose between :
- not loading URLs with unknown or unsupported file extension without
checking the actual Media Type (relying Content-Type header for now).
This was the old default behavior, faster, but not really accurate.
- always cross check URL file extension against the actual Media Type.
This lets properly parse URLs ending with an apparently odd file
extension, but which have actually a supported Media Type such as
text/html.

Sample URLs with misleading file extensions added as documentation in
the crawl start page.

fixes issue #244
2018-10-25 10:42:12 +02:00
luccioman
a83a56473e Added suport for PDF snapshots generation when running on MS Windows 2018-10-18 12:41:57 +02:00
luccioman
8852c97cee Added basic styling for cleaner rendering of missing image snapshots
For the output of the Solr snapshots writer
2018-10-15 18:19:57 +02:00
luccioman
746e0e788d Render a relevant HTTP status code on snapshot image rendering error
Instead of a null response body which is not very helpful.
2018-10-14 10:30:30 +02:00
luccioman
50b6edfcf5 Updated Solr snapshots writer for a cleaner html head 2018-10-13 10:36:39 +02:00
luccioman
f366f43d6b Made snapshots size customizable in Solr snapshots response writer 2018-10-13 10:22:47 +02:00
luccioman
7a62fc0e66 Fixed concurrency issue in custom classloader used for template classes
As reported in issue #241, the problem is only critical (random but
complete crash of the JVM) when upgrading to JDK11.
2018-10-11 18:34:39 +02:00
luccioman
61c337f29a Decode blacklist entries for easier edition of non ascii chars
Not using the JDK URLDecoder.decode() function, as it strips '+'
characters when they occur after '?' (both characters having regular
expression semantics when used in blacklist path patterns)
2018-10-04 09:33:58 +02:00
luccioman
ed93221fa1 Improved normalization of blacklist path patterns having non ascii chars
Normalize blacklist path patterns using percent-encoding, at pattern
edition in web interface and at loading from configuration files.

Fixes issue #237
2018-10-02 14:36:13 +02:00
luccioman
2a73b63d9e Use a constant default target file name for seed SCP upload method
To make seed upload (in /Settings_p.html?page=seed page) with SCP easier
when the user specify a remote target directory path.

See report by @vikulin in issue #227
2018-09-16 10:37:47 +02:00
luccioman
b5eabb626f Removed some dead code 2018-09-14 14:02:32 +02:00
luccioman
db7ad76366 Improved support for Java logs file pattern options
- support of "%h" and "%t" pattern components
- more proper initialization of file handler when the data folder is not
the default one, notably to prevent a non blocking but ugly error stack
trace reported by the log manager at startup with that kind of setup
2018-09-13 12:17:02 +02:00
luccioman
7adbd1f87d Fixed raw IPV6 addresses snapshots read/write on FAT32 and NTFS fs
Fixes issue #225
2018-09-12 17:34:40 +02:00
luccioman
9b1c87033b Fixed logs folder checking and creation
Previously, if YaCy log folder was for example at
`/home/user/yacy/DATA/LOG`, because of improper truncation of log path,
an unnecessary directory creation was atempted at `/home/us`.
2018-08-31 08:34:28 +02:00
luccioman
c29588dd6a Made possible to provide an absolute data root path for start script
Previously, only a path relative to the user home folder could be
provided
2018-08-30 18:16:22 +02:00
luccioman
d03c098b54 Removed deprecated warning comments about imports and Debian installer
Deprecated by commit be5d3a1066 , as
classpath is now defined in yacycore.jar Manifest file.
2018-08-22 22:35:00 +02:00
luccioman
5b60b4225f Fixed encoding of '+' character on search pages links
As revealed by issue #216
2018-08-20 18:44:04 +02:00
luccioman
54fbe166ba Updated pdf cache clear steps consistently with current pdfbox version
- Removed calls to no more existing clearResources functions (on PDFont
class and its children) since upgrade to pdfbox 2.n.n
- Removed hacky usage of protected internal ClassLoader function. This
removes the warnings displayed when running with JDK9 or JDK10 :

     [java] WARNING: Illegal reflective access by
net.yacy.document.parser.pdfParser$ResourceCleaner (file:<path>) to
method java.lang.ClassLoader.findLoadedClass(java.lang.String)
     [java] WARNING: Please consider reporting this to the maintainers
of net.yacy.document.parser.pdfParser$ResourceCleaner
     [java] WARNING: Use --illegal-access=warn to enable warnings of
further illegal reflective access operations
     [java] WARNING: All illegal access operations will be denied in a
future release

Crawling thousands of pdf documents from various sources after
modifications applied, revealed no new memory leak related to pdfbox
(measurements done with JVisualVM).
2018-08-16 18:23:42 +02:00
luccioman
685122363d Added a parser for XZ compressed archives.
As suggested by LA_FORGE on mantis 781
(http://mantis.tokeek.de/view.php?id=781)
2018-08-15 10:07:39 +02:00
luccioman
4ee14ff3c5 Fixed NullPointerException case on malformed crawl queue folder name 2018-08-13 14:35:26 +02:00
luccioman
21ad9435ec Fixed crawl queue folder naming for IPv6 hosts on MS Windows filesystems
As reported by @vikulin in issue #187, crawling websites using a raw
IPv6 address as host name in their URL failed when running on Microsoft
Windows platforms (FAT32 or NTFS filesystems) when YaCy crawler created
the crawl queue folder, as the ':' character which is part of an IPV6
address is forbidden on these filesystems.
2018-08-11 10:02:26 +02:00
luccioman
8a29551c54 Upgraded the OpenGeoDB dump URL
The status of the library in the DictionaryLoader_p.html page now also
advertises the user that an upgrade can be applied when an older dump is
already loaded.

Upgrade applied as suggested by Niklas Andrus @fapth_gitlab on Gitter
chat.
2018-08-03 18:39:41 +02:00
luccioman
373edf9eac Adjusted yjson Solr writer to support responses from an external Solr
Worked previously only with responses from YaCy embedded Solr, now able
to render the response when YaCy is configured to use an external Solr
index.
2018-07-31 16:22:21 +02:00
luccioman
87bd17b1cf Simplified a little bit the RSS OpenSearch Solr writer 2018-07-31 16:02:50 +02:00