Commit Graph

3839 Commits

Author SHA1 Message Date
reger
cb95b7339a include html5 <time> tag in content scraper,
add "datetime" property of <time> tag to scrapers startdate list.
Datetime is parsed as iso8601 (xml) date, html5 allows partial as well
as duration (not handled by this)
2016-12-24 03:11:35 +01:00
reger
7bf2bcf504 fix and prevent exception on missing required cookie name
skip cookie creation if name is empty.
2016-12-22 19:52:38 +01:00
luccioman
3ca695390c FTP crawl start URLs : applied crawl profile depth control
Applied rules :
- when the FTP URL denotes a file resource, stack it as any start URL :
eventually embedded links can be followed applying the usual depth rules
- when the FTP URL denotes a directory, list files under this directory
and stack them for crawl, and repeat the process on sub folders until
crawl depth is reached
2016-12-22 16:25:09 +01:00
luccioman
128c8ef8d4 Fixed title rendering having non ASCII chars in QuickCrawlLink_p.html. 2016-12-21 08:19:09 +01:00
reger
8eb6fba59c activate filetype navigator plugin and restrict config (append) of navs
to not already actives.
Dht results are now included in count this might over shoot on redundant
dht and solr, while the previous solr facet based was always low.
2016-12-21 02:04:13 +01:00
luccioman
c25e48e969 Enabled displaying results after 14th page for local search queries.
Fixes issue #90 for local queries only: Stealth mode, Portal mode or
Intranet mode. 
For P2p mode, the issue would probably be difficult to solve with
reasonable performance. This is still to dig.

Also switched some InterreputedException catch log messages to warn
level as this is normal behavior when shutting down a peer.

Fixed yacysearch buttons navbar behavior to deal correctly with total
results count or offset over 1000. Also improved the buttons navbar to
be able to navigate over 10th page for local queries.
2016-12-20 14:52:33 +01:00
luccioman
a3886c6adb Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2016-12-20 13:41:13 +01:00
luccioman
feaa87005e Improved indentation for easier debugging steps. 2016-12-20 08:27:17 +01:00
reger
bab4804d11 add FileTypeNavigator plugin 2016-12-19 23:56:03 +01:00
reger
d35c47090c remove obsolete put of HttpServletRequest attributes to YaCy servlet
parameters on SSI (server side includes).
Query parameters are already merged by dispatcher.include, making copy
of parameter (RequestDispatcher.INCLUDE_QUERY_STRING) obsolete.
All other parameter are not used as YaCy servlet arguments.
2016-12-19 02:30:55 +01:00
reger
0959038624 correct DefaultServlet resource pathinContext calculation
exclude servletPath option as resources are always relative to htroot 
or htdocs, the change reflects this.
Theoretically it and the recent adjustments arcording relative urls 
allows to configure the instance to be configurable in a path other as 
root (/)
2016-12-18 21:11:00 +01:00
reger
c50e23c495 reduce creation of empty legacy RequestHeader() in situation where null
is acceptable (less for garbage collection).
2016-12-18 02:38:43 +01:00
reger
87f6631a2a adjust Cache getHeader to prev. changes/commit 2016-12-18 01:02:56 +01:00
reger
6be7339b1d remove the overhead of unused reverseMappingCache of HeaderFramewor / RequestHeader 2016-12-18 00:57:47 +01:00
reger
c702eb6786 del dead menu link to /repository
(directory not created in current distribution -> old)
2016-12-17 02:38:52 +01:00
reger
baa5d9b9e3 adjust DomainHandler working on resolved .yacy domain
(remove obsolete check for path on hostname)
2016-12-17 01:33:00 +01:00
luccioman
1ba705c23d Use loaderDispatcher instead of HTTPClient to download releases.
The default redirection strategy when using directly HTTPClient is
incorrect when redirection is cross host (the original Host header is
still sent when requesting the redirected location).

YaCy LoaderDispatcher handles redirections properly, thus release
archive files using redirected URLs (such as the URLs on a GitHub
Release page) are successfully downloaded.
2016-12-16 20:38:54 +01:00
luccioman
467650c042 Hardened system update checks.
When a downloaded archive release is corrupted, empty, or can not be
opened for any reason, the update script must not be launched because it
erases the existing lib/*.jar libraries.
2016-12-16 11:03:09 +01:00
luccioman
b5711b8fe1 Added some Javadocs. 2016-12-16 10:43:00 +01:00
reger
0d2964cf2b expanded error message on rejected crawl url due to faile dns lookup
close of http://mantis.tokeek.de/view.php?id=678
2016-12-15 23:59:50 +01:00
luccioman
00e81fcc15 Check HTTP status when downloading a release, and report eventual error. 2016-12-15 15:30:36 +01:00
reger
0758c868c9 add HostNavigator plugin 2016-12-13 22:14:16 +01:00
reger
60160877f5 bundle initialization of search navigation plugins in separate handler
class to allow to use navigator map in config servlets 
(without need to create a search event)
2016-12-11 21:46:29 +01:00
reger
3151cda3a5 catch ip-format exception on wrong server access setting ip filter
as reported in http://mantis.tokeek.de/view.php?id=713
to prevent abort of initialization.
This jetty/whitelist ipaccesshandler accepts currently only ipv4
2016-12-11 04:43:36 +01:00
reger
b32bcdf344 list entries in outgoing cookie monitor one per line
for easier readability.
For this adjust outgoingCookies entry to use Cookie[] instead of String[]
2016-12-10 22:08:09 +01:00
reger
3f32262654 enable getCookies for HeaderFramework reusing Jetty CookieCutter 2016-12-09 00:33:20 +01:00
reger
4186ee6fc0 add other custom response header entries set by servlets to the response
to the client (not cookies only). This is used by some servlets to mainly 
set "Access-Control-Allow-Origin" header. Added a contains check to be
sure no header set by Defaultservlet is overwritten.
2016-12-07 00:51:07 +01:00
luccioman
d27adc2b92 Fixed language detector initialization and NullPointerException cases.
NullPointerException occurred when using and Identificator instance
which encountered and error in its constructor.
This error could be caused by a missing "langdetect" folder in the
current folder of the main process, or by simultaneous first calls to
the constructor, initializing concurrently the DetectorFactory.langlist.

Fixes the mantis 714 (http://mantis.tokeek.de/view.php?id=714)
2016-12-05 18:12:21 +01:00
luccioman
a1f922b34a Reduced locations vocabulary memory footprint.
Reduced this vocabulary memory usage :
 - by using only one map term2entries instead of two maps having the
same key set
 - by generating the location object links on the fly using the
GeoLocation data instead of storing many duplicates of string prefix
"http://www.openstreetmap.org/?lat="
 
Measurements with VisualVM and GeoNames 0 enabled (cities with a
population > 1000) :
 - AutotaggingLibrary retained size :
  - initial : 309 718 763 bytes
  - after refactoring : 159 224 641 bytes
2016-12-05 10:57:37 +01:00
reger
9c06e752e4 allow request.setAttribute w/o "not implemented" exception by default
skip unused CONNECTION_PROP_ARGS check in getQueryString
2016-12-05 03:15:24 +01:00
reger
59ab42e7d6 add UserDB lastaccess update calls on login 2016-12-04 00:58:45 +01:00
luccioman
bf8a6d9848 Reduced GeoNames locations memory footprint.
Using String instead of StringBuilder instances in GeonamesLocation
allows to reuse the same immutable objects in the Tagging class.

Measurements with VisualVM and GeoNames 0 enabled (cities with a
population > 1000) :
 - OverArchingLocation retained size :
  - initial : 164 666 830 bytes
  - after refactoring : 97 736 804 bytes
 - AutotaggingLibrary retained size :
  - initial : 354 713 633 bytes
   - after refactoring : 309 718 763 bytes
2016-12-03 09:05:19 +01:00
luccioman
3f561c1635 Fixed a NullPointerException case.
Could occur when a search request was performed just after peer startup,
and the Switchboard Thread "LibraryProvider.initialize" had completed,
thus requesting a ProbabilisticClassifier not completely initialized
(and having a null contexts property).
2016-12-02 13:45:45 +01:00
luccioman
6bc2bf1aa4 Small memory footprint reduction for GeonamesLocation.
Reusing the same geonameid Integer instance between `id2loc` and
`name2ids` maps reduces (a little) memory footprint.
Measured OverarchigLocation class retained memory with VisualVM on
openJDK 8 :
 - initial : 183 439 490 bytes
 - after refactoring : 164 666 830 bytes
2016-12-02 13:12:47 +01:00
luccioman
7f846ef674 Small complementary memory footprint improvement for synonyms.
Memory footprint measured with VisualVM and all synonyms enabled :
 - before : 195 015 914 bytes
 - after : 192 548 826 bytes
2016-11-30 17:49:51 +01:00
luccioman
568e3dde6a Improved synonyms memory footprint.
The idea is to avoid unnecessary String objects duplication for the same
words. Particularly efficient with the large moby thesaurus.

Memory footprint measurements with VisualVM :
 - openthesaurus_de_yacy :
 	- initial : 19 443 796 bytes
 	- after refactoring : 18 012 606 bytes

 - mobythesaurus_en_yacy :
 	- initial : 343 453 904 bytes
 	- after refactoring : 173 843 780 bytes

 - thesaurus_ru_yacy :
 	- initial : 3 800 706 bytes
 	- after refactoring : 3 466 612 bytes

 - de + en + ru : 
 	- initial : 366 603 450 bytes
 	- after refactoring : 195 015 914 bytes
2016-11-30 16:50:25 +01:00
reger
60b3adfb43 fix ext2mime to return given default on input=null 2016-11-29 23:32:20 +01:00
reger
f7e9f9be5f move Digest auth checks from DefaultServlet to adminAuthenticated,
eliminating the need to modify http header on Servlet container handled 
Digest authentication, to simulate Basic auth for YaCy servlets.
2016-11-29 03:20:33 +01:00
luccioman
cca3417b87 Fixed image and favicon viewing for unauthenticated local requests.
As reported by @reger24, image and favicon viewing was broken with
unauthenticated requests on peers configured to require authentication
even from localhost.

So I unified viewing rights check in a single new function on
ImageViewer class.
2016-11-28 22:10:05 +01:00
reger
02092de3d8 remove login cookie generation for static admin ind User servlet
cookieAuth is never successful for static admin, leaving the creation and
handling for login cookies for static admin obsolete.
2016-11-26 23:28:30 +01:00
luccioman
fc575fc760 Fixed a NullPointerException case. 2016-11-25 11:07:50 +01:00
reger
9a8691129f fix typing error from commit 60ba5c117c 2016-11-25 04:18:36 +01:00
reger
f9328f07e2 completing the usage of CONNECTION_PROP_CLIENT_HTTPSERVLETREQUEST in
HTTPDProxyHandlers logging facility.
2016-11-25 03:05:53 +01:00
reger
8e3e3ed191 update the older ResponseHeader patch to handle cookies,
to work directly with javax.servlet.http.Cookie (rename headerProps to
cookieStore as is only used for this).
(Re)implement set-cookie in DefaultServlet to make cookieAuthentication
work as designed.
2016-11-25 02:00:20 +01:00
reger
866d3a1960 make RequestHeader login succeed (without throwing exception by default)
correct getAuthType to return Auth Scheme only after authentication
2016-11-25 01:09:42 +01:00
reger
44a6a4e795 fix authentication by hit in userdb (wrong parameter) 2016-11-24 00:16:22 +01:00
luccioman
aa9ddf3c23 Added control over Robots.txt active threads maximum number.
When starting a crawl from a file containing thousands of links,
configuration setting "crawler.MaxActiveThreads" is effective to prevent
saturating the system with too many outgoing HTTP connections threads
launched by the crawler.
But robots.txt was not affected by this setting and was indefinitely
increasing the number of concurrently loading threads until most ot the
connections timed out.

To improve performance control, added a pool of threads for Robots.txt,
consistently used in its ensureExist() and massCrawlCheck() methods.
The Robots.txt threads pool max size can now be configured in the
/PerformanceQueus_p.html page, or with the new
"robots.txt.MaxActiveThreads" setting, initialized with the same default
value as the crawler.
2016-11-23 18:13:05 +01:00
luccioman
3092a8ced5 Fixed thread name consistency for improved monitoring.
Some tasks were modifying the current thread name without restoring it
once finished as it is effectively done elsewhere.
2016-11-23 17:59:52 +01:00
luccioman
eec5779889 Added a name prefix to pooled threads for easier monitoring.
Using JVM monitoring tools, it is then easier to identify tasks running
inside thread pool with a custom prefix rather than the generic one :
"pool-".
2016-11-23 11:21:14 +01:00
reger
59130777a6 add high scored items first to YearNavigator (to make sure to be included
in sorted view)
2016-11-22 01:17:33 +01:00