Commit Graph

8315 Commits

Author SHA1 Message Date
reger
6be7339b1d remove the overhead of unused reverseMappingCache of HeaderFramewor / RequestHeader 2016-12-18 00:57:47 +01:00
reger
c702eb6786 del dead menu link to /repository
(directory not created in current distribution -> old)
2016-12-17 02:38:52 +01:00
reger
baa5d9b9e3 adjust DomainHandler working on resolved .yacy domain
(remove obsolete check for path on hostname)
2016-12-17 01:33:00 +01:00
luccioman
1ba705c23d Use loaderDispatcher instead of HTTPClient to download releases.
The default redirection strategy when using directly HTTPClient is
incorrect when redirection is cross host (the original Host header is
still sent when requesting the redirected location).

YaCy LoaderDispatcher handles redirections properly, thus release
archive files using redirected URLs (such as the URLs on a GitHub
Release page) are successfully downloaded.
2016-12-16 20:38:54 +01:00
luccioman
467650c042 Hardened system update checks.
When a downloaded archive release is corrupted, empty, or can not be
opened for any reason, the update script must not be launched because it
erases the existing lib/*.jar libraries.
2016-12-16 11:03:09 +01:00
luccioman
b5711b8fe1 Added some Javadocs. 2016-12-16 10:43:00 +01:00
reger
0d2964cf2b expanded error message on rejected crawl url due to faile dns lookup
close of http://mantis.tokeek.de/view.php?id=678
2016-12-15 23:59:50 +01:00
luccioman
00e81fcc15 Check HTTP status when downloading a release, and report eventual error. 2016-12-15 15:30:36 +01:00
reger
0758c868c9 add HostNavigator plugin 2016-12-13 22:14:16 +01:00
reger
60160877f5 bundle initialization of search navigation plugins in separate handler
class to allow to use navigator map in config servlets 
(without need to create a search event)
2016-12-11 21:46:29 +01:00
reger
3151cda3a5 catch ip-format exception on wrong server access setting ip filter
as reported in http://mantis.tokeek.de/view.php?id=713
to prevent abort of initialization.
This jetty/whitelist ipaccesshandler accepts currently only ipv4
2016-12-11 04:43:36 +01:00
reger
b32bcdf344 list entries in outgoing cookie monitor one per line
for easier readability.
For this adjust outgoingCookies entry to use Cookie[] instead of String[]
2016-12-10 22:08:09 +01:00
reger
3f32262654 enable getCookies for HeaderFramework reusing Jetty CookieCutter 2016-12-09 00:33:20 +01:00
reger
4186ee6fc0 add other custom response header entries set by servlets to the response
to the client (not cookies only). This is used by some servlets to mainly 
set "Access-Control-Allow-Origin" header. Added a contains check to be
sure no header set by Defaultservlet is overwritten.
2016-12-07 00:51:07 +01:00
luccioman
d27adc2b92 Fixed language detector initialization and NullPointerException cases.
NullPointerException occurred when using and Identificator instance
which encountered and error in its constructor.
This error could be caused by a missing "langdetect" folder in the
current folder of the main process, or by simultaneous first calls to
the constructor, initializing concurrently the DetectorFactory.langlist.

Fixes the mantis 714 (http://mantis.tokeek.de/view.php?id=714)
2016-12-05 18:12:21 +01:00
luccioman
a1f922b34a Reduced locations vocabulary memory footprint.
Reduced this vocabulary memory usage :
 - by using only one map term2entries instead of two maps having the
same key set
 - by generating the location object links on the fly using the
GeoLocation data instead of storing many duplicates of string prefix
"http://www.openstreetmap.org/?lat="
 
Measurements with VisualVM and GeoNames 0 enabled (cities with a
population > 1000) :
 - AutotaggingLibrary retained size :
  - initial : 309 718 763 bytes
  - after refactoring : 159 224 641 bytes
2016-12-05 10:57:37 +01:00
reger
9c06e752e4 allow request.setAttribute w/o "not implemented" exception by default
skip unused CONNECTION_PROP_ARGS check in getQueryString
2016-12-05 03:15:24 +01:00
reger
59ab42e7d6 add UserDB lastaccess update calls on login 2016-12-04 00:58:45 +01:00
luccioman
bf8a6d9848 Reduced GeoNames locations memory footprint.
Using String instead of StringBuilder instances in GeonamesLocation
allows to reuse the same immutable objects in the Tagging class.

Measurements with VisualVM and GeoNames 0 enabled (cities with a
population > 1000) :
 - OverArchingLocation retained size :
  - initial : 164 666 830 bytes
  - after refactoring : 97 736 804 bytes
 - AutotaggingLibrary retained size :
  - initial : 354 713 633 bytes
   - after refactoring : 309 718 763 bytes
2016-12-03 09:05:19 +01:00
luccioman
3f561c1635 Fixed a NullPointerException case.
Could occur when a search request was performed just after peer startup,
and the Switchboard Thread "LibraryProvider.initialize" had completed,
thus requesting a ProbabilisticClassifier not completely initialized
(and having a null contexts property).
2016-12-02 13:45:45 +01:00
luccioman
6bc2bf1aa4 Small memory footprint reduction for GeonamesLocation.
Reusing the same geonameid Integer instance between `id2loc` and
`name2ids` maps reduces (a little) memory footprint.
Measured OverarchigLocation class retained memory with VisualVM on
openJDK 8 :
 - initial : 183 439 490 bytes
 - after refactoring : 164 666 830 bytes
2016-12-02 13:12:47 +01:00
luccioman
7f846ef674 Small complementary memory footprint improvement for synonyms.
Memory footprint measured with VisualVM and all synonyms enabled :
 - before : 195 015 914 bytes
 - after : 192 548 826 bytes
2016-11-30 17:49:51 +01:00
luccioman
568e3dde6a Improved synonyms memory footprint.
The idea is to avoid unnecessary String objects duplication for the same
words. Particularly efficient with the large moby thesaurus.

Memory footprint measurements with VisualVM :
 - openthesaurus_de_yacy :
 	- initial : 19 443 796 bytes
 	- after refactoring : 18 012 606 bytes

 - mobythesaurus_en_yacy :
 	- initial : 343 453 904 bytes
 	- after refactoring : 173 843 780 bytes

 - thesaurus_ru_yacy :
 	- initial : 3 800 706 bytes
 	- after refactoring : 3 466 612 bytes

 - de + en + ru : 
 	- initial : 366 603 450 bytes
 	- after refactoring : 195 015 914 bytes
2016-11-30 16:50:25 +01:00
reger
60b3adfb43 fix ext2mime to return given default on input=null 2016-11-29 23:32:20 +01:00
reger
f7e9f9be5f move Digest auth checks from DefaultServlet to adminAuthenticated,
eliminating the need to modify http header on Servlet container handled 
Digest authentication, to simulate Basic auth for YaCy servlets.
2016-11-29 03:20:33 +01:00
luccioman
cca3417b87 Fixed image and favicon viewing for unauthenticated local requests.
As reported by @reger24, image and favicon viewing was broken with
unauthenticated requests on peers configured to require authentication
even from localhost.

So I unified viewing rights check in a single new function on
ImageViewer class.
2016-11-28 22:10:05 +01:00
reger
02092de3d8 remove login cookie generation for static admin ind User servlet
cookieAuth is never successful for static admin, leaving the creation and
handling for login cookies for static admin obsolete.
2016-11-26 23:28:30 +01:00
luccioman
fc575fc760 Fixed a NullPointerException case. 2016-11-25 11:07:50 +01:00
reger
9a8691129f fix typing error from commit 60ba5c117c 2016-11-25 04:18:36 +01:00
reger
f9328f07e2 completing the usage of CONNECTION_PROP_CLIENT_HTTPSERVLETREQUEST in
HTTPDProxyHandlers logging facility.
2016-11-25 03:05:53 +01:00
reger
8e3e3ed191 update the older ResponseHeader patch to handle cookies,
to work directly with javax.servlet.http.Cookie (rename headerProps to
cookieStore as is only used for this).
(Re)implement set-cookie in DefaultServlet to make cookieAuthentication
work as designed.
2016-11-25 02:00:20 +01:00
reger
866d3a1960 make RequestHeader login succeed (without throwing exception by default)
correct getAuthType to return Auth Scheme only after authentication
2016-11-25 01:09:42 +01:00
reger
44a6a4e795 fix authentication by hit in userdb (wrong parameter) 2016-11-24 00:16:22 +01:00
luccioman
aa9ddf3c23 Added control over Robots.txt active threads maximum number.
When starting a crawl from a file containing thousands of links,
configuration setting "crawler.MaxActiveThreads" is effective to prevent
saturating the system with too many outgoing HTTP connections threads
launched by the crawler.
But robots.txt was not affected by this setting and was indefinitely
increasing the number of concurrently loading threads until most ot the
connections timed out.

To improve performance control, added a pool of threads for Robots.txt,
consistently used in its ensureExist() and massCrawlCheck() methods.
The Robots.txt threads pool max size can now be configured in the
/PerformanceQueus_p.html page, or with the new
"robots.txt.MaxActiveThreads" setting, initialized with the same default
value as the crawler.
2016-11-23 18:13:05 +01:00
luccioman
3092a8ced5 Fixed thread name consistency for improved monitoring.
Some tasks were modifying the current thread name without restoring it
once finished as it is effectively done elsewhere.
2016-11-23 17:59:52 +01:00
luccioman
eec5779889 Added a name prefix to pooled threads for easier monitoring.
Using JVM monitoring tools, it is then easier to identify tasks running
inside thread pool with a custom prefix rather than the generic one :
"pool-".
2016-11-23 11:21:14 +01:00
reger
59130777a6 add high scored items first to YearNavigator (to make sure to be included
in sorted view)
2016-11-22 01:17:33 +01:00
luccioman
0ba5a838f7 Added charset meta to Solr HTML writers.
Non-ASCII characters are thus correctly rendered in browsers.
This is a fix ro mantis 706 : http://mantis.tokeek.de/view.php?id=706
2016-11-21 19:55:13 +01:00
reger
08a0acc35d make a YearNavigator availabel, useable as SearchEvent.naviator plugin.
It can take any Date field of the index and displays a list of year strings
in reverse order by the year (not the score/count).
To allow to define the index field to use, the fieldname (and title can be 
appended to the navi's name "year" e.g. year:load_date_dt:LoadDate
It works also with dates_in_content_dts field (from the graphical date
navigator). Here the query parameter from: to: are used on selection as
Query modifier (for other dates currently no query parameter available, so
selection won't work to filter search results).
Not included in the UI Searchpage layout config so far (for experiment with
it manual change to conf needed).
2016-11-21 16:52:53 +01:00
reger
7742579ca4 make a LanguageNavigator availabel, useable for the SearchEvent.naviator
plugin (not activated yet).
2016-11-21 00:19:11 +01:00
reger
0d3bef659b implement RequestHeader.setCharacterEncoding for legacy header,
make sure .getProtocol returns a http version
move the patch for Set-Cookie to ResponseHeader (applies only here)
2016-11-20 19:30:38 +01:00
Michael Peter Christen
5320209963 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2016-11-20 12:13:07 +01:00
Michael Peter Christen
83f5e3d715 added+disabled a federate search experiment 2016-11-20 12:13:00 +01:00
reger
4eeb448eb3 use DigestURL in UrlProxyServlet as parameter to pass requested url to
handler.
UrlProxyServlet splits url in parts to pass it on as parameter and 
HeaderFramework constructs a url from param parts. This is obsolete if
already created url is used (makes HeaderFramework.getRequestURL obsolete
= removed)
2016-11-20 05:27:33 +01:00
reger
bad8f87998 remove old/obsolete clear text "adminAccount" credential entry from init
and setConfig (.,empty) from servlets/code
2016-11-20 00:20:47 +01:00
reger
811cf637f8 fix Jetty9YaCySecurityHandler, length check of Basic credential,
add comment to SwitchboardConstants.AdminAccount const
2016-11-20 00:15:21 +01:00
reger
fdcf33f08f fix Domain.stripToHostName for some IPv6 cases
add unit test for it
2016-11-19 16:37:16 +01:00
reger
ac6e198bd1 add unit test for Domains.stripToPort,
simplify ipv6 check
2016-11-19 06:22:55 +01:00
reger
f27531f5ec fix Domains.stripToPort, make ipv6 save 2016-11-19 05:56:48 +01:00
reger
67744a8038 fix HeaderFramework.getRequestURL on host with port considering ipv6 host 2016-11-19 05:08:11 +01:00
reger
66cc0dd173 refactor: move GSA specific date formatter to GSAservlet
adjust return type to String for HeaderFrameWork.getSingle
2016-11-19 03:29:23 +01:00
reger
d525967999 refactor: move convertHeaderFromJetty to ProxHandler (only used with active proxy
not needed for standard servlets)
2016-11-19 02:10:21 +01:00
reger
60ba5c117c fix legacy getHeaderCookies to work with cookies from original
HttpServletRequest, by moving to RequestHeader.
2016-11-18 21:44:16 +01:00
reger
30f8d1e2d7 let RequestHeader.logout succeed w/o throwing exception by default 2016-11-18 03:00:30 +01:00
reger
28afd3a2f8 fix UserDB.proxyAuth from header string
(take care of prefix "Basic" in header entry)
2016-11-18 01:12:14 +01:00
luccioman
0806de8fdc Ensure file input stream are closed in both normal and error cases. 2016-11-16 15:13:58 +01:00
luccioman
a0dfbaca6a FileUtils : added some JavaDocs and unit test cases 2016-11-16 15:12:21 +01:00
reger
59448461d3 make use of userInRole for quick login verification 2016-11-16 01:42:53 +01:00
reger
2a4d826d9e adjust servlet RequestHeader.getLocale
init jvm defaultLocale matching UI language
2016-11-15 02:37:03 +01:00
reger
9db68acb4f remove obsolete X_YACY... header declarations
not in use (no writes, only remove and try to read).
Obsolete parameter setupHttpClient
2016-11-14 22:53:26 +01:00
reger
8e9aece786 more use of RequestHeader constant referer, authorization
in Jetty9YaCySecurityHandler
2016-11-14 04:49:30 +01:00
reger
d631fbc019 make more use of the new ServletRequest interface methodes
getScheme, getServerPort (in QuickCrawlLink_p & YaCyDefaultServlet)
2016-11-14 03:01:15 +01:00
reger
395f2e8946 Make ServletRequest implement the standardized HttpServletRequest interface,
to make all readily available information from the original ServletRequest
available to YaCy servlets (without converting data to internal structures).
The implementation of the common interface allows easier integration of
YaCy servlets with the servlet standard (e.g. shared login service with
the servlet container etc.)
2016-11-14 01:37:16 +01:00
luccioman
74fec066f4 Converted more URLs to pure relative ones.
Easier YaCy peer configuration behind a reverse proxy subfolder : no
need for the reverse proxy to rewrite HTML links or URLs in css files.

Tested on Debian Jessie with an apache2 reverse proxy.

See related mantis issues http://mantis.tokeek.de/view.php?id=106 and
http://mantis.tokeek.de/view.php?id=701
2016-11-12 10:51:54 +01:00
luccioman
0f0393e5e3 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2016-11-09 02:40:52 +01:00
luccioman
7296e3884f Switched even more URLs to pure relative ones.
Thus a YaCy peer can run behind a reverse proxy subfolder without need
for the reverse proxy to rewrite HTML links (a CPU costly operation).

Tested on Debian Jessie with an apache2 reverse proxy.

See related mantis issues http://mantis.tokeek.de/view.php?id=106 and
http://mantis.tokeek.de/view.php?id=701
2016-11-09 02:40:33 +01:00
reger
49eae79c01 fix Tables.hasIndex check for tablename = key
apply same functionality to hasHeap (to not create new table on call hasHeap)
2016-11-09 02:33:42 +01:00
luccioman
84b81c1af0 Switched more URLs to relative ones when possible.
This permits an easier and more flexible reverse proxy configuration.
Some related mantis issues : http://mantis.tokeek.de/view.php?id=106 and
http://mantis.tokeek.de/view.php?id=701
2016-11-08 03:05:51 +01:00
luccioman
731684105a Improved absolute URLs rendering in OpenSearch desc and RSS feeds.
When the peer is behind a reverse proxy providing SSL/TLS encryption,
the rendered absolute URLs should start with https when the user browser
requested https : added limited support to the X-Forwarded-Proto HTTP
header notably provided on Heroku platform.
Also added some unit tests.
2016-11-08 02:39:45 +01:00
reger
669f60223e upd Column.toString to output encoder "{bytes}"
used for String and binary Column types
2016-11-06 21:02:58 +01:00
reger
c9e81d2fa0 fix Column parsing from celldefinition string, without cellwidth def.
(outofbound exception)
2016-11-06 03:34:24 +01:00
reger
e0816ef2e5 use human readable date format in CrawlStacker error message
"double in: local index, oldDate = "
2016-11-05 19:40:14 +01:00
luccioman
54d879a9b3 Generate HTML relative (to each peer) links from hosted WikiCode.
When WikiCode inserted in a peer hosted Blog, Wiki, Messages or Profile
contains relative links (images or any content, hosted in DATA/HTDOCS),
it is more reliable to keep these links relative, especially when the
peer is behind any kind of reverse Proxy.
2016-11-04 11:21:20 +01:00
luccioman
2da5f339f8 Fixed /News.html and /Wiki.html pages in Search Portal mode (issue #87).
Also fixes theses pages rendering when the peer is not online.

Re-factored code in common with /opensearchdescription.xml and
ConfigPortal.html.
2016-11-03 02:33:36 +01:00
reger
8fe28a83f2 harmonize used lastmodified date for rwi and fulltext in storeDocument 2016-11-02 03:43:39 +01:00
reger
3d1d297308 refactor namespace navigator as part of navigatorplugin map, this allows
the navigator to include counts all matches (rwi+fulltext).
Fixing also unresolved_pattern in navigators title (of the counter)
The use of inurl: query modifier as filter has not been changed keeping
it as soft (unsharp) filter facet.

Upd StringNavigator to prevent empty string form multivalued solr fields,
removed date value conversion (better handled elsewhere, not need here).
2016-11-01 04:38:47 +01:00
reger
67f660523b Make navigators underlaying indexfield name accessible in interface
use interface in declaration and extend facet check to include navigator
field.
2016-10-31 18:42:23 +01:00
reger
5eb3ee4e20 Add search navigator interface to allow for additional navigators (plugins)
Prepared the first basic navigators (for authors and collections) for the
list of SearchEvent.navigatorPlugins and adjusted servlet to use these.
- this allows to configure display order of these navigators (by ordering config string)
- eventually allows for additional and/or custom navigators using any
available index field without need for changing servlets
- the Collection navigation has been adjusted to exclude the internal, 
default robot_*  and dht collections from displaying
- rwi results are now also checked for navigatior by the refactored navi's

So far no config options were added to customize or add navigators (may
come later if route of upcoming modularization/plugin system is defined).
2016-10-31 02:17:43 +01:00
reger
fd3f58fcaa improve query modifier parsing of "collection:" and possible collision
with "on:" in case multiple collection modifier were entered (by mistake)
http://mantis.tokeek.de/view.php?id=702
2016-10-31 00:43:01 +01:00
reger
af39a76bf6 Reduce number of default max. search navigator lines (from 10000)
to 100 + make it configurable
2016-10-29 04:19:46 +02:00
reger
20a1b29ed3 add simple test case for ReferenceContainer helpful for debugging
calculated ranking parameter
2016-10-26 01:38:40 +02:00
reger
3c7220bc7b Refacture rwi reference word position and word distance calculation
used for rwi ranking.
Main changes:  
- introduce a  posintext() to access the stored value. This reduces also mem alloc of position array for WordReferenceRow (index access)
- use the positions() array for joined references on multi-word queries if needed (otherwise allow positions() to be null
- adjust assignments and the min() max() and distance() calculation accordingly
2016-10-23 19:40:02 +02:00
luccioman
f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
This makes threads monitoring easier to read.
2016-10-22 17:17:21 +02:00
luccioman
db3b9db9c2 Crawl from local file : faster task end when manually terminating crawl. 2016-10-22 09:11:20 +02:00
reger
4c67ed3f8d catch rwi ranking div by zero exception
during rwi search result processing worddistance calculation is effected 
by concurrent update (normalization) of min/max ranking parameter for
wordpositions. On update of min/max the exception is raised in distance calc
and now catched. 
This concurrent update and change of ranking results is needed for speed
but should be further checked for optimization
2016-10-22 00:53:47 +02:00
luccioman
47af33a04c Advanced Crawl from local file : better processing of large files.
Applied strategy : when there is no restriction on domains or
sub-path(s), stack anchor links once discovered by the content scraper
instead of waiting the complete parsing of the file. 

This makes it possible to handle a crawling start file with thousands of
links in a reasonable amount of time.

Performance limitation : even if the crawl start faster with a large
file, the content of the parsed file still is fully loaded in memory.
2016-10-21 13:03:31 +02:00
luccioman
ee92082a3b Updated javadocs : warning about closing stream responsibility. 2016-10-21 12:48:36 +02:00
luccioman
6f49ece22f Fixed redirected URLs processing as crawl start point.
See mantis 699 (http://mantis.tokeek.de/view.php?id=699) for details.
2016-10-20 12:12:26 +02:00
reger
68217465fe div by null in word distance calculation
(again, description in http://mantis.tokeek.de/view.php?id=698)
as root cause was not seen, added just workaround reducing in favour over a 
try catch (for easier followup).
2016-10-19 22:55:36 +02:00
luccioman
7263d17436 Removed mentions of deprecated LURL-db.
Thanks to LA_FORGE asking about if on YaCy forum (
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5895 )
2016-10-19 14:56:25 +02:00
reger
8b74a6bf57 fix min/max calculation of WordReferenceVars.distance()
Issue was the calculation in AbstractReference with positions.clear() call,
this made distance result always 0 (distance needs min 2 positions) and created concurrency issues.
+ unit test of changes
2016-10-17 23:58:28 +02:00
luccioman
da362628fb Added fine log level for too long blacklist matching processing. 2016-10-17 22:32:19 +02:00
reger
aaae7c6462 adjust ConcurrentScoreMap internal value map to interface and use parameter
Long -> Integer (saves some bytes)
2016-10-16 06:31:48 +02:00
reger
31d2a5645e remove obsolete query variable
leftover from 8fb370d9f8 (diff-1d4259005ebfddc11083387857a86175)
harmonize ranking shift parameter to 0xFF
correct addresult weight parameter to long
2016-10-15 19:29:19 +02:00
luccioman
a588ed7628 Applied image headers customization to the new ViewFavicon servlet. 2016-10-14 14:05:38 +02:00
luccioman
7717a3d43d Fixed license headers on files created to improve favicon management. 2016-10-14 11:55:49 +02:00
luccioman
6e1959f469 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
Conflicts:
	htroot/yacysearchitem.java
	source/net/yacy/cora/federate/solr/responsewriter/YJsonResponseWriter.java
	source/net/yacy/search/schema/CollectionConfiguration.java
	source/net/yacy/server/serverObjects.java
2016-10-14 11:29:55 +02:00
reger
685d8e86bf Avoid frequent data type casting (float/long) for rwi score
refactor to using long in URIMetadataNode too (and related call parameters)
As remote rwi score's are not used (since v1.83) skip reading float-score ,
but keep in toString() for communication with older versions.
2016-10-14 01:17:34 +02:00
luccioman
3ccd89e274 Fixed MultiProtocolURL.resolveBackpath to handle remaining '..' segments 2016-10-13 16:18:24 +02:00
luccioman
4b699c469a Blacklist refactoring : extracted a function for easier unit testing 2016-10-13 15:33:31 +02:00