Commit Graph

11518 Commits

Author SHA1 Message Date
Michael Peter Christen
ecb6a59e9e do not translate gif images into png images for thumbnails. Instead,
stream the original to the search result thumb viewer. This has two
reasons:
- animated gifs cause 100% cpu and deadlocks in the jvm gif parser; a
known bug which is obviously not yet fixed
- animated gifs now appear in the search result also as animation
2014-12-28 14:53:55 +01:00
Michael Peter Christen
d9603039ff automatically set the Q flag for smb/ftp start urls (split pdf support) 2014-12-28 14:36:43 +01:00
Michael Peter Christen
8600ea01dd automatically swith on query option in case intranet protocols (smb/ftp)
are used. This supports the new split-pdf option.
2014-12-28 14:27:42 +01:00
Ryszard Goń
3144313974 Postprocessing progress bar fix
(Make it work as [probably] actually intended)
2014-12-27 03:02:18 +01:00
reger
6a04563578 Init Jetty using setDefaultDescriptor (web.xml) to defaults/web.xml
so web.xml in defaults dir is applied first and optional DATA/SETTINGS/web.xml loaded on top.
By using this Jetty feature (default web.xml) we assure that changes to the default are applied to existing installations
and individual addition/changes are still respected.
2014-12-27 00:10:14 +01:00
reger
51ec9c1f44 fix "null" title in response writer for documents with multivalued title 2014-12-26 18:23:26 +01:00
reger
73ba5d8ef7 adjust fieldtype and description of field httpstatus_redirect_s in CollectionSchema
- the field is not used (delete candidate)
2014-12-26 18:21:35 +01:00
reger
1f9389396a fix NPE related 500 (Bad Request) response of UrlProxy on blacklisted urls,
by adding parameter HTTPDeamon and removing unused hostAddress lookup code in sendRespondError
2014-12-25 02:21:45 +01:00
reger
7e4e9f7e32 improve yacysearchitem,
prevent allocation of String (modifyURL) if feature not used
2014-12-25 02:16:19 +01:00
reger
61f75d6019 add xmpcore as direct dependency to pom
(otherwise it's looked up at pdfbox archive path and not found there)
2014-12-25 02:13:44 +01:00
Michael Peter Christen
8ef56eda90 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-12-24 12:24:15 +01:00
Michael Peter Christen
9fce8bf2a5 crawling of multi-page pdfs with artificial post part on smb or ftp
shares is not possible with the disabled setting; this is not temporary
disabled until a better solution is on the hand.
2014-12-24 12:23:59 +01:00
reger
682dd94925 fix div by 0 in hello
Caused by: java.lang.ArithmeticException: / by zero
	at hello.respond(hello.java:159)
2014-12-24 00:04:35 +01:00
reger
17808898c6 update to SLF4J 1.7.9 2014-12-23 19:11:21 +01:00
reger
f856edecb6 fix proxy redirect (http status 302) response
fixes http://mantis.tokeek.de/view.php?id=517

The url given in bug report uses a gzip input stream which causes the HTTPClient.writeto() throw an IOException due to incomplete input stream. This in turn prevents the 302 reponse to the client browser. 
By limiting to serve target content just on httpstatus=200 will proxy the header reponse and client browsers redirect settings can be honored.
2014-12-23 02:01:03 +01:00
Michael Peter Christen
cc090bcb01 enhanced initialization of autotagging 2014-12-23 00:37:51 +01:00
Michael Peter Christen
003ec43bee Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-12-23 00:33:20 +01:00
Michael Peter Christen
bef689d0a2 NPE fix 2014-12-23 00:30:34 +01:00
reger
1de33c6a53 add hint to Heuristics Config on "Greedy Learning Mode" in portal config,
to point to a option to make this setting permanent.
2014-12-22 20:36:29 +01:00
reger
5332c9df21 update to commons-fileupload-1.3.1.jar
(includes a security fix)
2014-12-22 20:34:13 +01:00
Michael Peter Christen
a0576ec737 fix for pdf sub-page result preparation 2014-12-22 14:32:09 +01:00
Michael Peter Christen
6ad43c4a8b removed debug code 2014-12-22 14:24:09 +01:00
Michael Peter Christen
407cfff010 fix to wkhtmltopdf usage 2014-12-22 02:01:55 +01:00
Michael Peter Christen
5d321d3dc5 fixes to wkhtmltopdf call 2014-12-21 20:11:39 +01:00
Michael Peter Christen
eb78388a98 changed prefer strategy for http unique in such a way that http is
preferred over https. While this is a bad idea from the standpoint of
security it is more common applicable for environments where http and
https mix and for some domains https is not available. Then the
double-check is possible even if no postprocessing is performed.
2014-12-21 19:17:06 +01:00
Michael Peter Christen
84e2cccab4 fix to prevent assertion error in ranking servlet if no vocabularies are
present that could be evaluated
2014-12-21 19:08:28 +01:00
Michael Peter Christen
9e588944fa prevent NPE during initialization of very large vocabularies 2014-12-21 19:02:36 +01:00
Michael Peter Christen
aaf7d4775a Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-12-21 18:10:25 +01:00
Michael Peter Christen
8c3e5b7b6d added experimental pdf splitting which enables YaCy to split pdfs during
parsing into individual pages and add them all using different URLs.
These constructed urls are generated from the source url with an
appended page=<pagenumber> attribute to the url get/post properties.
This will distinguish the different page entries. The search result list
will then replace the post parameter with a url anchor # mark which
causes that the original url is presented in the search result. These
URLs can be opened directly on the correct page using pdf.js which is
now built-in into firefox. That means: if you find a search hit on page
5 and click on the search result, firefox will open the pdf viewer and
shows page 5.
2014-12-21 18:10:15 +01:00
Michael Peter Christen
85773ebd4f removed debug lines 2014-12-21 17:53:06 +01:00
Michael Peter Christen
d14114697c the miss cache does not seem to work, it sometimes contains urlhashes
from documents which actually are inside the index. This can be
reproduced using the crawl result table at 
http://localhost:8090/CrawlResults.html?process=5
The cache is temporary disabled to remove the bad behaviour, however a
later reactivation of that feater may be possible.
2014-12-21 17:31:51 +01:00
reger
deb75a1dbe fix refactored size() -> filesize() in YMarkMetadata 2014-12-21 14:02:06 +01:00
reger
198102304b refactor size() -> filesize() of URIMetadataNode
(harmonize with ResultEntry and to not get confused with Collection.size())
2014-12-21 06:05:35 +01:00
reger
c6f634a4f2 remove redundant caching of urlhash in URIMetadataNode
(is already cached in underlaying DigestURL .url)

upd pom keyword for maven-antrun-plugin
2014-12-21 03:45:54 +01:00
Michael Peter Christen
445fafeb7c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-12-20 15:38:15 +01:00
Michael Peter Christen
0d69089c61 fix for division by zero 2014-12-20 15:11:06 +01:00
reger
ac61a39828 use peeraddress for link in remote crawl list
to make link work without enabled proxy

upd pom for Jetty (missing in last commit)
2014-12-20 01:59:00 +01:00
reger
fe5d4e6c7b update to Jetty 9.2.6 2014-12-19 21:54:17 +01:00
Michael Peter Christen
5516819354 preventing the use of no-cache and expires in case that images are
generated dynamically which will stay static in the future. This applies
mainly to the search result favicon in front of search hits. These icons
will now be generated once, but then caches in the browser. There is
also a YaCy-internal cache for these icons which had prevented the
re-generation of the icons in YaCy, but this cache is now superfluous
since the browser should not call the servlet ViewImage again.
2014-12-19 17:41:38 +01:00
Michael Peter Christen
d3e71ed070 fixes for searches when initialization of large autotagging libraries
have not been finished
2014-12-19 17:38:58 +01:00
Michael Peter Christen
28683530cd fixes to usage of no-cache: use and recognize also the no-store
directive
2014-12-19 17:37:58 +01:00
Michael Peter Christen
c9c700b510 reduction of http requests to YaCy using the correct cache-control,
expires and last-modified headers in http response.
2014-12-19 11:51:14 +01:00
reger
eca578a5fa update to PDFBox 1.8.8 2014-12-19 02:54:38 +01:00
reger
13cca2b114 fix missing AppPath
upd Maven plugin versionid
2014-12-19 01:58:37 +01:00
Michael Peter Christen
d7e2f08a89 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-12-18 14:56:18 +01:00
reger
0f7d4c42e9 include xmpcore.jar in classpath
used by metadata-extractor
2014-12-16 21:12:37 +01:00
malykhin.dmitry
bd39e009ac Update russian translation 2014-12-16 23:10:53 +03:00
Michael Peter Christen
65125439fe added query modifier 'on'. This makes it possible to search for date
occurrences within the (web) page documents (not the document
last-modified!). This works only if the solr field dates_in_content_sxt
is enabled. A search request may then have the form "term on:<date>",
like
gift on:24.12.2014
gift on:2014/12/24
* on:2014/12/31
For the date format you may use any kind of human-readable date
representation(!yes!) - the on:<date> parser tries to identify language
and also knows event names, like:
bunny on:eastern
.. as long as the date term has no spaces inside (use a dot). Further
enhancement will be made to accept also strings encapsulated with
quotes.
2014-12-16 13:53:12 +01:00
Michael Peter Christen
1cfddea578 added (very experimental) Solr response writer for snapshot image
results
2014-12-16 13:18:49 +01:00
Michael Peter Christen
7287dd764e added url, date, time and page number on pdf snapshot footer 2014-12-16 12:39:10 +01:00