Commit Graph

11600 Commits

Author SHA1 Message Date
reger
24f68a4eb7 refactor opensearch heuristic
introduce FederateSearchManager handling search heuristic to external systems via specific FederateSearchConnectors,
which provide the query() functionallity, the translation to YaCy schema .toYaCySchema() and the search() routine to deliver results to searchevents, which is generally implemented in Abstract connector.
The manager enforces now a min 15s delay between calls to external systems.
Besides the OpensearchConnector a SolrFederateSearchConnector is available. It uses a additional config file for fieldname translation.

default heuristicopensearch.conf: 
- openbdb.com removed - seems not longer to deliver results
- config via solrconnector to  datacite.org added (large technical library archive)
2015-01-19 03:30:35 +01:00
Michael Peter Christen
3b51636ecb fix for mediawiki import 2015-01-12 00:35:47 +01:00
Michael Peter Christen
b07afbc115 a test with http://validator.w3.org/feed/#validate_by_input shows that
the time format was wrong; we must use RFC-822
2015-01-09 16:45:43 +01:00
Michael Peter Christen
8cafdb989a Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2015-01-09 11:00:02 +01:00
reger
66839f73fa remove debug limit from commit before 2015-01-09 02:52:18 +01:00
reger
4214f250d0 Add option for extended search (Autosearch) to Bookmark.html asking all connected peers for the searchterm added as description to the bookmark created by the bookmark icon.
Intended for searches/research projects with not sufficient results from local and DHT selected remote target peers.

Function: the process checks newly created bookmarks for description starting with "query=..." and takes this to ask every peer for 20 search results and adds it to the local index in a background job.
link to start/stop the process added to /Bookmarks.html
2015-01-09 02:06:30 +01:00
reger
bb37cb32e4 Add title import for bookmark icon
if avail in index
2015-01-09 01:33:45 +01:00
reger
8e751d754a - add javadoc to busythread with hint about the init parameter useage
- remove obsolete 10_httpd config parameter
2015-01-09 01:31:57 +01:00
Michael Peter Christen
3e6c3e2237 documents pushed over the api/push_p.html interface will have their
unique flag set by default
2015-01-06 15:22:59 +01:00
Michael Peter Christen
0871e43fcc better scale 2015-01-06 14:22:43 +01:00
Michael Peter Christen
35c24608cc fix for division by zero (rare cases) 2015-01-06 14:21:20 +01:00
Michael Peter Christen
4144c7cc52 do not write frame links to webgraph 2015-01-06 14:14:25 +01:00
reger
4eb89d7f15 revert clickservlet
(default was indeed a mistakenly)
2015-01-05 09:10:20 +01:00
Michael Peter Christen
61ae9d2d11 do not use the clickservlet by default. From my personal view, this
technique should not be used at all! This project is about privacy, the
existence of a click servlet is one example why people should NOT use a
search portal if such exists.
2015-01-05 08:21:51 +01:00
Michael Peter Christen
c9e2128260 please commit new files under your own name, this file was not created
by me.
2015-01-05 08:18:19 +01:00
reger
ebe5faeb01 added url to bookmark icon link
url is anyway needed, saves index lookup and works w/o commited url.
Removed unused order parameter
2015-01-05 06:55:53 +01:00
sixcooler
5594c43d2e bump to Solr-/Lucene-4.10.3 2015-01-04 18:47:47 +01:00
reger
d44d8996d0 Added a “don't store remote search results” option
This is intended for peers who want to participate in the P2P network but don't wish to load/fill-up their index with metadata of every received search result. 
The DHT transfer is not effected by this option (and will work as usual, so that a peer disabling the new store to index switch still receives and holds the metadata according to DHT rules).
Downside for the local peer is that search speed will not improve if search terms are only avail. remote or by quick hits in local index.

To be able to improve the local index a Click-Servlet option was added additionally.
If switched on, all search result links point to this servlet, which forwards the users browser (by html header) to the desired page and feeds the page to the fulltext-index.
The servlet accepts a parameter defining the action to perform (see defaults/web.xml, index, crawl, crawllinks)

The option check-boxes are placed in ConfigPortal.html
2015-01-04 11:10:45 +01:00
reger
d729386787 fix NPE in viewimage
Caused by: java.lang.NullPointerException
	at net.yacy.peers.graphics.EncodedImage.<init>(EncodedImage.java:73)
	at ViewImage.respond(ViewImage.java:156)
2015-01-04 09:12:30 +01:00
reger
4ff018c9e4 fix ConfigPortal jumps to iframe focus
add focus parameter to yacysearch.html too
2015-01-04 06:57:13 +01:00
reger
c156548efe add info text to metadata page (htmlresponsewriter) on no documents found 2015-01-04 02:59:21 +01:00
reger
3ac1d14a21 improve TexParser.mimeOf( fileextension ) by returning 1st defined in supported list.
This prevents unusual mapping of supported fileextension -> mimetype
(like htm=application/x-tex)
2015-01-02 04:20:02 +01:00
Michael Peter Christen
d2792a43fd do not write iframe and embed links into webgraph, but use them anyway
for crawling
2015-01-02 02:44:03 +01:00
Michael Peter Christen
5b810f6d70 Merge branch 'master' of gitorious.org:yacy/whitrs-rc1 2015-01-02 00:57:37 +01:00
Ryszard Goń
3cdbd5f5c6 Fix for progress table background not resizing
when the post-processing started/ended.
2015-01-02 00:11:32 +01:00
reger
0dfeee154a adjustments for Bookmark icon to act on BookmarkDB,
it acts on YMarks but YMark interface seems not maintained,
for future features (e.g. query memory) BookmarkDB is the likely choice to expand, besides the crawlstart bookmark also the result bookmark icon now adds to BookmarkDB.
The YMark related code is (for now) left untouched so both tables are updated.
2015-01-01 02:41:20 +01:00
Michael Peter Christen
513e9259f5 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-12-30 02:36:17 +01:00
reger
e177d69387 remove obsolete config footer option (ConfigPortal user.login)
no footer or footer-option in use

remove unused yacy.init item allowUnlimitedReceiveIndexFrom
2014-12-29 03:50:00 +01:00
Michael Peter Christen
5d4167f977 reacivated clear stacks code for termination of all crawls because this
did not work wihtout that part of the code
2014-12-28 15:52:43 +01:00
Michael Peter Christen
3cd7deb3b8 do not flush non-errors to stdout because this is a concurrency issue.
the flush-call appeared very often in thread dumps with high load, so
this hopefully gives some performances
2014-12-28 15:48:37 +01:00
Michael Peter Christen
4e3e2acc69 Merge branch 'master' of gitorious.org:yacy/rc1-fixed_percent-encoding 2014-12-28 15:01:40 +01:00
Michael Peter Christen
ecb6a59e9e do not translate gif images into png images for thumbnails. Instead,
stream the original to the search result thumb viewer. This has two
reasons:
- animated gifs cause 100% cpu and deadlocks in the jvm gif parser; a
known bug which is obviously not yet fixed
- animated gifs now appear in the search result also as animation
2014-12-28 14:53:55 +01:00
Michael Peter Christen
d9603039ff automatically set the Q flag for smb/ftp start urls (split pdf support) 2014-12-28 14:36:43 +01:00
Michael Peter Christen
8600ea01dd automatically swith on query option in case intranet protocols (smb/ftp)
are used. This supports the new split-pdf option.
2014-12-28 14:27:42 +01:00
arucard21
3e9871291f Applied URL-decoding prior to HTML-encoding.
This removes percent-encoding from text shown in HTML
2014-12-27 09:52:34 +01:00
Ryszard Goń
3144313974 Postprocessing progress bar fix
(Make it work as [probably] actually intended)
2014-12-27 03:02:18 +01:00
reger
6a04563578 Init Jetty using setDefaultDescriptor (web.xml) to defaults/web.xml
so web.xml in defaults dir is applied first and optional DATA/SETTINGS/web.xml loaded on top.
By using this Jetty feature (default web.xml) we assure that changes to the default are applied to existing installations
and individual addition/changes are still respected.
2014-12-27 00:10:14 +01:00
reger
51ec9c1f44 fix "null" title in response writer for documents with multivalued title 2014-12-26 18:23:26 +01:00
reger
73ba5d8ef7 adjust fieldtype and description of field httpstatus_redirect_s in CollectionSchema
- the field is not used (delete candidate)
2014-12-26 18:21:35 +01:00
reger
1f9389396a fix NPE related 500 (Bad Request) response of UrlProxy on blacklisted urls,
by adding parameter HTTPDeamon and removing unused hostAddress lookup code in sendRespondError
2014-12-25 02:21:45 +01:00
reger
7e4e9f7e32 improve yacysearchitem,
prevent allocation of String (modifyURL) if feature not used
2014-12-25 02:16:19 +01:00
reger
61f75d6019 add xmpcore as direct dependency to pom
(otherwise it's looked up at pdfbox archive path and not found there)
2014-12-25 02:13:44 +01:00
Michael Peter Christen
8ef56eda90 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-12-24 12:24:15 +01:00
Michael Peter Christen
9fce8bf2a5 crawling of multi-page pdfs with artificial post part on smb or ftp
shares is not possible with the disabled setting; this is not temporary
disabled until a better solution is on the hand.
2014-12-24 12:23:59 +01:00
reger
682dd94925 fix div by 0 in hello
Caused by: java.lang.ArithmeticException: / by zero
	at hello.respond(hello.java:159)
2014-12-24 00:04:35 +01:00
reger
17808898c6 update to SLF4J 1.7.9 2014-12-23 19:11:21 +01:00
reger
f856edecb6 fix proxy redirect (http status 302) response
fixes http://mantis.tokeek.de/view.php?id=517

The url given in bug report uses a gzip input stream which causes the HTTPClient.writeto() throw an IOException due to incomplete input stream. This in turn prevents the 302 reponse to the client browser. 
By limiting to serve target content just on httpstatus=200 will proxy the header reponse and client browsers redirect settings can be honored.
2014-12-23 02:01:03 +01:00
Michael Peter Christen
cc090bcb01 enhanced initialization of autotagging 2014-12-23 00:37:51 +01:00
Michael Peter Christen
003ec43bee Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-12-23 00:33:20 +01:00
Michael Peter Christen
bef689d0a2 NPE fix 2014-12-23 00:30:34 +01:00