yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-21 00:00:13 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	8cafdb989a	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2015-01-09 11:00:02 +01:00
reger	66839f73fa	remove debug limit from commit before	2015-01-09 02:52:18 +01:00
reger	4214f250d0	Add option for extended search (Autosearch) to Bookmark.html asking all connected peers for the searchterm added as description to the bookmark created by the bookmark icon. Intended for searches/research projects with not sufficient results from local and DHT selected remote target peers. Function: the process checks newly created bookmarks for description starting with "query=..." and takes this to ask every peer for 20 search results and adds it to the local index in a background job. link to start/stop the process added to /Bookmarks.html	2015-01-09 02:06:30 +01:00
reger	8e751d754a	- add javadoc to busythread with hint about the init parameter useage - remove obsolete 10_httpd config parameter	2015-01-09 01:31:57 +01:00
Michael Peter Christen	3e6c3e2237	documents pushed over the api/push_p.html interface will have their unique flag set by default	2015-01-06 15:22:59 +01:00
Michael Peter Christen	35c24608cc	fix for division by zero (rare cases)	2015-01-06 14:21:20 +01:00
Michael Peter Christen	4144c7cc52	do not write frame links to webgraph	2015-01-06 14:14:25 +01:00
reger	4eb89d7f15	revert clickservlet (default was indeed a mistakenly)	2015-01-05 09:10:20 +01:00
Michael Peter Christen	c9e2128260	please commit new files under your own name, this file was not created by me.	2015-01-05 08:18:19 +01:00
reger	d44d8996d0	Added a “don't store remote search results” option This is intended for peers who want to participate in the P2P network but don't wish to load/fill-up their index with metadata of every received search result. The DHT transfer is not effected by this option (and will work as usual, so that a peer disabling the new store to index switch still receives and holds the metadata according to DHT rules). Downside for the local peer is that search speed will not improve if search terms are only avail. remote or by quick hits in local index. To be able to improve the local index a Click-Servlet option was added additionally. If switched on, all search result links point to this servlet, which forwards the users browser (by html header) to the desired page and feeds the page to the fulltext-index. The servlet accepts a parameter defining the action to perform (see defaults/web.xml, index, crawl, crawllinks) The option check-boxes are placed in ConfigPortal.html	2015-01-04 11:10:45 +01:00
reger	c156548efe	add info text to metadata page (htmlresponsewriter) on no documents found	2015-01-04 02:59:21 +01:00
reger	3ac1d14a21	improve TexParser.mimeOf( fileextension ) by returning 1st defined in supported list. This prevents unusual mapping of supported fileextension -> mimetype (like htm=application/x-tex)	2015-01-02 04:20:02 +01:00
Michael Peter Christen	d2792a43fd	do not write iframe and embed links into webgraph, but use them anyway for crawling	2015-01-02 02:44:03 +01:00
Michael Peter Christen	3cd7deb3b8	do not flush non-errors to stdout because this is a concurrency issue. the flush-call appeared very often in thread dumps with high load, so this hopefully gives some performances	2014-12-28 15:48:37 +01:00
Michael Peter Christen	4e3e2acc69	Merge branch 'master' of gitorious.org:yacy/rc1-fixed_percent-encoding	2014-12-28 15:01:40 +01:00
Michael Peter Christen	ecb6a59e9e	do not translate gif images into png images for thumbnails. Instead, stream the original to the search result thumb viewer. This has two reasons: - animated gifs cause 100% cpu and deadlocks in the jvm gif parser; a known bug which is obviously not yet fixed - animated gifs now appear in the search result also as animation	2014-12-28 14:53:55 +01:00
arucard21	3e9871291f	Applied URL-decoding prior to HTML-encoding. This removes percent-encoding from text shown in HTML	2014-12-27 09:52:34 +01:00
reger	6a04563578	Init Jetty using setDefaultDescriptor (web.xml) to defaults/web.xml so web.xml in defaults dir is applied first and optional DATA/SETTINGS/web.xml loaded on top. By using this Jetty feature (default web.xml) we assure that changes to the default are applied to existing installations and individual addition/changes are still respected.	2014-12-27 00:10:14 +01:00
reger	51ec9c1f44	fix "null" title in response writer for documents with multivalued title	2014-12-26 18:23:26 +01:00
reger	73ba5d8ef7	adjust fieldtype and description of field httpstatus_redirect_s in CollectionSchema - the field is not used (delete candidate)	2014-12-26 18:21:35 +01:00
reger	1f9389396a	fix NPE related 500 (Bad Request) response of UrlProxy on blacklisted urls, by adding parameter HTTPDeamon and removing unused hostAddress lookup code in sendRespondError	2014-12-25 02:21:45 +01:00
reger	f856edecb6	fix proxy redirect (http status 302) response fixes http://mantis.tokeek.de/view.php?id=517 The url given in bug report uses a gzip input stream which causes the HTTPClient.writeto() throw an IOException due to incomplete input stream. This in turn prevents the 302 reponse to the client browser. By limiting to serve target content just on httpstatus=200 will proxy the header reponse and client browsers redirect settings can be honored.	2014-12-23 02:01:03 +01:00
Michael Peter Christen	cc090bcb01	enhanced initialization of autotagging	2014-12-23 00:37:51 +01:00
Michael Peter Christen	a0576ec737	fix for pdf sub-page result preparation	2014-12-22 14:32:09 +01:00
Michael Peter Christen	6ad43c4a8b	removed debug code	2014-12-22 14:24:09 +01:00
Michael Peter Christen	407cfff010	fix to wkhtmltopdf usage	2014-12-22 02:01:55 +01:00
Michael Peter Christen	5d321d3dc5	fixes to wkhtmltopdf call	2014-12-21 20:11:39 +01:00
Michael Peter Christen	eb78388a98	changed prefer strategy for http unique in such a way that http is preferred over https. While this is a bad idea from the standpoint of security it is more common applicable for environments where http and https mix and for some domains https is not available. Then the double-check is possible even if no postprocessing is performed.	2014-12-21 19:17:06 +01:00
Michael Peter Christen	9e588944fa	prevent NPE during initialization of very large vocabularies	2014-12-21 19:02:36 +01:00
Michael Peter Christen	aaf7d4775a	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-12-21 18:10:25 +01:00
Michael Peter Christen	8c3e5b7b6d	added experimental pdf splitting which enables YaCy to split pdfs during parsing into individual pages and add them all using different URLs. These constructed urls are generated from the source url with an appended page=<pagenumber> attribute to the url get/post properties. This will distinguish the different page entries. The search result list will then replace the post parameter with a url anchor # mark which causes that the original url is presented in the search result. These URLs can be opened directly on the correct page using pdf.js which is now built-in into firefox. That means: if you find a search hit on page 5 and click on the search result, firefox will open the pdf viewer and shows page 5.	2014-12-21 18:10:15 +01:00
Michael Peter Christen	d14114697c	the miss cache does not seem to work, it sometimes contains urlhashes from documents which actually are inside the index. This can be reproduced using the crawl result table at http://localhost:8090/CrawlResults.html?process=5 The cache is temporary disabled to remove the bad behaviour, however a later reactivation of that feater may be possible.	2014-12-21 17:31:51 +01:00
reger	deb75a1dbe	fix refactored size() -> filesize() in YMarkMetadata	2014-12-21 14:02:06 +01:00
reger	198102304b	refactor size() -> filesize() of URIMetadataNode (harmonize with ResultEntry and to not get confused with Collection.size())	2014-12-21 06:05:35 +01:00
reger	c6f634a4f2	remove redundant caching of urlhash in URIMetadataNode (is already cached in underlaying DigestURL .url) upd pom keyword for maven-antrun-plugin	2014-12-21 03:45:54 +01:00
Michael Peter Christen	5516819354	preventing the use of no-cache and expires in case that images are generated dynamically which will stay static in the future. This applies mainly to the search result favicon in front of search hits. These icons will now be generated once, but then caches in the browser. There is also a YaCy-internal cache for these icons which had prevented the re-generation of the icons in YaCy, but this cache is now superfluous since the browser should not call the servlet ViewImage again.	2014-12-19 17:41:38 +01:00
Michael Peter Christen	d3e71ed070	fixes for searches when initialization of large autotagging libraries have not been finished	2014-12-19 17:38:58 +01:00
Michael Peter Christen	28683530cd	fixes to usage of no-cache: use and recognize also the no-store directive	2014-12-19 17:37:58 +01:00
Michael Peter Christen	c9c700b510	reduction of http requests to YaCy using the correct cache-control, expires and last-modified headers in http response.	2014-12-19 11:51:14 +01:00
reger	13cca2b114	fix missing AppPath upd Maven plugin versionid	2014-12-19 01:58:37 +01:00
Michael Peter Christen	65125439fe	added query modifier 'on'. This makes it possible to search for date occurrences within the (web) page documents (not the document last-modified!). This works only if the solr field dates_in_content_sxt is enabled. A search request may then have the form "term on:<date>", like gift on:24.12.2014 gift on:2014/12/24 * on:2014/12/31 For the date format you may use any kind of human-readable date representation(!yes!) - the on:<date> parser tries to identify language and also knows event names, like: bunny on:eastern .. as long as the date term has no spaces inside (use a dot). Further enhancement will be made to accept also strings encapsulated with quotes.	2014-12-16 13:53:12 +01:00
Michael Peter Christen	1cfddea578	added (very experimental) Solr response writer for snapshot image results	2014-12-16 13:18:49 +01:00
Michael Peter Christen	7287dd764e	added url, date, time and page number on pdf snapshot footer	2014-12-16 12:39:10 +01:00
Michael Peter Christen	8b5d074715	fix for image parser (there is a class missing!)	2014-12-16 12:10:15 +01:00
Michael Peter Christen	932faafffe	reactivated on-demand snapshot loading	2014-12-16 12:09:57 +01:00
Michael Peter Christen	2362ad7c34	fix for a count issue in snapshot api	2014-12-16 11:33:30 +01:00
Michael Peter Christen	3354cd63be	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-12-15 23:32:57 +01:00
Michael Peter Christen	9971e197e0	Added a transaction interface to the snapshots: all documents in the snapshots can now be processed with transactions using commit and rollback commands. Furthermore, a large number of monitoring methods had been added to check the success of transactions. The transactions for snapshots have two main components: a rss search API to get information about latest/oldest entries and a commit/rollback API to move entries away from the rss results. This is done by usage of two storage locations for the snapshots, INVENTORY and ARCHIVE. New snapshots are placed to INVENTORY, commited snapshots move to ARCHIVE, rollback snapshots move to INVENTORY again. Normal Workflow: Beside all these options below, usually it is sufficient to process data like this: - call http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST - process the rss result and use the <guid> value as <urlhash> (see next command) - for each processed result call http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash> - then you can call the rss feed again and the commited urls are omited from the next set of items. These are the commands to control this: The rss feed: http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=OLDESTFIRST http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=OLDESTFIRST http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST The feed will return a <urlhash> in the <guid> - field of the rss. This must be used for commit/rollback: Commit/Rollback: http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash> http://localhost:8090/api/snapshot.json?command=rollback&urlhash=<urlhash> The json will return a property list containing the property "result" with possible values "success" or "fail", according of the result. If an "fail" occurs, please look into the log for further info. Monitoring: http://localhost:8090/api/snapshot.json?command=status This shows the total number of entries in the INVENTORY and the ARCHIVE http://localhost:8090/api/snapshot.json?command=list This will result a list of all hosts which have snapshots and the number of entries for the hosts. Counts for INVENTORY and ARCHIVE are listed in the porperties for "count.INVENTORY" and "count.ARCHIVE" http://localhost:8090/api/snapshot.json?command=list&depth=2 The list can be restricted to such which have a specific depth. The list contains then the same host names, but the count values change because only documents at that specific crawl depth are listed http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80 This lists all urlhashes for the given host, not only an accumulated list of the number of entries http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80&depth=0 This restricts the list of urlhashes for that host for the given depth http://localhost:8090/api/snapshot.json?command=list&state=INVENTORY http://localhost:8090/api/snapshot.json?command=list&state=ARCHIVE This selects either the INVENTORY or ARCHIVE for all list commands, default is ALL which means that from both snapshot directories the host information is collected and combined. You can use the state option for all the commands as listed above Detailed Information: http://localhost:8090/api/snapshot.json?command=metadata&urlhash=upiFJ7Fh1hyQ This collects metadata information for the given urlhash. This can also be restricted with state=INVENTORY and state=ARCHIVE to test if the document is either in one of these snapshot directories. If an urlhash is not found, an empty result is returned. If an entry was found and the state was not restricted, then the result contains a state property containing the name of the location where the document is, either INVENTORY or ARCHIVE. Hint: If a very large number of documents is inside of INVENTORY, then it could be better to call the rss feed with http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY because that is very efficient.	2014-12-15 23:32:46 +01:00
reger	63846ddb89	add final SolrQueryRequest.close to SolrServlet	2014-12-15 22:54:49 +01:00
reger	9edc7308aa	update to metadata-extractor-2.7.0.jar add 2 simple JUnit test cases for jpeg and tif parsing	2014-12-15 20:45:05 +01:00

1 2 3 4 5 ...

3130 Commits