yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	5516819354	preventing the use of no-cache and expires in case that images are generated dynamically which will stay static in the future. This applies mainly to the search result favicon in front of search hits. These icons will now be generated once, but then caches in the browser. There is also a YaCy-internal cache for these icons which had prevented the re-generation of the icons in YaCy, but this cache is now superfluous since the browser should not call the servlet ViewImage again.	2014-12-19 17:41:38 +01:00
Michael Peter Christen	d3e71ed070	fixes for searches when initialization of large autotagging libraries have not been finished	2014-12-19 17:38:58 +01:00
Michael Peter Christen	28683530cd	fixes to usage of no-cache: use and recognize also the no-store directive	2014-12-19 17:37:58 +01:00
Michael Peter Christen	c9c700b510	reduction of http requests to YaCy using the correct cache-control, expires and last-modified headers in http response.	2014-12-19 11:51:14 +01:00
reger	eca578a5fa	update to PDFBox 1.8.8	2014-12-19 02:54:38 +01:00
reger	13cca2b114	fix missing AppPath upd Maven plugin versionid	2014-12-19 01:58:37 +01:00
Michael Peter Christen	d7e2f08a89	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-12-18 14:56:18 +01:00
reger	0f7d4c42e9	include xmpcore.jar in classpath used by metadata-extractor	2014-12-16 21:12:37 +01:00
malykhin.dmitry	bd39e009ac	Update russian translation	2014-12-16 23:10:53 +03:00
Michael Peter Christen	65125439fe	added query modifier 'on'. This makes it possible to search for date occurrences within the (web) page documents (not the document last-modified!). This works only if the solr field dates_in_content_sxt is enabled. A search request may then have the form "term on:<date>", like gift on:24.12.2014 gift on:2014/12/24 * on:2014/12/31 For the date format you may use any kind of human-readable date representation(!yes!) - the on:<date> parser tries to identify language and also knows event names, like: bunny on:eastern .. as long as the date term has no spaces inside (use a dot). Further enhancement will be made to accept also strings encapsulated with quotes.	2014-12-16 13:53:12 +01:00
Michael Peter Christen	1cfddea578	added (very experimental) Solr response writer for snapshot image results	2014-12-16 13:18:49 +01:00
Michael Peter Christen	7287dd764e	added url, date, time and page number on pdf snapshot footer	2014-12-16 12:39:10 +01:00
Michael Peter Christen	8b5d074715	fix for image parser (there is a class missing!)	2014-12-16 12:10:15 +01:00
Michael Peter Christen	932faafffe	reactivated on-demand snapshot loading	2014-12-16 12:09:57 +01:00
Michael Peter Christen	2362ad7c34	fix for a count issue in snapshot api	2014-12-16 11:33:30 +01:00
Michael Peter Christen	3354cd63be	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-12-15 23:32:57 +01:00
Michael Peter Christen	9971e197e0	Added a transaction interface to the snapshots: all documents in the snapshots can now be processed with transactions using commit and rollback commands. Furthermore, a large number of monitoring methods had been added to check the success of transactions. The transactions for snapshots have two main components: a rss search API to get information about latest/oldest entries and a commit/rollback API to move entries away from the rss results. This is done by usage of two storage locations for the snapshots, INVENTORY and ARCHIVE. New snapshots are placed to INVENTORY, commited snapshots move to ARCHIVE, rollback snapshots move to INVENTORY again. Normal Workflow: Beside all these options below, usually it is sufficient to process data like this: - call http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST - process the rss result and use the <guid> value as <urlhash> (see next command) - for each processed result call http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash> - then you can call the rss feed again and the commited urls are omited from the next set of items. These are the commands to control this: The rss feed: http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=OLDESTFIRST http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=OLDESTFIRST http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST The feed will return a <urlhash> in the <guid> - field of the rss. This must be used for commit/rollback: Commit/Rollback: http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash> http://localhost:8090/api/snapshot.json?command=rollback&urlhash=<urlhash> The json will return a property list containing the property "result" with possible values "success" or "fail", according of the result. If an "fail" occurs, please look into the log for further info. Monitoring: http://localhost:8090/api/snapshot.json?command=status This shows the total number of entries in the INVENTORY and the ARCHIVE http://localhost:8090/api/snapshot.json?command=list This will result a list of all hosts which have snapshots and the number of entries for the hosts. Counts for INVENTORY and ARCHIVE are listed in the porperties for "count.INVENTORY" and "count.ARCHIVE" http://localhost:8090/api/snapshot.json?command=list&depth=2 The list can be restricted to such which have a specific depth. The list contains then the same host names, but the count values change because only documents at that specific crawl depth are listed http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80 This lists all urlhashes for the given host, not only an accumulated list of the number of entries http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80&depth=0 This restricts the list of urlhashes for that host for the given depth http://localhost:8090/api/snapshot.json?command=list&state=INVENTORY http://localhost:8090/api/snapshot.json?command=list&state=ARCHIVE This selects either the INVENTORY or ARCHIVE for all list commands, default is ALL which means that from both snapshot directories the host information is collected and combined. You can use the state option for all the commands as listed above Detailed Information: http://localhost:8090/api/snapshot.json?command=metadata&urlhash=upiFJ7Fh1hyQ This collects metadata information for the given urlhash. This can also be restricted with state=INVENTORY and state=ARCHIVE to test if the document is either in one of these snapshot directories. If an urlhash is not found, an empty result is returned. If an entry was found and the state was not restricted, then the result contains a state property containing the name of the location where the document is, either INVENTORY or ARCHIVE. Hint: If a very large number of documents is inside of INVENTORY, then it could be better to call the rss feed with http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY because that is very efficient.	2014-12-15 23:32:46 +01:00
reger	63846ddb89	add final SolrQueryRequest.close to SolrServlet	2014-12-15 22:54:49 +01:00
reger	9edc7308aa	update to metadata-extractor-2.7.0.jar add 2 simple JUnit test cases for jpeg and tif parsing	2014-12-15 20:45:05 +01:00
Michael Peter Christen	578ae29f1e	added a note that the servlet is linked using web.xml	2014-12-15 05:56:12 +01:00
reger	6c3f36def1	- fix path to default heuristic.cfg - deprecate unused ProxyServlet	2014-12-14 21:27:45 +01:00
reger	00113dcfbd	add chardet.jar to Maven dependencies	2014-12-14 19:17:13 +01:00
reger	446f374ba9	fix yacy.init comment http://mantis.tokeek.de/view.php?id=513	2014-12-14 19:12:18 +01:00
Michael Peter Christen	bbf0ac40c3	add the actual DateDetection class... (missed in latest commit)	2014-12-14 13:43:30 +01:00
Michael Peter Christen	66b5a56976	Added and integrated new date detection class which can identify date notions within the fulltext of a document. This class attempts to identify also dates given abbreviated or with missing year or described with names for special days, like 'Halloween'. In case that a date has no year given, the current year and following years are considered. This process is therefore able to identify a large set of dates to a document, either because there are several dates given in the document or the date is ambiguous. Four new Solr fields are used to store the parsing result: dates_in_content_sxt: if date expressions can be found in the content, these dates are listed here in order of the appearances dates_in_content_count_i: the number of entries in dates_in_content_sxt date_in_content_min_dt: if dates_in_content_sxt is filled, this contains the oldest date from the list of available dates #date_in_content_max_dt: if dates_in_content_sxt is filled, this contains the youngest date from the list of available dates, that may also be possibly in the future These fields are deactiviated by default because the evaluation of regular expressions to detect the date is yet too CPU intensive. Maybe future enhancements will cause that this is switched on by default. The purpose of these fields is the creation of calendar-like search facets, to be implemented next.	2014-12-14 13:40:45 +01:00
Michael Peter Christen	c3c2b6999b	fixes on wkhtmltopdf	2014-12-14 04:03:20 +01:00
Michael Peter Christen	114f0afc1e	enable sku as anchor in html response writer	2014-12-14 04:02:13 +01:00
Michael Peter Christen	aa80cb1159	enhanced tagging preparation speed which reduces initialization time for very large vocabularies	2014-12-13 09:54:41 +01:00
Michael Peter Christen	6a1865f507	refactoring date -> lastModified	2014-12-11 23:37:41 +01:00
Michael Peter Christen	ab6cc3c88c	added concurrent generation of snapshot pdfs	2014-12-10 14:10:05 +01:00
Michael Peter Christen	ff035a20e7	fix for vocabulary import (double term detection)	2014-12-10 14:09:34 +01:00
Michael Peter Christen	e6650050fe	fix for Is Facet checkbox	2014-12-10 13:14:39 +01:00
Michael Peter Christen	bd3ed5cae5	added charset detection to vocabulary reader	2014-12-10 13:11:51 +01:00
Michael Peter Christen	413eeefed4	added character set detection library from http://www-archive.mozilla.org/projects/intl/chardet.html	2014-12-10 13:08:29 +01:00
Michael Peter Christen	7bfc5b80cb	added new options to vocabulary editor: - new switch 'isFacet' which causes that the usage of the vocabulary for search facets is enabled or disabled. This shall be used for large vocabularies sind searched in solr are extremely slow if facets for a large set of alternative terms are generated - new option to disable auto-enrichment from synonyms - new option to add synonyms from another column when importing from csv - automatically recognize double-occurrences in synonyms and bundling terms for such synonyms	2014-12-10 12:20:27 +01:00
Michael Peter Christen	87b53b3572	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-12-09 16:20:44 +01:00
Michael Peter Christen	8df8ffbb6d	enhanced the snapshot functionality: - snapshots can now also be xml files which are extracted from the solr index and stored as individual xml files in the snapshot directory along the pdf and jpg images - a transaction layer was placed above of the snapshot directory to distinguish snapshots into 'inventory' and 'archive'. This may be used to do transactions of index fragments using archived solr search results between peers. This is currently unfinished, we need a protocol to move snapshots from inventory to archive - the SNAPSHOT directory was renamed to snapshot and contains now two snapshot subdirectories: inventory and archive - snapshots may now be generated by everyone, not only such peers running on a server with xkhtml2pdf installed. The expert crawl starts provides the option for snapshots to everyone. PDF snapshots are now optional and the option is only shown if xkhtml2pdf is installed. - the snapshot api now provides the request for historised xml files, i.e. call: http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ The result of such xml files is identical with solr search results with only one hit. The pdf generation has been moved from the http loading process to the solr document storage process. This may slow down the process a lot and a different version of the process may be needed.	2014-12-09 16:20:34 +01:00
reger	5d67e165d9	remove redundant null check in ResponseHeader.lastModified added a JUnit testcase for ResponseHeader dates (using age()), adjusted age() to pass all tests	2014-12-09 00:58:08 +01:00
Michael Peter Christen	4111d42c81	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-12-08 12:40:12 +01:00
Michael Peter Christen	793ce6d13b	added confirmation dialogs for row deletion	2014-12-08 11:41:28 +01:00
Michael Peter Christen	cdc21d43b1	more robustness for broken table data in Table_API_p.html -- see bug report http://mantis.tokeek.de/view.php?id=495	2014-12-08 11:35:40 +01:00
reger	1d3ea35d69	prevent NPE on host link for to short HeuristicCfg.OpenSearchURL	2014-12-08 01:35:37 +01:00
Michael Peter Christen	a95af11050	enhancement for clearing the crawl queue	2014-12-07 23:43:38 +01:00
reger	5f0bb1214f	modified FieldReIndex to reindex queries with low number of documents first by using a internally a score map with number of documents as score and working through the list from low to high.	2014-12-07 04:31:09 +01:00
reger	8055ed5b2a	update to commons-logging-1.2	2014-12-06 22:32:24 +01:00
reger	e52370728a	fix startup stop on missing HTCACHE/SNAPSHOT directory	2014-12-06 02:25:24 +01:00
reger	e5236aa7ca	Merge origin/master	2014-12-06 01:44:03 +01:00
reger	70cf7060a4	coding fixes suggested in http://mantis.tokeek.de/view.php?id=509 http://mantis.tokeek.de/view.php?id=510	2014-12-06 01:42:24 +01:00
Michael Peter Christen	d97deb5555	npe fix	2014-12-06 00:43:12 +01:00
Michael Peter Christen	4fe4bf29ad	added rss feed output to snapshot servlet which can be used to get a list of latest/oldest entries in the snapshot database. This is an example: http://localhost:8090/api/snapshot.rss?depth=2&order=LATESTFIRST&host=yacy.net&maxcount=100 The properties depth, order, host and maxcount can be omited. The meaning of the fields are: host: select only urls from this host or all, if not given depth: select only urls at that crawl depth or all, if not given maxcount: select at most the given number of urls or 10, if not given order: either LATESTFIRST to select the youngest entries, OLDESTFIRST to select the first entries or ANY to select any The rss feed needs administration rights to work, a call to this servlet with rss extension must attach login credentials.	2014-12-06 00:25:05 +01:00

... 2 3 4 5 6 ...

11630 Commits