yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
reger	4e0892962a	fix NPE in citation servlet on empty text field	2016-05-14 03:51:13 +02:00
reger	d9adc2c255	load handler for Transparent Proxy on startup only if feature is activated to save the resources and keep handler chain small if the feature is not used. +add a warning message on settingsack_p page to restart on first activation	2016-03-25 05:26:48 +01:00
Michael Peter Christen	b89465d952	0N - basic dump upload servlet infrastructure, to share index dumps within an experimental new sharing model	2016-03-11 18:12:13 +01:00
Michael Peter Christen	f12a900f3e	harmonization of http post of files for one and several files - this had been differently - and wrong for several files. also: base64-encoding for gzipped push files because our data structures currently only supports ASCII POST pushes..	2016-03-11 08:59:33 +01:00
luc	571bc55937	Refactoring : use StandardCharsets constants instead of hard-coded charset names.	2016-01-05 23:37:05 +01:00
luc	55a4d15775	Added a note on deprecated default search field and operator.	2015-12-14 23:55:12 +01:00
reger	52a9040ae6	Sort out double keywords (dc_subject) early in parsed documents - by direct using Set vs. List - remove not neede String[] getter	2015-11-13 01:48:28 +01:00
reger	a60b1fb6c2	differentiate api call getLocalPort() from getConfigInt()	2015-10-31 23:09:03 +01:00
sixcooler	87e4abe393	fight the fieldcache by usind DocValues: in Solr-5.x the fieldcache has moved and was not cleared anymore. This results in an huge fieldcache. (http://lucene.apache.org/#highlights-of-the-lucene-release-include https://issues.apache.org/jira/browse/LUCENE-5666) Here I try to use DovValues where it is possible. For this I used the Api-Scheme as new basis für the Solr-Schema. This needs at least a complete optimization of the Solr-Index to get a smaller FieldCache. Everything that is indexed with these setting will not use the Fieldcache at all.	2015-08-31 20:24:41 +02:00
Michael Peter Christen	b43811d38c	added surrogate import process for exported solr dumps. Just throw your solr dump file into DATA/SURROGATES/in/ and it will be imported!	2015-05-30 13:19:59 +02:00
reger	3e742d1e34	Init remote crawler on demand If remote crawl option is not activated, skip init of remoteCrawlJob to save the resources of queue and ideling thread. Deploy of the remoteCrawlJob deferred on activation of the option.	2015-05-23 02:06:39 +02:00
reger	609c52e987	refactor getBookmark to consistenly check existance by != null (w/o throwing exception on not found)	2015-05-11 00:37:04 +02:00
reger	8a5b8f8789	on bookmaring of search result, remember orig. query in separate bookmark property (instead of using the description field) - adjust display and autosearch - don't overwrite existing bookmark but combine info	2015-05-03 02:31:50 +02:00
Michael Peter Christen	fed26f33a8	enhanced timezone managament for indexed data: to support the new time parser and search functions in YaCy a high precision detection of date and time on the day is necessary. That requires that the time zone of the document content and the time zone of the user, doing a search, is detected. The time zone of the search request is done automatically using the browsers time zone offset which is delivered to the search request automatically and invisible to the user. The time zone for the content of web pages cannot be detected automatically and must be an attribute of crawl starts. The advanced crawl start now provides an input field to set the time zone in minutes as an offset number. All parsers must get a time zone offset passed, so this required the change of the parser java api. A lot of other changes had been made which corrects the wrong handling of dates in YaCy which was to add a correction based on the time zone of the server. Now no correction is added and all dates in YaCy are UTC/GMT time zone, a normalized time zone for all peers.	2015-04-15 13:17:23 +02:00
reger	7fcf0d0b71	fix missing display of CrawlerMonitor -> robots.txt Monitor revert delete of file api/table_p.html see `3ffe19b85c` (still used in this menu)	2015-03-24 00:13:05 +01:00
Michael Peter Christen	710a0efa1b	generalized time period computations	2015-03-02 12:55:31 +01:00
Michael Peter Christen	974d58b01f	IPv6 Fix for push interface	2015-02-04 15:03:34 +01:00
Michael Peter Christen	69eacdf4eb	applying precompiled CommonPattern.COMMA.split to all places where split(",") was used	2015-01-29 01:46:22 +01:00
Michael Peter Christen	bee5ee7cce	removed some warnings	2015-01-27 17:00:20 +01:00
Michael Peter Christen	6390454652	fix for vocabulary on/off setting	2015-01-27 16:24:27 +01:00
Michael Peter Christen	7db2888336	fixed font size and print page generation in pdf snapshots	2015-01-20 17:14:14 +01:00
reger	198102304b	refactor size() -> filesize() of URIMetadataNode (harmonize with ResultEntry and to not get confused with Collection.size())	2014-12-21 06:05:35 +01:00
Michael Peter Christen	5516819354	preventing the use of no-cache and expires in case that images are generated dynamically which will stay static in the future. This applies mainly to the search result favicon in front of search hits. These icons will now be generated once, but then caches in the browser. There is also a YaCy-internal cache for these icons which had prevented the re-generation of the icons in YaCy, but this cache is now superfluous since the browser should not call the servlet ViewImage again.	2014-12-19 17:41:38 +01:00
Michael Peter Christen	932faafffe	reactivated on-demand snapshot loading	2014-12-16 12:09:57 +01:00
Michael Peter Christen	2362ad7c34	fix for a count issue in snapshot api	2014-12-16 11:33:30 +01:00
Michael Peter Christen	9971e197e0	Added a transaction interface to the snapshots: all documents in the snapshots can now be processed with transactions using commit and rollback commands. Furthermore, a large number of monitoring methods had been added to check the success of transactions. The transactions for snapshots have two main components: a rss search API to get information about latest/oldest entries and a commit/rollback API to move entries away from the rss results. This is done by usage of two storage locations for the snapshots, INVENTORY and ARCHIVE. New snapshots are placed to INVENTORY, commited snapshots move to ARCHIVE, rollback snapshots move to INVENTORY again. Normal Workflow: Beside all these options below, usually it is sufficient to process data like this: - call http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST - process the rss result and use the <guid> value as <urlhash> (see next command) - for each processed result call http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash> - then you can call the rss feed again and the commited urls are omited from the next set of items. These are the commands to control this: The rss feed: http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=OLDESTFIRST http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=OLDESTFIRST http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST The feed will return a <urlhash> in the <guid> - field of the rss. This must be used for commit/rollback: Commit/Rollback: http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash> http://localhost:8090/api/snapshot.json?command=rollback&urlhash=<urlhash> The json will return a property list containing the property "result" with possible values "success" or "fail", according of the result. If an "fail" occurs, please look into the log for further info. Monitoring: http://localhost:8090/api/snapshot.json?command=status This shows the total number of entries in the INVENTORY and the ARCHIVE http://localhost:8090/api/snapshot.json?command=list This will result a list of all hosts which have snapshots and the number of entries for the hosts. Counts for INVENTORY and ARCHIVE are listed in the porperties for "count.INVENTORY" and "count.ARCHIVE" http://localhost:8090/api/snapshot.json?command=list&depth=2 The list can be restricted to such which have a specific depth. The list contains then the same host names, but the count values change because only documents at that specific crawl depth are listed http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80 This lists all urlhashes for the given host, not only an accumulated list of the number of entries http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80&depth=0 This restricts the list of urlhashes for that host for the given depth http://localhost:8090/api/snapshot.json?command=list&state=INVENTORY http://localhost:8090/api/snapshot.json?command=list&state=ARCHIVE This selects either the INVENTORY or ARCHIVE for all list commands, default is ALL which means that from both snapshot directories the host information is collected and combined. You can use the state option for all the commands as listed above Detailed Information: http://localhost:8090/api/snapshot.json?command=metadata&urlhash=upiFJ7Fh1hyQ This collects metadata information for the given urlhash. This can also be restricted with state=INVENTORY and state=ARCHIVE to test if the document is either in one of these snapshot directories. If an urlhash is not found, an empty result is returned. If an entry was found and the state was not restricted, then the result contains a state property containing the name of the location where the document is, either INVENTORY or ARCHIVE. Hint: If a very large number of documents is inside of INVENTORY, then it could be better to call the rss feed with http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY because that is very efficient.	2014-12-15 23:32:46 +01:00
Michael Peter Christen	c3c2b6999b	fixes on wkhtmltopdf	2014-12-14 04:03:20 +01:00
Michael Peter Christen	8df8ffbb6d	enhanced the snapshot functionality: - snapshots can now also be xml files which are extracted from the solr index and stored as individual xml files in the snapshot directory along the pdf and jpg images - a transaction layer was placed above of the snapshot directory to distinguish snapshots into 'inventory' and 'archive'. This may be used to do transactions of index fragments using archived solr search results between peers. This is currently unfinished, we need a protocol to move snapshots from inventory to archive - the SNAPSHOT directory was renamed to snapshot and contains now two snapshot subdirectories: inventory and archive - snapshots may now be generated by everyone, not only such peers running on a server with xkhtml2pdf installed. The expert crawl starts provides the option for snapshots to everyone. PDF snapshots are now optional and the option is only shown if xkhtml2pdf is installed. - the snapshot api now provides the request for historised xml files, i.e. call: http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ The result of such xml files is identical with solr search results with only one hit. The pdf generation has been moved from the http loading process to the solr document storage process. This may slow down the process a lot and a different version of the process may be needed.	2014-12-09 16:20:34 +01:00
Michael Peter Christen	d97deb5555	npe fix	2014-12-06 00:43:12 +01:00
Michael Peter Christen	4fe4bf29ad	added rss feed output to snapshot servlet which can be used to get a list of latest/oldest entries in the snapshot database. This is an example: http://localhost:8090/api/snapshot.rss?depth=2&order=LATESTFIRST&host=yacy.net&maxcount=100 The properties depth, order, host and maxcount can be omited. The meaning of the fields are: host: select only urls from this host or all, if not given depth: select only urls at that crawl depth or all, if not given maxcount: select at most the given number of urls or 10, if not given order: either LATESTFIRST to select the youngest entries, OLDESTFIRST to select the first entries or ANY to select any The rss feed needs administration rights to work, a call to this servlet with rss extension must attach login credentials.	2014-12-06 00:25:05 +01:00
reger	ff18129def	ViewFile servlet: update index if newer, so viewed text and metadata (stored) info is similar - to archive it, use request with profile to allow indexing (defaultglobaltext) and update index (the resource is loaded, parsed anyway, so it's not a expensive operation) Request: remove 2 unused init parameter - number of anchors of the parent - forkfactor sum of anchors of all ancestors	2014-12-05 01:13:37 +01:00
Michael Peter Christen	226aea5914	added a servlet which can create preview images, preview tumbnails and preview pdfs from web pages, i.e.: http://localhost:8090/api/snapshot.png?url=http://yacy.net/en/&width=128&height=128 http://localhost:8090/api/snapshot.jpg?url=http://yacy.net/en/&width=128&height=128 http://localhost:8090/api/snapshot.pdf?url=http://yacy.net/en/ This supports also an on-the-fly generation of the preview documents if the user is an administrator. Otherwise, the servlet fails. To enable this, you must add wkhtmltopdf, imagemagick and (on headless servers) xvfb to your operation system. for detailed instructions, see `97f6089a41`	2014-12-03 11:45:48 +01:00
Michael Peter Christen	0550b54d56	added fix to postprocessing: avoid caching of postprocessing collection to always get fresh lists of documents. This is necessary since the postprocessing changes the same documents which the postprocessing-collection query selects.	2014-11-14 16:34:55 +01:00
Michael Peter Christen	1db476c67e	fix for bad table iteration	2014-11-10 18:52:01 +01:00
orbiter	3ffe19b85c	replaced old /api/table_p.xml servlet with /Tables_p.xml to avoid double code	2014-10-29 17:23:58 +01:00
Michael Peter Christen	07c5b57953	removed warnings	2014-10-15 11:19:25 +02:00
reger	f5967dfedf	add filter to citation page and a on/off button to display only sentences with citations, while maintaining the sentence number. Make the filtered list the default in search result citation link	2014-10-12 06:32:13 +02:00
Michael Peter Christen	0bfc69b29b	more ipv6 bugfixes	2014-10-08 12:38:56 +02:00
Marc Nause	1e6e69bc40	Finished implementation of UPNP: ) will try other ports if YaCy standard ports are not available ) distinguish between internal and external port (not sure if this works 100%) Still to add: propery in config to enter own external port (in case of manually configured NAT)	2014-10-07 13:10:06 +02:00
orbiter	3ac31614a3	added option to reverse-sort YaCy tables (internal API change only)	2014-09-18 11:11:09 +02:00
Michael Peter Christen	2a52c6f0f1	using htroot/api/blacklists as source folder: removed package declaration of some classes in that folder	2014-09-08 11:36:28 +02:00
reger	6654d314f1	add rss version to api/feed.rss IE11 reports error without	2014-08-27 02:31:21 +02:00
orbiter	2371d6b8db	target linktexts must be string to enable search facets on these fields	2014-08-01 13:20:25 +02:00
orbiter	22ce4fb4dd	better error handling for remote solr queries and exists-checks	2014-08-01 11:00:10 +02:00
Michael Peter Christen	2de159719b	added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead.	2014-07-18 12:43:01 +02:00
Michael Peter Christen	8514bffc22	enhanced postprocessing status report	2014-07-16 14:57:25 +02:00
orbiter	59160984cc	timeline performance update	2014-07-03 13:06:29 +02:00
orbiter	2073e69034	fix for long periods in timeline	2014-07-02 11:29:50 +02:00
Michael Peter Christen	8c52f0651b	refactoring of AccessTracker events & timeline fix	2014-07-01 16:06:01 +02:00
Michael Peter Christen	74206a10c7	refactoring	2014-06-27 14:40:36 +02:00

1 2 3 4 5 ...

427 Commits