yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
reger	a21789d4e7	Fix unresolved pattern in api/share.html by init some display var's	2017-07-08 22:46:15 +02:00
luccioman	bf55f1d6e5	Started support of partial parsing on large streamed resources. Thus enable getpageinfo_p API to return something in a reasonable amount of time on resources over MegaBytes size range. Support added first with the generic XML parser, for other formats regular crawler limits apply as usual.	2017-07-08 09:04:03 +02:00
luccioman	8da3174867	Ensure lower case conversion consistency with any default locale. Especially for Turkish speaking users using "tr" as their system default locale : strings for technical stuff (URLs, tag names, constants...) must not be lower cased with the default locale, as 'I' doesn't becomes 'i' like in other locales such as "en", but becomes 'ı'.	2017-06-27 06:42:33 +02:00
luccioman	0f80c978d6	Limit the number of initially previewed links in crawl start pages. This prevent rendering a big and inconvenient scrollbar on resources containing many links. If really needed, preview of all links is still available with a "Show all links" button. Doesn't affect the number of links used once the crawl is effectively started, as the list is then loaded again server-side.	2017-06-17 09:33:14 +02:00
luccioman	cbccf97361	Added JavaDoc to the getpageinfo_p API servlet.	2017-05-30 17:38:16 +02:00
luccioman	bd88fd303e	Deprecated duplicated and internally unused getpageinfo servlet. Redirections set for the transition of any eventual external uses: - /api/getpageinfo.xml to /api/getpageinfo_p.xml - /api/getpageinfo.json to /api/getpageinfo_p.json	2017-05-30 09:29:28 +02:00
reger	a2afb4bae0	add switchboardconstants for server ports config keys	2017-03-18 20:02:26 +01:00
reger	334c70c37a	correct fromDate init value on missing param in api/timeline_p servlet revert test modification from last commit in AccessTracker.main	2017-02-20 00:14:14 +01:00
luccioman	e048e74072	Added an optional parameter to webstructure.xml api. This new "documentStructure" parameter can be set to false to only get hosts accumulated references on a resource and thus prevent scraping the specified URL and getting citations references. Also set WebStructureGraph constants as final and updated the Javadoc with example api call URLs.	2017-01-19 12:30:44 +01:00
luccioman	17b7c92009	Made sure webstructure.xml API produces valid XML. Host names should not contain XML special characters such as quotation mark, but at this stage the WebGraph may have mistakenly recorded a host name with such characters. What's more the DigestURL constructor does not prevent this. By the way using serverObjects.putXML to encode host names we ensure here the rendered XML is well formed and can be parsed by external tools even if an structure entry is incorrect.	2017-01-17 15:59:55 +01:00
luccioman	ed3dd5e31a	Fixed webstructure.xml API used with a domain name 'about' parameter. As described in mantis 720 (http://mantis.tokeek.de/view.php?id=720), when requesting this API with a domain name instead of a complete URL only HTTP references on default port were listed.	2017-01-16 16:41:06 +01:00
luccioman	f793d97e56	Factored common code with DigestURL.hosthash()	2017-01-13 16:05:46 +01:00
luccioman	9cea7cbb10	Detailed some Javadoc related to /api/webstructure.xml usage.	2017-01-12 17:52:47 +01:00
reger	c50e23c495	reduce creation of empty legacy RequestHeader() in situation where null is acceptable (less for garbage collection).	2016-12-18 02:38:43 +01:00
reger	f45945cada	increase use of header const for custom "EXT" header	2016-11-13 01:39:14 +01:00
luccioman	812abfc868	Converted one more set of URLs to pure relative ones. Easier YaCy peer configuration behind a reverse proxy subfolder : no need for the reverse proxy to rewrite HTML links or URLs in css files. Tested on Debian Jessie with an apache2 reverse proxy. See related mantis issues http://mantis.tokeek.de/view.php?id=106 and http://mantis.tokeek.de/view.php?id=701	2016-11-12 15:54:35 +01:00
luccioman	74fec066f4	Converted more URLs to pure relative ones. Easier YaCy peer configuration behind a reverse proxy subfolder : no need for the reverse proxy to rewrite HTML links or URLs in css files. Tested on Debian Jessie with an apache2 reverse proxy. See related mantis issues http://mantis.tokeek.de/view.php?id=106 and http://mantis.tokeek.de/view.php?id=701	2016-11-12 10:51:54 +01:00
luccioman	734340c128	Fixed errors for Search portal mode or when peer is not reachable. Same case as reported on issue #87.	2016-11-04 14:31:22 +01:00
luccioman	6e1959f469	Merge branch 'master' of https://github.com/yacy/yacy_search_server.git Conflicts: htroot/yacysearchitem.java source/net/yacy/cora/federate/solr/responsewriter/YJsonResponseWriter.java source/net/yacy/search/schema/CollectionConfiguration.java source/net/yacy/server/serverObjects.java	2016-10-14 11:29:55 +02:00
reger	7c81160f45	correct blacklist export as text url to blacklists_p.txt was using servlet for network access and missing network.unit.name fix for http://mantis.tokeek.de/view.php?id=694 + prevent unresoved_pattern in yacy/list servlet	2016-10-07 03:03:41 +02:00
reger	91ab8a526a	add error msg to api/share.html and skip display of url on nothing uploaded	2016-08-17 03:07:26 +02:00
luccioman	6e96c7341a	Merge remote-tracking branch 'origin/master' Conflicts: htroot/Load_MediawikiWiki.java htroot/Load_PHPBB3.java htroot/ViewImage.java	2016-07-03 18:59:00 +02:00
reger	4e0892962a	fix NPE in citation servlet on empty text field	2016-05-14 03:51:13 +02:00
reger	d9adc2c255	load handler for Transparent Proxy on startup only if feature is activated to save the resources and keep handler chain small if the feature is not used. +add a warning message on settingsack_p page to restart on first activation	2016-03-25 05:26:48 +01:00
Michael Peter Christen	b89465d952	0N - basic dump upload servlet infrastructure, to share index dumps within an experimental new sharing model	2016-03-11 18:12:13 +01:00
Michael Peter Christen	f12a900f3e	harmonization of http post of files for one and several files - this had been differently - and wrong for several files. also: base64-encoding for gzipped push files because our data structures currently only supports ASCII POST pushes..	2016-03-11 08:59:33 +01:00
luc	8682dfbd5e	Updated getpageinfo outputs to return page icons list.	2016-02-10 09:02:21 +01:00
luc	3cc5619d93	Improved HTML icons indexing and rendering in search results. See http://mantis.tokeek.de/view.php?id=629	2016-02-02 09:57:54 +01:00
luc	571bc55937	Refactoring : use StandardCharsets constants instead of hard-coded charset names.	2016-01-05 23:37:05 +01:00
luc	55a4d15775	Added a note on deprecated default search field and operator.	2015-12-14 23:55:12 +01:00
reger	52a9040ae6	Sort out double keywords (dc_subject) early in parsed documents - by direct using Set vs. List - remove not neede String[] getter	2015-11-13 01:48:28 +01:00
reger	a60b1fb6c2	differentiate api call getLocalPort() from getConfigInt()	2015-10-31 23:09:03 +01:00
sixcooler	87e4abe393	fight the fieldcache by usind DocValues: in Solr-5.x the fieldcache has moved and was not cleared anymore. This results in an huge fieldcache. (http://lucene.apache.org/#highlights-of-the-lucene-release-include https://issues.apache.org/jira/browse/LUCENE-5666) Here I try to use DovValues where it is possible. For this I used the Api-Scheme as new basis für the Solr-Schema. This needs at least a complete optimization of the Solr-Index to get a smaller FieldCache. Everything that is indexed with these setting will not use the Fieldcache at all.	2015-08-31 20:24:41 +02:00
Michael Peter Christen	b43811d38c	added surrogate import process for exported solr dumps. Just throw your solr dump file into DATA/SURROGATES/in/ and it will be imported!	2015-05-30 13:19:59 +02:00
reger	3e742d1e34	Init remote crawler on demand If remote crawl option is not activated, skip init of remoteCrawlJob to save the resources of queue and ideling thread. Deploy of the remoteCrawlJob deferred on activation of the option.	2015-05-23 02:06:39 +02:00
reger	609c52e987	refactor getBookmark to consistenly check existance by != null (w/o throwing exception on not found)	2015-05-11 00:37:04 +02:00
reger	8a5b8f8789	on bookmaring of search result, remember orig. query in separate bookmark property (instead of using the description field) - adjust display and autosearch - don't overwrite existing bookmark but combine info	2015-05-03 02:31:50 +02:00
Michael Peter Christen	fed26f33a8	enhanced timezone managament for indexed data: to support the new time parser and search functions in YaCy a high precision detection of date and time on the day is necessary. That requires that the time zone of the document content and the time zone of the user, doing a search, is detected. The time zone of the search request is done automatically using the browsers time zone offset which is delivered to the search request automatically and invisible to the user. The time zone for the content of web pages cannot be detected automatically and must be an attribute of crawl starts. The advanced crawl start now provides an input field to set the time zone in minutes as an offset number. All parsers must get a time zone offset passed, so this required the change of the parser java api. A lot of other changes had been made which corrects the wrong handling of dates in YaCy which was to add a correction based on the time zone of the server. Now no correction is added and all dates in YaCy are UTC/GMT time zone, a normalized time zone for all peers.	2015-04-15 13:17:23 +02:00
reger	7fcf0d0b71	fix missing display of CrawlerMonitor -> robots.txt Monitor revert delete of file api/table_p.html see `3ffe19b85c` (still used in this menu)	2015-03-24 00:13:05 +01:00
Michael Peter Christen	710a0efa1b	generalized time period computations	2015-03-02 12:55:31 +01:00
Michael Peter Christen	974d58b01f	IPv6 Fix for push interface	2015-02-04 15:03:34 +01:00
Michael Peter Christen	69eacdf4eb	applying precompiled CommonPattern.COMMA.split to all places where split(",") was used	2015-01-29 01:46:22 +01:00
Michael Peter Christen	bee5ee7cce	removed some warnings	2015-01-27 17:00:20 +01:00
Michael Peter Christen	6390454652	fix for vocabulary on/off setting	2015-01-27 16:24:27 +01:00
Michael Peter Christen	7db2888336	fixed font size and print page generation in pdf snapshots	2015-01-20 17:14:14 +01:00
reger	198102304b	refactor size() -> filesize() of URIMetadataNode (harmonize with ResultEntry and to not get confused with Collection.size())	2014-12-21 06:05:35 +01:00
Michael Peter Christen	5516819354	preventing the use of no-cache and expires in case that images are generated dynamically which will stay static in the future. This applies mainly to the search result favicon in front of search hits. These icons will now be generated once, but then caches in the browser. There is also a YaCy-internal cache for these icons which had prevented the re-generation of the icons in YaCy, but this cache is now superfluous since the browser should not call the servlet ViewImage again.	2014-12-19 17:41:38 +01:00
Michael Peter Christen	932faafffe	reactivated on-demand snapshot loading	2014-12-16 12:09:57 +01:00
Michael Peter Christen	2362ad7c34	fix for a count issue in snapshot api	2014-12-16 11:33:30 +01:00
Michael Peter Christen	9971e197e0	Added a transaction interface to the snapshots: all documents in the snapshots can now be processed with transactions using commit and rollback commands. Furthermore, a large number of monitoring methods had been added to check the success of transactions. The transactions for snapshots have two main components: a rss search API to get information about latest/oldest entries and a commit/rollback API to move entries away from the rss results. This is done by usage of two storage locations for the snapshots, INVENTORY and ARCHIVE. New snapshots are placed to INVENTORY, commited snapshots move to ARCHIVE, rollback snapshots move to INVENTORY again. Normal Workflow: Beside all these options below, usually it is sufficient to process data like this: - call http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST - process the rss result and use the <guid> value as <urlhash> (see next command) - for each processed result call http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash> - then you can call the rss feed again and the commited urls are omited from the next set of items. These are the commands to control this: The rss feed: http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=OLDESTFIRST http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=OLDESTFIRST http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST The feed will return a <urlhash> in the <guid> - field of the rss. This must be used for commit/rollback: Commit/Rollback: http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash> http://localhost:8090/api/snapshot.json?command=rollback&urlhash=<urlhash> The json will return a property list containing the property "result" with possible values "success" or "fail", according of the result. If an "fail" occurs, please look into the log for further info. Monitoring: http://localhost:8090/api/snapshot.json?command=status This shows the total number of entries in the INVENTORY and the ARCHIVE http://localhost:8090/api/snapshot.json?command=list This will result a list of all hosts which have snapshots and the number of entries for the hosts. Counts for INVENTORY and ARCHIVE are listed in the porperties for "count.INVENTORY" and "count.ARCHIVE" http://localhost:8090/api/snapshot.json?command=list&depth=2 The list can be restricted to such which have a specific depth. The list contains then the same host names, but the count values change because only documents at that specific crawl depth are listed http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80 This lists all urlhashes for the given host, not only an accumulated list of the number of entries http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80&depth=0 This restricts the list of urlhashes for that host for the given depth http://localhost:8090/api/snapshot.json?command=list&state=INVENTORY http://localhost:8090/api/snapshot.json?command=list&state=ARCHIVE This selects either the INVENTORY or ARCHIVE for all list commands, default is ALL which means that from both snapshot directories the host information is collected and combined. You can use the state option for all the commands as listed above Detailed Information: http://localhost:8090/api/snapshot.json?command=metadata&urlhash=upiFJ7Fh1hyQ This collects metadata information for the given urlhash. This can also be restricted with state=INVENTORY and state=ARCHIVE to test if the document is either in one of these snapshot directories. If an urlhash is not found, an empty result is returned. If an entry was found and the state was not restricted, then the result contains a state property containing the name of the location where the document is, either INVENTORY or ARCHIVE. Hint: If a very large number of documents is inside of INVENTORY, then it could be better to call the rss feed with http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY because that is very efficient.	2014-12-15 23:32:46 +01:00

1 2 3 4 5 ...

451 Commits