yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	85773ebd4f	removed debug lines	2014-12-21 17:53:06 +01:00
reger	198102304b	refactor size() -> filesize() of URIMetadataNode (harmonize with ResultEntry and to not get confused with Collection.size())	2014-12-21 06:05:35 +01:00
Michael Peter Christen	445fafeb7c	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-12-20 15:38:15 +01:00
Michael Peter Christen	0d69089c61	fix for division by zero	2014-12-20 15:11:06 +01:00
reger	ac61a39828	use peeraddress for link in remote crawl list to make link work without enabled proxy upd pom for Jetty (missing in last commit)	2014-12-20 01:59:00 +01:00
Michael Peter Christen	5516819354	preventing the use of no-cache and expires in case that images are generated dynamically which will stay static in the future. This applies mainly to the search result favicon in front of search hits. These icons will now be generated once, but then caches in the browser. There is also a YaCy-internal cache for these icons which had prevented the re-generation of the icons in YaCy, but this cache is now superfluous since the browser should not call the servlet ViewImage again.	2014-12-19 17:41:38 +01:00
Michael Peter Christen	d3e71ed070	fixes for searches when initialization of large autotagging libraries have not been finished	2014-12-19 17:38:58 +01:00
Michael Peter Christen	28683530cd	fixes to usage of no-cache: use and recognize also the no-store directive	2014-12-19 17:37:58 +01:00
Michael Peter Christen	932faafffe	reactivated on-demand snapshot loading	2014-12-16 12:09:57 +01:00
Michael Peter Christen	2362ad7c34	fix for a count issue in snapshot api	2014-12-16 11:33:30 +01:00
Michael Peter Christen	9971e197e0	Added a transaction interface to the snapshots: all documents in the snapshots can now be processed with transactions using commit and rollback commands. Furthermore, a large number of monitoring methods had been added to check the success of transactions. The transactions for snapshots have two main components: a rss search API to get information about latest/oldest entries and a commit/rollback API to move entries away from the rss results. This is done by usage of two storage locations for the snapshots, INVENTORY and ARCHIVE. New snapshots are placed to INVENTORY, commited snapshots move to ARCHIVE, rollback snapshots move to INVENTORY again. Normal Workflow: Beside all these options below, usually it is sufficient to process data like this: - call http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST - process the rss result and use the <guid> value as <urlhash> (see next command) - for each processed result call http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash> - then you can call the rss feed again and the commited urls are omited from the next set of items. These are the commands to control this: The rss feed: http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=OLDESTFIRST http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=OLDESTFIRST http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST The feed will return a <urlhash> in the <guid> - field of the rss. This must be used for commit/rollback: Commit/Rollback: http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash> http://localhost:8090/api/snapshot.json?command=rollback&urlhash=<urlhash> The json will return a property list containing the property "result" with possible values "success" or "fail", according of the result. If an "fail" occurs, please look into the log for further info. Monitoring: http://localhost:8090/api/snapshot.json?command=status This shows the total number of entries in the INVENTORY and the ARCHIVE http://localhost:8090/api/snapshot.json?command=list This will result a list of all hosts which have snapshots and the number of entries for the hosts. Counts for INVENTORY and ARCHIVE are listed in the porperties for "count.INVENTORY" and "count.ARCHIVE" http://localhost:8090/api/snapshot.json?command=list&depth=2 The list can be restricted to such which have a specific depth. The list contains then the same host names, but the count values change because only documents at that specific crawl depth are listed http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80 This lists all urlhashes for the given host, not only an accumulated list of the number of entries http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80&depth=0 This restricts the list of urlhashes for that host for the given depth http://localhost:8090/api/snapshot.json?command=list&state=INVENTORY http://localhost:8090/api/snapshot.json?command=list&state=ARCHIVE This selects either the INVENTORY or ARCHIVE for all list commands, default is ALL which means that from both snapshot directories the host information is collected and combined. You can use the state option for all the commands as listed above Detailed Information: http://localhost:8090/api/snapshot.json?command=metadata&urlhash=upiFJ7Fh1hyQ This collects metadata information for the given urlhash. This can also be restricted with state=INVENTORY and state=ARCHIVE to test if the document is either in one of these snapshot directories. If an urlhash is not found, an empty result is returned. If an entry was found and the state was not restricted, then the result contains a state property containing the name of the location where the document is, either INVENTORY or ARCHIVE. Hint: If a very large number of documents is inside of INVENTORY, then it could be better to call the rss feed with http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY because that is very efficient.	2014-12-15 23:32:46 +01:00
reger	6c3f36def1	- fix path to default heuristic.cfg - deprecate unused ProxyServlet	2014-12-14 21:27:45 +01:00
Michael Peter Christen	c3c2b6999b	fixes on wkhtmltopdf	2014-12-14 04:03:20 +01:00
Michael Peter Christen	ff035a20e7	fix for vocabulary import (double term detection)	2014-12-10 14:09:34 +01:00
Michael Peter Christen	e6650050fe	fix for Is Facet checkbox	2014-12-10 13:14:39 +01:00
Michael Peter Christen	bd3ed5cae5	added charset detection to vocabulary reader	2014-12-10 13:11:51 +01:00
Michael Peter Christen	7bfc5b80cb	added new options to vocabulary editor: - new switch 'isFacet' which causes that the usage of the vocabulary for search facets is enabled or disabled. This shall be used for large vocabularies sind searched in solr are extremely slow if facets for a large set of alternative terms are generated - new option to disable auto-enrichment from synonyms - new option to add synonyms from another column when importing from csv - automatically recognize double-occurrences in synonyms and bundling terms for such synonyms	2014-12-10 12:20:27 +01:00
Michael Peter Christen	8df8ffbb6d	enhanced the snapshot functionality: - snapshots can now also be xml files which are extracted from the solr index and stored as individual xml files in the snapshot directory along the pdf and jpg images - a transaction layer was placed above of the snapshot directory to distinguish snapshots into 'inventory' and 'archive'. This may be used to do transactions of index fragments using archived solr search results between peers. This is currently unfinished, we need a protocol to move snapshots from inventory to archive - the SNAPSHOT directory was renamed to snapshot and contains now two snapshot subdirectories: inventory and archive - snapshots may now be generated by everyone, not only such peers running on a server with xkhtml2pdf installed. The expert crawl starts provides the option for snapshots to everyone. PDF snapshots are now optional and the option is only shown if xkhtml2pdf is installed. - the snapshot api now provides the request for historised xml files, i.e. call: http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ The result of such xml files is identical with solr search results with only one hit. The pdf generation has been moved from the http loading process to the solr document storage process. This may slow down the process a lot and a different version of the process may be needed.	2014-12-09 16:20:34 +01:00
Michael Peter Christen	4111d42c81	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-12-08 12:40:12 +01:00
Michael Peter Christen	793ce6d13b	added confirmation dialogs for row deletion	2014-12-08 11:41:28 +01:00
Michael Peter Christen	cdc21d43b1	more robustness for broken table data in Table_API_p.html -- see bug report http://mantis.tokeek.de/view.php?id=495	2014-12-08 11:35:40 +01:00
reger	1d3ea35d69	prevent NPE on host link for to short HeuristicCfg.OpenSearchURL	2014-12-08 01:35:37 +01:00
Michael Peter Christen	a95af11050	enhancement for clearing the crawl queue	2014-12-07 23:43:38 +01:00
reger	5f0bb1214f	modified FieldReIndex to reindex queries with low number of documents first by using a internally a score map with number of documents as score and working through the list from low to high.	2014-12-07 04:31:09 +01:00
Michael Peter Christen	d97deb5555	npe fix	2014-12-06 00:43:12 +01:00
Michael Peter Christen	4fe4bf29ad	added rss feed output to snapshot servlet which can be used to get a list of latest/oldest entries in the snapshot database. This is an example: http://localhost:8090/api/snapshot.rss?depth=2&order=LATESTFIRST&host=yacy.net&maxcount=100 The properties depth, order, host and maxcount can be omited. The meaning of the fields are: host: select only urls from this host or all, if not given depth: select only urls at that crawl depth or all, if not given maxcount: select at most the given number of urls or 10, if not given order: either LATESTFIRST to select the youngest entries, OLDESTFIRST to select the first entries or ANY to select any The rss feed needs administration rights to work, a call to this servlet with rss extension must attach login credentials.	2014-12-06 00:25:05 +01:00
reger	d6539ba597	Merge origin/master	2014-12-05 01:15:41 +01:00
reger	ff18129def	ViewFile servlet: update index if newer, so viewed text and metadata (stored) info is similar - to archive it, use request with profile to allow indexing (defaultglobaltext) and update index (the resource is loaded, parsed anyway, so it's not a expensive operation) Request: remove 2 unused init parameter - number of anchors of the parent - forkfactor sum of anchors of all ancestors	2014-12-05 01:13:37 +01:00
Michael Peter Christen	d83de9ecf5	added another path for the convert command because on older Macs ImageMagick has a different installation location	2014-12-03 18:07:05 +01:00
Michael Peter Christen	226aea5914	added a servlet which can create preview images, preview tumbnails and preview pdfs from web pages, i.e.: http://localhost:8090/api/snapshot.png?url=http://yacy.net/en/&width=128&height=128 http://localhost:8090/api/snapshot.jpg?url=http://yacy.net/en/&width=128&height=128 http://localhost:8090/api/snapshot.pdf?url=http://yacy.net/en/ This supports also an on-the-fly generation of the preview documents if the user is an administrator. Otherwise, the servlet fails. To enable this, you must add wkhtmltopdf, imagemagick and (on headless servers) xvfb to your operation system. for detailed instructions, see `97f6089a41`	2014-12-03 11:45:48 +01:00
Michael Peter Christen	181911376c	showing list of all thread in threaddump using the ThreadMXBean counter (this obviously show more threads than before?)	2014-12-02 16:21:06 +01:00
Michael Peter Christen	64887f6b21	show number of threads on status page	2014-12-02 16:04:11 +01:00
Michael Peter Christen	6f0167fac1	get cloned crawl start parameter for snapshots	2014-12-02 12:52:05 +01:00
Michael Peter Christen	97f6089a41	YaCy can now create web page snapshots as pdf documents which can later be transcoded into jpg for image previews. To create such pdfs you must do: Add wkhtmltopdf and imagemagick to your OS, which you can do: On a Mac download wkhtmltox-0.12.1_osx-cocoa-x86-64.pkg from http://wkhtmltopdf.org/downloads.html and downloadh ttp://cactuslab.com/imagemagick/assets/ImageMagick-6.8.9-9.pkg.zip In Debian do "apt-get install wkhtmltopdf imagemagick" Then check in /Settings_p.html?page=ProxyAccess: "Transparent Proxy" and "Always Fresh" - this is used by wkhtmltopdf to fetch web pages using the YaCy proxy. Using "Always Fresh" it is possible to get all pages from the proxy cache. Finally, you will see a new option when starting an expert web crawl. You can set a maximum depth for crawling which should cause a pdf generation. The resulting pdfs are then available in DATA/HTCACHE/SNAPSHOTS/<host>.<port>/<depth>/<shard>/<urlhash>.<date>.pdf	2014-12-01 15:03:09 +01:00
Michael Peter Christen	41d00350e4	moved network configuration to Use Case submenu; this is necessary because the definiton of portal peers within the YaCy freeworld network is otherwise splitted into two different main menus.	2014-12-01 01:12:51 +01:00
reger	221f86dd5e	position api icon (ViewFile.html)	2014-11-30 01:58:14 +01:00
Michael Peter Christen	ad0da5f246	added new web page snapshot infrastructure which will lead to the ability to have web page previews in the search results. (This is a stub, no function available with this yet...)	2014-11-29 11:56:32 +01:00
reger	c475be2937	fix (enable) error msg on empty query	2014-11-28 22:44:33 +01:00
reger	f709132961	remove obsolete alternate link fix api link	2014-11-28 01:40:46 +01:00
Michael Peter Christen	3c71e1c872	show vocabularies in search result (in case of debugging)	2014-11-28 01:19:31 +01:00
Michael Peter Christen	2fce2e2697	larger boost fields for ranking	2014-11-27 12:11:54 +01:00
Michael Peter Christen	6c03ff8355	bold words in snippets should not be coloured black in the base style because there are styles with dark backgrounds which make the bold word invisible	2014-11-27 08:08:05 +01:00
Michael Peter Christen	c0f9f6ac66	added option to change the navbar-default, i.e. usable for dark skins	2014-11-26 18:01:35 +01:00
Michael Peter Christen	84763126e0	added option to make the YaCy proxy act as the cache is never stale. If set to 'Always Fresh' the cache is always used if the entry in the cache exist. This is a good way to archive web content and access it without going online again in case the documents exist. To do so, open /Settings_p.html?page=ProxyAccess and check the "Always Fresh" checkbox. This is set do false which behave as set before. If you set this to true, then you have your web archive in DATA/HTCACHE. Copy this to carry around your private copy of the internet!	2014-11-24 20:28:52 +01:00
Michael Peter Christen	5bb52f79be	reduce number of calls to queue.size() because that may be a bottleneck during crawling	2014-11-23 20:09:32 +01:00
Michael Peter Christen	092d97d7ac	when importing vocabulary csv files, accept also files without semicolon and truncate quotes from literals	2014-11-21 12:42:29 +01:00
Michael Peter Christen	ee9ec40048	added hints to ranking to make ranking boosts using vocabularies easier	2014-11-20 18:46:06 +01:00
Michael Peter Christen	70f03f7c8e	do not cache search requests to Solr if the result is used for doublechecking. If a double-check comes from cached results the doublecheck fails.	2014-11-20 18:45:27 +01:00
Michael Peter Christen	a0b84e4def	use a LinkedHashMap for factes to maintain facet order as given by solr	2014-11-20 18:44:29 +01:00
Michael Peter Christen	0dc6e0a5f2	added option to enrich vocabularies with synonyms from synonym database	2014-11-19 18:12:43 +01:00
Michael Peter Christen	6a2a669db4	added loading of the synonyms file from addon/synonyms into the knowledge loader	2014-11-19 17:36:56 +01:00
Michael Peter Christen	fdba8e2fa0	fix for 2-day network stats table: showing 48 instead of 24 hours from peer history	2014-11-17 14:23:21 +01:00
Michael Peter Christen	ec9d021568	added option in vocabulary editor to import CSV files with different encodings (preselected windows-type character encoding which is typical for CSV files). Fixed also other problems with character encoding in dictionary files. Automatically generated vocabularies are now also noted in the API steering.	2014-11-17 14:22:40 +01:00
reger	b558433211	adjust tag cloud font size calculation to limit max font size to ~ TOPWORDS_MAXSIZE	2014-11-17 01:24:30 +01:00
Michael Peter Christen	0550b54d56	added fix to postprocessing: avoid caching of postprocessing collection to always get fresh lists of documents. This is necessary since the postprocessing changes the same documents which the postprocessing-collection query selects.	2014-11-14 16:34:55 +01:00
Michael Peter Christen	68e8039fd1	added high-precision scheduler for API processes. This allows also to make the execution in dependency of available RAM or CPU load. The default value for CPU load is 4.0 and the check runs once a minute.	2014-11-14 10:02:50 +01:00
Michael Peter Christen	0a879c98e7	added new 'firstSeen' database table and necessary data structures which hold a date for each URL to record when a url was first seen. This is then used to overwrite the modification date for urls upon recrawl in case that the first-seen date is before the latest document date. This behaviour is necessary due to the common behaviour of content management systems which attach always the current date to all documents. Using the firstSeen database it is possible to approximate a real first document creation date in case that the crawler starts frequently for the same domain. As a result the search results ordered by date have a much better quality and the usage of YaCy as search agent for latest news has a better quality.	2014-11-13 00:58:58 +01:00
Michael Peter Christen	487a733c99	fix for catchall handling in search	2014-11-12 22:48:33 +01:00
sixcooler	33b0234454	added a input-field for setting 'fileHost' Set this to avoid error-messages like 'proxy use not allowed / granted' on accessing your Peer by its hostname.	2014-11-12 21:32:34 +01:00
Michael Peter Christen	1db476c67e	fix for bad table iteration	2014-11-10 18:52:01 +01:00
Michael Peter Christen	e05b7332b9	html fix	2014-11-10 02:18:44 +01:00
reger	c1ad265efd	remove not used accordion javascript call for facet navs	2014-11-09 22:06:00 +01:00
Michael Peter Christen	ecdfb35f09	added long variables to debug output in index browser	2014-11-07 18:12:09 +01:00
Michael Peter Christen	95d87f00b3	fix for bad query generation in doublecheck in postprocessing	2014-11-07 18:11:23 +01:00
orbiter	a2b5cfb3cf	added reverse button to tables, by default on now (to see latest entries first)	2014-11-02 20:30:49 +01:00
orbiter	fceac5d2d4	added (missing) Tables_p.xml for table xml api	2014-11-02 20:10:32 +01:00
orbiter	dbafd4865e	enhanced debug code in host browser	2014-10-30 15:47:44 +01:00
Michael Peter Christen	8f6587e87b	fix for broken protocol navigation	2014-10-30 12:41:04 +01:00
Michael Peter Christen	5c962dd009	better scaling of network statistic graphs	2014-10-29 21:41:41 +01:00
orbiter	3ffe19b85c	replaced old /api/table_p.xml servlet with /Tables_p.xml to avoid double code	2014-10-29 17:23:58 +01:00
Michael Peter Christen	b4585e9546	added new index size history image in /Status.html page	2014-10-29 13:37:44 +01:00
Michael Peter Christen	9aebbbebc0	added network history in /Network.html?page=5	2014-10-29 13:21:35 +01:00
Michael Peter Christen	26279b0993	added debug code for statistics about document attributes related to domains	2014-10-29 10:50:08 +01:00
reger	d65e3f2b53	RankingSolr: display only available or configured boost fields	2014-10-26 23:33:21 +01:00
Michael Peter Christen	4e56d79fc8	replaced input text field with text field for index deletion with query and replaced GET with POST method. This should make it possible to tubmit here very large queries for deletion.	2014-10-24 12:57:37 +02:00
orbiter	6f707b4305	removed spaces in seedlist.xml to reduce data	2014-10-20 18:05:37 +02:00
orbiter	78c9d31388	fix for bad json	2014-10-17 21:32:07 +02:00
Michael Peter Christen	8098a86f1d	ipv6 fix for api /yacy/seedlist.[json\|xml], multiple IPs are now attached to the seed info. API clients must be adopted. Documentation will be fixed in http://www.yacy-websuche.de/wiki/index.php/Dev:APIseedlist Also added a new retrieval option for seeds, they can now be retrieved by their name with the get parameter name=<name>	2014-10-17 12:44:28 +02:00
Michael Peter Christen	07c5b57953	removed warnings	2014-10-15 11:19:25 +02:00
Michael Peter Christen	509eba2484	automatically zoom to location/POI	2014-10-15 11:07:08 +02:00
orbiter	fa2ad101ec	enhanced graphics computation (avoiding long string parsing for colours)	2014-10-15 10:31:24 +02:00
orbiter	ef813cec91	added proper copyright notice to OSM tiles presented at the search result page	2014-10-15 09:13:23 +02:00
Michael Peter Christen	1269e77dfa	enhanced location search	2014-10-15 00:55:57 +02:00
Michael Peter Christen	75b5f24be4	make browsing of file://z: - paths in index browser easier - this will now show the root paths on a shared drive	2014-10-13 18:33:39 +02:00
Michael Peter Christen	8ac3e9f890	fix for api icon in yacysearch_location.html	2014-10-13 16:53:00 +02:00
Michael Peter Christen	a1dd0ae62c	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	2014-10-12 23:43:32 +02:00
reger	f5967dfedf	add filter to citation page and a on/off button to display only sentences with citations, while maintaining the sentence number. Make the filtered list the default in search result citation link	2014-10-12 06:32:13 +02:00
Michael Peter Christen	f818f84adb	more ipv6 fixes	2014-10-11 00:34:07 +02:00
Michael Peter Christen	2c2b50e65d	refactoring (class name should start with uppercase letter)	2014-10-10 14:32:21 +02:00
Michael Peter Christen	14385057c2	added also the NetworkHistory servlet...	2014-10-10 14:16:16 +02:00
Michael Peter Christen	d8beafba3a	fix for values in CrawlProfileEditor table and xml; now the full profile is available in the xml.	2014-10-09 13:27:20 +02:00
Michael Peter Christen	ec95dfa2e6	fixed crawl profile xml result which did not show the correct crawl status.	2014-10-08 18:48:57 +02:00
Michael Peter Christen	8c1a89cb34	added another decoration flag to switch off network graphics in crawler monitor and index browser: decoration.grafics.linkstructure Please set this to false to remove the graphics from the interface.	2014-10-08 17:12:35 +02:00
Michael Peter Christen	764e4ed673	fixed appearance of RSS icon on search result page	2014-10-08 15:48:45 +02:00
Michael Peter Christen	9b1958e8ca	more ipv6 bugfixes	2014-10-08 15:21:49 +02:00
Michael Peter Christen	7817fc50c9	added a high cpu cycle monitor to PerformanceQueues	2014-10-08 15:20:43 +02:00
Michael Peter Christen	5082feb103	less volume for effect sounds	2014-10-08 15:04:35 +02:00
Michael Peter Christen	0bfc69b29b	more ipv6 bugfixes	2014-10-08 12:38:56 +02:00
Michael Peter Christen	a27563e5c3	removed the atmo sound clips because they had been too large	2014-10-07 23:42:41 +02:00
Michael Peter Christen	ae58b22f5b	ipv6 fixes for Network.html front page	2014-10-07 21:57:41 +02:00

1 2 3 4 5 ...

5141 Commits