yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
luccioman	a9cb083fa1	Improved consistency between loader openInputStream and load functions	2017-06-02 01:46:06 +02:00
luccioman	b1da92648e	Fixed surrogates import monitoring page (/CrawlResults.html?process=7) This page was always empty, as described in mantis 740 (http://mantis.tokeek.de/view.php?id=740)	2017-04-24 18:24:26 +02:00
luccioman	f66438442e	Extended Mediawiki dump import to remote URLs. When using a public HTTP URL in /IndexImportMediawiki_p.html, the remote file now is directly streamed and processed, allowing import of several GB dumps even with a low memory remote peer, and without need to manually download the dump file first.	2017-04-14 14:32:44 +02:00
reger	ce87025462	further avoid to set connect info properties as header value following comment "use of properties as header values is discouraged" in case where (proxy)HTTPClient overwrites values with supplied url. Use defined request.referer procedure in response class.	2017-03-04 22:45:17 +01:00
luccioman	39e081ef38	Fixed display of crawler pending URLs counts in HostBrowser.html page. As described in mantis 722 (http://mantis.tokeek.de/view.php?id=722) Also updated some Javadoc.	2017-01-22 12:31:14 +01:00
luccioman	0da1e6ba16	Factored code re-implementing DigestURL.hosthash() method. This ensure consistent implementation of the url host hash generation and easier usage finding in source code. Also added a unit test for this function.	2017-01-16 10:18:42 +01:00
luccioman	c1401d821e	Adjusted crawl depth control for FTP crawl start URLs.	2017-01-02 10:24:17 +01:00
luccioman	3ca695390c	FTP crawl start URLs : applied crawl profile depth control Applied rules : - when the FTP URL denotes a file resource, stack it as any start URL : eventually embedded links can be followed applying the usual depth rules - when the FTP URL denotes a directory, list files under this directory and stack them for crawl, and repeat the process on sub folders until crawl depth is reached	2016-12-22 16:25:09 +01:00
reger	c50e23c495	reduce creation of empty legacy RequestHeader() in situation where null is acceptable (less for garbage collection).	2016-12-18 02:38:43 +01:00
reger	87f6631a2a	adjust Cache getHeader to prev. changes/commit	2016-12-18 01:02:56 +01:00
reger	0d2964cf2b	expanded error message on rejected crawl url due to faile dns lookup close of http://mantis.tokeek.de/view.php?id=678	2016-12-15 23:59:50 +01:00
luccioman	aa9ddf3c23	Added control over Robots.txt active threads maximum number. When starting a crawl from a file containing thousands of links, configuration setting "crawler.MaxActiveThreads" is effective to prevent saturating the system with too many outgoing HTTP connections threads launched by the crawler. But robots.txt was not affected by this setting and was indefinitely increasing the number of concurrently loading threads until most ot the connections timed out. To improve performance control, added a pool of threads for Robots.txt, consistently used in its ensureExist() and massCrawlCheck() methods. The Robots.txt threads pool max size can now be configured in the /PerformanceQueus_p.html page, or with the new "robots.txt.MaxActiveThreads" setting, initialized with the same default value as the crawler.	2016-11-23 18:13:05 +01:00
reger	e0816ef2e5	use human readable date format in CrawlStacker error message "double in: local index, oldDate = "	2016-11-05 19:40:14 +01:00
luccioman	f0639d810c	Customized name for Threads still using the default "Thread-n" pattern. This makes threads monitoring easier to read.	2016-10-22 17:17:21 +02:00
luccioman	db3b9db9c2	Crawl from local file : faster task end when manually terminating crawl.	2016-10-22 09:11:20 +02:00
luccioman	47af33a04c	Advanced Crawl from local file : better processing of large files. Applied strategy : when there is no restriction on domains or sub-path(s), stack anchor links once discovered by the content scraper instead of waiting the complete parsing of the file. This makes it possible to handle a crawling start file with thousands of links in a reasonable amount of time. Performance limitation : even if the crawl start faster with a large file, the content of the parsed file still is fully loaded in memory.	2016-10-21 13:03:31 +02:00
luccioman	6f49ece22f	Fixed redirected URLs processing as crawl start point. See mantis 699 (http://mantis.tokeek.de/view.php?id=699) for details.	2016-10-20 12:12:26 +02:00
luccioman	7263d17436	Removed mentions of deprecated LURL-db. Thanks to LA_FORGE asking about if on YaCy forum ( http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5895 )	2016-10-19 14:56:25 +02:00
luccioman	54cfcc3f56	CrawlCheck_p.html : also display info about disallowed URLs.	2016-10-12 11:26:59 +02:00
luccioman	8b341e9818	Robots : properly handle URLs including non ASCII characters This fixes GitHub issue 80 ( https://github.com/yacy/yacy_search_server/issues/80 ) reported by Lord-Protector.	2016-10-12 11:25:36 +02:00
luccioman	dcdea2d02f	Fixed shutdown for crawler.MaxActiveThreads value greater than 200 Shutdown was hanging in CrawlQueues.close() at this.workerQueue.put(POISON_REQUEST) when config value crawler.MaxActiveThreads was greater than 200. Revealed by "Collision" Threads dumps in mantis 689 (http://mantis.tokeek.de/view.php?id=689#c1312) Fixed consistency between this.worker.length and this.workerQueue capacity, and made the process more reliable using non-blocking offer() function.	2016-09-29 10:33:11 +02:00
luccioman	3ee4f56c39	Improved ErrorCache behavior when switching networks Even after network switch, ErroCache was still holding a reference to the previous Solr cores, thus becoming useless until next YaCy restart. Initial error cache filling with recent errors from the index was also missing after the swtich.	2016-09-22 09:07:07 +02:00
Michael Peter Christen	5e165a8150	removed unused imports	2016-09-06 18:46:24 +02:00
reger	7ab41d4ff1	use directories original lastmodified date in file- & smbloader in response	2016-07-09 19:55:47 +02:00
reger	708bcbb042	one more replacement to use cached hosthash vs. calculated	2016-07-07 02:50:57 +02:00
reger	22db449f2a	to prevent crawler to concurrently access and alter same crawl queue after restart, put hosthash in queue's filename (which is used as primary key for crawl queue. Hint: initial hosthash from url and recalculated hosthash from just hostname:port are not the same. fixes http://mantis.tokeek.de/view.php?id=668 (partially)	2016-07-05 23:22:35 +02:00
reger	8d58a48029	remove wrong log line in CrawlSwitchboard + don't allow CrawlSwitchboard to exit application making network param unused	2016-07-02 20:33:23 +02:00
reger	a6ba1faa80	introduce a translation edit servlet Translator_p.html YaCy's UI text translation This is the 1st rudimentary approach to support the translatio utilities. It allows currently to edit untranslated text and save it in a local translation file in the DATA/LOCALE directory. + refactor Translator (less static's) to leverage on class overrides and support garbage collection for this 1 time routine + adjust TranslatorXliff to check for local translations in DATA/LOCALE, this includes storing manually downloaded translation files in DATA as well (to keep default untouched) + on 1st call of Translator_p a master tanslation file is generated, checking the supported languages for missing translation text (later this masterfile is planned to part of the distribution, to harmonize translation key text between the languages) Outlook: the local modifications (possibly as translation fragments instead of complete file) to be shared with maintainer using xlif features.	2016-06-03 01:46:30 +02:00
reger	eb2a00b1d8	fix NPE on missing crawldepth_i	2016-05-15 01:26:38 +02:00
reger	7be1c7a05a	fix logger name	2016-04-17 03:20:14 +02:00
reger	7789c32c82	delete crawl queue on init exception (happens occasionally on path name vaiolation and will never get resolved)	2016-04-16 00:22:48 +02:00
reger	379e9b330d	use supplied url port to get robots.txt in crawlers hostqueue	2016-03-02 00:12:34 +01:00
reger	06d0e2aeb9	result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode. - Above brought up that parser start url parameter, declared as AnchorURL uses only methodes of parent object DigestURL (changed parameter declaration accordingly).	2016-02-16 02:05:58 +01:00
sixcooler	5cb7ba0dc4	fix for connections not getting closed to get favicon.ico during seach	2016-01-19 20:57:22 +01:00
Ryszard Goń	a98c395023	Add the Autocrawl thread	2016-01-14 00:50:23 +01:00
Ryszard Goń	1728cd30c6	Create autocrawl profiles	2016-01-12 16:28:34 +01:00
luc	571bc55937	Refactoring : use StandardCharsets constants instead of hard-coded charset names.	2016-01-05 23:37:05 +01:00
reger	b7e8358645	make use of header.getContentType where possible (mime is normalized afterwards) otherwise use header.mime() differentiated in prev. commit.	2015-12-20 15:49:24 +01:00
Michael Peter Christen	d82d311995	Merge branch 'master' of https://github.com/luccioman/yacy_search_server # Conflicts: # .classpath	2015-11-30 13:34:10 +01:00
reger	b5371ea8c1	read/init crawl queue in a thread to speed-up YaCy start on large existing crawler queues	2015-11-29 05:19:39 +01:00
reger	90686a75a2	fix flux factor (additional crawl delay by access count) calculation	2015-11-25 01:34:41 +01:00
luc	4af27289e5	Merge branch 'master' of https://github.com/yacy/yacy_search_server	2015-11-23 09:01:25 +01:00
reger	297fdb60d3	throw exception if crawler hostqueue can't create hostpath directory. In rare cases hostname may not be a valid filesystem directory name, which can't be created (e.g. containing '*' char). To prevent crawl queue looping on this invalid entry by throwing a malformedurlexception.	2015-11-22 21:26:18 +01:00
luc	755efac17d	Use same max file size when loading all resource bytes or opening stream content	2015-11-20 19:35:39 +01:00
luc	f01d49c37a	Process large or local file images dealing directly with content InputStream.	2015-11-18 10:15:38 +01:00
luc	5bbb2e1730	Ensure resource is closed when reading a full file InputStream	2015-11-18 10:08:06 +01:00
reger	7a64bebb86	init Recrawl job chunk size to max crawl loader during job start, to use some system preferences and allow injection of recrawl urls before queue is empty During recrawl the balancer hangs on the very last urls often on hosts with huge delay time, by allowing injection earlier progress is more balanced. Max number of injected crawl urls by recrawl job is 2 * max loader.	2015-10-16 03:05:39 +02:00
reger	fb75fea446	use recrawljob w/o sort results by date This is a workaround for existing index (not fully reindexed) since intro of schema with docvalues to prevent solr exception causing recrawljob to fail with org.apache.solr.core.SolrCore java.lang.IllegalStateException: unexpected docvalues type NONE for field 'load_date_dt' (expected=NUMERIC). Use UninvertingReader or index with docvalues.	2015-10-04 05:43:40 +02:00
reger	43c27aa550	upd to solr/lucene 5.3.1	2015-10-03 23:20:33 +02:00
reger	98ab655917	on reindex delete index document with invalid url if discovered	2015-09-12 23:06:13 +02:00
reger	367fe388b9	fix exception throw after sendError in DefaultServlet - reduce debug exception logs in crawler	2015-09-05 01:57:30 +02:00
Michael Peter Christen	8f90767889	fix for filesystem crawl	2015-08-11 00:42:26 +02:00
Michael Peter Christen	dbbad23e12	removed warnings	2015-08-03 05:37:34 +02:00
reger	fa08ca207e	! finish running crawls before applying ! Allow crawl urls up to 2048 character fix for http://mantis.tokeek.de/view.php?id=575	2015-08-03 00:49:24 +02:00
Michael Peter Christen	fbeae20b3a	try a healing of the cache if the index file is corrupted	2015-07-27 15:16:08 +02:00
Michael Peter Christen	3c4c69adea	fix for - bad regex computation for crawl start from file (limitation on domain did not work) - servlet error when starting crawl from a large list of urls	2015-06-29 02:02:01 +02:00
Michael Peter Christen	9c12555be5	added link to Snapshots in search results if the snapshot exists and option is set in ConfigSearchPage_p (this is a stub: we also need a visualization of pdf files!)	2015-06-07 20:37:37 +02:00
reger	72f6a0b0b2	enhance recrawl job - allow to modify the query to select documents to process (after job has started) - allow to include failed urls (httpstatus <> 200)	2015-06-06 18:45:39 +02:00
Michael Peter Christen	197f7449e5	All entities of crawl profiles are now editable in the crawl profile editor.	2015-05-28 16:07:40 +02:00
reger	3e742d1e34	Init remote crawler on demand If remote crawl option is not activated, skip init of remoteCrawlJob to save the resources of queue and ideling thread. Deploy of the remoteCrawlJob deferred on activation of the option.	2015-05-23 02:06:39 +02:00
reger	cd7c0e0aae	detail optimization of RecrawlThread	2015-05-17 00:13:00 +02:00
reger	ace71a8877	Initial (experimental) implementation of index update/re-crawl job added to IndexReIndexMonitor_p.html Selects existing documents from index and feeds it to the crawler. currently only the field fresh_date_dt is used determine documents for recrawl (fresh_date_dt:[* TO NOW-1DAY] Documents are added in small chunks (200) to the crawler, only if no other crawl is running.	2015-05-16 01:23:08 +02:00
reger	141cd80456	correct log msg text	2015-05-16 00:01:54 +02:00
Michael Peter Christen	97930a6aad	added must-not-match filter to snapshot generation. also: fixed some bugs	2015-05-08 13:46:27 +02:00
Ryszard Goń	ca1a70aec8	fix for Accept '?' URLs column in Crawl Profile List	2015-04-19 15:55:49 +02:00
Michael Peter Christen	fed26f33a8	enhanced timezone managament for indexed data: to support the new time parser and search functions in YaCy a high precision detection of date and time on the day is necessary. That requires that the time zone of the document content and the time zone of the user, doing a search, is detected. The time zone of the search request is done automatically using the browsers time zone offset which is delivered to the search request automatically and invisible to the user. The time zone for the content of web pages cannot be detected automatically and must be an attribute of crawl starts. The advanced crawl start now provides an input field to set the time zone in minutes as an offset number. All parsers must get a time zone offset passed, so this required the change of the parser java api. A lot of other changes had been made which corrects the wrong handling of dates in YaCy which was to add a correction based on the time zone of the server. Now no correction is added and all dates in YaCy are UTC/GMT time zone, a normalized time zone for all peers.	2015-04-15 13:17:23 +02:00
Michael Peter Christen	3288489fd2	more logging during start-up	2015-04-11 13:00:32 +02:00
Michael Peter Christen	535f1ebe3b	added a new way of content browsing in search results: - date navigation The date is taken from the CONTENT of the documents / web pages, NOT from a date submitted in the context of metadata (i.e. http header or html head form). This makes it possible to search for documents in the future, i.e. when documents contain event descriptions for future events. The date is written to an index field which is now enabled by default. All documents are scanned for contained date mentions. To visualize the dates for a specific search results, a histogram showing the number of documents for each day is displayed. To render these histograms the morris.js library is used. Morris.js requires also raphael.js which is now also integrated in YaCy. The histogram is now also displayed in the index browser by default. To select a specific range from a search result, the following modifiers had been introduced: from:<date> to:<date> These modifiers can be used separately (i.e. only 'from' or only 'to') to describe an open interval or combined to have a closed interval. Both dates are inclusive. To select a specific single date only, use the 'to:' - modifier. The histogram shows blue and green lines; the green lines denot weekend days (saturday and sunday). Clicking on bars in the histogram has the following reaction: 1st click: add a from:<date> modifier for the date of the bar 2nd click: add a to:<date> modifier for the date of the bar 3rd click: remove from and date modifier and set a on:<date> for the bar When the on:<date> modifier is used, the histogram shows an unlimited time period. This makes it possible to click again (4th click) which is then interpreted as a 1st click again (sets a from modifier). The display feature is NOT switched on by default; to switch it on use the /ConfigSearchPage_p.html servlet.	2015-03-02 04:30:10 +01:00
Michael Peter Christen	b5ac29c9a5	added a html field scraper which reads text from html entities of a given css class and extends a given vocabulary with a term consisting with the text content of the html class tag. Additionally, the term is included into the semantic facet of the document. This allows the creation of faceted search to documents without the pre-creation of vocabularies; instead, the vocabulary is created on-the-fly, possibly for use in other crawls. If any of the term scraping for a specific vocabulary is successful on a document, this vocabulary is excluded for auto-annotation on the page. To use this feature, do the following: - create a vocabulary on /Vocabulary_p.html (if not existent) - in /CrawlStartExpert.html you will now see the vocabularies as column in a table. The second column provides text fields where you can name the class of html entities where the literal of the corresponding vocabulary shall be scraped out - when doing a search, you will see the content of the scraped fields in a navigation facet for the given vocabulary	2015-01-30 13:20:56 +01:00
Michael Peter Christen	69eacdf4eb	applying precompiled CommonPattern.COMMA.split to all places where split(",") was used	2015-01-29 01:46:22 +01:00
Michael Peter Christen	bee5ee7cce	removed some warnings	2015-01-27 17:00:20 +01:00
Michael Peter Christen	783cf6fbc7	the LinkedBlockingQueue is much faster than the ArrayBlockingQueue (strange but this is the result of a test: ArrayBlockingQueue: 39461 lines / second; LinkedBlockingQueue: 60774 lines / second)	2015-01-27 16:53:09 +01:00
Michael Peter Christen	7db2888336	fixed font size and print page generation in pdf snapshots	2015-01-20 17:14:14 +01:00
Michael Peter Christen	3e6c3e2237	documents pushed over the api/push_p.html interface will have their unique flag set by default	2015-01-06 15:22:59 +01:00
Michael Peter Christen	8c3e5b7b6d	added experimental pdf splitting which enables YaCy to split pdfs during parsing into individual pages and add them all using different URLs. These constructed urls are generated from the source url with an appended page=<pagenumber> attribute to the url get/post properties. This will distinguish the different page entries. The search result list will then replace the post parameter with a url anchor # mark which causes that the original url is presented in the search result. These URLs can be opened directly on the correct page using pdf.js which is now built-in into firefox. That means: if you find a search hit on page 5 and click on the search result, firefox will open the pdf viewer and shows page 5.	2014-12-21 18:10:15 +01:00
Michael Peter Christen	28683530cd	fixes to usage of no-cache: use and recognize also the no-store directive	2014-12-19 17:37:58 +01:00
Michael Peter Christen	932faafffe	reactivated on-demand snapshot loading	2014-12-16 12:09:57 +01:00
Michael Peter Christen	2362ad7c34	fix for a count issue in snapshot api	2014-12-16 11:33:30 +01:00
Michael Peter Christen	9971e197e0	Added a transaction interface to the snapshots: all documents in the snapshots can now be processed with transactions using commit and rollback commands. Furthermore, a large number of monitoring methods had been added to check the success of transactions. The transactions for snapshots have two main components: a rss search API to get information about latest/oldest entries and a commit/rollback API to move entries away from the rss results. This is done by usage of two storage locations for the snapshots, INVENTORY and ARCHIVE. New snapshots are placed to INVENTORY, commited snapshots move to ARCHIVE, rollback snapshots move to INVENTORY again. Normal Workflow: Beside all these options below, usually it is sufficient to process data like this: - call http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST - process the rss result and use the <guid> value as <urlhash> (see next command) - for each processed result call http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash> - then you can call the rss feed again and the commited urls are omited from the next set of items. These are the commands to control this: The rss feed: http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=OLDESTFIRST http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=OLDESTFIRST http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST The feed will return a <urlhash> in the <guid> - field of the rss. This must be used for commit/rollback: Commit/Rollback: http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash> http://localhost:8090/api/snapshot.json?command=rollback&urlhash=<urlhash> The json will return a property list containing the property "result" with possible values "success" or "fail", according of the result. If an "fail" occurs, please look into the log for further info. Monitoring: http://localhost:8090/api/snapshot.json?command=status This shows the total number of entries in the INVENTORY and the ARCHIVE http://localhost:8090/api/snapshot.json?command=list This will result a list of all hosts which have snapshots and the number of entries for the hosts. Counts for INVENTORY and ARCHIVE are listed in the porperties for "count.INVENTORY" and "count.ARCHIVE" http://localhost:8090/api/snapshot.json?command=list&depth=2 The list can be restricted to such which have a specific depth. The list contains then the same host names, but the count values change because only documents at that specific crawl depth are listed http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80 This lists all urlhashes for the given host, not only an accumulated list of the number of entries http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80&depth=0 This restricts the list of urlhashes for that host for the given depth http://localhost:8090/api/snapshot.json?command=list&state=INVENTORY http://localhost:8090/api/snapshot.json?command=list&state=ARCHIVE This selects either the INVENTORY or ARCHIVE for all list commands, default is ALL which means that from both snapshot directories the host information is collected and combined. You can use the state option for all the commands as listed above Detailed Information: http://localhost:8090/api/snapshot.json?command=metadata&urlhash=upiFJ7Fh1hyQ This collects metadata information for the given urlhash. This can also be restricted with state=INVENTORY and state=ARCHIVE to test if the document is either in one of these snapshot directories. If an urlhash is not found, an empty result is returned. If an entry was found and the state was not restricted, then the result contains a state property containing the name of the location where the document is, either INVENTORY or ARCHIVE. Hint: If a very large number of documents is inside of INVENTORY, then it could be better to call the rss feed with http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY because that is very efficient.	2014-12-15 23:32:46 +01:00
Michael Peter Christen	66b5a56976	Added and integrated new date detection class which can identify date notions within the fulltext of a document. This class attempts to identify also dates given abbreviated or with missing year or described with names for special days, like 'Halloween'. In case that a date has no year given, the current year and following years are considered. This process is therefore able to identify a large set of dates to a document, either because there are several dates given in the document or the date is ambiguous. Four new Solr fields are used to store the parsing result: dates_in_content_sxt: if date expressions can be found in the content, these dates are listed here in order of the appearances dates_in_content_count_i: the number of entries in dates_in_content_sxt date_in_content_min_dt: if dates_in_content_sxt is filled, this contains the oldest date from the list of available dates #date_in_content_max_dt: if dates_in_content_sxt is filled, this contains the youngest date from the list of available dates, that may also be possibly in the future These fields are deactiviated by default because the evaluation of regular expressions to detect the date is yet too CPU intensive. Maybe future enhancements will cause that this is switched on by default. The purpose of these fields is the creation of calendar-like search facets, to be implemented next.	2014-12-14 13:40:45 +01:00
Michael Peter Christen	ab6cc3c88c	added concurrent generation of snapshot pdfs	2014-12-10 14:10:05 +01:00
Michael Peter Christen	8df8ffbb6d	enhanced the snapshot functionality: - snapshots can now also be xml files which are extracted from the solr index and stored as individual xml files in the snapshot directory along the pdf and jpg images - a transaction layer was placed above of the snapshot directory to distinguish snapshots into 'inventory' and 'archive'. This may be used to do transactions of index fragments using archived solr search results between peers. This is currently unfinished, we need a protocol to move snapshots from inventory to archive - the SNAPSHOT directory was renamed to snapshot and contains now two snapshot subdirectories: inventory and archive - snapshots may now be generated by everyone, not only such peers running on a server with xkhtml2pdf installed. The expert crawl starts provides the option for snapshots to everyone. PDF snapshots are now optional and the option is only shown if xkhtml2pdf is installed. - the snapshot api now provides the request for historised xml files, i.e. call: http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ The result of such xml files is identical with solr search results with only one hit. The pdf generation has been moved from the http loading process to the solr document storage process. This may slow down the process a lot and a different version of the process may be needed.	2014-12-09 16:20:34 +01:00
Michael Peter Christen	4fe4bf29ad	added rss feed output to snapshot servlet which can be used to get a list of latest/oldest entries in the snapshot database. This is an example: http://localhost:8090/api/snapshot.rss?depth=2&order=LATESTFIRST&host=yacy.net&maxcount=100 The properties depth, order, host and maxcount can be omited. The meaning of the fields are: host: select only urls from this host or all, if not given depth: select only urls at that crawl depth or all, if not given maxcount: select at most the given number of urls or 10, if not given order: either LATESTFIRST to select the youngest entries, OLDESTFIRST to select the first entries or ANY to select any The rss feed needs administration rights to work, a call to this servlet with rss extension must attach login credentials.	2014-12-06 00:25:05 +01:00
reger	568c991405	remove the unused Request variable (fix of prev. commit)	2014-12-05 03:03:28 +01:00
reger	ff18129def	ViewFile servlet: update index if newer, so viewed text and metadata (stored) info is similar - to archive it, use request with profile to allow indexing (defaultglobaltext) and update index (the resource is loaded, parsed anyway, so it's not a expensive operation) Request: remove 2 unused init parameter - number of anchors of the parent - forkfactor sum of anchors of all ancestors	2014-12-05 01:13:37 +01:00
Michael Peter Christen	226aea5914	added a servlet which can create preview images, preview tumbnails and preview pdfs from web pages, i.e.: http://localhost:8090/api/snapshot.png?url=http://yacy.net/en/&width=128&height=128 http://localhost:8090/api/snapshot.jpg?url=http://yacy.net/en/&width=128&height=128 http://localhost:8090/api/snapshot.pdf?url=http://yacy.net/en/ This supports also an on-the-fly generation of the preview documents if the user is an administrator. Otherwise, the servlet fails. To enable this, you must add wkhtmltopdf, imagemagick and (on headless servers) xvfb to your operation system. for detailed instructions, see `97f6089a41`	2014-12-03 11:45:48 +01:00
Michael Peter Christen	e586e423aa	in case that loading from the cache fails, load from wkhtmltopdf without cache using the user agent string given in the crawl profile	2014-12-02 13:35:19 +01:00
Michael Peter Christen	25a64c51b3	moved snapshot generation out of the html handler to prevent that existing cache entries cause that the handler is not executed	2014-12-01 17:37:25 +01:00
Michael Peter Christen	97f6089a41	YaCy can now create web page snapshots as pdf documents which can later be transcoded into jpg for image previews. To create such pdfs you must do: Add wkhtmltopdf and imagemagick to your OS, which you can do: On a Mac download wkhtmltox-0.12.1_osx-cocoa-x86-64.pkg from http://wkhtmltopdf.org/downloads.html and downloadh ttp://cactuslab.com/imagemagick/assets/ImageMagick-6.8.9-9.pkg.zip In Debian do "apt-get install wkhtmltopdf imagemagick" Then check in /Settings_p.html?page=ProxyAccess: "Transparent Proxy" and "Always Fresh" - this is used by wkhtmltopdf to fetch web pages using the YaCy proxy. Using "Always Fresh" it is possible to get all pages from the proxy cache. Finally, you will see a new option when starting an expert web crawl. You can set a maximum depth for crawling which should cause a pdf generation. The resulting pdfs are then available in DATA/HTCACHE/SNAPSHOTS/<host>.<port>/<depth>/<shard>/<urlhash>.<date>.pdf	2014-12-01 15:03:09 +01:00
Michael Peter Christen	ad0da5f246	added new web page snapshot infrastructure which will lead to the ability to have web page previews in the search results. (This is a stub, no function available with this yet...)	2014-11-29 11:56:32 +01:00
Michael Peter Christen	84763126e0	added option to make the YaCy proxy act as the cache is never stale. If set to 'Always Fresh' the cache is always used if the entry in the cache exist. This is a good way to archive web content and access it without going online again in case the documents exist. To do so, open /Settings_p.html?page=ProxyAccess and check the "Always Fresh" checkbox. This is set do false which behave as set before. If you set this to true, then you have your web archive in DATA/HTCACHE. Copy this to carry around your private copy of the internet!	2014-11-24 20:28:52 +01:00
Michael Peter Christen	a39419f2ef	more stacks shall be considered for on-demand loading, not only deep-depth stacks to prevent "too many open files" problem	2014-11-23 20:11:23 +01:00
Michael Peter Christen	5bb52f79be	reduce number of calls to queue.size() because that may be a bottleneck during crawling	2014-11-23 20:09:32 +01:00
Michael Peter Christen	a34f837592	better delete all files in path when removing host crawl stack	2014-11-22 12:09:07 +01:00
Michael Peter Christen	10b1db430a	if we have many hosts, use on-demand earlier	2014-11-22 12:04:04 +01:00
Michael Peter Christen	6983dff334	explain crawl denial when not switched to intranet mode	2014-10-11 09:02:12 +02:00
Michael Peter Christen	d8beafba3a	fix for values in CrawlProfileEditor table and xml; now the full profile is available in the xml.	2014-10-09 13:27:20 +02:00
Michael Peter Christen	ec95dfa2e6	fixed crawl profile xml result which did not show the correct crawl status.	2014-10-08 18:48:57 +02:00
Michael Peter Christen	9b1958e8ca	more ipv6 bugfixes	2014-10-08 15:21:49 +02:00
Michael Peter Christen	e1bc768f9d	more IPv6 bugfixes	2014-10-06 17:44:27 +02:00

1 2 3 4 5 ...

366 Commits