yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
reger	8d58a48029	remove wrong log line in CrawlSwitchboard + don't allow CrawlSwitchboard to exit application making network param unused	2016-07-02 20:33:23 +02:00
reger	a6ba1faa80	introduce a translation edit servlet Translator_p.html YaCy's UI text translation This is the 1st rudimentary approach to support the translatio utilities. It allows currently to edit untranslated text and save it in a local translation file in the DATA/LOCALE directory. + refactor Translator (less static's) to leverage on class overrides and support garbage collection for this 1 time routine + adjust TranslatorXliff to check for local translations in DATA/LOCALE, this includes storing manually downloaded translation files in DATA as well (to keep default untouched) + on 1st call of Translator_p a master tanslation file is generated, checking the supported languages for missing translation text (later this masterfile is planned to part of the distribution, to harmonize translation key text between the languages) Outlook: the local modifications (possibly as translation fragments instead of complete file) to be shared with maintainer using xlif features.	2016-06-03 01:46:30 +02:00
reger	eb2a00b1d8	fix NPE on missing crawldepth_i	2016-05-15 01:26:38 +02:00
reger	7be1c7a05a	fix logger name	2016-04-17 03:20:14 +02:00
reger	7789c32c82	delete crawl queue on init exception (happens occasionally on path name vaiolation and will never get resolved)	2016-04-16 00:22:48 +02:00
reger	379e9b330d	use supplied url port to get robots.txt in crawlers hostqueue	2016-03-02 00:12:34 +01:00
reger	06d0e2aeb9	result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode. - Above brought up that parser start url parameter, declared as AnchorURL uses only methodes of parent object DigestURL (changed parameter declaration accordingly).	2016-02-16 02:05:58 +01:00
sixcooler	5cb7ba0dc4	fix for connections not getting closed to get favicon.ico during seach	2016-01-19 20:57:22 +01:00
Ryszard Goń	a98c395023	Add the Autocrawl thread	2016-01-14 00:50:23 +01:00
Ryszard Goń	1728cd30c6	Create autocrawl profiles	2016-01-12 16:28:34 +01:00
luc	571bc55937	Refactoring : use StandardCharsets constants instead of hard-coded charset names.	2016-01-05 23:37:05 +01:00
reger	b7e8358645	make use of header.getContentType where possible (mime is normalized afterwards) otherwise use header.mime() differentiated in prev. commit.	2015-12-20 15:49:24 +01:00
Michael Peter Christen	d82d311995	Merge branch 'master' of https://github.com/luccioman/yacy_search_server # Conflicts: # .classpath	2015-11-30 13:34:10 +01:00
reger	b5371ea8c1	read/init crawl queue in a thread to speed-up YaCy start on large existing crawler queues	2015-11-29 05:19:39 +01:00
reger	90686a75a2	fix flux factor (additional crawl delay by access count) calculation	2015-11-25 01:34:41 +01:00
luc	4af27289e5	Merge branch 'master' of https://github.com/yacy/yacy_search_server	2015-11-23 09:01:25 +01:00
reger	297fdb60d3	throw exception if crawler hostqueue can't create hostpath directory. In rare cases hostname may not be a valid filesystem directory name, which can't be created (e.g. containing '*' char). To prevent crawl queue looping on this invalid entry by throwing a malformedurlexception.	2015-11-22 21:26:18 +01:00
luc	755efac17d	Use same max file size when loading all resource bytes or opening stream content	2015-11-20 19:35:39 +01:00
luc	f01d49c37a	Process large or local file images dealing directly with content InputStream.	2015-11-18 10:15:38 +01:00
luc	5bbb2e1730	Ensure resource is closed when reading a full file InputStream	2015-11-18 10:08:06 +01:00
reger	7a64bebb86	init Recrawl job chunk size to max crawl loader during job start, to use some system preferences and allow injection of recrawl urls before queue is empty During recrawl the balancer hangs on the very last urls often on hosts with huge delay time, by allowing injection earlier progress is more balanced. Max number of injected crawl urls by recrawl job is 2 * max loader.	2015-10-16 03:05:39 +02:00
reger	fb75fea446	use recrawljob w/o sort results by date This is a workaround for existing index (not fully reindexed) since intro of schema with docvalues to prevent solr exception causing recrawljob to fail with org.apache.solr.core.SolrCore java.lang.IllegalStateException: unexpected docvalues type NONE for field 'load_date_dt' (expected=NUMERIC). Use UninvertingReader or index with docvalues.	2015-10-04 05:43:40 +02:00
reger	43c27aa550	upd to solr/lucene 5.3.1	2015-10-03 23:20:33 +02:00
reger	98ab655917	on reindex delete index document with invalid url if discovered	2015-09-12 23:06:13 +02:00
reger	367fe388b9	fix exception throw after sendError in DefaultServlet - reduce debug exception logs in crawler	2015-09-05 01:57:30 +02:00
Michael Peter Christen	8f90767889	fix for filesystem crawl	2015-08-11 00:42:26 +02:00
Michael Peter Christen	dbbad23e12	removed warnings	2015-08-03 05:37:34 +02:00
reger	fa08ca207e	! finish running crawls before applying ! Allow crawl urls up to 2048 character fix for http://mantis.tokeek.de/view.php?id=575	2015-08-03 00:49:24 +02:00
Michael Peter Christen	fbeae20b3a	try a healing of the cache if the index file is corrupted	2015-07-27 15:16:08 +02:00
Michael Peter Christen	3c4c69adea	fix for - bad regex computation for crawl start from file (limitation on domain did not work) - servlet error when starting crawl from a large list of urls	2015-06-29 02:02:01 +02:00
Michael Peter Christen	9c12555be5	added link to Snapshots in search results if the snapshot exists and option is set in ConfigSearchPage_p (this is a stub: we also need a visualization of pdf files!)	2015-06-07 20:37:37 +02:00
reger	72f6a0b0b2	enhance recrawl job - allow to modify the query to select documents to process (after job has started) - allow to include failed urls (httpstatus <> 200)	2015-06-06 18:45:39 +02:00
Michael Peter Christen	197f7449e5	All entities of crawl profiles are now editable in the crawl profile editor.	2015-05-28 16:07:40 +02:00
reger	3e742d1e34	Init remote crawler on demand If remote crawl option is not activated, skip init of remoteCrawlJob to save the resources of queue and ideling thread. Deploy of the remoteCrawlJob deferred on activation of the option.	2015-05-23 02:06:39 +02:00
reger	cd7c0e0aae	detail optimization of RecrawlThread	2015-05-17 00:13:00 +02:00
reger	ace71a8877	Initial (experimental) implementation of index update/re-crawl job added to IndexReIndexMonitor_p.html Selects existing documents from index and feeds it to the crawler. currently only the field fresh_date_dt is used determine documents for recrawl (fresh_date_dt:[* TO NOW-1DAY] Documents are added in small chunks (200) to the crawler, only if no other crawl is running.	2015-05-16 01:23:08 +02:00
reger	141cd80456	correct log msg text	2015-05-16 00:01:54 +02:00
Michael Peter Christen	97930a6aad	added must-not-match filter to snapshot generation. also: fixed some bugs	2015-05-08 13:46:27 +02:00
Ryszard Goń	ca1a70aec8	fix for Accept '?' URLs column in Crawl Profile List	2015-04-19 15:55:49 +02:00
Michael Peter Christen	fed26f33a8	enhanced timezone managament for indexed data: to support the new time parser and search functions in YaCy a high precision detection of date and time on the day is necessary. That requires that the time zone of the document content and the time zone of the user, doing a search, is detected. The time zone of the search request is done automatically using the browsers time zone offset which is delivered to the search request automatically and invisible to the user. The time zone for the content of web pages cannot be detected automatically and must be an attribute of crawl starts. The advanced crawl start now provides an input field to set the time zone in minutes as an offset number. All parsers must get a time zone offset passed, so this required the change of the parser java api. A lot of other changes had been made which corrects the wrong handling of dates in YaCy which was to add a correction based on the time zone of the server. Now no correction is added and all dates in YaCy are UTC/GMT time zone, a normalized time zone for all peers.	2015-04-15 13:17:23 +02:00
Michael Peter Christen	3288489fd2	more logging during start-up	2015-04-11 13:00:32 +02:00
Michael Peter Christen	535f1ebe3b	added a new way of content browsing in search results: - date navigation The date is taken from the CONTENT of the documents / web pages, NOT from a date submitted in the context of metadata (i.e. http header or html head form). This makes it possible to search for documents in the future, i.e. when documents contain event descriptions for future events. The date is written to an index field which is now enabled by default. All documents are scanned for contained date mentions. To visualize the dates for a specific search results, a histogram showing the number of documents for each day is displayed. To render these histograms the morris.js library is used. Morris.js requires also raphael.js which is now also integrated in YaCy. The histogram is now also displayed in the index browser by default. To select a specific range from a search result, the following modifiers had been introduced: from:<date> to:<date> These modifiers can be used separately (i.e. only 'from' or only 'to') to describe an open interval or combined to have a closed interval. Both dates are inclusive. To select a specific single date only, use the 'to:' - modifier. The histogram shows blue and green lines; the green lines denot weekend days (saturday and sunday). Clicking on bars in the histogram has the following reaction: 1st click: add a from:<date> modifier for the date of the bar 2nd click: add a to:<date> modifier for the date of the bar 3rd click: remove from and date modifier and set a on:<date> for the bar When the on:<date> modifier is used, the histogram shows an unlimited time period. This makes it possible to click again (4th click) which is then interpreted as a 1st click again (sets a from modifier). The display feature is NOT switched on by default; to switch it on use the /ConfigSearchPage_p.html servlet.	2015-03-02 04:30:10 +01:00
Michael Peter Christen	b5ac29c9a5	added a html field scraper which reads text from html entities of a given css class and extends a given vocabulary with a term consisting with the text content of the html class tag. Additionally, the term is included into the semantic facet of the document. This allows the creation of faceted search to documents without the pre-creation of vocabularies; instead, the vocabulary is created on-the-fly, possibly for use in other crawls. If any of the term scraping for a specific vocabulary is successful on a document, this vocabulary is excluded for auto-annotation on the page. To use this feature, do the following: - create a vocabulary on /Vocabulary_p.html (if not existent) - in /CrawlStartExpert.html you will now see the vocabularies as column in a table. The second column provides text fields where you can name the class of html entities where the literal of the corresponding vocabulary shall be scraped out - when doing a search, you will see the content of the scraped fields in a navigation facet for the given vocabulary	2015-01-30 13:20:56 +01:00
Michael Peter Christen	69eacdf4eb	applying precompiled CommonPattern.COMMA.split to all places where split(",") was used	2015-01-29 01:46:22 +01:00
Michael Peter Christen	bee5ee7cce	removed some warnings	2015-01-27 17:00:20 +01:00
Michael Peter Christen	783cf6fbc7	the LinkedBlockingQueue is much faster than the ArrayBlockingQueue (strange but this is the result of a test: ArrayBlockingQueue: 39461 lines / second; LinkedBlockingQueue: 60774 lines / second)	2015-01-27 16:53:09 +01:00
Michael Peter Christen	7db2888336	fixed font size and print page generation in pdf snapshots	2015-01-20 17:14:14 +01:00
Michael Peter Christen	3e6c3e2237	documents pushed over the api/push_p.html interface will have their unique flag set by default	2015-01-06 15:22:59 +01:00
Michael Peter Christen	8c3e5b7b6d	added experimental pdf splitting which enables YaCy to split pdfs during parsing into individual pages and add them all using different URLs. These constructed urls are generated from the source url with an appended page=<pagenumber> attribute to the url get/post properties. This will distinguish the different page entries. The search result list will then replace the post parameter with a url anchor # mark which causes that the original url is presented in the search result. These URLs can be opened directly on the correct page using pdf.js which is now built-in into firefox. That means: if you find a search hit on page 5 and click on the search result, firefox will open the pdf viewer and shows page 5.	2014-12-21 18:10:15 +01:00
Michael Peter Christen	28683530cd	fixes to usage of no-cache: use and recognize also the no-store directive	2014-12-19 17:37:58 +01:00

1 2 3 4 5 ...

290 Commits