yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	6244b084cd	fixed wrong order of result count values	2012-11-07 02:29:33 +01:00
Michael Peter Christen	15d1460b40	added information about the reason of pausing of crawls	2012-11-06 15:21:56 +01:00
Michael Peter Christen	2371ef031c	added solr faceted search support to YaCy search results added solr highlighting / YaCy snippets to YaCy search results - facets are now much more complete - facets are computed and searched much faster - snippet computation is done by solr if solr knows the snippet	2012-11-06 14:32:08 +01:00
Michael Peter Christen	791e1dcfdf	when a new crawl is started, delete all entries about error-urls for crawl-start domains	2012-11-05 22:14:27 +01:00
Michael Peter Christen	5e77801aac	update to web interface structure	2012-11-05 15:23:03 +01:00
orbiter	354ef8000d	- added 'deleteold' option to crawler which causes that documents are deleted which are selected by a crawl filter (host or subpath) - site crawl used this option be default now - made option to deleteDomain() concurrency	2012-11-04 02:58:26 +01:00
Michael Peter Christen	f8f05ecba7	- added a delete button in host browser to delete a complete subpath - removed storage of default collection name - default is now "user" - made stacking of crawl start points concurrently	2012-10-31 17:44:45 +01:00
Michael Peter Christen	ac9540dfb6	removed options for stopwords which are not used	2012-10-30 12:36:36 +01:00
Michael Peter Christen	85ca07b90e	when a new crawl is started, an equal crawl, if still running, is terminated and the corresponding crawl profile is deleted (this also clears the crawl queue entries for that crawl profile)	2012-10-25 10:20:55 +02:00
Michael Peter Christen	ae6feb5610	showing the web structure graph as animation in the crawl monitor	2012-10-23 02:50:26 +02:00
Michael Peter Christen	21fe8339b4	- enhanced generation of url objects - enhanced computation of link structure graphics - enhanced collection of data for link structures	2012-10-15 13:17:13 +02:00
Michael Peter Christen	5f0ab25382	removed the option to prevent removal of & parts inside of the MultiProtocolURI during normalform computation because that should always be done and also be done during initialization of the MultiProtocolURI Object. The new normalform method takes only one argument which should be 'true' unless you know exactly what you are doing.	2012-10-10 11:46:22 +02:00
Michael Peter Christen	53789555b9	fix for crawl start filter	2012-10-10 10:40:32 +02:00
Michael Peter Christen	abebb3b124	added a crawl start checker which makes a simple analysis on the list of all given urls: shows if the url can be loaded and if there is a robots and/or a sitemap.	2012-10-10 02:02:17 +02:00
orbiter	ae246c30c3	fixed interpretation of directDocByURL attribute during crawl start	2012-10-09 23:11:31 +02:00
sixcooler	c65b576a6f	added filename for missing crawlname when crawling from file	2012-09-26 14:05:33 +02:00
Michael Peter Christen	1533bfd63b	refactoring	2012-09-25 21:20:03 +02:00
Michael Peter Christen	00c1c777fa	refactoring	2012-09-21 15:48:16 +02:00
orbiter	60b1e23f05	added new crawl options: - indexUrlMustMatch and indexUrlMustNotMatch which can be used to select loaded pages for indexing. Default patterns are in such a way that all loaded pages are also indexed (as before) but when doing an expert crawl start, then the user may select only specific urls to be indexed. - crawlerNoDepthLimitMatch is a new pattern that can be used to remove the crawl depth limitation. This filter a never-match by default (which causes that the depth is used) but the user can select paths which will be loaded completely even if a crawl depth is reached.	2012-09-16 21:27:55 +02:00
Michael Peter Christen	6ec02deec6	added new crawl attributes in crawl profile (not active yet)	2012-09-14 16:49:29 +02:00
Michael Peter Christen	a13e5153ac	- added the possibility to have not one but a list of crawl start urls - the list of urls is entered in the expert crawl start in a textfield; the one-line input field was replaced with a text box - start urls can also be given in one single line where the urls are separated by a '\|'-character - as an effect, the crawl profile cannot carry a single start url for identificaton because it is possible to have more. Therefore the url was removed from the crawl profile - this affect all servlets which display a crawl profile: removed the url field from all there servlets - to work consistently with several start urls and the other crawl starts which computed crawl start url lists from sitelists or sitemaps, the crawl start servlet was restructured completely - new rules for must-match patterns were created to make it possible that site crawl starts also work with several crawl starts at once	2012-09-14 12:25:46 +02:00
Michael Peter Christen	9644c186a4	added search functionality to ViewFile.html servlet	2012-09-11 02:03:14 +02:00
Michael Peter Christen	b2b516cc3e	added a collection attribute to crawls and searches: - a solr field collection_sxt can be used to store a set of crawl tags - when this field is activated, a crawl tag can be assigned when crawls are started - the content of the collection field can be comma-separated, all of them are assigned to the documents when they are indexed as result of such a crawl start - a search result can be drilled down to a specific collection; this is currently only available in the solr interface and also in the gsa interface using the 'site' option - this adds a mandatory field for gsa queries (the google api demands that field all the time)	2012-09-03 15:26:08 +02:00
Michael Peter Christen	0cab06c47c	refactoring	2012-08-17 15:52:33 +02:00
Michael Peter Christen	24d9db1613	snippet retrieval loading processes may use a smaller minimum load time value than crawling processes. This speeds up the search result preparation dramatically.	2012-07-30 10:38:23 +02:00
Michael Peter Christen	1687737771	Abstraction of HandleMap and HandleSet	2012-07-27 12:13:53 +02:00
Michael Peter Christen	e3aa05b9dd	added creation of subpath pattern when crawl start is 'from file'	2012-07-11 23:18:57 +02:00
orbiter	0cbda0b2b8	- replaced all length() == 0 and size() == 0 with isEmpty() - replaced some length() > 0 and size() > 0 with !isEmpty() - cannot be done automatically - implemented some isEmpty() methods	2012-07-10 22:59:03 +02:00
Michael Peter Christen	7c1ba99755	removed more unused method parameters	2012-07-05 10:44:30 +02:00
Michael Peter Christen	0301aba1e9	removed unused method parameters	2012-07-05 10:23:07 +02:00
Michael Peter Christen	d3964253ae	- added @SuppressWarnings to unused servlet method parameters - removed unnecessary casts - removed unnecessary throw statements	2012-07-05 09:14:04 +02:00
Michael Peter Christen	276a66a793	Adding a limit of 1000 links that a parser shall store during indexing. A limit was necessary because some web pages have such huge numbers of links that it can easily cause a OOM just by the number of links. The quesion if the number of 1000 links is sufficient or too weak must be answered with the result of testing this feature.	2012-07-03 17:06:20 +02:00
Michael Peter Christen	1825f165b8	better integration of blacklist according to use case	2012-07-02 13:57:29 +02:00
Michael Peter Christen	03280fb161	removed segments-concept and the Segments class: the segments had been there to create a tenant-infrastructure but were never be used since that was all much too complex. There will be a replacement using a solr navigation using a segment field in the search index.	2012-06-28 14:27:29 +02:00
Michael Peter Christen	9116013c64	- allow lazy initialization of solr value (if using 'lazy', then no 0-values and no empty strings are written). This may save a lot of memory (in ram and on disc) if excessive 0-values or empty strings appear) - do not allow default boolean values for checkboxes because that does not make sense: browsers may omit the checkbox attribute name if the box is not checked. A default value 'true' would not comply with the semantic of the browsers response. - add a checkbox in IndexFederated_p for the lazy initialization of solr fields.	2012-06-27 12:17:58 +02:00
Michael Peter Christen	77f795756c	fixing redirects and status codes: storing of status code in ResponseHeader to make it available for late evaluations, like storage in solr.	2012-06-25 18:17:31 +02:00
Michael Peter Christen	d7eb18cdf2	accept also file names beginning with "file://" for crawl start from file.	2012-06-06 14:27:18 +02:00
Michael Peter Christen	16b21f7a5b	Added more steering in Crawler_p.html interface	2012-05-23 18:00:37 +02:00
Michael Peter Christen	19efbf1b0f	- apply directDocByURL to NOLOAD Queue - choose pushing to NOLOAD as default for site crawl	2012-04-26 00:23:18 +02:00
Michael Peter Christen	ef5192f8c9	using the generic document parser for crawl starts instead of the html parser. This makes it possible that every type of document can be a crawl start point, not only text documents or html documents. Testet this with a pdf document.	2012-01-23 17:27:29 +01:00
Michael Peter Christen	992dbdf4bb	added noload statistic to servlets	2012-01-05 18:33:05 +01:00
orbiter	11729061f2	added an option in the bookmark import process to put everything into the crawler git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8134 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-12-03 00:27:01 +00:00
orbiter	5a55397f99	some last-minute performance hacks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8101 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-25 11:23:52 +00:00
orbiter	da55a359e9	addon to http://bugs.yacy.net/view.php?id=72 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8095 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-24 16:43:26 +00:00
apfelmaennchen	564374d1fe	- included YMarks in addition to old bookmarks in yacysearchitem.html; don't get confused by the old bookmark dialog, the ymark is automatically added silently beforehand. - reworked bookmark creation on crawlstart - many smaller adjustments to ymarks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8072 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-22 23:50:49 +00:00
orbiter	d449547023	fix for http://bugs.yacy.net/view.php?id=72 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8064 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-22 00:22:09 +00:00
orbiter	c93f10417a	add a bookmark automatically each time a new crawl is started git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8063 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-22 00:03:20 +00:00
orbiter	e4a82ddd8b	produce a bookmark entry from every crawl start. these bookmarks are always private. these bookmarks will be used to get a source reference for the search in case of intranet or portal searches. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8062 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-21 23:10:29 +00:00
orbiter	42425c8003	fixed directDocByURL (has now effect if switched off) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8022 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-09 15:54:01 +00:00
orbiter	a7df70221e	refactoring git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7987 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-10-04 09:06:24 +00:00

1 2

99 Commits