yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
orbiter	60b1e23f05	added new crawl options: - indexUrlMustMatch and indexUrlMustNotMatch which can be used to select loaded pages for indexing. Default patterns are in such a way that all loaded pages are also indexed (as before) but when doing an expert crawl start, then the user may select only specific urls to be indexed. - crawlerNoDepthLimitMatch is a new pattern that can be used to remove the crawl depth limitation. This filter a never-match by default (which causes that the depth is used) but the user can select paths which will be loaded completely even if a crawl depth is reached.	2012-09-16 21:27:55 +02:00
Michael Peter Christen	6ec02deec6	added new crawl attributes in crawl profile (not active yet)	2012-09-14 16:49:29 +02:00
Michael Peter Christen	a13e5153ac	- added the possibility to have not one but a list of crawl start urls - the list of urls is entered in the expert crawl start in a textfield; the one-line input field was replaced with a text box - start urls can also be given in one single line where the urls are separated by a '\|'-character - as an effect, the crawl profile cannot carry a single start url for identificaton because it is possible to have more. Therefore the url was removed from the crawl profile - this affect all servlets which display a crawl profile: removed the url field from all there servlets - to work consistently with several start urls and the other crawl starts which computed crawl start url lists from sitelists or sitemaps, the crawl start servlet was restructured completely - new rules for must-match patterns were created to make it possible that site crawl starts also work with several crawl starts at once	2012-09-14 12:25:46 +02:00
Michael Peter Christen	9644c186a4	added search functionality to ViewFile.html servlet	2012-09-11 02:03:14 +02:00
Michael Peter Christen	b2b516cc3e	added a collection attribute to crawls and searches: - a solr field collection_sxt can be used to store a set of crawl tags - when this field is activated, a crawl tag can be assigned when crawls are started - the content of the collection field can be comma-separated, all of them are assigned to the documents when they are indexed as result of such a crawl start - a search result can be drilled down to a specific collection; this is currently only available in the solr interface and also in the gsa interface using the 'site' option - this adds a mandatory field for gsa queries (the google api demands that field all the time)	2012-09-03 15:26:08 +02:00
Michael Peter Christen	0cab06c47c	refactoring	2012-08-17 15:52:33 +02:00
Michael Peter Christen	24d9db1613	snippet retrieval loading processes may use a smaller minimum load time value than crawling processes. This speeds up the search result preparation dramatically.	2012-07-30 10:38:23 +02:00
Michael Peter Christen	1687737771	Abstraction of HandleMap and HandleSet	2012-07-27 12:13:53 +02:00
Michael Peter Christen	e3aa05b9dd	added creation of subpath pattern when crawl start is 'from file'	2012-07-11 23:18:57 +02:00
orbiter	0cbda0b2b8	- replaced all length() == 0 and size() == 0 with isEmpty() - replaced some length() > 0 and size() > 0 with !isEmpty() - cannot be done automatically - implemented some isEmpty() methods	2012-07-10 22:59:03 +02:00
Michael Peter Christen	7c1ba99755	removed more unused method parameters	2012-07-05 10:44:30 +02:00
Michael Peter Christen	0301aba1e9	removed unused method parameters	2012-07-05 10:23:07 +02:00
Michael Peter Christen	d3964253ae	- added @SuppressWarnings to unused servlet method parameters - removed unnecessary casts - removed unnecessary throw statements	2012-07-05 09:14:04 +02:00
Michael Peter Christen	276a66a793	Adding a limit of 1000 links that a parser shall store during indexing. A limit was necessary because some web pages have such huge numbers of links that it can easily cause a OOM just by the number of links. The quesion if the number of 1000 links is sufficient or too weak must be answered with the result of testing this feature.	2012-07-03 17:06:20 +02:00
Michael Peter Christen	1825f165b8	better integration of blacklist according to use case	2012-07-02 13:57:29 +02:00
Michael Peter Christen	03280fb161	removed segments-concept and the Segments class: the segments had been there to create a tenant-infrastructure but were never be used since that was all much too complex. There will be a replacement using a solr navigation using a segment field in the search index.	2012-06-28 14:27:29 +02:00
Michael Peter Christen	9116013c64	- allow lazy initialization of solr value (if using 'lazy', then no 0-values and no empty strings are written). This may save a lot of memory (in ram and on disc) if excessive 0-values or empty strings appear) - do not allow default boolean values for checkboxes because that does not make sense: browsers may omit the checkbox attribute name if the box is not checked. A default value 'true' would not comply with the semantic of the browsers response. - add a checkbox in IndexFederated_p for the lazy initialization of solr fields.	2012-06-27 12:17:58 +02:00
Michael Peter Christen	77f795756c	fixing redirects and status codes: storing of status code in ResponseHeader to make it available for late evaluations, like storage in solr.	2012-06-25 18:17:31 +02:00
Michael Peter Christen	d7eb18cdf2	accept also file names beginning with "file://" for crawl start from file.	2012-06-06 14:27:18 +02:00
Michael Peter Christen	16b21f7a5b	Added more steering in Crawler_p.html interface	2012-05-23 18:00:37 +02:00
Michael Peter Christen	19efbf1b0f	- apply directDocByURL to NOLOAD Queue - choose pushing to NOLOAD as default for site crawl	2012-04-26 00:23:18 +02:00
Michael Peter Christen	ef5192f8c9	using the generic document parser for crawl starts instead of the html parser. This makes it possible that every type of document can be a crawl start point, not only text documents or html documents. Testet this with a pdf document.	2012-01-23 17:27:29 +01:00
Michael Peter Christen	992dbdf4bb	added noload statistic to servlets	2012-01-05 18:33:05 +01:00
orbiter	11729061f2	added an option in the bookmark import process to put everything into the crawler git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8134 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-12-03 00:27:01 +00:00
orbiter	5a55397f99	some last-minute performance hacks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8101 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-25 11:23:52 +00:00
orbiter	da55a359e9	addon to http://bugs.yacy.net/view.php?id=72 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8095 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-24 16:43:26 +00:00
apfelmaennchen	564374d1fe	- included YMarks in addition to old bookmarks in yacysearchitem.html; don't get confused by the old bookmark dialog, the ymark is automatically added silently beforehand. - reworked bookmark creation on crawlstart - many smaller adjustments to ymarks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8072 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-22 23:50:49 +00:00
orbiter	d449547023	fix for http://bugs.yacy.net/view.php?id=72 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8064 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-22 00:22:09 +00:00
orbiter	c93f10417a	add a bookmark automatically each time a new crawl is started git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8063 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-22 00:03:20 +00:00
orbiter	e4a82ddd8b	produce a bookmark entry from every crawl start. these bookmarks are always private. these bookmarks will be used to get a source reference for the search in case of intranet or portal searches. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8062 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-21 23:10:29 +00:00
orbiter	42425c8003	fixed directDocByURL (has now effect if switched off) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8022 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-09 15:54:01 +00:00
orbiter	a7df70221e	refactoring git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7987 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-10-04 09:06:24 +00:00
orbiter	cf4fd525ee	added directDocByURL attribute in crawl profile git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7985 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-09-30 12:38:28 +00:00
orbiter	b250e6466d	implemented crawl restrictions for IP pattern and country lists git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7980 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-09-29 15:17:39 +00:00
orbiter	5ad7f9612b	added crawl settings for three new filters for each crawl: must-match for IPs (IPs that are known after DNS resolving for each URL in the crawl queue) must-not-match for IPs must-match against a list of country codes (allows only loading from hosts that are hostet in given countries) note: the settings and input environment is there with that commit, but the values are not yet evaluated git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7976 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-09-27 21:58:18 +00:00
orbiter	d2ea250d99	refactoring: - moved many classes from de.anomic to net.yacy - made more sub-packages for search classes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7973 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-09-25 16:59:06 +00:00
orbiter	115abc8917	- more attributes for search progress bar - moved cache strategy to cora package git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7778 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-06-13 21:44:03 +00:00
orbiter	10e2f588f8	- enhanced ybr ranking computation - many speed/performance hacks - added solr charding and new charding web interface - added option to switch off the yacy index when using solr - added new fail-url categories which are used to make a distinction which fail-urls to be sent to solr - refactoring/renaming of some method names to distinguish host/url hashes better - a large number of bug/npe fixes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7738 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-05-26 10:57:02 +00:00
orbiter	6fa439c82b	- refactoring of robots - added option to crawler to send error-URLs to solr - changed solr scheme slightly (no multi-value fields where no multi values are) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7693 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-05-02 14:05:51 +00:00
orbiter	b77b8cac0c	- enhanced html parser: recognized much more details in the content - added more properties to solr index - refactoring - more constants in switchboard - fix for some NPEs - recognition of more images - removed synchronization in HandleMap (obviously not necessary?) - added a nolocal configuration to remove excessive dns lookup (works only on allip - default off). Indexes produced with this setting are all flagged with 'local' and are (on purpose) not usable for freeworld because they will be rejected as beeing local. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7672 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-04-21 13:58:49 +00:00
orbiter	3d5104d357	- fixed a bug in crawl start with file name (npe in new url) - added deletion of solr index in IndexControlRWIs - added asynchronous adding of large url lists (happens when crawls are startet with file) - fixed npe in Image display - replaced language warning with fine logging - added a domain name cache in Domains that helps to speed up the isLocal property (less DNS lookups) - added a new storage class for this new cache: KeyList. The domain key list is stored in DATA/WORK/globalhosts.list - added concurrent solr updates and chunked transfers (50 documents until a commit is done) for high-speed feeding (> 40000 ppm) - fixed a bug in content scraper that chopped off large parts of crawl lists (using crawl start from file) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7666 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-04-18 16:11:16 +00:00
orbiter	156cf02703	- added an index constraint 'has location' to the condenser - added evaluation of the 'has location' constraint to search using the /location operator git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7633 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-03-31 09:41:30 +00:00
orbiter	43e1660512	fix/enhancement in Crawler: do not generate domain match pattern if crawl depth is 0 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7607 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-03-17 21:07:44 +00:00
low012	2861d0888a	) simplified code\n) fixed potential NumberFormatExceptions git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7600 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-03-15 01:03:35 +00:00
orbiter	7962d35425	- removed file upload function in crawl start and replaced it with an input field for a file path where the crawl start file is loaded. This was necessary to support the API steering for file crawl starts, for two reasons: 1) if the file is changed for a re-crawl this is not reflected in the steering because it would take the previously uploaded crawl start file 2) browsers do not submit the full path of the selected file even if this path is shown in the input field because of security reasons. There is no work-around or hack to make the submission of the full path possible - fixed deletion of crawl start point urls in crawl stack and balancer double-check - fixed a problem with steering self-call (no resolving of localhost) - added more logging for the crawler to supervise why crawl urls are not taken by the loader - added a javascript onload-function to select domain restriction in all cases where a crawl is started from a file or from a url - fixed the restrict-to-domain pattern computation, added a 'www.'-prefix and added this functionality also to a crawl start from file git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7574 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-03-09 12:50:39 +00:00
orbiter	4588b5a291	- fixed document number limitation for crawls that restrict the number of documents per domain - some restructuring of the document counting and logging structures was necessary - better abstraction of CrawlProfiles - added deletion of logs to the index deletion option (if the index is deleted using the servlets) which is necessary to reset the domain counters for the page limitation - more refactoring to get the LibraryProvider more clean - some refactoring of the Condenser class git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7478 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-02-12 00:01:40 +00:00
low012	ae10ed5613	*) added a Set to which filter elements are written before mustmatch-filter is created to avoid huge lists of double elements in mustmatch-filter when starting a crawl from a "Link-List of URL" on CrawlStartSite_p.html git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7456 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-01-28 16:24:33 +00:00
orbiter	c93f4dda72	- cleaned up yacy news - removed unused methods - avoid news generation in case that the peer runs in robinson mode git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7431 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-01-12 00:00:14 +00:00
orbiter	0769f4caa6	added search suggestions for interactive search: is only shown if there are no search results git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7411 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-12-29 14:30:25 +00:00
orbiter	58b59f9bc8	- a collection of bug fixes and some redesign of the Scanner class - fixed smb crawling - added smbget to download script generation git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7381 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-12-16 23:37:21 +00:00

1 2

81 Commits