yacy_search_server

mirror of https://github.com/yacy/yacy_search_server.git synced 2024-09-19 00:01:41 +02:00

Author	SHA1	Message	Date
Michael Peter Christen	16b21f7a5b	Added more steering in Crawler_p.html interface	2012-05-23 18:00:37 +02:00
Michael Peter Christen	19efbf1b0f	- apply directDocByURL to NOLOAD Queue - choose pushing to NOLOAD as default for site crawl	2012-04-26 00:23:18 +02:00
Michael Peter Christen	ef5192f8c9	using the generic document parser for crawl starts instead of the html parser. This makes it possible that every type of document can be a crawl start point, not only text documents or html documents. Testet this with a pdf document.	2012-01-23 17:27:29 +01:00
Michael Peter Christen	992dbdf4bb	added noload statistic to servlets	2012-01-05 18:33:05 +01:00
orbiter	11729061f2	added an option in the bookmark import process to put everything into the crawler git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8134 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-12-03 00:27:01 +00:00
orbiter	5a55397f99	some last-minute performance hacks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8101 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-25 11:23:52 +00:00
orbiter	da55a359e9	addon to http://bugs.yacy.net/view.php?id=72 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8095 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-24 16:43:26 +00:00
apfelmaennchen	564374d1fe	- included YMarks in addition to old bookmarks in yacysearchitem.html; don't get confused by the old bookmark dialog, the ymark is automatically added silently beforehand. - reworked bookmark creation on crawlstart - many smaller adjustments to ymarks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8072 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-22 23:50:49 +00:00
orbiter	d449547023	fix for http://bugs.yacy.net/view.php?id=72 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8064 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-22 00:22:09 +00:00
orbiter	c93f10417a	add a bookmark automatically each time a new crawl is started git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8063 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-22 00:03:20 +00:00
orbiter	e4a82ddd8b	produce a bookmark entry from every crawl start. these bookmarks are always private. these bookmarks will be used to get a source reference for the search in case of intranet or portal searches. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8062 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-21 23:10:29 +00:00
orbiter	42425c8003	fixed directDocByURL (has now effect if switched off) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8022 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-11-09 15:54:01 +00:00
orbiter	a7df70221e	refactoring git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7987 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-10-04 09:06:24 +00:00
orbiter	cf4fd525ee	added directDocByURL attribute in crawl profile git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7985 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-09-30 12:38:28 +00:00
orbiter	b250e6466d	implemented crawl restrictions for IP pattern and country lists git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7980 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-09-29 15:17:39 +00:00
orbiter	5ad7f9612b	added crawl settings for three new filters for each crawl: must-match for IPs (IPs that are known after DNS resolving for each URL in the crawl queue) must-not-match for IPs must-match against a list of country codes (allows only loading from hosts that are hostet in given countries) note: the settings and input environment is there with that commit, but the values are not yet evaluated git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7976 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-09-27 21:58:18 +00:00
orbiter	d2ea250d99	refactoring: - moved many classes from de.anomic to net.yacy - made more sub-packages for search classes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7973 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-09-25 16:59:06 +00:00
orbiter	115abc8917	- more attributes for search progress bar - moved cache strategy to cora package git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7778 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-06-13 21:44:03 +00:00
orbiter	10e2f588f8	- enhanced ybr ranking computation - many speed/performance hacks - added solr charding and new charding web interface - added option to switch off the yacy index when using solr - added new fail-url categories which are used to make a distinction which fail-urls to be sent to solr - refactoring/renaming of some method names to distinguish host/url hashes better - a large number of bug/npe fixes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7738 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-05-26 10:57:02 +00:00
orbiter	6fa439c82b	- refactoring of robots - added option to crawler to send error-URLs to solr - changed solr scheme slightly (no multi-value fields where no multi values are) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7693 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-05-02 14:05:51 +00:00
orbiter	b77b8cac0c	- enhanced html parser: recognized much more details in the content - added more properties to solr index - refactoring - more constants in switchboard - fix for some NPEs - recognition of more images - removed synchronization in HandleMap (obviously not necessary?) - added a nolocal configuration to remove excessive dns lookup (works only on allip - default off). Indexes produced with this setting are all flagged with 'local' and are (on purpose) not usable for freeworld because they will be rejected as beeing local. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7672 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-04-21 13:58:49 +00:00
orbiter	3d5104d357	- fixed a bug in crawl start with file name (npe in new url) - added deletion of solr index in IndexControlRWIs - added asynchronous adding of large url lists (happens when crawls are startet with file) - fixed npe in Image display - replaced language warning with fine logging - added a domain name cache in Domains that helps to speed up the isLocal property (less DNS lookups) - added a new storage class for this new cache: KeyList. The domain key list is stored in DATA/WORK/globalhosts.list - added concurrent solr updates and chunked transfers (50 documents until a commit is done) for high-speed feeding (> 40000 ppm) - fixed a bug in content scraper that chopped off large parts of crawl lists (using crawl start from file) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7666 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-04-18 16:11:16 +00:00
orbiter	156cf02703	- added an index constraint 'has location' to the condenser - added evaluation of the 'has location' constraint to search using the /location operator git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7633 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-03-31 09:41:30 +00:00
orbiter	43e1660512	fix/enhancement in Crawler: do not generate domain match pattern if crawl depth is 0 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7607 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-03-17 21:07:44 +00:00
low012	2861d0888a	) simplified code\n) fixed potential NumberFormatExceptions git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7600 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-03-15 01:03:35 +00:00
orbiter	7962d35425	- removed file upload function in crawl start and replaced it with an input field for a file path where the crawl start file is loaded. This was necessary to support the API steering for file crawl starts, for two reasons: 1) if the file is changed for a re-crawl this is not reflected in the steering because it would take the previously uploaded crawl start file 2) browsers do not submit the full path of the selected file even if this path is shown in the input field because of security reasons. There is no work-around or hack to make the submission of the full path possible - fixed deletion of crawl start point urls in crawl stack and balancer double-check - fixed a problem with steering self-call (no resolving of localhost) - added more logging for the crawler to supervise why crawl urls are not taken by the loader - added a javascript onload-function to select domain restriction in all cases where a crawl is started from a file or from a url - fixed the restrict-to-domain pattern computation, added a 'www.'-prefix and added this functionality also to a crawl start from file git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7574 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-03-09 12:50:39 +00:00
orbiter	4588b5a291	- fixed document number limitation for crawls that restrict the number of documents per domain - some restructuring of the document counting and logging structures was necessary - better abstraction of CrawlProfiles - added deletion of logs to the index deletion option (if the index is deleted using the servlets) which is necessary to reset the domain counters for the page limitation - more refactoring to get the LibraryProvider more clean - some refactoring of the Condenser class git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7478 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-02-12 00:01:40 +00:00
low012	ae10ed5613	*) added a Set to which filter elements are written before mustmatch-filter is created to avoid huge lists of double elements in mustmatch-filter when starting a crawl from a "Link-List of URL" on CrawlStartSite_p.html git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7456 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-01-28 16:24:33 +00:00
orbiter	c93f4dda72	- cleaned up yacy news - removed unused methods - avoid news generation in case that the peer runs in robinson mode git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7431 6c8d7289-2bf4-0310-a012-ef5d649a1542	2011-01-12 00:00:14 +00:00
orbiter	0769f4caa6	added search suggestions for interactive search: is only shown if there are no search results git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7411 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-12-29 14:30:25 +00:00
orbiter	58b59f9bc8	- a collection of bug fixes and some redesign of the Scanner class - fixed smb crawling - added smbget to download script generation git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7381 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-12-16 23:37:21 +00:00
orbiter	a563b05b60	enhanced crawler: - added a new queue 'noload' which can be filled with urls where it is already known that the content cannot be loaded. This may be because there is no parser available or the file is too big - the noload queue is emptied with the parser process which indexes the file names only - the 'start from file' functionality now also reads from ftp crawler git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7368 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-12-11 00:31:57 +00:00
orbiter	c36da90261	added a very fast ftp file list generator to site crawler: - when a site-crawl for ftp sites is now started, then a special directory-tree harvester gets the complete directory structure of a ftp server at once - the harvester runs concurrently and feeds into the normal crawl queue also in this: - fixed the 'start from file' crawl function - added a link detector for the html parser. The html parser can now also extract links that are not included in <a> tags. - this causes that a crawl start is now also possible from clear text link files git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7367 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-12-09 17:17:25 +00:00
low012	e7552bd719	*) cleaning up the code a little bit git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7343 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-11-27 00:54:59 +00:00
low012	38fdf43587	) renamed classes according to standard Java coding conventions ) String.isEmpty() was introduced in Java 1.6, but we still use Java 1.5 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7330 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-11-21 01:29:32 +00:00
f1ori	7d8de34778	* add a bit documentation to DigestURI, use DigestURI(string) instead of DigestURI(string, null) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7276 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-10-26 16:10:20 +00:00
orbiter	fcd40cd30f	- disabled domZones (buggy, must think about better solution) - increased time-out for dns resolver and isLocal property git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7233 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-10-09 10:17:50 +00:00
orbiter	c3bf17a3a1	fixed must-match filter for smb crawling git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7222 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-10-05 00:05:08 +00:00
orbiter	574346f8ce	better must-match pattern for intranet file-crawls git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7218 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-10-04 12:55:39 +00:00
sixcooler	aa6075402a	smal fix for crawling from 'sitelist' at changes from 7214 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7216 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-10-01 22:41:28 +00:00
orbiter	2c549ae341	fixed a number of small bugs: - better crawl star for files paths and smb paths - added time-out wrapper for dns resolving and reverse resolving to prevent blockings - fixed intranet scanner result list check boxes - prevented htcache usage in case of file and smb crawling (not necessary, documents are locally available) - fixed rss feed loader - fixes sitemap loader which had not been restricted to single files (crawl-depth must be zero) - clearing of crawl result lists when a network switch was done - higher maximum file size for crawler git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7214 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-09-30 23:57:58 +00:00
orbiter	f6eebb6f99	replaced auto-dom filter with easy-to-understand Site Link-List crawler option - nobody understand the auto-dom filter without a lenghtly introduction about the function of a crawler - nobody ever used the auto-dom filter other than with a crawl depth of 1 - the auto-dom filter was buggy since the filter did not survive a restart and then a search index contained waste - the function of the auto-dom filter was in fact to just load a link list from the given start url and then start separate crawls for all these urls restricted by their domain - the new Site Link-List option shows the target urls in real-time during input of the start url (like the robots check) and gives a transparent feed-back what it does before it can be used - the new option also fits into the easy site-crawl start menu git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7213 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-09-30 12:50:34 +00:00
orbiter	114bdd8ba7	fixed old sitemap importer which was not able to parse urls containing post elements - removed old parser - removed old importer framework (was only used by removed old parser) - added a new sitemap parser in parser framework - linked new parser with parser access in old sitemap processing routines git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7126 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-09-08 14:13:15 +00:00
orbiter	65eaf30f77	redesign of crawl profiles data structure. target will be: - permanent storage of auto-dom statistics in profile - storage of profiles in WorkTable data structure not finished yet. No functional change yet. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7088 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-08-31 15:47:47 +00:00
orbiter	3197ca42ed	preparations to move the HTCache into cora: - move the header framework classes to cora - move the ARC caching classes to cora - refactoring of code to call these classes from cora git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7068 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-08-23 12:32:02 +00:00
orbiter	933dc1a600	removed old rss parser (will be replaced with parser from cora package) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7052 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-08-20 07:42:38 +00:00
orbiter	70dd26ec95	added the new crawl scheduling function to the crawl start menu: - the scheduler extends the option for re-crawl timing. Many people misunderstood the re-crawl timing feature because that was just a criteria for the url double-check and not a scheduler. Now the scheduler setting is combined with the re-crawl setting and people will have the choice between no re-crawl, re-crawl as was possible so far and a scheduled re-crawl. The 'classic' re-crawl time is set automatically when the scheduling function is selected - removed the bookmark-based scheduler. This scheduler was not able to transport all attributes of a crawl start and did therefore not support special crawling starts i.e. for forums and wikis - since the old scheduler was not aber to crawl special forums and wikis, the must-not-match filter was statically fixed to all bad pages for these special use cases. Since the new scheduler can handle these filters, it is possible to remove the default settings for the filters - removed the busy thread that was used to trigger the bookmark-based scheduler - removed the crontab for the bookmark-based scheduler git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7051 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-08-19 23:52:38 +00:00
orbiter	dcd01698b4	added a 'transition feature' that shall lower the barrier to move from ggle to yacy (yes!): Here a new concept called 'search heuristics' is introduced. A heuristic is a kind of 'shortcut' to good results in IT, here for good search results. In this case it will be used to get a very transparent way to compare what YaCy is able to produce as search result and what ggle produces as search result. Here is what your can do now: - add the phrase 'heuristic:scroogle' to your search query, like 'oil spill heuristic:scroogle' and then a call to scroogle is made to get anonymous search results from ggle. - these results are _not_ taken as meta-search results, but are used to instantly feed a crawling and indexing process. This happens very fast, here 20 results from scroogle are taken and loaded all simultanously, parsed and indexed immediately and from the results of the parsed content the search result is feeded, along to the normal p2p search - when new results from that heuristic (more to come) get part of the search results, then it is verified if such results are redundant to existing (they had been part of the normal YaCy search result anyway) or if they had been completely new to YaCy. - in the search results the new search results from heuristics are marked with a 'H ++' and search results from heuristics that had been already found by YaCy are marked with a 'H ='. That means: - you can now see YaCy and Scroogle search results in one result page but you also see that you would not have 'missed' the ggle results when you would only have used YaCy. - to make it short: YaCy now subsumes g**gle results. If you use only YaCy, you miss nothing. to come: a configuration page that let you configure the usage of heuristics and get this feature by default. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6944 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-06-25 16:44:57 +00:00
orbiter	56ff9d5fd4	- extended news size from 512 to 1024 characters - a new news db will be created (news1024.db), the old one (news.db) can be deleted - peers with too large news payload are not ignored any more (they may have been invisible because they had a too large news payload!) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6917 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-06-15 10:43:47 +00:00
orbiter	11639aef35	- added new protocol loader for 'file'-type URLs - it is now possible to crawl the local file system with an intranet peer - redesign of URL handling - refactoring: created LGPLed package cora: 'content retrieval api' which may be used externally by other applications without yacy core elements because it has no dependencies to other parts of yacy git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6902 6c8d7289-2bf4-0310-a012-ef5d649a1542	2010-05-25 12:54:57 +00:00

1 2

62 Commits