2006-09-11 20:18:12 +02:00
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
2010-10-06 02:00:23 +02:00
< html xmlns = "http://www.w3.org/1999/xhtml" xml:lang = "en" >
2006-09-11 20:18:12 +02:00
< head >
2008-09-03 15:44:58 +02:00
< title > YaCy '#[clientname]#': Crawl Start< / title >
2006-09-11 20:18:12 +02:00
#%env/templates/metas.template%#
2014-02-10 21:40:42 +01:00
< script type = "text/javascript" src = "js/ajax.js" > < / script >
< script type = "text/javascript" src = "js/IndexCreate.js" > < / script >
2008-11-12 15:44:23 +01:00
< script type = "text/javascript" >
function check(key){
document.getElementById(key).checked = 'checked';
}
< / script >
2009-02-06 15:45:56 +01:00
< style type = "text/css" >
.nobr {
white-space: nowrap;
}
< / style >
2006-09-11 20:18:12 +02:00
< / head >
< body id = "IndexCreate" >
2011-11-14 21:19:41 +01:00
2006-09-11 20:18:12 +02:00
#%env/templates/header.template%#
2014-03-31 15:47:58 +02:00
#%env/templates/submenuCrawler.template%#
2014-06-15 22:50:04 +02:00
< div id = "api" > < / div >
2010-09-16 14:02:43 +02:00
< h2 > Site Crawling< / h2 >
2006-09-11 20:18:12 +02:00
< p id = "startCrawling" >
2010-09-16 14:02:43 +02:00
< strong > Site Crawler:< / strong >
Download all web pages from a given domain or base URL.
2006-09-11 20:18:12 +02:00
< / p >
2010-09-16 14:02:43 +02:00
< fieldset >
2014-03-18 13:42:31 +01:00
< legend > Site Crawl Start< / legend >
2010-10-06 02:00:23 +02:00
< form id = "Crawler" method = "post" action = "Crawler_p.html" enctype = "multipart/form-data" accept-charset = "UTF-8" >
2010-09-16 14:02:43 +02:00
< dl >
< dt > < label > Site< / label > < / dt >
< dd >
< table border = "0" cellpadding = "0" cellspacing = "0" > < tr valign = "top" >
2010-09-17 00:04:14 +02:00
< td valign = "top" > < input type = "radio" name = "crawlingMode" id = "url" value = "url" checked = "checked"
2013-07-03 14:50:06 +02:00
onmousedown="document.getElementById('rangeDomain').disabled=false;document.getElementById('rangeSubpath').disabled=false;document.getElementById('crawlingDomMaxCheck').disabled=false;document.getElementById('crawlingDomMaxPages').disabled=false;"/>Start URL (must start with< br / > http:// https:// ftp:// smb:// file://)< / td >
2010-09-16 14:02:43 +02:00
< td valign = "top" >
2010-10-09 00:02:24 +02:00
< input name = "crawlingURL" id = "crawlingURL" type = "text" size = "50" maxlength = "256" value = "#[starturl]#" onkeypress = "changed()" onfocus = "check('url')" style = "font-size:16px" / > < br / >
2010-09-16 14:02:43 +02:00
< input name = "bookmarkTitle" id = "bookmarkTitle" type = "text" size = "50" maxlength = "256" value = "" readonly = "readonly" style = "background:transparent; border:0px" / >
< / td >
< td >
2010-10-06 02:00:23 +02:00
< span id = "robotsOK" > < / span >
2014-02-10 21:40:42 +01:00
< img id = "ajax" src = "env/grafics/empty.gif" alt = "empty" style = "vertical-align: top;" / >
2010-09-30 14:50:34 +02:00
< / td >
< / tr > < tr >
2010-10-01 01:57:58 +02:00
< td > < input type = "radio" name = "crawlingMode" id = "sitelist" value = "sitelist" disabled = "disabled" / > Link-List of URL< / td >
2010-09-30 14:50:34 +02:00
< td > < div id = "sitelistURLs" > < / div > < / td >
< / tr > < tr >
2010-09-17 00:04:14 +02:00
< td > < input type = "radio" name = "crawlingMode" id = "sitemap" value = "sitemap" disabled = "disabled"
2013-07-03 14:50:06 +02:00
onmousedown="document.getElementById('rangeDomain').disabled=true;document.getElementById('rangeSubpath').disabled=true;document.getElementById('crawlingDomMaxCheck').disabled=true;document.getElementById('crawlingDomMaxPages').disabled=true;"/>Sitemap URL< / td >
2010-09-16 14:02:43 +02:00
< td > < input name = "sitemapURL" type = "text" size = "41" maxlength = "256" value = "" readonly = "readonly" style = "background:transparent; border:0px" / > < / td >
2010-09-30 14:50:34 +02:00
< / tr >
< / table > < br / >
2010-10-06 02:00:23 +02:00
< input type = "hidden" name = "crawlingDepth" id = "crawlingDepth" value = "99" / >
2010-09-16 14:02:43 +02:00
< / dd >
2010-09-17 01:00:07 +02:00
< dt > < label > Path< / label > < / dt >
2010-09-16 14:02:43 +02:00
< dd >
2010-09-17 01:00:07 +02:00
< input type = "radio" name = "range" id = "rangeDomain" value = "domain" checked = "checked" / > load all files in domain< br / >
< input type = "radio" name = "range" id = "rangeSubpath" value = "subpath" / > load only files in a sub-path of given url
2012-11-04 02:58:26 +01:00
< input type = "hidden" name = "deleteold" id = "deleteold" value = "on" / >
2010-10-06 02:00:23 +02:00
< input type = "hidden" name = "mustnotmatch" id = "mustnotmatch" value = "" / >
< input type = "hidden" name = "crawlingDomFilterCheck" id = "crawlingDomFilterCheck" value = "off" / >
< input type = "hidden" name = "crawlingDomFilterDepth" id = "crawlingDomFilterDepth" value = "#[crawlingDomFilterDepth]#" / >
< / dd >
2010-09-16 14:02:43 +02:00
< dt > < label > Limitation< / label > < / dt >
2010-10-06 02:00:23 +02:00
< dd > < table border = "0" cellpadding = "0" cellspacing = "0" > < tr valign = "top" >
2010-09-16 14:02:43 +02:00
< td valign = "top" > < input type = "checkbox" name = "crawlingDomMaxCheck" id = "crawlingDomMaxCheck" # ( crawlingDomMaxCheck ) # ::checked = "checked" # ( / crawlingDomMaxCheck ) # / > not more than < / td >
< td valign = "top" > < input name = "crawlingDomMaxPages" id = "crawlingDomMaxPages" type = "text" size = "6" maxlength = "6" value = "#[crawlingDomMaxPages]#" / > < / td >
< td valign = "top" > documents< / td >
< / tr > < / table >
< / dd >
2013-04-24 01:14:35 +02:00
< dt > < label > Collection< / label > < / dt >
< dd >
< input name = "collection" id = "collection" type = "text" size = "60" maxlength = "100" value = "#[collection]#" # ( collectionEnabled ) # disabled = "disabled" :: # ( / collectionEnabled ) # / >
2010-09-16 14:02:43 +02:00
< / dd >
< dt > < label > Start< / label > < / dt >
2013-04-24 01:14:35 +02:00
< dd >
< input type = "hidden" name = "directDocByURL" id = "directDocByURL" value = "off" / >
< input type = "hidden" name = "recrawl" id = "recrawl" value = "reload" / >
< input type = "hidden" name = "reloadIfOlderNumber" id = "reloadIfOlderNumber" value = "3" / >
< input type = "hidden" name = "reloadIfOlderUnit" id = "reloadIfOlderUnit" value = "day" / >
< input type = "hidden" name = "deleteold" id = "deleteold" value = "on" / >
< input type = "hidden" name = "storeHTCache" id = "storeHTCache" value = "on" / >
< input type = "hidden" name = "cachePolicy" id = "cachePolicy" value = "iffresh" / >
2013-07-03 14:50:06 +02:00
< input type = "hidden" name = "crawlingQ" id = "crawlingQ" value = "on" / >
< input type = "hidden" name = "followFrames" id = "followFrames" value = "on" / >
< input type = "hidden" name = "obeyHtmlRobotsNoindex" id = "obeyHtmlRobotsNoindex" value = "on" / >
2013-04-24 01:14:35 +02:00
< input type = "hidden" name = "indexText" id = "indexText" value = "on" / >
< input type = "hidden" name = "indexMedia" id = "indexMedia" value = "on" / >
< input type = "hidden" name = "intention" id = "intention" value = "" / >
2015-04-15 13:17:23 +02:00
< input id = "timezoneOffset" type = "hidden" name = "timezoneOffset" value = "" > < script > document . getElementById ( "timezoneOffset" ) . value = new Date ( ) . getTimezoneOffset ( ) ; < / script >
2014-03-20 22:52:01 +01:00
< input type = "submit" name = "crawlingstart" value = "Start New Crawl" class = "btn btn-primary" / >
2010-09-16 14:02:43 +02:00
< / dd >
< / dl >
2007-07-01 19:09:57 +02:00
< / form >
2010-09-16 14:02:43 +02:00
< / fieldset >
2010-09-17 00:04:14 +02:00
< h3 > Hints< / h3 >
< ul >
< li > < h4 > Crawl Speed Limitation< / h4 > No more that two pages are loaded from the same host in one second (not more that 120 document per minute) to limit the load on the target server.< / li >
< li > < h4 > Target Balancer< / h4 > A second crawl for a different host increases the throughput to a maximum of 240 documents per minute since the crawler balances the load over all hosts.< / li >
< li > < h4 > High Speed Crawling< / h4 > A 'shallow crawl' which is not limited to a single host (or site)
can extend the pages per minute (ppm) rate to unlimited documents per minute when the number of target hosts is high.
2014-04-01 00:35:58 +02:00
This can be done using the < a href = "CrawlStartExpert.html" > Expert Crawl Start< / a > servlet.< / li >
2010-10-06 02:00:23 +02:00
< li > < h4 > Scheduler Steering< / h4 > The scheduler on crawls can be changed or removed using the < a href = "Table_API_p.html" > API Steering< / a > .< / li >
2010-09-17 00:04:14 +02:00
< / ul >
2006-09-11 20:18:12 +02:00
#%env/templates/footer.template%#
< / body >
2006-11-16 21:03:07 +01:00
< / html >