yacy_search_server/htroot/IndexImportWikimedia_p.html
2009-06-19 14:03:28 +00:00

71 lines
3.5 KiB
HTML

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>YaCy '#[clientname]#': Wikimedia Dump Import</title>
#%env/templates/metas.template%#
#(import)#::<meta http-equiv="REFRESH" content="10" />#(/import)#
</head>
<body id="IndexImportWikimedia">
#%env/templates/header.template%#
#%env/templates/submenuIntegration.template%#
<h2>Wikimedia Dump Import</h2>
#(import)#
<p>#(status)#No import thread is running, you can start a new thread here::Bad input data: #[message]# #(/status)#</p>
<form action="IndexImportWikimedia_p.html" method="get">
<!-- no post method here, we don't want to transmit the whole file, only the path-->
<fieldset>
<legend>Wikimedia Dump File Selection: select a 'bz2' file</legend>
You can import Wikipedia dumps here. An example is the file
<a href="http://download.wikimedia.org/dewiki/20090311/dewiki-20090311-pages-articles.xml.bz2">
http://download.wikimedia.org/dewiki/20090311/dewiki-20090311-pages-articles.xml.bz2</a>.
<br />
Dumps must be in XML format and must be encoded in bz2. Do not decompress the file after downloading!
<br />
<input name="file" type="text" value="DATA/HTCACHE/dewiki-20090311-pages-articles.xml.bz2" size="80" />
<input name="submit" type="submit" value="Import Wikimedia Dump" />
</fieldset>
</form>
<p>
When the import is started, the following happens:
</p><ul>
<li>The dump is extracted on the fly and wiki entries are translated into Dublin Core data format. The output looks like this:
<pre>
&lt;?xml version="1.0" encoding="utf-8"?&gt;
&lt;surrogates xmlns:dc="http://purl.org/dc/elements/1.1/"&gt;
&lt;record&gt;
&lt;dc:Title&gt;&lt;![CDATA[Alan Smithee]]&gt;&lt;/dc:Title&gt;
&lt;dc:Identifier&gt;http://de.wikipedia.org/wiki/Alan%20Smithee&lt;/dc:Identifier&gt;
&lt;dc:Description&gt;&lt;![CDATA[Der als Filmregisseur oft genannte Alan Smithee ist ein Anagramm]]&gt;&lt;/dc:Description&gt;
&lt;dc:Language&gt;de&lt;/dc:Language&gt;
&lt;dc:Date&gt;2009-05-07T06:03:48Z&lt;/dc:Date&gt;
&lt;/record&gt;
&lt;record&gt;
...
&lt;/record&gt;
&lt;/surrogates&gt;
</pre>
</li>
<li>Each 10000 wiki records are combined in one output file which is written to /DATA/SURROGATES/in into a temporary file.</li>
<li>When each of the generated output file is finished, it is renamed to a .xml file</li>
<li>Each time a xml surrogate file appears in /DATA/SURROGATES/in, the YaCy indexer fetches the file and indexes the record entries.</li>
<li>When a surrogate file is finished with indexing, it is moved to /DATA/SURROGATES/out</li>
<li>You can recycle processed surrogate files by moving them from /DATA/SURROGATES/out to /DATA/SURROGATES/in</li>
</ul>
<p></p>
::
<form><fieldset><legend>Import Process</legend>
<dl>
<dt>Thread:</dt><dd>#[thread]#</dd>
<dt>Dump:</dt><dd>#[dump]#</dd>
<dt>Processed:</dt><dd>#[count]# Wiki Entries</dd>
<dt>Speed:</dt><dd>#[speed]# articles per second</dd>
<dt>Running Time:</dt><dd>#[runningHours]# hours, #[runningMinutes]# minutes</dd>
<dt>Remaining Time:</dt><dd>#[remainingHours]# hours, #[remainingMinutes]# minutes</dd>
</dl>
</fieldset></form>
#(/import)#
#%env/templates/footer.template%#
</body>
</html>