yacy_search_server/doc/FAQ.html

<html>
<head>
<title>YaCy: FAQ</title>
<meta http-equiv="content-type" content="text/html;charset=iso-8859-1">
<!-- <meta name="Content-Language" content="German, Deutsch, de, at, ch"> -->
<meta name="Content-Language" content="English, Englisch">
<meta name="keywords" content="YaCy HTTP Proxy search engine spider indexer java network open free download Mac Windwos Software development">
<meta name="description" content="YaCy Software HTTP Proxy Freeware Home Page">
<meta name="copyright" content="Michael Christen">
<script src="navigation.js" type="text/javascript"></script>
<link rel="stylesheet" media="all" href="style.css">
<!-- Realisation: Michael Christen; Contact: mc<at>anomic.de-->
</head>
<body bgcolor="#fefefe" marginheight="0" marginwidth="0" leftmargin="0" topmargin="0">
<SCRIPT LANGUAGE="JavaScript1.1"><!--
globalheader();
//--></SCRIPT>
<NOSCRIPT>
<table border="0" cellspacing="0" cellpadding="0" width="100%">
<tr><td></td></tr>
<tr><td height="1" bgcolor="#000000"></td></tr>
<tr><td>
<!-- start headline -->
<table bgcolor="#4070A0" border="0" cellspacing="0" cellpadding="0" width="100%">
<tr><td width="180" height="80" rowspan="3"><a href="http://www.yacy.net"><img border="0" src="grafics/yacy.gif" align="top"></a></td>
<td></td><td width="120"></td></tr>
</table>
<!-- end headline -->
</td></tr>
<tr><td height="2"></td></tr>
<tr><td>
<table border="0" cellspacing="0" cellpadding="0" width="100%">
<tr>
<td width="100" valign="top">
<!-- start lmenue -->
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr><td height="2"></td></tr>
<tr><td height="20" class="white" bgcolor="#BDCDD4" valign="middle">&nbsp;<a href="index.html" class="dark">Main Index</a></td></tr>
</table>
<!-- end lmenue -->
</td>
<td width="10" valign="top"></td>
<td valign="top">
<table border="0" cellspacing="0" cellpadding="0" width="100%">
<tr><td height="2"></td></tr>
<tr><td><br>
</NOSCRIPT>
<!-- ----- HERE STARTS CONTENT PART ----- -->

<h2>FAQ</h2>

<p>YaCy is not only a distributed search engine, but also a caching HTTP proxy.
Both application parts benefit from each other.</p>

<h3>Can I Crawl The Web With YaCy?</h3>
<p>Yes! You can start your own crawl and you may also trigger distributed crawling,
which means that your own YaCy peer asks other peers to perform specific crawl tasks.
You can specify many parameters that focus your crawl to a limited set of web pages.</p>

<h3>What do you mean with 'Global Search Engine'?</h3>
<p>The integrated indexing and search service can not only be used locally, but also <i>globally</i>.
Each YaCy peer distributes some contact information to all other proxies that can be reached in the internet,
and proxies exchange <i>but do not copy</i> their indexes to each other.
This is done in such a way, that each <i>peer</i> knows how to address the correct other
<i>peer</i> to retrieve a special search index.
Therefore the community of all proxies spawns a <i>distributed hash table</i> (DHT)
which is used to share the <i>reverse word index</i> (RWI) to all operators and users of the proxies.
The applied logic of distribution and retrieval of RWI's on the DHT combines all participating proxies to
a <i>Distributed Search Engine</i>.
To point out that this is in contrast to local indexing and searching,
we call it a <i>Global Search Engine</i>.
</p>

<h3>Is there a central server? Does the search engine network need one?</h3>
<p>No. The network architecture does not need a central server, and there is none.
In fact there is a root server which is the 'first' peer, but any other peer has the same rights and tasks to perform.
We still distinguish three different <i>classes</i> of peers:
<ul>
<li><i>junior</i> peers are peers that cannot be reached from the internet because of routing problems or firewall settings;</li>
<li><i>senior</i> peers can be accessed by other peers and</li>
<li><i>principal</i> peers are like senior but can also upload network bootstrap information to ftp/http sites; this is necessary for the network bootstraping.</li>
</ul>
Junior peers can contribute to the network by submitting index files to senior/principal peers without being asked. (This function is currently very limited)
</p>

<h3>Search Engines need a lot of terabytes of space, don't they? How much space do I need on my machine?</h3>
<p>The global index is <i>shared</i>, but not <i>copied</i> to the peers.
If you run YaCy, you need an average of the same disc memory amount for the index as you need for the cache.
In fact, the global space for the index may reach the space of Terabytes, but not all of that on your machine!</p>

<h3>Search Engines must do crawling, don't they? Do you?</h3>
<p>No. They <i>can</i> do, but we collect information by simply using the information that passes the proxy.
If you <i>want</i> to crawl, you can do so and start your own crawl job with a certain search depth.</p>

<h3>Do I need a fast machine? Search Engines need big server farms, don't they?</h3>
<p>You don't need a fast machine to run YaCy. You also don't need a lot of space.
You can configure the amount of Megabytes that you want to spend for the cache.
Any time-critical task is delayed automatically and takes place
when you are idle surfing (which works only if you use YaCy as http proxy).
Whenever internet pages pass YaCy in proxy-mode,
any indexing (or if wanted: prefetch-crawling) is interrupted and delayed.
</p>

<h3>I don't want to wait for search results very long. How long does a search take?</h3>
<p>Our architecture does not do peer-hopping, we also don't have a TTL (time to live).
We expect that search results are <i>instantly</i> responded to the requester.
This can be done by asking the index-owning peer <i>directly</i> which is
in fact possible by using DHT's (distributed hash tables).
Because we need some redundancy to compensate for missing peers, we ask
several peers simultanously. To collect their response, we wait a little time
of at most 6 seconds. If this is not enough, the user may start a re-search
to catch up 'late' responses from other peers.</p>

<h3>I am scared about the fact that the browsing results are distributed.
What about privacy?</h3>
<p>None of the words that are indexed from the
pages you have crawled or seen is stored in clear text on your computer.
Instead, a hash is used which can not be computed back into the original word.
Because index files travel among peers, you cannot state if a specific link was
visited by you or another peer-user, so this frees you from being responsible about
the index files on your machine.</p>

<h3>Do I need to set up and run a separate database?</h3>
<p>No. YaCy contains it's own database engine, which does not need any extra set-up
or configuration.</p>

<h3>What kind of database do you use? Is it fast enough?</h3>
<p>The database stores tables in files with the
structure of AVL-Trees (which are height-regulated binary trees).
Such a search tree ensures a logarithmic order of computation time.
We compared the YaCy database engine ('kelondro') with mysql and yacy
was as fast as mysql with up to millions of entries in one table.</p>

<h3>Why do you use your own database? Why not use mySQL or openLDAP?</h3>
<p>The database structure we need is very special.
One demand is that the entries can be retrieved in logarithmic time <i>and</i>
can be enumerated in any order. Enumeration in a specific order is needed to
create conjunctions of tables very fast. This is needed when someone searches
for several words. We implement the search word conjunction by pairwise and
simultanous enumeration/comparisment of index trees/sequences.
This forces us to use binary trees as data structure. Another demand is that we
need the ability to have many index tables, maybe <i>millions of tables</i>.
The size of the tables may be not big in average, but we need many of them.
This is in contrast of the organization of relational databases, where the focus
is on management of very large tables, but not of many of them. A third demand is
the ease of installation and maintenance: the user shall not be forced to install
a RBMS first, care about tablespaces and such.
The integrated database is completely service-free.</p>

<h3>What does Senior Mode mean? What is Junior Mode?</h3>
<p><i>Junior</i> peers are such peers that cannot be reached from other peers, while <i>Senior</i> peers can be contacted.
If your peer has global access, it runs in Senior Mode. If it is hidden from others, it is in Junior Mode.
If your peer is in Senior Mode, it is an access point for index sharing and distribution. It can be contacted for search requests and it collects index files from other peers. If your peer is in Junior Mode, it collects index files from your browsing and distributes them only to other Senior peers, but does not collect index files.
</p>

<h3>Why should I run my YaCy peer in Senior Mode?</h3>
<p>Some p2p-based file sharing software assign non-contributing peers very low priority. We think that that this is not always fair since sometimes the operator does not have the choice of opening the firewall or configuring the router accordingly. Our idea of 'information wares' and their exchange can also be applied to junior peers: they must contribute to the global index by submitting their index <i>actively</i>, while senior peers contribute <i>passively</i>.
Therefore we don't need to give junior peers low priority: they contribute equally, so they may participate equally.
But enough senior peers are needed to make this architecture functional. Since any peer contributes almost equally, either actively or passively, you should decide to run in Senior Mode if you can.
</p>

<h3>Why is this Search Engine also a Proxy?</h3>
<p>
We wanted to avoid that you start a search service ony for that very time when you submit a search query. This would give the Search Engine too little online time. So we looked for a cause the you would like to run the Search Engine during all the time that you are online. By giving you the additional value of a caching proxy, the reason was found. The built-in blacklist (url filter, useful i.e. to block ads) for the proxy is another increase in value.
</p>

<h3>Why is this Proxy also a Search Engine?</h3>
<p>YaCy has a built-in <i>caching</i> proxy, which means that YaCy has a lot of indexing information
'for free' without crawling. This may not be a very usual function of a proxy, but a very useful one:
you see a lot of information when you browse the internet and maybe you would like to search exactly
only what you have seen. Beside this interesting feature, you can use YaCy to index an intranet
simply by using the proxy; you don't need to additionally set up another search/indexing process or databases.
YaCy gives you an 'instant' database and an 'instant' search service.</p>

<h3>My YaCy says it runs in 'Junior Mode'. How can I run it in Senior Mode?</h3>
<p>Open your firewall for port 8080 (or the port you configured) or program your router to act as a <i>virtual server</i>.</p>

<h3>How can I help?</h3>
<p>First of all: run YaCy in senior mode. This helps to enrich the global index and to make YaCy more attractive.
If you want to add your own code, you are welcome; but please contact the author first and discuss your idea to see how it may fit into the overall architecture.
You can help a lot by simply giving us feedback or telling us about new ideas. You can also help by telling other people about this software.
And if you find an error or you see an exception, we welcome your defect report. Any feed-back is welcome.</p>

<!-- ----- HERE ENDS CONTENT PART ----- -->
<SCRIPT LANGUAGE="JavaScript1.1"><!--
globalfooter();
//--></SCRIPT>
<NOSCRIPT>
<br><br></td></tr></table>
</td>
<td width="10" valign="top">
</td>
</tr></table>
</td></tr></table>
</NOSCRIPT>
</body>
</html>