2005-04-07 21:19:42 +02:00
// plasmaSwitchboard.java
// -----------------------
// part of YaCy
// (C) by Michael Peter Christen; mc@anomic.de
// first published on http://www.anomic.de
// Frankfurt, Germany, 2004, 2005
2005-09-07 13:17:21 +02:00
//
2005-11-12 12:38:35 +01:00
// $LastChangedDate$
// $LastChangedRevision$
// $LastChangedBy$
2005-04-07 21:19:42 +02:00
//
// This program is free software; you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation; either version 2 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
//
// Using this software in any meaning (reading, learning, copying, compiling,
// running) means that you agree that the Author(s) is (are) not responsible
// for cost, loss of data or any harm that may be caused directly or indirectly
// by usage of this softare or this documentation. The usage of this software
// is on your own risk. The installation and usage (starting/running) of this
// software may allow other people or application to access your computer and
// any attached devices and is highly dependent on the configuration of the
// software which must be done by the user of the software; the author(s) is
// (are) also not responsible for proper configuration and usage of the
// software, even if provoked by documentation provided together with
// the software.
//
// Any changes to this file according to the GPL as documented in the file
// gpl.txt aside this file in the shipment you received can be done to the
// lines that follows this copyright notice here, but changes must not be
// done inside the copyright notive above. A re-distribution must contain
// the intact and unchanged copyright notice.
// Contributions and changes to the program code must be marked as such.
/ *
This class holds the run - time environment of the plasma
Search Engine . It ' s data forms a blackboard which can be used
to organize running jobs around the indexing algorithm .
The blackboard consist of the following entities :
- storage : one plasmaStore object with the url - based database
- configuration : initialized by properties once , then by external functions
- job queues : for parsing , condensing , indexing
- black / blue / whitelists : controls input and output to the index
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
this class is also the core of the http crawling .
There are some items that need to be respected when crawling the web :
1 ) respect robots . txt
2 ) do not access one domain too frequently , wait between accesses
3 ) remember crawled URL ' s and do not access again too early
4 ) priorization of specific links should be possible ( hot - lists )
5 ) attributes for crawling ( depth , filters , hot / black - lists , priority )
6 ) different crawling jobs with different attributes ( ' Orders ' ) simultanoulsy
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
We implement some specific tasks and use different database to archieve these goals :
- a database ' crawlerDisallow . db ' contains all url ' s that shall not be crawled
- a database ' crawlerDomain . db ' holds all domains and access times , where we loaded the disallow tables
this table contains the following entities :
< flag : robotes exist / not exist , last access of robots . txt , last access of domain ( for access scheduling ) >
- four databases for scheduled access : crawlerScheduledHotText . db , crawlerScheduledColdText . db ,
crawlerScheduledHotMedia . db and crawlerScheduledColdMedia . db
- two stacks for new URLS : newText . stack and newMedia . stack
- two databases for URL double - check : knownText . db and knownMedia . db
- one database with crawling orders : crawlerOrders . db
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
The Information flow of a single URL that is crawled is as follows :
- a html file is loaded from a specific URL within the module httpdProxyServlet as
a process of the proxy .
- the file is passed to httpdProxyCache . Here it ' s processing is delayed until the proxy is idle .
- The cache entry is passed on to the plasmaSwitchboard . There the URL is stored into plasmaLURL where
the URL is stored under a specific hash . The URL ' s from the content are stripped off , stored in plasmaLURL
with a ' wrong ' date ( the date of the URL ' s are not known at this time , only after fetching ) and stacked with
plasmaCrawlerTextStack . The content is read and splitted into rated words in plasmaCondenser .
The splitted words are then integrated into the index with plasmaSearch .
- In plasmaSearch the words are indexed by reversing the relation between URL and words : one URL points
to many words , the words within the document at the URL . After reversing , one word points
to many URL ' s , all the URL ' s where the word occurrs . One single word - > URL - hash relation is stored in
plasmaIndexEntry . A set of plasmaIndexEntries is a reverse word index .
This reverse word index is stored temporarly in plasmaIndexCache .
- In plasmaIndexCache the single plasmaIndexEntry ' ies are collected and stored into a plasmaIndex - entry
These plasmaIndex - Objects are the true reverse words indexes .
- in plasmaIndex the plasmaIndexEntry - objects are stored in a kelondroTree ; an indexed file in the file system .
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
The information flow of a search request is as follows :
- in httpdFileServlet the user enters a search query , which is passed to plasmaSwitchboard
- in plasmaSwitchboard , the query is passed to plasmaSearch .
- in plasmaSearch , the plasmaSearch . result object is generated by simultanous enumeration of
URL hases in the reverse word indexes plasmaIndex
- ( future : the plasmaSearch . result - object is used to identify more key words for a new search )
2005-11-11 00:48:20 +01:00
* /
2005-04-07 21:19:42 +02:00
package de.anomic.plasma ;
2005-05-05 07:32:19 +02:00
import java.io.File ;
2007-06-12 17:15:24 +02:00
import java.io.FileInputStream ;
2005-05-05 07:32:19 +02:00
import java.io.IOException ;
2006-10-03 13:05:48 +02:00
import java.io.InputStream ;
2006-11-28 16:00:15 +01:00
import java.io.UnsupportedEncodingException ;
2006-08-12 16:28:14 +02:00
import java.lang.reflect.Constructor ;
2005-12-07 15:26:43 +01:00
import java.net.InetAddress ;
2006-08-11 17:09:22 +02:00
import java.net.MalformedURLException ;
2005-05-05 07:32:19 +02:00
import java.text.SimpleDateFormat ;
2005-09-07 13:17:21 +02:00
import java.util.ArrayList ;
2005-05-05 07:32:19 +02:00
import java.util.Date ;
2007-06-22 11:16:25 +02:00
import java.util.Enumeration ;
2005-05-05 07:32:19 +02:00
import java.util.HashMap ;
2006-02-21 12:18:48 +01:00
import java.util.Hashtable ;
2005-05-05 07:32:19 +02:00
import java.util.Iterator ;
import java.util.Map ;
2007-06-12 17:15:24 +02:00
import java.util.Properties ;
2006-12-06 14:13:55 +01:00
import java.util.Set ;
2007-04-30 00:05:34 +02:00
import java.util.TreeMap ;
2005-05-05 07:32:19 +02:00
import java.util.TreeSet ;
2005-12-13 00:59:58 +01:00
2007-08-04 01:06:53 +02:00
import de.anomic.data.URLLicense ;
2006-03-01 08:40:25 +01:00
import de.anomic.data.blogBoard ;
2007-02-26 15:36:01 +01:00
import de.anomic.data.blogBoardComments ;
2005-12-26 15:21:01 +01:00
import de.anomic.data.bookmarksDB ;
2006-08-12 04:42:10 +02:00
import de.anomic.data.listManager ;
2005-05-05 07:32:19 +02:00
import de.anomic.data.messageBoard ;
2005-09-30 15:48:26 +02:00
import de.anomic.data.userDB ;
2006-09-30 00:27:20 +02:00
import de.anomic.data.wikiBoard ;
2007-05-20 18:19:25 +02:00
import de.anomic.data.wiki.wikiParser ;
2005-05-05 07:32:19 +02:00
import de.anomic.htmlFilter.htmlFilterContentScraper ;
import de.anomic.http.httpHeader ;
2005-10-22 15:28:04 +02:00
import de.anomic.http.httpRemoteProxyConfig ;
2005-09-05 12:34:34 +02:00
import de.anomic.http.httpc ;
2007-02-05 20:46:50 +01:00
import de.anomic.http.httpd ;
2007-03-02 02:19:38 +01:00
import de.anomic.http.httpdRobotsTxtConfig ;
2006-08-03 01:20:03 +02:00
import de.anomic.index.indexContainer ;
2006-11-08 17:17:47 +01:00
import de.anomic.index.indexRWIEntry ;
2006-08-02 22:01:59 +02:00
import de.anomic.index.indexURLEntry ;
2006-11-23 03:16:30 +01:00
import de.anomic.kelondro.kelondroBitfield ;
2007-03-06 23:43:32 +01:00
import de.anomic.kelondro.kelondroCache ;
2007-08-03 13:44:58 +02:00
import de.anomic.kelondro.kelondroCachedRecords ;
2005-05-05 07:32:19 +02:00
import de.anomic.kelondro.kelondroException ;
import de.anomic.kelondro.kelondroMSetTools ;
2006-06-01 18:01:24 +02:00
import de.anomic.kelondro.kelondroMapTable ;
2006-09-30 00:27:20 +02:00
import de.anomic.kelondro.kelondroNaturalOrder ;
2006-01-31 13:30:24 +01:00
import de.anomic.plasma.dbImport.dbImportManager ;
2006-09-20 14:25:07 +02:00
import de.anomic.plasma.parser.ParserException ;
2007-01-30 15:18:35 +01:00
import de.anomic.plasma.urlPattern.defaultURLPattern ;
2006-08-12 16:28:14 +02:00
import de.anomic.plasma.urlPattern.plasmaURLPattern ;
2005-05-05 07:32:19 +02:00
import de.anomic.server.serverAbstractSwitch ;
2007-07-24 02:46:17 +02:00
import de.anomic.server.serverDomains ;
2006-09-30 00:27:20 +02:00
import de.anomic.server.serverFileUtils ;
2005-05-05 07:32:19 +02:00
import de.anomic.server.serverInstantThread ;
import de.anomic.server.serverObjects ;
import de.anomic.server.serverSemaphore ;
import de.anomic.server.serverSwitch ;
2006-09-03 16:59:00 +02:00
import de.anomic.server.serverThread ;
2005-06-10 11:19:24 +02:00
import de.anomic.server.logging.serverLog ;
2005-05-05 07:32:19 +02:00
import de.anomic.tools.crypt ;
2007-09-05 11:01:35 +02:00
import de.anomic.yacy.yacyURL ;
2007-06-28 16:52:26 +02:00
import de.anomic.yacy.yacyVersion ;
2005-05-05 07:32:19 +02:00
import de.anomic.yacy.yacyClient ;
import de.anomic.yacy.yacyCore ;
2005-10-04 02:28:59 +02:00
import de.anomic.yacy.yacyNewsPool ;
2007-06-12 17:15:24 +02:00
import de.anomic.yacy.yacyNewsRecord ;
2006-09-30 00:27:20 +02:00
import de.anomic.yacy.yacySeed ;
2005-04-07 21:19:42 +02:00
2005-05-14 11:41:05 +02:00
public final class plasmaSwitchboard extends serverAbstractSwitch implements serverSwitch {
2005-09-28 15:49:57 +02:00
2005-04-07 21:19:42 +02:00
// load slots
2006-02-21 00:27:11 +01:00
public static int crawlSlots = 10 ;
2006-12-18 02:18:28 +01:00
public static int indexingSlots = 30 ;
2006-08-25 01:39:52 +02:00
public static int stackCrawlSlots = 1000000 ;
2006-02-21 00:27:11 +01:00
2007-07-31 12:00:17 +02:00
private int dhtTransferIndexCount = 100 ;
2005-11-07 13:33:02 +01:00
2006-09-03 16:59:00 +02:00
// we must distinguish the following cases: resource-load was initiated by
// 1) global crawling: the index is extern, not here (not possible here)
// 2) result of search queries, some indexes are here (not possible here)
// 3) result of index transfer, some of them are here (not possible here)
// 4) proxy-load (initiator is "------------")
// 5) local prefetch/crawling (initiator is own seedHash)
// 6) local fetching for global crawling (other known or unknwon initiator)
public static final int PROCESSCASE_0_UNKNOWN = 0 ;
public static final int PROCESSCASE_1_GLOBAL_CRAWLING = 1 ;
public static final int PROCESSCASE_2_SEARCH_QUERY_RESULT = 2 ;
public static final int PROCESSCASE_3_INDEX_TRANSFER_RESULT = 3 ;
public static final int PROCESSCASE_4_PROXY_LOAD = 4 ;
public static final int PROCESSCASE_5_LOCAL_CRAWLING = 5 ;
public static final int PROCESSCASE_6_GLOBAL_CRAWLING = 6 ;
2005-04-07 21:19:42 +02:00
// couloured list management
2006-02-20 18:50:42 +01:00
public static TreeSet badwords = null ;
2005-04-07 21:19:42 +02:00
public static TreeSet blueList = null ;
2006-02-20 18:50:42 +01:00
public static TreeSet stopwords = null ;
2005-07-12 02:07:09 +02:00
public static plasmaURLPattern urlBlacklist ;
2005-04-07 21:19:42 +02:00
2007-05-20 18:19:25 +02:00
public static wikiParser wikiParser = null ;
2005-11-11 00:48:20 +01:00
// storage management
2005-09-29 02:24:09 +02:00
public File htCachePath ;
2005-07-12 17:09:35 +02:00
private File plasmaPath ;
2007-03-24 16:28:17 +01:00
public File indexPrimaryPath , indexSecondaryPath ;
2005-07-12 17:09:35 +02:00
public File listsPath ;
2005-11-07 13:33:02 +01:00
public File htDocsPath ;
public File rankingPath ;
2006-01-21 17:39:57 +01:00
public File workPath ;
2007-06-28 16:52:26 +02:00
public File releasePath ;
2005-11-11 09:02:46 +01:00
public HashMap rankingPermissions ;
2006-12-05 03:47:51 +01:00
public plasmaCrawlNURL noticeURL ;
2007-03-16 14:25:56 +01:00
public plasmaCrawlZURL errorURL , delegatedURL ;
2005-07-12 17:09:35 +02:00
public plasmaWordIndex wordIndex ;
public plasmaCrawlLoader cacheLoader ;
public plasmaSwitchboardQueue sbQueue ;
2005-10-09 17:59:09 +02:00
public plasmaCrawlStacker sbStackCrawlThread ;
2005-07-12 17:09:35 +02:00
public messageBoard messageDB ;
public wikiBoard wikiDB ;
2006-03-20 23:31:59 +01:00
public blogBoard blogDB ;
2007-02-26 15:36:01 +01:00
public blogBoardComments blogCommentDB ;
2005-10-04 02:28:59 +02:00
public static plasmaCrawlRobotsTxt robots ;
2005-07-12 17:09:35 +02:00
public plasmaCrawlProfile profiles ;
public plasmaCrawlProfile . entry defaultProxyProfile ;
public plasmaCrawlProfile . entry defaultRemoteProfile ;
2006-12-19 04:10:46 +01:00
public plasmaCrawlProfile . entry defaultTextSnippetProfile ;
public plasmaCrawlProfile . entry defaultMediaSnippetProfile ;
2006-01-18 03:18:23 +01:00
public boolean rankingOn ;
2005-11-11 00:48:20 +01:00
public plasmaRankingDistribution rankingOwnDistribution ;
public plasmaRankingDistribution rankingOtherDistribution ;
2005-07-12 17:09:35 +02:00
public HashMap outgoingCookies , incomingCookies ;
2006-11-20 03:46:53 +01:00
public kelondroMapTable facilityDB ;
2005-07-12 17:09:35 +02:00
public plasmaParser parser ;
2005-07-21 13:17:04 +02:00
public long proxyLastAccess ;
2005-08-02 17:36:10 +02:00
public yacyCore yc ;
2005-09-02 00:05:20 +02:00
public HashMap indexingTasksInProcess ;
2005-10-04 02:28:59 +02:00
public userDB userDB ;
2005-12-26 15:21:01 +01:00
public bookmarksDB bookmarksDB ;
2007-05-22 10:13:48 +02:00
public plasmaWebStructure webStructure ;
2006-01-31 13:30:24 +01:00
public dbImportManager dbImportManager ;
2006-02-19 22:54:46 +01:00
public plasmaDHTFlush transferIdxThread = null ;
2006-02-21 00:57:50 +01:00
private plasmaDHTChunk dhtTransferChunk = null ;
2007-04-12 11:24:56 +02:00
public ArrayList localSearches , remoteSearches ; // array of search result properties as HashMaps
2007-06-11 00:02:17 +02:00
public HashMap localSearchTracker , remoteSearchTracker ; // mappings from requesting host to a TreeSet of Long(access time)
2007-01-31 18:05:15 +01:00
public long lastseedcheckuptime = - 1 ;
2007-01-31 16:39:11 +01:00
public long indexedPages = 0 ;
2007-01-31 18:05:15 +01:00
public long lastindexedPages = 0 ;
2007-01-31 16:39:11 +01:00
public double requestedQueries = 0d ;
2007-01-31 18:05:15 +01:00
public double lastrequestedQueries = 0d ;
2007-02-01 01:17:35 +01:00
public int totalPPM = 0 ;
public double totalQPM = 0d ;
2007-04-30 00:05:34 +02:00
public TreeMap clusterhashes ; // map of peerhash(String)/alternative-local-address as ip:port or only ip (String) or null if address in seed should be used
2007-07-24 02:46:17 +02:00
public boolean acceptLocalURLs , acceptGlobalURLs ;
2007-08-04 01:06:53 +02:00
public URLLicense licensedURLs ;
2005-10-22 15:28:04 +02:00
/ *
* Remote Proxy configuration
* /
2005-11-11 00:48:20 +01:00
// public boolean remoteProxyUse;
// public boolean remoteProxyUse4Yacy;
// public String remoteProxyHost;
// public int remoteProxyPort;
// public String remoteProxyNoProxy = "";
// public String[] remoteProxyNoProxyPatterns = null;
2005-10-22 15:28:04 +02:00
public httpRemoteProxyConfig remoteProxyConfig = null ;
2005-11-11 00:48:20 +01:00
2007-03-02 02:19:38 +01:00
public httpdRobotsTxtConfig robotstxtConfig = null ;
2005-11-11 00:48:20 +01:00
2005-10-22 15:28:04 +02:00
/ *
* Some constants
* /
2005-09-07 09:26:19 +02:00
private static final String STR_REMOTECRAWLTRIGGER = " REMOTECRAWLTRIGGER: REMOTE CRAWL TO PEER " ;
2005-11-11 00:48:20 +01:00
2005-04-24 23:24:53 +02:00
private serverSemaphore shutdownSync = new serverSemaphore ( 0 ) ;
private boolean terminate = false ;
2005-04-07 21:19:42 +02:00
2006-02-21 12:18:48 +01:00
//private Object crawlingPausedSync = new Object();
//private boolean crawlingIsPaused = false;
private static final int CRAWLJOB_SYNC = 0 ;
private static final int CRAWLJOB_STATUS = 1 ;
2007-01-30 15:18:35 +01:00
//////////////////////////////////////////////////////////////////////////////////////////////
// Thread settings
//////////////////////////////////////////////////////////////////////////////////////////////
// 20_dhtdistribution
/ * *
* < p > < code > public static final String < strong > INDEX_DIST < / strong > = " 20_dhtdistribution " < / code > < / p >
* < p > Name of the DHT distribution thread , which selects index chunks and transfers them to other peers
* according to the global DHT rules < / p >
* /
public static final String INDEX_DIST = " 20_dhtdistribution " ;
public static final String INDEX_DIST_METHOD_START = " dhtTransferJob " ;
public static final String INDEX_DIST_METHOD_JOBCOUNT = null ;
public static final String INDEX_DIST_METHOD_FREEMEM = null ;
public static final String INDEX_DIST_MEMPREREQ = " 20_dhtdistribution_memprereq " ;
public static final String INDEX_DIST_IDLESLEEP = " 20_dhtdistribution_idlesleep " ;
public static final String INDEX_DIST_BUSYSLEEP = " 20_dhtdistribution_busysleep " ;
// 30_peerping
/ * *
* < p > < code > public static final String < strong > PEER_PING < / strong > = " 30_peerping " < / code > < / p >
* < p > Name of the Peer Ping thread which publishes the own peer and retrieves information about other peers
* connected to the YaCy - network < / p >
* /
public static final String PEER_PING = " 30_peerping " ;
public static final String PEER_PING_METHOD_START = " peerPing " ;
public static final String PEER_PING_METHOD_JOBCOUNT = null ;
public static final String PEER_PING_METHOD_FREEMEM = null ;
2007-03-12 17:24:28 +01:00
public static final String PEER_PING_IDLESLEEP = " 30_peerping_idlesleep " ;
public static final String PEER_PING_BUSYSLEEP = " 30_peerping_busysleep " ;
2007-01-30 15:18:35 +01:00
// 40_peerseedcycle
/ * *
* < p > < code > public static final String < strong > SEED_UPLOAD < / strong > = " 40_peerseedcycle " < / code > < / p >
* < p > Name of the seed upload thread , providing the so - called seed - lists needed during bootstrapping < / p >
* /
public static final String SEED_UPLOAD = " 40_peerseedcycle " ;
public static final String SEED_UPLOAD_METHOD_START = " publishSeedList " ;
public static final String SEED_UPLOAD_METHOD_JOBCOUNT = null ;
public static final String SEED_UPLOAD_METHOD_FREEMEM = null ;
2007-03-12 17:24:28 +01:00
public static final String SEED_UPLOAD_IDLESLEEP = " 40_peerseedcycle_idlesleep " ;
public static final String SEED_UPLOAD_BUSYSLEEP = " 40_peerseedcycle_busysleep " ;
2007-01-30 15:18:35 +01:00
// 50_localcrawl
/ * *
* < p > < code > public static final String < strong > CRAWLJOB_LOCAL_CRAWL < / strong > = " 50_localcrawl " < / code > < / p >
* < p > Name of the local crawler thread , popping one entry off the Local Crawl Queue , and passing it to the
* proxy cache enqueue thread to download and further process it < / p >
*
* @see plasmaSwitchboard # PROXY_CACHE_ENQUEUE
* /
public static final String CRAWLJOB_LOCAL_CRAWL = " 50_localcrawl " ;
public static final String CRAWLJOB_LOCAL_CRAWL_METHOD_START = " coreCrawlJob " ;
public static final String CRAWLJOB_LOCAL_CRAWL_METHOD_JOBCOUNT = " coreCrawlJobSize " ;
public static final String CRAWLJOB_LOCAL_CRAWL_METHOD_FREEMEM = null ;
2007-03-12 17:24:28 +01:00
public static final String CRAWLJOB_LOCAL_CRAWL_IDLESLEEP = " 50_localcrawl_idlesleep " ;
public static final String CRAWLJOB_LOCAL_CRAWL_BUSYSLEEP = " 50_localcrawl_busysleep " ;
2007-01-30 15:18:35 +01:00
// 61_globalcawltrigger
/ * *
* < p > < code > public static final String < strong > CRAWLJOB_GLOBAL_CRAWL_TRIGGER < / strong > = " 61_globalcrawltrigger " < / code > < / p >
* < p > Name of the global crawl trigger thread , popping one entry off it ' s queue and sending it to a non - busy peer to
* crawl it < / p >
*
* @see plasmaSwitchboard # CRAWLJOB_REMOTE_TRIGGERED_CRAWL
* /
public static final String CRAWLJOB_GLOBAL_CRAWL_TRIGGER = " 61_globalcrawltrigger " ;
public static final String CRAWLJOB_GLOBAL_CRAWL_TRIGGER_METHOD_START = " limitCrawlTriggerJob " ;
public static final String CRAWLJOB_GLOBAL_CRAWL_TRIGGER_METHOD_JOBCOUNT = " limitCrawlTriggerJobSize " ;
public static final String CRAWLJOB_GLOBAL_CRAWL_TRIGGER_METHOD_FREEMEM = null ;
2007-03-12 17:24:28 +01:00
public static final String CRAWLJOB_GLOBAL_CRAWL_TRIGGER_IDLESLEEP = " 61_globalcrawltrigger_idlesleep " ;
public static final String CRAWLJOB_GLOBAL_CRAWL_TRIGGER_BUSYSLEEP = " 61_globalcrawltrigger_busysleep " ;
2007-01-30 15:18:35 +01:00
// 62_remotetriggeredcrawl
/ * *
* < p > < code > public static final String < strong > CRAWLJOB_REMOTE_TRIGGERED_CRAWL < / strong > = " 62_remotetriggeredcrawl " < / code > < / p >
* < p > Name of the remote triggered crawl thread , responsible for processing a remote crawl received from another peer < / p >
* /
public static final String CRAWLJOB_REMOTE_TRIGGERED_CRAWL = " 62_remotetriggeredcrawl " ;
public static final String CRAWLJOB_REMOTE_TRIGGERED_CRAWL_METHOD_START = " remoteTriggeredCrawlJob " ;
public static final String CRAWLJOB_REMOTE_TRIGGERED_CRAWL_METHOD_JOBCOUNT = " remoteTriggeredCrawlJobSize " ;
public static final String CRAWLJOB_REMOTE_TRIGGERED_CRAWL_METHOD_FREEMEM = null ;
2007-03-12 17:24:28 +01:00
public static final String CRAWLJOB_REMOTE_TRIGGERED_CRAWL_IDLESLEEP = " 62_remotetriggeredcrawl_idlesleep " ;
public static final String CRAWLJOB_REMOTE_TRIGGERED_CRAWL_BUSYSLEEP = " 62_remotetriggeredcrawl_busysleep " ;
2007-01-30 15:18:35 +01:00
// 70_cachemanager
/ * *
* < p > < code > public static final String < strong > PROXY_CACHE_ENQUEUE < / strong > = " 70_cachemanager " < / code > < / p >
* < p > Name of the proxy cache enqueue thread which fetches a given website and saves the site itself as well as it ' s
* HTTP - headers in the HTCACHE < / p >
*
* @see plasmaSwitchboard # PROXY_CACHE_PATH
* /
public static final String PROXY_CACHE_ENQUEUE = " 70_cachemanager " ;
public static final String PROXY_CACHE_ENQUEUE_METHOD_START = " htEntryStoreJob " ;
public static final String PROXY_CACHE_ENQUEUE_METHOD_JOBCOUNT = " htEntrySize " ;
public static final String PROXY_CACHE_ENQUEUE_METHOD_FREEMEM = null ;
2007-03-12 17:24:28 +01:00
public static final String PROXY_CACHE_ENQUEUE_IDLESLEEP = " 70_cachemanager_idlesleep " ;
public static final String PROXY_CACHE_ENQUEUE_BUSYSLEEP = " 70_cachemanager_busysleep " ;
2007-01-30 15:18:35 +01:00
// 80_indexing
/ * *
* < p > < code > public static final String < strong > INDEXER < / strong > = " 80_indexing " < / code > < / p >
* < p > Name of the indexer thread , performing the actual indexing of a website < / p >
* /
public static final String INDEXER = " 80_indexing " ;
public static final String INDEXER_CLUSTER = " 80_indexing_cluster " ;
public static final String INDEXER_MEMPREREQ = " 80_indexing_memprereq " ;
public static final String INDEXER_IDLESLEEP = " 80_indexing_idlesleep " ;
public static final String INDEXER_BUSYSLEEP = " 80_indexing_busysleep " ;
public static final String INDEXER_METHOD_START = " deQueue " ;
public static final String INDEXER_METHOD_JOBCOUNT = " queueSize " ;
public static final String INDEXER_METHOD_FREEMEM = " deQueueFreeMem " ;
public static final String INDEXER_SLOTS = " indexer.slots " ;
// 82_crawlstack
/ * *
* < p > < code > public static final String < strong > CRAWLSTACK < / strong > = " 82_crawlstack " < / code > < / p >
* < p > Name of the crawl stacker thread , performing several checks on new URLs to crawl , i . e . double - check < / p >
* /
public static final String CRAWLSTACK = " 82_crawlstack " ;
public static final String CRAWLSTACK_METHOD_START = " job " ;
public static final String CRAWLSTACK_METHOD_JOBCOUNT = " size " ;
public static final String CRAWLSTACK_METHOD_FREEMEM = null ;
2007-03-12 17:24:28 +01:00
public static final String CRAWLSTACK_IDLESLEEP = " 82_crawlstack_idlesleep " ;
public static final String CRAWLSTACK_BUSYSLEEP = " 82_crawlstack_busysleep " ;
2007-01-30 15:18:35 +01:00
// 90_cleanup
/ * *
* < p > < code > public static final String < strong > CLEANUP < / strong > = " 90_cleanup " < / code > < / p >
* < p > The cleanup thread which is responsible for pendant cleanup - jobs , news / ranking distribution , etc . < / p >
* /
public static final String CLEANUP = " 90_cleanup " ;
public static final String CLEANUP_METHOD_START = " cleanupJob " ;
public static final String CLEANUP_METHOD_JOBCOUNT = " cleanupJobSize " ;
public static final String CLEANUP_METHOD_FREEMEM = null ;
2007-03-12 17:24:28 +01:00
public static final String CLEANUP_IDLESLEEP = " 90_cleanup_idlesleep " ;
public static final String CLEANUP_BUSYSLEEP = " 90_cleanup_busysleep " ;
2007-01-30 15:18:35 +01:00
//////////////////////////////////////////////////////////////////////////////////////////////
// RAM Cache settings
//////////////////////////////////////////////////////////////////////////////////////////////
/ * *
* < p > < code > public static final String < strong > RAM_CACHE_LURL < / strong > = " ramCacheLURL " < / code > < / p >
* < p > Name of the setting how much memory in bytes should be assigned to the Loaded URLs DB for caching purposes < / p >
* /
public static final String RAM_CACHE_LURL_TIME = " ramCacheLURL_time " ;
/ * *
* < p > < code > public static final String < strong > RAM_CACHE_NURL < / strong > = " ramCacheNURL " < / code > < / p >
* < p > Name of the setting how much memory in bytes should be assigned to the Noticed URLs DB for caching purposes < / p >
* /
public static final String RAM_CACHE_NURL_TIME = " ramCacheNURL_time " ;
/ * *
* < p > < code > public static final String < strong > RAM_CACHE_EURL < / strong > = " ramCacheEURL " < / code > < / p >
* < p > Name of the setting how much memory in bytes should be assigned to the Erroneous URLs DB for caching purposes < / p >
* /
public static final String RAM_CACHE_EURL_TIME = " ramCacheEURL_time " ;
/ * *
* < p > < code > public static final String < strong > RAM_CACHE_RWI < / strong > = " ramCacheRWI " < / code > < / p >
* < p > Name of the setting how much memory in bytes should be assigned to the RWIs DB for caching purposes < / p >
* /
public static final String RAM_CACHE_RWI_TIME = " ramCacheRWI_time " ;
/ * *
* < p > < code > public static final String < strong > RAM_CACHE_HTTP < / strong > = " ramCacheHTTP " < / code > < / p >
* < p > Name of the setting how much memory in bytes should be assigned to the HTTP Headers DB for caching purposes < / p >
* /
public static final String RAM_CACHE_HTTP_TIME = " ramCacheHTTP_time " ;
/ * *
* < p > < code > public static final String < strong > RAM_CACHE_MESSAGE < / strong > = " ramCacheMessage " < / code > < / p >
* < p > Name of the setting how much memory in bytes should be assigned to the Message DB for caching purposes < / p >
* /
public static final String RAM_CACHE_MESSAGE_TIME = " ramCacheMessage_time " ;
/ * *
* < p > < code > public static final String < strong > RAM_CACHE_ROBOTS < / strong > = " ramCacheRobots " < / code > < / p >
* < p > Name of the setting how much memory in bytes should be assigned to the robots . txts DB for caching purposes < / p >
* /
public static final String RAM_CACHE_ROBOTS_TIME = " ramCacheRobots_time " ;
/ * *
* < p > < code > public static final String < strong > RAM_CACHE_PROFILES < / strong > = " ramCacheProfiles " < / code > < / p >
* < p > Name of the setting how much memory in bytes should be assigned to the Crawl Profiles DB for caching purposes < / p >
* /
public static final String RAM_CACHE_PROFILES_TIME = " ramCacheProfiles_time " ;
/ * *
* < p > < code > public static final String < strong > RAM_CACHE_PRE_NURL < / strong > = " ramCachePreNURL " < / code > < / p >
* < p > Name of the setting how much memory in bytes should be assigned to the Pre - Noticed URLs DB for caching purposes < / p >
* /
public static final String RAM_CACHE_PRE_NURL_TIME = " ramCachePreNURL_time " ;
/ * *
* < p > < code > public static final String < strong > RAM_CACHE_WIKI < / strong > = " ramCacheWiki " < / code > < / p >
* < p > Name of the setting how much memory in bytes should be assigned to the Wiki DB for caching purposes < / p >
* /
public static final String RAM_CACHE_WIKI_TIME = " ramCacheWiki_time " ;
/ * *
* < p > < code > public static final String < strong > RAM_CACHE_BLOG < / strong > = " ramCacheBlog " < / code > < / p >
* < p > Name of the setting how much memory in bytes should be assigned to the Blog DB for caching purposes < / p >
* /
public static final String RAM_CACHE_BLOG_TIME = " ramCacheBlog_time " ;
//////////////////////////////////////////////////////////////////////////////////////////////
// DHT settings
//////////////////////////////////////////////////////////////////////////////////////////////
/ * *
* < p > < code > public static final String < strong > INDEX_DIST_DHT_RECEIPT_LIMIT < / strong > = " indexDistribution.dhtReceiptLimit " < / code > < / p >
* < p > Name of the setting how many words the DHT - In cache may contain maximal before new DHT receipts
* will be rejected < / p >
* /
public static final String INDEX_DIST_DHT_RECEIPT_LIMIT = " indexDistribution.dhtReceiptLimit " ;
/ * *
* < p > < code > public static final String < strong > INDEX_DIST_CHUNK_SIZE_START < / strong > = " indexDistribution.startChunkSize " < / code > < / p >
* < p > Name of the setting specifying how many words the very first chunk will contain when the DHT - thread starts < / p >
* /
public static final String INDEX_DIST_CHUNK_SIZE_START = " indexDistribution.startChunkSize " ;
/ * *
* < p > < code > public static final String < strong > INDEX_DIST_CHUNK_SIZE_MIN < / strong > = " indexDistribution.minChunkSize " < / code > < / p >
* < p > Name of the setting specifying how many words the smallest chunk may contain < / p >
* /
public static final String INDEX_DIST_CHUNK_SIZE_MIN = " indexDistribution.minChunkSize " ;
/ * *
* < p > < code > public static final String < strong > INDEX_DIST_CHUNK_SIZE_MAX < / strong > = " indexDistribution.maxChunkSize " < / code > < / p >
* < p > Name of the setting specifying how many words the hugest chunk may contain < / p >
* /
public static final String INDEX_DIST_CHUNK_SIZE_MAX = " indexDistribution.maxChunkSize " ;
public static final String INDEX_DIST_CHUNK_FAILS_MAX = " indexDistribution.maxChunkFails " ;
/ * *
* < p > < code > public static final String < strong > INDEX_DIST_TIMEOUT < / strong > = " indexDistribution.timeout " < / code > < / p >
* < p > Name of the setting how long the timeout for an Index Distribution shall be in milliseconds < / p >
* /
public static final String INDEX_DIST_TIMEOUT = " indexDistribution.timeout " ;
/ * *
* < p > < code > public static final String < strong > INDEX_DIST_GZIP_BODY < / strong > = " indexDistribution.gzipBody " < / code > < / p >
* < p > Name of the setting whether DHT chunks shall be transferred gzip - encodedly < / p >
* /
public static final String INDEX_DIST_GZIP_BODY = " indexDistribution.gzipBody " ;
/ * *
* < p > < code > public static final String < strong > INDEX_DIST_ALLOW < / strong > = " allowDistributeIndex " < / code > < / p >
* < p > Name of the setting whether Index Distribution shall be allowed ( and the DHT - thread therefore started ) or not < / p >
*
* @see plasmaSwitchboard # INDEX_DIST_ALLOW_WHILE_CRAWLING
* /
public static final String INDEX_DIST_ALLOW = " allowDistributeIndex " ;
2007-04-23 22:47:07 +02:00
public static final String INDEX_RECEIVE_ALLOW = " allowReceiveIndex " ;
2007-01-30 15:18:35 +01:00
/ * *
* < p > < code > public static final String < strong > INDEX_DIST_ALLOW_WHILE_CRAWLING < / strong > = " allowDistributeIndexWhileCrawling " < / code > < / p >
* < p > Name of the setting whether Index Distribution shall be allowed while crawling is in progress , i . e .
* the Local Crawler Queue is filled . < / p >
* < p > This setting only has effect if { @link # INDEX_DIST_ALLOW } is enabled < / p >
*
* @see plasmaSwitchboard # INDEX_DIST_ALLOW
* /
public static final String INDEX_DIST_ALLOW_WHILE_CRAWLING = " allowDistributeIndexWhileCrawling " ;
2007-06-28 17:25:33 +02:00
public static final String INDEX_DIST_ALLOW_WHILE_INDEXING = " allowDistributeIndexWhileIndexing " ;
2007-01-30 15:18:35 +01:00
public static final String INDEX_TRANSFER_TIMEOUT = " indexTransfer.timeout " ;
public static final String INDEX_TRANSFER_GZIP_BODY = " indexTransfer.gzipBody " ;
//////////////////////////////////////////////////////////////////////////////////////////////
// Ranking settings
//////////////////////////////////////////////////////////////////////////////////////////////
public static final String RANKING_DIST_ON = " CRDistOn " ;
public static final String RANKING_DIST_0_PATH = " CRDist0Path " ;
public static final String RANKING_DIST_0_METHOD = " CRDist0Method " ;
public static final String RANKING_DIST_0_PERCENT = " CRDist0Percent " ;
public static final String RANKING_DIST_0_TARGET = " CRDist0Target " ;
public static final String RANKING_DIST_1_PATH = " CRDist1Path " ;
public static final String RANKING_DIST_1_METHOD = " CRDist1Method " ;
public static final String RANKING_DIST_1_PERCENT = " CRDist1Percent " ;
public static final String RANKING_DIST_1_TARGET = " CRDist1Target " ;
//////////////////////////////////////////////////////////////////////////////////////////////
// Parser settings
//////////////////////////////////////////////////////////////////////////////////////////////
public static final String PARSER_MIMETYPES_REALTIME = " parseableRealtimeMimeTypes " ;
public static final String PARSER_MIMETYPES_PROXY = " parseableMimeTypes.PROXY " ;
public static final String PARSER_MIMETYPES_CRAWLER = " parseableMimeTypes.CRAWLER " ;
public static final String PARSER_MIMETYPES_ICAP = " parseableMimeTypes.ICAP " ;
public static final String PARSER_MIMETYPES_URLREDIRECTOR = " parseableMimeTypes.URLREDIRECTOR " ;
public static final String PARSER_MEDIA_EXT = " mediaExt " ;
public static final String PARSER_MEDIA_EXT_PARSEABLE = " parseableExt " ;
//////////////////////////////////////////////////////////////////////////////////////////////
// Proxy settings
//////////////////////////////////////////////////////////////////////////////////////////////
/ * *
* < p > < code > public static final String < strong > PROXY_ONLINE_CAUTION_DELAY < / strong > = " onlineCautionDelay " < / code > < / p >
* < p > Name of the setting how long indexing should pause after the last time the proxy was used in milliseconds < / p >
* /
public static final String PROXY_ONLINE_CAUTION_DELAY = " onlineCautionDelay " ;
/ * *
* < p > < code > public static final String < strong > PROXY_PREFETCH_DEPTH < / strong > = " proxyPrefetchDepth " < / code > < / p >
* < p > Name of the setting how deep URLs fetched by proxy usage shall be followed < / p >
* /
public static final String PROXY_PREFETCH_DEPTH = " proxyPrefetchDepth " ;
public static final String PROXY_CRAWL_ORDER = " proxyCrawlOrder " ;
2007-06-08 22:01:16 +02:00
public static final String PROXY_INDEXING_REMOTE = " proxyIndexingRemote " ;
public static final String PROXY_INDEXING_LOCAL_TEXT = " proxyIndexingLocalText " ;
public static final String PROXY_INDEXING_LOCAL_MEDIA = " proxyIndexingLocalMedia " ;
2007-01-30 15:18:35 +01:00
public static final String PROXY_CACHE_SIZE = " proxyCacheSize " ;
/ * *
* < p > < code > public static final String < strong > PROXY_CACHE_LAYOUT < / strong > = " proxyCacheLayout " < / code > < / p >
* < p > Name of the setting which file - / folder - layout the proxy cache shall use . Possible values are { @link # PROXY_CACHE_LAYOUT_TREE }
* and { @link # PROXY_CACHE_LAYOUT_HASH } < / p >
*
* @see plasmaSwitchboard # PROXY_CACHE_LAYOUT_TREE
* @see plasmaSwitchboard # PROXY_CACHE_LAYOUT_HASH
* /
public static final String PROXY_CACHE_LAYOUT = " proxyCacheLayout " ;
/ * *
* < p > < code > public static final String < strong > PROXY_CACHE_LAYOUT_TREE < / strong > = " tree " < / code > < / p >
* < p > Setting the file - / folder - structure for { @link # PROXY_CACHE_LAYOUT } . Websites are stored in a folder - layout
* according to the layout , the URL purported . The first folder is either < code > http < / code > or < code > https < / code >
* depending on the protocol used to fetch the website , descending follows the hostname and the sub - folders on the
* website up to the actual file itself . < / p >
* < p > When using < code > tree < / code > , be aware that
* the possibility of inconsistencies between folders and files with the same name may occur which prevent proper
* storage of the fetched site . Below is an example how files are stored : < / p >
* < pre >
* / html /
* / html / www . example . com /
* / html / www . example . com / index /
* / html / www . example . com / index / en /
* / html / www . example . com / index / en / index . html < / pre >
* /
public static final String PROXY_CACHE_LAYOUT_TREE = " tree " ;
/ * *
* < p > < code > public static final String < strong > PROXY_CACHE_LAYOUT_HASH < / strong > = " hash " < / code > < / p >
* < p > Setting the file - / folder - structure for { @link # PROXY_CACHE_LAYOUT } . Websites are stored using the MD5 - sum of
* their respective URLs . This method prevents collisions on some websites caused by using the { @link # PROXY_CACHE_LAYOUT_TREE }
* layout . < / p >
* < p > Similarly to { @link # PROXY_CACHE_LAYOUT_TREE } , the top - folders name is given by the protocol used to fetch the site ,
* followed by either < code > www < / code > or & ndash ; if the hostname does not start with " www " & ndash ; < code > other < / code > .
* Afterwards the next folder has the rest of the hostname as name , followed by a folder < code > hash < / code > which contains
* a folder consisting of the first two letters of the hash . Another folder named after the 3rd and 4th letters of the
* hash follows which finally contains the file named after the full 18 - characters long hash .
* Below is an example how files are stored : < / p >
* < pre >
* / html /
* / html / www /
* / html / www / example . com /
* / html / www / example . com / hash /
* / html / www / example . com / hash / 0d /
* / html / www / example . com / hash / 0d / f8 /
* / html / www / example . com / hash / 0d / f8 / 0df83a8444f48317d8 < / pre >
* /
public static final String PROXY_CACHE_LAYOUT_HASH = " hash " ;
public static final String PROXY_CACHE_MIGRATION = " proxyCacheMigration " ;
2007-06-08 22:01:16 +02:00
//////////////////////////////////////////////////////////////////////////////////////////////
// Cluster settings
//////////////////////////////////////////////////////////////////////////////////////////////
public static final String CLUSTER_MODE = " cluster.mode " ;
public static final String CLUSTER_MODE_PUBLIC_CLUSTER = " publiccluster " ;
public static final String CLUSTER_MODE_PRIVATE_CLUSTER = " privatecluster " ;
public static final String CLUSTER_MODE_PUBLIC_PEER = " publicpeer " ;
public static final String CLUSTER_PEERS_IPPORT = " cluster.peers.ipport " ;
2007-01-30 15:18:35 +01:00
//////////////////////////////////////////////////////////////////////////////////////////////
// Miscellaneous settings
//////////////////////////////////////////////////////////////////////////////////////////////
public static final String CRAWL_PROFILE_PROXY = " proxy " ;
public static final String CRAWL_PROFILE_REMOTE = " remote " ;
public static final String CRAWL_PROFILE_SNIPPET_TEXT = " snippetText " ;
public static final String CRAWL_PROFILE_SNIPPET_MEDIA = " snippetMedia " ;
/ * *
* < p > < code > public static final String < strong > CRAWLER_THREADS_ACTIVE_MAX < / strong > = " crawler.MaxActiveThreads " < / code > < / p >
* < p > Name of the setting how many active crawler - threads may maximal be running on the same time < / p >
* /
public static final String CRAWLER_THREADS_ACTIVE_MAX = " crawler.MaxActiveThreads " ;
public static final String OWN_SEED_FILE = " yacyOwnSeedFile " ;
/ * *
* < p > < code > public static final String < strong > STORAGE_PEER_HASH < / strong > = " storagePeerHash " < / code > < / p >
* < p > Name of the setting holding the Peer - Hash where indexes shall be transferred after indexing a webpage . If this setting
* is empty , the Storage Peer function is disabled < / p >
* /
public static final String STORAGE_PEER_HASH = " storagePeerHash " ;
public static final String YACY_MODE_DEBUG = " yacyDebugMode " ;
public static final String WORDCACHE_INIT_COUNT = " wordCacheInitCount " ;
/ * *
* < p > < code > public static final String < strong > WORDCACHE_MAX_COUNT < / strong > = " wordCacheMaxCount " < / code > < / p >
* < p > Name of the setting how many words the word - cache ( or DHT - Out cache ) shall contain maximal . Indexing pages if the
* cache has reached this limit will slow down the indexing process by flushing some of it ' s entries < / p >
* /
public static final String WORDCACHE_MAX_COUNT = " wordCacheMaxCount " ;
public static final String HTTPC_NAME_CACHE_CACHING_PATTERNS_NO = " httpc.nameCacheNoCachingPatterns " ;
2007-03-02 02:19:38 +01:00
public static final String ROBOTS_TXT = " httpd.robots.txt " ;
2007-03-07 12:50:03 +01:00
public static final String ROBOTS_TXT_DEFAULT = httpdRobotsTxtConfig . LOCKED + " , " + httpdRobotsTxtConfig . DIRS ;
2007-03-02 02:19:38 +01:00
2007-05-20 18:19:25 +02:00
public static final String WIKIPARSER_CLASS = " wikiParser.class " ;
public static final String WIKIPARSER_CLASS_DEFAULT = " de.anomic.data.wikiCode " ;
2007-01-30 15:18:35 +01:00
//////////////////////////////////////////////////////////////////////////////////////////////
// Lists
//////////////////////////////////////////////////////////////////////////////////////////////
/ * *
* < p > < code > public static final String < strong > BLACKLIST_CLASS < / strong > = " Blacklist.class " < / code > < / p >
* < p > Name of the setting which Blacklist backend shall be used . Due to different requirements of users , the
* { @link plasmaURLPattern } - interface has been created to support blacklist engines different from YaCy ' s default < / p >
* < p > Attention is required when the backend is changed , because different engines may have different syntaxes < / p >
* /
public static final String BLACKLIST_CLASS = " BlackLists.class " ;
/ * *
* < p > < code > public static final String < strong > BLACKLIST_CLASS_DEFAULT < / strong > = " de.anomic.plasma.urlPattern.defaultURLPattern " < / code > < / p >
* < p > Package and name of YaCy ' s { @link defaultURLPattern default } blacklist implementation < / p >
*
* @see defaultURLPattern for a detailed overview about the syntax of the default implementation
* /
public static final String BLACKLIST_CLASS_DEFAULT = " de.anomic.plasma.urlPattern.defaultURLPattern " ;
public static final String LIST_BLUE = " plasmaBlueList " ;
public static final String LIST_BLUE_DEFAULT = null ;
public static final String LIST_BADWORDS_DEFAULT = " yacy.badwords " ;
public static final String LIST_STOPWORDS_DEFAULT = " yacy.stopwords " ;
//////////////////////////////////////////////////////////////////////////////////////////////
// DB Paths
//////////////////////////////////////////////////////////////////////////////////////////////
/ * *
* < p > < code > public static final String < strong > DBPATH < / strong > = " dbPath " < / code > < / p >
* < p > Name of the setting specifying the folder beginning from the YaCy - installation ' s top - folder , where all
* databases containing queues are stored < / p >
* /
public static final String DBPATH = " dbPath " ;
public static final String DBPATH_DEFAULT = " DATA/PLASMADB " ;
/ * *
* < p > < code > public static final String < strong > HTCACHE_PATH < / strong > = " proxyCache " < / code > < / p >
* < p > Name of the setting specifying the folder beginning from the YaCy - installation ' s top - folder , where all
* downloaded webpages and their respective ressources and HTTP - headers are stored . It is the location containing
* the proxy - cache < / p >
*
* @see plasmaSwitchboard # PROXY_CACHE_LAYOUT for details on the file - layout in this path
* /
public static final String HTCACHE_PATH = " proxyCache " ;
public static final String HTCACHE_PATH_DEFAULT = " DATA/HTCACHE " ;
2007-06-28 16:52:26 +02:00
public static final String RELEASE_PATH = " releases " ;
public static final String RELEASE_PATH_DEFAULT = " DATA/RELEASE " ;
2007-01-30 15:18:35 +01:00
/ * *
* < p > < code > public static final String < strong > HTDOCS_PATH < / strong > = " htDocsPath " < / code > < / p >
* < p > Name of the setting specifying the folder beginning from the YaCy - installation ' s top - folder , where all
* user - ressources ( i . e . for the fileshare or the contents displayed on < code > www . peername . yacy < / code > ) lie .
* The translated templates of the webinterface will also be put in here < / p >
* /
public static final String HTDOCS_PATH = " htDocsPath " ;
public static final String HTDOCS_PATH_DEFAULT = " DATA/HTDOCS " ;
/ * *
* < p > < code > public static final String < strong > HTROOT_PATH < / strong > = " htRootPath " < / code > < / p >
* < p > Name of the setting specifying the folder beginning from the YaCy - installation ' s top - folder , where all
* original servlets , their stylesheets , scripts , etc . lie . It is also home of the XML - interface to YaCy < / p >
* /
public static final String HTROOT_PATH = " htRootPath " ;
public static final String HTROOT_PATH_DEFAULT = " htroot " ;
/ * *
* < p > < code > public static final String < strong > INDEX_PATH < / strong > = " indexPath " < / code > < / p >
* < p > Name of the setting specifying the folder beginning from the YaCy - installation ' s top - folder , where the
* whole database of known RWIs and URLs as well as dumps of the DHT - In and DHT - Out caches are stored < / p >
* /
2007-03-24 16:28:17 +01:00
public static final String INDEX_PRIMARY_PATH = " indexPrimaryPath " ; // this is a relative path to the data root
public static final String INDEX_SECONDARY_PATH = " indexSecondaryPath " ; // this is a absolute path to any location
2007-01-30 15:18:35 +01:00
public static final String INDEX_PATH_DEFAULT = " DATA/INDEX " ;
/ * *
* < p > < code > public static final String < strong > LISTS_PATH < / strong > = " listsPath " < / code > < / p >
* < p > Name of the setting specifying the folder beginning from the YaCy - installation ' s top - folder , where all
* user - lists like blacklists , etc . are stored < / p >
* /
public static final String LISTS_PATH = " listsPath " ;
public static final String LISTS_PATH_DEFAULT = " DATA/LISTS " ;
/ * *
* < p > < code > public static final String < strong > RANKING_PATH < / strong > = " rankingPath " < / code > < / p >
* < p > Name of the setting specifying the folder beginning from the YaCy - installation ' s top - folder , where all
* ranking files are stored , self - generated as well as received ranking files < / p >
*
* @see plasmaSwitchboard # RANKING_DIST_0_PATH
* @see plasmaSwitchboard # RANKING_DIST_1_PATH
* /
public static final String RANKING_PATH = " rankingPath " ;
public static final String RANKING_PATH_DEFAULT = " DATA/RANKING " ;
/ * *
* < p > < code > public static final String < strong > WORK_PATH < / strong > = " wordPath " < / code > < / p >
* < p > Name of the setting specifying the folder beginning from the YaCy - installation ' s top - folder , where all
* DBs containing " work " of the user are saved . Such include bookmarks , messages , wiki , blog < / p >
*
* @see plasmaSwitchboard # DBFILE_BLOG
* @see plasmaSwitchboard # DBFILE_BOOKMARKS
* @see plasmaSwitchboard # DBFILE_BOOKMARKS_DATES
* @see plasmaSwitchboard # DBFILE_BOOKMARKS_TAGS
* @see plasmaSwitchboard # DBFILE_MESSAGE
* @see plasmaSwitchboard # DBFILE_WIKI
* @see plasmaSwitchboard # DBFILE_WIKI_BKP
* /
public static final String WORK_PATH = " workPath " ;
public static final String WORK_PATH_DEFAULT = " DATA/WORK " ;
//////////////////////////////////////////////////////////////////////////////////////////////
// DB files
//////////////////////////////////////////////////////////////////////////////////////////////
/ * *
* < p > < code > public static final String < strong > DBFILE_MESSAGE < / strong > = " message.db " < / code > < / p >
* < p > Name of the file containing the database holding the user ' s peer - messages < / p >
*
* @see plasmaSwitchboard # WORK_PATH for the folder , this file lies in
* /
public static final String DBFILE_MESSAGE = " message.db " ;
/ * *
* < p > < code > public static final String < strong > DBFILE_WIKI < / strong > = " wiki.db " < / code > < / p >
* < p > Name of the file containing the database holding the whole wiki of this peer < / p >
*
* @see plasmaSwitchboard # WORK_PATH for the folder , this file lies in
* @see plasmaSwitchboard # DBFILE_WIKI_BKP for the file previous versions of wiki - pages lie in
* /
public static final String DBFILE_WIKI = " wiki.db " ;
/ * *
* < p > < code > public static final String < strong > DBFILE_WIKI_BKP < / strong > = " wiki-bkp.db " < / code > < / p >
* < p > Name of the file containing the database holding all versions but the latest of the wiki - pages of this peer < / p >
*
* @see plasmaSwitchboard # WORK_PATH for the folder this file lies in
* @see plasmaSwitchboard # DBFILE_WIKI for the file the latest version of wiki - pages lie in
* /
public static final String DBFILE_WIKI_BKP = " wiki-bkp.db " ;
/ * *
* < p > < code > public static final String < strong > DBFILE_BLOG < / strong > = " blog.db " < / code > < / p >
* < p > Name of the file containing the database holding all blog - entries available on this peer < / p >
*
* @see plasmaSwitchboard # WORK_PATH for the folder this file lies in
* /
public static final String DBFILE_BLOG = " blog.db " ;
2007-02-26 15:36:01 +01:00
/ * *
* < p > < code > public static final String < strong > DBFILE_BLOGCOMMENTS < / strong > = " blogComment.db " < / code > < / p >
* < p > Name of the file containing the database holding all blogComment - entries available on this peer < / p >
*
* @see plasmaSwitchboard # WORK_PATH for the folder this file lies in
* /
public static final String DBFILE_BLOGCOMMENTS = " blogComment.db " ;
2007-01-30 15:18:35 +01:00
/ * *
* < p > < code > public static final String < strong > DBFILE_BOOKMARKS < / strong > = " bookmarks.db " < / code > < / p >
* < p > Name of the file containing the database holding all bookmarks available on this peer < / p >
*
* @see plasmaSwitchboard # WORK_PATH for the folder this file lies in
* @see bookmarksDB for more detailed overview about the bookmarks structure
* /
public static final String DBFILE_BOOKMARKS = " bookmarks.db " ;
/ * *
* < p > < code > public static final String < strong > DBFILE_BOOKMARKS_TAGS < / strong > = " bookmarkTags.db " < / code > < / p >
* < p > Name of the file containing the database holding all tag - & gt ; bookmark relations < / p >
*
* @see plasmaSwitchboard # WORK_PATH for the folder this file lies in
* @see bookmarksDB for more detailed overview about the bookmarks structure
* /
public static final String DBFILE_BOOKMARKS_TAGS = " bookmarkTags.db " ;
/ * *
* < p > < code > public static final String < strong > DBFILE_BOOKMARKS_DATES < / strong > = " bookmarkDates.db " < / code > < / p >
* < p > Name of the file containing the database holding all date - & gt ; bookmark relations < / p >
*
* @see plasmaSwitchboard # WORK_PATH for the folder this file lies in
* @see bookmarksDB for more detailed overview about the bookmarks structure
* /
public static final String DBFILE_BOOKMARKS_DATES = " bookmarkDates.db " ;
/ * *
* < p > < code > public static final String < strong > DBFILE_OWN_SEED < / strong > = " mySeed.txt " < / code > < / p >
* < p > Name of the file containing the database holding this peer ' s seed < / p >
* /
public static final String DBFILE_OWN_SEED = " mySeed.txt " ;
/ * *
* < p > < code > public static final String < strong > DBFILE_CRAWL_PROFILES < / strong > = " crawlProfiles0.db " < / code >
* < p > Name of the file containing the database holding all recent crawl profiles < / p >
*
* @see plasmaSwitchboard # DBPATH for the folder this file lies in
* /
public static final String DBFILE_CRAWL_PROFILES = " crawlProfiles0.db " ;
/ * *
* < p > < code > public static final String < strong > DBFILE_CRAWL_ROBOTS < / strong > = " crawlRobotsTxt.db " < / code > < / p >
* < p > Name of the file containing the database holding all < code > robots . txt < / code > - entries of the lately crawled domains < / p >
*
* @see plasmaSwitchboard # DBPATH for the folder this file lies in
* /
public static final String DBFILE_CRAWL_ROBOTS = " crawlRobotsTxt.db " ;
/ * *
* < p > < code > public static final String < strong > DBFILE_USER < / strong > = " DATA/SETTINGS/user.db " < / code > < / p >
* < p > Path to the user - DB , beginning from the YaCy - installation ' s top - folder . It holds all rights the created
* users have as well as all other needed data about them < / p >
* /
public static final String DBFILE_USER = " DATA/SETTINGS/user.db " ;
2006-02-21 12:18:48 +01:00
private Hashtable crawlJobsStatus = new Hashtable ( ) ;
2005-11-11 00:48:20 +01:00
private static plasmaSwitchboard sb ;
2006-01-18 03:18:23 +01:00
2007-06-16 16:11:52 +02:00
public plasmaSwitchboard ( String rootPath , String initPath , String configPath , boolean applyPro ) {
super ( rootPath , initPath , configPath , applyPro ) ;
2007-04-30 00:05:34 +02:00
sb = this ;
2005-11-11 00:48:20 +01:00
2005-05-11 16:58:03 +02:00
// set loglevel and log
2005-08-30 14:50:30 +02:00
setLog ( new serverLog ( " PLASMA " ) ) ;
2007-06-16 16:11:52 +02:00
if ( applyPro ) this . log . logInfo ( " This is the pro-version of YaCy " ) ;
2007-06-22 16:29:14 +02:00
// remote proxy configuration
this . remoteProxyConfig = httpRemoteProxyConfig . init ( this ) ;
this . log . logConfig ( " Remote proxy configuration: \ n " + this . remoteProxyConfig . toString ( ) ) ;
// load network configuration into settings
String networkUnitDefinition = getConfig ( " network.unit.definition " , " yacy.network.unit " ) ;
String networkGroupDefinition = getConfig ( " network.group.definition " , " yacy.network.group " ) ;
// include additional network definition properties into our settings
// note that these properties cannot be set in the application because they are
// _always_ overwritten each time with the default values. This is done so on purpose.
// the network definition should be made either consistent for all peers,
// or independently using a bootstrap URL
Map initProps ;
if ( networkUnitDefinition . startsWith ( " http:// " ) ) {
try {
2007-09-05 11:01:35 +02:00
this . setConfig ( httpc . loadHashMap ( new yacyURL ( networkUnitDefinition , null ) , remoteProxyConfig ) ) ;
2007-06-22 16:29:14 +02:00
} catch ( MalformedURLException e ) {
}
} else {
File networkUnitDefinitionFile = new File ( rootPath , networkUnitDefinition ) ;
if ( networkUnitDefinitionFile . exists ( ) ) {
initProps = serverFileUtils . loadHashMap ( networkUnitDefinitionFile ) ;
this . setConfig ( initProps ) ;
}
}
if ( networkGroupDefinition . startsWith ( " http:// " ) ) {
try {
2007-09-05 11:01:35 +02:00
this . setConfig ( httpc . loadHashMap ( new yacyURL ( networkGroupDefinition , null ) , remoteProxyConfig ) ) ;
2007-06-22 16:29:14 +02:00
} catch ( MalformedURLException e ) {
}
} else {
File networkGroupDefinitionFile = new File ( rootPath , networkGroupDefinition ) ;
if ( networkGroupDefinitionFile . exists ( ) ) {
initProps = serverFileUtils . loadHashMap ( networkGroupDefinitionFile ) ;
this . setConfig ( initProps ) ;
}
}
2007-07-02 17:16:05 +02:00
// set release locations
int i = 0 ;
String location ;
while ( true ) {
location = getConfig ( " network.unit.update.location " + i , " " ) ;
if ( location . length ( ) = = 0 ) break ;
try {
2007-09-05 11:01:35 +02:00
yacyVersion . latestReleaseLocations . add ( new yacyURL ( location , null ) ) ;
2007-07-02 17:16:05 +02:00
} catch ( MalformedURLException e ) {
break ;
}
i + + ;
}
2007-06-28 16:52:26 +02:00
2007-08-04 01:06:53 +02:00
// initiate url license object
licensedURLs = new URLLicense ( 8 ) ;
2007-07-24 02:46:17 +02:00
// set URL domain acceptance
this . acceptGlobalURLs = " global.any " . indexOf ( getConfig ( " network.unit.domain " , " global " ) ) > = 0 ;
this . acceptLocalURLs = " local.any " . indexOf ( getConfig ( " network.unit.domain " , " global " ) ) > = 0 ;
2005-08-30 14:50:30 +02:00
// load values from configs
2007-01-30 15:18:35 +01:00
this . plasmaPath = new File ( rootPath , getConfig ( DBPATH , DBPATH_DEFAULT ) ) ;
2005-10-10 11:13:17 +02:00
this . log . logConfig ( " Plasma DB Path: " + this . plasmaPath . toString ( ) ) ;
2007-03-24 16:28:17 +01:00
this . indexPrimaryPath = new File ( rootPath , getConfig ( INDEX_PRIMARY_PATH , INDEX_PATH_DEFAULT ) ) ;
this . log . logConfig ( " Index Primary Path: " + this . indexPrimaryPath . toString ( ) ) ;
this . indexSecondaryPath = ( getConfig ( INDEX_SECONDARY_PATH , " " ) . length ( ) = = 0 ) ? indexPrimaryPath : new File ( getConfig ( INDEX_SECONDARY_PATH , " " ) ) ;
this . log . logConfig ( " Index Secondary Path: " + this . indexSecondaryPath . toString ( ) ) ;
2007-01-30 15:18:35 +01:00
this . listsPath = new File ( rootPath , getConfig ( LISTS_PATH , LISTS_PATH_DEFAULT ) ) ;
2005-11-07 13:33:02 +01:00
this . log . logConfig ( " Lists Path: " + this . listsPath . toString ( ) ) ;
2007-01-30 15:18:35 +01:00
this . htDocsPath = new File ( rootPath , getConfig ( HTDOCS_PATH , HTDOCS_PATH_DEFAULT ) ) ;
2005-11-07 13:33:02 +01:00
this . log . logConfig ( " HTDOCS Path: " + this . htDocsPath . toString ( ) ) ;
2007-01-30 15:18:35 +01:00
this . rankingPath = new File ( rootPath , getConfig ( RANKING_PATH , RANKING_PATH_DEFAULT ) ) ;
2005-11-07 13:33:02 +01:00
this . log . logConfig ( " Ranking Path: " + this . rankingPath . toString ( ) ) ;
2005-11-11 09:02:46 +01:00
this . rankingPermissions = new HashMap ( ) ; // mapping of permission - to filename.
2007-01-30 15:18:35 +01:00
this . workPath = new File ( rootPath , getConfig ( WORK_PATH , WORK_PATH_DEFAULT ) ) ;
2006-01-21 17:39:57 +01:00
this . log . logConfig ( " Work Path: " + this . workPath . toString ( ) ) ;
2005-09-05 12:34:34 +02:00
2007-03-02 02:19:38 +01:00
// set up local robots.txt
this . robotstxtConfig = httpdRobotsTxtConfig . init ( this ) ;
2005-11-15 09:25:46 +01:00
// setting timestamp of last proxy access
2005-11-07 13:33:02 +01:00
this . proxyLastAccess = System . currentTimeMillis ( ) - 60000 ;
2007-05-22 10:13:48 +02:00
this . webStructure = new plasmaWebStructure ( log , rankingPath , " LOCAL/010_cr/ " , getConfig ( " CRDist0Path " , plasmaRankingDistribution . CR_OWN ) , new File ( plasmaPath , " webStructure.map " ) ) ;
2005-06-08 02:52:24 +02:00
2005-10-22 15:28:04 +02:00
// configuring list path
2005-04-07 21:19:42 +02:00
if ( ! ( listsPath . exists ( ) ) ) listsPath . mkdirs ( ) ;
2005-11-11 00:48:20 +01:00
2005-08-30 14:50:30 +02:00
// load coloured lists
if ( blueList = = null ) {
// read only once upon first instantiation of this class
2007-01-30 15:18:35 +01:00
String f = getConfig ( LIST_BLUE , LIST_BLUE_DEFAULT ) ;
2005-10-10 11:13:17 +02:00
File plasmaBlueListFile = new File ( f ) ;
2006-01-11 01:32:44 +01:00
if ( f ! = null ) blueList = kelondroMSetTools . loadList ( plasmaBlueListFile , kelondroNaturalOrder . naturalOrder ) ; else blueList = new TreeSet ( ) ;
2005-11-11 00:48:20 +01:00
this . log . logConfig ( " loaded blue-list from file " + plasmaBlueListFile . getName ( ) + " , " +
blueList . size ( ) + " entries, " +
ppRamString ( plasmaBlueListFile . length ( ) / 1024 ) ) ;
2005-08-30 14:50:30 +02:00
}
2005-11-11 00:48:20 +01:00
2005-07-11 17:36:10 +02:00
// load the black-list / inspired by [AS]
2007-09-10 08:20:27 +02:00
File blacklistsPath = new File ( getRootPath ( ) , getConfig ( LISTS_PATH , LISTS_PATH_DEFAULT ) ) ;
2007-01-30 15:18:35 +01:00
String blacklistClassName = getConfig ( BLACKLIST_CLASS , BLACKLIST_CLASS_DEFAULT ) ;
2006-08-12 16:28:14 +02:00
this . log . logConfig ( " Starting blacklist engine ... " ) ;
try {
Class blacklistClass = Class . forName ( blacklistClassName ) ;
Constructor blacklistClassConstr = blacklistClass . getConstructor ( new Class [ ] { File . class } ) ;
2007-09-10 08:20:27 +02:00
urlBlacklist = ( plasmaURLPattern ) blacklistClassConstr . newInstance ( new Object [ ] { blacklistsPath } ) ;
2006-08-12 16:28:14 +02:00
this . log . logFine ( " Used blacklist engine class: " + blacklistClassName ) ;
this . log . logConfig ( " Using blacklist engine: " + urlBlacklist . getEngineInfo ( ) ) ;
} catch ( Exception e ) {
this . log . logSevere ( " Unable to load the blacklist engine " , e ) ;
System . exit ( - 1 ) ;
} catch ( Error e ) {
this . log . logSevere ( " Unable to load the blacklist engine " , e ) ;
System . exit ( - 1 ) ;
}
this . log . logConfig ( " Loading backlist data ... " ) ;
2006-08-12 04:42:10 +02:00
listManager . switchboard = this ;
2007-09-10 08:20:27 +02:00
listManager . listsPath = blacklistsPath ;
2006-08-12 04:42:10 +02:00
listManager . reloadBlacklists ( ) ;
2006-02-20 18:50:42 +01:00
// load badwords (to filter the topwords)
if ( badwords = = null ) {
2007-01-30 15:18:35 +01:00
File badwordsFile = new File ( rootPath , LIST_BADWORDS_DEFAULT ) ;
2006-02-20 18:50:42 +01:00
badwords = kelondroMSetTools . loadList ( badwordsFile , kelondroNaturalOrder . naturalOrder ) ;
this . log . logConfig ( " loaded badwords from file " + badwordsFile . getName ( ) +
" , " + badwords . size ( ) + " entries, " +
ppRamString ( badwordsFile . length ( ) / 1024 ) ) ;
}
2005-04-07 21:19:42 +02:00
// load stopwords
if ( stopwords = = null ) {
2007-01-30 15:18:35 +01:00
File stopwordsFile = new File ( rootPath , LIST_STOPWORDS_DEFAULT ) ;
2006-01-11 01:32:44 +01:00
stopwords = kelondroMSetTools . loadList ( stopwordsFile , kelondroNaturalOrder . naturalOrder ) ;
2005-11-11 00:48:20 +01:00
this . log . logConfig ( " loaded stopwords from file " + stopwordsFile . getName ( ) + " , " +
stopwords . size ( ) + " entries, " +
ppRamString ( stopwordsFile . length ( ) / 1024 ) ) ;
2005-04-07 21:19:42 +02:00
}
2006-02-20 18:50:42 +01:00
2007-05-20 18:22:46 +02:00
// load ranking tables
2005-11-23 12:57:30 +01:00
File YBRPath = new File ( rootPath , " ranking/YBR " ) ;
if ( YBRPath . exists ( ) ) {
plasmaSearchPreOrder . loadYBR ( YBRPath , 15 ) ;
2005-11-22 16:17:05 +01:00
}
2005-08-30 14:50:30 +02:00
// read memory amount
2007-01-30 15:18:35 +01:00
long ramLURL_time = getConfigLong ( RAM_CACHE_LURL_TIME , 1000 ) ;
long ramNURL_time = getConfigLong ( RAM_CACHE_NURL_TIME , 1000 ) ;
long ramEURL_time = getConfigLong ( RAM_CACHE_EURL_TIME , 1000 ) ;
long ramRWI_time = getConfigLong ( RAM_CACHE_RWI_TIME , 1000 ) ;
long ramHTTP_time = getConfigLong ( RAM_CACHE_HTTP_TIME , 1000 ) ;
long ramMessage_time = getConfigLong ( RAM_CACHE_MESSAGE_TIME , 1000 ) ;
long ramRobots_time = getConfigLong ( RAM_CACHE_ROBOTS_TIME , 1000 ) ;
long ramProfiles_time = getConfigLong ( RAM_CACHE_PROFILES_TIME , 1000 ) ;
long ramPreNURL_time = getConfigLong ( RAM_CACHE_PRE_NURL_TIME , 1000 ) ;
long ramWiki_time = getConfigLong ( RAM_CACHE_WIKI_TIME , 1000 ) ;
long ramBlog_time = getConfigLong ( RAM_CACHE_BLOG_TIME , 1000 ) ;
2007-03-06 23:43:32 +01:00
this . log . logConfig ( " LURL preloadTime = " + ramLURL_time ) ;
this . log . logConfig ( " NURL preloadTime = " + ramNURL_time ) ;
this . log . logConfig ( " EURL preloadTime = " + ramEURL_time ) ;
this . log . logConfig ( " RWI preloadTime = " + ramRWI_time ) ;
this . log . logConfig ( " HTTP preloadTime = " + ramHTTP_time ) ;
this . log . logConfig ( " Message preloadTime = " + ramMessage_time ) ;
this . log . logConfig ( " Wiki preloadTime = " + ramWiki_time ) ;
this . log . logConfig ( " Blog preloadTime = " + ramBlog_time ) ;
this . log . logConfig ( " Robots preloadTime = " + ramRobots_time ) ;
this . log . logConfig ( " Profiles preloadTime = " + ramProfiles_time ) ;
this . log . logConfig ( " PreNURL preloadTime = " + ramPreNURL_time ) ;
2005-04-07 21:19:42 +02:00
2005-08-30 14:50:30 +02:00
// make crawl profiles database and default profiles
2005-09-05 12:34:34 +02:00
this . log . logConfig ( " Initializing Crawl Profiles " ) ;
2007-01-30 15:18:35 +01:00
File profilesFile = new File ( this . plasmaPath , DBFILE_CRAWL_PROFILES ) ;
2007-03-06 23:43:32 +01:00
this . profiles = new plasmaCrawlProfile ( profilesFile , ramProfiles_time ) ;
2005-04-13 17:52:00 +02:00
initProfiles ( ) ;
2005-11-11 00:48:20 +01:00
log . logConfig ( " Loaded profiles from file " + profilesFile . getName ( ) +
" , " + this . profiles . size ( ) + " entries " +
" , " + ppRamString ( profilesFile . length ( ) / 1024 ) ) ;
2005-04-13 17:52:00 +02:00
2005-09-07 13:17:21 +02:00
// loading the robots.txt db
this . log . logConfig ( " Initializing robots.txt DB " ) ;
2007-01-30 15:18:35 +01:00
File robotsDBFile = new File ( this . plasmaPath , DBFILE_CRAWL_ROBOTS ) ;
2007-03-06 23:43:32 +01:00
robots = new plasmaCrawlRobotsTxt ( robotsDBFile , ramRobots_time ) ;
2005-11-11 00:48:20 +01:00
this . log . logConfig ( " Loaded robots.txt DB from file " + robotsDBFile . getName ( ) +
" , " + robots . size ( ) + " entries " +
" , " + ppRamString ( robotsDBFile . length ( ) / 1024 ) ) ;
2005-09-07 13:17:21 +02:00
2007-03-08 00:24:16 +01:00
// start a cache manager
log . logConfig ( " Starting HT Cache Manager " ) ;
// create the cache directory
String cache = getConfig ( HTCACHE_PATH , HTCACHE_PATH_DEFAULT ) ;
cache = cache . replace ( '\\' , '/' ) ;
if ( cache . endsWith ( " / " ) ) { cache = cache . substring ( 0 , cache . length ( ) - 1 ) ; }
if ( new File ( cache ) . isAbsolute ( ) ) {
htCachePath = new File ( cache ) ; // don't use rootPath
} else {
htCachePath = new File ( rootPath , cache ) ;
}
this . log . logInfo ( " HTCACHE Path = " + htCachePath . getAbsolutePath ( ) ) ;
long maxCacheSize = 1024 * 1024 * Long . parseLong ( getConfig ( PROXY_CACHE_SIZE , " 2 " ) ) ; // this is megabyte
String cacheLayout = getConfig ( PROXY_CACHE_LAYOUT , PROXY_CACHE_LAYOUT_TREE ) ;
boolean cacheMigration = getConfigBool ( PROXY_CACHE_MIGRATION , true ) ;
2007-08-15 23:31:31 +02:00
plasmaHTCache . init ( htCachePath , maxCacheSize , ramHTTP_time , cacheLayout , cacheMigration ) ;
2007-03-08 00:24:16 +01:00
2007-06-28 16:52:26 +02:00
// create the release download directory
String release = getConfig ( RELEASE_PATH , RELEASE_PATH_DEFAULT ) ;
release = release . replace ( '\\' , '/' ) ;
if ( release . endsWith ( " / " ) ) { release = release . substring ( 0 , release . length ( ) - 1 ) ; }
if ( new File ( release ) . isAbsolute ( ) ) {
releasePath = new File ( release ) ; // don't use rootPath
} else {
releasePath = new File ( rootPath , release ) ;
}
releasePath . mkdirs ( ) ;
this . log . logInfo ( " RELEASE Path = " + releasePath . getAbsolutePath ( ) ) ;
2007-03-08 00:24:16 +01:00
// starting message board
initMessages ( ramMessage_time ) ;
// starting wiki
initWiki ( ramWiki_time ) ;
//starting blog
initBlog ( ramBlog_time ) ;
// Init User DB
this . log . logConfig ( " Loading User DB " ) ;
File userDbFile = new File ( getRootPath ( ) , DBFILE_USER ) ;
this . userDB = new userDB ( userDbFile , 2000 ) ;
this . log . logConfig ( " Loaded User DB from file " + userDbFile . getName ( ) +
" , " + this . userDB . size ( ) + " entries " +
" , " + ppRamString ( userDbFile . length ( ) / 1024 ) ) ;
//Init bookmarks DB
initBookmarks ( ) ;
2005-04-07 21:19:42 +02:00
// start indexing management
2005-08-30 23:10:39 +02:00
log . logConfig ( " Starting Indexing Management " ) ;
2007-03-16 14:25:56 +01:00
noticeURL = new plasmaCrawlNURL ( plasmaPath ) ;
2007-06-16 16:11:52 +02:00
//errorURL = new plasmaCrawlZURL(); // fresh error DB each startup; can be hold in RAM and reduces IO;
errorURL = new plasmaCrawlZURL ( plasmaPath , " urlError.db " , true ) ;
delegatedURL = new plasmaCrawlZURL ( plasmaPath , " urlDelegated.db " , false ) ;
2007-03-24 16:28:17 +01:00
wordIndex = new plasmaWordIndex ( indexPrimaryPath , indexSecondaryPath , ramRWI_time , log ) ;
2007-03-08 00:24:16 +01:00
2006-08-19 15:09:04 +02:00
// set a high maximum cache size to current size; this is adopted later automatically
2007-01-30 15:18:35 +01:00
int wordCacheMaxCount = Math . max ( ( int ) getConfigLong ( WORDCACHE_INIT_COUNT , 30000 ) ,
( int ) getConfigLong ( WORDCACHE_MAX_COUNT , 20000 ) ) ;
setConfig ( WORDCACHE_MAX_COUNT , Integer . toString ( wordCacheMaxCount ) ) ;
2007-07-24 02:46:17 +02:00
wordIndex . setMaxWordCount ( wordCacheMaxCount ) ;
wordIndex . setWordFlushSize ( ( int ) getConfigLong ( " wordFlushSize " , 10000 ) ) ;
2005-04-07 21:19:42 +02:00
2007-05-11 20:02:48 +02:00
// set a maximum amount of memory for the caches
long memprereq = Math . max ( getConfigLong ( INDEXER_MEMPREREQ , 0 ) , wordIndex . minMem ( ) ) ;
// setConfig(INDEXER_MEMPREREQ, memprereq);
//setThreadPerformance(INDEXER, getConfigLong(INDEXER_IDLESLEEP, 0), getConfigLong(INDEXER_BUSYSLEEP, 0), memprereq);
2007-08-03 13:44:58 +02:00
kelondroCachedRecords . setCacheGrowStati ( memprereq + 4 * 1024 * 1024 , memprereq + 2 * 1024 * 1024 ) ;
2007-05-11 20:02:48 +02:00
kelondroCache . setCacheGrowStati ( memprereq + 4 * 1024 * 1024 , memprereq + 2 * 1024 * 1024 ) ;
2006-12-22 13:54:56 +01:00
2005-05-17 10:25:04 +02:00
// make parser
2005-08-30 23:10:39 +02:00
log . logConfig ( " Starting Parser " ) ;
2005-07-06 16:48:41 +02:00
this . parser = new plasmaParser ( ) ;
2005-11-15 09:25:46 +01:00
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
* initialize switchboard queue
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
// create queue
2007-08-15 23:31:31 +02:00
this . sbQueue = new plasmaSwitchboardQueue ( this . wordIndex . loadedURL , new File ( this . plasmaPath , " switchboardQueue1.stack " ) , this . profiles ) ;
2005-11-15 09:25:46 +01:00
// setting the indexing queue slots
2007-01-30 15:18:35 +01:00
indexingSlots = ( int ) getConfigLong ( INDEXER_SLOTS , 30 ) ;
2005-11-15 09:25:46 +01:00
// create in process list
this . indexingTasksInProcess = new HashMap ( ) ;
2005-05-17 10:25:04 +02:00
2005-09-20 23:49:47 +02:00
// going through the sbQueue Entries and registering all content files as in use
int count = 0 ;
2007-07-13 15:00:18 +02:00
plasmaSwitchboardQueue . Entry queueEntry ;
Iterator i1 = sbQueue . entryIterator ( true ) ;
while ( i1 . hasNext ( ) ) {
queueEntry = ( plasmaSwitchboardQueue . Entry ) i1 . next ( ) ;
if ( ( queueEntry ! = null ) & & ( queueEntry . url ( ) ! = null ) & & ( queueEntry . cacheFile ( ) . exists ( ) ) ) {
plasmaHTCache . filesInUse . add ( queueEntry . cacheFile ( ) ) ;
count + + ;
2005-09-20 23:49:47 +02:00
}
}
2007-07-13 15:00:18 +02:00
this . log . logConfig ( count + " files in htcache reported to the cachemanager as in use. " ) ;
2005-04-07 21:19:42 +02:00
// define an extension-blacklist
2005-08-30 23:10:39 +02:00
log . logConfig ( " Parser: Initializing Extension Mappings for Media/Parser " ) ;
2007-01-30 15:18:35 +01:00
plasmaParser . initMediaExt ( plasmaParser . extString2extList ( getConfig ( PARSER_MEDIA_EXT , " " ) ) ) ;
plasmaParser . initSupportedRealtimeFileExt ( plasmaParser . extString2extList ( getConfig ( PARSER_MEDIA_EXT_PARSEABLE , " " ) ) ) ;
2005-11-11 00:48:20 +01:00
2005-05-17 10:25:04 +02:00
// define a realtime parsable mimetype list
2005-08-30 23:10:39 +02:00
log . logConfig ( " Parser: Initializing Mime Types " ) ;
2007-01-30 15:18:35 +01:00
plasmaParser . initRealtimeParsableMimeTypes ( getConfig ( PARSER_MIMETYPES_REALTIME , " application/xhtml+xml,text/html,text/plain " ) ) ;
plasmaParser . initParseableMimeTypes ( plasmaParser . PARSER_MODE_PROXY , getConfig ( PARSER_MIMETYPES_PROXY , null ) ) ;
plasmaParser . initParseableMimeTypes ( plasmaParser . PARSER_MODE_CRAWLER , getConfig ( PARSER_MIMETYPES_CRAWLER , null ) ) ;
plasmaParser . initParseableMimeTypes ( plasmaParser . PARSER_MODE_ICAP , getConfig ( PARSER_MIMETYPES_ICAP , null ) ) ;
plasmaParser . initParseableMimeTypes ( plasmaParser . PARSER_MODE_URLREDIRECTOR , getConfig ( PARSER_MIMETYPES_URLREDIRECTOR , null ) ) ;
2005-04-07 21:19:42 +02:00
2005-05-17 10:25:04 +02:00
// start a loader
2005-08-30 23:10:39 +02:00
log . logConfig ( " Starting Crawl Loader " ) ;
2007-01-30 15:18:35 +01:00
crawlSlots = Integer . parseInt ( getConfig ( CRAWLER_THREADS_ACTIVE_MAX , " 10 " ) ) ;
2005-08-22 00:52:46 +02:00
plasmaCrawlLoader . switchboard = this ;
2007-08-15 23:31:31 +02:00
this . cacheLoader = new plasmaCrawlLoader ( this . log ) ;
2006-02-21 12:18:48 +01:00
/ *
* Creating sync objects and loading status for the crawl jobs
* a ) local crawl
* b ) remote triggered crawl
* c ) global crawl trigger
* /
this . crawlJobsStatus . put ( CRAWLJOB_LOCAL_CRAWL , new Object [ ] {
new Object ( ) ,
Boolean . valueOf ( getConfig ( CRAWLJOB_LOCAL_CRAWL + " _isPaused " , " false " ) ) } ) ;
this . crawlJobsStatus . put ( CRAWLJOB_REMOTE_TRIGGERED_CRAWL , new Object [ ] {
new Object ( ) ,
Boolean . valueOf ( getConfig ( CRAWLJOB_REMOTE_TRIGGERED_CRAWL + " _isPaused " , " false " ) ) } ) ;
this . crawlJobsStatus . put ( CRAWLJOB_GLOBAL_CRAWL_TRIGGER , new Object [ ] {
new Object ( ) ,
Boolean . valueOf ( getConfig ( CRAWLJOB_GLOBAL_CRAWL_TRIGGER + " _isPaused " , " false " ) ) } ) ;
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
// init cookie-Monitor
2005-10-10 11:13:17 +02:00
this . log . logConfig ( " Starting Cookie Monitor " ) ;
this . outgoingCookies = new HashMap ( ) ;
this . incomingCookies = new HashMap ( ) ;
2005-11-11 00:48:20 +01:00
2007-01-15 02:50:57 +01:00
// init search history trackers
this . localSearchTracker = new HashMap ( ) ; // String:TreeSet - IP:set of Long(accessTime)
this . remoteSearchTracker = new HashMap ( ) ;
2007-02-08 11:42:35 +01:00
this . localSearches = new ArrayList ( ) ; // contains search result properties as HashMaps
this . remoteSearches = new ArrayList ( ) ;
2007-01-15 02:50:57 +01:00
2006-12-23 05:26:05 +01:00
// init messages: clean up message symbol
2007-01-30 15:18:35 +01:00
File notifierSource = new File ( getRootPath ( ) , getConfig ( HTROOT_PATH , HTROOT_PATH_DEFAULT ) + " /env/grafics/empty.gif " ) ;
File notifierDest = new File ( getConfig ( HTDOCS_PATH , HTDOCS_PATH_DEFAULT ) , " notifier.gif " ) ;
2006-12-23 05:26:05 +01:00
try {
serverFileUtils . copy ( notifierSource , notifierDest ) ;
} catch ( IOException e ) {
}
2005-04-07 21:19:42 +02:00
// clean up profiles
2005-10-10 11:13:17 +02:00
this . log . logConfig ( " Cleaning Profiles " ) ;
2006-09-04 07:17:37 +02:00
try { cleanProfiles ( ) ; } catch ( InterruptedException e ) { /* Ignore this here */ }
2005-11-11 00:48:20 +01:00
// init ranking transmission
2005-11-20 19:55:35 +01:00
/ *
2006-01-18 03:18:23 +01:00
CRDistOn = true / false
2005-11-20 19:55:35 +01:00
CRDist0Path = GLOBAL / 010_owncr
CRDist0Method = 1
CRDist0Percent = 0
CRDist0Target =
CRDist1Path = GLOBAL / 014_othercr / 1
CRDist1Method = 9
CRDist1Percent = 30
CRDist1Target = kaskelix . de : 8080 , yacy . dyndns . org : 8000 , suma - lab . de : 8080
* * /
2007-06-28 16:52:26 +02:00
rankingOn = getConfig ( RANKING_DIST_ON , " true " ) . equals ( " true " ) & & getConfig ( " network.unit.name " , " " ) . equals ( " freeworld " ) ;
2007-01-30 15:18:35 +01:00
rankingOwnDistribution = new plasmaRankingDistribution ( log , new File ( rankingPath , getConfig ( RANKING_DIST_0_PATH , plasmaRankingDistribution . CR_OWN ) ) , ( int ) getConfigLong ( RANKING_DIST_0_METHOD , plasmaRankingDistribution . METHOD_ANYSENIOR ) , ( int ) getConfigLong ( RANKING_DIST_0_METHOD , 0 ) , getConfig ( RANKING_DIST_0_TARGET , " " ) ) ;
rankingOtherDistribution = new plasmaRankingDistribution ( log , new File ( rankingPath , getConfig ( RANKING_DIST_1_PATH , plasmaRankingDistribution . CR_OTHER ) ) , ( int ) getConfigLong ( RANKING_DIST_1_METHOD , plasmaRankingDistribution . METHOD_MIXEDSENIOR ) , ( int ) getConfigLong ( RANKING_DIST_1_METHOD , 30 ) , getConfig ( RANKING_DIST_1_TARGET , " kaskelix.de:8080,yacy.dyndns.org:8000,suma-lab.de:8080 " ) ) ;
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
// init facility DB
2005-06-23 04:07:45 +02:00
/ *
2005-05-18 23:52:17 +02:00
log . logSystem ( " Starting Facility Database " ) ;
2005-04-07 21:19:42 +02:00
File facilityDBpath = new File ( getRootPath ( ) , " DATA/SETTINGS/ " ) ;
facilityDB = new kelondroTables ( facilityDBpath ) ;
facilityDB . declareMaps ( " backlinks " , 250 , 500 , new String [ ] { " date " } , null ) ;
2005-06-20 18:36:31 +02:00
log . logSystem ( " ..opened backlinks " ) ;
2005-04-07 21:19:42 +02:00
facilityDB . declareMaps ( " zeitgeist " , 40 , 500 ) ;
2005-06-20 18:36:31 +02:00
log . logSystem ( " ..opened zeitgeist " ) ;
2005-04-07 21:19:42 +02:00
facilityDB . declareTree ( " statistik " , new int [ ] { 11 , 8 , 8 , 8 , 8 , 8 , 8 } , 0x400 ) ;
2005-06-20 18:36:31 +02:00
log . logSystem ( " ..opened statistik " ) ;
2005-04-07 21:19:42 +02:00
facilityDB . update ( " statistik " , ( new serverDate ( ) ) . toShortString ( false ) . substring ( 0 , 11 ) , new long [ ] { 1 , 2 , 3 , 4 , 5 , 6 } ) ;
long [ ] testresult = facilityDB . selectLong ( " statistik " , " yyyyMMddHHm " ) ;
testresult = facilityDB . selectLong ( " statistik " , ( new serverDate ( ) ) . toShortString ( false ) . substring ( 0 , 11 ) ) ;
2005-11-11 00:48:20 +01:00
* /
2005-06-08 02:52:24 +02:00
2005-11-07 11:57:54 +01:00
/ *
* Initializing httpc
* /
2005-09-05 12:34:34 +02:00
// initializing yacyDebugMode
2007-01-30 15:18:35 +01:00
httpc . yacyDebugMode = getConfig ( YACY_MODE_DEBUG , " false " ) . equals ( " true " ) ;
2005-09-05 12:34:34 +02:00
2005-11-07 11:57:54 +01:00
// init nameCacheNoCachingList
2007-01-30 15:18:35 +01:00
String noCachingList = getConfig ( HTTPC_NAME_CACHE_CACHING_PATTERNS_NO , " " ) ;
2005-11-07 11:57:54 +01:00
String [ ] noCachingEntries = noCachingList . split ( " , " ) ;
2007-07-02 17:16:05 +02:00
for ( i = 0 ; i < noCachingEntries . length ; i + + ) {
2005-11-07 11:57:54 +01:00
String entry = noCachingEntries [ i ] . trim ( ) ;
2007-07-24 02:46:17 +02:00
serverDomains . nameCacheNoCachingPatterns . add ( entry ) ;
2005-11-07 11:57:54 +01:00
}
2005-06-08 02:52:24 +02:00
// generate snippets cache
2005-08-30 23:10:39 +02:00
log . logConfig ( " Initializing Snippet Cache " ) ;
2007-08-15 23:31:31 +02:00
plasmaSnippetCache . init ( parser , log ) ;
2005-06-08 02:52:24 +02:00
2005-04-11 01:51:42 +02:00
// start yacy core
2005-08-30 23:10:39 +02:00
log . logConfig ( " Starting YaCy Protocol Core " ) ;
2005-06-23 04:07:45 +02:00
//try{Thread.currentThread().sleep(5000);} catch (InterruptedException e) {} // for profiler
2005-08-02 17:36:10 +02:00
this . yc = new yacyCore ( this ) ;
2005-06-23 04:07:45 +02:00
//log.logSystem("Started YaCy Protocol Core");
2005-11-11 00:48:20 +01:00
// System.gc(); try{Thread.currentThread().sleep(5000);} catch (InterruptedException e) {} // for profiler
2005-11-02 18:56:26 +01:00
serverInstantThread . oneTimeJob ( yc , " loadSeeds " , yacyCore . log , 3000 ) ;
2005-04-11 01:51:42 +02:00
2007-05-20 18:19:25 +02:00
String wikiParserClassName = getConfig ( WIKIPARSER_CLASS , WIKIPARSER_CLASS_DEFAULT ) ;
this . log . logConfig ( " Loading wiki parser " + wikiParserClassName + " ... " ) ;
try {
Class wikiParserClass = Class . forName ( wikiParserClassName ) ;
Constructor wikiParserClassConstr = wikiParserClass . getConstructor ( new Class [ ] { plasmaSwitchboard . class } ) ;
wikiParser = ( wikiParser ) wikiParserClassConstr . newInstance ( new Object [ ] { this } ) ;
} catch ( Exception e ) {
this . log . logSevere ( " Unable to load wiki parser, the wiki won't work " , e ) ;
}
2005-10-05 12:45:33 +02:00
// initializing the stackCrawlThread
2007-03-06 23:43:32 +01:00
this . sbStackCrawlThread = new plasmaCrawlStacker ( this , this . plasmaPath , ramPreNURL_time , ( int ) getConfigLong ( " tableTypeForPreNURL " , 0 ) ) ;
2005-10-09 17:59:09 +02:00
//this.sbStackCrawlThread = new plasmaStackCrawlThread(this,this.plasmaPath,ramPreNURL);
2005-11-11 00:48:20 +01:00
//this.sbStackCrawlThread.start();
2005-10-05 12:45:33 +02:00
2006-02-21 00:57:50 +01:00
// initializing dht chunk generation
this . dhtTransferChunk = null ;
2007-01-30 15:18:35 +01:00
this . dhtTransferIndexCount = ( int ) getConfigLong ( INDEX_DIST_CHUNK_SIZE_START , 50 ) ;
2006-02-21 00:57:50 +01:00
2007-05-07 22:48:24 +02:00
// init robinson cluster
this . clusterhashes = yacyCore . seedDB . clusterHashes ( getConfig ( " cluster.peers.yacydomain " , " " ) ) ;
2005-04-07 21:19:42 +02:00
// deploy threads
2005-08-30 23:10:39 +02:00
log . logConfig ( " Starting Threads " ) ;
2005-09-28 15:49:57 +02:00
// System.gc(); // help for profiler
2007-01-30 15:18:35 +01:00
int indexing_cluster = Integer . parseInt ( getConfig ( INDEXER_CLUSTER , " 1 " ) ) ;
2005-06-08 15:19:05 +02:00
if ( indexing_cluster < 1 ) indexing_cluster = 1 ;
2007-01-30 15:18:35 +01:00
deployThread ( CLEANUP , " Cleanup " , " simple cleaning process for monitoring information " , null ,
new serverInstantThread ( this , CLEANUP_METHOD_START , CLEANUP_METHOD_JOBCOUNT , CLEANUP_METHOD_FREEMEM ) , 10000 ) ; // all 5 Minutes
deployThread ( CRAWLSTACK , " Crawl URL Stacker " , " process that checks url for double-occurrences and for allowance/disallowance by robots.txt " , null ,
new serverInstantThread ( sbStackCrawlThread , CRAWLSTACK_METHOD_START , CRAWLSTACK_METHOD_JOBCOUNT , CRAWLSTACK_METHOD_FREEMEM ) , 8000 ) ;
2006-01-18 03:18:23 +01:00
2007-01-30 15:18:35 +01:00
deployThread ( INDEXER , " Parsing/Indexing " , " thread that performes document parsing and indexing " , " /IndexCreateIndexingQueue_p.html " ,
new serverInstantThread ( this , INDEXER_METHOD_START , INDEXER_METHOD_JOBCOUNT , INDEXER_METHOD_FREEMEM ) , 10000 ) ;
2007-07-02 17:16:05 +02:00
for ( i = 1 ; i < indexing_cluster ; i + + ) {
2007-01-30 15:18:35 +01:00
setConfig ( ( i + 80 ) + " _indexing_idlesleep " , getConfig ( INDEXER_IDLESLEEP , " " ) ) ;
setConfig ( ( i + 80 ) + " _indexing_busysleep " , getConfig ( INDEXER_BUSYSLEEP , " " ) ) ;
2005-07-20 15:03:41 +02:00
deployThread ( ( i + 80 ) + " _indexing " , " Parsing/Indexing (cluster job) " , " thread that performes document parsing and indexing " , null ,
2007-01-30 15:18:35 +01:00
new serverInstantThread ( this , INDEXER_METHOD_START , INDEXER_METHOD_JOBCOUNT , INDEXER_METHOD_FREEMEM ) , 10000 + ( i * 1000 ) ,
Long . parseLong ( getConfig ( INDEXER_IDLESLEEP , " 5000 " ) ) ,
Long . parseLong ( getConfig ( INDEXER_BUSYSLEEP , " 0 " ) ) ,
Long . parseLong ( getConfig ( INDEXER_MEMPREREQ , " 1000000 " ) ) ) ;
}
2007-07-24 02:46:17 +02:00
deployThread ( PROXY_CACHE_ENQUEUE , " Proxy Cache Enqueue " , " job takes new input files from RAM stack, stores them, and hands over to the Indexing Stack " , null ,
2007-01-30 15:18:35 +01:00
new serverInstantThread ( this , PROXY_CACHE_ENQUEUE_METHOD_START , PROXY_CACHE_ENQUEUE_METHOD_JOBCOUNT , PROXY_CACHE_ENQUEUE_METHOD_FREEMEM ) , 10000 ) ;
deployThread ( CRAWLJOB_REMOTE_TRIGGERED_CRAWL , " Remote Crawl Job " , " thread that performes a single crawl/indexing step triggered by a remote peer " , null ,
new serverInstantThread ( this , CRAWLJOB_REMOTE_TRIGGERED_CRAWL_METHOD_START , CRAWLJOB_REMOTE_TRIGGERED_CRAWL_METHOD_JOBCOUNT , CRAWLJOB_REMOTE_TRIGGERED_CRAWL_METHOD_FREEMEM ) , 30000 ) ;
deployThread ( CRAWLJOB_GLOBAL_CRAWL_TRIGGER , " Global Crawl Trigger " , " thread that triggeres remote peers for crawling " , " /IndexCreateWWWGlobalQueue_p.html " ,
new serverInstantThread ( this , CRAWLJOB_GLOBAL_CRAWL_TRIGGER_METHOD_START , CRAWLJOB_GLOBAL_CRAWL_TRIGGER_METHOD_JOBCOUNT , CRAWLJOB_GLOBAL_CRAWL_TRIGGER_METHOD_FREEMEM ) , 30000 ) ; // error here?
deployThread ( CRAWLJOB_LOCAL_CRAWL , " Local Crawl " , " thread that performes a single crawl step from the local crawl queue " , " /IndexCreateWWWLocalQueue_p.html " ,
new serverInstantThread ( this , CRAWLJOB_LOCAL_CRAWL_METHOD_START , CRAWLJOB_LOCAL_CRAWL_METHOD_JOBCOUNT , CRAWLJOB_LOCAL_CRAWL_METHOD_FREEMEM ) , 10000 ) ;
deployThread ( SEED_UPLOAD , " Seed-List Upload " , " task that a principal peer performes to generate and upload a seed-list to a ftp account " , null ,
new serverInstantThread ( yc , SEED_UPLOAD_METHOD_START , SEED_UPLOAD_METHOD_JOBCOUNT , SEED_UPLOAD_METHOD_FREEMEM ) , 180000 ) ;
2005-07-04 13:09:48 +02:00
serverInstantThread peerPing = null ;
2007-01-30 15:18:35 +01:00
deployThread ( PEER_PING , " YaCy Core " , " this is the p2p-control and peer-ping task " , null ,
peerPing = new serverInstantThread ( yc , PEER_PING_METHOD_START , PEER_PING_METHOD_JOBCOUNT , PEER_PING_METHOD_FREEMEM ) , 2000 ) ;
2005-07-04 13:09:48 +02:00
peerPing . setSyncObject ( new Object ( ) ) ;
2005-07-17 23:22:18 +02:00
2007-01-30 15:18:35 +01:00
deployThread ( INDEX_DIST , " DHT Distribution " , " selection, transfer and deletion of index entries that are not searched on your peer, but on others " , null ,
new serverInstantThread ( this , INDEX_DIST_METHOD_START , INDEX_DIST_METHOD_JOBCOUNT , INDEX_DIST_METHOD_FREEMEM ) , 60000 ,
Long . parseLong ( getConfig ( INDEX_DIST_IDLESLEEP , " 5000 " ) ) ,
Long . parseLong ( getConfig ( INDEX_DIST_BUSYSLEEP , " 0 " ) ) ,
Long . parseLong ( getConfig ( INDEX_DIST_MEMPREREQ , " 1000000 " ) ) ) ;
2006-03-19 23:06:15 +01:00
2005-06-30 00:55:37 +02:00
// test routine for snippet fetch
2005-06-30 20:54:00 +02:00
//Set query = new HashSet();
//query.add(plasmaWordIndexEntry.word2hash("Weitergabe"));
//query.add(plasmaWordIndexEntry.word2hash("Zahl"));
2005-06-30 00:55:37 +02:00
//plasmaSnippetCache.result scr = snippetCache.retrieve(new URL("http://www.heise.de/mobil/newsticker/meldung/mail/54980"), query, true);
2005-06-30 02:01:53 +02:00
//plasmaSnippetCache.result scr = snippetCache.retrieve(new URL("http://www.heise.de/security/news/foren/go.shtml?read=1&msg_id=7301419&forum_id=72721"), query, true);
2005-07-02 01:35:36 +02:00
//plasmaSnippetCache.result scr = snippetCache.retrieve(new URL("http://www.heise.de/kiosk/archiv/ct/2003/4/20"), query, true, 260);
2006-01-18 03:18:23 +01:00
2006-01-31 13:30:24 +01:00
this . dbImportManager = new dbImportManager ( this ) ;
2005-08-30 23:10:39 +02:00
log . logConfig ( " Finished Switchboard Initialization " ) ;
2005-04-07 21:19:42 +02:00
}
2007-07-24 02:46:17 +02:00
2007-03-06 23:43:32 +01:00
public void initMessages ( long ramMessage_time ) {
2006-01-21 17:39:57 +01:00
this . log . logConfig ( " Starting Message Board " ) ;
2007-01-30 15:18:35 +01:00
File messageDbFile = new File ( workPath , DBFILE_MESSAGE ) ;
2007-03-06 23:43:32 +01:00
this . messageDB = new messageBoard ( messageDbFile , ramMessage_time ) ;
2006-01-21 17:39:57 +01:00
this . log . logConfig ( " Loaded Message Board DB from file " + messageDbFile . getName ( ) +
" , " + this . messageDB . size ( ) + " entries " +
" , " + ppRamString ( messageDbFile . length ( ) / 1024 ) ) ;
}
2007-07-24 02:46:17 +02:00
2007-03-06 23:43:32 +01:00
public void initWiki ( long ramWiki_time ) {
2006-01-21 17:39:57 +01:00
this . log . logConfig ( " Starting Wiki Board " ) ;
2007-01-30 15:18:35 +01:00
File wikiDbFile = new File ( workPath , DBFILE_WIKI ) ;
2007-03-06 23:43:32 +01:00
this . wikiDB = new wikiBoard ( wikiDbFile , new File ( workPath , DBFILE_WIKI_BKP ) , ramWiki_time ) ;
2006-01-21 17:39:57 +01:00
this . log . logConfig ( " Loaded Wiki Board DB from file " + wikiDbFile . getName ( ) +
" , " + this . wikiDB . size ( ) + " entries " +
" , " + ppRamString ( wikiDbFile . length ( ) / 1024 ) ) ;
}
2007-07-24 02:46:17 +02:00
2007-03-06 23:43:32 +01:00
public void initBlog ( long ramBlog_time ) {
2006-03-01 08:40:25 +01:00
this . log . logConfig ( " Starting Blog " ) ;
2007-01-30 15:18:35 +01:00
File blogDbFile = new File ( workPath , DBFILE_BLOG ) ;
2007-03-06 23:43:32 +01:00
this . blogDB = new blogBoard ( blogDbFile , ramBlog_time ) ;
2006-03-01 08:40:25 +01:00
this . log . logConfig ( " Loaded Blog DB from file " + blogDbFile . getName ( ) +
" , " + this . blogDB . size ( ) + " entries " +
" , " + ppRamString ( blogDbFile . length ( ) / 1024 ) ) ;
2007-02-26 15:36:01 +01:00
File blogCommentDbFile = new File ( workPath , DBFILE_BLOGCOMMENTS ) ;
2007-03-06 23:43:32 +01:00
this . blogCommentDB = new blogBoardComments ( blogCommentDbFile , ramBlog_time ) ;
2007-02-26 15:36:01 +01:00
this . log . logConfig ( " Loaded Blog-Comment DB from file " + blogCommentDbFile . getName ( ) +
" , " + this . blogCommentDB . size ( ) + " entries " +
" , " + ppRamString ( blogCommentDbFile . length ( ) / 1024 ) ) ;
2006-03-01 08:40:25 +01:00
}
2007-07-24 02:46:17 +02:00
2006-02-14 11:00:12 +01:00
public void initBookmarks ( ) {
this . log . logConfig ( " Loading Bookmarks DB " ) ;
2007-01-30 15:18:35 +01:00
File bookmarksFile = new File ( workPath , DBFILE_BOOKMARKS ) ;
File tagsFile = new File ( workPath , DBFILE_BOOKMARKS_TAGS ) ;
File datesFile = new File ( workPath , DBFILE_BOOKMARKS_DATES ) ;
2007-03-06 23:43:32 +01:00
this . bookmarksDB = new bookmarksDB ( bookmarksFile , tagsFile , datesFile , 2000 ) ;
2006-02-14 11:00:12 +01:00
this . log . logConfig ( " Loaded Bookmarks DB from files " + bookmarksFile . getName ( ) + " , " + tagsFile . getName ( ) ) ;
this . log . logConfig ( this . bookmarksDB . tagsSize ( ) + " Tag, " + this . bookmarksDB . bookmarksSize ( ) + " Bookmarks " ) ;
}
2005-11-11 00:48:20 +01:00
public static plasmaSwitchboard getSwitchboard ( ) {
return sb ;
}
2005-08-22 14:13:19 +02:00
2006-04-06 23:48:24 +02:00
public boolean isRobinsonMode ( ) {
2007-04-23 22:47:07 +02:00
// we are in robinson mode, if we do not exchange index by dht distribution
// we need to take care that search requests and remote indexing requests go only
// to the peers in the same cluster, if we run a robinson cluster.
2007-04-27 23:35:43 +02:00
return ! getConfigBool ( plasmaSwitchboard . INDEX_DIST_ALLOW , false ) & & ! getConfigBool ( plasmaSwitchboard . INDEX_RECEIVE_ALLOW , false ) ;
2006-04-06 23:48:24 +02:00
}
2006-12-05 03:47:51 +01:00
2007-04-30 02:39:53 +02:00
public boolean isPublicRobinson ( ) {
2007-04-23 22:47:07 +02:00
// robinson peers may be member of robinson clusters, which can be public or private
// this does not check the robinson attribute, only the specific subtype of the cluster
2007-06-08 22:01:16 +02:00
String clustermode = getConfig ( CLUSTER_MODE , CLUSTER_MODE_PUBLIC_PEER ) ;
return ( clustermode . equals ( CLUSTER_MODE_PUBLIC_CLUSTER ) ) | | ( clustermode . equals ( CLUSTER_MODE_PUBLIC_PEER ) ) ;
2007-04-23 22:47:07 +02:00
}
public boolean isInMyCluster ( String peer ) {
// check if the given peer is in the own network, if this is a robinson cluster
// depending on the robinson cluster type, the peer String may be a peerhash (b64-hash)
// or a ip:port String or simply a ip String
2007-04-24 17:11:12 +02:00
// if this robinson mode does not define a cluster membership, false is returned
2007-07-24 02:46:17 +02:00
if ( peer = = null ) return false ;
2007-04-23 22:47:07 +02:00
if ( ! isRobinsonMode ( ) ) return false ;
2007-06-08 22:01:16 +02:00
String clustermode = getConfig ( CLUSTER_MODE , CLUSTER_MODE_PUBLIC_PEER ) ;
if ( clustermode . equals ( CLUSTER_MODE_PRIVATE_CLUSTER ) ) {
2007-04-23 22:47:07 +02:00
// check if we got the request from a peer in the private cluster
2007-06-08 22:01:16 +02:00
String network = getConfig ( CLUSTER_PEERS_IPPORT , " " ) ;
2007-04-23 22:47:07 +02:00
return network . indexOf ( peer ) > = 0 ;
2007-06-08 22:01:16 +02:00
} else if ( clustermode . equals ( CLUSTER_MODE_PUBLIC_CLUSTER ) ) {
2007-04-23 22:47:07 +02:00
// check if we got the request from a peer in the public cluster
2007-04-30 00:05:34 +02:00
return this . clusterhashes . containsKey ( peer ) ;
2007-04-23 22:47:07 +02:00
} else {
return false ;
}
}
public boolean isInMyCluster ( yacySeed seed ) {
// check if the given peer is in the own network, if this is a robinson cluster
2007-04-24 17:11:12 +02:00
// if this robinson mode does not define a cluster membership, false is returned
2007-04-23 22:47:07 +02:00
if ( seed = = null ) return false ;
if ( ! isRobinsonMode ( ) ) return false ;
2007-06-08 22:01:16 +02:00
String clustermode = getConfig ( CLUSTER_MODE , CLUSTER_MODE_PUBLIC_PEER ) ;
if ( clustermode . equals ( CLUSTER_MODE_PRIVATE_CLUSTER ) ) {
2007-04-23 22:47:07 +02:00
// check if we got the request from a peer in the private cluster
2007-06-08 22:01:16 +02:00
String network = getConfig ( CLUSTER_PEERS_IPPORT , " " ) ;
2007-04-30 00:05:34 +02:00
return network . indexOf ( seed . getPublicAddress ( ) ) > = 0 ;
2007-06-08 22:01:16 +02:00
} else if ( clustermode . equals ( CLUSTER_MODE_PUBLIC_CLUSTER ) ) {
2007-04-26 16:28:57 +02:00
// check if we got the request from a peer in the public cluster
2007-04-30 00:05:34 +02:00
return this . clusterhashes . containsKey ( seed . hash ) ;
2007-04-23 22:47:07 +02:00
} else {
return false ;
}
}
2007-07-24 02:46:17 +02:00
2007-09-05 11:01:35 +02:00
public boolean acceptURL ( yacyURL url ) {
2007-07-24 02:46:17 +02:00
// returns true if the url can be accepted accoring to network.unit.domain
2007-07-24 20:11:30 +02:00
if ( url = = null ) return false ;
String host = url . getHost ( ) ;
if ( host = = null ) return false ;
return acceptURL ( serverDomains . dnsResolve ( host ) ) ;
2007-07-24 02:46:17 +02:00
}
public boolean acceptURL ( InetAddress hostAddress ) {
// returns true if the url can be accepted accoring to network.unit.domain
2007-07-24 20:11:30 +02:00
if ( hostAddress = = null ) return false ; // if we don't know the host, we cannot load that resource anyway
2007-07-24 02:46:17 +02:00
if ( this . acceptGlobalURLs & & this . acceptLocalURLs ) return true ; // fast shortcut
boolean local = hostAddress . isSiteLocalAddress ( ) | | hostAddress . isLoopbackAddress ( ) ;
return ( this . acceptGlobalURLs & & ! local ) | | ( this . acceptLocalURLs & & local ) ;
}
2007-04-23 22:47:07 +02:00
2006-12-05 03:47:51 +01:00
public String urlExists ( String hash ) {
// tests if hash occurrs in any database
// if it exists, the name of the database is returned,
// if it not exists, null is returned
if ( wordIndex . loadedURL . exists ( hash ) ) return " loaded " ;
if ( noticeURL . existsInStack ( hash ) ) return " crawler " ;
2007-03-16 14:25:56 +01:00
if ( delegatedURL . exists ( hash ) ) return " delegated " ;
2006-12-05 03:47:51 +01:00
if ( errorURL . exists ( hash ) ) return " errors " ;
return null ;
}
2007-06-15 19:45:49 +02:00
public void urlRemove ( String hash ) {
wordIndex . loadedURL . remove ( hash ) ;
noticeURL . remove ( hash ) ;
delegatedURL . remove ( hash ) ;
errorURL . remove ( hash ) ;
}
2007-09-05 11:01:35 +02:00
public yacyURL getURL ( String urlhash ) throws IOException {
if ( urlhash . equals ( yacyURL . dummyHash ) ) return null ;
2007-03-16 14:25:56 +01:00
plasmaCrawlEntry ne = noticeURL . get ( urlhash ) ;
if ( ne ! = null ) return ne . url ( ) ;
2006-12-05 03:47:51 +01:00
indexURLEntry le = wordIndex . loadedURL . load ( urlhash , null ) ;
if ( le ! = null ) return le . comp ( ) . url ( ) ;
2007-03-16 14:25:56 +01:00
plasmaCrawlZURL . Entry ee = delegatedURL . getEntry ( urlhash ) ;
if ( ee ! = null ) return ee . url ( ) ;
ee = errorURL . getEntry ( urlhash ) ;
2006-12-05 03:47:51 +01:00
if ( ee ! = null ) return ee . url ( ) ;
return null ;
}
2005-11-11 00:48:20 +01:00
2005-10-19 19:59:54 +02:00
/ * *
* This method changes the HTCache size . < br >
2007-01-30 15:18:35 +01:00
* @param newCacheSize in MB
2005-10-19 19:59:54 +02:00
* /
public final void setCacheSize ( long newCacheSize ) {
2007-08-15 23:31:31 +02:00
plasmaHTCache . setCacheSize ( 1048576 * newCacheSize ) ;
2005-10-19 19:59:54 +02:00
}
2005-11-11 00:48:20 +01:00
2005-07-21 13:17:04 +02:00
public boolean onlineCaution ( ) {
2005-07-26 17:17:29 +02:00
try {
2007-01-30 15:18:35 +01:00
return System . currentTimeMillis ( ) - proxyLastAccess < Integer . parseInt ( getConfig ( PROXY_ONLINE_CAUTION_DELAY , " 30000 " ) ) ;
2005-07-26 17:17:29 +02:00
} catch ( NumberFormatException e ) {
return false ;
}
2005-07-21 13:17:04 +02:00
}
2005-09-27 09:10:24 +02:00
private static String ppRamString ( long bytes ) {
2005-04-25 01:15:40 +02:00
if ( bytes < 1024 ) return bytes + " KByte " ;
bytes = bytes / 1024 ;
if ( bytes < 1024 ) return bytes + " MByte " ;
bytes = bytes / 1024 ;
if ( bytes < 1024 ) return bytes + " GByte " ;
return ( bytes / 1024 ) + " TByte " ;
}
2006-01-18 03:18:23 +01:00
2005-12-13 00:59:58 +01:00
private void initProfiles ( ) {
2006-10-10 01:07:10 +02:00
this . defaultProxyProfile = null ;
this . defaultRemoteProfile = null ;
2006-12-19 04:10:46 +01:00
this . defaultTextSnippetProfile = null ;
this . defaultMediaSnippetProfile = null ;
2006-10-10 01:07:10 +02:00
Iterator i = this . profiles . profiles ( true ) ;
plasmaCrawlProfile . entry profile ;
String name ;
while ( i . hasNext ( ) ) {
profile = ( plasmaCrawlProfile . entry ) i . next ( ) ;
name = profile . name ( ) ;
2007-01-30 15:18:35 +01:00
if ( name . equals ( CRAWL_PROFILE_PROXY ) ) this . defaultProxyProfile = profile ;
if ( name . equals ( CRAWL_PROFILE_REMOTE ) ) this . defaultRemoteProfile = profile ;
if ( name . equals ( CRAWL_PROFILE_SNIPPET_TEXT ) ) this . defaultTextSnippetProfile = profile ;
if ( name . equals ( CRAWL_PROFILE_SNIPPET_MEDIA ) ) this . defaultMediaSnippetProfile = profile ;
2006-10-10 01:07:10 +02:00
}
if ( this . defaultProxyProfile = = null ) {
2005-04-13 17:52:00 +02:00
// generate new default entry for proxy crawling
2006-10-10 01:07:10 +02:00
this . defaultProxyProfile = this . profiles . newEntry ( " proxy " , " " , " .* " , " .* " ,
2007-06-08 22:01:16 +02:00
Integer . parseInt ( getConfig ( PROXY_PREFETCH_DEPTH , " 0 " ) ) ,
Integer . parseInt ( getConfig ( PROXY_PREFETCH_DEPTH , " 0 " ) ) ,
2007-01-29 02:11:22 +01:00
60 * 24 , - 1 , - 1 , false ,
2007-06-08 22:01:16 +02:00
getConfigBool ( PROXY_INDEXING_LOCAL_TEXT , true ) ,
getConfigBool ( PROXY_INDEXING_LOCAL_MEDIA , true ) ,
2007-01-29 02:11:22 +01:00
true , true ,
2007-06-08 22:01:16 +02:00
getConfigBool ( PROXY_INDEXING_REMOTE , false ) , true , true , true ) ;
2005-04-13 17:52:00 +02:00
}
2006-10-10 01:07:10 +02:00
if ( this . defaultRemoteProfile = = null ) {
2005-07-30 13:57:19 +02:00
// generate new default entry for remote crawling
2007-01-30 15:18:35 +01:00
defaultRemoteProfile = this . profiles . newEntry ( CRAWL_PROFILE_REMOTE , " " , " .* " , " .* " , 0 , 0 ,
2006-12-19 04:10:46 +01:00
- 1 , - 1 , - 1 , true , true , true , false , true , false , true , true , false ) ;
2006-10-10 01:07:10 +02:00
}
2006-12-19 04:10:46 +01:00
if ( this . defaultTextSnippetProfile = = null ) {
2006-10-10 01:07:10 +02:00
// generate new default entry for snippet fetch and optional crawling
2007-01-30 15:18:35 +01:00
defaultTextSnippetProfile = this . profiles . newEntry ( CRAWL_PROFILE_SNIPPET_TEXT , " " , " .* " , " .* " , 0 , 0 ,
2006-12-19 04:10:46 +01:00
60 * 24 * 30 , - 1 , - 1 , true , true , true , true , true , false , true , true , false ) ;
}
if ( this . defaultMediaSnippetProfile = = null ) {
// generate new default entry for snippet fetch and optional crawling
2007-01-30 15:18:35 +01:00
defaultMediaSnippetProfile = this . profiles . newEntry ( CRAWL_PROFILE_SNIPPET_MEDIA , " " , " .* " , " .* " , 0 , 0 ,
2006-12-19 04:10:46 +01:00
60 * 24 * 30 , - 1 , - 1 , true , false , true , true , true , false , true , true , false ) ;
2005-04-13 17:52:00 +02:00
}
}
2006-01-18 03:18:23 +01:00
2005-04-13 17:52:00 +02:00
private void resetProfiles ( ) {
2007-01-30 15:18:35 +01:00
final File pdb = new File ( plasmaPath , DBFILE_CRAWL_PROFILES ) ;
2005-04-13 17:52:00 +02:00
if ( pdb . exists ( ) ) pdb . delete ( ) ;
2007-01-30 15:18:35 +01:00
long ramProfiles_time = getConfigLong ( RAM_CACHE_PROFILES_TIME , 1000 ) ;
2007-03-06 23:43:32 +01:00
profiles = new plasmaCrawlProfile ( pdb , ramProfiles_time ) ;
2005-12-13 00:59:58 +01:00
initProfiles ( ) ;
2005-04-13 17:52:00 +02:00
}
2005-12-13 00:59:58 +01:00
2007-06-12 01:33:24 +02:00
/ * *
* { @link plasmaCrawlProfile Crawl Profiles } are saved independantly from the queues themselves
* and therefore have to be cleaned up from time to time . This method only performs the clean - up
* if - and only if - the { @link plasmaSwitchboardQueue switchboard } ,
* { @link plasmaCrawlLoader loader } and { @link plasmaCrawlNURL local crawl } queues are all empty .
* < p >
* Then it iterates through all existing { @link plasmaCrawlProfile crawl profiles } and removes
* all profiles which are not hardcoded .
* < / p >
* < p >
* < i > If this method encounters DB - failures , the profile DB will be resetted and < / i >
* < code > true < / code > < i > will be returned < / i >
* < / p >
* @see # CRAWL_PROFILE_PROXY hardcoded
* @see # CRAWL_PROFILE_REMOTE hardcoded
* @see # CRAWL_PROFILE_SNIPPET_TEXT hardcoded
* @see # CRAWL_PROFILE_SNIPPET_MEDIA hardcoded
* @return whether this method has done something or not ( i . e . because the queues have been filled
* or there are no profiles left to clean up )
* @throws < b > InterruptedException < / b > if the current thread has been interrupted , i . e . by the
* shutdown procedure
* /
2006-09-04 07:17:37 +02:00
public boolean cleanProfiles ( ) throws InterruptedException {
2007-08-03 14:21:46 +02:00
if ( ( sbQueue . size ( ) > 0 ) | | ( cacheLoader . size ( ) > 0 ) | | ( noticeURL . notEmpty ( ) ) ) return false ;
2005-09-07 09:26:19 +02:00
final Iterator iter = profiles . profiles ( true ) ;
2005-09-06 16:17:53 +02:00
plasmaCrawlProfile . entry entry ;
2005-08-02 18:03:35 +02:00
boolean hasDoneSomething = false ;
2005-04-13 17:52:00 +02:00
try {
2005-09-07 09:26:19 +02:00
while ( iter . hasNext ( ) ) {
2006-09-04 07:17:37 +02:00
// check for interruption
if ( Thread . currentThread ( ) . isInterrupted ( ) ) throw new InterruptedException ( " Shutdown in progress " ) ;
// getting next profile
2005-09-07 09:26:19 +02:00
entry = ( plasmaCrawlProfile . entry ) iter . next ( ) ;
2007-01-30 15:18:35 +01:00
if ( ! ( ( entry . name ( ) . equals ( CRAWL_PROFILE_PROXY ) ) | |
( entry . name ( ) . equals ( CRAWL_PROFILE_REMOTE ) ) | |
( entry . name ( ) . equals ( CRAWL_PROFILE_SNIPPET_TEXT ) ) | |
( entry . name ( ) . equals ( CRAWL_PROFILE_SNIPPET_MEDIA ) ) ) ) {
2005-09-07 09:26:19 +02:00
iter . remove ( ) ;
2005-08-02 18:03:35 +02:00
hasDoneSomething = true ;
}
2005-04-13 17:52:00 +02:00
}
} catch ( kelondroException e ) {
resetProfiles ( ) ;
2005-08-02 18:03:35 +02:00
hasDoneSomething = true ;
2005-04-13 17:52:00 +02:00
}
2005-08-02 18:03:35 +02:00
return hasDoneSomething ;
2005-04-07 21:19:42 +02:00
}
2007-09-06 03:28:35 +02:00
/ *
2007-06-26 16:37:10 +02:00
synchronized public void htEntryStoreEnqueued ( plasmaHTCache . Entry entry ) {
2007-08-15 23:31:31 +02:00
if ( plasmaHTCache . full ( ) )
2005-11-11 00:48:20 +01:00
htEntryStoreProcess ( entry ) ;
else
2007-08-15 23:31:31 +02:00
plasmaHTCache . push ( entry ) ;
2005-07-06 16:48:41 +02:00
}
2007-09-06 03:28:35 +02:00
* /
2007-06-26 16:37:10 +02:00
synchronized public boolean htEntryStoreProcess ( plasmaHTCache . Entry entry ) {
2005-07-06 16:48:41 +02:00
2005-11-11 00:48:20 +01:00
if ( entry = = null ) return false ;
2006-03-03 09:30:08 +01:00
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
* PARSER SUPPORT
*
* Testing if the content type is supported by the available parsers
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
2006-09-04 17:29:45 +02:00
boolean isSupportedContent = plasmaParser . supportedContent ( entry . url ( ) , entry . getMimeType ( ) ) ;
2006-03-03 09:30:08 +01:00
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
* INDEX CONTROL HEADER
*
2005-10-23 12:35:05 +02:00
* With the X - YACY - Index - Control header set to " no-index " a client could disallow
* yacy to index the response returned as answer to a request
2006-03-03 09:30:08 +01:00
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
boolean doIndexing = true ;
2006-10-18 06:23:00 +02:00
if ( entry . requestProhibitsIndexing ( ) ) {
doIndexing = false ;
if ( this . log . isFine ( ) )
this . log . logFine ( " Crawling of " + entry . url ( ) + " prohibited by request. " ) ;
2006-03-03 09:30:08 +01:00
}
2005-07-06 16:48:41 +02:00
2006-03-03 09:30:08 +01:00
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
* LOCAL IP ADDRESS CHECK
*
2006-09-06 16:31:17 +02:00
* check if ip is local ip address // TODO: remove this procotol specific code here
2006-03-03 09:30:08 +01:00
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
2007-07-24 02:46:17 +02:00
InetAddress hostAddress = serverDomains . dnsResolve ( entry . url ( ) . getHost ( ) ) ;
2005-12-07 15:26:43 +01:00
if ( hostAddress = = null ) {
2006-09-04 13:26:53 +02:00
if ( this . remoteProxyConfig = = null | | ! this . remoteProxyConfig . useProxy ( ) ) {
2006-09-04 17:03:54 +02:00
this . log . logFine ( " Unknown host in URL ' " + entry . url ( ) + " '. Will not be indexed. " ) ;
2006-09-04 13:26:53 +02:00
doIndexing = false ;
}
2007-07-24 02:46:17 +02:00
} else if ( ! acceptURL ( hostAddress ) ) {
2006-09-04 17:03:54 +02:00
this . log . logFine ( " Host in URL ' " + entry . url ( ) + " ' has private ip address. Will not be indexed. " ) ;
2007-07-24 02:46:17 +02:00
doIndexing = false ;
2005-12-07 15:26:43 +01:00
}
2006-03-03 09:30:08 +01:00
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
* STORING DATA
*
* Now we store the response header and response content if
* a ) the user has configured to use the htcache or
* b ) the content should be indexed
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
if (
2006-09-04 17:03:54 +02:00
( entry . profile ( ) . storeHTCache ( ) ) | |
2006-03-03 09:30:08 +01:00
( doIndexing & & isSupportedContent )
) {
2006-09-06 16:31:17 +02:00
// store response header
if ( entry . writeResourceInfo ( ) ) {
2006-09-04 17:03:54 +02:00
this . log . logInfo ( " WROTE HEADER for " + entry . cacheFile ( ) ) ;
2006-03-03 09:30:08 +01:00
}
// work off unwritten files
2006-09-04 17:03:54 +02:00
if ( entry . cacheArray ( ) = = null ) {
2006-06-25 12:31:38 +02:00
//this.log.logFine("EXISTING FILE (" + entry.cacheFile.length() + " bytes) for " + entry.cacheFile);
2005-07-08 18:24:07 +02:00
} else {
2006-03-03 09:30:08 +01:00
String error = entry . shallStoreCacheForProxy ( ) ;
if ( error = = null ) {
2007-08-15 23:31:31 +02:00
plasmaHTCache . writeResourceContent ( entry . url ( ) , entry . cacheArray ( ) ) ;
2006-09-04 17:03:54 +02:00
this . log . logFine ( " WROTE FILE ( " + entry . cacheArray ( ) . length + " bytes) for " + entry . cacheFile ( ) ) ;
2006-03-03 09:30:08 +01:00
} else {
2006-09-04 17:03:54 +02:00
this . log . logFine ( " WRITE OF FILE " + entry . cacheFile ( ) + " FORBIDDEN: " + error ) ;
2006-03-03 09:30:08 +01:00
}
2005-07-08 18:24:07 +02:00
}
2005-07-06 16:48:41 +02:00
}
2005-11-11 00:48:20 +01:00
2006-03-03 09:30:08 +01:00
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
* INDEXING
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
if ( doIndexing & & isSupportedContent ) {
2005-09-05 13:17:37 +02:00
2005-09-07 23:38:03 +02:00
// registering the cachefile as in use
2006-09-04 17:03:54 +02:00
if ( entry . cacheFile ( ) . exists ( ) ) {
plasmaHTCache . filesInUse . add ( entry . cacheFile ( ) ) ;
2005-09-07 23:38:03 +02:00
}
2005-09-05 13:17:37 +02:00
// enqueue for further crawling
2006-09-04 16:38:29 +02:00
enQueue ( this . sbQueue . newEntry (
2006-09-04 17:03:54 +02:00
entry . url ( ) ,
2007-09-05 11:01:35 +02:00
( entry . referrerURL ( ) = = null ) ? null : entry . referrerURL ( ) . hash ( ) ,
2006-09-04 17:29:45 +02:00
entry . ifModifiedSince ( ) ,
entry . requestWithCookie ( ) ,
2006-09-04 16:38:29 +02:00
entry . initiator ( ) ,
2006-09-04 17:03:54 +02:00
entry . depth ( ) ,
entry . profile ( ) . handle ( ) ,
2006-09-04 16:38:29 +02:00
entry . name ( )
2005-09-05 13:17:37 +02:00
) ) ;
2006-03-03 09:30:08 +01:00
} else {
2006-09-04 17:03:54 +02:00
if ( ! entry . profile ( ) . storeHTCache ( ) & & entry . cacheFile ( ) . exists ( ) ) {
2007-08-15 23:31:31 +02:00
plasmaHTCache . deleteFile ( entry . url ( ) ) ;
2006-03-03 09:30:08 +01:00
}
2005-11-11 00:48:20 +01:00
}
2005-07-06 16:48:41 +02:00
return true ;
}
public boolean htEntryStoreJob ( ) {
2007-08-15 23:31:31 +02:00
if ( plasmaHTCache . empty ( ) ) return false ;
return htEntryStoreProcess ( plasmaHTCache . pop ( ) ) ;
2005-07-06 16:48:41 +02:00
}
public int htEntrySize ( ) {
2007-08-15 23:31:31 +02:00
return plasmaHTCache . size ( ) ;
2005-07-06 16:48:41 +02:00
}
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
public void close ( ) {
2005-08-30 23:10:39 +02:00
log . logConfig ( " SWITCHBOARD SHUTDOWN STEP 1: sending termination signal to managed threads: " ) ;
2005-04-07 21:19:42 +02:00
terminateAllThreads ( true ) ;
2006-02-19 22:54:46 +01:00
if ( transferIdxThread ! = null ) stopTransferWholeIndex ( false ) ;
2006-01-23 14:45:14 +01:00
log . logConfig ( " SWITCHBOARD SHUTDOWN STEP 2: sending termination signal to threaded indexing " ) ;
2005-12-14 14:04:43 +01:00
// closing all still running db importer jobs
2006-01-31 13:30:24 +01:00
this . dbImportManager . close ( ) ;
2005-12-14 14:04:43 +01:00
cacheLoader . close ( ) ;
wikiDB . close ( ) ;
2006-03-01 08:40:25 +01:00
blogDB . close ( ) ;
2007-02-26 15:36:01 +01:00
blogCommentDB . close ( ) ;
2005-12-14 14:04:43 +01:00
userDB . close ( ) ;
2005-12-26 15:21:01 +01:00
bookmarksDB . close ( ) ;
2005-12-14 14:04:43 +01:00
messageDB . close ( ) ;
2007-03-09 09:48:47 +01:00
if ( facilityDB ! = null ) facilityDB . close ( ) ;
2005-12-14 14:04:43 +01:00
sbStackCrawlThread . close ( ) ;
profiles . close ( ) ;
robots . close ( ) ;
parser . close ( ) ;
2007-08-15 23:31:31 +02:00
plasmaHTCache . close ( ) ;
2005-12-14 14:04:43 +01:00
sbQueue . close ( ) ;
2007-05-22 10:13:48 +02:00
webStructure . flushCitationReference ( " crg " ) ;
webStructure . close ( ) ;
2006-01-23 14:45:14 +01:00
log . logConfig ( " SWITCHBOARD SHUTDOWN STEP 3: sending termination signal to database manager (stand by...) " ) ;
2006-12-05 03:47:51 +01:00
noticeURL . close ( ) ;
2007-03-16 14:25:56 +01:00
delegatedURL . close ( ) ;
2006-12-05 03:47:51 +01:00
errorURL . close ( ) ;
wordIndex . close ( ) ;
2007-03-09 09:48:47 +01:00
yc . close ( ) ;
2005-08-30 23:10:39 +02:00
log . logConfig ( " SWITCHBOARD SHUTDOWN TERMINATED " ) ;
2005-04-07 21:19:42 +02:00
}
2005-05-29 13:56:40 +02:00
2005-04-07 21:19:42 +02:00
public int queueSize ( ) {
2005-07-06 16:48:41 +02:00
return sbQueue . size ( ) ;
2005-06-08 15:19:05 +02:00
//return processStack.size() + cacheLoader.size() + noticeURL.stackSize();
2005-04-07 21:19:42 +02:00
}
public void enQueue ( Object job ) {
2005-07-06 16:48:41 +02:00
if ( ! ( job instanceof plasmaSwitchboardQueue . Entry ) ) {
2007-06-02 17:25:13 +02:00
System . out . println ( " Internal error at plasmaSwitchboard.enQueue: wrong job type " ) ;
2005-07-06 16:48:41 +02:00
System . exit ( 0 ) ;
}
try {
sbQueue . push ( ( plasmaSwitchboardQueue . Entry ) job ) ;
} catch ( IOException e ) {
2005-08-30 23:32:59 +02:00
log . logSevere ( " IOError in plasmaSwitchboard.enQueue: " + e . getMessage ( ) , e ) ;
2005-07-06 16:48:41 +02:00
}
2005-04-07 21:19:42 +02:00
}
2005-11-11 00:48:20 +01:00
2006-08-18 04:07:03 +02:00
public void deQueueFreeMem ( ) {
// flush some entries from the RAM cache
2007-02-27 14:01:22 +01:00
wordIndex . flushCacheSome ( ) ;
2006-08-19 02:06:39 +02:00
// adopt maximum cache size to current size to prevent that further OutOfMemoryErrors occur
2007-05-12 00:12:29 +02:00
int newMaxCount = Math . max ( 1200 , Math . min ( ( int ) getConfigLong ( WORDCACHE_MAX_COUNT , 1200 ) , wordIndex . dhtOutCacheSize ( ) ) ) ;
2007-01-30 15:18:35 +01:00
setConfig ( WORDCACHE_MAX_COUNT , Integer . toString ( newMaxCount ) ) ;
2006-08-19 02:06:39 +02:00
wordIndex . setMaxWordCount ( newMaxCount ) ;
2006-08-18 04:07:03 +02:00
}
2005-04-26 16:19:44 +02:00
public boolean deQueue ( ) {
2006-09-03 16:59:00 +02:00
try {
// work off fresh entries from the proxy or from the crawler
if ( onlineCaution ( ) ) {
log . logFine ( " deQueue: online caution, omitting resource stack processing " ) ;
return false ;
}
2007-03-17 02:18:34 +01:00
2006-09-03 16:59:00 +02:00
// flush some entries from the RAM cache
2007-02-27 14:01:22 +01:00
if ( sbQueue . size ( ) = = 0 ) wordIndex . flushCacheSome ( ) ; // permanent flushing only if we are not busy
2006-12-05 03:47:51 +01:00
wordIndex . loadedURL . flushCacheSome ( ) ;
2006-08-07 23:49:39 +02:00
2006-09-03 16:59:00 +02:00
boolean doneSomething = false ;
2006-01-27 03:48:27 +01:00
2006-09-03 16:59:00 +02:00
// possibly delete entries from last chunk
if ( ( this . dhtTransferChunk ! = null ) & &
( this . dhtTransferChunk . getStatus ( ) = = plasmaDHTChunk . chunkStatus_COMPLETE ) ) {
2006-10-13 01:14:41 +02:00
String deletedURLs = this . dhtTransferChunk . deleteTransferIndexes ( ) ;
2006-09-03 16:59:00 +02:00
this . log . logFine ( " Deleted from " + this . dhtTransferChunk . containers ( ) . length + " transferred RWIs locally, removed " + deletedURLs + " URL references " ) ;
this . dhtTransferChunk = null ;
2005-06-08 17:28:29 +02:00
}
2006-01-27 03:48:27 +01:00
2006-09-03 16:59:00 +02:00
// generate a dht chunk
if (
( dhtShallTransfer ( ) = = null ) & &
(
( this . dhtTransferChunk = = null ) | |
( this . dhtTransferChunk . getStatus ( ) = = plasmaDHTChunk . chunkStatus_UNDEFINED ) | |
// (this.dhtTransferChunk.getStatus() == plasmaDHTChunk.chunkStatus_COMPLETE) ||
( this . dhtTransferChunk . getStatus ( ) = = plasmaDHTChunk . chunkStatus_FAILED )
)
) {
// generate new chunk
2007-01-30 15:18:35 +01:00
int minChunkSize = ( int ) getConfigLong ( INDEX_DIST_CHUNK_SIZE_MIN , 30 ) ;
2007-01-19 11:39:22 +01:00
dhtTransferChunk = new plasmaDHTChunk ( this . log , wordIndex , minChunkSize , dhtTransferIndexCount , 5000 ) ;
2006-09-03 16:59:00 +02:00
doneSomething = true ;
2006-01-27 03:48:27 +01:00
}
2006-09-03 16:59:00 +02:00
// check for interruption
checkInterruption ( ) ;
// getting the next entry from the indexing queue
synchronized ( sbQueue ) {
if ( sbQueue . size ( ) = = 0 ) {
//log.logFine("deQueue: nothing to do, queue is emtpy");
return doneSomething ; // nothing to do
}
/ *
if ( wordIndex . wordCacheRAMSize ( ) + 1000 > ( int ) getConfigLong ( " wordCacheMaxLow " , 8000 ) ) {
log . logFine ( " deQueue: word index ram cache too full ( " + ( ( int ) getConfigLong ( " wordCacheMaxLow " , 8000 ) - wordIndex . wordCacheRAMSize ( ) ) + " slots left); dismissed to omit ram flush lock " ) ;
2006-08-10 23:21:50 +02:00
return false ;
}
2006-09-03 16:59:00 +02:00
* /
int stackCrawlQueueSize ;
if ( ( stackCrawlQueueSize = sbStackCrawlThread . size ( ) ) > = stackCrawlSlots ) {
log . logFine ( " deQueue: too many processes in stack crawl thread queue, dismissed to protect emergency case ( " + " stackCrawlQueue= " + stackCrawlQueueSize + " ) " ) ;
return doneSomething ;
}
plasmaSwitchboardQueue . Entry nextentry ;
// if we were interrupted we should return now
if ( Thread . currentThread ( ) . isInterrupted ( ) ) {
log . logFine ( " deQueue: thread was interrupted " ) ;
return false ;
}
// do one processing step
log . logFine ( " DEQUEUE: sbQueueSize= " + sbQueue . size ( ) +
2006-12-05 03:47:51 +01:00
" , coreStackSize= " + noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_CORE ) +
" , limitStackSize= " + noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_LIMIT ) +
" , overhangStackSize= " + noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_OVERHANG ) +
" , remoteStackSize= " + noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_REMOTE ) ) ;
2006-09-03 16:59:00 +02:00
try {
nextentry = sbQueue . pop ( ) ;
if ( nextentry = = null ) {
log . logFine ( " deQueue: null entry on queue stack " ) ;
return false ;
}
} catch ( IOException e ) {
log . logSevere ( " IOError in plasmaSwitchboard.deQueue: " + e . getMessage ( ) , e ) ;
return doneSomething ;
}
synchronized ( this . indexingTasksInProcess ) {
this . indexingTasksInProcess . put ( nextentry . urlHash ( ) , nextentry ) ;
}
// parse and index the resource
processResourceStack ( nextentry ) ;
2006-01-27 03:48:27 +01:00
}
2006-09-03 16:59:00 +02:00
// ready & finished
return true ;
} catch ( InterruptedException e ) {
log . logInfo ( " DEQUEUE: Shutdown detected. " ) ;
return false ;
2005-11-11 00:48:20 +01:00
}
2005-04-07 21:19:42 +02:00
}
public int cleanupJobSize ( ) {
int c = 0 ;
2007-03-16 14:25:56 +01:00
if ( ( delegatedURL . stackSize ( ) > 1000 ) ) c + + ;
2006-12-05 03:47:51 +01:00
if ( ( errorURL . stackSize ( ) > 1000 ) ) c + + ;
2005-04-07 21:19:42 +02:00
for ( int i = 1 ; i < = 6 ; i + + ) {
2006-12-05 03:47:51 +01:00
if ( wordIndex . loadedURL . getStackSize ( i ) > 1000 ) c + + ;
2005-11-11 00:48:20 +01:00
}
2005-04-07 21:19:42 +02:00
return c ;
}
public boolean cleanupJob ( ) {
2006-09-04 07:17:37 +02:00
try {
boolean hasDoneSomething = false ;
// do transmission of cr-files
checkInterruption ( ) ;
int count = rankingOwnDistribution . size ( ) / 100 ;
if ( count = = 0 ) count = 1 ;
if ( count > 5 ) count = 5 ;
2007-06-28 16:52:26 +02:00
if ( rankingOn ) {
rankingOwnDistribution . transferRanking ( count ) ;
rankingOtherDistribution . transferRanking ( 1 ) ;
}
2007-03-16 14:25:56 +01:00
// clean up delegated stack
checkInterruption ( ) ;
if ( ( delegatedURL . stackSize ( ) > 1000 ) ) {
log . logFine ( " Cleaning Delegated-URLs report stack, " + delegatedURL . stackSize ( ) + " entries on stack " ) ;
delegatedURL . clearStack ( ) ;
hasDoneSomething = true ;
}
2006-09-04 07:17:37 +02:00
// clean up error stack
checkInterruption ( ) ;
2006-12-05 03:47:51 +01:00
if ( ( errorURL . stackSize ( ) > 1000 ) ) {
log . logFine ( " Cleaning Error-URLs report stack, " + errorURL . stackSize ( ) + " entries on stack " ) ;
errorURL . clearStack ( ) ;
2005-04-07 21:19:42 +02:00
hasDoneSomething = true ;
2005-11-11 00:48:20 +01:00
}
2007-03-16 14:25:56 +01:00
2006-09-04 07:17:37 +02:00
// clean up loadedURL stack
for ( int i = 1 ; i < = 6 ; i + + ) {
checkInterruption ( ) ;
2006-12-05 03:47:51 +01:00
if ( wordIndex . loadedURL . getStackSize ( i ) > 1000 ) {
log . logFine ( " Cleaning Loaded-URLs report stack, " + wordIndex . loadedURL . getStackSize ( i ) + " entries on stack " + i ) ;
wordIndex . loadedURL . clearStack ( i ) ;
2006-09-04 07:17:37 +02:00
hasDoneSomething = true ;
}
}
2007-03-16 14:25:56 +01:00
2006-09-04 07:17:37 +02:00
// clean up profiles
checkInterruption ( ) ;
if ( cleanProfiles ( ) ) hasDoneSomething = true ;
2006-05-17 15:08:57 +02:00
2006-09-04 07:17:37 +02:00
// clean up news
checkInterruption ( ) ;
try {
log . logFine ( " Cleaning Incoming News, " + yacyCore . newsPool . size ( yacyNewsPool . INCOMING_DB ) + " entries on stack " ) ;
if ( yacyCore . newsPool . automaticProcess ( ) > 0 ) hasDoneSomething = true ;
} catch ( IOException e ) { }
2007-07-06 00:56:37 +02:00
if ( getConfigBool ( " cleanup.deletionProcessedNews " , true ) ) {
yacyCore . newsPool . clear ( yacyNewsPool . PROCESSED_DB ) ;
}
if ( getConfigBool ( " cleanup.deletionPublishedNews " , true ) ) {
yacyCore . newsPool . clear ( yacyNewsPool . PUBLISHED_DB ) ;
}
2007-06-22 11:16:25 +02:00
// clean up seed-dbs
2007-06-22 12:42:29 +02:00
if ( getConfigBool ( " routing.deleteOldSeeds.permission " , true ) ) {
final long deleteOldSeedsTime = getConfigLong ( " routing.deleteOldSeeds.time " , 7 ) * 24 * 3600000 ;
2007-06-22 11:16:25 +02:00
Enumeration e = yacyCore . seedDB . seedsSortedDisconnected ( true , yacySeed . LASTSEEN ) ;
yacySeed seed = null ;
ArrayList deleteQueue = new ArrayList ( ) ;
checkInterruption ( ) ;
//clean passive seeds
while ( e . hasMoreElements ( ) ) {
seed = ( yacySeed ) e . nextElement ( ) ;
if ( seed ! = null ) {
//list is sorted -> break when peers are too young to delete
2007-06-22 12:42:29 +02:00
if ( seed . getLastSeenUTC ( ) > ( System . currentTimeMillis ( ) - deleteOldSeedsTime ) )
2007-06-22 11:16:25 +02:00
break ;
deleteQueue . add ( seed . hash ) ;
}
}
for ( int i = 0 ; i < deleteQueue . size ( ) ; + + i ) yacyCore . seedDB . removeDisconnected ( ( String ) deleteQueue . get ( i ) ) ;
deleteQueue . clear ( ) ;
e = yacyCore . seedDB . seedsSortedPotential ( true , yacySeed . LASTSEEN ) ;
checkInterruption ( ) ;
//clean potential seeds
while ( e . hasMoreElements ( ) ) {
seed = ( yacySeed ) e . nextElement ( ) ;
if ( seed ! = null ) {
//list is sorted -> break when peers are too young to delete
2007-06-22 12:42:29 +02:00
if ( seed . getLastSeenUTC ( ) > ( System . currentTimeMillis ( ) - deleteOldSeedsTime ) )
2007-06-22 11:16:25 +02:00
break ;
deleteQueue . add ( seed . hash ) ;
}
}
for ( int i = 0 ; i < deleteQueue . size ( ) ; + + i ) yacyCore . seedDB . removePotential ( ( String ) deleteQueue . get ( i ) ) ;
}
2007-07-12 18:23:33 +02:00
// check if update is available and
// if auto-update is activated perform an automatic installation and restart
2007-07-17 01:47:21 +02:00
yacyVersion updateVersion = yacyVersion . rulebasedUpdateInfo ( false ) ;
2007-07-12 18:23:33 +02:00
if ( updateVersion ! = null ) try {
// there is a version that is more recent. Load it and re-start with it
log . logInfo ( " AUTO-UPDATE: downloading more recent release " + updateVersion . url ) ;
yacyVersion . downloadRelease ( updateVersion ) ;
File releaseFile = new File ( sb . getRootPath ( ) , " DATA/RELEASE/ " + updateVersion . name ) ;
boolean devenvironment = yacyVersion . combined2prettyVersion ( sb . getConfig ( " version " , " 0.1 " ) ) . startsWith ( " dev " ) ;
if ( devenvironment ) {
log . logInfo ( " AUTO-UPDATE: omiting update because this is a development environment " ) ;
} else if ( ( ! releaseFile . exists ( ) ) | | ( releaseFile . length ( ) = = 0 ) ) {
log . logInfo ( " AUTO-UPDATE: omiting update because download failed (file cannot be found or is too small) " ) ;
} else {
yacyVersion . deployRelease ( updateVersion . name ) ;
terminate ( 5000 ) ;
log . logInfo ( " AUTO-UPDATE: deploy and restart initiated " ) ;
}
} catch ( IOException e ) {
log . logSevere ( " AUTO-UPDATE: could not download and install release " + updateVersion . url + " : " + e . getMessage ( ) ) ;
}
2007-06-12 17:15:24 +02:00
// initiate broadcast about peer startup to spread supporter url
if ( yacyCore . newsPool . size ( yacyNewsPool . OUTGOING_DB ) = = 0 ) {
// read profile
final Properties profile = new Properties ( ) ;
FileInputStream fileIn = null ;
try {
fileIn = new FileInputStream ( new File ( " DATA/SETTINGS/profile.txt " ) ) ;
profile . load ( fileIn ) ;
} catch ( IOException e ) {
} finally {
if ( fileIn ! = null ) try { fileIn . close ( ) ; } catch ( Exception e ) { }
}
String homepage = ( String ) profile . get ( " homepage " ) ;
if ( ( homepage ! = null ) & & ( homepage . length ( ) > 10 ) ) {
Properties news = new Properties ( ) ;
news . put ( " homepage " , profile . get ( " homepage " ) ) ;
2007-07-13 11:41:55 +02:00
yacyCore . newsPool . publishMyNews ( yacyNewsRecord . newRecord ( yacyNewsPool . CATEGORY_PROFILE_BROADCAST , news ) ) ;
2007-06-12 17:15:24 +02:00
}
}
2007-07-12 18:23:33 +02:00
2007-05-11 20:02:48 +02:00
// set a maximum amount of memory for the caches
2007-03-06 23:43:32 +01:00
long memprereq = Math . max ( getConfigLong ( INDEXER_MEMPREREQ , 0 ) , wordIndex . minMem ( ) ) ;
2007-05-11 20:02:48 +02:00
// setConfig(INDEXER_MEMPREREQ, memprereq);
2007-05-02 17:39:27 +02:00
//setThreadPerformance(INDEXER, getConfigLong(INDEXER_IDLESLEEP, 0), getConfigLong(INDEXER_BUSYSLEEP, 0), memprereq);
2007-08-03 13:44:58 +02:00
kelondroCachedRecords . setCacheGrowStati ( memprereq + 4 * 1024 * 1024 , memprereq + 2 * 1024 * 1024 ) ;
2007-05-11 20:02:48 +02:00
kelondroCache . setCacheGrowStati ( memprereq + 4 * 1024 * 1024 , memprereq + 2 * 1024 * 1024 ) ;
2006-12-22 13:54:56 +01:00
2007-04-26 11:51:51 +02:00
// update the cluster set
this . clusterhashes = yacyCore . seedDB . clusterHashes ( getConfig ( " cluster.peers.yacydomain " , " " ) ) ;
2006-09-04 07:17:37 +02:00
return hasDoneSomething ;
} catch ( InterruptedException e ) {
this . log . logInfo ( " cleanupJob: Shutdown detected " ) ;
return false ;
}
2005-04-07 21:19:42 +02:00
}
2005-11-11 00:48:20 +01:00
2005-11-03 15:07:58 +01:00
/ * *
* Creates a new File instance with absolute path of ours Seed File . < br >
* @return a new File instance
* /
public File getOwnSeedFile ( ) {
2007-01-30 15:18:35 +01:00
return new File ( getRootPath ( ) , getConfig ( OWN_SEED_FILE , DBFILE_OWN_SEED ) ) ;
2005-11-03 15:07:58 +01:00
}
2005-11-11 00:48:20 +01:00
2005-05-14 11:41:05 +02:00
/ * *
2005-11-11 00:48:20 +01:00
* With this function the crawling process can be paused
2005-05-14 11:41:05 +02:00
* /
2006-02-21 12:18:48 +01:00
public void pauseCrawlJob ( String jobType ) {
Object [ ] status = ( Object [ ] ) this . crawlJobsStatus . get ( jobType ) ;
synchronized ( status [ CRAWLJOB_SYNC ] ) {
status [ CRAWLJOB_STATUS ] = Boolean . TRUE ;
2005-05-14 11:41:05 +02:00
}
2006-02-21 12:18:48 +01:00
setConfig ( jobType + " _isPaused " , " true " ) ;
}
2005-05-14 11:41:05 +02:00
/ * *
2005-11-11 00:48:20 +01:00
* Continue the previously paused crawling
2005-05-14 11:41:05 +02:00
* /
2006-02-21 12:18:48 +01:00
public void continueCrawlJob ( String jobType ) {
Object [ ] status = ( Object [ ] ) this . crawlJobsStatus . get ( jobType ) ;
synchronized ( status [ CRAWLJOB_SYNC ] ) {
if ( ( ( Boolean ) status [ CRAWLJOB_STATUS ] ) . booleanValue ( ) ) {
status [ CRAWLJOB_STATUS ] = Boolean . FALSE ;
status [ CRAWLJOB_SYNC ] . notifyAll ( ) ;
2005-05-14 11:41:05 +02:00
}
}
2006-02-21 12:18:48 +01:00
setConfig ( jobType + " _isPaused " , " false " ) ;
}
2005-05-14 11:41:05 +02:00
/ * *
* @return < code > true < / code > if crawling was paused or < code > false < / code > otherwise
* /
2006-02-21 12:18:48 +01:00
public boolean crawlJobIsPaused ( String jobType ) {
Object [ ] status = ( Object [ ] ) this . crawlJobsStatus . get ( jobType ) ;
synchronized ( status [ CRAWLJOB_SYNC ] ) {
return ( ( Boolean ) status [ CRAWLJOB_STATUS ] ) . booleanValue ( ) ;
2005-05-14 11:41:05 +02:00
}
2005-11-11 00:48:20 +01:00
}
2005-05-14 11:41:05 +02:00
2005-05-29 13:56:40 +02:00
public int coreCrawlJobSize ( ) {
2006-12-05 03:47:51 +01:00
return noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_CORE ) ;
2005-04-07 21:19:42 +02:00
}
2005-05-29 13:56:40 +02:00
public boolean coreCrawlJob ( ) {
2006-12-05 03:47:51 +01:00
if ( noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_CORE ) = = 0 ) {
2005-05-29 13:56:40 +02:00
//log.logDebug("CoreCrawl: queue is empty");
2005-05-14 11:41:05 +02:00
return false ;
}
2005-07-06 16:48:41 +02:00
if ( sbQueue . size ( ) > = indexingSlots ) {
2005-08-30 23:10:39 +02:00
log . logFine ( " CoreCrawl: too many processes in indexing queue, dismissed ( " +
2005-11-11 00:48:20 +01:00
" sbQueueSize= " + sbQueue . size ( ) + " ) " ) ;
2005-05-14 11:41:05 +02:00
return false ;
}
2005-04-26 16:19:44 +02:00
if ( cacheLoader . size ( ) > = crawlSlots ) {
2005-08-30 23:10:39 +02:00
log . logFine ( " CoreCrawl: too many processes in loader queue, dismissed ( " +
2005-11-11 00:48:20 +01:00
" cacheLoader= " + cacheLoader . size ( ) + " ) " ) ;
2005-05-14 11:41:05 +02:00
return false ;
}
2005-07-21 13:17:04 +02:00
if ( onlineCaution ( ) ) {
2005-08-30 23:10:39 +02:00
log . logFine ( " CoreCrawl: online caution, omitting processing " ) ;
2005-07-21 13:17:04 +02:00
return false ;
}
2005-05-14 11:41:05 +02:00
// if the server is busy, we do crawling more slowly
2005-06-15 03:22:07 +02:00
//if (!(cacheManager.idle())) try {Thread.currentThread().sleep(2000);} catch (InterruptedException e) {}
2005-05-14 11:41:05 +02:00
// if crawling was paused we have to wait until we wer notified to continue
2006-02-21 12:18:48 +01:00
Object [ ] status = ( Object [ ] ) this . crawlJobsStatus . get ( CRAWLJOB_LOCAL_CRAWL ) ;
synchronized ( status [ CRAWLJOB_SYNC ] ) {
if ( ( ( Boolean ) status [ CRAWLJOB_STATUS ] ) . booleanValue ( ) ) {
2005-05-14 11:41:05 +02:00
try {
2006-02-21 12:18:48 +01:00
status [ CRAWLJOB_SYNC ] . wait ( ) ;
2005-05-14 11:41:05 +02:00
}
catch ( InterruptedException e ) { return false ; }
}
2005-11-11 00:48:20 +01:00
}
2005-05-14 11:41:05 +02:00
2006-02-16 07:59:34 +01:00
// do a local crawl
2007-03-16 14:25:56 +01:00
plasmaCrawlEntry urlEntry = null ;
2006-12-05 03:47:51 +01:00
while ( urlEntry = = null & & noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_CORE ) > 0 ) {
String stats = " LOCALCRAWL[ " + noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_CORE ) + " , " + noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_LIMIT ) + " , " + noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_OVERHANG ) + " , " + noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_REMOTE ) + " ] " ;
2006-02-16 07:59:34 +01:00
try {
2007-08-22 02:59:05 +02:00
urlEntry = noticeURL . pop ( plasmaCrawlNURL . STACK_TYPE_CORE , true ) ;
2006-02-16 07:59:34 +01:00
String profileHandle = urlEntry . profileHandle ( ) ;
// System.out.println("DEBUG plasmaSwitchboard.processCrawling:
// profileHandle = " + profileHandle + ", urlEntry.url = " + urlEntry.url());
if ( profileHandle = = null ) {
2006-10-01 01:28:03 +02:00
log . logSevere ( stats + " : NULL PROFILE HANDLE ' " + urlEntry . profileHandle ( ) + " ' for URL " + urlEntry . url ( ) ) ;
2006-02-16 07:59:34 +01:00
return true ;
}
plasmaCrawlProfile . entry profile = profiles . getEntry ( profileHandle ) ;
if ( profile = = null ) {
2007-05-11 19:48:22 +02:00
log . logWarning ( stats + " : LOST PROFILE HANDLE ' " + urlEntry . profileHandle ( ) + " ' for URL " + urlEntry . url ( ) ) ;
2006-02-16 07:59:34 +01:00
return true ;
}
log . logFine ( " LOCALCRAWL: URL= " + urlEntry . url ( ) + " , initiator= " + urlEntry . initiator ( ) + " , crawlOrder= " + ( ( profile . remoteIndexing ( ) ) ? " true " : " false " ) + " , depth= " + urlEntry . depth ( ) + " , crawlDepth= " + profile . generalDepth ( ) + " , filter= " + profile . generalFilter ( )
+ " , permission= " + ( ( yacyCore . seedDB = = null ) ? " undefined " : ( ( ( yacyCore . seedDB . mySeed . isSenior ( ) ) | | ( yacyCore . seedDB . mySeed . isPrincipal ( ) ) ) ? " true " : " false " ) ) ) ;
processLocalCrawling ( urlEntry , profile , stats ) ;
2005-12-12 15:11:59 +01:00
return true ;
2006-02-16 07:59:34 +01:00
} catch ( IOException e ) {
2007-02-21 17:23:31 +01:00
log . logSevere ( stats + " : CANNOT FETCH ENTRY: " + e . getMessage ( ) , e ) ;
if ( e . getMessage ( ) . indexOf ( " hash is null " ) > 0 ) noticeURL . clear ( plasmaCrawlNURL . STACK_TYPE_CORE ) ;
2005-12-12 15:11:59 +01:00
}
2005-05-29 13:56:40 +02:00
}
2006-02-16 07:59:34 +01:00
return true ;
2005-05-29 13:56:40 +02:00
}
public int limitCrawlTriggerJobSize ( ) {
2006-12-05 03:47:51 +01:00
return noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_LIMIT ) ;
2005-05-29 13:56:40 +02:00
}
public boolean limitCrawlTriggerJob ( ) {
2006-12-05 03:47:51 +01:00
if ( noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_LIMIT ) = = 0 ) {
2005-05-29 13:56:40 +02:00
//log.logDebug("LimitCrawl: queue is empty");
return false ;
}
2007-04-30 02:39:53 +02:00
boolean robinsonPrivateCase = ( ( isRobinsonMode ( ) ) & &
2007-06-08 22:01:16 +02:00
( ! getConfig ( CLUSTER_MODE , " " ) . equals ( CLUSTER_MODE_PUBLIC_CLUSTER ) ) & &
( ! getConfig ( CLUSTER_MODE , " " ) . equals ( CLUSTER_MODE_PRIVATE_CLUSTER ) ) ) ;
2005-07-18 02:44:51 +02:00
2007-04-30 02:39:53 +02:00
if ( ( robinsonPrivateCase ) | | ( ( coreCrawlJobSize ( ) < = 20 ) & & ( limitCrawlTriggerJobSize ( ) > 10 ) ) ) {
2005-07-18 02:44:51 +02:00
// it is not efficient if the core crawl job is empty and we have too much to do
// move some tasks to the core crawl job
2007-04-30 02:39:53 +02:00
int toshift = 10 ; // this cannot be a big number because the balancer makes a forced waiting if it cannot balance
2005-09-25 03:09:21 +02:00
if ( toshift > limitCrawlTriggerJobSize ( ) ) toshift = limitCrawlTriggerJobSize ( ) ;
2005-12-07 00:51:29 +01:00
for ( int i = 0 ; i < toshift ; i + + ) {
2006-12-05 03:47:51 +01:00
noticeURL . shift ( plasmaCrawlNURL . STACK_TYPE_LIMIT , plasmaCrawlNURL . STACK_TYPE_CORE ) ;
2005-12-07 00:51:29 +01:00
}
2007-06-08 22:01:16 +02:00
log . logInfo ( " shifted " + toshift + " jobs from global crawl to local crawl (coreCrawlJobSize()= " + coreCrawlJobSize ( ) + " , limitCrawlTriggerJobSize()= " + limitCrawlTriggerJobSize ( ) + " , cluster.mode= " + getConfig ( CLUSTER_MODE , " " ) + " , robinsonMode= " + ( ( isRobinsonMode ( ) ) ? " on " : " off " ) ) ;
2007-04-30 02:39:53 +02:00
if ( robinsonPrivateCase ) return false ;
2005-07-18 02:44:51 +02:00
}
2005-05-29 13:56:40 +02:00
// if the server is busy, we do crawling more slowly
2005-06-17 03:26:51 +02:00
//if (!(cacheManager.idle())) try {Thread.currentThread().sleep(2000);} catch (InterruptedException e) {}
2005-05-29 13:56:40 +02:00
// if crawling was paused we have to wait until we wer notified to continue
2006-02-21 12:18:48 +01:00
Object [ ] status = ( Object [ ] ) this . crawlJobsStatus . get ( CRAWLJOB_GLOBAL_CRAWL_TRIGGER ) ;
synchronized ( status [ CRAWLJOB_SYNC ] ) {
if ( ( ( Boolean ) status [ CRAWLJOB_STATUS ] ) . booleanValue ( ) ) {
2005-05-29 13:56:40 +02:00
try {
2006-02-21 12:18:48 +01:00
status [ CRAWLJOB_SYNC ] . wait ( ) ;
2005-05-29 13:56:40 +02:00
}
catch ( InterruptedException e ) { return false ; }
}
2005-11-11 00:48:20 +01:00
}
2005-05-29 13:56:40 +02:00
// start a global crawl, if possible
2006-12-05 03:47:51 +01:00
String stats = " REMOTECRAWLTRIGGER[ " + noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_CORE ) + " , " + noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_LIMIT ) + " , " + noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_OVERHANG ) + " , "
+ noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_REMOTE ) + " ] " ;
2005-12-12 15:11:59 +01:00
try {
2007-08-22 02:59:05 +02:00
plasmaCrawlEntry urlEntry = noticeURL . pop ( plasmaCrawlNURL . STACK_TYPE_LIMIT , true ) ;
2005-12-12 15:11:59 +01:00
String profileHandle = urlEntry . profileHandle ( ) ;
// System.out.println("DEBUG plasmaSwitchboard.processCrawling:
// profileHandle = " + profileHandle + ", urlEntry.url = " + urlEntry.url());
plasmaCrawlProfile . entry profile = profiles . getEntry ( profileHandle ) ;
if ( profile = = null ) {
2007-05-11 19:48:22 +02:00
log . logWarning ( stats + " : LOST PROFILE HANDLE ' " + urlEntry . profileHandle ( ) + " ' for URL " + urlEntry . url ( ) ) ;
2005-12-12 15:11:59 +01:00
return true ;
}
log . logFine ( " plasmaSwitchboard.limitCrawlTriggerJob: url= " + urlEntry . url ( ) + " , initiator= " + urlEntry . initiator ( ) + " , crawlOrder= " + ( ( profile . remoteIndexing ( ) ) ? " true " : " false " ) + " , depth= " + urlEntry . depth ( ) + " , crawlDepth= " + profile . generalDepth ( ) + " , filter= "
+ profile . generalFilter ( ) + " , permission= " + ( ( yacyCore . seedDB = = null ) ? " undefined " : ( ( ( yacyCore . seedDB . mySeed . isSenior ( ) ) | | ( yacyCore . seedDB . mySeed . isPrincipal ( ) ) ) ? " true " : " false " ) ) ) ;
2006-12-05 03:47:51 +01:00
boolean tryRemote = ( ( noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_CORE ) ! = 0 ) | | ( sbQueue . size ( ) ! = 0 ) ) & &
2005-12-12 15:11:59 +01:00
( profile . remoteIndexing ( ) ) & &
( urlEntry . initiator ( ) ! = null ) & &
2006-05-20 01:50:55 +02:00
// (!(urlEntry.initiator().equals(indexURL.dummyHash))) &&
2005-12-12 15:11:59 +01:00
( ( yacyCore . seedDB . mySeed . isSenior ( ) ) | | ( yacyCore . seedDB . mySeed . isPrincipal ( ) ) ) ;
if ( tryRemote ) {
boolean success = processRemoteCrawlTrigger ( urlEntry ) ;
if ( success ) return true ;
}
2006-04-20 13:53:15 +02:00
processLocalCrawling ( urlEntry , profile , stats ) ; // emergency case
if ( sbQueue . size ( ) > = indexingSlots ) {
log . logFine ( " LimitCrawl: too many processes in indexing queue, delayed to protect emergency case ( " +
" sbQueueSize= " + sbQueue . size ( ) + " ) " ) ;
return false ;
}
if ( cacheLoader . size ( ) > = crawlSlots ) {
log . logFine ( " LimitCrawl: too many processes in loader queue, delayed to protect emergency case ( " +
" cacheLoader= " + cacheLoader . size ( ) + " ) " ) ;
return false ;
}
2005-06-17 03:26:51 +02:00
return true ;
2005-12-12 15:11:59 +01:00
} catch ( IOException e ) {
2007-02-21 17:23:31 +01:00
log . logSevere ( stats + " : CANNOT FETCH ENTRY: " + e . getMessage ( ) , e ) ;
if ( e . getMessage ( ) . indexOf ( " hash is null " ) > 0 ) noticeURL . clear ( plasmaCrawlNURL . STACK_TYPE_LIMIT ) ;
2005-12-12 15:11:59 +01:00
return true ; // if we return a false here we will block everything
2005-05-29 13:56:40 +02:00
}
2005-04-07 21:19:42 +02:00
}
2005-05-29 13:56:40 +02:00
public int remoteTriggeredCrawlJobSize ( ) {
2006-12-05 03:47:51 +01:00
return noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_REMOTE ) ;
2005-04-07 21:19:42 +02:00
}
2005-05-29 13:56:40 +02:00
public boolean remoteTriggeredCrawlJob ( ) {
2005-05-14 11:41:05 +02:00
// work off crawl requests that had been placed by other peers to our crawl stack
// do nothing if either there are private processes to be done
// or there is no global crawl on the stack
2006-12-05 03:47:51 +01:00
if ( noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_REMOTE ) = = 0 ) {
2005-05-14 11:41:05 +02:00
//log.logDebug("GlobalCrawl: queue is empty");
return false ;
}
2006-03-22 15:19:03 +01:00
if ( sbQueue . size ( ) > = indexingSlots ) {
log . logFine ( " GlobalCrawl: too many processes in indexing queue, dismissed ( " +
" sbQueueSize= " + sbQueue . size ( ) + " ) " ) ;
return false ;
}
if ( cacheLoader . size ( ) > = crawlSlots ) {
log . logFine ( " GlobalCrawl: too many processes in loader queue, dismissed ( " +
" cacheLoader= " + cacheLoader . size ( ) + " ) " ) ;
return false ;
}
2005-07-21 13:17:04 +02:00
if ( onlineCaution ( ) ) {
2005-08-30 23:10:39 +02:00
log . logFine ( " GlobalCrawl: online caution, omitting processing " ) ;
2005-05-14 11:41:05 +02:00
return false ;
}
// if crawling was paused we have to wait until we wer notified to continue
2006-02-21 12:18:48 +01:00
Object [ ] status = ( Object [ ] ) this . crawlJobsStatus . get ( CRAWLJOB_REMOTE_TRIGGERED_CRAWL ) ;
synchronized ( status [ CRAWLJOB_SYNC ] ) {
if ( ( ( Boolean ) status [ CRAWLJOB_STATUS ] ) . booleanValue ( ) ) {
2005-05-14 11:41:05 +02:00
try {
2006-02-21 12:18:48 +01:00
status [ CRAWLJOB_SYNC ] . wait ( ) ;
2005-05-14 11:41:05 +02:00
}
2006-02-21 12:18:48 +01:00
catch ( InterruptedException e ) { return false ; }
2005-05-14 11:41:05 +02:00
}
2005-11-11 00:48:20 +01:00
}
2005-05-14 11:41:05 +02:00
// we don't want to crawl a global URL globally, since WE are the global part. (from this point of view)
2006-12-05 03:47:51 +01:00
String stats = " REMOTETRIGGEREDCRAWL[ " + noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_CORE ) + " , " + noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_LIMIT ) + " , " + noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_OVERHANG ) + " , "
+ noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_REMOTE ) + " ] " ;
2005-12-12 15:11:59 +01:00
try {
2007-08-22 02:59:05 +02:00
plasmaCrawlEntry urlEntry = noticeURL . pop ( plasmaCrawlNURL . STACK_TYPE_REMOTE , true ) ;
2005-12-12 15:11:59 +01:00
String profileHandle = urlEntry . profileHandle ( ) ;
// System.out.println("DEBUG plasmaSwitchboard.processCrawling:
// profileHandle = " + profileHandle + ", urlEntry.url = " +
// urlEntry.url());
plasmaCrawlProfile . entry profile = profiles . getEntry ( profileHandle ) ;
if ( profile = = null ) {
2007-05-11 19:48:22 +02:00
log . logWarning ( stats + " : LOST PROFILE HANDLE ' " + urlEntry . profileHandle ( ) + " ' for URL " + urlEntry . url ( ) ) ;
2005-12-12 15:11:59 +01:00
return false ;
}
log . logFine ( " plasmaSwitchboard.remoteTriggeredCrawlJob: url= " + urlEntry . url ( ) + " , initiator= " + urlEntry . initiator ( ) + " , crawlOrder= " + ( ( profile . remoteIndexing ( ) ) ? " true " : " false " ) + " , depth= " + urlEntry . depth ( ) + " , crawlDepth= " + profile . generalDepth ( ) + " , filter= "
+ profile . generalFilter ( ) + " , permission= " + ( ( yacyCore . seedDB = = null ) ? " undefined " : ( ( ( yacyCore . seedDB . mySeed . isSenior ( ) ) | | ( yacyCore . seedDB . mySeed . isPrincipal ( ) ) ) ? " true " : " false " ) ) ) ;
processLocalCrawling ( urlEntry , profile , stats ) ;
return true ;
} catch ( IOException e ) {
2007-02-21 17:23:31 +01:00
log . logSevere ( stats + " : CANNOT FETCH ENTRY: " + e . getMessage ( ) , e ) ;
if ( e . getMessage ( ) . indexOf ( " hash is null " ) > 0 ) noticeURL . clear ( plasmaCrawlNURL . STACK_TYPE_REMOTE ) ;
2005-12-12 15:11:59 +01:00
return true ;
2005-05-29 13:56:40 +02:00
}
2005-04-07 21:19:42 +02:00
}
2005-11-11 00:48:20 +01:00
2006-09-20 14:25:07 +02:00
private plasmaParserDocument parseResource ( plasmaSwitchboardQueue . Entry entry , String initiatorHash ) throws InterruptedException , ParserException {
2006-11-08 17:17:47 +01:00
2006-09-03 16:59:00 +02:00
// the mimetype of this entry
2006-09-04 17:36:19 +02:00
String mimeType = entry . getMimeType ( ) ;
2006-09-18 12:12:11 +02:00
String charset = entry . getCharacterEncoding ( ) ;
2006-09-03 16:59:00 +02:00
// the parser logger
2006-11-08 17:17:47 +01:00
//serverLog parserLogger = parser.getLogger();
2006-09-03 16:59:00 +02:00
2006-09-20 14:25:07 +02:00
// parse the document
return parseResource ( entry . url ( ) , mimeType , charset , entry . cacheFile ( ) ) ;
}
2007-09-05 11:01:35 +02:00
public plasmaParserDocument parseResource ( yacyURL location , String mimeType , String documentCharset , File sourceFile ) throws InterruptedException , ParserException {
2006-09-20 14:25:07 +02:00
plasmaParserDocument doc = parser . parseSource ( location , mimeType , documentCharset , sourceFile ) ;
assert ( doc ! = null ) : " Unexpected error. Parser returned null. " ;
return doc ;
2006-09-03 16:59:00 +02:00
}
private void processResourceStack ( plasmaSwitchboardQueue . Entry entry ) throws InterruptedException {
2006-09-30 11:31:53 +02:00
plasmaParserDocument document = null ;
2005-12-08 23:16:49 +01:00
try {
// work off one stack entry with a fresh resource
2005-11-11 00:48:20 +01:00
long stackStartTime = 0 , stackEndTime = 0 ,
parsingStartTime = 0 , parsingEndTime = 0 ,
indexingStartTime = 0 , indexingEndTime = 0 ,
storageStartTime = 0 , storageEndTime = 0 ;
2005-09-20 23:49:47 +02:00
2005-04-07 21:19:42 +02:00
// we must distinguish the following cases: resource-load was initiated by
// 1) global crawling: the index is extern, not here (not possible here)
// 2) result of search queries, some indexes are here (not possible here)
// 3) result of index transfer, some of them are here (not possible here)
// 4) proxy-load (initiator is "------------")
// 5) local prefetch/crawling (initiator is own seedHash)
// 6) local fetching for global crawling (other known or unknwon initiator)
2006-09-03 16:59:00 +02:00
int processCase = PROCESSCASE_0_UNKNOWN ;
yacySeed initiatorPeer = null ;
2007-09-05 11:01:35 +02:00
String initiatorPeerHash = ( entry . proxy ( ) ) ? yacyURL . dummyHash : entry . initiator ( ) ;
if ( initiatorPeerHash . equals ( yacyURL . dummyHash ) ) {
2005-04-07 21:19:42 +02:00
// proxy-load
2006-09-03 16:59:00 +02:00
processCase = PROCESSCASE_4_PROXY_LOAD ;
} else if ( initiatorPeerHash . equals ( yacyCore . seedDB . mySeed . hash ) ) {
2005-04-07 21:19:42 +02:00
// normal crawling
2006-09-03 16:59:00 +02:00
processCase = PROCESSCASE_5_LOCAL_CRAWLING ;
2005-04-07 21:19:42 +02:00
} else {
// this was done for remote peer (a global crawl)
2006-09-03 16:59:00 +02:00
initiatorPeer = yacyCore . seedDB . getConnected ( initiatorPeerHash ) ;
processCase = PROCESSCASE_6_GLOBAL_CRAWLING ;
2005-04-07 21:19:42 +02:00
}
2005-05-17 10:25:04 +02:00
2005-08-30 23:10:39 +02:00
log . logFine ( " processResourceStack processCase= " + processCase +
2005-12-08 23:16:49 +01:00
" , depth= " + entry . depth ( ) +
" , maxDepth= " + ( ( entry . profile ( ) = = null ) ? " null " : Integer . toString ( entry . profile ( ) . generalDepth ( ) ) ) +
" , filter= " + ( ( entry . profile ( ) = = null ) ? " null " : entry . profile ( ) . generalFilter ( ) ) +
2006-09-03 16:59:00 +02:00
" , initiatorHash= " + initiatorPeerHash +
2006-09-04 17:36:19 +02:00
//", responseHeader=" + ((entry.responseHeader() == null) ? "null" : entry.responseHeader().toString()) +
2005-12-08 23:16:49 +01:00
" , url= " + entry . url ( ) ) ; // DEBUG
2005-05-17 10:25:04 +02:00
2006-09-03 16:59:00 +02:00
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
* PARSE CONTENT
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
parsingStartTime = System . currentTimeMillis ( ) ;
2006-09-20 14:25:07 +02:00
try {
2006-09-03 16:59:00 +02:00
document = this . parseResource ( entry , initiatorPeerHash ) ;
if ( document = = null ) return ;
2006-09-20 14:25:07 +02:00
} catch ( ParserException e ) {
this . log . logInfo ( " Unable to parse the resource ' " + entry . url ( ) + " '. " + e . getMessage ( ) ) ;
2006-11-23 03:16:30 +01:00
addURLtoErrorDB ( entry . url ( ) , entry . referrerHash ( ) , initiatorPeerHash , entry . anchorName ( ) , e . getErrorCode ( ) , new kelondroBitfield ( ) ) ;
2006-09-30 12:09:01 +02:00
if ( document ! = null ) {
document . close ( ) ;
document = null ;
}
2006-09-20 14:25:07 +02:00
return ;
}
2006-09-03 16:59:00 +02:00
parsingEndTime = System . currentTimeMillis ( ) ;
2005-04-13 17:52:00 +02:00
2006-09-03 16:59:00 +02:00
// getting the document date
2006-09-04 17:36:19 +02:00
Date docDate = entry . getModificationDate ( ) ;
2005-07-06 16:48:41 +02:00
2006-09-03 16:59:00 +02:00
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
* put anchors on crawl stack
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
2005-09-23 01:43:45 +02:00
stackStartTime = System . currentTimeMillis ( ) ;
2006-09-03 16:59:00 +02:00
if (
( ( processCase = = PROCESSCASE_4_PROXY_LOAD ) | | ( processCase = = PROCESSCASE_5_LOCAL_CRAWLING ) ) & &
( ( entry . profile ( ) = = null ) | | ( entry . depth ( ) < entry . profile ( ) . generalDepth ( ) ) )
) {
2005-04-13 17:52:00 +02:00
Map hl = document . getHyperlinks ( ) ;
2005-04-13 00:57:54 +02:00
Iterator i = hl . entrySet ( ) . iterator ( ) ;
2006-09-03 16:59:00 +02:00
String nextUrlString ;
2007-09-05 11:01:35 +02:00
yacyURL nextUrl ;
2006-09-03 16:59:00 +02:00
Map . Entry nextEntry ;
2005-04-13 00:57:54 +02:00
while ( i . hasNext ( ) ) {
2006-09-03 16:59:00 +02:00
// check for interruption
checkInterruption ( ) ;
2005-11-11 00:48:20 +01:00
2006-09-03 16:59:00 +02:00
// fetching the next hyperlink
nextEntry = ( Map . Entry ) i . next ( ) ;
nextUrlString = ( String ) nextEntry . getKey ( ) ;
try {
2007-09-05 11:01:35 +02:00
nextUrl = new yacyURL ( nextUrlString , null ) ;
2006-09-03 16:59:00 +02:00
// enqueue the hyperlink into the pre-notice-url db
2007-03-16 14:25:56 +01:00
sbStackCrawlThread . enqueue ( nextUrl , entry . urlHash ( ) , initiatorPeerHash , ( String ) nextEntry . getValue ( ) , docDate , entry . depth ( ) + 1 , entry . profile ( ) ) ;
2006-09-03 16:59:00 +02:00
} catch ( MalformedURLException e1 ) { }
2005-04-07 21:19:42 +02:00
}
2007-07-19 17:32:10 +02:00
log . logInfo ( " CRAWL: ADDED " + hl . size ( ) + " LINKS FROM " + entry . url ( ) . toNormalform ( false , true ) +
2006-12-05 03:47:51 +01:00
" , NEW CRAWL STACK SIZE IS " + noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_CORE ) ) ;
2005-04-07 21:19:42 +02:00
}
2005-09-23 01:43:45 +02:00
stackEndTime = System . currentTimeMillis ( ) ;
2005-04-07 21:19:42 +02:00
2006-09-03 16:59:00 +02:00
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
* CREATE INDEX
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
2007-03-18 13:33:19 +01:00
String docDescription = document . getTitle ( ) ;
2007-09-05 11:01:35 +02:00
yacyURL referrerURL = entry . referrerURL ( ) ;
2005-12-15 11:31:00 +01:00
2006-08-07 17:11:14 +02:00
String noIndexReason = plasmaCrawlEURL . DENIED_UNSPECIFIED_INDEXING_ERROR ;
2006-09-03 16:59:00 +02:00
if ( processCase = = PROCESSCASE_4_PROXY_LOAD ) {
2005-04-13 17:52:00 +02:00
// proxy-load
noIndexReason = entry . shallIndexCacheForProxy ( ) ;
} else {
// normal crawling
noIndexReason = entry . shallIndexCacheForCrawler ( ) ;
}
2006-09-03 16:59:00 +02:00
2005-04-13 17:52:00 +02:00
if ( noIndexReason = = null ) {
2005-04-07 21:19:42 +02:00
// strip out words
2005-09-20 23:49:47 +02:00
indexingStartTime = System . currentTimeMillis ( ) ;
2006-09-03 16:59:00 +02:00
checkInterruption ( ) ;
2007-07-19 17:32:10 +02:00
log . logFine ( " Condensing for ' " + entry . url ( ) . toNormalform ( false , true ) + " ' " ) ;
2006-12-19 04:10:46 +01:00
plasmaCondenser condenser = new plasmaCondenser ( document , entry . profile ( ) . indexText ( ) , entry . profile ( ) . indexMedia ( ) ) ;
2005-11-11 00:48:20 +01:00
2005-11-07 13:33:02 +01:00
// generate citation reference
2007-05-22 10:13:48 +02:00
Integer [ ] ioLinks = webStructure . generateCitationReference ( entry . url ( ) , entry . urlHash ( ) , docDate , document , condenser ) ; // [outlinksSame, outlinksOther]
2005-11-07 13:33:02 +01:00
2006-09-03 16:59:00 +02:00
try {
// check for interruption
checkInterruption ( ) ;
2005-09-05 12:10:00 +02:00
2006-09-03 16:59:00 +02:00
// create a new loaded URL db entry
2006-12-08 03:14:56 +01:00
long ldate = System . currentTimeMillis ( ) ;
2007-08-15 13:36:59 +02:00
indexURLEntry newEntry = new indexURLEntry (
2006-11-28 16:00:15 +01:00
entry . url ( ) , // URL
docDescription , // document description
2007-03-17 02:18:34 +01:00
document . getAuthor ( ) , // author
document . getKeywords ( ' ' ) , // tags
2006-11-28 16:00:15 +01:00
" " , // ETag
docDate , // modification date
new Date ( ) , // loaded date
2006-12-08 03:14:56 +01:00
new Date ( ldate + Math . max ( 0 , ldate - docDate . getTime ( ) ) / 2 ) , // freshdate, computed with Proxy-TTL formula
2007-09-05 11:01:35 +02:00
( referrerURL = = null ) ? null : referrerURL . hash ( ) , // referer hash
2006-11-28 16:00:15 +01:00
new byte [ 0 ] , // md5
( int ) entry . size ( ) , // size
condenser . RESULT_NUMB_WORDS , // word count
2007-09-05 11:01:35 +02:00
plasmaHTCache . docType ( document . getMimeType ( ) ) , // doctype
2006-11-28 16:00:15 +01:00
condenser . RESULT_FLAGS , // flags
2007-09-05 11:01:35 +02:00
yacyURL . language ( entry . url ( ) ) , // language
2006-11-28 16:00:15 +01:00
ioLinks [ 0 ] . intValue ( ) , // llocal
ioLinks [ 1 ] . intValue ( ) , // lother
2006-12-01 17:21:17 +01:00
document . getAudiolinks ( ) . size ( ) , // laudio
document . getImages ( ) . size ( ) , // limage
document . getVideolinks ( ) . size ( ) , // lvideo
document . getApplinks ( ) . size ( ) // lapp
2006-03-17 11:16:07 +01:00
) ;
2006-09-03 16:59:00 +02:00
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
* STORE URL TO LOADED - URL - DB
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
2006-12-05 03:47:51 +01:00
wordIndex . loadedURL . store ( newEntry ) ;
wordIndex . loadedURL . stack (
2006-09-03 16:59:00 +02:00
newEntry , // loaded url db entry
2006-11-28 16:00:15 +01:00
initiatorPeerHash , // initiator peer hash
2006-09-03 16:59:00 +02:00
yacyCore . seedDB . mySeed . hash , // executor peer hash
processCase // process case
) ;
2005-04-07 21:19:42 +02:00
2006-09-03 16:59:00 +02:00
// check for interruption
checkInterruption ( ) ;
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
* STORE WORD INDEX
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
if (
(
( processCase = = PROCESSCASE_4_PROXY_LOAD ) | |
( processCase = = PROCESSCASE_5_LOCAL_CRAWLING ) | |
( processCase = = PROCESSCASE_6_GLOBAL_CRAWLING )
) & &
2006-12-19 04:10:46 +01:00
( ( entry . profile ( ) . indexText ( ) ) | | ( entry . profile ( ) . indexMedia ( ) ) )
2006-09-03 16:59:00 +02:00
) {
String urlHash = newEntry . hash ( ) ;
// remove stopwords
2005-07-06 16:48:41 +02:00
log . logInfo ( " Excluded " + condenser . excludeWords ( stopwords ) + " words in URL " + entry . url ( ) ) ;
2005-09-27 09:10:24 +02:00
indexingEndTime = System . currentTimeMillis ( ) ;
2005-04-07 21:19:42 +02:00
2005-09-27 09:10:24 +02:00
storageStartTime = System . currentTimeMillis ( ) ;
2005-10-05 12:45:33 +02:00
int words = 0 ;
2005-11-11 00:48:20 +01:00
String storagePeerHash ;
2005-10-05 12:45:33 +02:00
yacySeed seed ;
2006-09-03 16:59:00 +02:00
if (
2007-01-30 15:18:35 +01:00
( ( storagePeerHash = getConfig ( STORAGE_PEER_HASH , null ) ) = = null ) | |
2005-12-08 23:16:49 +01:00
( storagePeerHash . trim ( ) . length ( ) = = 0 ) | |
2007-01-30 15:18:35 +01:00
( ( seed = yacyCore . seedDB . getConnected ( storagePeerHash ) ) = = null )
2006-09-03 16:59:00 +02:00
) {
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
* STORE PAGE INDEX INTO WORD INDEX DB
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
words = wordIndex . addPageIndex (
2006-12-08 03:14:56 +01:00
entry . url ( ) , // document url
docDate , // document mod date
( int ) entry . size ( ) , // document size
document , // document content
condenser , // document condenser
2007-09-05 11:01:35 +02:00
yacyURL . language ( entry . url ( ) ) , // document language
plasmaHTCache . docType ( document . getMimeType ( ) ) , // document type
2006-12-08 03:14:56 +01:00
ioLinks [ 0 ] . intValue ( ) , // outlinkSame
ioLinks [ 1 ] . intValue ( ) // outlinkOthers
2006-09-03 16:59:00 +02:00
) ;
2005-10-05 12:45:33 +02:00
} else {
2006-09-03 16:59:00 +02:00
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
* SEND PAGE INDEX TO STORAGE PEER
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
2005-10-05 12:45:33 +02:00
HashMap urlCache = new HashMap ( 1 ) ;
urlCache . put ( newEntry . hash ( ) , newEntry ) ;
2006-09-03 16:59:00 +02:00
2006-12-19 04:10:46 +01:00
ArrayList tmpContainers = new ArrayList ( condenser . words ( ) . size ( ) ) ;
2006-09-03 16:59:00 +02:00
2007-09-05 11:01:35 +02:00
String language = yacyURL . language ( entry . url ( ) ) ;
char doctype = plasmaHTCache . docType ( document . getMimeType ( ) ) ;
2006-11-08 17:17:47 +01:00
indexURLEntry . Components comp = newEntry . comp ( ) ;
2007-07-19 17:32:10 +02:00
int urlLength = comp . url ( ) . toNormalform ( true , true ) . length ( ) ;
int urlComps = htmlFilterContentScraper . urlComps ( comp . url ( ) . toNormalform ( true , true ) ) . length ;
2006-02-26 12:30:37 +01:00
2005-10-05 12:45:33 +02:00
// iterate over all words
2006-12-06 14:13:55 +01:00
Iterator i = condenser . words ( ) . entrySet ( ) . iterator ( ) ;
2006-01-22 01:07:00 +01:00
Map . Entry wentry ;
plasmaCondenser . wordStatProp wordStat ;
2005-10-05 12:45:33 +02:00
while ( i . hasNext ( ) ) {
2006-01-22 01:07:00 +01:00
wentry = ( Map . Entry ) i . next ( ) ;
String word = ( String ) wentry . getKey ( ) ;
wordStat = ( plasmaCondenser . wordStatProp ) wentry . getValue ( ) ;
2006-11-23 03:16:30 +01:00
String wordHash = plasmaCondenser . word2hash ( word ) ;
2007-03-21 16:35:35 +01:00
indexRWIEntry wordIdxEntry = new indexRWIEntry (
2006-11-19 21:05:25 +01:00
urlHash ,
urlLength , urlComps ,
wordStat . count ,
2007-03-18 13:33:19 +01:00
document . getTitle ( ) . length ( ) ,
2006-12-19 04:10:46 +01:00
condenser . words ( ) . size ( ) ,
condenser . sentences ( ) . size ( ) ,
2006-11-19 21:05:25 +01:00
wordStat . posInText ,
wordStat . posInPhrase ,
wordStat . numOfPhrase ,
0 ,
newEntry . size ( ) ,
docDate . getTime ( ) ,
System . currentTimeMillis ( ) ,
language ,
doctype ,
ioLinks [ 0 ] . intValue ( ) ,
ioLinks [ 1 ] . intValue ( ) ,
2006-11-23 03:16:30 +01:00
condenser . RESULT_FLAGS
2006-11-19 21:05:25 +01:00
) ;
2007-08-25 01:12:59 +02:00
indexContainer wordIdxContainer = plasmaWordIndex . emptyContainer ( wordHash , 1 ) ;
2006-01-30 01:42:38 +01:00
wordIdxContainer . add ( wordIdxEntry ) ;
tmpContainers . add ( wordIdxContainer ) ;
2005-10-05 12:45:33 +02:00
}
//System.out.println("DEBUG: plasmaSearch.addPageIndex: added " + condenser.getWords().size() + " words, flushed " + c + " entries");
2006-12-19 04:10:46 +01:00
words = condenser . words ( ) . size ( ) ;
2005-10-05 12:45:33 +02:00
// transfering the index to the storage peer
2006-09-03 16:59:00 +02:00
indexContainer [ ] indexData = ( indexContainer [ ] ) tmpContainers . toArray ( new indexContainer [ tmpContainers . size ( ) ] ) ;
2006-06-14 11:40:42 +02:00
HashMap resultObj = yacyClient . transferIndex (
2006-09-03 16:59:00 +02:00
seed , // target seed
indexData , // word index data
urlCache , // urls
true , // gzip body
120000 // transfer timeout
) ;
// check for interruption
checkInterruption ( ) ;
// if the transfer failed we try to store the index locally
2006-06-14 11:40:42 +02:00
String error = ( String ) resultObj . get ( " result " ) ;
2005-10-05 12:45:33 +02:00
if ( error ! = null ) {
2006-09-03 16:59:00 +02:00
words = wordIndex . addPageIndex (
2007-09-05 11:01:35 +02:00
entry . url ( ) ,
2006-09-03 16:59:00 +02:00
docDate ,
( int ) entry . size ( ) ,
document ,
condenser ,
2007-09-05 11:01:35 +02:00
yacyURL . language ( entry . url ( ) ) ,
plasmaHTCache . docType ( document . getMimeType ( ) ) ,
2006-09-03 16:59:00 +02:00
ioLinks [ 0 ] . intValue ( ) ,
ioLinks [ 1 ] . intValue ( )
) ;
2005-10-05 12:45:33 +02:00
}
2006-01-30 01:42:38 +01:00
tmpContainers = null ;
2006-12-08 03:14:56 +01:00
} //end: SEND PAGE INDEX TO STORAGE PEER
2005-09-27 09:10:24 +02:00
storageEndTime = System . currentTimeMillis ( ) ;
2006-10-30 20:04:06 +01:00
//increment number of indexed urls
2007-01-31 16:39:11 +01:00
indexedPages + + ;
2006-10-30 20:04:06 +01:00
2006-09-03 16:59:00 +02:00
if ( log . isInfo ( ) ) {
2006-09-15 07:31:30 +02:00
// TODO: UTF-8 docDescription seems not to be displayed correctly because
// of string concatenation
2005-11-11 00:48:20 +01:00
log . logInfo ( " *Indexed " + words + " words in URL " + entry . url ( ) +
2005-12-08 23:16:49 +01:00
" [ " + entry . urlHash ( ) + " ] " +
2006-09-03 16:59:00 +02:00
" \ n \ tDescription: " + docDescription +
2006-12-01 17:21:17 +01:00
" \ n \ tMimeType: " + document . getMimeType ( ) + " | Charset: " + document . getCharset ( ) + " | " +
2006-09-30 11:31:53 +02:00
" Size: " + document . getTextLength ( ) + " bytes | " +
2006-12-01 17:21:17 +01:00
" Anchors: " + ( ( document . getAnchors ( ) = = null ) ? 0 : document . getAnchors ( ) . size ( ) ) +
2005-12-08 23:16:49 +01:00
" \ n \ tStackingTime: " + ( stackEndTime - stackStartTime ) + " ms | " +
" ParsingTime: " + ( parsingEndTime - parsingStartTime ) + " ms | " +
" IndexingTime: " + ( indexingEndTime - indexingStartTime ) + " ms | " +
" StorageTime: " + ( storageEndTime - storageStartTime ) + " ms " ) ;
2005-09-27 09:10:24 +02:00
}
2005-04-07 21:19:42 +02:00
2006-09-03 16:59:00 +02:00
// check for interruption
checkInterruption ( ) ;
2005-04-07 21:19:42 +02:00
// if this was performed for a remote crawl request, notify requester
2006-09-03 16:59:00 +02:00
if ( ( processCase = = PROCESSCASE_6_GLOBAL_CRAWLING ) & & ( initiatorPeer ! = null ) ) {
2007-07-19 17:32:10 +02:00
log . logInfo ( " Sending crawl receipt for ' " + entry . url ( ) . toNormalform ( false , true ) + " ' to " + initiatorPeer . getName ( ) ) ;
2007-05-02 14:26:29 +02:00
if ( clusterhashes ! = null ) initiatorPeer . setAlternativeAddress ( ( String ) clusterhashes . get ( initiatorPeer . hash ) ) ;
2006-09-03 16:59:00 +02:00
yacyClient . crawlReceipt ( initiatorPeer , " crawl " , " fill " , " indexed " , newEntry , " " ) ;
2005-04-07 21:19:42 +02:00
}
} else {
2007-07-19 17:32:10 +02:00
log . logFine ( " Not Indexed Resource ' " + entry . url ( ) . toNormalform ( false , true ) + " ': process case= " + processCase ) ;
2007-09-05 11:01:35 +02:00
addURLtoErrorDB ( entry . url ( ) , referrerURL . hash ( ) , initiatorPeerHash , docDescription , plasmaCrawlEURL . DENIED_UNKNOWN_INDEXING_PROCESS_CASE , new kelondroBitfield ( ) ) ;
2005-04-07 21:19:42 +02:00
}
} catch ( Exception ee ) {
2006-09-03 16:59:00 +02:00
if ( ee instanceof InterruptedException ) throw ( InterruptedException ) ee ;
// check for interruption
checkInterruption ( ) ;
2005-08-30 23:32:59 +02:00
log . logSevere ( " Could not index URL " + entry . url ( ) + " : " + ee . getMessage ( ) , ee ) ;
2006-09-03 16:59:00 +02:00
if ( ( processCase = = PROCESSCASE_6_GLOBAL_CRAWLING ) & & ( initiatorPeer ! = null ) ) {
2007-05-02 14:26:29 +02:00
if ( clusterhashes ! = null ) initiatorPeer . setAlternativeAddress ( ( String ) clusterhashes . get ( initiatorPeer . hash ) ) ;
2006-09-03 16:59:00 +02:00
yacyClient . crawlReceipt ( initiatorPeer , " crawl " , " exception " , ee . getMessage ( ) , null , " " ) ;
2005-04-07 21:19:42 +02:00
}
2007-09-05 11:01:35 +02:00
addURLtoErrorDB ( entry . url ( ) , ( referrerURL = = null ) ? null : referrerURL . hash ( ) , initiatorPeerHash , docDescription , plasmaCrawlEURL . DENIED_UNSPECIFIED_INDEXING_ERROR , new kelondroBitfield ( ) ) ;
2005-04-07 21:19:42 +02:00
}
} else {
2006-09-03 16:59:00 +02:00
// check for interruption
checkInterruption ( ) ;
2005-07-06 16:48:41 +02:00
log . logInfo ( " Not indexed any word in URL " + entry . url ( ) + " ; cause: " + noIndexReason ) ;
2007-09-05 11:01:35 +02:00
addURLtoErrorDB ( entry . url ( ) , ( referrerURL = = null ) ? null : referrerURL . hash ( ) , initiatorPeerHash , docDescription , noIndexReason , new kelondroBitfield ( ) ) ;
2006-09-03 16:59:00 +02:00
if ( ( processCase = = PROCESSCASE_6_GLOBAL_CRAWLING ) & & ( initiatorPeer ! = null ) ) {
2007-05-02 14:26:29 +02:00
if ( clusterhashes ! = null ) initiatorPeer . setAlternativeAddress ( ( String ) clusterhashes . get ( initiatorPeer . hash ) ) ;
2006-09-03 16:59:00 +02:00
yacyClient . crawlReceipt ( initiatorPeer , " crawl " , " rejected " , noIndexReason , null , " " ) ;
2005-04-07 21:19:42 +02:00
}
}
2006-09-30 11:31:53 +02:00
document . close ( ) ;
2005-11-07 13:33:02 +01:00
document = null ;
2006-09-12 06:50:12 +02:00
} catch ( Exception e ) {
2006-10-03 17:40:18 +02:00
if ( e instanceof InterruptedException ) throw ( InterruptedException ) e ;
2006-09-12 06:50:12 +02:00
this . log . logSevere ( " Unexpected exception while parsing/indexing URL " , e ) ;
} catch ( Error e ) {
this . log . logSevere ( " Unexpected exception while parsing/indexing URL " , e ) ;
2005-12-08 23:16:49 +01:00
} finally {
2006-09-03 16:59:00 +02:00
checkInterruption ( ) ;
2005-12-08 23:16:49 +01:00
// The following code must be into the finally block, otherwise it will not be executed
// on errors!
2006-09-03 16:59:00 +02:00
2005-11-03 16:28:37 +01:00
// removing current entry from in process list
2005-09-02 00:05:20 +02:00
synchronized ( this . indexingTasksInProcess ) {
this . indexingTasksInProcess . remove ( entry . urlHash ( ) ) ;
2005-11-07 13:33:02 +01:00
}
2006-09-03 16:59:00 +02:00
2005-11-03 16:28:37 +01:00
// removing current entry from notice URL queue
2007-03-16 14:25:56 +01:00
/ *
2006-12-05 03:47:51 +01:00
boolean removed = noticeURL . remove ( entry . urlHash ( ) ) ; // worked-off
2005-11-03 16:28:37 +01:00
if ( ! removed ) {
log . logFinest ( " Unable to remove indexed URL " + entry . url ( ) + " from Crawler Queue. This could be because of an URL redirect. " ) ;
}
2007-03-16 14:25:56 +01:00
* /
2005-08-24 09:47:42 +02:00
// explicit delete/free resources
2005-09-07 23:38:03 +02:00
if ( ( entry ! = null ) & & ( entry . profile ( ) ! = null ) & & ( ! ( entry . profile ( ) . storeHTCache ( ) ) ) ) {
2005-11-02 18:56:26 +01:00
plasmaHTCache . filesInUse . remove ( entry . cacheFile ( ) ) ;
2007-08-15 23:31:31 +02:00
plasmaHTCache . deleteFile ( entry . url ( ) ) ;
2005-09-07 23:38:03 +02:00
}
2005-11-07 13:33:02 +01:00
entry = null ;
2006-09-30 11:31:53 +02:00
if ( document ! = null ) try { document . close ( ) ; } catch ( Exception e ) { /* ignore this */ }
2005-12-08 23:16:49 +01:00
}
2005-04-07 21:19:42 +02:00
}
2007-05-22 10:13:48 +02:00
2007-03-16 14:25:56 +01:00
private void processLocalCrawling ( plasmaCrawlEntry urlEntry , plasmaCrawlProfile . entry profile , String stats ) {
2005-04-07 21:19:42 +02:00
// work off one Crawl stack entry
2005-09-01 08:55:21 +02:00
if ( ( urlEntry = = null ) | | ( urlEntry . url ( ) = = null ) ) {
2005-06-17 03:26:51 +02:00
log . logInfo ( stats + " : urlEntry=null " ) ;
return ;
2005-04-07 21:19:42 +02:00
}
2005-10-12 10:17:43 +02:00
2006-08-07 17:11:14 +02:00
// convert the referrer hash into the corresponding URL
2007-09-05 11:01:35 +02:00
yacyURL refererURL = null ;
2007-03-16 14:25:56 +01:00
String refererHash = urlEntry . referrerhash ( ) ;
2007-09-05 11:01:35 +02:00
if ( ( refererHash ! = null ) & & ( ! refererHash . equals ( yacyURL . dummyHash ) ) ) try {
2006-12-05 03:47:51 +01:00
refererURL = this . getURL ( refererHash ) ;
2005-12-11 01:25:02 +01:00
} catch ( IOException e ) {
refererURL = null ;
2005-10-12 10:17:43 +02:00
}
2007-06-14 17:21:01 +02:00
cacheLoader . loadAsync ( urlEntry . url ( ) , urlEntry . name ( ) , ( refererURL ! = null ) ? refererURL . toString ( ) : null , urlEntry . initiator ( ) , urlEntry . depth ( ) , profile , - 1 , false ) ;
2007-09-05 11:01:35 +02:00
log . logInfo ( stats + " : enqueued for load " + urlEntry . url ( ) + " [ " + urlEntry . url ( ) . hash ( ) + " ] " ) ;
2005-06-17 03:26:51 +02:00
return ;
2005-04-07 21:19:42 +02:00
}
2007-03-16 14:25:56 +01:00
private boolean processRemoteCrawlTrigger ( plasmaCrawlEntry urlEntry ) {
// if this returns true, then the urlEntry is considered as stored somewhere and the case is finished
// if this returns false, the urlEntry will be enqueued to the local crawl again
2005-09-06 16:17:53 +02:00
2007-03-16 14:25:56 +01:00
// wrong access
2005-04-07 21:19:42 +02:00
if ( urlEntry = = null ) {
2006-12-05 03:47:51 +01:00
log . logInfo ( " REMOTECRAWLTRIGGER[ " + noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_CORE ) + " , " + noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_REMOTE ) + " ]: urlEntry=null " ) ;
2007-03-16 14:25:56 +01:00
return true ; // superfluous request; true correct in this context because the urlEntry shall not be tracked any more
2005-04-07 21:19:42 +02:00
}
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
// check url
if ( urlEntry . url ( ) = = null ) {
2005-08-30 23:10:39 +02:00
log . logFine ( " ERROR: plasmaSwitchboard.processRemoteCrawlTrigger - url is null. name= " + urlEntry . name ( ) ) ;
2007-03-16 14:25:56 +01:00
return true ; // same case as above: no more consideration
2005-04-07 21:19:42 +02:00
}
2007-03-16 14:25:56 +01:00
// are we qualified for a remote crawl?
if ( ( yacyCore . seedDB . mySeed = = null ) | | ( yacyCore . seedDB . mySeed . isJunior ( ) ) ) {
log . logFine ( " plasmaSwitchboard.processRemoteCrawlTrigger: no permission " ) ;
return false ; // no, we must crawl this page ourselves
}
2005-06-17 03:26:51 +02:00
2007-03-16 14:25:56 +01:00
// check if peer for remote crawl is available
2007-04-30 02:39:53 +02:00
yacySeed remoteSeed = ( ( this . isPublicRobinson ( ) ) & & ( getConfig ( " cluster.mode " , " " ) . equals ( " publiccluster " ) ) ) ?
2007-09-05 11:01:35 +02:00
yacyCore . dhtAgent . getPublicClusterCrawlSeed ( urlEntry . url ( ) . hash ( ) , this . clusterhashes ) :
yacyCore . dhtAgent . getGlobalCrawlSeed ( urlEntry . url ( ) . hash ( ) ) ;
2005-04-07 21:19:42 +02:00
if ( remoteSeed = = null ) {
2005-08-30 23:10:39 +02:00
log . logFine ( " plasmaSwitchboard.processRemoteCrawlTrigger: no remote crawl seed available " ) ;
2005-11-11 00:48:20 +01:00
return false ;
2005-04-07 21:19:42 +02:00
}
2005-06-17 03:26:51 +02:00
// do the request
2007-03-16 14:25:56 +01:00
HashMap page = null ;
2005-12-11 01:25:02 +01:00
try {
2007-03-16 14:25:56 +01:00
page = yacyClient . crawlOrder ( remoteSeed , urlEntry . url ( ) , getURL ( urlEntry . referrerhash ( ) ) , 6000 ) ;
} catch ( IOException e1 ) {
log . logSevere ( STR_REMOTECRAWLTRIGGER + remoteSeed . getName ( ) + " FAILED. URL CANNOT BE RETRIEVED from referrer hash: " + urlEntry . referrerhash ( ) , e1 ) ;
return false ;
}
2005-12-11 01:25:02 +01:00
2007-03-16 14:25:56 +01:00
// check if we got contact to peer and the peer respondet
if ( ( page = = null ) | | ( page . get ( " delay " ) = = null ) ) {
log . logInfo ( " CRAWL: REMOTE CRAWL TO PEER " + remoteSeed . getName ( ) + " FAILED. CAUSE: unknown (URL= " + urlEntry . url ( ) . toString ( ) + " ). Removed peer. " ) ;
yacyCore . peerActions . peerDeparture ( remoteSeed ) ;
return false ; // no response from peer, we will crawl this ourself
}
2007-04-30 02:39:53 +02:00
String response = ( String ) page . get ( " response " ) ;
2007-03-16 14:25:56 +01:00
log . logFine ( " plasmaSwitchboard.processRemoteCrawlTrigger: remoteSeed= "
+ remoteSeed . getName ( ) + " , url= " + urlEntry . url ( ) . toString ( )
+ " , response= " + page . toString ( ) ) ; // DEBUG
// we received an answer and we are told to wait a specific time until we shall ask again for another crawl
int newdelay = Integer . parseInt ( ( String ) page . get ( " delay " ) ) ;
yacyCore . dhtAgent . setCrawlDelay ( remoteSeed . hash , newdelay ) ;
if ( response . equals ( " stacked " ) ) {
// success, the remote peer accepted the crawl
log . logInfo ( STR_REMOTECRAWLTRIGGER + remoteSeed . getName ( )
+ " PLACED URL= " + urlEntry . url ( ) . toString ( )
+ " ; NEW DELAY= " + newdelay ) ;
// track this remote crawl
this . delegatedURL . newEntry ( urlEntry , remoteSeed . hash , new Date ( ) , 0 , response ) . store ( ) ;
return true ;
}
// check other cases: the remote peer may respond that it already knows that url
if ( response . equals ( " double " ) ) {
// in case the peer answers double, it transmits the complete lurl data
String lurl = ( String ) page . get ( " lurl " ) ;
if ( ( lurl ! = null ) & & ( lurl . length ( ) ! = 0 ) ) {
String propStr = crypt . simpleDecode ( lurl , ( String ) page . get ( " key " ) ) ;
indexURLEntry entry = wordIndex . loadedURL . newEntry ( propStr ) ;
2005-12-11 01:25:02 +01:00
try {
2007-03-16 14:25:56 +01:00
wordIndex . loadedURL . store ( entry ) ;
wordIndex . loadedURL . stack ( entry , yacyCore . seedDB . mySeed . hash , remoteSeed . hash , 1 ) ; // *** ueberfluessig/doppelt?
// noticeURL.remove(entry.hash());
} catch ( IOException e ) {
// TODO Auto-generated catch block
e . printStackTrace ( ) ;
2005-04-07 21:19:42 +02:00
}
2007-03-16 14:25:56 +01:00
log . logInfo ( STR_REMOTECRAWLTRIGGER + remoteSeed . getName ( )
+ " SUPERFLUOUS. CAUSE: " + page . get ( " reason " )
+ " (URL= " + urlEntry . url ( ) . toString ( )
+ " ). URL IS CONSIDERED AS 'LOADED!' " ) ;
return true ;
} else {
log . logInfo ( STR_REMOTECRAWLTRIGGER + remoteSeed . getName ( )
+ " REJECTED. CAUSE: bad lurl response / " + page . get ( " reason " ) + " (URL= "
+ urlEntry . url ( ) . toString ( ) + " ) " ) ;
remoteSeed . setFlagAcceptRemoteCrawl ( false ) ;
yacyCore . seedDB . update ( remoteSeed . hash , remoteSeed ) ;
return false ;
}
2005-04-07 21:19:42 +02:00
}
2007-03-16 14:25:56 +01:00
log . logInfo ( STR_REMOTECRAWLTRIGGER + remoteSeed . getName ( )
+ " DENIED. RESPONSE= " + response + " , CAUSE= "
+ page . get ( " reason " ) + " , URL= " + urlEntry . url ( ) . toString ( ) ) ;
remoteSeed . setFlagAcceptRemoteCrawl ( false ) ;
yacyCore . seedDB . update ( remoteSeed . hash , remoteSeed ) ;
return false ;
2005-04-07 21:19:42 +02:00
}
2005-04-15 16:18:14 +02:00
2005-04-07 21:19:42 +02:00
private static SimpleDateFormat DateFormatter = new SimpleDateFormat ( " EEE, dd MMM yyyy " ) ;
public static String dateString ( Date date ) {
2005-11-11 00:48:20 +01:00
if ( date = = null ) return " " ; else return DateFormatter . format ( date ) ;
2005-04-07 21:19:42 +02:00
}
2005-04-15 16:18:14 +02:00
2005-04-07 21:19:42 +02:00
public serverObjects action ( String actionName , serverObjects actionInput ) {
2006-01-20 16:14:21 +01:00
// perform an action. (not used)
2005-11-11 00:48:20 +01:00
return null ;
2005-04-07 21:19:42 +02:00
}
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
public String toString ( ) {
2005-11-11 00:48:20 +01:00
// it is possible to use this method in the cgi pages.
// actually it is used there for testing purpose
return " PROPS: " + super . toString ( ) + " ; QUEUE: " + sbQueue . toString ( ) ;
2005-04-07 21:19:42 +02:00
}
// method for index deletion
2007-09-05 11:01:35 +02:00
public int removeAllUrlReferences ( yacyURL url , boolean fetchOnline ) {
return removeAllUrlReferences ( url . hash ( ) , fetchOnline ) ;
2005-04-07 21:19:42 +02:00
}
2005-12-15 11:31:00 +01:00
public int removeAllUrlReferences ( String urlhash , boolean fetchOnline ) {
2005-04-07 21:19:42 +02:00
// find all the words in a specific resource and remove the url reference from every word index
// finally, delete the url entry
// determine the url string
2006-12-05 03:47:51 +01:00
indexURLEntry entry = wordIndex . loadedURL . load ( urlhash , null ) ;
2006-09-07 20:24:39 +02:00
if ( entry = = null ) return 0 ;
2006-11-08 17:17:47 +01:00
indexURLEntry . Components comp = entry . comp ( ) ;
2006-10-19 00:25:07 +02:00
if ( comp . url ( ) = = null ) return 0 ;
2006-09-20 14:25:07 +02:00
2006-10-03 13:05:48 +02:00
InputStream resourceContent = null ;
2006-09-20 14:25:07 +02:00
try {
2006-10-03 13:05:48 +02:00
// get the resource content
2007-08-15 13:36:59 +02:00
Object [ ] resource = plasmaSnippetCache . getResource ( comp . url ( ) , fetchOnline , 10000 , true ) ;
2006-10-03 13:05:48 +02:00
resourceContent = ( InputStream ) resource [ 0 ] ;
Long resourceContentLength = ( Long ) resource [ 1 ] ;
// parse the resource
2007-08-15 13:36:59 +02:00
plasmaParserDocument document = plasmaSnippetCache . parseDocument ( comp . url ( ) , resourceContentLength . longValue ( ) , resourceContent ) ;
2006-10-03 13:05:48 +02:00
2006-12-06 14:13:55 +01:00
// get the word set
Set words = null ;
2006-11-28 16:00:15 +01:00
try {
2006-12-19 04:10:46 +01:00
words = new plasmaCondenser ( document , true , true ) . words ( ) . keySet ( ) ;
2006-11-28 16:00:15 +01:00
} catch ( UnsupportedEncodingException e ) {
e . printStackTrace ( ) ;
}
2006-10-03 13:05:48 +02:00
2006-09-20 14:25:07 +02:00
// delete all word references
2006-11-28 16:00:15 +01:00
int count = 0 ;
2007-03-13 23:18:36 +01:00
if ( words ! = null ) count = wordIndex . removeWordReferences ( words , urlhash ) ;
2006-10-03 13:05:48 +02:00
2006-09-20 14:25:07 +02:00
// finally delete the url entry itself
2006-12-05 03:47:51 +01:00
wordIndex . loadedURL . remove ( urlhash ) ;
2006-09-20 14:25:07 +02:00
return count ;
} catch ( ParserException e ) {
return 0 ;
2006-10-03 13:05:48 +02:00
} finally {
if ( resourceContent ! = null ) try { resourceContent . close ( ) ; } catch ( Exception e ) { /* ignore this */ }
2006-09-20 14:25:07 +02:00
}
2005-04-07 21:19:42 +02:00
}
2006-01-22 01:07:00 +01:00
2005-04-07 21:19:42 +02:00
public int adminAuthenticated ( httpHeader header ) {
2006-06-16 10:04:02 +02:00
2007-02-05 20:46:50 +01:00
String adminAccountBase64MD5 = getConfig ( httpd . ADMIN_ACCOUNT_B64MD5 , " " ) ;
2005-05-17 10:25:04 +02:00
String authorization = ( ( String ) header . get ( httpHeader . AUTHORIZATION , " xxxxxx " ) ) . trim ( ) . substring ( 6 ) ;
2006-07-20 13:30:10 +02:00
// security check against too long authorization strings
if ( authorization . length ( ) > 256 ) return 0 ;
// authorization by encoded password, only for localhost access
if ( ( ( ( String ) header . get ( " CLIENTIP " , " " ) ) . equals ( " localhost " ) ) & & ( adminAccountBase64MD5 . equals ( authorization ) ) ) return 3 ; // soft-authenticated for localhost
// authorization by hit in userDB
2006-06-16 10:04:02 +02:00
if ( userDB . hasAdminRight ( ( String ) header . get ( httpHeader . AUTHORIZATION , " xxxxxx " ) , ( ( String ) header . get ( " CLIENTIP " , " " ) ) , header . getHeaderCookies ( ) ) ) return 4 ; //return, because 4=max
2006-07-20 13:30:10 +02:00
// authorization with admin keyword in configuration
2007-02-05 20:46:50 +01:00
return httpd . staticAdminAuthenticated ( authorization , this ) ;
2005-04-07 21:19:42 +02:00
}
2005-04-24 23:24:53 +02:00
2005-12-07 01:36:05 +01:00
public boolean verifyAuthentication ( httpHeader header , boolean strict ) {
// handle access rights
switch ( adminAuthenticated ( header ) ) {
case 0 : // wrong password given
try { Thread . sleep ( 3000 ) ; } catch ( InterruptedException e ) { } // prevent brute-force
return false ;
case 1 : // no password given
return false ;
case 2 : // no password stored
return ! strict ;
case 3 : // soft-authenticated for localhost only
return true ;
case 4 : // hard-authenticated, all ok
return true ;
}
return false ;
}
2007-03-12 10:06:57 +01:00
public void setPerformance ( int wantedPPM ) {
// we consider 3 cases here
// wantedPPM <= 10: low performance
// 10 < wantedPPM < 1000: custom performance
// 1000 <= wantedPPM : maximum performance
if ( wantedPPM < = 10 ) wantedPPM = 10 ;
if ( wantedPPM > = 1000 ) wantedPPM = 1000 ;
2007-03-12 17:24:28 +01:00
int newBusySleep = 60000 / wantedPPM ; // for wantedPPM = 10: 6000; for wantedPPM = 1000: 60
serverThread thread ;
thread = getThread ( INDEX_DIST ) ;
2007-04-26 16:28:57 +02:00
if ( thread ! = null ) {
setConfig ( INDEX_DIST_BUSYSLEEP , thread . setBusySleep ( Math . max ( 2000 , thread . setBusySleep ( newBusySleep * 2 ) ) ) ) ;
thread . setIdleSleep ( 30000 ) ;
}
2007-03-12 17:24:28 +01:00
thread = getThread ( CRAWLJOB_LOCAL_CRAWL ) ;
2007-04-26 16:28:57 +02:00
if ( thread ! = null ) {
setConfig ( CRAWLJOB_LOCAL_CRAWL_BUSYSLEEP , thread . setBusySleep ( newBusySleep ) ) ;
thread . setIdleSleep ( 1000 ) ;
}
2007-03-12 17:24:28 +01:00
thread = getThread ( CRAWLJOB_GLOBAL_CRAWL_TRIGGER ) ;
2007-04-26 16:28:57 +02:00
if ( thread ! = null ) {
setConfig ( CRAWLJOB_GLOBAL_CRAWL_TRIGGER_BUSYSLEEP , thread . setBusySleep ( Math . max ( 1000 , newBusySleep * 3 ) ) ) ;
thread . setIdleSleep ( 10000 ) ;
}
/ *
thread = getThread ( CRAWLJOB_REMOTE_TRIGGERED_CRAWL ) ;
if ( thread ! = null ) {
setConfig ( CRAWLJOB_REMOTE_TRIGGERED_CRAWL_BUSYSLEEP , thread . setBusySleep ( newBusySleep * 10 ) ) ;
thread . setIdleSleep ( 10000 ) ;
}
* /
2007-03-12 17:24:28 +01:00
thread = getThread ( PROXY_CACHE_ENQUEUE ) ;
2007-04-26 16:28:57 +02:00
if ( thread ! = null ) {
setConfig ( PROXY_CACHE_ENQUEUE_BUSYSLEEP , thread . setBusySleep ( 0 ) ) ;
thread . setIdleSleep ( 1000 ) ;
}
2007-03-12 10:06:57 +01:00
2007-03-12 17:24:28 +01:00
thread = getThread ( INDEXER ) ;
2007-04-26 16:28:57 +02:00
if ( thread ! = null ) {
setConfig ( INDEXER_BUSYSLEEP , thread . setBusySleep ( newBusySleep / 2 ) ) ;
thread . setIdleSleep ( 1000 ) ;
}
2007-03-12 10:06:57 +01:00
2007-03-12 17:24:28 +01:00
thread = getThread ( CRAWLSTACK ) ;
2007-04-26 16:28:57 +02:00
if ( thread ! = null ) {
setConfig ( CRAWLSTACK_BUSYSLEEP , thread . setBusySleep ( 0 ) ) ;
thread . setIdleSleep ( 5000 ) ;
}
2007-03-12 10:06:57 +01:00
}
2007-06-11 00:02:17 +02:00
public static int accessFrequency ( HashMap tracker , String host ) {
// returns the access frequency in queries per hour for a given host and a specific tracker
long timeInterval = 1000 * 60 * 60 ;
TreeSet accessSet = ( TreeSet ) tracker . get ( host ) ;
if ( accessSet = = null ) return 0 ;
return accessSet . tailSet ( new Long ( System . currentTimeMillis ( ) - timeInterval ) ) . size ( ) ;
}
2006-02-19 22:54:46 +01:00
public void startTransferWholeIndex ( yacySeed seed , boolean delete ) {
if ( transferIdxThread = = null ) {
2006-02-20 00:47:45 +01:00
this . transferIdxThread = new plasmaDHTFlush ( this . log , this . wordIndex , seed , delete ,
2007-01-30 15:18:35 +01:00
" true " . equalsIgnoreCase ( getConfig ( INDEX_TRANSFER_GZIP_BODY , " false " ) ) ,
( int ) getConfigLong ( INDEX_TRANSFER_TIMEOUT , 60000 ) ) ;
2006-02-19 22:54:46 +01:00
this . transferIdxThread . start ( ) ;
}
}
public void stopTransferWholeIndex ( boolean wait ) {
if ( ( transferIdxThread ! = null ) & & ( transferIdxThread . isAlive ( ) ) & & ( ! transferIdxThread . isFinished ( ) ) ) {
try {
this . transferIdxThread . stopIt ( wait ) ;
} catch ( InterruptedException e ) { }
}
}
public void abortTransferWholeIndex ( boolean wait ) {
if ( transferIdxThread ! = null ) {
if ( ! transferIdxThread . isFinished ( ) )
try {
this . transferIdxThread . stopIt ( wait ) ;
} catch ( InterruptedException e ) { }
transferIdxThread = null ;
}
2006-02-21 00:27:11 +01:00
}
2006-02-19 22:54:46 +01:00
2006-02-22 00:08:07 +01:00
public String dhtShallTransfer ( ) {
2006-02-21 00:27:11 +01:00
if ( yacyCore . seedDB = = null ) {
2006-02-22 00:08:07 +01:00
return " no DHT distribution: seedDB == null " ;
2006-02-21 00:27:11 +01:00
}
if ( yacyCore . seedDB . mySeed = = null ) {
2006-02-22 00:08:07 +01:00
return " no DHT distribution: mySeed == null " ;
2006-02-21 00:27:11 +01:00
}
if ( yacyCore . seedDB . mySeed . isVirgin ( ) ) {
2006-02-22 00:08:07 +01:00
return " no DHT distribution: status is virgin " ;
2006-02-21 00:27:11 +01:00
}
2007-08-27 00:35:26 +02:00
if ( yacyCore . seedDB . noDHTActivity ( ) ) {
return " no DHT distribution: network too small " ;
}
2007-01-30 15:18:35 +01:00
if ( getConfig ( INDEX_DIST_ALLOW , " false " ) . equalsIgnoreCase ( " false " ) ) {
2006-02-22 00:08:07 +01:00
return " no DHT distribution: not enabled " ;
2006-02-21 00:27:11 +01:00
}
2006-12-05 03:47:51 +01:00
if ( wordIndex . loadedURL . size ( ) < 10 ) {
return " no DHT distribution: loadedURL.size() = " + wordIndex . loadedURL . size ( ) ;
2006-02-21 00:27:11 +01:00
}
if ( wordIndex . size ( ) < 100 ) {
2006-02-22 00:08:07 +01:00
return " no DHT distribution: not enough words - wordIndex.size() = " + wordIndex . size ( ) ;
2006-02-21 00:27:11 +01:00
}
2007-08-03 14:21:46 +02:00
if ( ( getConfig ( INDEX_DIST_ALLOW_WHILE_CRAWLING , " false " ) . equalsIgnoreCase ( " false " ) ) & & ( noticeURL . notEmpty ( ) ) ) {
2007-08-03 13:44:58 +02:00
return " no DHT distribution: crawl in progress: noticeURL.stackSize() = " + noticeURL . size ( ) + " , sbQueue.size() = " + sbQueue . size ( ) ;
2006-02-22 00:08:07 +01:00
}
2007-06-28 17:25:33 +02:00
if ( ( getConfig ( INDEX_DIST_ALLOW_WHILE_INDEXING , " false " ) . equalsIgnoreCase ( " false " ) ) & & ( sbQueue . size ( ) > 1 ) ) {
2007-08-03 13:44:58 +02:00
return " no DHT distribution: indexing in progress: noticeURL.stackSize() = " + noticeURL . size ( ) + " , sbQueue.size() = " + sbQueue . size ( ) ;
2007-06-28 17:25:33 +02:00
}
2006-02-22 00:08:07 +01:00
return null ;
}
public boolean dhtTransferJob ( ) {
String rejectReason = dhtShallTransfer ( ) ;
if ( rejectReason ! = null ) {
log . logFine ( rejectReason ) ;
2006-02-21 00:27:11 +01:00
return false ;
}
2006-02-21 00:57:50 +01:00
if ( this . dhtTransferChunk = = null ) {
log . logFine ( " no DHT distribution: no transfer chunk defined " ) ;
return false ;
}
if ( ( this . dhtTransferChunk ! = null ) & & ( this . dhtTransferChunk . getStatus ( ) ! = plasmaDHTChunk . chunkStatus_FILLED ) ) {
2006-02-22 00:08:07 +01:00
log . logFine ( " no DHT distribution: index distribution is in progress, status= " + this . dhtTransferChunk . getStatus ( ) ) ;
2006-02-21 00:57:50 +01:00
return false ;
}
2006-02-21 00:27:11 +01:00
// do the transfer
2007-07-24 02:46:17 +02:00
int peerCount = Math . max ( 1 , ( yacyCore . seedDB . mySeed . isJunior ( ) ) ?
( int ) getConfigLong ( " network.unit.dhtredundancy.junior " , 1 ) :
( int ) getConfigLong ( " network.unit.dhtredundancy.senior " , 1 ) ) ; // set redundancy factor
2006-02-21 00:27:11 +01:00
long starttime = System . currentTimeMillis ( ) ;
2006-02-21 00:57:50 +01:00
2006-02-22 00:08:07 +01:00
boolean ok = dhtTransferProcess ( dhtTransferChunk , peerCount ) ;
2006-02-21 00:27:11 +01:00
2006-02-22 00:08:07 +01:00
if ( ok ) {
dhtTransferChunk . setStatus ( plasmaDHTChunk . chunkStatus_COMPLETE ) ;
log . logFine ( " DHT distribution: transfer COMPLETE " ) ;
// adopt transfer count
if ( ( System . currentTimeMillis ( ) - starttime ) > ( 10000 * peerCount ) ) {
dhtTransferIndexCount - - ;
} else {
2006-05-07 19:44:33 +02:00
if ( dhtTransferChunk . indexCount ( ) > = dhtTransferIndexCount ) dhtTransferIndexCount + + ;
2006-02-22 00:08:07 +01:00
}
2007-01-30 15:18:35 +01:00
int minChunkSize = ( int ) getConfigLong ( INDEX_DIST_CHUNK_SIZE_MIN , 30 ) ;
int maxChunkSize = ( int ) getConfigLong ( INDEX_DIST_CHUNK_SIZE_MAX , 3000 ) ;
2006-05-09 13:43:10 +02:00
if ( dhtTransferIndexCount < minChunkSize ) dhtTransferIndexCount = minChunkSize ;
if ( dhtTransferIndexCount > maxChunkSize ) dhtTransferIndexCount = maxChunkSize ;
2006-02-22 00:08:07 +01:00
// show success
return true ;
} else {
2006-11-02 22:32:59 +01:00
dhtTransferChunk . incTransferFailedCounter ( ) ;
2007-01-30 15:18:35 +01:00
int maxChunkFails = ( int ) getConfigLong ( INDEX_DIST_CHUNK_FAILS_MAX , 1 ) ;
2006-11-02 22:32:59 +01:00
if ( dhtTransferChunk . getTransferFailedCounter ( ) > = maxChunkFails ) {
2007-02-26 15:36:01 +01:00
//System.out.println("DEBUG: " + dhtTransferChunk.getTransferFailedCounter() + " of " + maxChunkFails + " sendings failed for this chunk, aborting!");
2006-11-02 22:32:59 +01:00
dhtTransferChunk . setStatus ( plasmaDHTChunk . chunkStatus_FAILED ) ;
log . logFine ( " DHT distribution: transfer FAILED " ) ;
}
else {
2007-02-26 15:36:01 +01:00
//System.out.println("DEBUG: " + dhtTransferChunk.getTransferFailedCounter() + " of " + maxChunkFails + " sendings failed for this chunk, retrying!");
2006-11-02 22:32:59 +01:00
log . logFine ( " DHT distribution: transfer FAILED, sending this chunk again " ) ;
}
2006-02-21 00:27:11 +01:00
return false ;
}
}
2006-02-22 00:08:07 +01:00
public boolean dhtTransferProcess ( plasmaDHTChunk dhtChunk , int peerCount ) {
2006-02-21 00:27:11 +01:00
if ( ( yacyCore . seedDB = = null ) | | ( yacyCore . seedDB . sizeConnected ( ) = = 0 ) ) return false ;
2006-05-13 17:28:57 +02:00
try {
// find a list of DHT-peers
2007-05-14 14:52:39 +02:00
double maxDist = 0 . 2 ;
ArrayList seeds = yacyCore . dhtAgent . getDHTTargets ( log , peerCount , Math . min ( 8 , ( int ) ( yacyCore . seedDB . sizeConnected ( ) * maxDist ) ) , dhtChunk . firstContainer ( ) . getWordHash ( ) , dhtChunk . lastContainer ( ) . getWordHash ( ) , maxDist ) ;
2006-05-13 17:28:57 +02:00
if ( seeds . size ( ) < peerCount ) {
2007-02-26 15:36:01 +01:00
log . logWarning ( " found not enough ( " + seeds . size ( ) + " ) peers for distribution for dhtchunk [ " + dhtChunk . firstContainer ( ) . getWordHash ( ) + " .. " + dhtChunk . lastContainer ( ) . getWordHash ( ) + " ] " ) ;
2006-05-13 17:28:57 +02:00
return false ;
}
2006-02-21 00:27:11 +01:00
2006-05-13 17:28:57 +02:00
// send away the indexes to all these peers
int hc1 = 0 ;
2006-02-21 00:27:11 +01:00
2006-05-13 17:28:57 +02:00
// getting distribution configuration values
2007-01-30 15:18:35 +01:00
boolean gzipBody = getConfig ( INDEX_DIST_GZIP_BODY , " false " ) . equalsIgnoreCase ( " true " ) ;
int timeout = ( int ) getConfigLong ( INDEX_DIST_TIMEOUT , 60000 ) ;
2006-05-13 17:28:57 +02:00
int retries = 0 ;
// starting up multiple DHT transfer threads
Iterator seedIter = seeds . iterator ( ) ;
ArrayList transfer = new ArrayList ( peerCount ) ;
2006-05-15 12:03:24 +02:00
while ( hc1 < peerCount & & ( transfer . size ( ) > 0 | | seedIter . hasNext ( ) ) ) {
2006-09-03 16:59:00 +02:00
2006-05-13 17:28:57 +02:00
// starting up some transfer threads
int transferThreadCount = transfer . size ( ) ;
for ( int i = 0 ; i < peerCount - hc1 - transferThreadCount ; i + + ) {
2006-09-03 16:59:00 +02:00
// check for interruption
checkInterruption ( ) ;
2006-05-13 17:28:57 +02:00
if ( seedIter . hasNext ( ) ) {
plasmaDHTTransfer t = new plasmaDHTTransfer ( log , ( yacySeed ) seedIter . next ( ) , dhtChunk , gzipBody , timeout , retries ) ;
t . start ( ) ;
transfer . add ( t ) ;
} else {
break ;
}
}
// waiting for the transfer threads to finish
Iterator transferIter = transfer . iterator ( ) ;
while ( transferIter . hasNext ( ) ) {
2006-09-03 16:59:00 +02:00
// check for interruption
checkInterruption ( ) ;
2006-05-13 17:28:57 +02:00
plasmaDHTTransfer t = ( plasmaDHTTransfer ) transferIter . next ( ) ;
if ( ! t . isAlive ( ) ) {
// remove finished thread from the list
transferIter . remove ( ) ;
// count successful transfers
if ( t . getStatus ( ) = = plasmaDHTChunk . chunkStatus_COMPLETE ) {
this . log . logInfo ( " DHT distribution: transfer to peer " + t . getSeed ( ) . getName ( ) + " finished. " ) ;
hc1 + + ;
}
}
}
if ( hc1 < peerCount ) Thread . sleep ( 100 ) ;
2006-02-21 00:27:11 +01:00
}
2006-03-15 12:27:43 +01:00
2006-02-21 00:27:11 +01:00
2006-05-13 17:28:57 +02:00
// clean up and finish with deletion of indexes
if ( hc1 > = peerCount ) {
// success
return true ;
}
this . log . logSevere ( " Index distribution failed. Too few peers ( " + hc1 + " ) received the index, not deleted locally. " ) ;
return false ;
} catch ( InterruptedException e ) {
return false ;
2006-02-21 00:27:11 +01:00
}
}
2006-03-15 12:27:43 +01:00
2006-08-07 17:11:14 +02:00
private void addURLtoErrorDB (
2007-09-05 11:01:35 +02:00
yacyURL url ,
2006-08-07 17:11:14 +02:00
String referrerHash ,
String initiator ,
String name ,
String failreason ,
2006-11-23 03:16:30 +01:00
kelondroBitfield flags
2006-08-07 17:11:14 +02:00
) {
// create a new errorURL DB entry
2007-03-16 14:25:56 +01:00
plasmaCrawlEntry bentry = new plasmaCrawlEntry (
initiator ,
url ,
referrerHash ,
( name = = null ) ? " " : name ,
new Date ( ) ,
null ,
0 ,
0 ,
0 ) ;
plasmaCrawlZURL . Entry ee = this . errorURL . newEntry (
bentry , initiator , new Date ( ) ,
0 , failreason ) ;
2006-08-07 17:11:14 +02:00
// store the entry
ee . store ( ) ;
// push it onto the stack
2006-12-05 03:47:51 +01:00
this . errorURL . stackPushEntry ( ee ) ;
2007-03-16 14:25:56 +01:00
}
2006-08-07 17:11:14 +02:00
2006-09-03 16:59:00 +02:00
public void checkInterruption ( ) throws InterruptedException {
Thread curThread = Thread . currentThread ( ) ;
if ( ( curThread instanceof serverThread ) & & ( ( serverThread ) curThread ) . shutdownInProgress ( ) ) throw new InterruptedException ( " Shutdown in progress ... " ) ;
else if ( this . terminate | | curThread . isInterrupted ( ) ) throw new InterruptedException ( " Shutdown in progress ... " ) ;
}
2006-05-22 10:02:35 +02:00
public void terminate ( long delay ) {
if ( delay < = 0 ) throw new IllegalArgumentException ( " The shutdown delay must be greater than 0. " ) ;
( new delayedShutdown ( this , delay ) ) . start ( ) ;
}
2005-04-24 23:24:53 +02:00
public void terminate ( ) {
this . terminate = true ;
this . shutdownSync . V ( ) ;
}
public boolean isTerminated ( ) {
return this . terminate ;
}
public boolean waitForShutdown ( ) throws InterruptedException {
this . shutdownSync . P ( ) ;
return this . terminate ;
}
2005-04-07 21:19:42 +02:00
}
2006-05-22 10:02:35 +02:00
class delayedShutdown extends Thread {
private plasmaSwitchboard sb ;
private long delay ;
public delayedShutdown ( plasmaSwitchboard sb , long delay ) {
this . sb = sb ;
this . delay = delay ;
}
public void run ( ) {
try {
Thread . sleep ( delay ) ;
} catch ( InterruptedException e ) {
2007-03-09 09:48:47 +01:00
sb . getLog ( ) . logInfo ( " interrupted delayed shutdown " ) ;
2006-05-22 10:02:35 +02:00
}
this . sb . terminate ( ) ;
}
}