2005-04-07 21:19:42 +02:00
// plasmaSwitchboard.java
2007-11-30 04:22:42 +01:00
// (C) 2004-2007 by Michael Peter Christen; mc@yacy.net, Frankfurt a. M., Germany
// first published 2004 on http://yacy.net
//
// This is a part of YaCy, a peer-to-peer based web search engine
2005-09-07 13:17:21 +02:00
//
2005-11-12 12:38:35 +01:00
// $LastChangedDate$
// $LastChangedRevision$
// $LastChangedBy$
2005-04-07 21:19:42 +02:00
//
2007-11-30 04:22:42 +01:00
// LICENSE
//
2005-04-07 21:19:42 +02:00
// This program is free software; you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation; either version 2 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
/ *
This class holds the run - time environment of the plasma
Search Engine . It ' s data forms a blackboard which can be used
to organize running jobs around the indexing algorithm .
The blackboard consist of the following entities :
- storage : one plasmaStore object with the url - based database
- configuration : initialized by properties once , then by external functions
- job queues : for parsing , condensing , indexing
- black / blue / whitelists : controls input and output to the index
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
this class is also the core of the http crawling .
There are some items that need to be respected when crawling the web :
1 ) respect robots . txt
2 ) do not access one domain too frequently , wait between accesses
3 ) remember crawled URL ' s and do not access again too early
4 ) priorization of specific links should be possible ( hot - lists )
5 ) attributes for crawling ( depth , filters , hot / black - lists , priority )
6 ) different crawling jobs with different attributes ( ' Orders ' ) simultanoulsy
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
We implement some specific tasks and use different database to archieve these goals :
- a database ' crawlerDisallow . db ' contains all url ' s that shall not be crawled
- a database ' crawlerDomain . db ' holds all domains and access times , where we loaded the disallow tables
this table contains the following entities :
< flag : robotes exist / not exist , last access of robots . txt , last access of domain ( for access scheduling ) >
- four databases for scheduled access : crawlerScheduledHotText . db , crawlerScheduledColdText . db ,
crawlerScheduledHotMedia . db and crawlerScheduledColdMedia . db
- two stacks for new URLS : newText . stack and newMedia . stack
- two databases for URL double - check : knownText . db and knownMedia . db
- one database with crawling orders : crawlerOrders . db
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
The Information flow of a single URL that is crawled is as follows :
- a html file is loaded from a specific URL within the module httpdProxyServlet as
a process of the proxy .
- the file is passed to httpdProxyCache . Here it ' s processing is delayed until the proxy is idle .
- The cache entry is passed on to the plasmaSwitchboard . There the URL is stored into plasmaLURL where
the URL is stored under a specific hash . The URL ' s from the content are stripped off , stored in plasmaLURL
with a ' wrong ' date ( the date of the URL ' s are not known at this time , only after fetching ) and stacked with
plasmaCrawlerTextStack . The content is read and splitted into rated words in plasmaCondenser .
The splitted words are then integrated into the index with plasmaSearch .
- In plasmaSearch the words are indexed by reversing the relation between URL and words : one URL points
to many words , the words within the document at the URL . After reversing , one word points
to many URL ' s , all the URL ' s where the word occurrs . One single word - > URL - hash relation is stored in
plasmaIndexEntry . A set of plasmaIndexEntries is a reverse word index .
This reverse word index is stored temporarly in plasmaIndexCache .
- In plasmaIndexCache the single plasmaIndexEntry ' ies are collected and stored into a plasmaIndex - entry
These plasmaIndex - Objects are the true reverse words indexes .
- in plasmaIndex the plasmaIndexEntry - objects are stored in a kelondroTree ; an indexed file in the file system .
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
The information flow of a search request is as follows :
- in httpdFileServlet the user enters a search query , which is passed to plasmaSwitchboard
- in plasmaSwitchboard , the query is passed to plasmaSearch .
- in plasmaSearch , the plasmaSearch . result object is generated by simultanous enumeration of
URL hases in the reverse word indexes plasmaIndex
- ( future : the plasmaSearch . result - object is used to identify more key words for a new search )
2005-11-11 00:48:20 +01:00
* /
2005-04-07 21:19:42 +02:00
package de.anomic.plasma ;
2008-07-07 02:03:37 +02:00
2005-05-05 07:32:19 +02:00
import java.io.File ;
2007-06-12 17:15:24 +02:00
import java.io.FileInputStream ;
2005-05-05 07:32:19 +02:00
import java.io.IOException ;
2006-10-03 13:05:48 +02:00
import java.io.InputStream ;
2006-11-28 16:00:15 +01:00
import java.io.UnsupportedEncodingException ;
2006-08-12 16:28:14 +02:00
import java.lang.reflect.Constructor ;
2006-08-11 17:09:22 +02:00
import java.net.MalformedURLException ;
2005-05-05 07:32:19 +02:00
import java.text.SimpleDateFormat ;
2005-09-07 13:17:21 +02:00
import java.util.ArrayList ;
2005-05-05 07:32:19 +02:00
import java.util.Date ;
import java.util.HashMap ;
2006-02-21 12:18:48 +01:00
import java.util.Hashtable ;
2005-05-05 07:32:19 +02:00
import java.util.Iterator ;
2007-10-02 23:28:29 +02:00
import java.util.Locale ;
2005-05-05 07:32:19 +02:00
import java.util.Map ;
2007-06-12 17:15:24 +02:00
import java.util.Properties ;
2006-12-06 14:13:55 +01:00
import java.util.Set ;
2007-11-02 15:55:46 +01:00
import java.util.Timer ;
import java.util.TimerTask ;
2007-04-30 00:05:34 +02:00
import java.util.TreeMap ;
2005-05-05 07:32:19 +02:00
import java.util.TreeSet ;
2005-12-13 00:59:58 +01:00
2008-05-06 02:32:41 +02:00
import de.anomic.crawler.CrawlEntry ;
import de.anomic.crawler.CrawlProfile ;
import de.anomic.crawler.CrawlQueues ;
2008-06-06 18:01:27 +02:00
import de.anomic.crawler.CrawlStacker ;
import de.anomic.crawler.ErrorURL ;
2008-07-04 13:03:03 +02:00
import de.anomic.crawler.HTTPLoader ;
2008-06-06 18:01:27 +02:00
import de.anomic.crawler.ImporterManager ;
import de.anomic.crawler.IndexingStack ;
2008-05-06 02:32:41 +02:00
import de.anomic.crawler.NoticedURL ;
2008-07-13 21:03:38 +02:00
import de.anomic.crawler.ResourceObserver ;
2008-05-06 02:32:41 +02:00
import de.anomic.crawler.ResultImages ;
import de.anomic.crawler.ResultURLs ;
import de.anomic.crawler.RobotsTxt ;
import de.anomic.crawler.ZURL ;
2008-08-26 15:20:18 +02:00
import de.anomic.crawler.CrawlProfile.entry ;
2007-08-04 01:06:53 +02:00
import de.anomic.data.URLLicense ;
2006-03-01 08:40:25 +01:00
import de.anomic.data.blogBoard ;
2007-02-26 15:36:01 +01:00
import de.anomic.data.blogBoardComments ;
2005-12-26 15:21:01 +01:00
import de.anomic.data.bookmarksDB ;
2006-08-12 04:42:10 +02:00
import de.anomic.data.listManager ;
2005-05-05 07:32:19 +02:00
import de.anomic.data.messageBoard ;
2005-09-30 15:48:26 +02:00
import de.anomic.data.userDB ;
2006-09-30 00:27:20 +02:00
import de.anomic.data.wikiBoard ;
2007-05-20 18:19:25 +02:00
import de.anomic.data.wiki.wikiParser ;
2008-04-05 15:17:16 +02:00
import de.anomic.http.HttpClient ;
import de.anomic.http.JakartaCommonsHttpClient ;
2005-10-22 15:28:04 +02:00
import de.anomic.http.httpRemoteProxyConfig ;
2008-08-25 20:11:47 +02:00
import de.anomic.http.httpRequestHeader ;
import de.anomic.http.httpResponseHeader ;
2007-02-05 20:46:50 +01:00
import de.anomic.http.httpd ;
2007-03-02 02:19:38 +01:00
import de.anomic.http.httpdRobotsTxtConfig ;
2008-08-25 20:11:47 +02:00
import de.anomic.index.indexDocumentMetadata ;
2008-03-26 16:37:49 +01:00
import de.anomic.index.indexReferenceBlacklist ;
import de.anomic.index.indexURLReference ;
2007-03-06 23:43:32 +01:00
import de.anomic.kelondro.kelondroCache ;
2007-08-03 13:44:58 +02:00
import de.anomic.kelondro.kelondroCachedRecords ;
2005-05-05 07:32:19 +02:00
import de.anomic.kelondro.kelondroMSetTools ;
2006-09-30 00:27:20 +02:00
import de.anomic.kelondro.kelondroNaturalOrder ;
2006-09-20 14:25:07 +02:00
import de.anomic.plasma.parser.ParserException ;
2005-05-05 07:32:19 +02:00
import de.anomic.server.serverAbstractSwitch ;
2008-03-27 13:03:16 +01:00
import de.anomic.server.serverBusyThread ;
2008-05-15 13:26:43 +02:00
import de.anomic.server.serverCodings ;
2008-05-20 00:17:53 +02:00
import de.anomic.server.serverCore ;
2008-06-04 23:34:57 +02:00
import de.anomic.server.serverDate ;
2007-07-24 02:46:17 +02:00
import de.anomic.server.serverDomains ;
2006-09-30 00:27:20 +02:00
import de.anomic.server.serverFileUtils ;
2008-03-27 13:03:16 +01:00
import de.anomic.server.serverInstantBusyThread ;
2007-11-02 15:55:46 +01:00
import de.anomic.server.serverMemory ;
2005-05-05 07:32:19 +02:00
import de.anomic.server.serverObjects ;
2008-03-28 12:56:28 +01:00
import de.anomic.server.serverProcessor ;
import de.anomic.server.serverProcessorJob ;
2007-12-04 21:19:13 +01:00
import de.anomic.server.serverProfiling ;
2005-05-05 07:32:19 +02:00
import de.anomic.server.serverSemaphore ;
import de.anomic.server.serverSwitch ;
2008-07-13 21:03:38 +02:00
import de.anomic.server.serverSystem ;
2006-09-03 16:59:00 +02:00
import de.anomic.server.serverThread ;
2005-06-10 11:19:24 +02:00
import de.anomic.server.logging.serverLog ;
2007-11-07 23:38:09 +01:00
import de.anomic.tools.crypt ;
2008-04-05 15:17:16 +02:00
import de.anomic.tools.nxTools ;
2005-05-05 07:32:19 +02:00
import de.anomic.yacy.yacyClient ;
import de.anomic.yacy.yacyCore ;
2005-10-04 02:28:59 +02:00
import de.anomic.yacy.yacyNewsPool ;
2007-06-12 17:15:24 +02:00
import de.anomic.yacy.yacyNewsRecord ;
2006-09-30 00:27:20 +02:00
import de.anomic.yacy.yacySeed ;
2008-07-13 09:51:45 +02:00
import de.anomic.yacy.yacyTray ;
2007-12-27 18:56:59 +01:00
import de.anomic.yacy.yacyURL ;
import de.anomic.yacy.yacyVersion ;
2005-04-07 21:19:42 +02:00
2008-05-14 23:36:02 +02:00
public final class plasmaSwitchboard extends serverAbstractSwitch < IndexingStack . QueueEntry > implements serverSwitch < IndexingStack . QueueEntry > {
2005-09-28 15:49:57 +02:00
2005-04-07 21:19:42 +02:00
// load slots
2008-03-22 02:28:37 +01:00
public static int xstackCrawlSlots = 2000 ;
2006-02-21 00:27:11 +01:00
2007-07-31 12:00:17 +02:00
private int dhtTransferIndexCount = 100 ;
2008-06-04 23:34:57 +02:00
public static long lastPPMUpdate = System . currentTimeMillis ( ) - 30000 ;
2005-04-07 21:19:42 +02:00
// couloured list management
2008-01-11 01:12:01 +01:00
public static TreeSet < String > badwords = null ;
public static TreeSet < String > blueList = null ;
public static TreeSet < String > stopwords = null ;
2008-05-03 11:06:00 +02:00
public static indexReferenceBlacklist urlBlacklist = null ;
2005-04-07 21:19:42 +02:00
2007-05-20 18:19:25 +02:00
public static wikiParser wikiParser = null ;
2008-07-13 09:51:45 +02:00
public yacyTray yacytray ;
2005-11-11 00:48:20 +01:00
// storage management
2008-05-06 15:44:38 +02:00
public File htCachePath ;
public File plasmaPath ;
public File listsPath ;
public File htDocsPath ;
public File rankingPath ;
public File workPath ;
public File releasePath ;
2008-08-02 15:57:00 +02:00
public Map < String , String > rankingPermissions ;
2008-05-14 23:36:02 +02:00
public plasmaWordIndex webIndex ;
2008-05-06 15:44:38 +02:00
public CrawlQueues crawlQueues ;
public ResultURLs crawlResults ;
public CrawlStacker crawlStacker ;
public messageBoard messageDB ;
public wikiBoard wikiDB ;
public blogBoard blogDB ;
public blogBoardComments blogCommentDB ;
2008-07-05 02:35:20 +02:00
public RobotsTxt robots ;
2008-05-06 15:44:38 +02:00
public boolean rankingOn ;
public plasmaRankingDistribution rankingOwnDistribution ;
public plasmaRankingDistribution rankingOtherDistribution ;
public HashMap < String , Object [ ] > outgoingCookies , incomingCookies ;
public plasmaParser parser ;
public volatile long proxyLastAccess , localSearchLastAccess , remoteSearchLastAccess ;
public yacyCore yc ;
2008-07-07 02:03:37 +02:00
public ResourceObserver observer ;
2008-05-06 15:44:38 +02:00
public userDB userDB ;
public bookmarksDB bookmarksDB ;
public plasmaWebStructure webStructure ;
public ImporterManager dbImportManager ;
public plasmaDHTFlush transferIdxThread = null ;
private plasmaDHTChunk dhtTransferChunk = null ;
public ArrayList < plasmaSearchQuery > localSearches ; // array of search result properties as HashMaps
public ArrayList < plasmaSearchQuery > remoteSearches ; // array of search result properties as HashMaps
2008-01-11 01:12:01 +01:00
public HashMap < String , TreeSet < Long > > localSearchTracker , remoteSearchTracker ; // mappings from requesting host to a TreeSet of Long(access time)
2008-05-06 15:44:38 +02:00
public long lastseedcheckuptime = - 1 ;
public long indexedPages = 0 ;
public long lastindexedPages = 0 ;
public double requestedQueries = 0d ;
public double lastrequestedQueries = 0d ;
public int totalPPM = 0 ;
public double totalQPM = 0d ;
public TreeMap < String , String > clusterhashes ; // map of peerhash(String)/alternative-local-address as ip:port or only ip (String) or null if address in seed should be used
public boolean acceptLocalURLs , acceptGlobalURLs ;
public URLLicense licensedURLs ;
public Timer moreMemory ;
2008-04-12 15:24:21 +02:00
2008-03-30 01:03:44 +01:00
public serverProcessor < indexingQueueEntry > indexingDocumentProcessor ;
public serverProcessor < indexingQueueEntry > indexingCondensementProcessor ;
public serverProcessor < indexingQueueEntry > indexingAnalysisProcessor ;
public serverProcessor < indexingQueueEntry > indexingStorageProcessor ;
2008-03-28 12:56:28 +01:00
2007-03-02 02:19:38 +01:00
public httpdRobotsTxtConfig robotstxtConfig = null ;
2005-11-11 00:48:20 +01:00
2008-08-02 14:12:04 +02:00
private final serverSemaphore shutdownSync = new serverSemaphore ( 0 ) ;
2005-04-24 23:24:53 +02:00
private boolean terminate = false ;
2005-04-07 21:19:42 +02:00
2006-02-21 12:18:48 +01:00
//private Object crawlingPausedSync = new Object();
//private boolean crawlingIsPaused = false;
2008-01-19 01:40:19 +01:00
public Hashtable < String , Object [ ] > crawlJobsStatus = new Hashtable < String , Object [ ] > ( ) ;
2006-02-21 12:18:48 +01:00
2008-05-03 11:06:00 +02:00
private static plasmaSwitchboard sb = null ;
2007-11-30 04:22:42 +01:00
2008-08-02 14:12:04 +02:00
public plasmaSwitchboard ( final File rootPath , final String initPath , final String configPath , final boolean applyPro ) {
2007-06-16 16:11:52 +02:00
super ( rootPath , initPath , configPath , applyPro ) ;
2007-12-04 21:19:13 +01:00
serverProfiling . startSystemProfiling ( ) ;
2007-04-30 00:05:34 +02:00
sb = this ;
2005-11-11 00:48:20 +01:00
2005-05-11 16:58:03 +02:00
// set loglevel and log
2005-08-30 14:50:30 +02:00
setLog ( new serverLog ( " PLASMA " ) ) ;
2007-06-16 16:11:52 +02:00
if ( applyPro ) this . log . logInfo ( " This is the pro-version of YaCy " ) ;
2008-08-02 15:57:00 +02:00
initSystemTray ( ) ;
2007-06-22 16:29:14 +02:00
// remote proxy configuration
2008-08-02 15:57:00 +02:00
httpRemoteProxyConfig . init ( this ) ;
2007-06-22 16:29:14 +02:00
2008-05-14 23:36:02 +02:00
// load the network definition
overwriteNetworkDefinition ( this ) ;
2008-07-05 02:35:20 +02:00
2008-05-06 15:44:38 +02:00
// load values from configs
2008-08-02 15:57:00 +02:00
this . plasmaPath = getConfigPath ( plasmaSwitchboardConstants . PLASMA_PATH , plasmaSwitchboardConstants . PLASMA_PATH_DEFAULT ) ;
2005-10-10 11:13:17 +02:00
this . log . logConfig ( " Plasma DB Path: " + this . plasmaPath . toString ( ) ) ;
2008-08-02 15:57:00 +02:00
final File indexPrimaryPath = getConfigPath ( plasmaSwitchboardConstants . INDEX_PRIMARY_PATH , plasmaSwitchboardConstants . INDEX_PATH_DEFAULT ) ;
2008-05-06 15:44:38 +02:00
this . log . logConfig ( " Index Primary Path: " + indexPrimaryPath . toString ( ) ) ;
2008-08-02 15:57:00 +02:00
final File indexSecondaryPath = ( getConfig ( plasmaSwitchboardConstants . INDEX_SECONDARY_PATH , " " ) . length ( ) = = 0 ) ? indexPrimaryPath : new File ( getConfig ( plasmaSwitchboardConstants . INDEX_SECONDARY_PATH , " " ) ) ;
2008-05-06 15:44:38 +02:00
this . log . logConfig ( " Index Secondary Path: " + indexSecondaryPath . toString ( ) ) ;
2008-08-02 15:57:00 +02:00
this . listsPath = getConfigPath ( plasmaSwitchboardConstants . LISTS_PATH , plasmaSwitchboardConstants . LISTS_PATH_DEFAULT ) ;
2005-11-07 13:33:02 +01:00
this . log . logConfig ( " Lists Path: " + this . listsPath . toString ( ) ) ;
2008-08-02 15:57:00 +02:00
this . htDocsPath = getConfigPath ( plasmaSwitchboardConstants . HTDOCS_PATH , plasmaSwitchboardConstants . HTDOCS_PATH_DEFAULT ) ;
2005-11-07 13:33:02 +01:00
this . log . logConfig ( " HTDOCS Path: " + this . htDocsPath . toString ( ) ) ;
2008-08-02 15:57:00 +02:00
this . rankingPath = getConfigPath ( plasmaSwitchboardConstants . RANKING_PATH , plasmaSwitchboardConstants . RANKING_PATH_DEFAULT ) ;
2005-11-07 13:33:02 +01:00
this . log . logConfig ( " Ranking Path: " + this . rankingPath . toString ( ) ) ;
2008-01-19 01:40:19 +01:00
this . rankingPermissions = new HashMap < String , String > ( ) ; // mapping of permission - to filename.
2008-08-02 15:57:00 +02:00
this . workPath = getConfigPath ( plasmaSwitchboardConstants . WORK_PATH , plasmaSwitchboardConstants . WORK_PATH_DEFAULT ) ;
2006-01-21 17:39:57 +01:00
this . log . logConfig ( " Work Path: " + this . workPath . toString ( ) ) ;
2005-09-05 12:34:34 +02:00
2008-05-20 11:29:01 +02:00
// set a high maximum cache size to current size; this is adopted later automatically
2008-08-02 15:57:00 +02:00
final int wordCacheMaxCount = Math . max ( ( int ) getConfigLong ( plasmaSwitchboardConstants . WORDCACHE_INIT_COUNT , 30000 ) ,
( int ) getConfigLong ( plasmaSwitchboardConstants . WORDCACHE_MAX_COUNT , 20000 ) ) ;
setConfig ( plasmaSwitchboardConstants . WORDCACHE_MAX_COUNT , Integer . toString ( wordCacheMaxCount ) ) ;
2008-05-20 11:29:01 +02:00
2008-02-19 10:14:07 +01:00
// start indexing management
log . logConfig ( " Starting Indexing Management " ) ;
2008-08-02 14:12:04 +02:00
final String networkName = getConfig ( " network.unit.name " , " " ) ;
2008-05-20 11:29:01 +02:00
webIndex = new plasmaWordIndex ( networkName , log , indexPrimaryPath , indexSecondaryPath , wordCacheMaxCount ) ;
2008-05-06 02:32:41 +02:00
crawlResults = new ResultURLs ( ) ;
2008-02-19 10:14:07 +01:00
// start yacy core
log . logConfig ( " Starting YaCy Protocol Core " ) ;
this . yc = new yacyCore ( this ) ;
2008-06-04 23:34:57 +02:00
serverInstantBusyThread . oneTimeJob ( this , " loadSeedLists " , yacyCore . log , 0 ) ;
2008-08-02 14:12:04 +02:00
final long startedSeedListAquisition = System . currentTimeMillis ( ) ;
2008-02-19 10:14:07 +01:00
2007-03-02 02:19:38 +01:00
// set up local robots.txt
this . robotstxtConfig = httpdRobotsTxtConfig . init ( this ) ;
2005-11-15 09:25:46 +01:00
// setting timestamp of last proxy access
2008-06-30 23:47:53 +02:00
this . proxyLastAccess = System . currentTimeMillis ( ) - 10000 ;
this . localSearchLastAccess = System . currentTimeMillis ( ) - 10000 ;
this . remoteSearchLastAccess = System . currentTimeMillis ( ) - 10000 ;
2007-05-22 10:13:48 +02:00
this . webStructure = new plasmaWebStructure ( log , rankingPath , " LOCAL/010_cr/ " , getConfig ( " CRDist0Path " , plasmaRankingDistribution . CR_OWN ) , new File ( plasmaPath , " webStructure.map " ) ) ;
2005-06-08 02:52:24 +02:00
2005-10-22 15:28:04 +02:00
// configuring list path
2005-04-07 21:19:42 +02:00
if ( ! ( listsPath . exists ( ) ) ) listsPath . mkdirs ( ) ;
2005-11-11 00:48:20 +01:00
2005-08-30 14:50:30 +02:00
// load coloured lists
if ( blueList = = null ) {
// read only once upon first instantiation of this class
2008-08-02 15:57:00 +02:00
final String f = getConfig ( plasmaSwitchboardConstants . LIST_BLUE , plasmaSwitchboardConstants . LIST_BLUE_DEFAULT ) ;
2008-08-02 14:12:04 +02:00
final File plasmaBlueListFile = new File ( f ) ;
2008-01-19 01:40:19 +01:00
if ( f ! = null ) blueList = kelondroMSetTools . loadList ( plasmaBlueListFile , kelondroNaturalOrder . naturalComparator ) ; else blueList = new TreeSet < String > ( ) ;
2005-11-11 00:48:20 +01:00
this . log . logConfig ( " loaded blue-list from file " + plasmaBlueListFile . getName ( ) + " , " +
blueList . size ( ) + " entries, " +
ppRamString ( plasmaBlueListFile . length ( ) / 1024 ) ) ;
2005-08-30 14:50:30 +02:00
}
2005-11-11 00:48:20 +01:00
2005-07-11 17:36:10 +02:00
// load the black-list / inspired by [AS]
2008-08-02 15:57:00 +02:00
final File blacklistsPath = getConfigPath ( plasmaSwitchboardConstants . LISTS_PATH , plasmaSwitchboardConstants . LISTS_PATH_DEFAULT ) ;
String blacklistClassName = getConfig ( plasmaSwitchboardConstants . BLACKLIST_CLASS , plasmaSwitchboardConstants . BLACKLIST_CLASS_DEFAULT ) ;
2008-03-26 16:37:49 +01:00
if ( blacklistClassName . equals ( " de.anomic.plasma.urlPattern.defaultURLPattern " ) ) {
// patch old class location
2008-08-02 15:57:00 +02:00
blacklistClassName = plasmaSwitchboardConstants . BLACKLIST_CLASS_DEFAULT ;
setConfig ( plasmaSwitchboardConstants . BLACKLIST_CLASS , blacklistClassName ) ;
2008-03-26 16:37:49 +01:00
}
2006-08-12 16:28:14 +02:00
this . log . logConfig ( " Starting blacklist engine ... " ) ;
try {
2008-08-02 14:12:04 +02:00
final Class < ? > blacklistClass = Class . forName ( blacklistClassName ) ;
final Constructor < ? > blacklistClassConstr = blacklistClass . getConstructor ( new Class [ ] { File . class } ) ;
2008-03-26 16:37:49 +01:00
urlBlacklist = ( indexReferenceBlacklist ) blacklistClassConstr . newInstance ( new Object [ ] { blacklistsPath } ) ;
2006-08-12 16:28:14 +02:00
this . log . logFine ( " Used blacklist engine class: " + blacklistClassName ) ;
this . log . logConfig ( " Using blacklist engine: " + urlBlacklist . getEngineInfo ( ) ) ;
2008-08-02 14:12:04 +02:00
} catch ( final Exception e ) {
2006-08-12 16:28:14 +02:00
this . log . logSevere ( " Unable to load the blacklist engine " , e ) ;
System . exit ( - 1 ) ;
2008-08-02 14:12:04 +02:00
} catch ( final Error e ) {
2006-08-12 16:28:14 +02:00
this . log . logSevere ( " Unable to load the blacklist engine " , e ) ;
System . exit ( - 1 ) ;
}
this . log . logConfig ( " Loading backlist data ... " ) ;
2006-08-12 04:42:10 +02:00
listManager . switchboard = this ;
2007-09-10 08:20:27 +02:00
listManager . listsPath = blacklistsPath ;
2006-08-12 04:42:10 +02:00
listManager . reloadBlacklists ( ) ;
2006-02-20 18:50:42 +01:00
// load badwords (to filter the topwords)
if ( badwords = = null ) {
2008-08-02 15:57:00 +02:00
final File badwordsFile = new File ( rootPath , plasmaSwitchboardConstants . LIST_BADWORDS_DEFAULT ) ;
2008-01-11 01:12:01 +01:00
badwords = kelondroMSetTools . loadList ( badwordsFile , kelondroNaturalOrder . naturalComparator ) ;
2006-02-20 18:50:42 +01:00
this . log . logConfig ( " loaded badwords from file " + badwordsFile . getName ( ) +
" , " + badwords . size ( ) + " entries, " +
ppRamString ( badwordsFile . length ( ) / 1024 ) ) ;
}
2005-04-07 21:19:42 +02:00
// load stopwords
if ( stopwords = = null ) {
2008-08-02 15:57:00 +02:00
final File stopwordsFile = new File ( rootPath , plasmaSwitchboardConstants . LIST_STOPWORDS_DEFAULT ) ;
2008-01-11 01:12:01 +01:00
stopwords = kelondroMSetTools . loadList ( stopwordsFile , kelondroNaturalOrder . naturalComparator ) ;
2005-11-11 00:48:20 +01:00
this . log . logConfig ( " loaded stopwords from file " + stopwordsFile . getName ( ) + " , " +
stopwords . size ( ) + " entries, " +
ppRamString ( stopwordsFile . length ( ) / 1024 ) ) ;
2005-04-07 21:19:42 +02:00
}
2006-02-20 18:50:42 +01:00
2007-05-20 18:22:46 +02:00
// load ranking tables
2008-08-02 14:12:04 +02:00
final File YBRPath = new File ( rootPath , " ranking/YBR " ) ;
2005-11-23 12:57:30 +01:00
if ( YBRPath . exists ( ) ) {
2007-11-07 23:38:09 +01:00
plasmaSearchRankingProcess . loadYBR ( YBRPath , 15 ) ;
2005-11-22 16:17:05 +01:00
}
2005-04-07 21:19:42 +02:00
2005-09-07 13:17:21 +02:00
// loading the robots.txt db
this . log . logConfig ( " Initializing robots.txt DB " ) ;
2008-08-02 15:57:00 +02:00
final File robotsDBFile = new File ( this . plasmaPath , plasmaSwitchboardConstants . DBFILE_CRAWL_ROBOTS ) ;
2008-05-06 02:32:41 +02:00
robots = new RobotsTxt ( robotsDBFile ) ;
2005-11-11 00:48:20 +01:00
this . log . logConfig ( " Loaded robots.txt DB from file " + robotsDBFile . getName ( ) +
" , " + robots . size ( ) + " entries " +
" , " + ppRamString ( robotsDBFile . length ( ) / 1024 ) ) ;
2005-09-07 13:17:21 +02:00
2007-03-08 00:24:16 +01:00
// start a cache manager
log . logConfig ( " Starting HT Cache Manager " ) ;
// create the cache directory
2008-08-02 15:57:00 +02:00
htCachePath = getConfigPath ( plasmaSwitchboardConstants . HTCACHE_PATH , plasmaSwitchboardConstants . HTCACHE_PATH_DEFAULT ) ;
2007-03-08 00:24:16 +01:00
this . log . logInfo ( " HTCACHE Path = " + htCachePath . getAbsolutePath ( ) ) ;
2008-08-02 15:57:00 +02:00
final long maxCacheSize = 1024 * 1024 * Long . parseLong ( getConfig ( plasmaSwitchboardConstants . PROXY_CACHE_SIZE , " 2 " ) ) ; // this is megabyte
2008-04-30 00:31:05 +02:00
plasmaHTCache . init ( htCachePath , maxCacheSize ) ;
2007-03-08 00:24:16 +01:00
2007-06-28 16:52:26 +02:00
// create the release download directory
2008-08-02 15:57:00 +02:00
releasePath = getConfigPath ( plasmaSwitchboardConstants . RELEASE_PATH , plasmaSwitchboardConstants . RELEASE_PATH_DEFAULT ) ;
2007-06-28 16:52:26 +02:00
releasePath . mkdirs ( ) ;
this . log . logInfo ( " RELEASE Path = " + releasePath . getAbsolutePath ( ) ) ;
2007-03-08 00:24:16 +01:00
// starting message board
2008-02-19 10:14:07 +01:00
initMessages ( ) ;
2007-03-08 00:24:16 +01:00
// starting wiki
2008-02-19 10:14:07 +01:00
initWiki ( ) ;
2007-03-08 00:24:16 +01:00
//starting blog
2008-02-19 10:14:07 +01:00
initBlog ( ) ;
2007-03-08 00:24:16 +01:00
// Init User DB
this . log . logConfig ( " Loading User DB " ) ;
2008-08-02 15:57:00 +02:00
final File userDbFile = new File ( getRootPath ( ) , plasmaSwitchboardConstants . DBFILE_USER ) ;
2008-02-19 10:14:07 +01:00
this . userDB = new userDB ( userDbFile ) ;
2007-03-08 00:24:16 +01:00
this . log . logConfig ( " Loaded User DB from file " + userDbFile . getName ( ) +
" , " + this . userDB . size ( ) + " entries " +
" , " + ppRamString ( userDbFile . length ( ) / 1024 ) ) ;
//Init bookmarks DB
initBookmarks ( ) ;
2007-05-11 20:02:48 +02:00
// set a maximum amount of memory for the caches
2007-11-02 15:55:46 +01:00
// long memprereq = Math.max(getConfigLong(INDEXER_MEMPREREQ, 0), wordIndex.minMem());
2007-05-11 20:02:48 +02:00
// setConfig(INDEXER_MEMPREREQ, memprereq);
2007-11-02 15:55:46 +01:00
// setThreadPerformance(INDEXER, getConfigLong(INDEXER_IDLESLEEP, 0), getConfigLong(INDEXER_BUSYSLEEP, 0), memprereq);
kelondroCachedRecords . setCacheGrowStati ( 40 * 1024 * 1024 , 20 * 1024 * 1024 ) ;
kelondroCache . setCacheGrowStati ( 40 * 1024 * 1024 , 20 * 1024 * 1024 ) ;
2006-12-22 13:54:56 +01:00
2005-05-17 10:25:04 +02:00
// make parser
2005-08-30 23:10:39 +02:00
log . logConfig ( " Starting Parser " ) ;
2005-07-06 16:48:41 +02:00
this . parser = new plasmaParser ( ) ;
2005-04-07 21:19:42 +02:00
// define an extension-blacklist
2005-08-30 23:10:39 +02:00
log . logConfig ( " Parser: Initializing Extension Mappings for Media/Parser " ) ;
2008-08-02 15:57:00 +02:00
plasmaParser . initMediaExt ( plasmaParser . extString2extList ( getConfig ( plasmaSwitchboardConstants . PARSER_MEDIA_EXT , " " ) ) ) ;
plasmaParser . initSupportedHTMLFileExt ( plasmaParser . extString2extList ( getConfig ( plasmaSwitchboardConstants . PARSER_MEDIA_EXT_PARSEABLE , " " ) ) ) ;
2005-11-11 00:48:20 +01:00
2005-05-17 10:25:04 +02:00
// define a realtime parsable mimetype list
2005-08-30 23:10:39 +02:00
log . logConfig ( " Parser: Initializing Mime Types " ) ;
2008-08-02 15:57:00 +02:00
plasmaParser . initHTMLParsableMimeTypes ( getConfig ( plasmaSwitchboardConstants . PARSER_MIMETYPES_HTML , " application/xhtml+xml,text/html,text/plain " ) ) ;
plasmaParser . initParseableMimeTypes ( plasmaParser . PARSER_MODE_PROXY , getConfig ( plasmaSwitchboardConstants . PARSER_MIMETYPES_PROXY , null ) ) ;
plasmaParser . initParseableMimeTypes ( plasmaParser . PARSER_MODE_CRAWLER , getConfig ( plasmaSwitchboardConstants . PARSER_MIMETYPES_CRAWLER , null ) ) ;
plasmaParser . initParseableMimeTypes ( plasmaParser . PARSER_MODE_ICAP , getConfig ( plasmaSwitchboardConstants . PARSER_MIMETYPES_ICAP , null ) ) ;
plasmaParser . initParseableMimeTypes ( plasmaParser . PARSER_MODE_URLREDIRECTOR , getConfig ( plasmaSwitchboardConstants . PARSER_MIMETYPES_URLREDIRECTOR , null ) ) ;
plasmaParser . initParseableMimeTypes ( plasmaParser . PARSER_MODE_IMAGE , getConfig ( plasmaSwitchboardConstants . PARSER_MIMETYPES_IMAGE , null ) ) ;
2005-04-07 21:19:42 +02:00
2005-05-17 10:25:04 +02:00
// start a loader
2005-08-30 23:10:39 +02:00
log . logConfig ( " Starting Crawl Loader " ) ;
2008-05-06 02:32:41 +02:00
this . crawlQueues = new CrawlQueues ( this , plasmaPath ) ;
2008-07-03 15:08:37 +02:00
this . crawlQueues . noticeURL . setMinimumLocalDelta ( this . getConfigLong ( " minimumLocalDelta " , this . crawlQueues . noticeURL . getMinimumLocalDelta ( ) ) ) ;
this . crawlQueues . noticeURL . setMinimumGlobalDelta ( this . getConfigLong ( " minimumGlobalDelta " , this . crawlQueues . noticeURL . getMinimumGlobalDelta ( ) ) ) ;
2006-02-21 12:18:48 +01:00
/ *
* Creating sync objects and loading status for the crawl jobs
* a ) local crawl
* b ) remote triggered crawl
* c ) global crawl trigger
* /
2008-08-02 15:57:00 +02:00
this . crawlJobsStatus . put ( plasmaSwitchboardConstants . CRAWLJOB_LOCAL_CRAWL , new Object [ ] {
2006-02-21 12:18:48 +01:00
new Object ( ) ,
2008-08-02 15:57:00 +02:00
Boolean . valueOf ( getConfig ( plasmaSwitchboardConstants . CRAWLJOB_LOCAL_CRAWL + " _isPaused " , " false " ) ) } ) ;
this . crawlJobsStatus . put ( plasmaSwitchboardConstants . CRAWLJOB_REMOTE_TRIGGERED_CRAWL , new Object [ ] {
2006-02-21 12:18:48 +01:00
new Object ( ) ,
2008-08-02 15:57:00 +02:00
Boolean . valueOf ( getConfig ( plasmaSwitchboardConstants . CRAWLJOB_REMOTE_TRIGGERED_CRAWL + " _isPaused " , " false " ) ) } ) ;
this . crawlJobsStatus . put ( plasmaSwitchboardConstants . CRAWLJOB_REMOTE_CRAWL_LOADER , new Object [ ] {
2006-02-21 12:18:48 +01:00
new Object ( ) ,
2008-08-02 15:57:00 +02:00
Boolean . valueOf ( getConfig ( plasmaSwitchboardConstants . CRAWLJOB_REMOTE_CRAWL_LOADER + " _isPaused " , " false " ) ) } ) ;
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
// init cookie-Monitor
2005-10-10 11:13:17 +02:00
this . log . logConfig ( " Starting Cookie Monitor " ) ;
2008-01-19 01:40:19 +01:00
this . outgoingCookies = new HashMap < String , Object [ ] > ( ) ;
this . incomingCookies = new HashMap < String , Object [ ] > ( ) ;
2005-11-11 00:48:20 +01:00
2007-01-15 02:50:57 +01:00
// init search history trackers
2008-01-11 01:12:01 +01:00
this . localSearchTracker = new HashMap < String , TreeSet < Long > > ( ) ; // String:TreeSet - IP:set of Long(accessTime)
this . remoteSearchTracker = new HashMap < String , TreeSet < Long > > ( ) ;
2008-02-18 00:35:48 +01:00
this . localSearches = new ArrayList < plasmaSearchQuery > ( ) ; // contains search result properties as HashMaps
this . remoteSearches = new ArrayList < plasmaSearchQuery > ( ) ;
2007-01-15 02:50:57 +01:00
2006-12-23 05:26:05 +01:00
// init messages: clean up message symbol
2008-08-02 15:57:00 +02:00
final File notifierSource = new File ( getRootPath ( ) , getConfig ( plasmaSwitchboardConstants . HTROOT_PATH , plasmaSwitchboardConstants . HTROOT_PATH_DEFAULT ) + " /env/grafics/empty.gif " ) ;
final File notifierDest = new File ( getConfigPath ( plasmaSwitchboardConstants . HTDOCS_PATH , plasmaSwitchboardConstants . HTDOCS_PATH_DEFAULT ) , " notifier.gif " ) ;
2006-12-23 05:26:05 +01:00
try {
serverFileUtils . copy ( notifierSource , notifierDest ) ;
2008-08-02 14:12:04 +02:00
} catch ( final IOException e ) {
2006-12-23 05:26:05 +01:00
}
2005-04-07 21:19:42 +02:00
// clean up profiles
2005-10-10 11:13:17 +02:00
this . log . logConfig ( " Cleaning Profiles " ) ;
2008-08-02 14:12:04 +02:00
try { cleanProfiles ( ) ; } catch ( final InterruptedException e ) { /* Ignore this here */ }
2005-11-11 00:48:20 +01:00
// init ranking transmission
2005-11-20 19:55:35 +01:00
/ *
2006-01-18 03:18:23 +01:00
CRDistOn = true / false
2005-11-20 19:55:35 +01:00
CRDist0Path = GLOBAL / 010_owncr
CRDist0Method = 1
CRDist0Percent = 0
CRDist0Target =
CRDist1Path = GLOBAL / 014_othercr / 1
CRDist1Method = 9
CRDist1Percent = 30
CRDist1Target = kaskelix . de : 8080 , yacy . dyndns . org : 8000 , suma - lab . de : 8080
* * /
2008-08-02 15:57:00 +02:00
rankingOn = getConfig ( plasmaSwitchboardConstants . RANKING_DIST_ON , " true " ) . equals ( " true " ) & & networkName . equals ( " freeworld " ) ;
rankingOwnDistribution = new plasmaRankingDistribution ( log , webIndex . seedDB , new File ( rankingPath , getConfig ( plasmaSwitchboardConstants . RANKING_DIST_0_PATH , plasmaRankingDistribution . CR_OWN ) ) , ( int ) getConfigLong ( plasmaSwitchboardConstants . RANKING_DIST_0_METHOD , plasmaRankingDistribution . METHOD_ANYSENIOR ) , ( int ) getConfigLong ( plasmaSwitchboardConstants . RANKING_DIST_0_METHOD , 0 ) , getConfig ( plasmaSwitchboardConstants . RANKING_DIST_0_TARGET , " " ) ) ;
rankingOtherDistribution = new plasmaRankingDistribution ( log , webIndex . seedDB , new File ( rankingPath , getConfig ( plasmaSwitchboardConstants . RANKING_DIST_1_PATH , plasmaRankingDistribution . CR_OTHER ) ) , ( int ) getConfigLong ( plasmaSwitchboardConstants . RANKING_DIST_1_METHOD , plasmaRankingDistribution . METHOD_MIXEDSENIOR ) , ( int ) getConfigLong ( plasmaSwitchboardConstants . RANKING_DIST_1_METHOD , 30 ) , getConfig ( plasmaSwitchboardConstants . RANKING_DIST_1_TARGET , " kaskelix.de:8080,yacy.dyndns.org:8000 " ) ) ;
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
// init facility DB
2005-06-23 04:07:45 +02:00
/ *
2005-05-18 23:52:17 +02:00
log . logSystem ( " Starting Facility Database " ) ;
2005-04-07 21:19:42 +02:00
File facilityDBpath = new File ( getRootPath ( ) , " DATA/SETTINGS/ " ) ;
facilityDB = new kelondroTables ( facilityDBpath ) ;
facilityDB . declareMaps ( " backlinks " , 250 , 500 , new String [ ] { " date " } , null ) ;
2005-06-20 18:36:31 +02:00
log . logSystem ( " ..opened backlinks " ) ;
2005-04-07 21:19:42 +02:00
facilityDB . declareMaps ( " zeitgeist " , 40 , 500 ) ;
2005-06-20 18:36:31 +02:00
log . logSystem ( " ..opened zeitgeist " ) ;
2005-04-07 21:19:42 +02:00
facilityDB . declareTree ( " statistik " , new int [ ] { 11 , 8 , 8 , 8 , 8 , 8 , 8 } , 0x400 ) ;
2005-06-20 18:36:31 +02:00
log . logSystem ( " ..opened statistik " ) ;
2005-04-07 21:19:42 +02:00
facilityDB . update ( " statistik " , ( new serverDate ( ) ) . toShortString ( false ) . substring ( 0 , 11 ) , new long [ ] { 1 , 2 , 3 , 4 , 5 , 6 } ) ;
long [ ] testresult = facilityDB . selectLong ( " statistik " , " yyyyMMddHHm " ) ;
testresult = facilityDB . selectLong ( " statistik " , ( new serverDate ( ) ) . toShortString ( false ) . substring ( 0 , 11 ) ) ;
2005-11-11 00:48:20 +01:00
* /
2005-06-08 02:52:24 +02:00
2005-11-07 11:57:54 +01:00
// init nameCacheNoCachingList
2008-08-02 15:57:00 +02:00
final String noCachingList = getConfig ( plasmaSwitchboardConstants . HTTPC_NAME_CACHE_CACHING_PATTERNS_NO , " " ) ;
2008-08-02 14:12:04 +02:00
final String [ ] noCachingEntries = noCachingList . split ( " , " ) ;
2008-05-14 23:36:02 +02:00
for ( int i = 0 ; i < noCachingEntries . length ; i + + ) {
2008-08-02 14:12:04 +02:00
final String entry = noCachingEntries [ i ] . trim ( ) ;
2007-07-24 02:46:17 +02:00
serverDomains . nameCacheNoCachingPatterns . add ( entry ) ;
2005-11-07 11:57:54 +01:00
}
2005-06-08 02:52:24 +02:00
// generate snippets cache
2005-08-30 23:10:39 +02:00
log . logConfig ( " Initializing Snippet Cache " ) ;
2007-08-15 23:31:31 +02:00
plasmaSnippetCache . init ( parser , log ) ;
2005-06-08 02:52:24 +02:00
2008-08-02 15:57:00 +02:00
final String wikiParserClassName = getConfig ( plasmaSwitchboardConstants . WIKIPARSER_CLASS , plasmaSwitchboardConstants . WIKIPARSER_CLASS_DEFAULT ) ;
2007-05-20 18:19:25 +02:00
this . log . logConfig ( " Loading wiki parser " + wikiParserClassName + " ... " ) ;
try {
2008-08-02 14:12:04 +02:00
final Class < ? > wikiParserClass = Class . forName ( wikiParserClassName ) ;
final Constructor < ? > wikiParserClassConstr = wikiParserClass . getConstructor ( new Class [ ] { plasmaSwitchboard . class } ) ;
2007-05-20 18:19:25 +02:00
wikiParser = ( wikiParser ) wikiParserClassConstr . newInstance ( new Object [ ] { this } ) ;
2008-08-02 14:12:04 +02:00
} catch ( final Exception e ) {
2007-05-20 18:19:25 +02:00
this . log . logSevere ( " Unable to load wiki parser, the wiki won't work " , e ) ;
}
2008-06-06 15:10:21 +02:00
// initializing the resourceObserver
2008-07-07 02:03:37 +02:00
this . observer = new ResourceObserver ( this ) ;
2008-06-30 23:47:53 +02:00
// run the oberver here a first time
this . observer . resourceObserverJob ( ) ;
2008-06-06 15:10:21 +02:00
2005-10-05 12:45:33 +02:00
// initializing the stackCrawlThread
2008-08-02 15:57:00 +02:00
this . crawlStacker = new CrawlStacker ( this , this . plasmaPath , ( int ) getConfigLong ( " tableTypeForPreNURL " , 0 ) , ( ( ( int ) getConfigLong ( " tableTypeForPreNURL " , 0 ) = = 0 ) & & ( getConfigLong ( plasmaSwitchboardConstants . CRAWLSTACK_BUSYSLEEP , 0 ) < = 100 ) ) ) ;
2005-10-09 17:59:09 +02:00
//this.sbStackCrawlThread = new plasmaStackCrawlThread(this,this.plasmaPath,ramPreNURL);
2005-11-11 00:48:20 +01:00
//this.sbStackCrawlThread.start();
2005-10-05 12:45:33 +02:00
2006-02-21 00:57:50 +01:00
// initializing dht chunk generation
this . dhtTransferChunk = null ;
2008-08-02 15:57:00 +02:00
this . dhtTransferIndexCount = ( int ) getConfigLong ( plasmaSwitchboardConstants . INDEX_DIST_CHUNK_SIZE_START , 50 ) ;
2006-02-21 00:57:50 +01:00
2007-05-07 22:48:24 +02:00
// init robinson cluster
2007-10-01 14:30:23 +02:00
// before we do that, we wait some time until the seed list is loaded.
2008-08-02 14:12:04 +02:00
while ( ( ( System . currentTimeMillis ( ) - startedSeedListAquisition ) < 8000 ) & & ( this . webIndex . seedDB . sizeConnected ( ) = = 0 ) ) try { Thread . sleep ( 1000 ) ; } catch ( final InterruptedException e ) { }
try { Thread . sleep ( 1000 ) ; } catch ( final InterruptedException e ) { }
2008-05-14 23:36:02 +02:00
this . clusterhashes = this . webIndex . seedDB . clusterHashes ( getConfig ( " cluster.peers.yacydomain " , " " ) ) ;
2007-05-07 22:48:24 +02:00
2008-03-28 12:56:28 +01:00
// deploy blocking threads
2008-03-30 01:03:44 +01:00
indexingStorageProcessor = new serverProcessor < indexingQueueEntry > ( this , " storeDocumentIndex " , 1 , null ) ;
indexingAnalysisProcessor = new serverProcessor < indexingQueueEntry > ( this , " webStructureAnalysis " , serverProcessor . useCPU + 1 , indexingStorageProcessor ) ;
indexingCondensementProcessor = new serverProcessor < indexingQueueEntry > ( this , " condenseDocument " , serverProcessor . useCPU + 1 , indexingAnalysisProcessor ) ;
indexingDocumentProcessor = new serverProcessor < indexingQueueEntry > ( this , " parseDocument " , serverProcessor . useCPU + 1 , indexingCondensementProcessor ) ;
2008-03-28 12:56:28 +01:00
// deploy busy threads
2005-08-30 23:10:39 +02:00
log . logConfig ( " Starting Threads " ) ;
2007-11-02 15:55:46 +01:00
serverMemory . gc ( 1000 , " plasmaSwitchboard, help for profiler " ) ; // help for profiler - thq
moreMemory = new Timer ( ) ; // init GC Thread - thq
moreMemory . schedule ( new MoreMemory ( ) , 300000 , 600000 ) ;
2008-03-28 12:56:28 +01:00
2008-08-02 15:57:00 +02:00
deployThread ( plasmaSwitchboardConstants . CLEANUP , " Cleanup " , " simple cleaning process for monitoring information " , null ,
new serverInstantBusyThread ( this , plasmaSwitchboardConstants . CLEANUP_METHOD_START , plasmaSwitchboardConstants . CLEANUP_METHOD_JOBCOUNT , plasmaSwitchboardConstants . CLEANUP_METHOD_FREEMEM ) , 600000 ) ; // all 5 Minutes, wait 10 minutes until first run
deployThread ( plasmaSwitchboardConstants . CRAWLSTACK , " Crawl URL Stacker " , " process that checks url for double-occurrences and for allowance/disallowance by robots.txt " , null ,
new serverInstantBusyThread ( crawlStacker , plasmaSwitchboardConstants . CRAWLSTACK_METHOD_START , plasmaSwitchboardConstants . CRAWLSTACK_METHOD_JOBCOUNT , plasmaSwitchboardConstants . CRAWLSTACK_METHOD_FREEMEM ) , 8000 ) ;
deployThread ( plasmaSwitchboardConstants . INDEXER , " Indexing " , " thread that either initiates a parsing/indexing queue, distributes the index into the DHT, stores parsed documents or flushes the index cache " , " /IndexCreateIndexingQueue_p.html " ,
new serverInstantBusyThread ( this , plasmaSwitchboardConstants . INDEXER_METHOD_START , plasmaSwitchboardConstants . INDEXER_METHOD_JOBCOUNT , plasmaSwitchboardConstants . INDEXER_METHOD_FREEMEM ) , 10000 ) ;
deployThread ( plasmaSwitchboardConstants . PROXY_CACHE_ENQUEUE , " Proxy Cache Enqueue " , " job takes new input files from RAM stack, stores them, and hands over to the Indexing Stack " , null ,
new serverInstantBusyThread ( this , plasmaSwitchboardConstants . PROXY_CACHE_ENQUEUE_METHOD_START , plasmaSwitchboardConstants . PROXY_CACHE_ENQUEUE_METHOD_JOBCOUNT , plasmaSwitchboardConstants . PROXY_CACHE_ENQUEUE_METHOD_FREEMEM ) , 10000 ) ;
deployThread ( plasmaSwitchboardConstants . CRAWLJOB_REMOTE_TRIGGERED_CRAWL , " Remote Crawl Job " , " thread that performes a single crawl/indexing step triggered by a remote peer " , null ,
new serverInstantBusyThread ( crawlQueues , plasmaSwitchboardConstants . CRAWLJOB_REMOTE_TRIGGERED_CRAWL_METHOD_START , plasmaSwitchboardConstants . CRAWLJOB_REMOTE_TRIGGERED_CRAWL_METHOD_JOBCOUNT , plasmaSwitchboardConstants . CRAWLJOB_REMOTE_TRIGGERED_CRAWL_METHOD_FREEMEM ) , 30000 ) ;
deployThread ( plasmaSwitchboardConstants . CRAWLJOB_REMOTE_CRAWL_LOADER , " Remote Crawl URL Loader " , " thread that loads remote crawl lists from other peers " , " " ,
new serverInstantBusyThread ( crawlQueues , plasmaSwitchboardConstants . CRAWLJOB_REMOTE_CRAWL_LOADER_METHOD_START , plasmaSwitchboardConstants . CRAWLJOB_REMOTE_CRAWL_LOADER_METHOD_JOBCOUNT , plasmaSwitchboardConstants . CRAWLJOB_REMOTE_CRAWL_LOADER_METHOD_FREEMEM ) , 30000 ) ; // error here?
deployThread ( plasmaSwitchboardConstants . CRAWLJOB_LOCAL_CRAWL , " Local Crawl " , " thread that performes a single crawl step from the local crawl queue " , " /IndexCreateWWWLocalQueue_p.html " ,
new serverInstantBusyThread ( crawlQueues , plasmaSwitchboardConstants . CRAWLJOB_LOCAL_CRAWL_METHOD_START , plasmaSwitchboardConstants . CRAWLJOB_LOCAL_CRAWL_METHOD_JOBCOUNT , plasmaSwitchboardConstants . CRAWLJOB_LOCAL_CRAWL_METHOD_FREEMEM ) , 10000 ) ;
deployThread ( plasmaSwitchboardConstants . SEED_UPLOAD , " Seed-List Upload " , " task that a principal peer performes to generate and upload a seed-list to a ftp account " , null ,
new serverInstantBusyThread ( yc , plasmaSwitchboardConstants . SEED_UPLOAD_METHOD_START , plasmaSwitchboardConstants . SEED_UPLOAD_METHOD_JOBCOUNT , plasmaSwitchboardConstants . SEED_UPLOAD_METHOD_FREEMEM ) , 180000 ) ;
deployThread ( plasmaSwitchboardConstants . PEER_PING , " YaCy Core " , " this is the p2p-control and peer-ping task " , null ,
new serverInstantBusyThread ( yc , plasmaSwitchboardConstants . PEER_PING_METHOD_START , plasmaSwitchboardConstants . PEER_PING_METHOD_JOBCOUNT , plasmaSwitchboardConstants . PEER_PING_METHOD_FREEMEM ) , 2000 ) ;
deployThread ( plasmaSwitchboardConstants . INDEX_DIST , " DHT Distribution " , " selection, transfer and deletion of index entries that are not searched on your peer, but on others " , null ,
new serverInstantBusyThread ( this , plasmaSwitchboardConstants . INDEX_DIST_METHOD_START , plasmaSwitchboardConstants . INDEX_DIST_METHOD_JOBCOUNT , plasmaSwitchboardConstants . INDEX_DIST_METHOD_FREEMEM ) , 60000 ,
Long . parseLong ( getConfig ( plasmaSwitchboardConstants . INDEX_DIST_IDLESLEEP , " 5000 " ) ) ,
Long . parseLong ( getConfig ( plasmaSwitchboardConstants . INDEX_DIST_BUSYSLEEP , " 0 " ) ) ,
Long . parseLong ( getConfig ( plasmaSwitchboardConstants . INDEX_DIST_MEMPREREQ , " 1000000 " ) ) ) ;
2006-03-19 23:06:15 +01:00
2005-06-30 00:55:37 +02:00
// test routine for snippet fetch
2005-06-30 20:54:00 +02:00
//Set query = new HashSet();
//query.add(plasmaWordIndexEntry.word2hash("Weitergabe"));
//query.add(plasmaWordIndexEntry.word2hash("Zahl"));
2005-06-30 00:55:37 +02:00
//plasmaSnippetCache.result scr = snippetCache.retrieve(new URL("http://www.heise.de/mobil/newsticker/meldung/mail/54980"), query, true);
2005-06-30 02:01:53 +02:00
//plasmaSnippetCache.result scr = snippetCache.retrieve(new URL("http://www.heise.de/security/news/foren/go.shtml?read=1&msg_id=7301419&forum_id=72721"), query, true);
2005-07-02 01:35:36 +02:00
//plasmaSnippetCache.result scr = snippetCache.retrieve(new URL("http://www.heise.de/kiosk/archiv/ct/2003/4/20"), query, true, 260);
2006-01-18 03:18:23 +01:00
2008-05-06 15:44:38 +02:00
this . dbImportManager = new ImporterManager ( ) ;
2006-01-31 13:30:24 +01:00
2005-08-30 23:10:39 +02:00
log . logConfig ( " Finished Switchboard Initialization " ) ;
2005-04-07 21:19:42 +02:00
}
2008-08-02 15:57:00 +02:00
/ * *
*
* /
private void initSystemTray ( ) {
// make system tray
// TODO: make tray on linux
try {
final boolean trayIcon = getConfig ( " trayIcon " , " false " ) . equals ( " true " ) ;
if ( trayIcon & & serverSystem . isWindows ) {
System . setProperty ( " java.awt.headless " , " false " ) ;
yacytray = new yacyTray ( this , false ) ;
}
} catch ( final Exception e ) {
System . setProperty ( " java.awt.headless " , " true " ) ;
}
}
2007-07-24 02:46:17 +02:00
2008-05-14 23:36:02 +02:00
2008-08-02 14:12:04 +02:00
public static void overwriteNetworkDefinition ( final plasmaSwitchboard sb ) {
2008-05-14 23:36:02 +02:00
// load network configuration into settings
String networkUnitDefinition = sb . getConfig ( " network.unit.definition " , " defaults/yacy.network.freeworld.unit " ) ;
2008-08-02 14:12:04 +02:00
final String networkGroupDefinition = sb . getConfig ( " network.group.definition " , " yacy.network.group " ) ;
2008-05-14 23:36:02 +02:00
// patch old values
if ( networkUnitDefinition . equals ( " yacy.network.unit " ) ) {
networkUnitDefinition = " defaults/yacy.network.freeworld.unit " ;
sb . setConfig ( " network.unit.definition " , networkUnitDefinition ) ;
}
// remove old release locations
int i = 0 ;
String location ;
while ( true ) {
location = sb . getConfig ( " network.unit.update.location " + i , " " ) ;
if ( location . length ( ) = = 0 ) break ;
sb . removeConfig ( " network.unit.update.location " + i ) ;
i + + ;
}
// include additional network definition properties into our settings
// note that these properties cannot be set in the application because they are
// _always_ overwritten each time with the default values. This is done so on purpose.
// the network definition should be made either consistent for all peers,
// or independently using a bootstrap URL
Map < String , String > initProps ;
if ( networkUnitDefinition . startsWith ( " http:// " ) ) {
try {
sb . setConfig ( plasmaSwitchboard . loadHashMap ( new yacyURL ( networkUnitDefinition , null ) ) ) ;
2008-08-02 14:12:04 +02:00
} catch ( final MalformedURLException e ) { }
2008-05-14 23:36:02 +02:00
} else {
2008-08-02 14:12:04 +02:00
final File networkUnitDefinitionFile = ( networkUnitDefinition . startsWith ( " / " ) ) ? new File ( networkUnitDefinition ) : new File ( sb . getRootPath ( ) , networkUnitDefinition ) ;
2008-05-14 23:36:02 +02:00
if ( networkUnitDefinitionFile . exists ( ) ) {
initProps = serverFileUtils . loadHashMap ( networkUnitDefinitionFile ) ;
sb . setConfig ( initProps ) ;
}
}
if ( networkGroupDefinition . startsWith ( " http:// " ) ) {
try {
sb . setConfig ( plasmaSwitchboard . loadHashMap ( new yacyURL ( networkGroupDefinition , null ) ) ) ;
2008-08-02 14:12:04 +02:00
} catch ( final MalformedURLException e ) { }
2008-05-14 23:36:02 +02:00
} else {
2008-08-02 14:12:04 +02:00
final File networkGroupDefinitionFile = new File ( sb . getRootPath ( ) , networkGroupDefinition ) ;
2008-05-14 23:36:02 +02:00
if ( networkGroupDefinitionFile . exists ( ) ) {
initProps = serverFileUtils . loadHashMap ( networkGroupDefinitionFile ) ;
sb . setConfig ( initProps ) ;
}
}
// set release locations
while ( true ) {
location = sb . getConfig ( " network.unit.update.location " + i , " " ) ;
if ( location . length ( ) = = 0 ) break ;
try {
yacyVersion . latestReleaseLocations . add ( new yacyURL ( location , null ) ) ;
2008-08-02 14:12:04 +02:00
} catch ( final MalformedURLException e ) {
2008-05-14 23:36:02 +02:00
break ;
}
i + + ;
}
// initiate url license object
sb . licensedURLs = new URLLicense ( 8 ) ;
// set URL domain acceptance
sb . acceptGlobalURLs = " global.any " . indexOf ( sb . getConfig ( " network.unit.domain " , " global " ) ) > = 0 ;
sb . acceptLocalURLs = " local.any " . indexOf ( sb . getConfig ( " network.unit.domain " , " global " ) ) > = 0 ;
}
2008-08-02 14:12:04 +02:00
public void switchNetwork ( final String networkDefinition ) {
2008-05-20 11:29:01 +02:00
// pause crawls
2008-08-02 15:57:00 +02:00
final boolean lcp = crawlJobIsPaused ( plasmaSwitchboardConstants . CRAWLJOB_LOCAL_CRAWL ) ;
if ( ! lcp ) pauseCrawlJob ( plasmaSwitchboardConstants . CRAWLJOB_LOCAL_CRAWL ) ;
final boolean rcp = crawlJobIsPaused ( plasmaSwitchboardConstants . CRAWLJOB_REMOTE_TRIGGERED_CRAWL ) ;
if ( ! rcp ) pauseCrawlJob ( plasmaSwitchboardConstants . CRAWLJOB_REMOTE_TRIGGERED_CRAWL ) ;
2008-05-20 11:29:01 +02:00
// trigger online caution
2008-05-27 00:55:55 +02:00
proxyLastAccess = System . currentTimeMillis ( ) + 10000 ; // at least 10 seconds online caution to prevent unnecessary action on database meanwhile
2008-06-04 23:34:57 +02:00
// clean search events which have cached relations to the old index
plasmaSearchEvent . cleanupEvents ( true ) ;
2008-05-20 11:29:01 +02:00
// switch the networks
2008-08-20 09:54:56 +02:00
synchronized ( this ) {
synchronized ( this . webIndex ) {
this . webIndex . close ( ) ;
}
2008-05-20 11:29:01 +02:00
setConfig ( " network.unit.definition " , networkDefinition ) ;
overwriteNetworkDefinition ( this ) ;
2008-08-02 15:57:00 +02:00
final File indexPrimaryPath = getConfigPath ( plasmaSwitchboardConstants . INDEX_PRIMARY_PATH , plasmaSwitchboardConstants . INDEX_PATH_DEFAULT ) ;
final File indexSecondaryPath = ( getConfig ( plasmaSwitchboardConstants . INDEX_SECONDARY_PATH , " " ) . length ( ) = = 0 ) ? indexPrimaryPath : new File ( getConfig ( plasmaSwitchboardConstants . INDEX_SECONDARY_PATH , " " ) ) ;
final int wordCacheMaxCount = ( int ) getConfigLong ( plasmaSwitchboardConstants . WORDCACHE_MAX_COUNT , 20000 ) ;
2008-05-20 11:29:01 +02:00
this . webIndex = new plasmaWordIndex ( getConfig ( " network.unit.name " , " " ) , getLog ( ) , indexPrimaryPath , indexSecondaryPath , wordCacheMaxCount ) ;
}
2008-05-27 00:55:55 +02:00
// start up crawl jobs
2008-08-02 15:57:00 +02:00
continueCrawlJob ( plasmaSwitchboardConstants . CRAWLJOB_LOCAL_CRAWL ) ;
continueCrawlJob ( plasmaSwitchboardConstants . CRAWLJOB_REMOTE_TRIGGERED_CRAWL ) ;
2008-05-20 15:21:55 +02:00
this . log . logInfo ( " switched network to " + networkDefinition ) ;
2008-05-20 11:29:01 +02:00
// check status of account configuration: when local url crawling is allowed, it is not allowed
// that an automatic authorization of localhost is done, because in this case crawls from local
// addresses are blocked to prevent attack szenarios where remote pages contain links to localhost
// addresses that can steer a YaCy peer
if ( ( this . acceptLocalURLs ) & & ( getConfigBool ( " adminAccountForLocalhost " , false ) ) ) {
setConfig ( " adminAccountForLocalhost " , false ) ;
if ( getConfig ( httpd . ADMIN_ACCOUNT_B64MD5 , " " ) . startsWith ( " 0000 " ) ) {
// the password was set automatically with a random value.
// We must remove that here to prevent that a user cannot log in any more
setConfig ( httpd . ADMIN_ACCOUNT_B64MD5 , " " ) ;
// after this a message must be generated to alert the user to set a new password
log . logInfo ( " RANDOM PASSWORD REMOVED! User must set a new password " ) ;
}
}
}
2008-02-19 10:14:07 +01:00
public void initMessages ( ) {
2006-01-21 17:39:57 +01:00
this . log . logConfig ( " Starting Message Board " ) ;
2008-08-02 15:57:00 +02:00
final File messageDbFile = new File ( workPath , plasmaSwitchboardConstants . DBFILE_MESSAGE ) ;
2008-02-19 10:14:07 +01:00
this . messageDB = new messageBoard ( messageDbFile ) ;
2006-01-21 17:39:57 +01:00
this . log . logConfig ( " Loaded Message Board DB from file " + messageDbFile . getName ( ) +
" , " + this . messageDB . size ( ) + " entries " +
" , " + ppRamString ( messageDbFile . length ( ) / 1024 ) ) ;
}
2007-07-24 02:46:17 +02:00
2008-02-19 10:14:07 +01:00
public void initWiki ( ) {
2006-01-21 17:39:57 +01:00
this . log . logConfig ( " Starting Wiki Board " ) ;
2008-08-02 15:57:00 +02:00
final File wikiDbFile = new File ( workPath , plasmaSwitchboardConstants . DBFILE_WIKI ) ;
this . wikiDB = new wikiBoard ( wikiDbFile , new File ( workPath , plasmaSwitchboardConstants . DBFILE_WIKI_BKP ) ) ;
2006-01-21 17:39:57 +01:00
this . log . logConfig ( " Loaded Wiki Board DB from file " + wikiDbFile . getName ( ) +
" , " + this . wikiDB . size ( ) + " entries " +
" , " + ppRamString ( wikiDbFile . length ( ) / 1024 ) ) ;
}
2007-07-24 02:46:17 +02:00
2008-02-19 10:14:07 +01:00
public void initBlog ( ) {
2006-03-01 08:40:25 +01:00
this . log . logConfig ( " Starting Blog " ) ;
2008-08-02 15:57:00 +02:00
final File blogDbFile = new File ( workPath , plasmaSwitchboardConstants . DBFILE_BLOG ) ;
2008-02-19 10:14:07 +01:00
this . blogDB = new blogBoard ( blogDbFile ) ;
2006-03-01 08:40:25 +01:00
this . log . logConfig ( " Loaded Blog DB from file " + blogDbFile . getName ( ) +
" , " + this . blogDB . size ( ) + " entries " +
" , " + ppRamString ( blogDbFile . length ( ) / 1024 ) ) ;
2007-02-26 15:36:01 +01:00
2008-08-02 15:57:00 +02:00
final File blogCommentDbFile = new File ( workPath , plasmaSwitchboardConstants . DBFILE_BLOGCOMMENTS ) ;
2008-02-19 10:14:07 +01:00
this . blogCommentDB = new blogBoardComments ( blogCommentDbFile ) ;
2007-02-26 15:36:01 +01:00
this . log . logConfig ( " Loaded Blog-Comment DB from file " + blogCommentDbFile . getName ( ) +
" , " + this . blogCommentDB . size ( ) + " entries " +
" , " + ppRamString ( blogCommentDbFile . length ( ) / 1024 ) ) ;
2006-03-01 08:40:25 +01:00
}
2007-07-24 02:46:17 +02:00
2006-02-14 11:00:12 +01:00
public void initBookmarks ( ) {
this . log . logConfig ( " Loading Bookmarks DB " ) ;
2008-08-02 15:57:00 +02:00
final File bookmarksFile = new File ( workPath , plasmaSwitchboardConstants . DBFILE_BOOKMARKS ) ;
final File tagsFile = new File ( workPath , plasmaSwitchboardConstants . DBFILE_BOOKMARKS_TAGS ) ;
final File datesFile = new File ( workPath , plasmaSwitchboardConstants . DBFILE_BOOKMARKS_DATES ) ;
2008-02-19 10:14:07 +01:00
this . bookmarksDB = new bookmarksDB ( bookmarksFile , tagsFile , datesFile ) ;
2006-02-14 11:00:12 +01:00
this . log . logConfig ( " Loaded Bookmarks DB from files " + bookmarksFile . getName ( ) + " , " + tagsFile . getName ( ) ) ;
this . log . logConfig ( this . bookmarksDB . tagsSize ( ) + " Tag, " + this . bookmarksDB . bookmarksSize ( ) + " Bookmarks " ) ;
}
2005-11-11 00:48:20 +01:00
public static plasmaSwitchboard getSwitchboard ( ) {
return sb ;
}
2005-08-22 14:13:19 +02:00
2006-04-06 23:48:24 +02:00
public boolean isRobinsonMode ( ) {
2007-04-23 22:47:07 +02:00
// we are in robinson mode, if we do not exchange index by dht distribution
// we need to take care that search requests and remote indexing requests go only
// to the peers in the same cluster, if we run a robinson cluster.
2008-08-02 15:57:00 +02:00
return ! getConfigBool ( plasmaSwitchboardConstants . INDEX_DIST_ALLOW , false ) & & ! getConfigBool ( plasmaSwitchboardConstants . INDEX_RECEIVE_ALLOW , false ) ;
2006-04-06 23:48:24 +02:00
}
2006-12-05 03:47:51 +01:00
2007-04-30 02:39:53 +02:00
public boolean isPublicRobinson ( ) {
2007-04-23 22:47:07 +02:00
// robinson peers may be member of robinson clusters, which can be public or private
// this does not check the robinson attribute, only the specific subtype of the cluster
2008-08-02 15:57:00 +02:00
final String clustermode = getConfig ( plasmaSwitchboardConstants . CLUSTER_MODE , plasmaSwitchboardConstants . CLUSTER_MODE_PUBLIC_PEER ) ;
return ( clustermode . equals ( plasmaSwitchboardConstants . CLUSTER_MODE_PUBLIC_CLUSTER ) ) | | ( clustermode . equals ( plasmaSwitchboardConstants . CLUSTER_MODE_PUBLIC_PEER ) ) ;
2007-04-23 22:47:07 +02:00
}
2008-08-02 14:12:04 +02:00
public boolean isInMyCluster ( final String peer ) {
2007-04-23 22:47:07 +02:00
// check if the given peer is in the own network, if this is a robinson cluster
// depending on the robinson cluster type, the peer String may be a peerhash (b64-hash)
// or a ip:port String or simply a ip String
2007-04-24 17:11:12 +02:00
// if this robinson mode does not define a cluster membership, false is returned
2007-07-24 02:46:17 +02:00
if ( peer = = null ) return false ;
2007-04-23 22:47:07 +02:00
if ( ! isRobinsonMode ( ) ) return false ;
2008-08-02 15:57:00 +02:00
final String clustermode = getConfig ( plasmaSwitchboardConstants . CLUSTER_MODE , plasmaSwitchboardConstants . CLUSTER_MODE_PUBLIC_PEER ) ;
if ( clustermode . equals ( plasmaSwitchboardConstants . CLUSTER_MODE_PRIVATE_CLUSTER ) ) {
2007-04-23 22:47:07 +02:00
// check if we got the request from a peer in the private cluster
2008-08-02 15:57:00 +02:00
final String network = getConfig ( plasmaSwitchboardConstants . CLUSTER_PEERS_IPPORT , " " ) ;
2007-04-23 22:47:07 +02:00
return network . indexOf ( peer ) > = 0 ;
2008-08-02 15:57:00 +02:00
} else if ( clustermode . equals ( plasmaSwitchboardConstants . CLUSTER_MODE_PUBLIC_CLUSTER ) ) {
2007-04-23 22:47:07 +02:00
// check if we got the request from a peer in the public cluster
2007-04-30 00:05:34 +02:00
return this . clusterhashes . containsKey ( peer ) ;
2007-04-23 22:47:07 +02:00
} else {
return false ;
}
}
2008-08-02 14:12:04 +02:00
public boolean isInMyCluster ( final yacySeed seed ) {
2007-04-23 22:47:07 +02:00
// check if the given peer is in the own network, if this is a robinson cluster
2007-04-24 17:11:12 +02:00
// if this robinson mode does not define a cluster membership, false is returned
2007-04-23 22:47:07 +02:00
if ( seed = = null ) return false ;
if ( ! isRobinsonMode ( ) ) return false ;
2008-08-02 15:57:00 +02:00
final String clustermode = getConfig ( plasmaSwitchboardConstants . CLUSTER_MODE , plasmaSwitchboardConstants . CLUSTER_MODE_PUBLIC_PEER ) ;
if ( clustermode . equals ( plasmaSwitchboardConstants . CLUSTER_MODE_PRIVATE_CLUSTER ) ) {
2007-04-23 22:47:07 +02:00
// check if we got the request from a peer in the private cluster
2008-08-02 15:57:00 +02:00
final String network = getConfig ( plasmaSwitchboardConstants . CLUSTER_PEERS_IPPORT , " " ) ;
2007-04-30 00:05:34 +02:00
return network . indexOf ( seed . getPublicAddress ( ) ) > = 0 ;
2008-08-02 15:57:00 +02:00
} else if ( clustermode . equals ( plasmaSwitchboardConstants . CLUSTER_MODE_PUBLIC_CLUSTER ) ) {
2007-04-26 16:28:57 +02:00
// check if we got the request from a peer in the public cluster
2007-04-30 00:05:34 +02:00
return this . clusterhashes . containsKey ( seed . hash ) ;
2007-04-23 22:47:07 +02:00
} else {
return false ;
}
}
2007-07-24 02:46:17 +02:00
2008-04-20 23:36:25 +02:00
/ * *
* Test a url if it can be used for crawling / indexing
* This mainly checks if the url is in the declared domain ( local / global )
* @param url
* @return null if the url can be accepted , a string containing a rejection reason if the url cannot be accepted
* /
2008-08-02 14:12:04 +02:00
public String acceptURL ( final yacyURL url ) {
2007-07-24 02:46:17 +02:00
// returns true if the url can be accepted accoring to network.unit.domain
2008-04-20 23:36:25 +02:00
if ( url = = null ) return " url is null " ;
2008-08-02 14:12:04 +02:00
final String host = url . getHost ( ) ;
2008-04-20 23:36:25 +02:00
if ( host = = null ) return " url.host is null " ;
if ( this . acceptGlobalURLs & & this . acceptLocalURLs ) return null ; // fast shortcut to avoid dnsResolve
2008-05-01 01:06:42 +02:00
/ *
2007-10-29 02:43:20 +01:00
InetAddress hostAddress = serverDomains . dnsResolve ( host ) ;
// if we don't know the host, we cannot load that resource anyway.
// But in case we use a proxy, it is possible that we dont have a DNS service.
2008-04-05 15:17:16 +02:00
final httpRemoteProxyConfig remoteProxyConfig = httpdProxyHandler . getRemoteProxyConfig ( ) ;
2008-04-20 23:36:25 +02:00
if ( hostAddress = = null ) {
if ( ( remoteProxyConfig ! = null ) & & ( remoteProxyConfig . useProxy ( ) ) ) return null ; else return " the dns of the host ' " + host + " ' cannot be resolved " ;
}
2008-05-01 01:06:42 +02:00
* /
2007-10-29 02:43:20 +01:00
// check if this is a local address and we are allowed to index local pages:
2008-05-01 01:06:42 +02:00
//boolean local = hostAddress.isSiteLocalAddress() || hostAddress.isLoopbackAddress();
2008-08-02 14:12:04 +02:00
final boolean local = url . isLocal ( ) ;
2008-03-05 22:46:55 +01:00
//assert local == yacyURL.isLocalDomain(url.hash()); // TODO: remove the dnsResolve above!
2008-04-20 23:36:25 +02:00
if ( ( this . acceptGlobalURLs & & ! local ) | | ( this . acceptLocalURLs & & local ) ) return null ;
return ( local ) ?
( " the host ' " + host + " ' is local, but local addresses are not accepted " ) :
( " the host ' " + host + " ' is global, but global addresses are not accepted " ) ;
2007-07-24 02:46:17 +02:00
}
2007-04-23 22:47:07 +02:00
2008-08-02 14:12:04 +02:00
public String urlExists ( final String hash ) {
2006-12-05 03:47:51 +01:00
// tests if hash occurrs in any database
// if it exists, the name of the database is returned,
// if it not exists, null is returned
2008-05-14 23:36:02 +02:00
if ( webIndex . existsURL ( hash ) ) return " loaded " ;
2007-10-29 02:43:20 +01:00
return this . crawlQueues . urlExists ( hash ) ;
2006-12-05 03:47:51 +01:00
}
2008-08-02 14:12:04 +02:00
public void urlRemove ( final String hash ) {
2008-05-14 23:36:02 +02:00
webIndex . removeURL ( hash ) ;
2008-03-26 15:13:05 +01:00
crawlResults . remove ( hash ) ;
2007-10-29 02:43:20 +01:00
crawlQueues . urlRemove ( hash ) ;
2007-06-15 19:45:49 +02:00
}
2008-08-02 14:12:04 +02:00
public yacyURL getURL ( final String urlhash ) {
2007-11-29 14:58:00 +01:00
if ( urlhash = = null ) return null ;
2008-05-06 01:13:47 +02:00
if ( urlhash . length ( ) = = 0 ) return null ;
2008-08-02 14:12:04 +02:00
final yacyURL ne = crawlQueues . getURL ( urlhash ) ;
2007-10-29 02:43:20 +01:00
if ( ne ! = null ) return ne ;
2008-08-02 14:12:04 +02:00
final indexURLReference le = webIndex . getURL ( urlhash , null , 0 ) ;
2006-12-05 03:47:51 +01:00
if ( le ! = null ) return le . comp ( ) . url ( ) ;
return null ;
}
2005-11-11 00:48:20 +01:00
2007-11-07 23:38:09 +01:00
public plasmaSearchRankingProfile getRanking ( ) {
return ( getConfig ( " rankingProfile " , " " ) . length ( ) = = 0 ) ?
new plasmaSearchRankingProfile ( plasmaSearchQuery . CONTENTDOM_TEXT ) :
new plasmaSearchRankingProfile ( " " , crypt . simpleDecode ( sb . getConfig ( " rankingProfile " , " " ) , null ) ) ;
}
2005-10-19 19:59:54 +02:00
/ * *
* This method changes the HTCache size . < br >
2007-01-30 15:18:35 +01:00
* @param newCacheSize in MB
2005-10-19 19:59:54 +02:00
* /
2008-08-02 14:12:04 +02:00
public final void setCacheSize ( final long newCacheSize ) {
2007-08-15 23:31:31 +02:00
plasmaHTCache . setCacheSize ( 1048576 * newCacheSize ) ;
2005-10-19 19:59:54 +02:00
}
2005-11-11 00:48:20 +01:00
2005-07-21 13:17:04 +02:00
public boolean onlineCaution ( ) {
2007-12-06 22:53:17 +01:00
return
2008-08-02 15:57:00 +02:00
( System . currentTimeMillis ( ) - this . proxyLastAccess < Integer . parseInt ( getConfig ( plasmaSwitchboardConstants . PROXY_ONLINE_CAUTION_DELAY , " 30000 " ) ) ) | |
( System . currentTimeMillis ( ) - this . localSearchLastAccess < Integer . parseInt ( getConfig ( plasmaSwitchboardConstants . LOCALSEACH_ONLINE_CAUTION_DELAY , " 30000 " ) ) ) | |
( System . currentTimeMillis ( ) - this . remoteSearchLastAccess < Integer . parseInt ( getConfig ( plasmaSwitchboardConstants . REMOTESEARCH_ONLINE_CAUTION_DELAY , " 30000 " ) ) ) ;
2005-07-21 13:17:04 +02:00
}
2005-09-27 09:10:24 +02:00
private static String ppRamString ( long bytes ) {
2005-04-25 01:15:40 +02:00
if ( bytes < 1024 ) return bytes + " KByte " ;
bytes = bytes / 1024 ;
if ( bytes < 1024 ) return bytes + " MByte " ;
bytes = bytes / 1024 ;
if ( bytes < 1024 ) return bytes + " GByte " ;
return ( bytes / 1024 ) + " TByte " ;
}
2006-01-18 03:18:23 +01:00
2007-06-12 01:33:24 +02:00
/ * *
2008-05-06 02:32:41 +02:00
* { @link CrawlProfile Crawl Profiles } are saved independantly from the queues themselves
2007-06-12 01:33:24 +02:00
* and therefore have to be cleaned up from time to time . This method only performs the clean - up
2008-05-14 23:36:02 +02:00
* if - and only if - the { @link IndexingStack switchboard } ,
2008-05-06 02:32:41 +02:00
* { @link ProtocolLoader loader } and { @link plasmaCrawlNURL local crawl } queues are all empty .
2007-06-12 01:33:24 +02:00
* < p >
2008-05-06 02:32:41 +02:00
* Then it iterates through all existing { @link CrawlProfile crawl profiles } and removes
2007-06-12 01:33:24 +02:00
* all profiles which are not hardcoded .
* < / p >
* < p >
* < i > If this method encounters DB - failures , the profile DB will be resetted and < / i >
* < code > true < / code > < i > will be returned < / i >
* < / p >
* @see # CRAWL_PROFILE_PROXY hardcoded
* @see # CRAWL_PROFILE_REMOTE hardcoded
* @see # CRAWL_PROFILE_SNIPPET_TEXT hardcoded
* @see # CRAWL_PROFILE_SNIPPET_MEDIA hardcoded
* @return whether this method has done something or not ( i . e . because the queues have been filled
* or there are no profiles left to clean up )
* @throws < b > InterruptedException < / b > if the current thread has been interrupted , i . e . by the
* shutdown procedure
* /
2006-09-04 07:17:37 +02:00
public boolean cleanProfiles ( ) throws InterruptedException {
2008-05-14 23:36:02 +02:00
if ( ( crawlQueues . size ( ) > 0 ) | |
( crawlStacker ! = null & & crawlStacker . size ( ) > 0 ) | |
( crawlQueues . noticeURL . notEmpty ( ) ) )
2007-10-08 17:11:26 +02:00
return false ;
2008-05-14 23:36:02 +02:00
return this . webIndex . cleanProfiles ( ) ;
2005-04-07 21:19:42 +02:00
}
2007-10-29 02:43:20 +01:00
2008-08-25 20:11:47 +02:00
public boolean htEntryStoreProcess ( final indexDocumentMetadata entry ) {
2005-07-06 16:48:41 +02:00
2005-11-11 00:48:20 +01:00
if ( entry = = null ) return false ;
2006-03-03 09:30:08 +01:00
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
* PARSER SUPPORT
*
* Testing if the content type is supported by the available parsers
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
2008-08-02 14:12:04 +02:00
final boolean isSupportedContent = plasmaParser . supportedContent ( entry . url ( ) , entry . getMimeType ( ) ) ;
2008-09-03 02:30:21 +02:00
if ( log . isFinest ( ) ) log . logFinest ( " STORE " + entry . url ( ) + " content of type " + entry . getMimeType ( ) + " is supported: " + isSupportedContent ) ;
2006-03-03 09:30:08 +01:00
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
* INDEX CONTROL HEADER
*
2005-10-23 12:35:05 +02:00
* With the X - YACY - Index - Control header set to " no-index " a client could disallow
* yacy to index the response returned as answer to a request
2006-03-03 09:30:08 +01:00
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
boolean doIndexing = true ;
2006-10-18 06:23:00 +02:00
if ( entry . requestProhibitsIndexing ( ) ) {
doIndexing = false ;
2008-09-03 02:30:21 +02:00
if ( this . log . isFine ( ) ) this . log . logFine ( " Crawling of " + entry . url ( ) + " prohibited by request. " ) ;
2006-03-03 09:30:08 +01:00
}
2005-07-06 16:48:41 +02:00
2006-03-03 09:30:08 +01:00
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
* LOCAL IP ADDRESS CHECK
*
2006-09-06 16:31:17 +02:00
* check if ip is local ip address // TODO: remove this procotol specific code here
2006-03-03 09:30:08 +01:00
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
2008-08-02 14:12:04 +02:00
final String urlRejectReason = acceptURL ( entry . url ( ) ) ;
2008-04-20 23:36:25 +02:00
if ( urlRejectReason ! = null ) {
if ( this . log . isFine ( ) ) this . log . logFine ( " Rejected URL ' " + entry . url ( ) + " ': " + urlRejectReason ) ;
2007-07-24 02:46:17 +02:00
doIndexing = false ;
2005-12-07 15:26:43 +01:00
}
2008-05-14 23:36:02 +02:00
synchronized ( webIndex . queuePreStack ) {
2006-03-03 09:30:08 +01:00
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
* STORING DATA
*
* Now we store the response header and response content if
* a ) the user has configured to use the htcache or
* b ) the content should be indexed
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
2008-04-06 13:50:15 +02:00
if ( ( ( entry . profile ( ) ! = null ) & & ( entry . profile ( ) . storeHTCache ( ) ) ) | | ( doIndexing & & isSupportedContent ) ) {
2008-07-10 02:47:37 +02:00
// store response header
/ *
2006-09-06 16:31:17 +02:00
if ( entry . writeResourceInfo ( ) ) {
2006-09-04 17:03:54 +02:00
this . log . logInfo ( " WROTE HEADER for " + entry . cacheFile ( ) ) ;
2008-07-10 02:47:37 +02:00
}
* /
2006-03-03 09:30:08 +01:00
// work off unwritten files
2008-05-03 11:06:00 +02:00
if ( entry . cacheArray ( ) ! = null ) {
2008-08-02 14:12:04 +02:00
final String error = entry . shallStoreCacheForProxy ( ) ;
2006-03-03 09:30:08 +01:00
if ( error = = null ) {
2007-08-15 23:31:31 +02:00
plasmaHTCache . writeResourceContent ( entry . url ( ) , entry . cacheArray ( ) ) ;
2008-04-08 16:44:39 +02:00
if ( this . log . isFine ( ) ) this . log . logFine ( " WROTE FILE ( " + entry . cacheArray ( ) . length + " bytes) for " + entry . cacheFile ( ) ) ;
2006-03-03 09:30:08 +01:00
} else {
2008-04-08 16:44:39 +02:00
if ( this . log . isFine ( ) ) this . log . logFine ( " WRITE OF FILE " + entry . cacheFile ( ) + " FORBIDDEN: " + error ) ;
2006-03-03 09:30:08 +01:00
}
2008-05-03 11:06:00 +02:00
//} else {
//this.log.logFine("EXISTING FILE (" + entry.cacheFile.length() + " bytes) for " + entry.cacheFile);
2005-07-08 18:24:07 +02:00
}
2005-07-06 16:48:41 +02:00
}
2005-11-11 00:48:20 +01:00
2006-03-03 09:30:08 +01:00
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
* INDEXING
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
2008-04-30 00:31:05 +02:00
if ( doIndexing & & isSupportedContent ) {
2005-09-07 23:38:03 +02:00
2005-09-05 13:17:37 +02:00
// enqueue for further crawling
2008-05-14 23:36:02 +02:00
enQueue ( this . webIndex . queuePreStack . newEntry (
2006-09-04 17:03:54 +02:00
entry . url ( ) ,
2007-09-05 11:01:35 +02:00
( entry . referrerURL ( ) = = null ) ? null : entry . referrerURL ( ) . hash ( ) ,
2006-09-04 17:29:45 +02:00
entry . ifModifiedSince ( ) ,
entry . requestWithCookie ( ) ,
2006-09-04 16:38:29 +02:00
entry . initiator ( ) ,
2006-09-04 17:03:54 +02:00
entry . depth ( ) ,
entry . profile ( ) . handle ( ) ,
2006-09-04 16:38:29 +02:00
entry . name ( )
2005-09-05 13:17:37 +02:00
) ) ;
2006-03-03 09:30:08 +01:00
} else {
2006-09-04 17:03:54 +02:00
if ( ! entry . profile ( ) . storeHTCache ( ) & & entry . cacheFile ( ) . exists ( ) ) {
2008-08-25 20:11:47 +02:00
plasmaHTCache . deleteURLfromCache ( entry . url ( ) , false ) ;
2006-03-03 09:30:08 +01:00
}
2005-11-11 00:48:20 +01:00
}
2008-02-26 15:05:43 +01:00
}
2005-07-06 16:48:41 +02:00
return true ;
}
public boolean htEntryStoreJob ( ) {
2007-08-15 23:31:31 +02:00
if ( plasmaHTCache . empty ( ) ) return false ;
return htEntryStoreProcess ( plasmaHTCache . pop ( ) ) ;
2005-07-06 16:48:41 +02:00
}
public int htEntrySize ( ) {
2007-08-15 23:31:31 +02:00
return plasmaHTCache . size ( ) ;
2005-07-06 16:48:41 +02:00
}
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
public void close ( ) {
2005-08-30 23:10:39 +02:00
log . logConfig ( " SWITCHBOARD SHUTDOWN STEP 1: sending termination signal to managed threads: " ) ;
2007-12-04 21:19:13 +01:00
serverProfiling . stopSystemProfiling ( ) ;
2007-11-02 15:55:46 +01:00
moreMemory . cancel ( ) ;
2005-04-07 21:19:42 +02:00
terminateAllThreads ( true ) ;
2006-02-19 22:54:46 +01:00
if ( transferIdxThread ! = null ) stopTransferWholeIndex ( false ) ;
2006-01-23 14:45:14 +01:00
log . logConfig ( " SWITCHBOARD SHUTDOWN STEP 2: sending termination signal to threaded indexing " ) ;
2005-12-14 14:04:43 +01:00
// closing all still running db importer jobs
2008-03-30 01:03:44 +01:00
indexingDocumentProcessor . shutdown ( 4000 ) ;
indexingCondensementProcessor . shutdown ( 3000 ) ;
indexingAnalysisProcessor . shutdown ( 2000 ) ;
indexingStorageProcessor . shutdown ( 1000 ) ;
2006-01-31 13:30:24 +01:00
this . dbImportManager . close ( ) ;
2008-04-05 15:17:16 +02:00
JakartaCommonsHttpClient . closeAllConnections ( ) ;
2005-12-14 14:04:43 +01:00
wikiDB . close ( ) ;
2006-03-01 08:40:25 +01:00
blogDB . close ( ) ;
2007-02-26 15:36:01 +01:00
blogCommentDB . close ( ) ;
2005-12-14 14:04:43 +01:00
userDB . close ( ) ;
2005-12-26 15:21:01 +01:00
bookmarksDB . close ( ) ;
2005-12-14 14:04:43 +01:00
messageDB . close ( ) ;
2007-10-29 02:43:20 +01:00
crawlStacker . close ( ) ;
2005-12-14 14:04:43 +01:00
robots . close ( ) ;
parser . close ( ) ;
2007-08-15 23:31:31 +02:00
plasmaHTCache . close ( ) ;
2007-05-22 10:13:48 +02:00
webStructure . flushCitationReference ( " crg " ) ;
webStructure . close ( ) ;
2008-03-30 01:03:44 +01:00
crawlQueues . close ( ) ;
2006-01-23 14:45:14 +01:00
log . logConfig ( " SWITCHBOARD SHUTDOWN STEP 3: sending termination signal to database manager (stand by...) " ) ;
2008-05-14 23:36:02 +02:00
webIndex . close ( ) ;
2008-07-18 16:17:52 +02:00
if ( yacyTray . isShown ) yacytray . removeTray ( ) ;
2005-08-30 23:10:39 +02:00
log . logConfig ( " SWITCHBOARD SHUTDOWN TERMINATED " ) ;
2005-04-07 21:19:42 +02:00
}
2005-05-29 13:56:40 +02:00
2005-04-07 21:19:42 +02:00
public int queueSize ( ) {
2008-05-14 23:36:02 +02:00
return webIndex . queuePreStack . size ( ) ;
2005-04-07 21:19:42 +02:00
}
2008-08-02 14:12:04 +02:00
public void enQueue ( final IndexingStack . QueueEntry job ) {
2007-09-28 03:21:31 +02:00
assert job ! = null ;
2005-07-06 16:48:41 +02:00
try {
2008-06-06 18:01:27 +02:00
webIndex . queuePreStack . push ( job ) ;
2008-08-02 14:12:04 +02:00
} catch ( final IOException e ) {
2005-08-30 23:32:59 +02:00
log . logSevere ( " IOError in plasmaSwitchboard.enQueue: " + e . getMessage ( ) , e ) ;
2005-07-06 16:48:41 +02:00
}
2005-04-07 21:19:42 +02:00
}
2005-11-11 00:48:20 +01:00
2006-08-18 04:07:03 +02:00
public void deQueueFreeMem ( ) {
// flush some entries from the RAM cache
2008-05-14 23:36:02 +02:00
webIndex . flushCacheSome ( ) ;
2008-06-16 23:39:58 +02:00
// empty some caches
webIndex . clearCache ( ) ;
plasmaSearchEvent . cleanupEvents ( true ) ;
2006-08-19 02:06:39 +02:00
// adopt maximum cache size to current size to prevent that further OutOfMemoryErrors occur
2007-11-02 15:55:46 +01:00
/ * int newMaxCount = Math . max ( 1200 , Math . min ( ( int ) getConfigLong ( WORDCACHE_MAX_COUNT , 1200 ) , wordIndex . dhtOutCacheSize ( ) ) ) ;
2007-01-30 15:18:35 +01:00
setConfig ( WORDCACHE_MAX_COUNT , Integer . toString ( newMaxCount ) ) ;
2007-11-02 15:55:46 +01:00
wordIndex . setMaxWordCount ( newMaxCount ) ; * /
2006-08-18 04:07:03 +02:00
}
2008-05-14 23:36:02 +02:00
public IndexingStack . QueueEntry deQueue ( ) {
2008-03-22 02:28:37 +01:00
// getting the next entry from the indexing queue
2008-05-14 23:36:02 +02:00
IndexingStack . QueueEntry nextentry = null ;
synchronized ( webIndex . queuePreStack ) {
2008-03-22 02:28:37 +01:00
// do one processing step
2008-05-14 23:36:02 +02:00
if ( this . log . isFine ( ) ) log . logFine ( " DEQUEUE: sbQueueSize= " + webIndex . queuePreStack . size ( ) +
2008-05-06 02:32:41 +02:00
" , coreStackSize= " + crawlQueues . noticeURL . stackSize ( NoticedURL . STACK_TYPE_CORE ) +
" , limitStackSize= " + crawlQueues . noticeURL . stackSize ( NoticedURL . STACK_TYPE_LIMIT ) +
" , overhangStackSize= " + crawlQueues . noticeURL . stackSize ( NoticedURL . STACK_TYPE_OVERHANG ) +
" , remoteStackSize= " + crawlQueues . noticeURL . stackSize ( NoticedURL . STACK_TYPE_REMOTE ) ) ;
2008-03-22 02:28:37 +01:00
try {
2008-08-02 14:12:04 +02:00
final int sizeBefore = webIndex . queuePreStack . size ( ) ;
2008-05-14 23:36:02 +02:00
nextentry = webIndex . queuePreStack . pop ( ) ;
2008-03-22 02:28:37 +01:00
if ( nextentry = = null ) {
log . logWarning ( " deQueue: null entry on queue stack. " ) ;
2008-05-14 23:36:02 +02:00
if ( webIndex . queuePreStack . size ( ) = = sizeBefore ) {
2008-03-22 02:28:37 +01:00
// this is a severe problem: because this time a null is returned, it means that this status will last forever
// to re-enable use of the sbQueue, it must be emptied completely
log . logSevere ( " deQueue: does not shrink after pop() == null. Emergency reset. " ) ;
2008-05-14 23:36:02 +02:00
webIndex . queuePreStack . clear ( ) ;
2008-03-22 02:28:37 +01:00
}
return null ;
}
2008-08-02 14:12:04 +02:00
} catch ( final IOException e ) {
2008-03-22 02:28:37 +01:00
log . logSevere ( " IOError in plasmaSwitchboard.deQueue: " + e . getMessage ( ) , e ) ;
return null ;
}
return nextentry ;
}
}
public boolean deQueueProcess ( ) {
2006-09-03 16:59:00 +02:00
try {
// work off fresh entries from the proxy or from the crawler
if ( onlineCaution ( ) ) {
2008-04-08 16:44:39 +02:00
if ( this . log . isFine ( ) ) log . logFine ( " deQueue: online caution, omitting resource stack processing " ) ;
2006-09-03 16:59:00 +02:00
return false ;
}
2007-03-17 02:18:34 +01:00
2006-09-03 16:59:00 +02:00
boolean doneSomething = false ;
2008-03-22 02:28:37 +01:00
// flush some entries from the RAM cache
2008-05-14 23:36:02 +02:00
if ( webIndex . queuePreStack . size ( ) = = 0 ) {
doneSomething = webIndex . flushCacheSome ( ) > 0 ; // permanent flushing only if we are not busy
2008-03-22 02:28:37 +01:00
}
2006-09-03 16:59:00 +02:00
// possibly delete entries from last chunk
2007-10-29 02:43:20 +01:00
if ( ( this . dhtTransferChunk ! = null ) & & ( this . dhtTransferChunk . getStatus ( ) = = plasmaDHTChunk . chunkStatus_COMPLETE ) ) {
2008-08-02 14:12:04 +02:00
final String deletedURLs = this . dhtTransferChunk . deleteTransferIndexes ( ) ;
2008-04-08 16:44:39 +02:00
if ( this . log . isFine ( ) ) this . log . logFine ( " Deleted from " + this . dhtTransferChunk . containers ( ) . length + " transferred RWIs locally, removed " + deletedURLs + " URL references " ) ;
2006-09-03 16:59:00 +02:00
this . dhtTransferChunk = null ;
2005-06-08 17:28:29 +02:00
}
2006-01-27 03:48:27 +01:00
2006-09-03 16:59:00 +02:00
// generate a dht chunk
2008-03-24 23:51:26 +01:00
if ( ( dhtShallTransfer ( ) = = null ) & & (
( this . dhtTransferChunk = = null ) | |
( this . dhtTransferChunk . getStatus ( ) = = plasmaDHTChunk . chunkStatus_UNDEFINED ) | |
// (this.dhtTransferChunk.getStatus() == plasmaDHTChunk.chunkStatus_COMPLETE) ||
( this . dhtTransferChunk . getStatus ( ) = = plasmaDHTChunk . chunkStatus_FAILED )
) ) {
2006-09-03 16:59:00 +02:00
// generate new chunk
2008-08-02 15:57:00 +02:00
final int minChunkSize = ( int ) getConfigLong ( plasmaSwitchboardConstants . INDEX_DIST_CHUNK_SIZE_MIN , 30 ) ;
2008-05-14 23:36:02 +02:00
dhtTransferChunk = new plasmaDHTChunk ( this . log , webIndex , minChunkSize , dhtTransferIndexCount , 5000 ) ;
2006-09-03 16:59:00 +02:00
doneSomething = true ;
2006-01-27 03:48:27 +01:00
}
2006-09-03 16:59:00 +02:00
// check for interruption
checkInterruption ( ) ;
// getting the next entry from the indexing queue
2008-05-14 23:36:02 +02:00
if ( webIndex . queuePreStack . size ( ) = = 0 ) {
2008-03-22 02:28:37 +01:00
//log.logFine("deQueue: nothing to do, queue is emtpy");
return doneSomething ; // nothing to do
}
2006-09-03 16:59:00 +02:00
2008-08-02 15:57:00 +02:00
if ( crawlStacker . size ( ) > = getConfigLong ( plasmaSwitchboardConstants . CRAWLSTACK_SLOTS , 2000 ) ) {
2008-04-08 16:44:39 +02:00
if ( this . log . isFine ( ) ) log . logFine ( " deQueue: too many processes in stack crawl thread queue ( " + " stackCrawlQueue= " + crawlStacker . size ( ) + " ) " ) ;
2008-03-22 02:28:37 +01:00
return doneSomething ;
}
2006-09-03 16:59:00 +02:00
2008-03-22 02:28:37 +01:00
// if we were interrupted we should return now
if ( Thread . currentThread ( ) . isInterrupted ( ) ) {
2008-04-08 16:44:39 +02:00
if ( this . log . isFine ( ) ) log . logFine ( " deQueue: thread was interrupted " ) ;
2008-03-22 02:28:37 +01:00
return false ;
}
2008-03-30 01:03:44 +01:00
2008-03-26 23:43:38 +01:00
// get next queue entry and start a queue processing
2008-08-02 14:12:04 +02:00
final IndexingStack . QueueEntry queueEntry = deQueue ( ) ;
2008-03-26 23:43:38 +01:00
assert queueEntry ! = null ;
if ( queueEntry = = null ) return true ;
if ( queueEntry . profile ( ) = = null ) {
queueEntry . close ( ) ;
return true ;
2006-01-27 03:48:27 +01:00
}
2008-05-14 23:36:02 +02:00
webIndex . queuePreStack . enQueueToActive ( queueEntry ) ;
2008-03-26 23:43:38 +01:00
2008-03-28 12:56:28 +01:00
// check for interruption
checkInterruption ( ) ;
2008-03-30 01:03:44 +01:00
this . indexingDocumentProcessor . enQueue ( new indexingQueueEntry ( queueEntry , null , null ) ) ;
2008-03-28 12:56:28 +01:00
/ *
2008-03-26 23:43:38 +01:00
// THE FOLLOWING CAN BE CONCURRENT ->
2008-02-26 15:05:43 +01:00
// parse and index the resource
2008-03-28 12:56:28 +01:00
indexingQueueEntry document = parseDocument ( new indexingQueueEntry ( queueEntry , null , null ) ) ;
2008-03-26 23:43:38 +01:00
// do condensing
2008-03-28 12:56:28 +01:00
indexingQueueEntry condensement = condenseDocument ( document ) ;
2008-03-26 23:43:38 +01:00
// do a web structure analysis
2008-03-28 12:56:28 +01:00
indexingQueueEntry analysis = webStructureAnalysis ( condensement ) ;
2008-03-26 23:43:38 +01:00
// <- CONCURRENT UNTIL HERE, THEN SERIALIZE AGAIN
// store the result
2008-03-28 12:56:28 +01:00
storeDocumentIndex ( analysis ) ;
* /
2006-09-03 16:59:00 +02:00
return true ;
2008-08-02 14:12:04 +02:00
} catch ( final InterruptedException e ) {
2006-09-03 16:59:00 +02:00
log . logInfo ( " DEQUEUE: Shutdown detected. " ) ;
return false ;
2005-11-11 00:48:20 +01:00
}
2005-04-07 21:19:42 +02:00
}
2008-03-30 01:03:44 +01:00
public static class indexingQueueEntry extends serverProcessorJob {
2008-05-14 23:36:02 +02:00
public IndexingStack . QueueEntry queueEntry ;
2008-03-28 12:56:28 +01:00
public plasmaParserDocument document ;
public plasmaCondenser condenser ;
public indexingQueueEntry (
2008-08-02 14:12:04 +02:00
final IndexingStack . QueueEntry queueEntry ,
final plasmaParserDocument document ,
final plasmaCondenser condenser ) {
2008-03-30 01:03:44 +01:00
super ( ) ;
2008-03-28 12:56:28 +01:00
this . queueEntry = queueEntry ;
this . document = document ;
this . condenser = condenser ;
}
}
2005-04-07 21:19:42 +02:00
public int cleanupJobSize ( ) {
int c = 0 ;
2007-10-29 02:43:20 +01:00
if ( ( crawlQueues . delegatedURL . stackSize ( ) > 1000 ) ) c + + ;
if ( ( crawlQueues . errorURL . stackSize ( ) > 1000 ) ) c + + ;
2005-04-07 21:19:42 +02:00
for ( int i = 1 ; i < = 6 ; i + + ) {
2008-03-26 15:13:05 +01:00
if ( crawlResults . getStackSize ( i ) > 1000 ) c + + ;
2005-11-11 00:48:20 +01:00
}
2005-04-07 21:19:42 +02:00
return c ;
}
public boolean cleanupJob ( ) {
2006-09-04 07:17:37 +02:00
try {
boolean hasDoneSomething = false ;
2008-04-12 13:39:48 +02:00
2008-06-16 23:39:58 +02:00
// clear caches if necessary
if ( ! serverMemory . request ( 8000000L , false ) ) {
webIndex . clearCache ( ) ;
plasmaSearchEvent . cleanupEvents ( true ) ;
}
2008-05-15 13:26:43 +02:00
// set a random password if no password is configured
2008-05-20 11:29:01 +02:00
if ( ! this . acceptLocalURLs & & getConfigBool ( " adminAccountForLocalhost " , false ) & & getConfig ( httpd . ADMIN_ACCOUNT_B64MD5 , " " ) . length ( ) = = 0 ) {
2008-05-15 13:26:43 +02:00
// make a 'random' password
setConfig ( httpd . ADMIN_ACCOUNT_B64MD5 , " 0000 " + serverCodings . encodeMD5Hex ( System . getProperties ( ) . toString ( ) + System . currentTimeMillis ( ) ) ) ;
setConfig ( " adminAccount " , " " ) ;
}
2008-08-26 15:20:18 +02:00
// refresh recrawl dates
try {
Iterator < CrawlProfile . entry > it = webIndex . profilesActiveCrawls . profiles ( true ) ;
entry selentry ;
while ( it . hasNext ( ) ) {
selentry = it . next ( ) ;
if ( selentry . name ( ) . equals ( plasmaWordIndex . CRAWL_PROFILE_PROXY ) )
webIndex . profilesActiveCrawls . changeEntry ( selentry , CrawlProfile . entry . RECRAWL_IF_OLDER ,
Long . toString ( webIndex . profilesActiveCrawls . getRecrawlDate ( plasmaWordIndex . CRAWL_PROFILE_PROXY_RECRAWL_CYCLE ) ) ) ;
// if (selentry.name().equals(plasmaWordIndex.CRAWL_PROFILE_REMOTE));
if ( selentry . name ( ) . equals ( plasmaWordIndex . CRAWL_PROFILE_SNIPPET_LOCAL_TEXT ) )
webIndex . profilesActiveCrawls . changeEntry ( selentry , CrawlProfile . entry . RECRAWL_IF_OLDER ,
Long . toString ( webIndex . profilesActiveCrawls . getRecrawlDate ( plasmaWordIndex . CRAWL_PROFILE_SNIPPET_LOCAL_TEXT_RECRAWL_CYCLE ) ) ) ;
if ( selentry . name ( ) . equals ( plasmaWordIndex . CRAWL_PROFILE_SNIPPET_GLOBAL_TEXT ) )
webIndex . profilesActiveCrawls . changeEntry ( selentry , CrawlProfile . entry . RECRAWL_IF_OLDER ,
Long . toString ( webIndex . profilesActiveCrawls . getRecrawlDate ( plasmaWordIndex . CRAWL_PROFILE_SNIPPET_GLOBAL_TEXT_RECRAWL_CYCLE ) ) ) ;
if ( selentry . name ( ) . equals ( plasmaWordIndex . CRAWL_PROFILE_SNIPPET_LOCAL_MEDIA ) )
webIndex . profilesActiveCrawls . changeEntry ( selentry , CrawlProfile . entry . RECRAWL_IF_OLDER ,
Long . toString ( webIndex . profilesActiveCrawls . getRecrawlDate ( plasmaWordIndex . CRAWL_PROFILE_SNIPPET_LOCAL_MEDIA_RECRAWL_CYCLE ) ) ) ;
if ( selentry . name ( ) . equals ( plasmaWordIndex . CRAWL_PROFILE_SNIPPET_GLOBAL_MEDIA ) )
webIndex . profilesActiveCrawls . changeEntry ( selentry , CrawlProfile . entry . RECRAWL_IF_OLDER ,
Long . toString ( webIndex . profilesActiveCrawls . getRecrawlDate ( plasmaWordIndex . CRAWL_PROFILE_SNIPPET_GLOBAL_MEDIA_RECRAWL_CYCLE ) ) ) ;
}
} catch ( final IOException e ) { } ;
2008-04-12 13:39:48 +02:00
// close unused connections
JakartaCommonsHttpClient . cleanup ( ) ;
2006-09-04 07:17:37 +02:00
2008-06-02 23:49:59 +02:00
// clean up too old connection information
super . cleanupAccessTracker ( 1000 * 60 * 60 ) ;
2007-12-14 20:17:54 +01:00
// do transmission of CR-files
2006-09-04 07:17:37 +02:00
checkInterruption ( ) ;
int count = rankingOwnDistribution . size ( ) / 100 ;
if ( count = = 0 ) count = 1 ;
if ( count > 5 ) count = 5 ;
2007-06-28 16:52:26 +02:00
if ( rankingOn ) {
rankingOwnDistribution . transferRanking ( count ) ;
rankingOtherDistribution . transferRanking ( 1 ) ;
}
2007-03-16 14:25:56 +01:00
// clean up delegated stack
checkInterruption ( ) ;
2007-10-29 02:43:20 +01:00
if ( ( crawlQueues . delegatedURL . stackSize ( ) > 1000 ) ) {
2008-04-08 16:44:39 +02:00
if ( this . log . isFine ( ) ) log . logFine ( " Cleaning Delegated-URLs report stack, " + crawlQueues . delegatedURL . stackSize ( ) + " entries on stack " ) ;
2007-10-29 02:43:20 +01:00
crawlQueues . delegatedURL . clearStack ( ) ;
2007-03-16 14:25:56 +01:00
hasDoneSomething = true ;
}
2006-09-04 07:17:37 +02:00
// clean up error stack
checkInterruption ( ) ;
2007-10-29 02:43:20 +01:00
if ( ( crawlQueues . errorURL . stackSize ( ) > 1000 ) ) {
2008-04-08 16:44:39 +02:00
if ( this . log . isFine ( ) ) log . logFine ( " Cleaning Error-URLs report stack, " + crawlQueues . errorURL . stackSize ( ) + " entries on stack " ) ;
2007-10-29 02:43:20 +01:00
crawlQueues . errorURL . clearStack ( ) ;
2005-04-07 21:19:42 +02:00
hasDoneSomething = true ;
2005-11-11 00:48:20 +01:00
}
2007-03-16 14:25:56 +01:00
2006-09-04 07:17:37 +02:00
// clean up loadedURL stack
for ( int i = 1 ; i < = 6 ; i + + ) {
checkInterruption ( ) ;
2008-03-26 15:13:05 +01:00
if ( crawlResults . getStackSize ( i ) > 1000 ) {
2008-04-08 16:44:39 +02:00
if ( this . log . isFine ( ) ) log . logFine ( " Cleaning Loaded-URLs report stack, " + crawlResults . getStackSize ( i ) + " entries on stack " + i ) ;
2008-03-26 15:13:05 +01:00
crawlResults . clearStack ( i ) ;
2006-09-04 07:17:37 +02:00
hasDoneSomething = true ;
}
}
2008-04-22 00:42:49 +02:00
// clean up image stack
2008-05-06 02:32:41 +02:00
ResultImages . clearQueues ( ) ;
2007-03-16 14:25:56 +01:00
2006-09-04 07:17:37 +02:00
// clean up profiles
checkInterruption ( ) ;
if ( cleanProfiles ( ) ) hasDoneSomething = true ;
2006-05-17 15:08:57 +02:00
2006-09-04 07:17:37 +02:00
// clean up news
checkInterruption ( ) ;
try {
2008-05-14 23:36:02 +02:00
if ( this . log . isFine ( ) ) log . logFine ( " Cleaning Incoming News, " + this . webIndex . newsPool . size ( yacyNewsPool . INCOMING_DB ) + " entries on stack " ) ;
if ( this . webIndex . newsPool . automaticProcess ( webIndex . seedDB ) > 0 ) hasDoneSomething = true ;
2008-08-02 14:12:04 +02:00
} catch ( final IOException e ) { }
2007-07-06 00:56:37 +02:00
if ( getConfigBool ( " cleanup.deletionProcessedNews " , true ) ) {
2008-05-14 23:36:02 +02:00
this . webIndex . newsPool . clear ( yacyNewsPool . PROCESSED_DB ) ;
2007-07-06 00:56:37 +02:00
}
if ( getConfigBool ( " cleanup.deletionPublishedNews " , true ) ) {
2008-05-14 23:36:02 +02:00
this . webIndex . newsPool . clear ( yacyNewsPool . PUBLISHED_DB ) ;
2007-07-06 00:56:37 +02:00
}
2007-06-22 11:16:25 +02:00
// clean up seed-dbs
2007-06-22 12:42:29 +02:00
if ( getConfigBool ( " routing.deleteOldSeeds.permission " , true ) ) {
final long deleteOldSeedsTime = getConfigLong ( " routing.deleteOldSeeds.time " , 7 ) * 24 * 3600000 ;
2008-05-14 23:36:02 +02:00
Iterator < yacySeed > e = this . webIndex . seedDB . seedsSortedDisconnected ( true , yacySeed . LASTSEEN ) ;
2007-06-22 11:16:25 +02:00
yacySeed seed = null ;
2008-08-02 14:12:04 +02:00
final ArrayList < String > deleteQueue = new ArrayList < String > ( ) ;
2007-06-22 11:16:25 +02:00
checkInterruption ( ) ;
//clean passive seeds
2007-10-01 14:30:23 +02:00
while ( e . hasNext ( ) ) {
2008-01-22 12:51:43 +01:00
seed = e . next ( ) ;
2007-06-22 11:16:25 +02:00
if ( seed ! = null ) {
//list is sorted -> break when peers are too young to delete
2007-06-22 12:42:29 +02:00
if ( seed . getLastSeenUTC ( ) > ( System . currentTimeMillis ( ) - deleteOldSeedsTime ) )
2007-06-22 11:16:25 +02:00
break ;
deleteQueue . add ( seed . hash ) ;
}
}
2008-06-06 18:01:27 +02:00
for ( int i = 0 ; i < deleteQueue . size ( ) ; + + i ) this . webIndex . seedDB . removeDisconnected ( deleteQueue . get ( i ) ) ;
2007-06-22 11:16:25 +02:00
deleteQueue . clear ( ) ;
2008-05-14 23:36:02 +02:00
e = this . webIndex . seedDB . seedsSortedPotential ( true , yacySeed . LASTSEEN ) ;
2007-06-22 11:16:25 +02:00
checkInterruption ( ) ;
//clean potential seeds
2007-10-01 14:30:23 +02:00
while ( e . hasNext ( ) ) {
2008-06-06 18:01:27 +02:00
seed = e . next ( ) ;
2007-06-22 11:16:25 +02:00
if ( seed ! = null ) {
//list is sorted -> break when peers are too young to delete
2007-06-22 12:42:29 +02:00
if ( seed . getLastSeenUTC ( ) > ( System . currentTimeMillis ( ) - deleteOldSeedsTime ) )
2007-06-22 11:16:25 +02:00
break ;
deleteQueue . add ( seed . hash ) ;
}
}
2008-06-06 18:01:27 +02:00
for ( int i = 0 ; i < deleteQueue . size ( ) ; + + i ) this . webIndex . seedDB . removePotential ( deleteQueue . get ( i ) ) ;
2007-06-22 11:16:25 +02:00
}
2007-07-12 18:23:33 +02:00
// check if update is available and
// if auto-update is activated perform an automatic installation and restart
2008-08-02 14:12:04 +02:00
final yacyVersion updateVersion = yacyVersion . rulebasedUpdateInfo ( false ) ;
2008-04-08 23:36:33 +02:00
if ( updateVersion ! = null ) {
2007-07-12 18:23:33 +02:00
// there is a version that is more recent. Load it and re-start with it
log . logInfo ( " AUTO-UPDATE: downloading more recent release " + updateVersion . url ) ;
2008-08-02 14:12:04 +02:00
final File downloaded = yacyVersion . downloadRelease ( updateVersion ) ;
final boolean devenvironment = yacyVersion . combined2prettyVersion ( sb . getConfig ( " version " , " 0.1 " ) ) . startsWith ( " dev " ) ;
2007-07-12 18:23:33 +02:00
if ( devenvironment ) {
log . logInfo ( " AUTO-UPDATE: omiting update because this is a development environment " ) ;
2008-04-11 14:37:17 +02:00
} else if ( ( downloaded = = null ) | | ( ! downloaded . exists ( ) ) | | ( downloaded . length ( ) = = 0 ) ) {
2007-07-12 18:23:33 +02:00
log . logInfo ( " AUTO-UPDATE: omiting update because download failed (file cannot be found or is too small) " ) ;
} else {
2008-04-11 14:37:17 +02:00
yacyVersion . deployRelease ( downloaded ) ;
2007-07-12 18:23:33 +02:00
terminate ( 5000 ) ;
log . logInfo ( " AUTO-UPDATE: deploy and restart initiated " ) ;
}
}
2007-06-12 17:15:24 +02:00
// initiate broadcast about peer startup to spread supporter url
2008-05-14 23:36:02 +02:00
if ( this . webIndex . newsPool . size ( yacyNewsPool . OUTGOING_DB ) = = 0 ) {
2007-06-12 17:15:24 +02:00
// read profile
final Properties profile = new Properties ( ) ;
FileInputStream fileIn = null ;
try {
fileIn = new FileInputStream ( new File ( " DATA/SETTINGS/profile.txt " ) ) ;
profile . load ( fileIn ) ;
2008-08-02 14:12:04 +02:00
} catch ( final IOException e ) {
2007-06-12 17:15:24 +02:00
} finally {
2008-08-02 14:12:04 +02:00
if ( fileIn ! = null ) try { fileIn . close ( ) ; } catch ( final Exception e ) { }
2007-06-12 17:15:24 +02:00
}
2008-08-02 14:12:04 +02:00
final String homepage = ( String ) profile . get ( " homepage " ) ;
2007-06-12 17:15:24 +02:00
if ( ( homepage ! = null ) & & ( homepage . length ( ) > 10 ) ) {
2008-08-02 14:12:04 +02:00
final Properties news = new Properties ( ) ;
2007-06-12 17:15:24 +02:00
news . put ( " homepage " , profile . get ( " homepage " ) ) ;
2008-05-14 23:36:02 +02:00
this . webIndex . newsPool . publishMyNews ( yacyNewsRecord . newRecord ( webIndex . seedDB . mySeed ( ) , yacyNewsPool . CATEGORY_PROFILE_BROADCAST , news ) ) ;
2007-06-12 17:15:24 +02:00
}
}
2007-11-02 15:55:46 +01:00
/ *
2007-05-11 20:02:48 +02:00
// set a maximum amount of memory for the caches
2007-11-02 15:55:46 +01:00
// long memprereq = Math.max(getConfigLong(INDEXER_MEMPREREQ, 0), wordIndex.minMem());
2007-05-11 20:02:48 +02:00
// setConfig(INDEXER_MEMPREREQ, memprereq);
2007-11-02 15:55:46 +01:00
// setThreadPerformance(INDEXER, getConfigLong(INDEXER_IDLESLEEP, 0), getConfigLong(INDEXER_BUSYSLEEP, 0), memprereq);
kelondroCachedRecords . setCacheGrowStati ( 40 * 1024 * 1024 , 20 * 1024 * 1024 ) ;
kelondroCache . setCacheGrowStati ( 40 * 1024 * 1024 , 20 * 1024 * 1024 ) ;
* /
2007-04-26 11:51:51 +02:00
// update the cluster set
2008-05-14 23:36:02 +02:00
this . clusterhashes = this . webIndex . seedDB . clusterHashes ( getConfig ( " cluster.peers.yacydomain " , " " ) ) ;
2007-04-26 11:51:51 +02:00
2008-06-06 15:10:21 +02:00
// after all clean up is done, check the resource usage
observer . resourceObserverJob ( ) ;
2006-09-04 07:17:37 +02:00
return hasDoneSomething ;
2008-08-02 14:12:04 +02:00
} catch ( final InterruptedException e ) {
2006-09-04 07:17:37 +02:00
this . log . logInfo ( " cleanupJob: Shutdown detected " ) ;
return false ;
}
2005-04-07 21:19:42 +02:00
}
2005-11-11 00:48:20 +01:00
2005-05-14 11:41:05 +02:00
/ * *
2005-11-11 00:48:20 +01:00
* With this function the crawling process can be paused
2008-04-05 15:17:16 +02:00
* @param jobType
2005-05-14 11:41:05 +02:00
* /
2008-08-02 14:12:04 +02:00
public void pauseCrawlJob ( final String jobType ) {
final Object [ ] status = this . crawlJobsStatus . get ( jobType ) ;
2008-08-02 15:57:00 +02:00
synchronized ( status [ plasmaSwitchboardConstants . CRAWLJOB_SYNC ] ) {
status [ plasmaSwitchboardConstants . CRAWLJOB_STATUS ] = Boolean . TRUE ;
2005-05-14 11:41:05 +02:00
}
2006-02-21 12:18:48 +01:00
setConfig ( jobType + " _isPaused " , " true " ) ;
}
2005-05-14 11:41:05 +02:00
/ * *
2005-11-11 00:48:20 +01:00
* Continue the previously paused crawling
2008-04-05 15:17:16 +02:00
* @param jobType
2005-05-14 11:41:05 +02:00
* /
2008-08-02 14:12:04 +02:00
public void continueCrawlJob ( final String jobType ) {
final Object [ ] status = this . crawlJobsStatus . get ( jobType ) ;
2008-08-02 15:57:00 +02:00
synchronized ( status [ plasmaSwitchboardConstants . CRAWLJOB_SYNC ] ) {
if ( ( ( Boolean ) status [ plasmaSwitchboardConstants . CRAWLJOB_STATUS ] ) . booleanValue ( ) ) {
status [ plasmaSwitchboardConstants . CRAWLJOB_STATUS ] = Boolean . FALSE ;
status [ plasmaSwitchboardConstants . CRAWLJOB_SYNC ] . notifyAll ( ) ;
2005-05-14 11:41:05 +02:00
}
}
2006-02-21 12:18:48 +01:00
setConfig ( jobType + " _isPaused " , " false " ) ;
}
2005-05-14 11:41:05 +02:00
/ * *
2008-04-05 15:17:16 +02:00
* @param jobType
2005-05-14 11:41:05 +02:00
* @return < code > true < / code > if crawling was paused or < code > false < / code > otherwise
* /
2008-08-02 14:12:04 +02:00
public boolean crawlJobIsPaused ( final String jobType ) {
final Object [ ] status = this . crawlJobsStatus . get ( jobType ) ;
2008-08-02 15:57:00 +02:00
synchronized ( status [ plasmaSwitchboardConstants . CRAWLJOB_SYNC ] ) {
return ( ( Boolean ) status [ plasmaSwitchboardConstants . CRAWLJOB_STATUS ] ) . booleanValue ( ) ;
2005-05-14 11:41:05 +02:00
}
2005-11-11 00:48:20 +01:00
}
2005-05-14 11:41:05 +02:00
2008-08-02 14:12:04 +02:00
public indexingQueueEntry parseDocument ( final indexingQueueEntry in ) {
2008-05-14 23:36:02 +02:00
in . queueEntry . updateStatus ( IndexingStack . QUEUE_STATE_PARSING ) ;
2008-08-26 18:34:24 +02:00
// debug
2008-09-03 02:30:21 +02:00
if ( log . isFinest ( ) ) log . logFinest ( " PARSE " + in . queueEntry . toString ( ) ) ;
2008-08-26 18:34:24 +02:00
2008-03-28 12:56:28 +01:00
plasmaParserDocument document = null ;
try {
document = parseDocument ( in . queueEntry ) ;
2008-08-02 14:12:04 +02:00
} catch ( final InterruptedException e ) {
2008-03-28 12:56:28 +01:00
document = null ;
}
if ( document = = null ) {
in . queueEntry . close ( ) ;
return null ;
}
return new indexingQueueEntry ( in . queueEntry , document , null ) ;
}
2008-08-02 14:12:04 +02:00
private plasmaParserDocument parseDocument ( final IndexingStack . QueueEntry entry ) throws InterruptedException {
2008-03-26 23:43:38 +01:00
plasmaParserDocument document = null ;
2008-08-02 14:12:04 +02:00
final int processCase = entry . processCase ( ) ;
2006-11-08 17:17:47 +01:00
2008-04-08 16:44:39 +02:00
if ( this . log . isFine ( ) ) log . logFine ( " processResourceStack processCase= " + processCase +
2008-03-26 23:43:38 +01:00
" , depth= " + entry . depth ( ) +
" , maxDepth= " + ( ( entry . profile ( ) = = null ) ? " null " : Integer . toString ( entry . profile ( ) . generalDepth ( ) ) ) +
" , filter= " + ( ( entry . profile ( ) = = null ) ? " null " : entry . profile ( ) . generalFilter ( ) ) +
" , initiatorHash= " + entry . initiator ( ) +
//", responseHeader=" + ((entry.responseHeader() == null) ? "null" : entry.responseHeader().toString()) +
" , url= " + entry . url ( ) ) ; // DEBUG
// PARSE CONTENT
2008-08-02 14:12:04 +02:00
final long parsingStartTime = System . currentTimeMillis ( ) ;
2006-09-03 16:59:00 +02:00
2005-12-08 23:16:49 +01:00
try {
2008-03-26 23:43:38 +01:00
// parse the document
document = parser . parseSource ( entry . url ( ) , entry . getMimeType ( ) , entry . getCharacterEncoding ( ) , entry . cacheFile ( ) ) ;
assert ( document ! = null ) : " Unexpected error. Parser returned null. " ;
if ( document = = null ) return null ;
2008-08-02 14:12:04 +02:00
} catch ( final ParserException e ) {
2008-03-26 23:43:38 +01:00
this . log . logInfo ( " Unable to parse the resource ' " + entry . url ( ) + " '. " + e . getMessage ( ) ) ;
2008-08-02 15:57:00 +02:00
addURLtoErrorDB ( entry . url ( ) , entry . referrerHash ( ) , entry . initiator ( ) , entry . anchorName ( ) , e . getErrorCode ( ) ) ;
2008-03-26 23:43:38 +01:00
if ( document ! = null ) {
document . close ( ) ;
document = null ;
2005-11-07 13:33:02 +01:00
}
2008-03-26 23:43:38 +01:00
return null ;
}
2008-08-02 14:12:04 +02:00
final long parsingEndTime = System . currentTimeMillis ( ) ;
2008-03-26 23:43:38 +01:00
// get the document date
2008-08-02 14:12:04 +02:00
final Date docDate = entry . getModificationDate ( ) ;
2008-03-26 23:43:38 +01:00
// put anchors on crawl stack
2008-08-02 14:12:04 +02:00
final long stackStartTime = System . currentTimeMillis ( ) ;
2008-03-26 23:43:38 +01:00
if (
2008-08-02 15:57:00 +02:00
( ( processCase = = plasmaSwitchboardConstants . PROCESSCASE_4_PROXY_LOAD ) | | ( processCase = = plasmaSwitchboardConstants . PROCESSCASE_5_LOCAL_CRAWLING ) ) & &
2008-03-26 23:43:38 +01:00
( ( entry . profile ( ) = = null ) | | ( entry . depth ( ) < entry . profile ( ) . generalDepth ( ) ) )
) {
2008-08-02 14:12:04 +02:00
final Map < yacyURL , String > hl = document . getHyperlinks ( ) ;
final Iterator < Map . Entry < yacyURL , String > > i = hl . entrySet ( ) . iterator ( ) ;
2008-03-26 23:43:38 +01:00
yacyURL nextUrl ;
Map . Entry < yacyURL , String > nextEntry ;
while ( i . hasNext ( ) ) {
// check for interruption
checkInterruption ( ) ;
// fetching the next hyperlink
nextEntry = i . next ( ) ;
nextUrl = nextEntry . getKey ( ) ;
// enqueue the hyperlink into the pre-notice-url db
crawlStacker . enqueueEntry ( nextUrl , entry . urlHash ( ) , entry . initiator ( ) , nextEntry . getValue ( ) , docDate , entry . depth ( ) + 1 , entry . profile ( ) ) ;
2005-09-07 23:38:03 +02:00
}
2008-08-02 14:12:04 +02:00
final long stackEndTime = System . currentTimeMillis ( ) ;
2008-03-26 23:43:38 +01:00
if ( log . isInfo ( ) ) log . logInfo ( " CRAWL: ADDED " + hl . size ( ) + " LINKS FROM " + entry . url ( ) . toNormalform ( false , true ) +
2008-05-06 02:32:41 +02:00
" , NEW CRAWL STACK SIZE IS " + crawlQueues . noticeURL . stackSize ( NoticedURL . STACK_TYPE_CORE ) +
2008-03-26 23:43:38 +01:00
" , STACKING TIME = " + ( stackEndTime - stackStartTime ) +
" , PARSING TIME = " + ( parsingEndTime - parsingStartTime ) ) ;
2008-03-24 23:51:26 +01:00
}
return document ;
}
2008-08-02 14:12:04 +02:00
public indexingQueueEntry condenseDocument ( final indexingQueueEntry in ) {
2008-05-14 23:36:02 +02:00
in . queueEntry . updateStatus ( IndexingStack . QUEUE_STATE_CONDENSING ) ;
2008-08-26 18:34:24 +02:00
// debug
2008-09-03 02:30:21 +02:00
if ( log . isFinest ( ) ) log . logFinest ( " CONDENSE " + in . queueEntry . toString ( ) ) ;
2008-08-26 18:34:24 +02:00
2008-03-28 12:56:28 +01:00
plasmaCondenser condenser = null ;
try {
condenser = condenseDocument ( in . queueEntry , in . document ) ;
2008-08-02 14:12:04 +02:00
} catch ( final InterruptedException e ) {
2008-03-28 12:56:28 +01:00
condenser = null ;
}
if ( condenser = = null ) {
in . queueEntry . close ( ) ;
return null ;
}
2008-04-22 00:42:49 +02:00
// update image result list statistics
// its good to do this concurrently here, because it needs a DNS lookup
// to compute a URL hash which is necessary for a double-check
2008-08-02 14:12:04 +02:00
final CrawlProfile . entry profile = in . queueEntry . profile ( ) ;
2008-05-06 02:32:41 +02:00
ResultImages . registerImages ( in . document , ( profile = = null ) ? true : ! profile . remoteIndexing ( ) ) ;
2008-04-22 00:42:49 +02:00
2008-03-28 12:56:28 +01:00
return new indexingQueueEntry ( in . queueEntry , in . document , condenser ) ;
}
2008-08-02 14:12:04 +02:00
private plasmaCondenser condenseDocument ( final IndexingStack . QueueEntry entry , plasmaParserDocument document ) throws InterruptedException {
2008-03-24 23:51:26 +01:00
// CREATE INDEX
2008-08-02 14:12:04 +02:00
final String dc_title = document . dc_title ( ) ;
final yacyURL referrerURL = entry . referrerURL ( ) ;
final int processCase = entry . processCase ( ) ;
2008-03-24 23:51:26 +01:00
2008-05-06 02:32:41 +02:00
String noIndexReason = ErrorURL . DENIED_UNSPECIFIED_INDEXING_ERROR ;
2008-08-02 15:57:00 +02:00
if ( processCase = = plasmaSwitchboardConstants . PROCESSCASE_4_PROXY_LOAD ) {
2008-03-24 23:51:26 +01:00
// proxy-load
noIndexReason = entry . shallIndexCacheForProxy ( ) ;
} else {
// normal crawling
noIndexReason = entry . shallIndexCacheForCrawler ( ) ;
}
if ( noIndexReason ! = null ) {
// check for interruption
checkInterruption ( ) ;
2008-04-24 10:42:08 +02:00
log . logFine ( " Not indexed any word in URL " + entry . url ( ) + " ; cause: " + noIndexReason ) ;
2008-08-02 15:57:00 +02:00
addURLtoErrorDB ( entry . url ( ) , ( referrerURL = = null ) ? " " : referrerURL . hash ( ) , entry . initiator ( ) , dc_title , noIndexReason ) ;
2008-03-24 23:51:26 +01:00
/ *
if ( ( processCase = = PROCESSCASE_6_GLOBAL_CRAWLING ) & & ( initiatorPeer ! = null ) ) {
if ( clusterhashes ! = null ) initiatorPeer . setAlternativeAddress ( ( String ) clusterhashes . get ( initiatorPeer . hash ) ) ;
yacyClient . crawlReceipt ( initiatorPeer , " crawl " , " rejected " , noIndexReason , null , " " ) ;
}
* /
document . close ( ) ;
document = null ;
return null ;
}
// strip out words
checkInterruption ( ) ;
2008-04-08 16:44:39 +02:00
if ( this . log . isFine ( ) ) log . logFine ( " Condensing for ' " + entry . url ( ) . toNormalform ( false , true ) + " ' " ) ;
2008-03-24 23:51:26 +01:00
plasmaCondenser condenser ;
try {
condenser = new plasmaCondenser ( document , entry . profile ( ) . indexText ( ) , entry . profile ( ) . indexMedia ( ) ) ;
2008-08-02 14:12:04 +02:00
} catch ( final UnsupportedEncodingException e ) {
2008-03-24 23:51:26 +01:00
return null ;
}
return condenser ;
}
2008-08-02 14:12:04 +02:00
public indexingQueueEntry webStructureAnalysis ( final indexingQueueEntry in ) {
2008-05-14 23:36:02 +02:00
in . queueEntry . updateStatus ( IndexingStack . QUEUE_STATE_STRUCTUREANALYSIS ) ;
2008-03-28 12:56:28 +01:00
in . document . notifyWebStructure ( webStructure , in . condenser , in . queueEntry . getModificationDate ( ) ) ;
return in ;
}
2008-08-02 14:12:04 +02:00
public void storeDocumentIndex ( final indexingQueueEntry in ) {
2008-05-14 23:36:02 +02:00
in . queueEntry . updateStatus ( IndexingStack . QUEUE_STATE_INDEXSTORAGE ) ;
2008-03-28 12:56:28 +01:00
storeDocumentIndex ( in . queueEntry , in . document , in . condenser ) ;
2008-05-14 23:36:02 +02:00
in . queueEntry . updateStatus ( IndexingStack . QUEUE_STATE_FINISHED ) ;
2008-03-28 12:56:28 +01:00
in . queueEntry . close ( ) ;
}
2008-08-02 14:12:04 +02:00
private void storeDocumentIndex ( final IndexingStack . QueueEntry queueEntry , final plasmaParserDocument document , final plasmaCondenser condenser ) {
2008-03-26 20:51:05 +01:00
// CREATE INDEX
2008-08-02 14:12:04 +02:00
final String dc_title = document . dc_title ( ) ;
final yacyURL referrerURL = queueEntry . referrerURL ( ) ;
final int processCase = queueEntry . processCase ( ) ;
2008-03-26 20:51:05 +01:00
// remove stopwords
2008-04-22 00:42:49 +02:00
log . logInfo ( " Excluded " + condenser . excludeWords ( stopwords ) + " words in URL " + queueEntry . url ( ) ) ;
2008-03-26 20:51:05 +01:00
// STORE URL TO LOADED-URL-DB
indexURLReference newEntry = null ;
try {
2008-05-14 23:36:02 +02:00
newEntry = webIndex . storeDocument ( queueEntry , document , condenser ) ;
2008-08-02 14:12:04 +02:00
} catch ( final IOException e ) {
2008-04-22 00:42:49 +02:00
if ( this . log . isFine ( ) ) log . logFine ( " Not Indexed Resource ' " + queueEntry . url ( ) . toNormalform ( false , true ) + " ': process case= " + processCase ) ;
2008-08-02 15:57:00 +02:00
addURLtoErrorDB ( queueEntry . url ( ) , referrerURL . hash ( ) , queueEntry . initiator ( ) , dc_title , " error storing url: " + e . getMessage ( ) ) ;
2008-03-26 20:51:05 +01:00
return ;
}
2008-04-22 00:42:49 +02:00
// update url result list statistics
2008-03-26 20:51:05 +01:00
crawlResults . stack (
newEntry , // loaded url db entry
2008-04-22 00:42:49 +02:00
queueEntry . initiator ( ) , // initiator peer hash
2008-05-14 23:36:02 +02:00
this . webIndex . seedDB . mySeed ( ) . hash , // executor peer hash
2008-03-26 20:51:05 +01:00
processCase // process case
) ;
// STORE WORD INDEX
2008-04-22 00:42:49 +02:00
if ( ( ! queueEntry . profile ( ) . indexText ( ) ) & & ( ! queueEntry . profile ( ) . indexMedia ( ) ) ) {
if ( this . log . isFine ( ) ) log . logFine ( " Not Indexed Resource ' " + queueEntry . url ( ) . toNormalform ( false , true ) + " ': process case= " + processCase ) ;
2008-08-02 15:57:00 +02:00
addURLtoErrorDB ( queueEntry . url ( ) , referrerURL . hash ( ) , queueEntry . initiator ( ) , dc_title , ErrorURL . DENIED_UNKNOWN_INDEXING_PROCESS_CASE ) ;
2008-03-26 20:51:05 +01:00
return ;
}
// increment number of indexed urls
indexedPages + + ;
// update profiling info
2008-06-04 23:34:57 +02:00
if ( System . currentTimeMillis ( ) - lastPPMUpdate > 30000 ) {
// we don't want to do this too often
updateMySeed ( ) ;
2008-08-20 09:54:56 +02:00
serverProfiling . update ( " ppm " , Long . valueOf ( currentPPM ( ) ) ) ;
2008-06-04 23:34:57 +02:00
lastPPMUpdate = System . currentTimeMillis ( ) ;
}
serverProfiling . update ( " indexed " , queueEntry . url ( ) . toNormalform ( true , false ) ) ;
2008-03-26 20:51:05 +01:00
// if this was performed for a remote crawl request, notify requester
2008-08-02 14:12:04 +02:00
final yacySeed initiatorPeer = queueEntry . initiatorPeer ( ) ;
2008-08-02 15:57:00 +02:00
if ( ( processCase = = plasmaSwitchboardConstants . PROCESSCASE_6_GLOBAL_CRAWLING ) & & ( initiatorPeer ! = null ) ) {
2008-04-22 00:42:49 +02:00
log . logInfo ( " Sending crawl receipt for ' " + queueEntry . url ( ) . toNormalform ( false , true ) + " ' to " + initiatorPeer . getName ( ) ) ;
2008-06-06 18:01:27 +02:00
if ( clusterhashes ! = null ) initiatorPeer . setAlternativeAddress ( clusterhashes . get ( initiatorPeer . hash ) ) ;
2008-03-26 23:43:38 +01:00
// start a thread for receipt sending to avoid a blocking here
new Thread ( new receiptSending ( initiatorPeer , newEntry ) ) . start ( ) ;
2008-03-26 20:51:05 +01:00
}
}
2008-03-26 23:43:38 +01:00
public class receiptSending implements Runnable {
yacySeed initiatorPeer ;
indexURLReference reference ;
2008-03-24 23:51:26 +01:00
2008-08-02 14:12:04 +02:00
public receiptSending ( final yacySeed initiatorPeer , final indexURLReference reference ) {
2008-03-26 23:43:38 +01:00
this . initiatorPeer = initiatorPeer ;
this . reference = reference ;
2008-03-24 23:51:26 +01:00
}
2008-03-26 23:43:38 +01:00
public void run ( ) {
2008-05-14 23:36:02 +02:00
yacyClient . crawlReceipt ( webIndex . seedDB . mySeed ( ) , initiatorPeer , " crawl " , " fill " , " indexed " , reference , " " ) ;
2005-12-08 23:16:49 +01:00
}
2005-04-07 21:19:42 +02:00
}
2005-04-15 16:18:14 +02:00
2005-04-07 21:19:42 +02:00
private static SimpleDateFormat DateFormatter = new SimpleDateFormat ( " EEE, dd MMM yyyy " ) ;
2008-08-02 14:12:04 +02:00
public static String dateString ( final Date date ) {
2008-08-02 15:57:00 +02:00
if ( date = = null ) return " " ;
return DateFormatter . format ( date ) ;
2005-04-07 21:19:42 +02:00
}
2005-04-15 16:18:14 +02:00
2007-10-02 23:28:29 +02:00
// we need locale independent RFC-822 dates at some places
2007-10-03 06:00:52 +02:00
private static SimpleDateFormat DateFormatter822 = new SimpleDateFormat ( " EEE, dd MMM yyyy HH:mm:ss Z " , Locale . US ) ;
2008-08-02 14:12:04 +02:00
public static String dateString822 ( final Date date ) {
2008-08-02 15:57:00 +02:00
if ( date = = null ) return " " ;
return DateFormatter822 . format ( date ) ;
2007-10-02 23:28:29 +02:00
}
2008-08-02 14:12:04 +02:00
public serverObjects action ( final String actionName , final serverObjects actionInput ) {
2006-01-20 16:14:21 +01:00
// perform an action. (not used)
2005-11-11 00:48:20 +01:00
return null ;
2005-04-07 21:19:42 +02:00
}
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
public String toString ( ) {
2005-11-11 00:48:20 +01:00
// it is possible to use this method in the cgi pages.
// actually it is used there for testing purpose
2008-05-14 23:36:02 +02:00
return " PROPS: " + super . toString ( ) + " ; QUEUE: " + webIndex . queuePreStack . toString ( ) ;
2005-04-07 21:19:42 +02:00
}
// method for index deletion
2008-08-02 14:12:04 +02:00
public int removeAllUrlReferences ( final yacyURL url , final boolean fetchOnline ) {
2007-09-05 11:01:35 +02:00
return removeAllUrlReferences ( url . hash ( ) , fetchOnline ) ;
2005-04-07 21:19:42 +02:00
}
2008-08-02 14:12:04 +02:00
public int removeAllUrlReferences ( final String urlhash , final boolean fetchOnline ) {
2005-04-07 21:19:42 +02:00
// find all the words in a specific resource and remove the url reference from every word index
// finally, delete the url entry
// determine the url string
2008-08-02 14:12:04 +02:00
final indexURLReference entry = webIndex . getURL ( urlhash , null , 0 ) ;
2006-09-07 20:24:39 +02:00
if ( entry = = null ) return 0 ;
2008-08-02 14:12:04 +02:00
final indexURLReference . Components comp = entry . comp ( ) ;
2006-10-19 00:25:07 +02:00
if ( comp . url ( ) = = null ) return 0 ;
2006-09-20 14:25:07 +02:00
2006-10-03 13:05:48 +02:00
InputStream resourceContent = null ;
2006-09-20 14:25:07 +02:00
try {
2006-10-03 13:05:48 +02:00
// get the resource content
2008-08-02 14:12:04 +02:00
final Object [ ] resource = plasmaSnippetCache . getResource ( comp . url ( ) , fetchOnline , 10000 , true , false ) ;
2006-10-03 13:05:48 +02:00
resourceContent = ( InputStream ) resource [ 0 ] ;
2008-08-02 14:12:04 +02:00
final Long resourceContentLength = ( Long ) resource [ 1 ] ;
2006-10-03 13:05:48 +02:00
// parse the resource
2008-08-02 14:12:04 +02:00
final plasmaParserDocument document = plasmaSnippetCache . parseDocument ( comp . url ( ) , resourceContentLength . longValue ( ) , resourceContent ) ;
2006-10-03 13:05:48 +02:00
2006-12-06 14:13:55 +01:00
// get the word set
2008-01-22 12:51:43 +01:00
Set < String > words = null ;
2006-11-28 16:00:15 +01:00
try {
2006-12-19 04:10:46 +01:00
words = new plasmaCondenser ( document , true , true ) . words ( ) . keySet ( ) ;
2008-08-02 14:12:04 +02:00
} catch ( final UnsupportedEncodingException e ) {
2006-11-28 16:00:15 +01:00
e . printStackTrace ( ) ;
}
2006-10-03 13:05:48 +02:00
2006-09-20 14:25:07 +02:00
// delete all word references
2006-11-28 16:00:15 +01:00
int count = 0 ;
2008-05-14 23:36:02 +02:00
if ( words ! = null ) count = webIndex . removeWordReferences ( words , urlhash ) ;
2006-10-03 13:05:48 +02:00
2006-09-20 14:25:07 +02:00
// finally delete the url entry itself
2008-05-14 23:36:02 +02:00
webIndex . removeURL ( urlhash ) ;
2006-09-20 14:25:07 +02:00
return count ;
2008-08-02 14:12:04 +02:00
} catch ( final ParserException e ) {
2006-09-20 14:25:07 +02:00
return 0 ;
2006-10-03 13:05:48 +02:00
} finally {
2008-08-02 14:12:04 +02:00
if ( resourceContent ! = null ) try { resourceContent . close ( ) ; } catch ( final Exception e ) { /* ignore this */ }
2006-09-20 14:25:07 +02:00
}
2005-04-07 21:19:42 +02:00
}
2006-01-22 01:07:00 +01:00
2008-08-25 20:11:47 +02:00
public int adminAuthenticated ( final httpRequestHeader requestHeader ) {
2006-06-16 10:04:02 +02:00
2008-05-15 13:26:43 +02:00
// authorization for localhost, only if flag is set to grant localhost access as admin
2008-08-25 20:11:47 +02:00
final String clientIP = ( String ) requestHeader . get ( httpRequestHeader . CONNECTION_PROP_CLIENTIP , " " ) ;
final String refererHost = requestHeader . refererHost ( ) ;
2008-08-02 14:12:04 +02:00
final boolean accessFromLocalhost = serverCore . isLocalhost ( clientIP ) & & ( refererHost . length ( ) = = 0 | | serverCore . isLocalhost ( refererHost ) ) ;
2008-05-15 13:26:43 +02:00
if ( getConfigBool ( " adminAccountForLocalhost " , false ) & & accessFromLocalhost ) return 3 ; // soft-authenticated for localhost
// get the authorization string from the header
2008-08-25 20:11:47 +02:00
final String authorization = ( ( String ) requestHeader . get ( httpRequestHeader . AUTHORIZATION , " xxxxxx " ) ) . trim ( ) . substring ( 6 ) ;
2006-07-20 13:30:10 +02:00
// security check against too long authorization strings
if ( authorization . length ( ) > 256 ) return 0 ;
// authorization by encoded password, only for localhost access
2008-08-02 14:12:04 +02:00
final String adminAccountBase64MD5 = getConfig ( httpd . ADMIN_ACCOUNT_B64MD5 , " " ) ;
2008-05-15 13:26:43 +02:00
if ( accessFromLocalhost & & ( adminAccountBase64MD5 . equals ( authorization ) ) ) return 3 ; // soft-authenticated for localhost
2006-07-20 13:30:10 +02:00
// authorization by hit in userDB
2008-08-25 20:11:47 +02:00
if ( userDB . hasAdminRight ( ( String ) requestHeader . get ( httpRequestHeader . AUTHORIZATION , " xxxxxx " ) , ( ( String ) requestHeader . get ( httpRequestHeader . CONNECTION_PROP_CLIENTIP , " " ) ) , requestHeader . getHeaderCookies ( ) ) ) return 4 ; //return, because 4=max
2006-07-20 13:30:10 +02:00
// authorization with admin keyword in configuration
2007-02-05 20:46:50 +01:00
return httpd . staticAdminAuthenticated ( authorization , this ) ;
2005-04-07 21:19:42 +02:00
}
2005-04-24 23:24:53 +02:00
2008-08-25 20:11:47 +02:00
public boolean verifyAuthentication ( final httpRequestHeader header , final boolean strict ) {
2005-12-07 01:36:05 +01:00
// handle access rights
switch ( adminAuthenticated ( header ) ) {
case 0 : // wrong password given
2008-08-02 14:12:04 +02:00
try { Thread . sleep ( 3000 ) ; } catch ( final InterruptedException e ) { } // prevent brute-force
2005-12-07 01:36:05 +01:00
return false ;
case 1 : // no password given
return false ;
case 2 : // no password stored
return ! strict ;
case 3 : // soft-authenticated for localhost only
return true ;
case 4 : // hard-authenticated, all ok
return true ;
}
return false ;
}
2007-03-12 10:06:57 +01:00
public void setPerformance ( int wantedPPM ) {
// we consider 3 cases here
// wantedPPM <= 10: low performance
// 10 < wantedPPM < 1000: custom performance
// 1000 <= wantedPPM : maximum performance
if ( wantedPPM < = 10 ) wantedPPM = 10 ;
2008-05-13 17:28:55 +02:00
if ( wantedPPM > = 6000 ) wantedPPM = 6000 ;
2008-08-02 14:12:04 +02:00
final int newBusySleep = 60000 / wantedPPM ; // for wantedPPM = 10: 6000; for wantedPPM = 1000: 60
2007-03-12 17:24:28 +01:00
2008-03-27 13:03:16 +01:00
serverBusyThread thread ;
2007-03-12 17:24:28 +01:00
2008-08-02 15:57:00 +02:00
thread = getThread ( plasmaSwitchboardConstants . INDEX_DIST ) ;
2007-04-26 16:28:57 +02:00
if ( thread ! = null ) {
2008-08-02 15:57:00 +02:00
setConfig ( plasmaSwitchboardConstants . INDEX_DIST_BUSYSLEEP , thread . setBusySleep ( Math . max ( 2000 , thread . setBusySleep ( newBusySleep * 2 ) ) ) ) ;
2007-04-26 16:28:57 +02:00
thread . setIdleSleep ( 30000 ) ;
}
2007-03-12 17:24:28 +01:00
2008-08-02 15:57:00 +02:00
thread = getThread ( plasmaSwitchboardConstants . CRAWLJOB_LOCAL_CRAWL ) ;
2007-04-26 16:28:57 +02:00
if ( thread ! = null ) {
2008-08-02 15:57:00 +02:00
setConfig ( plasmaSwitchboardConstants . CRAWLJOB_LOCAL_CRAWL_BUSYSLEEP , thread . setBusySleep ( newBusySleep ) ) ;
2008-07-03 15:08:37 +02:00
thread . setIdleSleep ( 2000 ) ;
2007-04-26 16:28:57 +02:00
}
2007-03-12 17:24:28 +01:00
2008-08-02 15:57:00 +02:00
thread = getThread ( plasmaSwitchboardConstants . PROXY_CACHE_ENQUEUE ) ;
2007-04-26 16:28:57 +02:00
if ( thread ! = null ) {
2008-08-02 15:57:00 +02:00
setConfig ( plasmaSwitchboardConstants . PROXY_CACHE_ENQUEUE_BUSYSLEEP , thread . setBusySleep ( 0 ) ) ;
2008-07-03 15:08:37 +02:00
thread . setIdleSleep ( 2000 ) ;
2007-04-26 16:28:57 +02:00
}
2007-03-12 10:06:57 +01:00
2008-08-02 15:57:00 +02:00
thread = getThread ( plasmaSwitchboardConstants . INDEXER ) ;
2007-04-26 16:28:57 +02:00
if ( thread ! = null ) {
2008-08-02 15:57:00 +02:00
setConfig ( plasmaSwitchboardConstants . INDEXER_BUSYSLEEP , thread . setBusySleep ( newBusySleep / 8 ) ) ;
2008-07-03 15:08:37 +02:00
thread . setIdleSleep ( 2000 ) ;
2007-04-26 16:28:57 +02:00
}
2007-03-12 10:06:57 +01:00
}
2008-08-02 14:12:04 +02:00
public static int accessFrequency ( final HashMap < String , TreeSet < Long > > tracker , final String host ) {
2007-06-11 00:02:17 +02:00
// returns the access frequency in queries per hour for a given host and a specific tracker
2008-08-02 14:12:04 +02:00
final long timeInterval = 1000 * 60 * 60 ;
final TreeSet < Long > accessSet = tracker . get ( host ) ;
2007-06-11 00:02:17 +02:00
if ( accessSet = = null ) return 0 ;
2008-08-20 09:54:56 +02:00
return accessSet . tailSet ( Long . valueOf ( System . currentTimeMillis ( ) - timeInterval ) ) . size ( ) ;
2007-06-11 00:02:17 +02:00
}
2008-08-02 14:12:04 +02:00
public void startTransferWholeIndex ( final yacySeed seed , final boolean delete ) {
2006-02-19 22:54:46 +01:00
if ( transferIdxThread = = null ) {
2008-05-14 23:36:02 +02:00
this . transferIdxThread = new plasmaDHTFlush ( this . log , this . webIndex , seed , delete ,
2008-08-02 15:57:00 +02:00
" true " . equalsIgnoreCase ( getConfig ( plasmaSwitchboardConstants . INDEX_TRANSFER_GZIP_BODY , " false " ) ) ,
( int ) getConfigLong ( plasmaSwitchboardConstants . INDEX_TRANSFER_TIMEOUT , 60000 ) ) ;
2006-02-19 22:54:46 +01:00
this . transferIdxThread . start ( ) ;
}
}
2008-08-02 14:12:04 +02:00
public void stopTransferWholeIndex ( final boolean wait ) {
2006-02-19 22:54:46 +01:00
if ( ( transferIdxThread ! = null ) & & ( transferIdxThread . isAlive ( ) ) & & ( ! transferIdxThread . isFinished ( ) ) ) {
try {
this . transferIdxThread . stopIt ( wait ) ;
2008-08-02 14:12:04 +02:00
} catch ( final InterruptedException e ) { }
2006-02-19 22:54:46 +01:00
}
}
2008-08-02 14:12:04 +02:00
public void abortTransferWholeIndex ( final boolean wait ) {
2006-02-19 22:54:46 +01:00
if ( transferIdxThread ! = null ) {
if ( ! transferIdxThread . isFinished ( ) )
try {
this . transferIdxThread . stopIt ( wait ) ;
2008-08-02 14:12:04 +02:00
} catch ( final InterruptedException e ) { }
2006-02-19 22:54:46 +01:00
transferIdxThread = null ;
}
2006-02-21 00:27:11 +01:00
}
2006-02-19 22:54:46 +01:00
2006-02-22 00:08:07 +01:00
public String dhtShallTransfer ( ) {
2008-05-14 23:36:02 +02:00
if ( this . webIndex . seedDB = = null ) {
2006-02-22 00:08:07 +01:00
return " no DHT distribution: seedDB == null " ;
2006-02-21 00:27:11 +01:00
}
2008-05-14 23:36:02 +02:00
if ( this . webIndex . seedDB . mySeed ( ) = = null ) {
2006-02-22 00:08:07 +01:00
return " no DHT distribution: mySeed == null " ;
2006-02-21 00:27:11 +01:00
}
2008-05-14 23:36:02 +02:00
if ( this . webIndex . seedDB . mySeed ( ) . isVirgin ( ) ) {
2006-02-22 00:08:07 +01:00
return " no DHT distribution: status is virgin " ;
2006-02-21 00:27:11 +01:00
}
2008-05-14 23:36:02 +02:00
if ( this . webIndex . seedDB . noDHTActivity ( ) ) {
2007-08-27 00:35:26 +02:00
return " no DHT distribution: network too small " ;
}
2008-03-11 12:09:38 +01:00
if ( ! this . getConfigBool ( " network.unit.dht " , true ) ) {
return " no DHT distribution: disabled by network.unit.dht " ;
}
2008-08-02 15:57:00 +02:00
if ( getConfig ( plasmaSwitchboardConstants . INDEX_DIST_ALLOW , " false " ) . equalsIgnoreCase ( " false " ) ) {
2008-03-11 12:09:38 +01:00
return " no DHT distribution: not enabled (ser setting) " ;
2006-02-21 00:27:11 +01:00
}
2008-05-14 23:36:02 +02:00
if ( webIndex . countURL ( ) < 10 ) {
return " no DHT distribution: loadedURL.size() = " + webIndex . countURL ( ) ;
2006-02-21 00:27:11 +01:00
}
2008-05-14 23:36:02 +02:00
if ( webIndex . size ( ) < 100 ) {
return " no DHT distribution: not enough words - wordIndex.size() = " + webIndex . size ( ) ;
2006-02-21 00:27:11 +01:00
}
2008-08-02 15:57:00 +02:00
if ( ( getConfig ( plasmaSwitchboardConstants . INDEX_DIST_ALLOW_WHILE_CRAWLING , " false " ) . equalsIgnoreCase ( " false " ) ) & & ( crawlQueues . noticeURL . notEmpty ( ) ) ) {
2008-05-14 23:36:02 +02:00
return " no DHT distribution: crawl in progress: noticeURL.stackSize() = " + crawlQueues . noticeURL . size ( ) + " , sbQueue.size() = " + webIndex . queuePreStack . size ( ) ;
2006-02-22 00:08:07 +01:00
}
2008-08-02 15:57:00 +02:00
if ( ( getConfig ( plasmaSwitchboardConstants . INDEX_DIST_ALLOW_WHILE_INDEXING , " false " ) . equalsIgnoreCase ( " false " ) ) & & ( webIndex . queuePreStack . size ( ) > 1 ) ) {
2008-05-14 23:36:02 +02:00
return " no DHT distribution: indexing in progress: noticeURL.stackSize() = " + crawlQueues . noticeURL . size ( ) + " , sbQueue.size() = " + webIndex . queuePreStack . size ( ) ;
2007-06-28 17:25:33 +02:00
}
2008-03-11 12:09:38 +01:00
return null ; // this means; yes, please do dht transfer
2006-02-22 00:08:07 +01:00
}
public boolean dhtTransferJob ( ) {
2008-08-02 14:12:04 +02:00
final String rejectReason = dhtShallTransfer ( ) ;
2006-02-22 00:08:07 +01:00
if ( rejectReason ! = null ) {
2008-04-08 16:44:39 +02:00
if ( this . log . isFine ( ) ) log . logFine ( rejectReason ) ;
2006-02-21 00:27:11 +01:00
return false ;
}
2006-02-21 00:57:50 +01:00
if ( this . dhtTransferChunk = = null ) {
2008-04-08 16:44:39 +02:00
if ( this . log . isFine ( ) ) log . logFine ( " no DHT distribution: no transfer chunk defined " ) ;
2006-02-21 00:57:50 +01:00
return false ;
}
if ( ( this . dhtTransferChunk ! = null ) & & ( this . dhtTransferChunk . getStatus ( ) ! = plasmaDHTChunk . chunkStatus_FILLED ) ) {
2008-04-08 16:44:39 +02:00
if ( this . log . isFine ( ) ) log . logFine ( " no DHT distribution: index distribution is in progress, status= " + this . dhtTransferChunk . getStatus ( ) ) ;
2006-02-21 00:57:50 +01:00
return false ;
}
2006-02-21 00:27:11 +01:00
// do the transfer
2008-08-02 14:12:04 +02:00
final int peerCount = Math . max ( 1 , ( this . webIndex . seedDB . mySeed ( ) . isJunior ( ) ) ?
2007-07-24 02:46:17 +02:00
( int ) getConfigLong ( " network.unit.dhtredundancy.junior " , 1 ) :
( int ) getConfigLong ( " network.unit.dhtredundancy.senior " , 1 ) ) ; // set redundancy factor
2008-08-02 14:12:04 +02:00
final long starttime = System . currentTimeMillis ( ) ;
2006-02-21 00:57:50 +01:00
2008-08-02 14:12:04 +02:00
final boolean ok = dhtTransferProcess ( dhtTransferChunk , peerCount ) ;
2006-02-21 00:27:11 +01:00
2008-08-02 15:57:00 +02:00
final boolean success ;
2006-02-22 00:08:07 +01:00
if ( ok ) {
dhtTransferChunk . setStatus ( plasmaDHTChunk . chunkStatus_COMPLETE ) ;
2008-04-08 16:44:39 +02:00
if ( this . log . isFine ( ) ) log . logFine ( " DHT distribution: transfer COMPLETE " ) ;
2006-02-22 00:08:07 +01:00
// adopt transfer count
if ( ( System . currentTimeMillis ( ) - starttime ) > ( 10000 * peerCount ) ) {
dhtTransferIndexCount - - ;
} else {
2006-05-07 19:44:33 +02:00
if ( dhtTransferChunk . indexCount ( ) > = dhtTransferIndexCount ) dhtTransferIndexCount + + ;
2006-02-22 00:08:07 +01:00
}
2008-08-02 15:57:00 +02:00
final int minChunkSize = ( int ) getConfigLong ( plasmaSwitchboardConstants . INDEX_DIST_CHUNK_SIZE_MIN , 30 ) ;
final int maxChunkSize = ( int ) getConfigLong ( plasmaSwitchboardConstants . INDEX_DIST_CHUNK_SIZE_MAX , 3000 ) ;
2006-05-09 13:43:10 +02:00
if ( dhtTransferIndexCount < minChunkSize ) dhtTransferIndexCount = minChunkSize ;
if ( dhtTransferIndexCount > maxChunkSize ) dhtTransferIndexCount = maxChunkSize ;
2006-02-22 00:08:07 +01:00
// show success
2008-08-02 15:57:00 +02:00
success = true ;
2006-02-22 00:08:07 +01:00
} else {
2006-11-02 22:32:59 +01:00
dhtTransferChunk . incTransferFailedCounter ( ) ;
2008-08-02 15:57:00 +02:00
final int maxChunkFails = ( int ) getConfigLong ( plasmaSwitchboardConstants . INDEX_DIST_CHUNK_FAILS_MAX , 1 ) ;
2006-11-02 22:32:59 +01:00
if ( dhtTransferChunk . getTransferFailedCounter ( ) > = maxChunkFails ) {
2007-02-26 15:36:01 +01:00
//System.out.println("DEBUG: " + dhtTransferChunk.getTransferFailedCounter() + " of " + maxChunkFails + " sendings failed for this chunk, aborting!");
2006-11-02 22:32:59 +01:00
dhtTransferChunk . setStatus ( plasmaDHTChunk . chunkStatus_FAILED ) ;
2008-04-08 16:44:39 +02:00
if ( this . log . isFine ( ) ) log . logFine ( " DHT distribution: transfer FAILED " ) ;
2006-11-02 22:32:59 +01:00
}
else {
2007-02-26 15:36:01 +01:00
//System.out.println("DEBUG: " + dhtTransferChunk.getTransferFailedCounter() + " of " + maxChunkFails + " sendings failed for this chunk, retrying!");
2008-04-08 16:44:39 +02:00
if ( this . log . isFine ( ) ) log . logFine ( " DHT distribution: transfer FAILED, sending this chunk again " ) ;
2006-11-02 22:32:59 +01:00
}
2008-08-02 15:57:00 +02:00
success = false ;
2006-02-21 00:27:11 +01:00
}
2008-08-02 15:57:00 +02:00
return success ;
2006-02-21 00:27:11 +01:00
}
2008-08-02 14:12:04 +02:00
public boolean dhtTransferProcess ( final plasmaDHTChunk dhtChunk , final int peerCount ) {
2008-05-14 23:36:02 +02:00
if ( ( this . webIndex . seedDB = = null ) | | ( this . webIndex . seedDB . sizeConnected ( ) = = 0 ) ) return false ;
2006-02-21 00:27:11 +01:00
2006-05-13 17:28:57 +02:00
try {
// find a list of DHT-peers
2008-08-02 14:12:04 +02:00
final double maxDist = 0 . 2 ;
final ArrayList < yacySeed > seeds = webIndex . peerActions . dhtAction . getDHTTargets ( webIndex . seedDB , log , peerCount , Math . min ( 8 , ( int ) ( this . webIndex . seedDB . sizeConnected ( ) * maxDist ) ) , dhtChunk . firstContainer ( ) . getWordHash ( ) , dhtChunk . lastContainer ( ) . getWordHash ( ) , maxDist ) ;
2006-05-13 17:28:57 +02:00
if ( seeds . size ( ) < peerCount ) {
2007-02-26 15:36:01 +01:00
log . logWarning ( " found not enough ( " + seeds . size ( ) + " ) peers for distribution for dhtchunk [ " + dhtChunk . firstContainer ( ) . getWordHash ( ) + " .. " + dhtChunk . lastContainer ( ) . getWordHash ( ) + " ] " ) ;
2006-05-13 17:28:57 +02:00
return false ;
}
2006-02-21 00:27:11 +01:00
2006-05-13 17:28:57 +02:00
// send away the indexes to all these peers
int hc1 = 0 ;
2006-02-21 00:27:11 +01:00
2006-05-13 17:28:57 +02:00
// getting distribution configuration values
2008-08-02 15:57:00 +02:00
final boolean gzipBody = getConfig ( plasmaSwitchboardConstants . INDEX_DIST_GZIP_BODY , " false " ) . equalsIgnoreCase ( " true " ) ;
final int timeout = ( int ) getConfigLong ( plasmaSwitchboardConstants . INDEX_DIST_TIMEOUT , 60000 ) ;
2008-08-02 14:12:04 +02:00
final int retries = 0 ;
2006-05-13 17:28:57 +02:00
// starting up multiple DHT transfer threads
2008-08-02 14:12:04 +02:00
final Iterator < yacySeed > seedIter = seeds . iterator ( ) ;
final ArrayList < plasmaDHTTransfer > transfer = new ArrayList < plasmaDHTTransfer > ( peerCount ) ;
2006-05-15 12:03:24 +02:00
while ( hc1 < peerCount & & ( transfer . size ( ) > 0 | | seedIter . hasNext ( ) ) ) {
2006-09-03 16:59:00 +02:00
2006-05-13 17:28:57 +02:00
// starting up some transfer threads
2008-08-02 14:12:04 +02:00
final int transferThreadCount = transfer . size ( ) ;
2006-05-13 17:28:57 +02:00
for ( int i = 0 ; i < peerCount - hc1 - transferThreadCount ; i + + ) {
2006-09-03 16:59:00 +02:00
// check for interruption
checkInterruption ( ) ;
2006-05-13 17:28:57 +02:00
if ( seedIter . hasNext ( ) ) {
2008-08-02 14:12:04 +02:00
final plasmaDHTTransfer t = new plasmaDHTTransfer ( log , webIndex . seedDB , webIndex . peerActions , seedIter . next ( ) , dhtChunk , gzipBody , timeout , retries ) ;
2006-05-13 17:28:57 +02:00
t . start ( ) ;
transfer . add ( t ) ;
} else {
break ;
}
}
// waiting for the transfer threads to finish
2008-08-02 14:12:04 +02:00
final Iterator < plasmaDHTTransfer > transferIter = transfer . iterator ( ) ;
2006-05-13 17:28:57 +02:00
while ( transferIter . hasNext ( ) ) {
2006-09-03 16:59:00 +02:00
// check for interruption
checkInterruption ( ) ;
2008-08-02 14:12:04 +02:00
final plasmaDHTTransfer t = transferIter . next ( ) ;
2006-05-13 17:28:57 +02:00
if ( ! t . isAlive ( ) ) {
// remove finished thread from the list
transferIter . remove ( ) ;
// count successful transfers
if ( t . getStatus ( ) = = plasmaDHTChunk . chunkStatus_COMPLETE ) {
this . log . logInfo ( " DHT distribution: transfer to peer " + t . getSeed ( ) . getName ( ) + " finished. " ) ;
hc1 + + ;
}
}
}
if ( hc1 < peerCount ) Thread . sleep ( 100 ) ;
2006-02-21 00:27:11 +01:00
}
2006-03-15 12:27:43 +01:00
2006-02-21 00:27:11 +01:00
2006-05-13 17:28:57 +02:00
// clean up and finish with deletion of indexes
if ( hc1 > = peerCount ) {
// success
return true ;
}
this . log . logSevere ( " Index distribution failed. Too few peers ( " + hc1 + " ) received the index, not deleted locally. " ) ;
return false ;
2008-08-02 14:12:04 +02:00
} catch ( final InterruptedException e ) {
2006-05-13 17:28:57 +02:00
return false ;
2006-02-21 00:27:11 +01:00
}
}
2006-03-15 12:27:43 +01:00
2006-08-07 17:11:14 +02:00
private void addURLtoErrorDB (
2008-08-02 14:12:04 +02:00
final yacyURL url ,
final String referrerHash ,
final String initiator ,
final String name ,
2008-08-02 15:57:00 +02:00
final String failreason
2006-08-07 17:11:14 +02:00
) {
2008-05-06 01:13:47 +02:00
assert initiator ! = null ;
2006-08-07 17:11:14 +02:00
// create a new errorURL DB entry
2008-08-02 14:12:04 +02:00
final CrawlEntry bentry = new CrawlEntry (
2007-03-16 14:25:56 +01:00
initiator ,
url ,
referrerHash ,
( name = = null ) ? " " : name ,
new Date ( ) ,
null ,
0 ,
0 ,
0 ) ;
2008-08-02 14:12:04 +02:00
final ZURL . Entry ee = crawlQueues . errorURL . newEntry (
2007-03-16 14:25:56 +01:00
bentry , initiator , new Date ( ) ,
0 , failreason ) ;
2006-08-07 17:11:14 +02:00
// store the entry
ee . store ( ) ;
// push it onto the stack
2007-10-29 02:43:20 +01:00
crawlQueues . errorURL . push ( ee ) ;
2007-03-16 14:25:56 +01:00
}
2006-08-07 17:11:14 +02:00
2008-08-04 12:47:26 +02:00
public int currentPPM ( ) {
final long uptime = ( System . currentTimeMillis ( ) - serverCore . startupTime ) / 1000 ;
final long uptimediff = uptime - lastseedcheckuptime ;
final long indexedcdiff = indexedPages - lastindexedPages ;
totalPPM = ( int ) ( indexedPages * 60 / Math . max ( uptime , 1 ) ) ;
return Math . round ( Math . max ( indexedcdiff , 0f ) * 60f / Math . max ( uptimediff , 1f ) ) ;
}
2008-06-04 23:34:57 +02:00
public void updateMySeed ( ) {
if ( getConfig ( " peerName " , " anomic " ) . equals ( " anomic " ) ) {
// generate new peer name
setConfig ( " peerName " , yacySeed . makeDefaultPeerName ( ) ) ;
}
webIndex . seedDB . mySeed ( ) . put ( yacySeed . NAME , getConfig ( " peerName " , " nameless " ) ) ;
webIndex . seedDB . mySeed ( ) . put ( yacySeed . PORT , Integer . toString ( serverCore . getPortNr ( getConfig ( " port " , " 8080 " ) ) ) ) ;
2008-08-02 14:12:04 +02:00
final long uptime = ( System . currentTimeMillis ( ) - serverCore . startupTime ) / 1000 ;
final long uptimediff = uptime - lastseedcheckuptime ;
final long indexedcdiff = indexedPages - lastindexedPages ;
2008-06-04 23:34:57 +02:00
//double requestcdiff = requestedQueries - lastrequestedQueries;
if ( uptimediff > 300 | | uptimediff < = 0 | | lastseedcheckuptime = = - 1 ) {
lastseedcheckuptime = uptime ;
lastindexedPages = indexedPages ;
lastrequestedQueries = requestedQueries ;
}
//the speed of indexing (pages/minute) of the peer
totalPPM = ( int ) ( indexedPages * 60 / Math . max ( uptime , 1 ) ) ;
2008-06-06 18:01:27 +02:00
webIndex . seedDB . mySeed ( ) . put ( yacySeed . ISPEED , Long . toString ( Math . round ( Math . max ( indexedcdiff , 0f ) * 60f / Math . max ( uptimediff , 1f ) ) ) ) ;
totalQPM = requestedQueries * 60d / Math . max ( uptime , 1d ) ;
2008-06-04 23:34:57 +02:00
webIndex . seedDB . mySeed ( ) . put ( yacySeed . RSPEED , Double . toString ( totalQPM /*Math.max((float) requestcdiff, 0f) * 60f / Math.max((float) uptimediff, 1f)*/ ) ) ;
webIndex . seedDB . mySeed ( ) . put ( yacySeed . UPTIME , Long . toString ( uptime / 60 ) ) ; // the number of minutes that the peer is up in minutes/day (moving average MA30)
webIndex . seedDB . mySeed ( ) . put ( yacySeed . LCOUNT , Integer . toString ( webIndex . countURL ( ) ) ) ; // the number of links that the peer has stored (LURL's)
webIndex . seedDB . mySeed ( ) . put ( yacySeed . NCOUNT , Integer . toString ( crawlQueues . noticeURL . size ( ) ) ) ; // the number of links that the peer has noticed, but not loaded (NURL's)
webIndex . seedDB . mySeed ( ) . put ( yacySeed . RCOUNT , Integer . toString ( crawlQueues . noticeURL . stackSize ( NoticedURL . STACK_TYPE_LIMIT ) ) ) ; // the number of links that the peer provides for remote crawling (ZURL's)
webIndex . seedDB . mySeed ( ) . put ( yacySeed . ICOUNT , Integer . toString ( webIndex . size ( ) ) ) ; // the minimum number of words that the peer has indexed (as it says)
webIndex . seedDB . mySeed ( ) . put ( yacySeed . SCOUNT , Integer . toString ( webIndex . seedDB . sizeConnected ( ) ) ) ; // the number of seeds that the peer has stored
webIndex . seedDB . mySeed ( ) . put ( yacySeed . CCOUNT , Double . toString ( ( ( int ) ( ( webIndex . seedDB . sizeConnected ( ) + webIndex . seedDB . sizeDisconnected ( ) + webIndex . seedDB . sizePotential ( ) ) * 60 . 0 / ( uptime + 1 . 01 ) ) * 100 ) / 100 . 0 ) ) ; // the number of clients that the peer connects (as connects/hour)
webIndex . seedDB . mySeed ( ) . put ( yacySeed . VERSION , getConfig ( " version " , " " ) ) ;
webIndex . seedDB . mySeed ( ) . setFlagDirectConnect ( true ) ;
webIndex . seedDB . mySeed ( ) . setLastSeenUTC ( ) ;
webIndex . seedDB . mySeed ( ) . put ( yacySeed . UTC , serverDate . UTCDiffString ( ) ) ;
webIndex . seedDB . mySeed ( ) . setFlagAcceptRemoteCrawl ( getConfig ( " crawlResponse " , " " ) . equals ( " true " ) ) ;
webIndex . seedDB . mySeed ( ) . setFlagAcceptRemoteIndex ( getConfig ( " allowReceiveIndex " , " " ) . equals ( " true " ) ) ;
//mySeed.setFlagAcceptRemoteIndex(true);
}
public void loadSeedLists ( ) {
// uses the superseed to initialize the database with known seeds
2008-08-25 20:11:47 +02:00
yacySeed ys ;
String seedListFileURL ;
yacyURL url ;
ArrayList < String > seedList ;
Iterator < String > enu ;
int lc ;
final int sc = webIndex . seedDB . sizeConnected ( ) ;
httpResponseHeader header ;
2008-06-04 23:34:57 +02:00
yacyCore . log . logInfo ( " BOOTSTRAP: " + sc + " seeds known from previous run " ) ;
// - use the superseed to further fill up the seedDB
int ssc = 0 , c = 0 ;
while ( true ) {
if ( Thread . currentThread ( ) . isInterrupted ( ) ) break ;
seedListFileURL = sb . getConfig ( " network.unit.bootstrap.seedlist " + c , " " ) ;
if ( seedListFileURL . length ( ) = = 0 ) break ;
c + + ;
if (
seedListFileURL . startsWith ( " http:// " ) | |
seedListFileURL . startsWith ( " https:// " )
) {
// load the seed list
try {
2008-08-25 20:11:47 +02:00
final httpRequestHeader reqHeader = new httpRequestHeader ( ) ;
reqHeader . put ( httpRequestHeader . PRAGMA , " no-cache " ) ;
reqHeader . put ( httpRequestHeader . CACHE_CONTROL , " no-cache " ) ;
reqHeader . put ( httpRequestHeader . USER_AGENT , HTTPLoader . yacyUserAgent ) ;
2008-06-04 23:34:57 +02:00
url = new yacyURL ( seedListFileURL , null ) ;
2008-08-02 14:12:04 +02:00
final long start = System . currentTimeMillis ( ) ;
2008-06-04 23:34:57 +02:00
header = HttpClient . whead ( url . toString ( ) , reqHeader ) ;
2008-08-02 14:12:04 +02:00
final long loadtime = System . currentTimeMillis ( ) - start ;
2008-06-04 23:34:57 +02:00
if ( header = = null ) {
if ( loadtime > getConfigLong ( " bootstrapLoadTimeout " , 6000 ) ) {
yacyCore . log . logWarning ( " BOOTSTRAP: seed-list URL " + seedListFileURL + " not available, time-out after " + loadtime + " milliseconds " ) ;
} else {
yacyCore . log . logWarning ( " BOOTSTRAP: seed-list URL " + seedListFileURL + " not available, no content " ) ;
}
} else if ( header . lastModified ( ) = = null ) {
yacyCore . log . logWarning ( " BOOTSTRAP: seed-list URL " + seedListFileURL + " not usable, last-modified is missing " ) ;
} else if ( ( header . age ( ) > 86400000 ) & & ( ssc > 0 ) ) {
yacyCore . log . logInfo ( " BOOTSTRAP: seed-list URL " + seedListFileURL + " too old ( " + ( header . age ( ) / 86400000 ) + " days) " ) ;
} else {
ssc + + ;
2008-07-04 13:03:03 +02:00
final byte [ ] content = HttpClient . wget ( url . toString ( ) , reqHeader , ( int ) getConfigLong ( " bootstrapLoadTimeout " , 20000 ) ) ;
2008-06-04 23:34:57 +02:00
seedList = nxTools . strings ( content , " UTF-8 " ) ;
enu = seedList . iterator ( ) ;
lc = 0 ;
while ( enu . hasNext ( ) ) {
2008-06-06 18:01:27 +02:00
ys = yacySeed . genRemoteSeed ( enu . next ( ) , null , false ) ;
2008-06-04 23:34:57 +02:00
if ( ( ys ! = null ) & &
2008-08-20 09:54:56 +02:00
( ( ! webIndex . seedDB . mySeedIsDefined ( ) ) | | ! webIndex . seedDB . mySeed ( ) . hash . equals ( ys . hash ) ) ) {
2008-06-04 23:34:57 +02:00
if ( webIndex . peerActions . connectPeer ( ys , false ) ) lc + + ;
//seedDB.writeMap(ys.hash, ys.getMap(), "init");
//System.out.println("BOOTSTRAP: received peer " + ys.get(yacySeed.NAME, "anonymous") + "/" + ys.getAddress());
//lc++;
}
}
yacyCore . log . logInfo ( " BOOTSTRAP: " + lc + " seeds from seed-list URL " + seedListFileURL + " , AGE= " + ( header . age ( ) / 3600000 ) + " h " ) ;
}
2008-08-02 14:12:04 +02:00
} catch ( final IOException e ) {
2008-06-04 23:34:57 +02:00
// this is when wget fails, commonly because of timeout
yacyCore . log . logWarning ( " BOOTSTRAP: failed (1) to load seeds from seed-list URL " + seedListFileURL + " : " + e . getMessage ( ) ) ;
2008-08-02 14:12:04 +02:00
} catch ( final Exception e ) {
2008-06-04 23:34:57 +02:00
// this is when wget fails; may be because of missing internet connection
yacyCore . log . logSevere ( " BOOTSTRAP: failed (2) to load seeds from seed-list URL " + seedListFileURL + " : " + e . getMessage ( ) , e ) ;
}
}
}
yacyCore . log . logInfo ( " BOOTSTRAP: " + ( webIndex . seedDB . sizeConnected ( ) - sc ) + " new seeds while bootstraping. " ) ;
}
2006-09-03 16:59:00 +02:00
public void checkInterruption ( ) throws InterruptedException {
2008-08-02 14:12:04 +02:00
final Thread curThread = Thread . currentThread ( ) ;
2006-09-03 16:59:00 +02:00
if ( ( curThread instanceof serverThread ) & & ( ( serverThread ) curThread ) . shutdownInProgress ( ) ) throw new InterruptedException ( " Shutdown in progress ... " ) ;
else if ( this . terminate | | curThread . isInterrupted ( ) ) throw new InterruptedException ( " Shutdown in progress ... " ) ;
}
2007-11-02 15:55:46 +01:00
2008-08-02 14:12:04 +02:00
public void terminate ( final long delay ) {
2006-05-22 10:02:35 +02:00
if ( delay < = 0 ) throw new IllegalArgumentException ( " The shutdown delay must be greater than 0. " ) ;
( new delayedShutdown ( this , delay ) ) . start ( ) ;
}
2005-04-24 23:24:53 +02:00
public void terminate ( ) {
this . terminate = true ;
this . shutdownSync . V ( ) ;
}
public boolean isTerminated ( ) {
return this . terminate ;
}
public boolean waitForShutdown ( ) throws InterruptedException {
this . shutdownSync . P ( ) ;
return this . terminate ;
}
2008-04-05 15:17:16 +02:00
/ * *
* loads the url as Map
*
* Strings like abc = 123 are parsed as pair : abc = > 123
*
* @param url
* @return
* /
2008-08-02 14:12:04 +02:00
public static Map < String , String > loadHashMap ( final yacyURL url ) {
2008-04-05 15:17:16 +02:00
try {
// sending request
2008-08-25 20:11:47 +02:00
final httpRequestHeader reqHeader = new httpRequestHeader ( ) ;
reqHeader . put ( httpRequestHeader . USER_AGENT , HTTPLoader . yacyUserAgent ) ;
2008-07-04 13:03:03 +02:00
final HashMap < String , String > result = nxTools . table ( HttpClient . wget ( url . toString ( ) , reqHeader , 10000 ) , " UTF-8 " ) ;
2008-04-05 15:17:16 +02:00
if ( result = = null ) return new HashMap < String , String > ( ) ;
return result ;
2008-08-02 14:12:04 +02:00
} catch ( final Exception e ) {
2008-04-05 15:17:16 +02:00
return new HashMap < String , String > ( ) ;
}
}
2005-04-07 21:19:42 +02:00
}
2006-05-22 10:02:35 +02:00
2007-11-02 15:55:46 +01:00
class MoreMemory extends TimerTask {
public final void run ( ) {
serverMemory . gc ( 10000 , " MoreMemory() " ) ;
}
}
2006-05-22 10:02:35 +02:00
class delayedShutdown extends Thread {
2008-08-02 14:12:04 +02:00
private final plasmaSwitchboard sb ;
private final long delay ;
public delayedShutdown ( final plasmaSwitchboard sb , final long delay ) {
2006-05-22 10:02:35 +02:00
this . sb = sb ;
this . delay = delay ;
}
public void run ( ) {
try {
Thread . sleep ( delay ) ;
2008-08-02 14:12:04 +02:00
} catch ( final InterruptedException e ) {
2007-03-09 09:48:47 +01:00
sb . getLog ( ) . logInfo ( " interrupted delayed shutdown " ) ;
2006-05-22 10:02:35 +02:00
}
this . sb . terminate ( ) ;
}
}