2005-04-07 21:19:42 +02:00
// plasmaSwitchboard.java
// -----------------------
// part of YaCy
// (C) by Michael Peter Christen; mc@anomic.de
// first published on http://www.anomic.de
// Frankfurt, Germany, 2004, 2005
2005-09-07 13:17:21 +02:00
//
2005-11-12 12:38:35 +01:00
// $LastChangedDate$
// $LastChangedRevision$
// $LastChangedBy$
2005-04-07 21:19:42 +02:00
//
// This program is free software; you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation; either version 2 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
//
// Using this software in any meaning (reading, learning, copying, compiling,
// running) means that you agree that the Author(s) is (are) not responsible
// for cost, loss of data or any harm that may be caused directly or indirectly
// by usage of this softare or this documentation. The usage of this software
// is on your own risk. The installation and usage (starting/running) of this
// software may allow other people or application to access your computer and
// any attached devices and is highly dependent on the configuration of the
// software which must be done by the user of the software; the author(s) is
// (are) also not responsible for proper configuration and usage of the
// software, even if provoked by documentation provided together with
// the software.
//
// Any changes to this file according to the GPL as documented in the file
// gpl.txt aside this file in the shipment you received can be done to the
// lines that follows this copyright notice here, but changes must not be
// done inside the copyright notive above. A re-distribution must contain
// the intact and unchanged copyright notice.
// Contributions and changes to the program code must be marked as such.
/ *
This class holds the run - time environment of the plasma
Search Engine . It ' s data forms a blackboard which can be used
to organize running jobs around the indexing algorithm .
The blackboard consist of the following entities :
- storage : one plasmaStore object with the url - based database
- configuration : initialized by properties once , then by external functions
- job queues : for parsing , condensing , indexing
- black / blue / whitelists : controls input and output to the index
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
this class is also the core of the http crawling .
There are some items that need to be respected when crawling the web :
1 ) respect robots . txt
2 ) do not access one domain too frequently , wait between accesses
3 ) remember crawled URL ' s and do not access again too early
4 ) priorization of specific links should be possible ( hot - lists )
5 ) attributes for crawling ( depth , filters , hot / black - lists , priority )
6 ) different crawling jobs with different attributes ( ' Orders ' ) simultanoulsy
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
We implement some specific tasks and use different database to archieve these goals :
- a database ' crawlerDisallow . db ' contains all url ' s that shall not be crawled
- a database ' crawlerDomain . db ' holds all domains and access times , where we loaded the disallow tables
this table contains the following entities :
< flag : robotes exist / not exist , last access of robots . txt , last access of domain ( for access scheduling ) >
- four databases for scheduled access : crawlerScheduledHotText . db , crawlerScheduledColdText . db ,
crawlerScheduledHotMedia . db and crawlerScheduledColdMedia . db
- two stacks for new URLS : newText . stack and newMedia . stack
- two databases for URL double - check : knownText . db and knownMedia . db
- one database with crawling orders : crawlerOrders . db
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
The Information flow of a single URL that is crawled is as follows :
- a html file is loaded from a specific URL within the module httpdProxyServlet as
a process of the proxy .
- the file is passed to httpdProxyCache . Here it ' s processing is delayed until the proxy is idle .
- The cache entry is passed on to the plasmaSwitchboard . There the URL is stored into plasmaLURL where
the URL is stored under a specific hash . The URL ' s from the content are stripped off , stored in plasmaLURL
with a ' wrong ' date ( the date of the URL ' s are not known at this time , only after fetching ) and stacked with
plasmaCrawlerTextStack . The content is read and splitted into rated words in plasmaCondenser .
The splitted words are then integrated into the index with plasmaSearch .
- In plasmaSearch the words are indexed by reversing the relation between URL and words : one URL points
to many words , the words within the document at the URL . After reversing , one word points
to many URL ' s , all the URL ' s where the word occurrs . One single word - > URL - hash relation is stored in
plasmaIndexEntry . A set of plasmaIndexEntries is a reverse word index .
This reverse word index is stored temporarly in plasmaIndexCache .
- In plasmaIndexCache the single plasmaIndexEntry ' ies are collected and stored into a plasmaIndex - entry
These plasmaIndex - Objects are the true reverse words indexes .
- in plasmaIndex the plasmaIndexEntry - objects are stored in a kelondroTree ; an indexed file in the file system .
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
The information flow of a search request is as follows :
- in httpdFileServlet the user enters a search query , which is passed to plasmaSwitchboard
- in plasmaSwitchboard , the query is passed to plasmaSearch .
- in plasmaSearch , the plasmaSearch . result object is generated by simultanous enumeration of
URL hases in the reverse word indexes plasmaIndex
- ( future : the plasmaSearch . result - object is used to identify more key words for a new search )
2005-11-11 00:48:20 +01:00
* /
2005-04-07 21:19:42 +02:00
package de.anomic.plasma ;
2005-05-05 07:32:19 +02:00
import java.io.ByteArrayInputStream ;
import java.io.File ;
import java.io.IOException ;
import java.net.URL ;
2005-11-06 07:27:17 +01:00
import java.net.URLEncoder ;
2005-05-05 07:32:19 +02:00
import java.text.SimpleDateFormat ;
2005-09-07 13:17:21 +02:00
import java.util.ArrayList ;
2005-05-05 07:32:19 +02:00
import java.util.Date ;
import java.util.HashMap ;
import java.util.Iterator ;
import java.util.Map ;
import java.util.Set ;
import java.util.TreeSet ;
2005-09-27 09:10:24 +02:00
import java.util.logging.Level ;
2005-05-05 07:32:19 +02:00
import de.anomic.data.messageBoard ;
import de.anomic.data.wikiBoard ;
2005-09-30 15:48:26 +02:00
import de.anomic.data.userDB ;
2005-05-05 07:32:19 +02:00
import de.anomic.htmlFilter.htmlFilterContentScraper ;
import de.anomic.http.httpHeader ;
2005-10-22 15:28:04 +02:00
import de.anomic.http.httpRemoteProxyConfig ;
2005-09-05 12:34:34 +02:00
import de.anomic.http.httpc ;
2005-05-05 07:32:19 +02:00
import de.anomic.kelondro.kelondroException ;
import de.anomic.kelondro.kelondroMSetTools ;
import de.anomic.kelondro.kelondroTables ;
import de.anomic.server.serverAbstractSwitch ;
import de.anomic.server.serverCodings ;
import de.anomic.server.serverCore ;
2005-11-07 13:33:02 +01:00
import de.anomic.server.serverDate ;
2005-05-05 07:32:19 +02:00
import de.anomic.server.serverInstantThread ;
import de.anomic.server.serverObjects ;
import de.anomic.server.serverSemaphore ;
import de.anomic.server.serverSwitch ;
2005-11-07 13:33:02 +01:00
import de.anomic.server.serverFileUtils ;
2005-06-10 11:19:24 +02:00
import de.anomic.server.logging.serverLog ;
2005-05-05 07:32:19 +02:00
import de.anomic.tools.bitfield ;
import de.anomic.tools.crypt ;
2005-11-11 14:40:53 +01:00
import de.anomic.tools.nxTools ;
2005-05-05 07:32:19 +02:00
import de.anomic.yacy.yacyClient ;
import de.anomic.yacy.yacyCore ;
import de.anomic.yacy.yacySeed ;
2005-10-04 02:28:59 +02:00
import de.anomic.yacy.yacyNewsPool ;
2005-04-07 21:19:42 +02:00
2005-05-14 11:41:05 +02:00
public final class plasmaSwitchboard extends serverAbstractSwitch implements serverSwitch {
2005-09-28 15:49:57 +02:00
2005-04-07 21:19:42 +02:00
// load slots
2005-06-21 03:17:25 +02:00
public static int crawlSlots = 10 ;
2005-07-06 16:48:41 +02:00
public static int indexingSlots = 100 ;
2005-10-05 12:45:33 +02:00
public static int stackCrawlSlots = 10000 ;
2005-11-15 09:25:46 +01:00
2005-11-17 02:59:01 +01:00
public static int maxCRLDump = 500000 ;
public static int maxCRGDump = 200000 ;
2005-11-07 13:33:02 +01:00
2005-04-07 21:19:42 +02:00
// couloured list management
public static TreeSet blueList = null ;
public static TreeSet stopwords = null ;
2005-07-12 02:07:09 +02:00
public static plasmaURLPattern urlBlacklist ;
2005-04-07 21:19:42 +02:00
2005-11-11 00:48:20 +01:00
// storage management
2005-09-29 02:24:09 +02:00
public File htCachePath ;
2005-07-12 17:09:35 +02:00
private File plasmaPath ;
public File listsPath ;
2005-11-07 13:33:02 +01:00
public File htDocsPath ;
public File rankingPath ;
2005-11-11 09:02:46 +01:00
public HashMap rankingPermissions ;
2005-07-12 17:09:35 +02:00
public plasmaURLPool urlPool ;
public plasmaWordIndex wordIndex ;
public plasmaHTCache cacheManager ;
public plasmaSnippetCache snippetCache ;
public plasmaCrawlLoader cacheLoader ;
public plasmaSwitchboardQueue sbQueue ;
2005-10-09 17:59:09 +02:00
public plasmaCrawlStacker sbStackCrawlThread ;
2005-07-12 17:09:35 +02:00
public messageBoard messageDB ;
public wikiBoard wikiDB ;
2005-10-04 02:28:59 +02:00
public static plasmaCrawlRobotsTxt robots ;
2005-07-12 17:09:35 +02:00
public plasmaCrawlProfile profiles ;
public plasmaCrawlProfile . entry defaultProxyProfile ;
public plasmaCrawlProfile . entry defaultRemoteProfile ;
public plasmaWordIndexDistribution indexDistribution ;
2005-11-11 00:48:20 +01:00
public plasmaRankingDistribution rankingOwnDistribution ;
public plasmaRankingDistribution rankingOtherDistribution ;
2005-07-12 17:09:35 +02:00
public HashMap outgoingCookies , incomingCookies ;
public kelondroTables facilityDB ;
public plasmaParser parser ;
2005-07-21 13:17:04 +02:00
public long proxyLastAccess ;
2005-08-02 17:36:10 +02:00
public yacyCore yc ;
2005-09-02 00:05:20 +02:00
public HashMap indexingTasksInProcess ;
2005-10-04 02:28:59 +02:00
public userDB userDB ;
2005-11-07 13:33:02 +01:00
//public StringBuffer crl; // local citation references
public StringBuffer crg ; // global citation references
2005-10-22 15:28:04 +02:00
/ *
* Remote Proxy configuration
* /
2005-11-11 00:48:20 +01:00
// public boolean remoteProxyUse;
// public boolean remoteProxyUse4Yacy;
// public String remoteProxyHost;
// public int remoteProxyPort;
// public String remoteProxyNoProxy = "";
// public String[] remoteProxyNoProxyPatterns = null;
2005-10-22 15:28:04 +02:00
public httpRemoteProxyConfig remoteProxyConfig = null ;
2005-11-11 00:48:20 +01:00
2005-10-22 15:28:04 +02:00
/ *
* Some constants
* /
2005-09-07 09:26:19 +02:00
private static final String STR_PROXYPROFILE = " defaultProxyProfile " ;
private static final String STR_REMOTEPROFILE = " defaultRemoteProfile " ;
private static final String STR_REMOTECRAWLTRIGGER = " REMOTECRAWLTRIGGER: REMOTE CRAWL TO PEER " ;
2005-11-11 00:48:20 +01:00
2005-04-24 23:24:53 +02:00
private serverSemaphore shutdownSync = new serverSemaphore ( 0 ) ;
private boolean terminate = false ;
2005-04-07 21:19:42 +02:00
2005-05-14 11:41:05 +02:00
private Object crawlingPausedSync = new Object ( ) ;
2005-11-11 00:48:20 +01:00
private boolean crawlingIsPaused = false ;
private static plasmaSwitchboard sb ;
2005-04-07 21:19:42 +02:00
public plasmaSwitchboard ( String rootPath , String initPath , String configPath ) throws IOException {
2005-08-30 14:50:30 +02:00
super ( rootPath , initPath , configPath ) ;
2005-11-11 00:48:20 +01:00
2005-05-11 16:58:03 +02:00
// set loglevel and log
2005-08-30 14:50:30 +02:00
setLog ( new serverLog ( " PLASMA " ) ) ;
2005-11-11 00:48:20 +01:00
2005-08-30 14:50:30 +02:00
// load values from configs
2005-11-07 13:33:02 +01:00
this . plasmaPath = new File ( rootPath , getConfig ( " dbPath " , " DATA/PLASMADB " ) ) ;
2005-10-10 11:13:17 +02:00
this . log . logConfig ( " Plasma DB Path: " + this . plasmaPath . toString ( ) ) ;
2005-11-07 13:33:02 +01:00
this . listsPath = new File ( rootPath , getConfig ( " listsPath " , " DATA/LISTS " ) ) ;
this . log . logConfig ( " Lists Path: " + this . listsPath . toString ( ) ) ;
this . htDocsPath = new File ( rootPath , getConfig ( " htDocsPath " , " DATA/HTDOCS " ) ) ;
this . log . logConfig ( " HTDOCS Path: " + this . htDocsPath . toString ( ) ) ;
this . rankingPath = new File ( rootPath , getConfig ( " rankingPath " , " DATA/RANKING " ) ) ;
this . log . logConfig ( " Ranking Path: " + this . rankingPath . toString ( ) ) ;
2005-11-11 09:02:46 +01:00
this . rankingPermissions = new HashMap ( ) ; // mapping of permission - to filename.
2005-09-05 12:34:34 +02:00
2005-10-22 15:28:04 +02:00
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
* Remote Proxy configuration
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
this . remoteProxyConfig = httpRemoteProxyConfig . init ( this ) ;
this . log . logConfig ( " Remote proxy configuration: \ n " + this . remoteProxyConfig . toString ( ) ) ;
2005-11-15 09:25:46 +01:00
// setting timestamp of last proxy access
2005-11-07 13:33:02 +01:00
this . proxyLastAccess = System . currentTimeMillis ( ) - 60000 ;
crg = new StringBuffer ( maxCRGDump ) ;
//crl = new StringBuffer(maxCRLDump);
2005-06-08 02:52:24 +02:00
2005-10-22 15:28:04 +02:00
// configuring list path
2005-04-07 21:19:42 +02:00
if ( ! ( listsPath . exists ( ) ) ) listsPath . mkdirs ( ) ;
2005-11-11 00:48:20 +01:00
2005-08-30 14:50:30 +02:00
// load coloured lists
if ( blueList = = null ) {
// read only once upon first instantiation of this class
String f = getConfig ( " plasmaBlueList " , null ) ;
2005-10-10 11:13:17 +02:00
File plasmaBlueListFile = new File ( f ) ;
if ( f ! = null ) blueList = kelondroMSetTools . loadList ( plasmaBlueListFile ) ; else blueList = new TreeSet ( ) ;
2005-11-11 00:48:20 +01:00
this . log . logConfig ( " loaded blue-list from file " + plasmaBlueListFile . getName ( ) + " , " +
blueList . size ( ) + " entries, " +
ppRamString ( plasmaBlueListFile . length ( ) / 1024 ) ) ;
2005-08-30 14:50:30 +02:00
}
2005-11-11 00:48:20 +01:00
2005-07-11 17:36:10 +02:00
// load the black-list / inspired by [AS]
2005-10-10 11:13:17 +02:00
File ulrBlackListFile = new File ( getRootPath ( ) , getConfig ( " listsPath " , " DATA/LISTS " ) ) ;
urlBlacklist = new plasmaURLPattern ( ulrBlackListFile ) ;
2005-07-11 17:36:10 +02:00
String f = getConfig ( " proxyBlackListsActive " , null ) ;
if ( f ! = null ) {
2005-12-06 23:30:15 +01:00
urlBlacklist . loadList ( f , " / " ) ;
2005-11-11 00:48:20 +01:00
this . log . logConfig ( " loaded black-list from file " + ulrBlackListFile . getName ( ) + " , " +
urlBlacklist . size ( ) + " entries, " +
ppRamString ( ulrBlackListFile . length ( ) / 1024 ) ) ;
2005-07-11 17:36:10 +02:00
}
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
// load stopwords
if ( stopwords = = null ) {
2005-09-05 12:34:34 +02:00
File stopwordsFile = new File ( rootPath , " yacy.stopwords " ) ;
stopwords = kelondroMSetTools . loadList ( stopwordsFile ) ;
2005-11-11 00:48:20 +01:00
this . log . logConfig ( " loaded stopwords from file " + stopwordsFile . getName ( ) + " , " +
stopwords . size ( ) + " entries, " +
ppRamString ( stopwordsFile . length ( ) / 1024 ) ) ;
2005-04-07 21:19:42 +02:00
}
2005-11-11 00:48:20 +01:00
2005-11-22 16:17:05 +01:00
// load ranking tables
2005-11-23 12:57:30 +01:00
File YBRPath = new File ( rootPath , " ranking/YBR " ) ;
if ( YBRPath . exists ( ) ) {
plasmaSearchPreOrder . loadYBR ( YBRPath , 15 ) ;
2005-11-22 16:17:05 +01:00
}
2005-08-30 14:50:30 +02:00
// read memory amount
2005-09-21 16:21:45 +02:00
int ramLURL = ( int ) getConfigLong ( " ramCacheLURL " , 1024 ) / 1024 ;
int ramNURL = ( int ) getConfigLong ( " ramCacheNURL " , 1024 ) / 1024 ;
int ramEURL = ( int ) getConfigLong ( " ramCacheEURL " , 1024 ) / 1024 ;
int ramRWI = ( int ) getConfigLong ( " ramCacheRWI " , 1024 ) / 1024 ;
int ramHTTP = ( int ) getConfigLong ( " ramCacheHTTP " , 1024 ) / 1024 ;
int ramMessage = ( int ) getConfigLong ( " ramCacheMessage " , 1024 ) / 1024 ;
int ramWiki = ( int ) getConfigLong ( " ramCacheWiki " , 1024 ) / 1024 ;
2005-09-23 03:31:29 +02:00
int ramRobots = ( int ) getConfigLong ( " ramCacheRobots " , 1024 ) / 1024 ;
2005-09-24 02:33:27 +02:00
int ramProfiles = ( int ) getConfigLong ( " ramCacheProfiles " , 1024 ) / 1024 ;
2005-10-05 12:45:33 +02:00
int ramPreNURL = ( int ) getConfigLong ( " ramCachePreNURL " , 1024 ) / 1024 ;
2005-09-24 02:33:27 +02:00
this . log . logConfig ( " LURL Cache memory = " + ppRamString ( ramLURL ) ) ;
this . log . logConfig ( " NURL Cache memory = " + ppRamString ( ramNURL ) ) ;
this . log . logConfig ( " EURL Cache memory = " + ppRamString ( ramEURL ) ) ;
this . log . logConfig ( " RWI Cache memory = " + ppRamString ( ramRWI ) ) ;
this . log . logConfig ( " HTTP Cache memory = " + ppRamString ( ramHTTP ) ) ;
this . log . logConfig ( " Message Cache memory = " + ppRamString ( ramMessage ) ) ;
this . log . logConfig ( " Wiki Cache memory = " + ppRamString ( ramWiki ) ) ;
this . log . logConfig ( " Robots Cache memory = " + ppRamString ( ramRobots ) ) ;
this . log . logConfig ( " Profiles Cache memory = " + ppRamString ( ramProfiles ) ) ;
2005-10-05 12:45:33 +02:00
this . log . logConfig ( " PreNURL Cache memory = " + ppRamString ( ramPreNURL ) ) ;
2005-04-07 21:19:42 +02:00
2005-08-30 14:50:30 +02:00
// make crawl profiles database and default profiles
2005-09-05 12:34:34 +02:00
this . log . logConfig ( " Initializing Crawl Profiles " ) ;
File profilesFile = new File ( this . plasmaPath , " crawlProfiles0.db " ) ;
2005-09-24 02:33:27 +02:00
this . profiles = new plasmaCrawlProfile ( profilesFile , ramProfiles ) ;
2005-04-13 17:52:00 +02:00
initProfiles ( ) ;
2005-11-11 00:48:20 +01:00
log . logConfig ( " Loaded profiles from file " + profilesFile . getName ( ) +
" , " + this . profiles . size ( ) + " entries " +
" , " + ppRamString ( profilesFile . length ( ) / 1024 ) ) ;
2005-04-13 17:52:00 +02:00
2005-09-07 13:17:21 +02:00
// loading the robots.txt db
this . log . logConfig ( " Initializing robots.txt DB " ) ;
File robotsDBFile = new File ( this . plasmaPath , " crawlRobotsTxt.db " ) ;
2005-11-02 18:56:26 +01:00
robots = new plasmaCrawlRobotsTxt ( robotsDBFile , ramRobots ) ;
2005-11-11 00:48:20 +01:00
this . log . logConfig ( " Loaded robots.txt DB from file " + robotsDBFile . getName ( ) +
" , " + robots . size ( ) + " entries " +
" , " + ppRamString ( robotsDBFile . length ( ) / 1024 ) ) ;
2005-09-07 13:17:21 +02:00
2005-04-07 21:19:42 +02:00
// start indexing management
2005-08-30 23:10:39 +02:00
log . logConfig ( " Starting Indexing Management " ) ;
2005-06-16 02:31:13 +02:00
urlPool = new plasmaURLPool ( plasmaPath , ramLURL , ramNURL , ramEURL ) ;
2005-11-11 00:48:20 +01:00
2005-05-17 10:25:04 +02:00
wordIndex = new plasmaWordIndex ( plasmaPath , ramRWI , log ) ;
2005-10-11 09:06:33 +02:00
int wordCacheMaxLow = ( int ) getConfigLong ( " wordCacheMaxLow " , 8000 ) ;
int wordCacheMaxHigh = ( int ) getConfigLong ( " wordCacheMaxHigh " , 10000 ) ;
2005-10-10 11:28:28 +02:00
wordIndex . setMaxWords ( wordCacheMaxLow , wordCacheMaxHigh ) ;
2005-04-07 21:19:42 +02:00
// start a cache manager
2005-08-30 23:10:39 +02:00
log . logConfig ( " Starting HT Cache Manager " ) ;
2005-11-11 00:48:20 +01:00
2005-09-06 16:17:53 +02:00
// create the cache directory
String cache = getConfig ( " proxyCache " , " DATA/HTCACHE " ) ;
cache = cache . replace ( '\\' , '/' ) ;
2005-09-28 16:47:57 +02:00
if ( cache . endsWith ( " / " ) ) { cache = cache . substring ( 0 , cache . length ( ) - 1 ) ; }
if ( new File ( cache ) . isAbsolute ( ) ) {
htCachePath = new File ( cache ) ; // don't use rootPath
} else {
htCachePath = new File ( rootPath , cache ) ;
}
this . log . logInfo ( " HTCACHE Path = " + htCachePath . getAbsolutePath ( ) ) ;
2005-07-06 16:48:41 +02:00
long maxCacheSize = 1024 * 1024 * Long . parseLong ( getConfig ( " proxyCacheSize " , " 2 " ) ) ; // this is megabyte
this . cacheManager = new plasmaHTCache ( htCachePath , maxCacheSize , ramHTTP ) ;
2005-05-17 10:25:04 +02:00
// make parser
2005-08-30 23:10:39 +02:00
log . logConfig ( " Starting Parser " ) ;
2005-07-06 16:48:41 +02:00
this . parser = new plasmaParser ( ) ;
2005-11-15 09:25:46 +01:00
/ * = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
* initialize switchboard queue
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = * /
// create queue
2005-12-06 23:30:15 +01:00
this . sbQueue = new plasmaSwitchboardQueue ( this . cacheManager , this . urlPool . loadedURL , new File ( this . plasmaPath , " switchboardQueue1.stack " ) , this . profiles ) ;
2005-11-15 09:25:46 +01:00
// setting the indexing queue slots
indexingSlots = ( int ) getConfigLong ( " indexer.slots " , 100 ) ;
// create in process list
this . indexingTasksInProcess = new HashMap ( ) ;
2005-05-17 10:25:04 +02:00
2005-09-20 23:49:47 +02:00
// going through the sbQueue Entries and registering all content files as in use
int count = 0 ;
ArrayList sbQueueEntries = this . sbQueue . list ( ) ;
for ( int i = 0 ; i < sbQueueEntries . size ( ) ; i + + ) {
plasmaSwitchboardQueue . Entry entry = ( plasmaSwitchboardQueue . Entry ) sbQueueEntries . get ( i ) ;
if ( ( entry ! = null ) & & ( entry . url ( ) ! = null ) & & ( entry . cacheFile ( ) . exists ( ) ) ) {
plasmaHTCache . filesInUse . add ( entry . cacheFile ( ) ) ;
count + + ;
}
}
2005-11-15 09:25:46 +01:00
this . log . logConfig ( count + " files in htcache reported to the cachemanager as in use. " ) ;
2005-09-20 23:49:47 +02:00
2005-04-07 21:19:42 +02:00
// define an extension-blacklist
2005-08-30 23:10:39 +02:00
log . logConfig ( " Parser: Initializing Extension Mappings for Media/Parser " ) ;
2005-06-23 04:07:45 +02:00
plasmaParser . initMediaExt ( plasmaParser . extString2extList ( getConfig ( " mediaExt " , " " ) ) ) ;
2005-06-23 09:43:59 +02:00
plasmaParser . initSupportedRealtimeFileExt ( plasmaParser . extString2extList ( getConfig ( " parseableExt " , " " ) ) ) ;
2005-11-11 00:48:20 +01:00
2005-05-17 10:25:04 +02:00
// define a realtime parsable mimetype list
2005-08-30 23:10:39 +02:00
log . logConfig ( " Parser: Initializing Mime Types " ) ;
2005-05-18 09:40:48 +02:00
plasmaParser . initRealtimeParsableMimeTypes ( getConfig ( " parseableRealtimeMimeTypes " , " application/xhtml+xml,text/html,text/plain " ) ) ;
2005-12-06 11:41:19 +01:00
plasmaParser . initParseableMimeTypes ( plasmaParser . PARSER_MODE_PROXY , getConfig ( " parseableMimeTypes.PROXY " , null ) ) ;
plasmaParser . initParseableMimeTypes ( plasmaParser . PARSER_MODE_CRAWLER , getConfig ( " parseableMimeTypes.CRAWLER " , null ) ) ;
plasmaParser . initParseableMimeTypes ( plasmaParser . PARSER_MODE_ICAP , getConfig ( " parseableMimeTypes.ICAP " , null ) ) ;
plasmaParser . initParseableMimeTypes ( plasmaParser . PARSER_MODE_URLREDIRECTOR , getConfig ( " parseableMimeTypes.URLREDIRECTOR " , null ) ) ;
2005-04-07 21:19:42 +02:00
2005-05-17 10:25:04 +02:00
// start a loader
2005-08-30 23:10:39 +02:00
log . logConfig ( " Starting Crawl Loader " ) ;
2005-08-02 14:20:03 +02:00
2005-08-24 09:47:42 +02:00
crawlSlots = Integer . parseInt ( getConfig ( " crawler.MaxActiveThreads " , " 10 " ) ) ;
2005-08-25 11:51:24 +02:00
this . crawlingIsPaused = Boolean . valueOf ( getConfig ( " crawler.isPaused " , " false " ) ) . booleanValue ( ) ;
2005-08-22 00:52:46 +02:00
plasmaCrawlLoader . switchboard = this ;
2005-12-05 15:24:13 +01:00
this . cacheLoader = new plasmaCrawlLoader ( this . cacheManager , this . log ) ;
2005-11-11 00:48:20 +01:00
2005-09-20 23:49:47 +02:00
// starting message board
2005-10-10 11:13:17 +02:00
this . log . logConfig ( " Starting Message Board " ) ;
File messageDbFile = new File ( getRootPath ( ) , " DATA/SETTINGS/message.db " ) ;
this . messageDB = new messageBoard ( messageDbFile , ramMessage ) ;
2005-11-11 00:48:20 +01:00
this . log . logConfig ( " Loaded Message Board DB from file " + messageDbFile . getName ( ) +
" , " + this . messageDB . size ( ) + " entries " +
" , " + ppRamString ( messageDbFile . length ( ) / 1024 ) ) ;
2005-09-20 23:49:47 +02:00
// starting wiki
2005-10-10 11:13:17 +02:00
this . log . logConfig ( " Starting Wiki Board " ) ;
File wikiDbFile = new File ( getRootPath ( ) , " DATA/SETTINGS/wiki.db " ) ;
this . wikiDB = new wikiBoard ( wikiDbFile ,
2005-11-11 00:48:20 +01:00
new File ( getRootPath ( ) , " DATA/SETTINGS/wiki-bkp.db " ) , ramWiki ) ;
this . log . logConfig ( " Loaded Wiki Board DB from file " + wikiDbFile . getName ( ) +
" , " + this . wikiDB . size ( ) + " entries " +
" , " + ppRamString ( wikiDbFile . length ( ) / 1024 ) ) ;
2005-10-10 11:13:17 +02:00
// Init User DB
this . log . logConfig ( " Loading User DB " ) ;
File userDbFile = new File ( getRootPath ( ) , " DATA/SETTINGS/user.db " ) ;
2005-11-11 00:48:20 +01:00
this . userDB = new userDB ( userDbFile , 512 ) ;
this . log . logConfig ( " Loaded User DB from file " + userDbFile . getName ( ) +
" , " + this . userDB . size ( ) + " entries " +
" , " + ppRamString ( userDbFile . length ( ) / 1024 ) ) ;
2005-04-07 21:19:42 +02:00
// init cookie-Monitor
2005-10-10 11:13:17 +02:00
this . log . logConfig ( " Starting Cookie Monitor " ) ;
this . outgoingCookies = new HashMap ( ) ;
this . incomingCookies = new HashMap ( ) ;
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
// clean up profiles
2005-10-10 11:13:17 +02:00
this . log . logConfig ( " Cleaning Profiles " ) ;
2005-04-07 21:19:42 +02:00
cleanProfiles ( ) ;
2005-11-11 00:48:20 +01:00
// init ranking transmission
2005-11-20 19:55:35 +01:00
/ *
CRDist0Path = GLOBAL / 010_owncr
CRDist0Method = 1
CRDist0Percent = 0
CRDist0Target =
CRDist1Path = GLOBAL / 014_othercr / 1
CRDist1Method = 9
CRDist1Percent = 30
CRDist1Target = kaskelix . de : 8080 , yacy . dyndns . org : 8000 , suma - lab . de : 8080
* * /
rankingOwnDistribution = new plasmaRankingDistribution ( log , new File ( rankingPath , getConfig ( " CRDist0Path " , plasmaRankingDistribution . CR_OWN ) ) , ( int ) getConfigLong ( " CRDist0Method " , plasmaRankingDistribution . METHOD_ANYSENIOR ) , ( int ) getConfigLong ( " CRDist0Percent " , 0 ) , getConfig ( " CRDist0Target " , " " ) ) ;
rankingOtherDistribution = new plasmaRankingDistribution ( log , new File ( rankingPath , getConfig ( " CRDist1Path " , plasmaRankingDistribution . CR_OTHER ) ) , ( int ) getConfigLong ( " CRDist1Method " , plasmaRankingDistribution . METHOD_MIXEDSENIOR ) , ( int ) getConfigLong ( " CRDist1Percent " , 30 ) , getConfig ( " CRDist1Target " , " kaskelix.de:8080,yacy.dyndns.org:8000,suma-lab.de:8080 " ) ) ;
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
// init facility DB
2005-06-23 04:07:45 +02:00
/ *
2005-05-18 23:52:17 +02:00
log . logSystem ( " Starting Facility Database " ) ;
2005-04-07 21:19:42 +02:00
File facilityDBpath = new File ( getRootPath ( ) , " DATA/SETTINGS/ " ) ;
facilityDB = new kelondroTables ( facilityDBpath ) ;
facilityDB . declareMaps ( " backlinks " , 250 , 500 , new String [ ] { " date " } , null ) ;
2005-06-20 18:36:31 +02:00
log . logSystem ( " ..opened backlinks " ) ;
2005-04-07 21:19:42 +02:00
facilityDB . declareMaps ( " zeitgeist " , 40 , 500 ) ;
2005-06-20 18:36:31 +02:00
log . logSystem ( " ..opened zeitgeist " ) ;
2005-04-07 21:19:42 +02:00
facilityDB . declareTree ( " statistik " , new int [ ] { 11 , 8 , 8 , 8 , 8 , 8 , 8 } , 0x400 ) ;
2005-06-20 18:36:31 +02:00
log . logSystem ( " ..opened statistik " ) ;
2005-04-07 21:19:42 +02:00
facilityDB . update ( " statistik " , ( new serverDate ( ) ) . toShortString ( false ) . substring ( 0 , 11 ) , new long [ ] { 1 , 2 , 3 , 4 , 5 , 6 } ) ;
long [ ] testresult = facilityDB . selectLong ( " statistik " , " yyyyMMddHHm " ) ;
testresult = facilityDB . selectLong ( " statistik " , ( new serverDate ( ) ) . toShortString ( false ) . substring ( 0 , 11 ) ) ;
2005-11-11 00:48:20 +01:00
* /
2005-06-08 02:52:24 +02:00
2005-11-07 11:57:54 +01:00
/ *
* Initializing httpc
* /
2005-09-05 12:34:34 +02:00
// initializing yacyDebugMode
2005-09-05 12:51:15 +02:00
httpc . yacyDebugMode = getConfig ( " yacyDebugMode " , " false " ) . equals ( " true " ) ;
2005-09-05 12:34:34 +02:00
2005-11-07 11:57:54 +01:00
// init nameCacheNoCachingList
String noCachingList = getConfig ( " httpc.nameCacheNoCachingPatterns " , " " ) ;
String [ ] noCachingEntries = noCachingList . split ( " , " ) ;
for ( int i = 0 ; i < noCachingEntries . length ; i + + ) {
String entry = noCachingEntries [ i ] . trim ( ) ;
httpc . nameCacheNoCachingPatterns . add ( entry ) ;
}
2005-06-08 02:52:24 +02:00
// generate snippets cache
2005-08-30 23:10:39 +02:00
log . logConfig ( " Initializing Snippet Cache " ) ;
2005-10-22 15:28:04 +02:00
snippetCache = new plasmaSnippetCache ( this , cacheManager , parser , log ) ;
2005-06-08 02:52:24 +02:00
2005-04-11 01:51:42 +02:00
// start yacy core
2005-08-30 23:10:39 +02:00
log . logConfig ( " Starting YaCy Protocol Core " ) ;
2005-06-23 04:07:45 +02:00
//try{Thread.currentThread().sleep(5000);} catch (InterruptedException e) {} // for profiler
2005-08-02 17:36:10 +02:00
this . yc = new yacyCore ( this ) ;
2005-06-23 04:07:45 +02:00
//log.logSystem("Started YaCy Protocol Core");
2005-11-11 00:48:20 +01:00
// System.gc(); try{Thread.currentThread().sleep(5000);} catch (InterruptedException e) {} // for profiler
2005-11-02 18:56:26 +01:00
serverInstantThread . oneTimeJob ( yc , " loadSeeds " , yacyCore . log , 3000 ) ;
2005-04-11 01:51:42 +02:00
2005-10-05 12:45:33 +02:00
// initializing the stackCrawlThread
2005-10-09 17:59:09 +02:00
this . sbStackCrawlThread = new plasmaCrawlStacker ( this , this . plasmaPath , ramPreNURL ) ;
//this.sbStackCrawlThread = new plasmaStackCrawlThread(this,this.plasmaPath,ramPreNURL);
2005-11-11 00:48:20 +01:00
//this.sbStackCrawlThread.start();
2005-10-05 12:45:33 +02:00
2005-04-07 21:19:42 +02:00
// deploy threads
2005-08-30 23:10:39 +02:00
log . logConfig ( " Starting Threads " ) ;
2005-09-28 15:49:57 +02:00
// System.gc(); // help for profiler
2005-06-08 17:28:29 +02:00
int indexing_cluster = Integer . parseInt ( getConfig ( " 80_indexing_cluster " , " 1 " ) ) ;
2005-06-08 15:19:05 +02:00
if ( indexing_cluster < 1 ) indexing_cluster = 1 ;
2005-07-20 15:03:41 +02:00
deployThread ( " 90_cleanup " , " Cleanup " , " simple cleaning process for monitoring information " , null ,
2005-11-11 00:48:20 +01:00
new serverInstantThread ( this , " cleanupJob " , " cleanupJobSize " ) , 10000 ) ; // all 5 Minutes
2005-10-09 17:59:09 +02:00
deployThread ( " 82_crawlstack " , " Crawl URL Stacker " , " process that checks url for double-occurrences and for allowance/disallowance by robots.txt " , null ,
2005-11-11 00:48:20 +01:00
new serverInstantThread ( sbStackCrawlThread , " job " , " size " ) , 8000 ) ;
2005-07-20 15:03:41 +02:00
deployThread ( " 80_indexing " , " Parsing/Indexing " , " thread that performes document parsing and indexing " , " /IndexCreateIndexingQueue_p.html " ,
2005-12-05 15:24:13 +01:00
new serverInstantThread ( this , " deQueue " , " queueSize " ) , 10000 ) ;
2005-06-08 15:19:05 +02:00
for ( int i = 1 ; i < indexing_cluster ; i + + ) {
setConfig ( ( i + 80 ) + " _indexing_idlesleep " , getConfig ( " 80_indexing_idlesleep " , " " ) ) ;
setConfig ( ( i + 80 ) + " _indexing_busysleep " , getConfig ( " 80_indexing_busysleep " , " " ) ) ;
2005-07-20 15:03:41 +02:00
deployThread ( ( i + 80 ) + " _indexing " , " Parsing/Indexing (cluster job) " , " thread that performes document parsing and indexing " , null ,
2005-11-11 00:48:20 +01:00
new serverInstantThread ( this , " deQueue " , " queueSize " ) , 10000 + ( i * 1000 ) ,
Long . parseLong ( getConfig ( " 80_indexing_idlesleep " , " 5000 " ) ) ,
Long . parseLong ( getConfig ( " 80_indexing_busysleep " , " 0 " ) ) ,
Long . parseLong ( getConfig ( " 80_indexing_memprereq " , " 1000000 " ) ) ) ;
2005-06-08 15:19:05 +02:00
}
2005-07-20 15:03:41 +02:00
deployThread ( " 70_cachemanager " , " Proxy Cache Enqueue " , " job takes new proxy files from RAM stack, stores them, and hands over to the Indexing Stack " , null ,
2005-11-11 00:48:20 +01:00
new serverInstantThread ( this , " htEntryStoreJob " , " htEntrySize " ) , 10000 ) ;
2005-07-20 15:03:41 +02:00
deployThread ( " 62_remotetriggeredcrawl " , " Remote Crawl Job " , " thread that performes a single crawl/indexing step triggered by a remote peer " , null ,
2005-11-11 00:48:20 +01:00
new serverInstantThread ( this , " remoteTriggeredCrawlJob " , " remoteTriggeredCrawlJobSize " ) , 30000 ) ;
2005-07-20 15:03:41 +02:00
deployThread ( " 61_globalcrawltrigger " , " Global Crawl Trigger " , " thread that triggeres remote peers for crawling " , " /IndexCreateWWWGlobalQueue_p.html " ,
2005-11-11 00:48:20 +01:00
new serverInstantThread ( this , " limitCrawlTriggerJob " , " limitCrawlTriggerJobSize " ) , 30000 ) ; // error here?
2005-07-20 15:03:41 +02:00
deployThread ( " 50_localcrawl " , " Local Crawl " , " thread that performes a single crawl step from the local crawl queue " , " /IndexCreateWWWLocalQueue_p.html " ,
2005-11-11 00:48:20 +01:00
new serverInstantThread ( this , " coreCrawlJob " , " coreCrawlJobSize " ) , 10000 ) ;
2005-07-20 15:03:41 +02:00
deployThread ( " 40_peerseedcycle " , " Seed-List Upload " , " task that a principal peer performes to generate and upload a seed-list to a ftp account " , null ,
2005-11-11 00:48:20 +01:00
new serverInstantThread ( yc , " publishSeedList " , null ) , 180000 ) ;
2005-07-04 13:09:48 +02:00
serverInstantThread peerPing = null ;
2005-07-20 15:03:41 +02:00
deployThread ( " 30_peerping " , " YaCy Core " , " this is the p2p-control and peer-ping task " , null ,
2005-11-11 00:48:20 +01:00
peerPing = new serverInstantThread ( yc , " peerPing " , null ) , 2000 ) ;
2005-07-04 13:09:48 +02:00
peerPing . setSyncObject ( new Object ( ) ) ;
2005-07-17 23:22:18 +02:00
2005-10-12 11:38:40 +02:00
this . indexDistribution = new plasmaWordIndexDistribution (
2005-11-11 00:48:20 +01:00
this . urlPool ,
this . wordIndex ,
this . log ,
getConfig ( " allowDistributeIndex " , " false " ) . equalsIgnoreCase ( " true " ) ,
getConfig ( " allowDistributeIndexWhileCrawling " , " false " ) . equalsIgnoreCase ( " true " ) ,
getConfig ( " indexDistribution.gzipBody " , " false " ) . equalsIgnoreCase ( " true " ) ,
( int ) getConfigLong ( " indexDistribution.timeout " , 60000 ) ,
( int ) getConfigLong ( " indexDistribution.maxOpenFiles " , 800 )
2005-10-12 11:38:40 +02:00
) ;
2005-08-13 00:14:24 +02:00
indexDistribution . setCounts ( 150 , 1 , 3 , 10000 ) ;
2005-07-27 14:51:00 +02:00
deployThread ( " 20_dhtdistribution " , " DHT Distribution " , " selection, transfer and deletion of index entries that are not searched on your peer, but on others " , null ,
2005-11-11 00:48:20 +01:00
new serverInstantThread ( indexDistribution , " job " , null ) , 600000 ) ;
2005-06-30 00:55:37 +02:00
// test routine for snippet fetch
2005-06-30 20:54:00 +02:00
//Set query = new HashSet();
//query.add(plasmaWordIndexEntry.word2hash("Weitergabe"));
//query.add(plasmaWordIndexEntry.word2hash("Zahl"));
2005-06-30 00:55:37 +02:00
//plasmaSnippetCache.result scr = snippetCache.retrieve(new URL("http://www.heise.de/mobil/newsticker/meldung/mail/54980"), query, true);
2005-06-30 02:01:53 +02:00
//plasmaSnippetCache.result scr = snippetCache.retrieve(new URL("http://www.heise.de/security/news/foren/go.shtml?read=1&msg_id=7301419&forum_id=72721"), query, true);
2005-07-02 01:35:36 +02:00
//plasmaSnippetCache.result scr = snippetCache.retrieve(new URL("http://www.heise.de/kiosk/archiv/ct/2003/4/20"), query, true, 260);
2005-11-11 00:48:20 +01:00
sb = this ;
2005-08-30 23:10:39 +02:00
log . logConfig ( " Finished Switchboard Initialization " ) ;
2005-04-07 21:19:42 +02:00
}
2005-11-11 00:48:20 +01:00
public static plasmaSwitchboard getSwitchboard ( ) {
return sb ;
}
2005-08-22 14:13:19 +02:00
2005-11-11 00:48:20 +01:00
2005-10-19 19:59:54 +02:00
/ * *
* This method changes the HTCache size . < br >
* @param new cache size in mb
* /
public final void setCacheSize ( long newCacheSize ) {
this . cacheManager . setCacheSize ( 1048576 * newCacheSize ) ;
}
2005-11-11 00:48:20 +01:00
2005-07-21 13:17:04 +02:00
public boolean onlineCaution ( ) {
2005-07-26 17:17:29 +02:00
try {
return System . currentTimeMillis ( ) - proxyLastAccess < Integer . parseInt ( getConfig ( " onlineCautionDelay " , " 30000 " ) ) ;
} catch ( NumberFormatException e ) {
return false ;
}
2005-07-21 13:17:04 +02:00
}
2005-09-27 09:10:24 +02:00
private static String ppRamString ( long bytes ) {
2005-04-25 01:15:40 +02:00
if ( bytes < 1024 ) return bytes + " KByte " ;
bytes = bytes / 1024 ;
if ( bytes < 1024 ) return bytes + " MByte " ;
bytes = bytes / 1024 ;
if ( bytes < 1024 ) return bytes + " GByte " ;
return ( bytes / 1024 ) + " TByte " ;
}
2005-04-13 17:52:00 +02:00
private void initProfiles ( ) throws IOException {
if ( ( profiles . size ( ) = = 0 ) | |
2005-11-11 00:48:20 +01:00
( getConfig ( STR_PROXYPROFILE , " " ) . length ( ) = = 0 ) | |
( profiles . getEntry ( getConfig ( STR_PROXYPROFILE , " " ) ) = = null ) ) {
2005-04-13 17:52:00 +02:00
// generate new default entry for proxy crawling
defaultProxyProfile = profiles . newEntry ( " proxy " , " " , " .* " , " .* " , Integer . parseInt ( getConfig ( " proxyPrefetchDepth " , " 0 " ) ) , Integer . parseInt ( getConfig ( " proxyPrefetchDepth " , " 0 " ) ) , false , true , true , true , false , true , true , true ) ;
2005-09-07 09:26:19 +02:00
setConfig ( STR_PROXYPROFILE , defaultProxyProfile . handle ( ) ) ;
2005-04-13 17:52:00 +02:00
} else {
2005-09-07 09:26:19 +02:00
defaultProxyProfile = profiles . getEntry ( getConfig ( STR_PROXYPROFILE , " " ) ) ;
2005-04-13 17:52:00 +02:00
}
if ( ( profiles . size ( ) = = 1 ) | |
2005-11-11 00:48:20 +01:00
( getConfig ( STR_REMOTEPROFILE , " " ) . length ( ) = = 0 ) | |
( profiles . getEntry ( getConfig ( STR_REMOTEPROFILE , " " ) ) = = null ) ) {
2005-07-30 13:57:19 +02:00
// generate new default entry for remote crawling
2005-11-12 12:38:35 +01:00
defaultRemoteProfile = profiles . newEntry ( " remote " , " " , " .* " , " .* " , 0 , 0 , true , false , true , true , false , true , true , false ) ;
2005-09-07 09:26:19 +02:00
setConfig ( STR_REMOTEPROFILE , defaultRemoteProfile . handle ( ) ) ;
2005-04-13 17:52:00 +02:00
} else {
2005-09-07 09:26:19 +02:00
defaultRemoteProfile = profiles . getEntry ( getConfig ( STR_REMOTEPROFILE , " " ) ) ;
2005-04-13 17:52:00 +02:00
}
}
private void resetProfiles ( ) {
2005-09-06 16:17:53 +02:00
final File pdb = new File ( plasmaPath , " crawlProfiles0.db " ) ;
2005-04-13 17:52:00 +02:00
if ( pdb . exists ( ) ) pdb . delete ( ) ;
try {
2005-09-24 02:33:27 +02:00
int ramProfiles = ( int ) getConfigLong ( " ramCacheProfiles " , 1024 ) / 1024 ;
profiles = new plasmaCrawlProfile ( pdb , ramProfiles ) ;
2005-04-13 17:52:00 +02:00
initProfiles ( ) ;
} catch ( IOException e ) { }
}
2005-08-02 18:03:35 +02:00
public boolean cleanProfiles ( ) {
if ( ( sbQueue . size ( ) > 0 ) | | ( cacheLoader . size ( ) > 0 ) | | ( urlPool . noticeURL . stackSize ( ) > 0 ) ) return false ;
2005-09-07 09:26:19 +02:00
final Iterator iter = profiles . profiles ( true ) ;
2005-09-06 16:17:53 +02:00
plasmaCrawlProfile . entry entry ;
2005-08-02 18:03:35 +02:00
boolean hasDoneSomething = false ;
2005-04-13 17:52:00 +02:00
try {
2005-09-07 09:26:19 +02:00
while ( iter . hasNext ( ) ) {
entry = ( plasmaCrawlProfile . entry ) iter . next ( ) ;
2005-08-02 18:03:35 +02:00
if ( ! ( ( entry . name ( ) . equals ( " proxy " ) ) | | ( entry . name ( ) . equals ( " remote " ) ) ) ) {
2005-09-07 09:26:19 +02:00
iter . remove ( ) ;
2005-08-02 18:03:35 +02:00
hasDoneSomething = true ;
}
2005-04-13 17:52:00 +02:00
}
} catch ( kelondroException e ) {
resetProfiles ( ) ;
2005-08-02 18:03:35 +02:00
hasDoneSomething = true ;
2005-04-13 17:52:00 +02:00
}
2005-08-02 18:03:35 +02:00
return hasDoneSomething ;
2005-04-07 21:19:42 +02:00
}
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
public plasmaHTCache getCacheManager ( ) {
2005-11-11 00:48:20 +01:00
return cacheManager ;
2005-04-07 21:19:42 +02:00
}
2005-07-06 16:48:41 +02:00
synchronized public void htEntryStoreEnqueued ( plasmaHTCache . Entry entry ) throws IOException {
2005-11-11 00:48:20 +01:00
if ( cacheManager . full ( ) )
htEntryStoreProcess ( entry ) ;
else
cacheManager . push ( entry ) ;
2005-07-06 16:48:41 +02:00
}
2005-11-11 00:48:20 +01:00
2005-07-06 16:48:41 +02:00
synchronized public boolean htEntryStoreProcess ( plasmaHTCache . Entry entry ) throws IOException {
2005-11-11 00:48:20 +01:00
if ( entry = = null ) return false ;
2005-07-06 16:48:41 +02:00
// store response header
2005-07-08 18:06:35 +02:00
if ( entry . responseHeader ! = null ) {
2005-10-23 12:35:05 +02:00
this . cacheManager . storeHeader ( entry . nomalizedURLHash , entry . responseHeader ) ;
this . log . logInfo ( " WROTE HEADER for " + entry . cacheFile ) ;
}
/ *
* Evaluating request header :
* With the X - YACY - Index - Control header set to " no-index " a client could disallow
* yacy to index the response returned as answer to a request
* /
boolean doIndexing = true ;
if ( entry . requestHeader ! = null ) {
if (
2005-11-11 00:48:20 +01:00
( entry . requestHeader . containsKey ( httpHeader . X_YACY_INDEX_CONTROL ) ) & &
( ( ( String ) entry . requestHeader . get ( httpHeader . X_YACY_INDEX_CONTROL ) ) . toUpperCase ( ) . equals ( " NO-INDEX " ) )
2005-10-23 12:35:05 +02:00
) {
doIndexing = false ;
}
2005-07-06 16:48:41 +02:00
}
2005-07-08 18:06:35 +02:00
// work off unwritten files
2005-09-06 16:17:53 +02:00
if ( entry . cacheArray = = null ) {
2005-10-23 12:35:05 +02:00
this . log . logInfo ( " EXISTING FILE ( " + entry . cacheFile . length ( ) + " bytes) for " + entry . cacheFile ) ;
2005-09-06 16:17:53 +02:00
} else {
2005-08-12 01:33:19 +02:00
String error = entry . shallStoreCacheForProxy ( ) ;
2005-07-08 18:24:07 +02:00
if ( error = = null ) {
2005-10-23 12:35:05 +02:00
this . cacheManager . writeFile ( entry . url , entry . cacheArray ) ;
this . log . logInfo ( " WROTE FILE ( " + entry . cacheArray . length + " bytes) for " + entry . cacheFile ) ;
2005-07-08 18:24:07 +02:00
} else {
2005-10-23 12:35:05 +02:00
this . log . logInfo ( " WRITE OF FILE " + entry . cacheFile + " FORBIDDEN: " + error ) ;
2005-07-08 18:24:07 +02:00
}
2005-07-06 16:48:41 +02:00
}
2005-11-11 00:48:20 +01:00
2005-10-23 12:35:05 +02:00
if ( ( doIndexing ) & & plasmaParser . supportedContent ( entry . url , entry . responseHeader . mime ( ) ) ) {
2005-09-05 13:17:37 +02:00
2005-09-07 23:38:03 +02:00
// registering the cachefile as in use
if ( entry . cacheFile . exists ( ) ) {
2005-10-23 12:35:05 +02:00
plasmaHTCache . filesInUse . add ( entry . cacheFile ) ;
2005-09-07 23:38:03 +02:00
}
2005-09-05 13:17:37 +02:00
// enqueue for further crawling
2005-10-23 12:35:05 +02:00
enQueue ( this . sbQueue . newEntry ( entry . url , plasmaURL . urlHash ( entry . referrerURL ( ) ) ,
2005-11-11 00:48:20 +01:00
entry . requestHeader . ifModifiedSince ( ) , entry . requestHeader . containsKey ( httpHeader . COOKIE ) ,
entry . initiator ( ) , entry . depth , entry . profile . handle ( ) ,
entry . name ( )
2005-09-05 13:17:37 +02:00
) ) ;
2005-11-11 00:48:20 +01:00
}
2005-07-06 16:48:41 +02:00
return true ;
}
public boolean htEntryStoreJob ( ) {
if ( cacheManager . empty ( ) ) return false ;
try {
return htEntryStoreProcess ( cacheManager . pop ( ) ) ;
} catch ( IOException e ) {
return false ;
}
}
public int htEntrySize ( ) {
return cacheManager . size ( ) ;
}
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
public void close ( ) {
2005-08-30 23:10:39 +02:00
log . logConfig ( " SWITCHBOARD SHUTDOWN STEP 1: sending termination signal to managed threads: " ) ;
2005-04-07 21:19:42 +02:00
terminateAllThreads ( true ) ;
2005-08-30 23:10:39 +02:00
log . logConfig ( " SWITCHBOARD SHUTDOWN STEP 2: sending termination signal to threaded indexing (stand by..) " ) ;
2005-04-25 01:15:40 +02:00
int waitingBoundSeconds = Integer . parseInt ( getConfig ( " maxWaitingWordFlush " , " 120 " ) ) ;
2005-04-23 15:00:56 +02:00
wordIndex . close ( waitingBoundSeconds ) ;
2005-08-30 23:10:39 +02:00
log . logConfig ( " SWITCHBOARD SHUTDOWN STEP 3: sending termination signal to database manager " ) ;
2005-04-07 21:19:42 +02:00
try {
2005-12-06 11:41:19 +01:00
// closing all still running db importer jobs
plasmaDbImporter . close ( ) ;
2005-07-29 08:50:36 +02:00
indexDistribution . close ( ) ;
2005-06-16 20:31:28 +02:00
cacheLoader . close ( ) ;
2005-04-07 21:19:42 +02:00
wikiDB . close ( ) ;
2005-10-09 18:40:44 +02:00
userDB . close ( ) ;
2005-04-07 21:19:42 +02:00
messageDB . close ( ) ;
2005-06-23 04:07:45 +02:00
if ( facilityDB ! = null ) facilityDB . close ( ) ;
2005-10-09 18:40:44 +02:00
sbStackCrawlThread . close ( ) ;
2005-06-16 02:31:13 +02:00
urlPool . close ( ) ;
2005-04-23 15:00:56 +02:00
profiles . close ( ) ;
2005-09-07 13:17:21 +02:00
robots . close ( ) ;
2005-11-11 00:48:20 +01:00
parser . close ( ) ;
2005-04-23 15:00:56 +02:00
cacheManager . close ( ) ;
2005-07-07 03:38:49 +02:00
sbQueue . close ( ) ;
2005-11-07 13:33:02 +01:00
//flushCitationReference(crl, "crl");
flushCitationReference ( crg , " crg " ) ;
2005-11-11 00:48:20 +01:00
} catch ( IOException e ) { }
2005-08-30 23:10:39 +02:00
log . logConfig ( " SWITCHBOARD SHUTDOWN TERMINATED " ) ;
2005-04-07 21:19:42 +02:00
}
2005-05-29 13:56:40 +02:00
2005-04-07 21:19:42 +02:00
public int queueSize ( ) {
2005-07-06 16:48:41 +02:00
return sbQueue . size ( ) ;
2005-06-08 15:19:05 +02:00
//return processStack.size() + cacheLoader.size() + noticeURL.stackSize();
2005-04-07 21:19:42 +02:00
}
public int cacheSizeMin ( ) {
2005-11-11 00:48:20 +01:00
return wordIndex . size ( ) ;
2005-04-07 21:19:42 +02:00
}
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
public void enQueue ( Object job ) {
2005-07-06 16:48:41 +02:00
if ( ! ( job instanceof plasmaSwitchboardQueue . Entry ) ) {
System . out . println ( " internal error at plasmaSwitchboard.enQueue: wrong job type " ) ;
System . exit ( 0 ) ;
}
try {
sbQueue . push ( ( plasmaSwitchboardQueue . Entry ) job ) ;
} catch ( IOException e ) {
2005-08-30 23:32:59 +02:00
log . logSevere ( " IOError in plasmaSwitchboard.enQueue: " + e . getMessage ( ) , e ) ;
2005-07-06 16:48:41 +02:00
}
2005-04-07 21:19:42 +02:00
}
2005-11-11 00:48:20 +01:00
2005-04-26 16:19:44 +02:00
public boolean deQueue ( ) {
2005-08-18 08:40:26 +02:00
// work off fresh entries from the proxy or from the crawler
2005-07-21 13:17:04 +02:00
if ( onlineCaution ( ) ) {
2005-09-06 16:17:53 +02:00
log . logFiner ( " deQueue: online caution, omitting resource stack processing " ) ;
2005-07-21 13:17:04 +02:00
return false ;
}
2005-10-05 12:45:33 +02:00
2005-10-11 09:06:33 +02:00
if ( wordIndex . wordCacheRAMSize ( ) + 1000 > ( int ) getConfigLong ( " wordCacheMaxLow " , 8000 ) ) {
log . logFine ( " deQueue: word index ram cache too full ( " + ( ( int ) getConfigLong ( " wordCacheMaxLow " , 8000 ) - wordIndex . wordCacheRAMSize ( ) ) +
2005-11-11 00:48:20 +01:00
" slots left); dismissed to omit ram flush lock " ) ;
2005-10-11 09:06:33 +02:00
return false ;
}
2005-10-05 12:45:33 +02:00
int stackCrawlQueueSize ;
2005-10-09 17:59:09 +02:00
if ( ( stackCrawlQueueSize = sbStackCrawlThread . size ( ) ) > = stackCrawlSlots ) {
2005-10-05 12:45:33 +02:00
log . logFine ( " deQueue: too many processes in stack crawl thread queue, dismissed to protect emergency case ( " +
2005-11-11 00:48:20 +01:00
" stackCrawlQueue= " + stackCrawlQueueSize + " ) " ) ;
2005-10-05 12:45:33 +02:00
return false ;
2005-11-11 00:48:20 +01:00
}
2005-10-05 12:45:33 +02:00
2005-07-06 16:48:41 +02:00
plasmaSwitchboardQueue . Entry nextentry ;
synchronized ( sbQueue ) {
if ( sbQueue . size ( ) = = 0 ) {
2005-06-08 17:28:29 +02:00
//log.logDebug("DEQUEUE: queue is empty");
return false ; // nothing to do
}
2005-10-27 13:28:37 +02:00
// if we were interrupted we should return now
if ( Thread . currentThread ( ) . isInterrupted ( ) ) return false ;
2005-06-08 17:28:29 +02:00
// do one processing step
2005-08-30 23:10:39 +02:00
log . logFine ( " DEQUEUE: sbQueueSize= " + sbQueue . size ( ) +
2005-11-11 00:48:20 +01:00
" , coreStackSize= " + urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_CORE ) +
" , limitStackSize= " + urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_LIMIT ) +
" , overhangStackSize= " + urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_OVERHANG ) +
" , remoteStackSize= " + urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_REMOTE ) ) ;
2005-07-06 16:48:41 +02:00
try {
nextentry = sbQueue . pop ( ) ;
2005-08-18 08:40:26 +02:00
if ( nextentry = = null ) return false ;
2005-07-06 16:48:41 +02:00
} catch ( IOException e ) {
2005-08-30 23:32:59 +02:00
log . logSevere ( " IOError in plasmaSwitchboard.deQueue: " + e . getMessage ( ) , e ) ;
2005-07-06 16:48:41 +02:00
return false ;
}
2005-06-08 17:28:29 +02:00
}
2005-11-11 00:48:20 +01:00
2005-10-27 13:28:37 +02:00
synchronized ( this . indexingTasksInProcess ) {
this . indexingTasksInProcess . put ( nextentry . urlHash ( ) , nextentry ) ;
2005-11-11 00:48:20 +01:00
}
2005-10-27 13:28:37 +02:00
2005-08-18 08:40:26 +02:00
processResourceStack ( nextentry ) ;
return true ;
2005-04-07 21:19:42 +02:00
}
public int cleanupJobSize ( ) {
int c = 0 ;
2005-06-16 02:31:13 +02:00
if ( ( urlPool . errorURL . stackSize ( ) > 1000 ) ) c + + ;
2005-04-07 21:19:42 +02:00
for ( int i = 1 ; i < = 6 ; i + + ) {
2005-11-11 00:48:20 +01:00
if ( urlPool . loadedURL . getStackSize ( i ) > 1000 ) c + + ;
}
2005-04-07 21:19:42 +02:00
return c ;
}
public boolean cleanupJob ( ) {
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
boolean hasDoneSomething = false ;
2005-11-17 02:59:01 +01:00
// do transmission of cr-files
int count = rankingOwnDistribution . size ( ) / 100 ;
if ( count = = 0 ) count = 1 ;
if ( count > 5 ) count = 5 ;
rankingOwnDistribution . transferRanking ( count ) ;
rankingOtherDistribution . transferRanking ( 1 ) ;
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
// clean up error stack
2005-11-11 00:48:20 +01:00
if ( ( urlPool . errorURL . stackSize ( ) > 1000 ) ) {
2005-10-04 02:28:59 +02:00
log . logFine ( " Cleaning Error-URLs report stack, " + urlPool . errorURL . stackSize ( ) + " entries on stack " ) ;
2005-11-11 00:48:20 +01:00
urlPool . errorURL . clearStack ( ) ;
2005-04-07 21:19:42 +02:00
hasDoneSomething = true ;
2005-11-11 00:48:20 +01:00
}
// clean up loadedURL stack
for ( int i = 1 ; i < = 6 ; i + + ) {
if ( urlPool . loadedURL . getStackSize ( i ) > 1000 ) {
log . logFine ( " Cleaning Loaded-URLs report stack, " + urlPool . loadedURL . getStackSize ( i ) + " entries on stack " + i ) ;
2005-10-04 02:28:59 +02:00
urlPool . loadedURL . clearStack ( i ) ;
2005-04-07 21:19:42 +02:00
hasDoneSomething = true ;
2005-11-11 00:48:20 +01:00
}
}
2005-04-07 21:19:42 +02:00
// clean up profiles
2005-08-02 18:03:35 +02:00
if ( cleanProfiles ( ) ) hasDoneSomething = true ;
// clean up news
try {
2005-10-04 02:28:59 +02:00
log . logFine ( " Cleaning Incoming News, " + yacyCore . newsPool . size ( yacyNewsPool . INCOMING_DB ) + " entries on stack " ) ;
2005-08-02 18:03:35 +02:00
if ( yacyCore . newsPool . automaticProcess ( ) > 0 ) hasDoneSomething = true ;
} catch ( IOException e ) { }
2005-04-07 21:19:42 +02:00
return hasDoneSomething ;
}
2005-11-11 00:48:20 +01:00
2005-11-03 15:07:58 +01:00
/ * *
* Creates a new File instance with absolute path of ours Seed File . < br >
* @return a new File instance
* /
public File getOwnSeedFile ( ) {
return new File ( getRootPath ( ) , getConfig ( " yacyOwnSeedFile " , " mySeed.txt " ) ) ;
}
2005-11-11 00:48:20 +01:00
2005-05-14 11:41:05 +02:00
/ * *
2005-11-11 00:48:20 +01:00
* With this function the crawling process can be paused
2005-05-14 11:41:05 +02:00
* /
public void pauseCrawling ( ) {
synchronized ( this . crawlingPausedSync ) {
this . crawlingIsPaused = true ;
}
2005-08-25 11:51:24 +02:00
setConfig ( " crawler.isPaused " , " true " ) ;
2005-05-14 11:41:05 +02:00
}
/ * *
2005-11-11 00:48:20 +01:00
* Continue the previously paused crawling
2005-05-14 11:41:05 +02:00
* /
public void continueCrawling ( ) {
2005-11-11 00:48:20 +01:00
synchronized ( this . crawlingPausedSync ) {
2005-05-14 11:41:05 +02:00
if ( this . crawlingIsPaused ) {
this . crawlingIsPaused = false ;
this . crawlingPausedSync . notifyAll ( ) ;
}
}
2005-08-25 11:51:24 +02:00
setConfig ( " crawler.isPaused " , " false " ) ;
2005-05-14 11:41:05 +02:00
}
/ * *
* @return < code > true < / code > if crawling was paused or < code > false < / code > otherwise
* /
public boolean crawlingIsPaused ( ) {
synchronized ( this . crawlingPausedSync ) {
return this . crawlingIsPaused ;
}
2005-11-11 00:48:20 +01:00
}
2005-05-14 11:41:05 +02:00
2005-05-29 13:56:40 +02:00
public int coreCrawlJobSize ( ) {
2005-06-16 02:31:13 +02:00
return urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_CORE ) ;
2005-04-07 21:19:42 +02:00
}
2005-05-29 13:56:40 +02:00
public boolean coreCrawlJob ( ) {
2005-06-16 02:31:13 +02:00
if ( urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_CORE ) = = 0 ) {
2005-05-29 13:56:40 +02:00
//log.logDebug("CoreCrawl: queue is empty");
2005-05-14 11:41:05 +02:00
return false ;
}
2005-07-06 16:48:41 +02:00
if ( sbQueue . size ( ) > = indexingSlots ) {
2005-08-30 23:10:39 +02:00
log . logFine ( " CoreCrawl: too many processes in indexing queue, dismissed ( " +
2005-11-11 00:48:20 +01:00
" sbQueueSize= " + sbQueue . size ( ) + " ) " ) ;
2005-05-14 11:41:05 +02:00
return false ;
}
2005-04-26 16:19:44 +02:00
if ( cacheLoader . size ( ) > = crawlSlots ) {
2005-08-30 23:10:39 +02:00
log . logFine ( " CoreCrawl: too many processes in loader queue, dismissed ( " +
2005-11-11 00:48:20 +01:00
" cacheLoader= " + cacheLoader . size ( ) + " ) " ) ;
2005-05-14 11:41:05 +02:00
return false ;
}
2005-07-21 13:17:04 +02:00
if ( onlineCaution ( ) ) {
2005-08-30 23:10:39 +02:00
log . logFine ( " CoreCrawl: online caution, omitting processing " ) ;
2005-07-21 13:17:04 +02:00
return false ;
}
2005-05-14 11:41:05 +02:00
// if the server is busy, we do crawling more slowly
2005-06-15 03:22:07 +02:00
//if (!(cacheManager.idle())) try {Thread.currentThread().sleep(2000);} catch (InterruptedException e) {}
2005-05-14 11:41:05 +02:00
// if crawling was paused we have to wait until we wer notified to continue
synchronized ( this . crawlingPausedSync ) {
if ( this . crawlingIsPaused ) {
try {
this . crawlingPausedSync . wait ( ) ;
}
catch ( InterruptedException e ) { return false ; }
}
2005-11-11 00:48:20 +01:00
}
2005-05-14 11:41:05 +02:00
2005-05-29 13:56:40 +02:00
// do a local crawl
2005-07-12 17:09:35 +02:00
plasmaCrawlNURL . Entry urlEntry = urlPool . noticeURL . pop ( plasmaCrawlNURL . STACK_TYPE_CORE ) ;
2005-06-19 07:27:42 +02:00
String stats = " LOCALCRAWL[ " + urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_CORE ) + " , " + urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_LIMIT ) + " , " + urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_OVERHANG ) + " , " + urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_REMOTE ) + " ] " ;
2005-08-02 18:03:35 +02:00
if ( ( urlEntry . url ( ) = = null ) | | ( urlEntry . url ( ) . toString ( ) . length ( ) < 10 ) ) {
2005-09-02 11:33:05 +02:00
log . logInfo ( stats + " : URL with hash " + ( ( urlEntry . hash ( ) = = null ) ? " Unknown " : urlEntry . hash ( ) ) + " already removed from queue. " ) ;
2005-11-07 13:33:02 +01:00
return true ;
2005-06-17 03:26:51 +02:00
}
2005-05-29 13:56:40 +02:00
String profileHandle = urlEntry . profileHandle ( ) ;
//System.out.println("DEBUG plasmaSwitchboard.processCrawling: profileHandle = " + profileHandle + ", urlEntry.url = " + urlEntry.url());
2005-07-31 03:54:46 +02:00
if ( profileHandle = = null ) {
2005-08-30 23:32:59 +02:00
log . logSevere ( stats + " : NULL PROFILE HANDLE ' " + urlEntry . profileHandle ( ) + " ' (must be internal error) for URL " + urlEntry . url ( ) ) ;
2005-07-31 03:54:46 +02:00
return true ;
}
2005-05-29 13:56:40 +02:00
plasmaCrawlProfile . entry profile = profiles . getEntry ( profileHandle ) ;
if ( profile = = null ) {
2005-08-30 23:32:59 +02:00
log . logSevere ( stats + " : LOST PROFILE HANDLE ' " + urlEntry . profileHandle ( ) + " ' (must be internal error) for URL " + urlEntry . url ( ) ) ;
2005-06-17 03:26:51 +02:00
return true ;
2005-05-29 13:56:40 +02:00
}
2005-11-11 00:48:20 +01:00
log . logFine ( " LOCALCRAWL: URL= " + urlEntry . url ( ) + " , initiator= " + urlEntry . initiator ( ) +
" , crawlOrder= " + ( ( profile . remoteIndexing ( ) ) ? " true " : " false " ) + " , depth= " + urlEntry . depth ( ) + " , crawlDepth= " + profile . generalDepth ( ) + " , filter= " + profile . generalFilter ( ) +
" , permission= " + ( ( yacyCore . seedDB = = null ) ? " undefined " : ( ( ( yacyCore . seedDB . mySeed . isSenior ( ) ) | | ( yacyCore . seedDB . mySeed . isPrincipal ( ) ) ) ? " true " : " false " ) ) ) ;
2005-06-17 03:26:51 +02:00
processLocalCrawling ( urlEntry , profile , stats ) ;
return true ;
2005-05-29 13:56:40 +02:00
}
public int limitCrawlTriggerJobSize ( ) {
2005-06-16 02:31:13 +02:00
return urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_LIMIT ) ;
2005-05-29 13:56:40 +02:00
}
public boolean limitCrawlTriggerJob ( ) {
2005-06-16 02:31:13 +02:00
if ( urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_LIMIT ) = = 0 ) {
2005-05-29 13:56:40 +02:00
//log.logDebug("LimitCrawl: queue is empty");
return false ;
}
2005-07-18 02:44:51 +02:00
2005-09-25 03:09:21 +02:00
if ( ( coreCrawlJobSize ( ) < = 20 ) & & ( limitCrawlTriggerJobSize ( ) > 100 ) ) {
2005-07-18 02:44:51 +02:00
// it is not efficient if the core crawl job is empty and we have too much to do
// move some tasks to the core crawl job
2005-09-25 03:09:21 +02:00
int toshift = limitCrawlTriggerJobSize ( ) / 5 ;
2005-07-18 02:44:51 +02:00
if ( toshift > 1000 ) toshift = 1000 ;
2005-09-25 03:09:21 +02:00
if ( toshift > limitCrawlTriggerJobSize ( ) ) toshift = limitCrawlTriggerJobSize ( ) ;
2005-12-07 00:51:29 +01:00
for ( int i = 0 ; i < toshift ; i + + ) {
urlPool . noticeURL . shift ( plasmaCrawlNURL . STACK_TYPE_LIMIT , plasmaCrawlNURL . STACK_TYPE_CORE ) ;
}
log . logInfo ( " shifted " + toshift + " jobs from global crawl to local crawl " ) ;
2005-07-18 02:44:51 +02:00
}
2005-07-29 08:50:36 +02:00
if ( sbQueue . size ( ) > = indexingSlots ) {
2005-08-30 23:10:39 +02:00
log . logFine ( " LimitCrawl: too many processes in indexing queue, dismissed to protect emergency case ( " +
2005-11-11 00:48:20 +01:00
" sbQueueSize= " + sbQueue . size ( ) + " ) " ) ;
2005-07-29 08:50:36 +02:00
return false ;
}
if ( cacheLoader . size ( ) > = crawlSlots ) {
2005-08-30 23:10:39 +02:00
log . logFine ( " LimitCrawl: too many processes in loader queue, dismissed to protect emergency case ( " +
2005-11-11 00:48:20 +01:00
" cacheLoader= " + cacheLoader . size ( ) + " ) " ) ;
2005-07-29 08:50:36 +02:00
return false ;
}
2005-05-29 13:56:40 +02:00
// if the server is busy, we do crawling more slowly
2005-06-17 03:26:51 +02:00
//if (!(cacheManager.idle())) try {Thread.currentThread().sleep(2000);} catch (InterruptedException e) {}
2005-05-29 13:56:40 +02:00
// if crawling was paused we have to wait until we wer notified to continue
synchronized ( this . crawlingPausedSync ) {
if ( this . crawlingIsPaused ) {
try {
this . crawlingPausedSync . wait ( ) ;
}
catch ( InterruptedException e ) { return false ; }
}
2005-11-11 00:48:20 +01:00
}
2005-05-29 13:56:40 +02:00
// start a global crawl, if possible
2005-07-12 17:09:35 +02:00
plasmaCrawlNURL . Entry urlEntry = urlPool . noticeURL . pop ( plasmaCrawlNURL . STACK_TYPE_LIMIT ) ;
2005-06-19 07:27:42 +02:00
String stats = " REMOTECRAWLTRIGGER[ " + urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_CORE ) + " , " + urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_LIMIT ) + " , " + urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_OVERHANG ) + " , " + urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_REMOTE ) + " ] " ;
2005-06-17 03:26:51 +02:00
if ( urlEntry . url ( ) = = null ) {
2005-08-30 23:32:59 +02:00
log . logSevere ( stats + " : urlEntry.url() == null " ) ;
2005-06-17 03:26:51 +02:00
return true ;
}
2005-05-29 13:56:40 +02:00
String profileHandle = urlEntry . profileHandle ( ) ;
//System.out.println("DEBUG plasmaSwitchboard.processCrawling: profileHandle = " + profileHandle + ", urlEntry.url = " + urlEntry.url());
plasmaCrawlProfile . entry profile = profiles . getEntry ( profileHandle ) ;
if ( profile = = null ) {
2005-08-30 23:32:59 +02:00
log . logSevere ( stats + " : LOST PROFILE HANDLE ' " + urlEntry . profileHandle ( ) + " ' (must be internal error) for URL " + urlEntry . url ( ) ) ;
2005-06-17 03:26:51 +02:00
return true ;
2005-05-29 13:56:40 +02:00
}
2005-11-11 00:48:20 +01:00
log . logFine ( " plasmaSwitchboard.limitCrawlTriggerJob: url= " + urlEntry . url ( ) + " , initiator= " + urlEntry . initiator ( ) +
" , crawlOrder= " + ( ( profile . remoteIndexing ( ) ) ? " true " : " false " ) + " , depth= " + urlEntry . depth ( ) + " , crawlDepth= " + profile . generalDepth ( ) + " , filter= " + profile . generalFilter ( ) +
" , permission= " + ( ( yacyCore . seedDB = = null ) ? " undefined " : ( ( ( yacyCore . seedDB . mySeed . isSenior ( ) ) | | ( yacyCore . seedDB . mySeed . isPrincipal ( ) ) ) ? " true " : " false " ) ) ) ;
boolean tryRemote =
( ( urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_CORE ) ! = 0 ) | | ( sbQueue . size ( ) ! = 0 ) ) /* should do ourself */ & &
( profile . remoteIndexing ( ) ) /* granted */ & &
( urlEntry . initiator ( ) ! = null ) & & ( ! ( urlEntry . initiator ( ) . equals ( plasmaURL . dummyHash ) ) ) /* not proxy */ & &
( ( yacyCore . seedDB . mySeed . isSenior ( ) ) | |
( yacyCore . seedDB . mySeed . isPrincipal ( ) ) ) /* qualified */ ;
2005-05-29 13:56:40 +02:00
if ( tryRemote ) {
boolean success = processRemoteCrawlTrigger ( urlEntry ) ;
if ( success ) return true ;
}
2005-06-17 03:26:51 +02:00
processLocalCrawling ( urlEntry , profile , stats ) ;
2005-06-01 00:12:43 +02:00
return true ;
2005-04-07 21:19:42 +02:00
}
2005-05-29 13:56:40 +02:00
public int remoteTriggeredCrawlJobSize ( ) {
2005-06-16 02:31:13 +02:00
return urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_REMOTE ) ;
2005-04-07 21:19:42 +02:00
}
2005-05-29 13:56:40 +02:00
public boolean remoteTriggeredCrawlJob ( ) {
2005-05-14 11:41:05 +02:00
// work off crawl requests that had been placed by other peers to our crawl stack
// do nothing if either there are private processes to be done
// or there is no global crawl on the stack
2005-06-16 02:31:13 +02:00
if ( urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_REMOTE ) = = 0 ) {
2005-05-14 11:41:05 +02:00
//log.logDebug("GlobalCrawl: queue is empty");
return false ;
}
2005-07-21 13:17:04 +02:00
if ( onlineCaution ( ) ) {
2005-08-30 23:10:39 +02:00
log . logFine ( " GlobalCrawl: online caution, omitting processing " ) ;
2005-05-14 11:41:05 +02:00
return false ;
}
// if crawling was paused we have to wait until we wer notified to continue
synchronized ( this . crawlingPausedSync ) {
if ( this . crawlingIsPaused ) {
try {
this . crawlingPausedSync . wait ( ) ;
}
catch ( InterruptedException e ) { return false ; }
}
2005-11-11 00:48:20 +01:00
}
2005-05-14 11:41:05 +02:00
// we don't want to crawl a global URL globally, since WE are the global part. (from this point of view)
2005-07-12 17:09:35 +02:00
plasmaCrawlNURL . Entry urlEntry = urlPool . noticeURL . pop ( plasmaCrawlNURL . STACK_TYPE_REMOTE ) ;
2005-06-19 07:27:42 +02:00
String stats = " REMOTETRIGGEREDCRAWL[ " + urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_CORE ) + " , " + urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_LIMIT ) + " , " + urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_OVERHANG ) + " , " + urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_REMOTE ) + " ] " ;
2005-06-17 03:26:51 +02:00
if ( urlEntry . url ( ) = = null ) {
2005-08-30 23:32:59 +02:00
log . logSevere ( stats + " : urlEntry.url() == null " ) ;
2005-06-17 03:26:51 +02:00
return false ;
}
2005-05-29 13:56:40 +02:00
String profileHandle = urlEntry . profileHandle ( ) ;
//System.out.println("DEBUG plasmaSwitchboard.processCrawling: profileHandle = " + profileHandle + ", urlEntry.url = " + urlEntry.url());
plasmaCrawlProfile . entry profile = profiles . getEntry ( profileHandle ) ;
2005-06-17 03:26:51 +02:00
2005-05-29 13:56:40 +02:00
if ( profile = = null ) {
2005-08-30 23:32:59 +02:00
log . logSevere ( stats + " : LOST PROFILE HANDLE ' " + urlEntry . profileHandle ( ) + " ' (must be internal error) for URL " + urlEntry . url ( ) ) ;
2005-05-29 13:56:40 +02:00
return false ;
}
2005-11-11 00:48:20 +01:00
log . logFine ( " plasmaSwitchboard.remoteTriggeredCrawlJob: url= " + urlEntry . url ( ) + " , initiator= " + urlEntry . initiator ( ) +
" , crawlOrder= " + ( ( profile . remoteIndexing ( ) ) ? " true " : " false " ) + " , depth= " + urlEntry . depth ( ) + " , crawlDepth= " + profile . generalDepth ( ) + " , filter= " + profile . generalFilter ( ) +
" , permission= " + ( ( yacyCore . seedDB = = null ) ? " undefined " : ( ( ( yacyCore . seedDB . mySeed . isSenior ( ) ) | | ( yacyCore . seedDB . mySeed . isPrincipal ( ) ) ) ? " true " : " false " ) ) ) ;
2005-06-17 03:26:51 +02:00
processLocalCrawling ( urlEntry , profile , stats ) ;
return true ;
2005-04-07 21:19:42 +02:00
}
2005-11-11 00:48:20 +01:00
2005-07-06 16:48:41 +02:00
private void processResourceStack ( plasmaSwitchboardQueue . Entry entry ) {
2005-07-12 17:09:35 +02:00
// work off one stack entry with a fresh resource
2005-11-11 00:48:20 +01:00
long stackStartTime = 0 , stackEndTime = 0 ,
parsingStartTime = 0 , parsingEndTime = 0 ,
indexingStartTime = 0 , indexingEndTime = 0 ,
storageStartTime = 0 , storageEndTime = 0 ;
2005-09-20 23:49:47 +02:00
2005-04-07 21:19:42 +02:00
// we must distinguish the following cases: resource-load was initiated by
// 1) global crawling: the index is extern, not here (not possible here)
// 2) result of search queries, some indexes are here (not possible here)
// 3) result of index transfer, some of them are here (not possible here)
// 4) proxy-load (initiator is "------------")
// 5) local prefetch/crawling (initiator is own seedHash)
// 6) local fetching for global crawling (other known or unknwon initiator)
int processCase = 0 ;
yacySeed initiator = null ;
String initiatorHash = ( entry . proxy ( ) ) ? plasmaURL . dummyHash : entry . initiator ( ) ;
if ( initiatorHash . equals ( plasmaURL . dummyHash ) ) {
// proxy-load
processCase = 4 ;
} else if ( initiatorHash . equals ( yacyCore . seedDB . mySeed . hash ) ) {
// normal crawling
processCase = 5 ;
} else {
// this was done for remote peer (a global crawl)
initiator = yacyCore . seedDB . getConnected ( initiatorHash ) ;
processCase = 6 ;
}
2005-05-17 10:25:04 +02:00
2005-08-30 23:10:39 +02:00
log . logFine ( " processResourceStack processCase= " + processCase +
2005-11-11 00:48:20 +01:00
" , depth= " + entry . depth ( ) +
" , maxDepth= " + ( ( entry . profile ( ) = = null ) ? " null " : Integer . toString ( entry . profile ( ) . generalDepth ( ) ) ) +
" , filter= " + ( ( entry . profile ( ) = = null ) ? " null " : entry . profile ( ) . generalFilter ( ) ) +
" , initiatorHash= " + initiatorHash +
" , responseHeader= " + ( ( entry . responseHeader ( ) = = null ) ? " null " : entry . responseHeader ( ) . toString ( ) ) +
" , url= " + entry . url ( ) ) ; // DEBUG
2005-05-17 10:25:04 +02:00
2005-04-13 17:52:00 +02:00
// parse content
2005-09-20 23:49:47 +02:00
parsingStartTime = System . currentTimeMillis ( ) ;
2005-06-16 20:31:28 +02:00
plasmaParserDocument document = null ;
2005-10-05 12:45:33 +02:00
httpHeader entryRespHeader = entry . responseHeader ( ) ;
String mimeType = ( entryRespHeader = = null ) ? null : entryRespHeader . mime ( ) ;
2005-12-06 11:41:19 +01:00
if ( plasmaParser . supportedContent (
entry . url ( ) ,
mimeType )
) {
2005-10-05 12:45:33 +02:00
if ( ( entry . cacheFile ( ) . exists ( ) ) & & ( entry . cacheFile ( ) . length ( ) > 0 ) ) {
2005-08-30 23:10:39 +02:00
log . logFine ( " (Parser) ' " + entry . normalizedURLString ( ) + " ' is not parsed yet, parsing now from File " ) ;
2005-10-05 12:45:33 +02:00
document = parser . parseSource ( entry . url ( ) , mimeType , entry . cacheFile ( ) ) ;
2005-05-17 10:25:04 +02:00
} else {
2005-08-30 23:10:39 +02:00
log . logFine ( " (Parser) ' " + entry . normalizedURLString ( ) + " ' cannot be parsed, no resource available " ) ;
2005-07-06 16:48:41 +02:00
return ;
2005-05-17 10:25:04 +02:00
}
if ( document = = null ) {
2005-08-30 23:32:59 +02:00
log . logSevere ( " (Parser) ' " + entry . normalizedURLString ( ) + " ' parse failure " ) ;
2005-05-17 10:25:04 +02:00
return ;
}
2005-04-21 19:13:43 +02:00
} else {
2005-10-05 12:45:33 +02:00
log . logFine ( " (Parser) ' " + entry . normalizedURLString ( ) + " '. Unsupported mimeType ' " + ( ( mimeType = = null ) ? " null " : mimeType ) + " '. " ) ;
2005-11-11 00:48:20 +01:00
return ;
2005-05-17 10:25:04 +02:00
}
2005-09-20 23:49:47 +02:00
parsingEndTime = System . currentTimeMillis ( ) ;
2005-04-13 17:52:00 +02:00
2005-11-07 13:33:02 +01:00
Date docDate = null ;
2005-07-07 03:38:49 +02:00
if ( entry . responseHeader ( ) ! = null ) {
2005-11-07 13:33:02 +01:00
docDate = entry . responseHeader ( ) . lastModified ( ) ;
if ( docDate = = null ) docDate = entry . responseHeader ( ) . date ( ) ;
2005-07-07 03:38:49 +02:00
}
2005-11-07 13:33:02 +01:00
if ( docDate = = null ) docDate = new Date ( ) ;
2005-07-06 16:48:41 +02:00
2005-04-07 21:19:42 +02:00
// put anchors on crawl stack
2005-09-23 01:43:45 +02:00
stackStartTime = System . currentTimeMillis ( ) ;
2005-04-07 21:19:42 +02:00
if ( ( ( processCase = = 4 ) | | ( processCase = = 5 ) ) & &
2005-11-11 00:48:20 +01:00
( ( entry . profile ( ) = = null ) | | ( entry . depth ( ) < entry . profile ( ) . generalDepth ( ) ) ) ) {
2005-04-13 17:52:00 +02:00
Map hl = document . getHyperlinks ( ) ;
2005-04-13 00:57:54 +02:00
Iterator i = hl . entrySet ( ) . iterator ( ) ;
2005-04-07 21:19:42 +02:00
String nexturlstring ;
2005-12-05 15:24:13 +01:00
//String rejectReason;
2005-04-13 00:57:54 +02:00
Map . Entry e ;
while ( i . hasNext ( ) ) {
e = ( Map . Entry ) i . next ( ) ;
nexturlstring = ( String ) e . getKey ( ) ;
2005-11-07 13:33:02 +01:00
nexturlstring = htmlFilterContentScraper . urlNormalform ( null , nexturlstring ) ;
2005-11-11 00:48:20 +01:00
2005-11-07 13:33:02 +01:00
sbStackCrawlThread . enqueue ( nexturlstring , entry . url ( ) . toString ( ) , initiatorHash , ( String ) e . getValue ( ) , docDate , entry . depth ( ) + 1 , entry . profile ( ) ) ;
2005-11-11 00:48:20 +01:00
// rejectReason = stackCrawl(nexturlstring, entry.normalizedURLString(), initiatorHash, (String) e.getValue(), loadDate, entry.depth() + 1, entry.profile());
// if (rejectReason == null) {
// c++;
// } else {
// urlPool.errorURL.newEntry(new URL(nexturlstring), entry.normalizedURLString(), entry.initiator(), yacyCore.seedDB.mySeed.hash,
// (String) e.getValue(), rejectReason, new bitfield(plasmaURL.urlFlagLength), false);
// }
2005-04-07 21:19:42 +02:00
}
2005-11-11 00:48:20 +01:00
log . logInfo ( " CRAWL: ADDED " + hl . size ( ) + " LINKS FROM " + entry . normalizedURLString ( ) +
" , NEW CRAWL STACK SIZE IS " + urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_CORE ) ) ;
2005-04-07 21:19:42 +02:00
}
2005-09-23 01:43:45 +02:00
stackEndTime = System . currentTimeMillis ( ) ;
2005-04-07 21:19:42 +02:00
// create index
2005-04-13 17:52:00 +02:00
String descr = document . getMainLongTitle ( ) ;
2005-04-07 21:19:42 +02:00
URL referrerURL = entry . referrerURL ( ) ;
String referrerHash = ( referrerURL = = null ) ? plasmaURL . dummyHash : plasmaURL . urlHash ( referrerURL ) ;
2005-04-13 17:52:00 +02:00
String noIndexReason = " unspecified " ;
if ( processCase = = 4 ) {
// proxy-load
noIndexReason = entry . shallIndexCacheForProxy ( ) ;
} else {
// normal crawling
noIndexReason = entry . shallIndexCacheForCrawler ( ) ;
}
if ( noIndexReason = = null ) {
2005-04-07 21:19:42 +02:00
// strip out words
2005-09-20 23:49:47 +02:00
indexingStartTime = System . currentTimeMillis ( ) ;
2005-08-30 23:10:39 +02:00
log . logFine ( " Condensing for ' " + entry . normalizedURLString ( ) + " ' " ) ;
2005-04-13 17:52:00 +02:00
plasmaCondenser condenser = new plasmaCondenser ( new ByteArrayInputStream ( document . getText ( ) ) ) ;
2005-11-11 00:48:20 +01:00
2005-11-07 13:33:02 +01:00
// generate citation reference
generateCitationReference ( entry . urlHash ( ) , docDate , document , condenser ) ;
2005-04-07 21:19:42 +02:00
//log.logInfo("INDEXING HEADLINE:" + descr);
try {
2005-07-18 00:25:50 +02:00
//log.logDebug("Create LURL-Entry for '" + entry.normalizedURLString() + "', " +
// "responseHeader=" + entry.responseHeader().toString());
2005-09-05 12:10:00 +02:00
2005-07-12 02:07:09 +02:00
plasmaCrawlLURL . Entry newEntry = urlPool . loadedURL . addEntry (
2005-11-11 00:48:20 +01:00
entry . url ( ) , descr , docDate , new Date ( ) ,
initiatorHash ,
yacyCore . seedDB . mySeed . hash ,
referrerHash ,
0 , true ,
condenser . RESULT_INFORMATION_VALUE ,
plasmaWordIndexEntry . language ( entry . url ( ) ) ,
plasmaWordIndexEntry . docType ( document . getMimeType ( ) ) ,
entry . size ( ) ,
condenser . RESULT_NUMB_WORDS ,
processCase
) ;
2005-04-07 21:19:42 +02:00
String urlHash = newEntry . hash ( ) ;
2005-11-03 16:28:37 +01:00
if ( ( ( processCase = = 4 ) | | ( processCase = = 5 ) | | ( processCase = = 6 ) ) & & ( entry . profile ( ) . localIndexing ( ) ) ) {
2005-04-07 21:19:42 +02:00
// remove stopwords
2005-07-06 16:48:41 +02:00
log . logInfo ( " Excluded " + condenser . excludeWords ( stopwords ) + " words in URL " + entry . url ( ) ) ;
2005-09-27 09:10:24 +02:00
indexingEndTime = System . currentTimeMillis ( ) ;
2005-04-07 21:19:42 +02:00
// do indexing
2005-07-18 00:25:50 +02:00
//log.logDebug("Create Index for '" + entry.normalizedURLString() + "'");
2005-09-27 09:10:24 +02:00
storageStartTime = System . currentTimeMillis ( ) ;
2005-10-05 12:45:33 +02:00
int words = 0 ;
2005-11-11 00:48:20 +01:00
String storagePeerHash ;
2005-10-05 12:45:33 +02:00
yacySeed seed ;
2005-11-11 00:48:20 +01:00
if ( ( ( storagePeerHash = getConfig ( " storagePeerHash " , null ) ) = = null ) | |
( storagePeerHash . trim ( ) . length ( ) = = 0 ) | |
( ( seed = yacyCore . seedDB . getConnected ( storagePeerHash ) ) = = null ) ) {
2005-11-07 13:33:02 +01:00
words = wordIndex . addPageIndex ( entry . url ( ) , urlHash , docDate , condenser , plasmaWordIndexEntry . language ( entry . url ( ) ) , plasmaWordIndexEntry . docType ( document . getMimeType ( ) ) ) ;
2005-10-05 12:45:33 +02:00
} else {
HashMap urlCache = new HashMap ( 1 ) ;
urlCache . put ( newEntry . hash ( ) , newEntry ) ;
ArrayList tmpEntities = new ArrayList ( condenser . getWords ( ) . size ( ) ) ;
int quality = 0 ;
try {
2005-11-07 13:33:02 +01:00
quality = condenser . RESULT_INFORMATION_VALUE ;
2005-10-05 12:45:33 +02:00
} catch ( NumberFormatException e ) {
System . out . println ( " INTERNAL ERROR WITH CONDENSER.INFORMATION_VALUE: " + e . toString ( ) + " : in URL " + newEntry . url ( ) . toString ( ) ) ;
2005-11-11 00:48:20 +01:00
}
2005-10-05 12:45:33 +02:00
String language = plasmaWordIndexEntry . language ( entry . url ( ) ) ;
char doctype = plasmaWordIndexEntry . docType ( document . getMimeType ( ) ) ;
// iterate over all words
Iterator i = condenser . getWords ( ) . iterator ( ) ;
int p = 0 ;
while ( i . hasNext ( ) ) {
String word = ( String ) i . next ( ) ;
int count = condenser . wordCount ( word ) ;
String wordHash = plasmaWordIndexEntry . word2hash ( word ) ;
plasmaWordIndexEntity wordIdxEntity = new plasmaWordIndexEntity ( wordHash ) ;
plasmaWordIndexEntry wordIdxEntry = new plasmaWordIndexEntry ( urlHash , count , p + + , 0 , 0 ,
2005-11-11 00:48:20 +01:00
plasmaWordIndex . microDateDays ( docDate ) , quality , language , doctype , true ) ;
2005-10-05 12:45:33 +02:00
wordIdxEntity . addEntry ( wordIdxEntry ) ;
tmpEntities . add ( wordIdxEntity ) ;
// wordIndex.addEntries(plasmaWordIndexEntryContainer.instantContainer(wordHash, System.currentTimeMillis(), entry));
}
//System.out.println("DEBUG: plasmaSearch.addPageIndex: added " + condenser.getWords().size() + " words, flushed " + c + " entries");
2005-11-11 00:48:20 +01:00
words = condenser . getWords ( ) . size ( ) ;
2005-10-05 12:45:33 +02:00
// transfering the index to the storage peer
String error = yacyClient . transferIndex ( seed , ( plasmaWordIndexEntity [ ] ) tmpEntities . toArray ( new plasmaWordIndexEntity [ tmpEntities . size ( ) ] ) , urlCache , true , 120000 ) ;
if ( error ! = null ) {
2005-11-07 13:33:02 +01:00
words = wordIndex . addPageIndex ( entry . url ( ) , urlHash , docDate , condenser , plasmaWordIndexEntry . language ( entry . url ( ) ) , plasmaWordIndexEntry . docType ( document . getMimeType ( ) ) ) ;
2005-10-05 12:45:33 +02:00
}
// cleanup
for ( int j = 0 ; j < tmpEntities . size ( ) ; j + + ) {
plasmaWordIndexEntity tmpEntity = ( plasmaWordIndexEntity ) tmpEntities . get ( j ) ;
try { tmpEntity . close ( ) ; } catch ( Exception e ) { }
}
}
2005-09-27 09:10:24 +02:00
storageEndTime = System . currentTimeMillis ( ) ;
if ( log . isLoggable ( Level . INFO ) ) {
2005-11-11 00:48:20 +01:00
log . logInfo ( " *Indexed " + words + " words in URL " + entry . url ( ) +
" [ " + entry . urlHash ( ) + " ] " +
" \ n \ tDescription: " + descr +
" \ n \ tMimeType: " + document . getMimeType ( ) + " | " +
" Size: " + document . text . length + " bytes | " +
" Anchors: " + ( ( document . anchors = = null ) ? 0 : document . anchors . size ( ) ) +
" \ n \ tStackingTime: " + ( stackEndTime - stackStartTime ) + " ms | " +
" ParsingTime: " + ( parsingEndTime - parsingStartTime ) + " ms | " +
" IndexingTime: " + ( indexingEndTime - indexingStartTime ) + " ms | " +
" StorageTime: " + ( storageEndTime - storageStartTime ) + " ms " ) ;
2005-09-27 09:10:24 +02:00
}
2005-04-07 21:19:42 +02:00
// if this was performed for a remote crawl request, notify requester
if ( ( processCase = = 6 ) & & ( initiator ! = null ) ) {
2005-07-06 16:48:41 +02:00
log . logInfo ( " Sending crawl receipt for ' " + entry . normalizedURLString ( ) + " ' to " + initiator . getName ( ) ) ;
2005-04-07 21:19:42 +02:00
yacyClient . crawlReceipt ( initiator , " crawl " , " fill " , " indexed " , newEntry , " " ) ;
}
} else {
2005-08-30 23:10:39 +02:00
log . logFine ( " Not Indexed Resource ' " + entry . normalizedURLString ( ) + " ': process case= " + processCase ) ;
2005-04-07 21:19:42 +02:00
}
} catch ( Exception ee ) {
2005-08-30 23:32:59 +02:00
log . logSevere ( " Could not index URL " + entry . url ( ) + " : " + ee . getMessage ( ) , ee ) ;
2005-04-07 21:19:42 +02:00
if ( ( processCase = = 6 ) & & ( initiator ! = null ) ) {
yacyClient . crawlReceipt ( initiator , " crawl " , " exception " , ee . getMessage ( ) , null , " " ) ;
}
}
} else {
2005-07-06 16:48:41 +02:00
log . logInfo ( " Not indexed any word in URL " + entry . url ( ) + " ; cause: " + noIndexReason ) ;
urlPool . errorURL . newEntry ( entry . url ( ) , referrerHash ,
2005-11-11 00:48:20 +01:00
( ( entry . proxy ( ) ) ? plasmaURL . dummyHash : entry . initiator ( ) ) ,
yacyCore . seedDB . mySeed . hash ,
descr , noIndexReason , new bitfield ( plasmaURL . urlFlagLength ) , true ) ;
2005-04-07 21:19:42 +02:00
if ( ( processCase = = 6 ) & & ( initiator ! = null ) ) {
yacyClient . crawlReceipt ( initiator , " crawl " , " rejected " , noIndexReason , null , " " ) ;
}
}
2005-11-07 13:33:02 +01:00
document = null ;
2005-11-11 00:48:20 +01:00
2005-11-03 16:28:37 +01:00
// removing current entry from in process list
2005-09-02 00:05:20 +02:00
synchronized ( this . indexingTasksInProcess ) {
this . indexingTasksInProcess . remove ( entry . urlHash ( ) ) ;
2005-11-07 13:33:02 +01:00
}
2005-09-02 00:05:20 +02:00
2005-11-03 16:28:37 +01:00
// removing current entry from notice URL queue
boolean removed = urlPool . noticeURL . remove ( entry . urlHash ( ) ) ; // worked-off
if ( ! removed ) {
log . logFinest ( " Unable to remove indexed URL " + entry . url ( ) + " from Crawler Queue. This could be because of an URL redirect. " ) ;
}
2005-08-24 09:47:42 +02:00
// explicit delete/free resources
2005-09-07 23:38:03 +02:00
if ( ( entry ! = null ) & & ( entry . profile ( ) ! = null ) & & ( ! ( entry . profile ( ) . storeHTCache ( ) ) ) ) {
2005-11-02 18:56:26 +01:00
plasmaHTCache . filesInUse . remove ( entry . cacheFile ( ) ) ;
2005-09-07 23:38:03 +02:00
cacheManager . deleteFile ( entry . url ( ) ) ;
}
2005-11-07 13:33:02 +01:00
entry = null ;
2005-04-07 21:19:42 +02:00
}
2005-11-11 00:48:20 +01:00
2005-11-07 13:33:02 +01:00
private void generateCitationReference ( String baseurlhash , Date docDate , plasmaParserDocument document , plasmaCondenser condenser ) {
// generate citation reference
Map hl = document . getHyperlinks ( ) ;
Iterator it = hl . entrySet ( ) . iterator ( ) ;
String nexturlhash ;
StringBuffer cpg = new StringBuffer ( 12 * ( hl . size ( ) + 1 ) + 1 ) ;
StringBuffer cpl = new StringBuffer ( 12 * ( hl . size ( ) + 1 ) + 1 ) ;
String lhp = baseurlhash . substring ( 6 ) ; // local hash part
int GCount = 0 ;
int LCount = 0 ;
while ( it . hasNext ( ) ) {
nexturlhash = plasmaURL . urlHash ( ( String ) ( ( Map . Entry ) it . next ( ) ) . getKey ( ) ) ;
if ( nexturlhash ! = null ) {
if ( nexturlhash . substring ( 6 ) . equals ( lhp ) ) {
cpl . append ( nexturlhash . substring ( 0 , 6 ) ) ;
LCount + + ;
} else {
cpg . append ( nexturlhash ) ;
GCount + + ;
}
}
}
// append this reference to buffer
// generate header info
String head = baseurlhash + " = " +
2005-11-11 00:48:20 +01:00
plasmaWordIndex . microDateHoursStr ( docDate . getTime ( ) ) + // latest update timestamp of the URL
plasmaWordIndex . microDateHoursStr ( System . currentTimeMillis ( ) ) + // last visit timestamp of the URL
serverCodings . enhancedCoder . encodeBase64LongSmart ( LCount , 2 ) + // count of links to local resources
serverCodings . enhancedCoder . encodeBase64LongSmart ( GCount , 2 ) + // count of links to global resources
serverCodings . enhancedCoder . encodeBase64LongSmart ( document . getImages ( ) . size ( ) , 2 ) + // count of Images in document
serverCodings . enhancedCoder . encodeBase64LongSmart ( 0 , 2 ) + // count of links to other documents
serverCodings . enhancedCoder . encodeBase64LongSmart ( document . getText ( ) . length , 3 ) + // length of plain text in bytes
serverCodings . enhancedCoder . encodeBase64LongSmart ( condenser . RESULT_NUMB_WORDS , 3 ) + // count of all appearing words
serverCodings . enhancedCoder . encodeBase64LongSmart ( condenser . RESULT_SIMI_WORDS , 3 ) + // count of all unique words
serverCodings . enhancedCoder . encodeBase64LongSmart ( 0 , 1 ) ; // Flags (update, popularity, attention, vote)
2005-11-07 13:33:02 +01:00
//crl.append(head); crl.append ('|'); crl.append(cpl); crl.append((char) 13); crl.append((char) 10);
2005-11-11 00:48:20 +01:00
crg . append ( head ) ; crg . append ( '|' ) ; crg . append ( cpg ) ; crg . append ( ( char ) 13 ) ; crg . append ( ( char ) 10 ) ;
2005-11-07 13:33:02 +01:00
// if buffer is full, flush it.
/ *
if ( crl . length ( ) > maxCRLDump ) {
flushCitationReference ( crl , " crl " ) ;
crl = new StringBuffer ( maxCRLDump ) ;
}
* * /
if ( crg . length ( ) > maxCRGDump ) {
flushCitationReference ( crg , " crg " ) ;
crg = new StringBuffer ( maxCRGDump ) ;
}
}
private void flushCitationReference ( StringBuffer cr , String type ) {
if ( cr . length ( ) < 12 ) return ;
String filename = type . toUpperCase ( ) + " -A- " + new serverDate ( ) . toShortString ( true ) + " . " + cr . substring ( 0 , 12 ) + " .cr.gz " ;
2005-11-20 19:55:35 +01:00
File path = new File ( rankingPath , ( type . equals ( " crl " ) ? " LOCAL/010_cr/ " : getConfig ( " CRDist0Path " , plasmaRankingDistribution . CR_OWN ) ) ) ;
2005-11-07 13:33:02 +01:00
path . mkdirs ( ) ;
File file = new File ( path , filename ) ;
// generate header
StringBuffer header = new StringBuffer ( 200 ) ;
header . append ( " # Name=YaCy " + ( ( type . equals ( " crl " ) ) ? " Local " : " Global " ) + " Citation Reference Ticket " ) ; header . append ( ( char ) 13 ) ; header . append ( ( char ) 10 ) ;
header . append ( " # Created= " + System . currentTimeMillis ( ) ) ; header . append ( ( char ) 13 ) ; header . append ( ( char ) 10 ) ;
header . append ( " # Structure=<Referee-12>,'=',<UDate-3>,<VDate-3>,<LCount-2>,<GCount-2>,<ICount-2>,<DCount-2>,<TLength-3>,<WACount-3>,<WUCount-3>,<Flags-1>,'|',*<Anchor- " + ( ( type . equals ( " crl " ) ) ? " 6 " : " 12 " ) + " > " ) ; header . append ( ( char ) 13 ) ; header . append ( ( char ) 10 ) ;
header . append ( " # --- " ) ; header . append ( ( char ) 13 ) ; header . append ( ( char ) 10 ) ;
cr . insert ( 0 , header . toString ( ) ) ;
try {
serverFileUtils . writeAndZip ( cr . toString ( ) . getBytes ( ) , file ) ;
log . logFine ( " wrote citation reference dump " + file . toString ( ) ) ;
} catch ( IOException e ) {
e . printStackTrace ( ) ;
}
}
2005-04-07 21:19:42 +02:00
2005-07-12 17:09:35 +02:00
private void processLocalCrawling ( plasmaCrawlNURL . Entry urlEntry , plasmaCrawlProfile . entry profile , String stats ) {
2005-04-07 21:19:42 +02:00
// work off one Crawl stack entry
2005-09-01 08:55:21 +02:00
if ( ( urlEntry = = null ) | | ( urlEntry . url ( ) = = null ) ) {
2005-06-17 03:26:51 +02:00
log . logInfo ( stats + " : urlEntry=null " ) ;
return ;
2005-04-07 21:19:42 +02:00
}
2005-10-12 10:17:43 +02:00
URL refererURL = null ;
String refererHash = urlEntry . referrerHash ( ) ;
if ( ( refererHash ! = null ) & & ( ! refererHash . equals ( plasmaURL . dummyHash ) ) ) {
refererURL = this . urlPool . getURL ( refererHash ) ;
}
cacheLoader . loadParallel ( urlEntry . url ( ) , urlEntry . name ( ) , ( refererURL ! = null ) ? refererURL . toString ( ) : null , urlEntry . initiator ( ) , urlEntry . depth ( ) , profile ) ;
2005-08-24 09:47:42 +02:00
log . logInfo ( stats + " : enqueued for load " + urlEntry . url ( ) + " [ " + urlEntry . hash ( ) + " ] " ) ;
2005-06-17 03:26:51 +02:00
return ;
2005-04-07 21:19:42 +02:00
}
2005-07-12 17:09:35 +02:00
private boolean processRemoteCrawlTrigger ( plasmaCrawlNURL . Entry urlEntry ) {
2005-09-06 16:17:53 +02:00
2005-06-17 03:26:51 +02:00
// return true iff another peer has/will index(ed) the url
2005-04-07 21:19:42 +02:00
if ( urlEntry = = null ) {
2005-06-16 02:31:13 +02:00
log . logInfo ( " REMOTECRAWLTRIGGER[ " + urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_CORE ) + " , " + urlPool . noticeURL . stackSize ( plasmaCrawlNURL . STACK_TYPE_REMOTE ) + " ]: urlEntry=null " ) ;
2005-06-17 03:26:51 +02:00
return true ; // superfluous request; true correct in this context
2005-04-07 21:19:42 +02:00
}
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
// are we qualified?
if ( ( yacyCore . seedDB . mySeed = = null ) | |
2005-11-11 00:48:20 +01:00
( yacyCore . seedDB . mySeed . isJunior ( ) ) ) {
2005-08-30 23:10:39 +02:00
log . logFine ( " plasmaSwitchboard.processRemoteCrawlTrigger: no permission " ) ;
2005-04-07 21:19:42 +02:00
return false ;
}
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
// check url
if ( urlEntry . url ( ) = = null ) {
2005-08-30 23:10:39 +02:00
log . logFine ( " ERROR: plasmaSwitchboard.processRemoteCrawlTrigger - url is null. name= " + urlEntry . name ( ) ) ;
2005-06-17 03:26:51 +02:00
return true ;
2005-04-07 21:19:42 +02:00
}
String urlhash = plasmaURL . urlHash ( urlEntry . url ( ) ) ;
// check remote crawl
yacySeed remoteSeed = yacyCore . dhtAgent . getCrawlSeed ( urlhash ) ;
2005-06-17 03:26:51 +02:00
2005-04-07 21:19:42 +02:00
if ( remoteSeed = = null ) {
2005-08-30 23:10:39 +02:00
log . logFine ( " plasmaSwitchboard.processRemoteCrawlTrigger: no remote crawl seed available " ) ;
2005-11-11 00:48:20 +01:00
return false ;
2005-04-07 21:19:42 +02:00
}
2005-06-17 03:26:51 +02:00
// do the request
2005-07-19 02:26:31 +02:00
HashMap page = yacyClient . crawlOrder ( remoteSeed , urlEntry . url ( ) , urlPool . getURL ( urlEntry . referrerHash ( ) ) ) ;
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
// check success
/ *
the result of the ' response ' value can have one of the following values :
negative cases , no retry
denied - the peer does not want to crawl that
exception - an exception occurred
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
negative case , retry possible
rejected - the peer has rejected to process , but a re - try should be possible
positive case with crawling
stacked - the resource is processed asap
2005-11-11 00:48:20 +01:00
positive case without crawling
2005-04-07 21:19:42 +02:00
double - the resource is already in database , believed to be fresh and not reloaded
the resource is also returned in lurl
2005-11-11 00:48:20 +01:00
* /
2005-04-07 21:19:42 +02:00
if ( ( page = = null ) | | ( page . get ( " delay " ) = = null ) ) {
2005-07-12 17:09:35 +02:00
log . logInfo ( " CRAWL: REMOTE CRAWL TO PEER " + remoteSeed . getName ( ) + " FAILED. CAUSE: unknown (URL= " + urlEntry . url ( ) . toString ( ) + " ) " ) ;
2005-06-17 03:26:51 +02:00
if ( remoteSeed ! = null ) yacyCore . peerActions . peerDeparture ( remoteSeed ) ;
2005-04-07 21:19:42 +02:00
return false ;
} else try {
2005-08-30 23:10:39 +02:00
log . logFine ( " plasmaSwitchboard.processRemoteCrawlTrigger: remoteSeed= " + remoteSeed . getName ( ) + " , url= " + urlEntry . url ( ) . toString ( ) + " , response= " + page . toString ( ) ) ; // DEBUG
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
int newdelay = Integer . parseInt ( ( String ) page . get ( " delay " ) ) ;
yacyCore . dhtAgent . setCrawlDelay ( remoteSeed . hash , newdelay ) ;
String response = ( String ) page . get ( " response " ) ;
if ( response . equals ( " stacked " ) ) {
2005-09-07 09:26:19 +02:00
log . logInfo ( STR_REMOTECRAWLTRIGGER + remoteSeed . getName ( ) + " PLACED URL= " + urlEntry . url ( ) . toString ( ) + " ; NEW DELAY= " + newdelay ) ;
2005-04-07 21:19:42 +02:00
return true ;
} else if ( response . equals ( " double " ) ) {
String lurl = ( String ) page . get ( " lurl " ) ;
if ( ( lurl ! = null ) & & ( lurl . length ( ) ! = 0 ) ) {
2005-11-11 00:48:20 +01:00
String propStr = crypt . simpleDecode ( lurl , ( String ) page . get ( " key " ) ) ;
2005-07-12 02:07:09 +02:00
plasmaCrawlLURL . Entry entry = urlPool . loadedURL . addEntry (
2005-11-11 00:48:20 +01:00
urlPool . loadedURL . newEntry ( propStr , true ) ,
yacyCore . seedDB . mySeed . hash , remoteSeed . hash , 1 ) ;
urlPool . noticeURL . remove ( entry . hash ( ) ) ;
2005-09-07 09:26:19 +02:00
log . logInfo ( STR_REMOTECRAWLTRIGGER + remoteSeed . getName ( ) + " SUPERFLUOUS. CAUSE: " + page . get ( " reason " ) + " (URL= " + urlEntry . url ( ) . toString ( ) + " ). URL IS CONSIDERED AS 'LOADED!' " ) ;
2005-04-07 21:19:42 +02:00
return true ;
} else {
2005-09-07 09:26:19 +02:00
log . logInfo ( STR_REMOTECRAWLTRIGGER + remoteSeed . getName ( ) + " REJECTED. CAUSE: " + page . get ( " reason " ) + " (URL= " + urlEntry . url ( ) . toString ( ) + " ) " ) ;
2005-04-07 21:19:42 +02:00
return false ;
}
} else {
2005-09-07 09:26:19 +02:00
log . logInfo ( STR_REMOTECRAWLTRIGGER + remoteSeed . getName ( ) + " DENIED. RESPONSE= " + response + " , CAUSE= " + page . get ( " reason " ) + " , URL= " + urlEntry . url ( ) . toString ( ) ) ;
2005-04-07 21:19:42 +02:00
return false ;
}
} catch ( Exception e ) {
// wrong values
2005-09-07 09:26:19 +02:00
log . logSevere ( STR_REMOTECRAWLTRIGGER + remoteSeed . getName ( ) + " FAILED. CLIENT RETURNED: " + page . toString ( ) , e ) ;
2005-04-07 21:19:42 +02:00
return false ;
}
}
2005-04-15 16:18:14 +02:00
2005-04-07 21:19:42 +02:00
private static SimpleDateFormat DateFormatter = new SimpleDateFormat ( " EEE, dd MMM yyyy " ) ;
public static String dateString ( Date date ) {
2005-11-11 00:48:20 +01:00
if ( date = = null ) return " " ; else return DateFormatter . format ( date ) ;
2005-04-07 21:19:42 +02:00
}
2005-04-15 16:18:14 +02:00
2005-10-10 02:32:15 +02:00
public serverObjects searchFromLocal ( plasmaSearchQuery query ) {
2005-04-07 21:19:42 +02:00
2005-11-11 00:48:20 +01:00
// tell all threads to do nothing for a specific time
wordIndex . intermission ( 2 * query . maximumTime ) ;
intermissionAllThreads ( 2 * query . maximumTime ) ;
2005-04-07 21:19:42 +02:00
serverObjects prop = new serverObjects ( ) ;
2005-11-11 00:48:20 +01:00
try {
2005-04-07 21:19:42 +02:00
// filter out words that appear in bluelist
2005-10-23 19:50:27 +02:00
//log.logInfo("E");
2005-10-12 14:28:49 +02:00
query . filterOut ( blueList ) ;
2005-04-07 21:19:42 +02:00
// log
2005-10-12 14:28:49 +02:00
log . logInfo ( " INIT WORD SEARCH: " + query . queryWords + " : " + query . queryHashes + " - " + query . wantedResults + " links, " + ( query . maximumTime / 1000 ) + " seconds " ) ;
2005-04-07 21:19:42 +02:00
long timestamp = System . currentTimeMillis ( ) ;
2005-06-24 09:41:07 +02:00
// start a presearch, which makes only sense if we idle afterwards.
// this is especially the case if we start a global search and idle until search
2005-10-12 14:28:49 +02:00
//if (query.domType == plasmaSearchQuery.SEARCHDOM_GLOBALDHT) {
// Thread preselect = new presearch(query.queryHashes, order, query.maximumTime / 10, query.urlMask, 10, 3);
// preselect.start();
//}
2005-06-08 02:52:24 +02:00
2005-10-13 15:57:15 +02:00
// create a new search event
2005-10-12 14:28:49 +02:00
plasmaSearchEvent theSearch = new plasmaSearchEvent ( query , log , wordIndex , urlPool . loadedURL , snippetCache ) ;
2005-10-13 15:57:15 +02:00
plasmaSearchResult acc = theSearch . search ( ) ;
2005-04-07 21:19:42 +02:00
2005-10-13 15:57:15 +02:00
// fetch snippets
2005-10-23 19:50:27 +02:00
//if (query.domType != plasmaSearchQuery.SEARCHDOM_GLOBALDHT) snippetCache.fetch(acc.cloneSmart(), query.queryHashes, query.urlMask, 10, 1000);
2005-08-30 23:10:39 +02:00
log . logFine ( " SEARCH TIME AFTER ORDERING OF SEARCH RESULT: " + ( ( System . currentTimeMillis ( ) - timestamp ) / 1000 ) + " seconds " ) ;
2005-04-07 21:19:42 +02:00
// result is a List of urlEntry elements: prepare answer
if ( acc = = null ) {
prop . put ( " totalcount " , " 0 " ) ;
2005-08-15 03:12:25 +02:00
prop . put ( " orderedcount " , " 0 " ) ;
2005-04-07 21:19:42 +02:00
prop . put ( " linkcount " , " 0 " ) ;
} else {
2005-10-13 15:57:15 +02:00
prop . put ( " globalresults " , acc . globalContributions ) ;
prop . put ( " totalcount " , acc . globalContributions + acc . localContributions ) ;
2005-08-15 03:12:25 +02:00
prop . put ( " orderedcount " , Integer . toString ( acc . sizeOrdered ( ) ) ) ;
2005-04-07 21:19:42 +02:00
int i = 0 ;
int p ;
URL url ;
2005-07-12 02:07:09 +02:00
plasmaCrawlLURL . Entry urlentry ;
2005-10-18 09:45:27 +02:00
String urlstring , urlname , filename , urlhash ;
2005-06-23 14:12:12 +02:00
String host , hash , address , descr = " " ;
2005-04-07 21:19:42 +02:00
yacySeed seed ;
2005-06-23 14:12:12 +02:00
plasmaSnippetCache . result snippet ;
2005-12-06 17:15:21 +01:00
String formerSearch = query . words ( " " ) ;
2005-10-23 19:50:27 +02:00
long targetTime = timestamp + query . maximumTime ;
if ( targetTime < System . currentTimeMillis ( ) ) targetTime = System . currentTimeMillis ( ) + 5000 ;
while ( ( acc . hasMoreElements ( ) ) & & ( i < query . wantedResults ) & & ( System . currentTimeMillis ( ) < targetTime ) ) {
2005-04-07 21:19:42 +02:00
urlentry = acc . nextElement ( ) ;
url = urlentry . url ( ) ;
2005-10-18 09:45:27 +02:00
urlhash = urlentry . hash ( ) ;
2005-04-07 21:19:42 +02:00
host = url . getHost ( ) ;
if ( host . endsWith ( " .yacyh " ) ) {
// translate host into current IP
p = host . indexOf ( " . " ) ;
hash = yacySeed . hexHash2b64Hash ( host . substring ( p + 1 , host . length ( ) - 6 ) ) ;
seed = yacyCore . seedDB . getConnected ( hash ) ;
filename = url . getFile ( ) ;
if ( ( seed = = null ) | | ( ( address = seed . getAddress ( ) ) = = null ) ) {
// seed is not known from here
2005-04-15 16:18:14 +02:00
removeReferences ( urlentry . hash ( ) , plasmaCondenser . getWords ( ( " yacyshare " + filename . replace ( '?' , ' ' ) + " " + urlentry . descr ( ) ) . getBytes ( ) ) ) ;
2005-06-16 02:31:13 +02:00
urlPool . loadedURL . remove ( urlentry . hash ( ) ) ; // clean up
2005-04-07 21:19:42 +02:00
continue ; // next result
}
url = new URL ( " http:// " + address + " / " + host . substring ( 0 , p ) + filename ) ;
urlname = " http://share. " + seed . getName ( ) + " .yacy " + filename ;
if ( ( p = urlname . indexOf ( " ? " ) ) > 0 ) urlname = urlname . substring ( 0 , p ) ;
urlstring = htmlFilterContentScraper . urlNormalform ( url ) ;
} else {
urlstring = htmlFilterContentScraper . urlNormalform ( url ) ;
urlname = urlstring ;
}
descr = urlentry . descr ( ) ;
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
// check bluelist again: filter out all links where any bluelisted word
// appear either in url, url's description or search word
// the search word was sorted out earlier
/ *
String s = descr . toLowerCase ( ) + url . toString ( ) . toLowerCase ( ) ;
for ( int c = 0 ; c < blueList . length ; c + + ) {
if ( s . indexOf ( blueList [ c ] ) > = 0 ) return ;
}
* /
//addScoreForked(ref, gs, descr.split(" "));
//addScoreForked(ref, gs, urlstring.split("/"));
2005-10-10 02:32:15 +02:00
if ( urlstring . matches ( query . urlMask ) ) { //.* is default
snippet = snippetCache . retrieve ( url , query . queryHashes , false , 260 ) ;
2005-06-30 02:01:53 +02:00
if ( snippet . source = = plasmaSnippetCache . ERROR_NO_MATCH ) {
// suppress line: there is no match in that resource
2005-06-02 03:33:10 +02:00
} else {
2005-12-06 17:15:21 +01:00
prop . put ( " results_ " + i + " _delete " , " /index.html?search= " + formerSearch + " &Enter=Search&count= " + query . wantedResults + " &order= " + query . orderString ( ) + " &resource=local&time=3&deleteref= " + urlhash + " &urlmaskfilter=.* " ) ;
2005-06-30 02:01:53 +02:00
prop . put ( " results_ " + i + " _description " , descr ) ;
prop . put ( " results_ " + i + " _url " , urlstring ) ;
2005-10-18 09:45:27 +02:00
prop . put ( " results_ " + i + " _urlhash " , urlhash ) ;
2005-12-06 17:15:21 +01:00
prop . put ( " results_ " + i + " _urlhexhash " , yacySeed . b64Hash2hexHash ( urlhash ) ) ;
2005-11-11 14:40:53 +01:00
prop . put ( " results_ " + i + " _urlname " , nxTools . cutUrlText ( urlname , 120 ) ) ;
2005-06-30 02:01:53 +02:00
prop . put ( " results_ " + i + " _date " , dateString ( urlentry . moddate ( ) ) ) ;
2005-11-22 16:17:05 +01:00
prop . put ( " results_ " + i + " _ybr " , plasmaSearchPreOrder . ybr ( urlentry . hash ( ) ) ) ;
2005-11-11 00:48:20 +01:00
prop . put ( " results_ " + i + " _size " , Long . toString ( urlentry . size ( ) ) ) ;
2005-11-06 07:27:17 +01:00
prop . put ( " results_ " + i + " _words " , URLEncoder . encode ( query . queryWords . toString ( ) , " UTF-8 " ) ) ;
// adding snippet if available
2005-06-30 02:01:53 +02:00
if ( snippet . line = = null ) {
prop . put ( " results_ " + i + " _snippet " , 0 ) ;
prop . put ( " results_ " + i + " _snippet_text " , " " ) ;
} else {
prop . put ( " results_ " + i + " _snippet " , 1 ) ;
2005-11-11 14:40:53 +01:00
prop . put ( " results_ " + i + " _snippet_text " , snippet . line . trim ( ) ) ;
2005-06-30 02:01:53 +02:00
}
i + + ;
2005-06-02 03:33:10 +02:00
}
2005-04-07 21:19:42 +02:00
}
}
2005-08-30 23:10:39 +02:00
log . logFine ( " SEARCH TIME AFTER RESULT PREPARATION: " + ( ( System . currentTimeMillis ( ) - timestamp ) / 1000 ) + " seconds " ) ;
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
// calc some more cross-reference
2005-10-13 15:57:15 +02:00
long remainingTime = query . maximumTime - ( System . currentTimeMillis ( ) - timestamp ) ;
2005-04-07 21:19:42 +02:00
if ( remainingTime < 0 ) remainingTime = 1000 ;
/ *
while ( ( acc . hasMoreElements ( ) ) & & ( ( ( time + timestamp ) < System . currentTimeMillis ( ) ) ) ) {
urlentry = acc . nextElement ( ) ;
urlstring = htmlFilterContentScraper . urlNormalform ( urlentry . url ( ) ) ;
descr = urlentry . descr ( ) ;
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
addScoreForked ( ref , gs , descr . split ( " " ) ) ;
addScoreForked ( ref , gs , urlstring . split ( " / " ) ) ;
}
* * /
//Object[] ws = ref.getScores(16, false, 2, Integer.MAX_VALUE);
Object [ ] ws = acc . getReferences ( 16 ) ;
2005-08-30 23:10:39 +02:00
log . logFine ( " SEARCH TIME AFTER XREF PREPARATION: " + ( ( System . currentTimeMillis ( ) - timestamp ) / 1000 ) + " seconds " ) ;
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
/ *
System . out . print ( " DEBUG WORD-SCORE: " ) ;
for ( int ii = 0 ; ii < ws . length ; ii + + ) System . out . print ( ws [ ii ] + " , " ) ;
System . out . println ( " all words = " + ref . getElementCount ( ) + " , total count = " + ref . getTotalCount ( ) ) ;
* /
prop . put ( " references " , ws ) ;
2005-08-03 04:02:39 +02:00
prop . put ( " linkcount " , Integer . toString ( i ) ) ;
prop . put ( " results " , Integer . toString ( i ) ) ;
2005-04-07 21:19:42 +02:00
}
// log
2005-10-12 14:28:49 +02:00
log . logInfo ( " EXIT WORD SEARCH: " + query . queryWords + " - " +
2005-08-10 11:52:50 +02:00
prop . get ( " totalcount " , " 0 " ) + " links found, " +
2005-08-15 03:12:25 +02:00
prop . get ( " orderedcount " , " 0 " ) + " links ordered, " +
2005-11-11 00:48:20 +01:00
prop . get ( " linkcount " , " ? " ) + " links selected, " +
2005-04-07 21:19:42 +02:00
( ( System . currentTimeMillis ( ) - timestamp ) / 1000 ) + " seconds " ) ;
return prop ;
} catch ( IOException e ) {
return null ;
}
}
2005-10-12 14:28:49 +02:00
public serverObjects searchFromRemote ( plasmaSearchQuery query ) {
2005-11-11 00:48:20 +01:00
// tell all threads to do nothing for a specific time
wordIndex . intermission ( 2 * query . maximumTime ) ;
intermissionAllThreads ( 2 * query . maximumTime ) ;
2005-10-23 19:50:27 +02:00
query . domType = plasmaSearchQuery . SEARCHDOM_LOCAL ;
2005-04-07 21:19:42 +02:00
serverObjects prop = new serverObjects ( ) ;
try {
2005-10-12 14:28:49 +02:00
log . logInfo ( " INIT HASH SEARCH: " + query . queryHashes + " - " + query . wantedResults + " links " ) ;
2005-04-07 21:19:42 +02:00
long timestamp = System . currentTimeMillis ( ) ;
2005-10-12 14:28:49 +02:00
plasmaSearchEvent theSearch = new plasmaSearchEvent ( query , log , wordIndex , urlPool . loadedURL , snippetCache ) ;
2005-10-23 19:50:27 +02:00
int idxc = theSearch . localSearch ( ) ;
plasmaSearchResult acc = theSearch . order ( ) ;
2005-04-07 21:19:42 +02:00
// result is a List of urlEntry elements
if ( acc = = null ) {
prop . put ( " totalcount " , " 0 " ) ;
prop . put ( " linkcount " , " 0 " ) ;
prop . put ( " references " , " " ) ;
} else {
2005-08-03 04:02:39 +02:00
prop . put ( " totalcount " , Integer . toString ( acc . sizeOrdered ( ) ) ) ;
2005-04-07 21:19:42 +02:00
int i = 0 ;
2005-05-14 11:41:05 +02:00
StringBuffer links = new StringBuffer ( ) ;
2005-04-07 21:19:42 +02:00
String resource = " " ;
//plasmaIndexEntry pie;
2005-07-12 02:07:09 +02:00
plasmaCrawlLURL . Entry urlentry ;
2005-06-23 14:12:12 +02:00
plasmaSnippetCache . result snippet ;
2005-10-12 14:28:49 +02:00
while ( ( acc . hasMoreElements ( ) ) & & ( i < query . wantedResults ) ) {
2005-04-07 21:19:42 +02:00
urlentry = acc . nextElement ( ) ;
2005-10-12 14:28:49 +02:00
snippet = snippetCache . retrieve ( urlentry . url ( ) , query . queryHashes , false , 260 ) ;
2005-06-30 02:01:53 +02:00
if ( snippet . source = = plasmaSnippetCache . ERROR_NO_MATCH ) {
// suppress line: there is no match in that resource
2005-06-02 03:33:10 +02:00
} else {
2005-06-30 02:01:53 +02:00
if ( snippet . line = = null ) {
resource = urlentry . toString ( ) ;
} else {
resource = urlentry . toString ( snippet . line ) ;
}
if ( resource ! = null ) {
links . append ( " resource " ) . append ( i ) . append ( " = " ) . append ( resource ) . append ( serverCore . crlfString ) ;
i + + ;
}
2005-04-07 21:19:42 +02:00
}
}
2005-05-14 11:41:05 +02:00
prop . put ( " links " , links . toString ( ) ) ;
2005-08-03 04:02:39 +02:00
prop . put ( " linkcount " , Integer . toString ( i ) ) ;
2005-04-07 21:19:42 +02:00
// prepare reference hints
Object [ ] ws = acc . getReferences ( 16 ) ;
2005-05-14 11:41:05 +02:00
StringBuffer refstr = new StringBuffer ( ) ;
for ( int j = 0 ; j < ws . length ; j + + ) refstr . append ( " , " ) . append ( ( String ) ws [ j ] ) ;
prop . put ( " references " , ( refstr . length ( ) > 0 ) ? refstr . substring ( 1 ) : refstr . toString ( ) ) ;
2005-04-07 21:19:42 +02:00
}
// add information about forward peers
prop . put ( " fwhop " , " " ) ; // hops (depth) of forwards that had been performed to construct this result
prop . put ( " fwsrc " , " " ) ; // peers that helped to construct this result
prop . put ( " fwrec " , " " ) ; // peers that would have helped to construct this result (recommendations)
// log
2005-10-13 15:57:15 +02:00
log . logInfo ( " EXIT HASH SEARCH: " + query . queryHashes + " - " + idxc + " links found, " +
2005-11-11 00:48:20 +01:00
prop . get ( " linkcount " , " ? " ) + " links selected, " +
2005-04-07 21:19:42 +02:00
( ( System . currentTimeMillis ( ) - timestamp ) / 1000 ) + " seconds " ) ;
return prop ;
} catch ( IOException e ) {
return null ;
}
}
public serverObjects action ( String actionName , serverObjects actionInput ) {
2005-11-11 00:48:20 +01:00
// perform an action. (not used)
return null ;
2005-04-07 21:19:42 +02:00
}
2005-11-11 00:48:20 +01:00
2005-04-07 21:19:42 +02:00
public String toString ( ) {
2005-11-11 00:48:20 +01:00
// it is possible to use this method in the cgi pages.
// actually it is used there for testing purpose
return " PROPS: " + super . toString ( ) + " ; QUEUE: " + sbQueue . toString ( ) ;
2005-04-07 21:19:42 +02:00
}
// method for index deletion
public int removeAllUrlReferences ( URL url , boolean fetchOnline ) {
return removeAllUrlReferences ( plasmaURL . urlHash ( url ) , fetchOnline ) ;
}
public int removeAllUrlReferences ( String urlhash , boolean fetchOnline ) {
// find all the words in a specific resource and remove the url reference from every word index
// finally, delete the url entry
// determine the url string
2005-07-12 02:07:09 +02:00
plasmaCrawlLURL . Entry entry = urlPool . loadedURL . getEntry ( urlhash ) ;
2005-04-07 21:19:42 +02:00
URL url = entry . url ( ) ;
if ( url = = null ) return 0 ;
// get set of words
2005-06-02 03:33:10 +02:00
//Set words = plasmaCondenser.getWords(getText(getResource(url, fetchOnline)));
2005-06-23 14:12:12 +02:00
Set words = plasmaCondenser . getWords ( snippetCache . parseDocument ( url , snippetCache . getResource ( url , fetchOnline ) ) . getText ( ) ) ;
2005-04-07 21:19:42 +02:00
// delete all word references
int count = removeReferences ( urlhash , words ) ;
// finally delete the url entry itself
2005-06-16 02:31:13 +02:00
urlPool . loadedURL . remove ( urlhash ) ;
2005-04-07 21:19:42 +02:00
return count ;
}
2005-09-07 09:26:19 +02:00
public int removeReferences ( URL url , Set words ) {
2005-04-07 21:19:42 +02:00
return removeReferences ( plasmaURL . urlHash ( url ) , words ) ;
}
2005-09-06 16:17:53 +02:00
public int removeReferences ( final String urlhash , final Set words ) {
2005-04-07 21:19:42 +02:00
// sequentially delete all word references
// returns number of deletions
2005-09-07 09:26:19 +02:00
Iterator iter = words . iterator ( ) ;
2005-04-07 21:19:42 +02:00
String word ;
2005-09-06 16:17:53 +02:00
final String [ ] urlEntries = new String [ ] { urlhash } ;
2005-04-07 21:19:42 +02:00
int count = 0 ;
2005-09-07 09:26:19 +02:00
while ( iter . hasNext ( ) ) {
word = ( String ) iter . next ( ) ;
2005-04-07 21:19:42 +02:00
// delete the URL reference in this word index
2005-05-07 23:11:18 +02:00
count + = wordIndex . removeEntries ( plasmaWordIndexEntry . word2hash ( word ) , urlEntries , true ) ;
2005-04-07 21:19:42 +02:00
}
return count ;
}
public int adminAuthenticated ( httpHeader header ) {
String adminAccountBase64MD5 = getConfig ( " adminAccountBase64MD5 " , " " ) ;
if ( adminAccountBase64MD5 . length ( ) = = 0 ) return 2 ; // not necessary
2005-05-17 10:25:04 +02:00
String authorization = ( ( String ) header . get ( httpHeader . AUTHORIZATION , " xxxxxx " ) ) . trim ( ) . substring ( 6 ) ;
2005-04-07 21:19:42 +02:00
if ( authorization . length ( ) = = 0 ) return 1 ; // no authentication information given
if ( ( ( ( String ) header . get ( " CLIENTIP " , " " ) ) . equals ( " localhost " ) ) & & ( adminAccountBase64MD5 . equals ( authorization ) ) ) return 3 ; // soft-authenticated for localhost
2005-11-02 18:56:26 +01:00
if ( adminAccountBase64MD5 . equals ( serverCodings . encodeMD5Hex ( authorization ) ) ) return 4 ; // hard-authenticated, all ok
2005-10-28 20:06:36 +02:00
userDB . Entry entry = this . userDB . proxyAuth ( ( String ) header . get ( httpHeader . AUTHORIZATION , " xxxxxx " ) ) ;
2005-11-08 11:17:12 +01:00
if ( ( entry ! = null ) & & ( entry . hasAdminRight ( ) ) )
2005-10-28 20:06:36 +02:00
return 4 ;
2005-04-07 21:19:42 +02:00
return 0 ; // wrong password
}
2005-04-24 23:24:53 +02:00
public void terminate ( ) {
this . terminate = true ;
this . shutdownSync . V ( ) ;
}
public boolean isTerminated ( ) {
return this . terminate ;
}
public boolean waitForShutdown ( ) throws InterruptedException {
this . shutdownSync . P ( ) ;
return this . terminate ;
}
2005-04-07 21:19:42 +02:00
}