Commit Graph

47 Commits

Author SHA1 Message Date
orbiter
59d52fb4a9 fixed some problems with crawl profiles
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1967 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-03-26 14:52:01 +00:00
orbiter
0c9b61820e enhanced re-crawl settings
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1960 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-03-24 13:45:01 +00:00
orbiter
708cc6c8d9 fixed some bugs for auto-filter and added monitor in profile list
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1959 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-03-24 00:38:40 +00:00
orbiter
63f39ac7b5 added 3 new crawling steering options:
- re-crawl by age of page (enter in minutes)
- auto-domain-filter
- maximum number of pages per domain
NOT YET TESTED!

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1949 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-03-23 16:05:16 +00:00
orbiter
1fc3b34be6 some pre-work (without function yet) to implement:
- re-crawl (by age of last crawl)
- auto-crawl-filter by crawl depth (to be explained..)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1948 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-03-22 15:28:17 +00:00
theli
2336f0f013 *) allow pausing/resuming of crawlJob Threads separately
- pausing/resuming localCrawls
   - pausing/resuming remoteTriggeredCrawls
   - pausing/resuming globalCrawlTrigger
   See: http://www.yacy-forum.de/viewtopic.php?t=1591

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1723 6c8d7289-2bf4-0310-a012-ef5d649a1542
2006-02-21 11:18:48 +00:00
orbiter
37f88b4017 code cleanup
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1176 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-12-06 23:51:29 +00:00
orbiter
548f0c6aff first Try with Eclipse / cleaned sources
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1157 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-12-04 23:14:30 +00:00
theli
444a5a9368 *) Bugfix for Entries with null url in GlobalQueue
See: http://www.yacy-forum.de/viewtopic.php?p=12675#12675

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1069 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-11-13 14:59:38 +00:00
orbiter
d2731418bf added creation of global ranking files and changed url normal form usage
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1046 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-11-07 12:33:02 +00:00
hydrox
cb69047b91 *)cleanup access static methods and fields
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1016 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-11-02 17:56:26 +00:00
hydrox
56b9f34411 *)removed unused imports
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1015 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-11-02 16:30:45 +00:00
theli
a2fa75e688 *) Asynchronous queuing of crawl job URLs (stackCrawl)
various checks like the blacklist check or the robots.txt disallow check are now
   done by a separate thread to unburden the indexer thread(s)
   TODO: maybe we have to introduce a threadpool here if it turn out that this single
         thread is a bottleneck because of the time consuming robots.txt downloads

*) improved index transfer
   The index selection and transmission is done in parallel now to improve index 
   transfer performance.
   TODO: maybe we could speed up performance by unsing multiple transmission threads in 
         parallel instead of only a single one.

*) gzip encoded post requests
   it is now configureable if a gzip encoded post request should be send on
   intex transfer/distribution

*) storage Peer (very experimentell and not optimized yet)
   Now it's possible to send the result of the yacy indexer thread to a remote peer 
   istead of storing the indexed words locally. 
   This could be done by setting the property "storagePeerHash" in the yacy config file
   - Please note that if the index transfer fails, the index ist stored locally.
   - TODO: currently this index transfer is done by the indexer thread. 
     To seedup the indexer
     a) this transmission should be done in parallel and
     b) multiple chunks should be bundled and transfered together


*) general performance improvements  
   - better memory cleanup after http request processing has finished
   - replacing some string concatenations with stringBuffers
   - replacing BufferedInputStreams with serverByteBuffer
   - replacing vectors with arraylists wherever possible
   - replacing hashtables with hashmaps wherever possible
   This was done because function calls to verctor or hashtable functions
   take 3 time longer than calls to functions of arraylists or hashmaps.
   TODO: we should take a look on the class serverObject which is inherited from hashmap
         Do we realy need a synchronization for this class?
   TODO: replace arraylists with linkedLists if random access to the list elements is not needed

*) Robots Parser supports if-modified-since downloads now
   If the downloaded robots.txt file is older than 7 days the robots parser tries to
   download the robots.txt with the if-modified-since header to avoid unnecessary downloads
   if the file was not changed. Additionally the ETag header is used to detect changes.

*) Crawler: better handling of unsupported mimeTypes + FileExtension

*) Bugfix: plasmaWordIndexEntity was not closed correctly in 
   - query.java
   - plasmaswitchboard.java

*) function minimizeUrlDB added to yacy.java 
   this function tests the current urlHashDB for unused urls
   ATTENTION: please don't use this function at the moment because
              it causes the wordIndexDB to flush all words into the
              word directory!

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@853 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-10-05 10:45:33 +00:00
low012
4dbc871524 *) Trying to get rid of possibility of exploits in IndexCreate* through HTML and JavaSkript in peernames, URLs, <title>-tags etc. (see http://www.yacy-forum.de/viewtopic.php?t=1181) I hope I got them all and did not overdo it.
*) Just a tiny bit of cleanig up in News.java. (I messed it up myself some time ago.)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@749 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-19 20:36:29 +00:00
theli
e6338b4390 *) Bugfix for "Error with request: GET http://localpeer:80/IndexDelete_p.ht"
See: http://www.yacy-forum.de/viewtopic.php?p=8906

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@678 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-07 13:27:38 +00:00
theli
bead8a32aa *) IndexCreate_p.java:
Crawler StartURLs will now also added to the errorURL-DB if an error occures on this url
*) kelondroStack.java, plasmaSwitchboardQueue.java
   Adding method which returns a list of all entries in the queue. This list is used by IndexCreate_p.java 
   instead of an iterator to display the indexing-list. 
   Advantages: avoid concurrent modifications of the list while displaying it. 
               Speedup because now we have to access only one sync function instead of multiple ones 
               (one for each entry)
*) IndexCreateIndexingQueue_p.java
   Using new list() function of plasmaSwitchboardQueue
*) httpdFileHandler.java
   If a servelet returns the special value "LOCATION" the httpFileHandler does a Redirection of 
   the Browser to the URL specified by the servelet. This can e.g. be used when a http get request is
   used insead of a post request, but a refresh should not be allowed.
*) IndexCreateWWWLocalQueue_p.html
   Now it's possible to delete single entries of the local crawler queue

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@626 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-09-01 07:52:46 +00:00
theli
330eae7cf3 *) Normalizing CrawlerStartURL now before crawling is started
*) CrawlWorker also does a URL normalization now before following the redirection URL
*) CrawlWorker removes redirection URL correctly from noticeURL stack now

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@571 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-08-21 22:52:46 +00:00
orbiter
bb3e897baf mor minor changes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@488 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-08-03 13:43:55 +00:00
orbiter
2d8557cb10 minor changes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@487 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-08-03 02:02:39 +00:00
orbiter
e84a177c49 many bigfixes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@475 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-08-02 02:18:01 +00:00
orbiter
9ee8a5ba6c fixed big in yacynews
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@474 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-08-02 00:50:20 +00:00
orbiter
d34eb23e4e fixed news; added news appearance on Network and IndexCreate page; added intention string to global crawl
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@466 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-08-01 01:12:02 +00:00
orbiter
1022fbeb65 many YaCyNews fixes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@461 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-07-31 01:54:46 +00:00
orbiter
13abd8b6e7 added news-creation at crawl start
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@460 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-07-30 11:57:19 +00:00
orbiter
81e564edb8 faster crawl profile list cleanup
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@442 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-07-27 14:16:47 +00:00
orbiter
3470a72d48 fixed div by zero, set default delays, fixed release number format and display
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@435 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-07-26 11:47:50 +00:00
orbiter
be1f324fca performance setting for remote indexing configuration and latest changes for 0.39
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@424 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-07-22 13:56:19 +00:00
theli
5c3822d5f4 *) adding experimental support for parsing of bookmarksfiles
See: http://www.yacy-forum.de/viewtopic.php?t=177

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@388 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-07-07 13:41:56 +00:00
orbiter
858cd94299 replaced indexing ram-queue by file-based stack-queue
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@381 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-07-06 14:48:41 +00:00
orbiter
252c6e4869 added crawl queue monitor for global crawls
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@372 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-07-04 15:07:33 +00:00
orbiter
9a3f80403e redesigned IndexCreate menu -- introduced submenues to enable more crawl queue control pages
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@370 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-07-04 12:42:29 +00:00
orbiter
a25b5b4986 fixed possible memory leak in htmlScraper: be aware that now links can get lost; further work necessary
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@288 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-06-16 18:31:28 +00:00
orbiter
a1ffc27041 preparations for image/movie/music indexing
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@280 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-06-16 00:31:13 +00:00
orbiter
33f9315e58 implemented multithreading of indexing
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@221 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-06-08 13:19:05 +00:00
orbiter
3d8a2ff937 enhanced parallelization of local/global/remote crawling
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@197 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-05-29 11:56:40 +00:00
theli
361f05978d Multiple updates regarding the yacy seedUpload facility,
optional content parsers, thread pool configuration ...

Please help me testing if everything works correct.

*) Migration of yacy seedUpload functionality
See: http://www.yacy-forum.de/viewtopic.php?t=256
- new uploaders can now be easily introduced because of a new modulare uploader system
- default uploaders are: none, file, ftp
- adding optional uploader for scp
- each uploader provides its own configuration file that will be 
  included into the settings page using the new template include feature
- Each uploader can define its libx dependencies. If not all needed libs are
  available, the uploader is deactivated automatically.

*) Migration of optional parsers
See: http://www.yacy-forum.de/viewtopic.php?t=198
- Parsers can now also define there libx dependencies
- adding parser for bzip compressed content
- adding parser for gzip compressed content
- adding parser for zip files
- adding parser for tar files
- adding parser to detect the mime-type of a file
  this is needed by the bzip/gzip Parser.java
- adding parser for rtf files
- removing extra configuration file yacy.parser
  the list of enabled parsers is now stored in the main config file

*) Adding configuration option in the performance dialog to configure
See: http://www.yacy-forum.de/viewtopic.php?t=267
- maxActive / maxIdle / minIdle values for httpd-session-threadpool
- maxActive / maxIdle / minIdle values for crawler-threadpool

*) Changing Crawling Filter behaviour
See: http://www.yacy-forum.de/viewtopic.php?p=2631

*) Replacing some hardcoded strings with the proper constants of the httpHeader class

*) Adding new libs to libx directory. This libs are
- needed by new content parsers
- needed by new optional seed uploader
- needed by SOAP API (which will be committed later)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@126 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-05-17 08:25:04 +00:00
theli
014b139c73 *) Bugfix of "There are xxx entries in the crawler queue. Showing 0 most recent entries" Bug.
see: http://www.yacy-forum.de/viewtopic.php?t=338
   see: http://www.yacy-forum.de/viewtopic.php?p=2552

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@122 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-05-15 10:09:15 +00:00
theli
1d38599598 *) changing comment
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@119 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-05-14 09:50:22 +00:00
theli
d2c4e9a55e *) Implementing yacy forum wishlist item: "Pause Crawling"
see: http://www.yacy-forum.de/viewtopic.php?t=48



git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@118 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-05-14 09:41:05 +00:00
rramthun
85c2f3be8a Fixed spelling mistakes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@110 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-05-12 17:50:45 +00:00
theli
e7f7aa0bb9 *) Import statements reorganized
Now it's easier to determine which class really uses which other class*) Reogranizing Import Statements 

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@83 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-05-05 05:36:42 +00:00
theli
c9c0a1f11c *) Trying to speedup local crawling
- introduction of a threadpool for crawling
- introduction of a job queue to avoid buzy waiting for a free crawler slot

*) New classes added
- queue for receiving of crawler jobs
- semaphore class to do reader/writer synchronization (mutual exclusion)
- message object to hold all needed data about a crawler job

*) Trying to solve session-thread shutdown problem
- session thread stopped variable is now set from outside before interrupting the
  session thread.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@39 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-04-21 10:31:40 +00:00
(no author)
d5ff81c636 *) Undoing last changes. Sorry.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@25 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-04-19 06:52:04 +00:00
(no author)
0a6cf3f5e7 *) Bugfix: Reference to plasmaHTCache.Entry.urlString was not set correctly
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@23 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-04-19 06:33:53 +00:00
orbiter
b9203bdb50 bug fixes and code cleaning
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@22 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-04-15 14:18:14 +00:00
orbiter
e7d055b98e very experimental integration of the new generic parser and optional disabling of bluelist filtering in proxy. Does not yet work properly. To disable the disable-feature, the presence of a non-empty bluelist is necessary
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@17 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-04-13 15:52:00 +00:00
orbiter
248077d3f0 initial load with yacy 0.36
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@1 6c8d7289-2bf4-0310-a012-ef5d649a1542
2005-04-07 19:19:42 +00:00