Commit Graph

102 Commits

Author SHA1 Message Date
orbiter
b2fe4b7b1a added a handling of appearances of yacy bot entries in robots.txt if this entry addresses the yacy peer
(directly or indirectly) and it grants a crawl-delay of 0. Then all forced pause mechanisms in YaCy are switched off and the domain is crawled at full speed.
crawl delay values can be assigned to either
- all yacy peers using the user-agent yacybot
- a specific peer with peer name <peer-name>.yacy or
- a specific peer with peer hash <peer-hash>.yacyh


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7639 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-04-03 23:39:45 +00:00
orbiter
f3baaca920 - enhancements to DNS IP caching and crawler speed
- bugfixes (NPEs)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7619 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-22 09:34:10 +00:00
orbiter
2b5f8585bf performance hack for Balancer and ip address parsing
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7608 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-17 21:09:18 +00:00
orbiter
3820525464 more memory protection: auto-flush of caches in case of memory shortage
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7575 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-09 16:32:34 +00:00
orbiter
7962d35425 - removed file upload function in crawl start and replaced it with an input field for a file path where the crawl start file is loaded. This was necessary to support the API steering for file crawl starts, for two reasons:
1) if the file is changed for a re-crawl this is not reflected in the steering because it would take the previously uploaded crawl start file
2) browsers do not submit the full path of the selected file even if this path is shown in the input field because of security reasons. There is no work-around or hack to make the submission of the full path possible

- fixed deletion of crawl start point urls in crawl stack and balancer double-check
- fixed a problem with steering self-call (no resolving of localhost)
- added more logging for the crawler to supervise why crawl urls are not taken by the loader
- added a javascript onload-function to select domain restriction in all cases where a crawl is started from a file or from a url
- fixed the restrict-to-domain pattern computation, added a 'www.'-prefix and added this functionality also to a crawl start from file 

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7574 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-09 12:50:39 +00:00
orbiter
cb1f49d0f2 replaced all 'new String' with default encoding (missing) or UTF-8 encoding with a String generation method that uses a pre-defined Charset constant for UTF-8. This avoids a cache-lookup for the Charset object using String hashing of the String 'UTF-8'.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7558 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-03-07 20:36:40 +00:00
low012
ce012e11aa *) deleted LogStatistics since the page did not work anymore and it seemed to be obsolete, tell me if you miss it and I will add it again
*) a few minor changes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7494 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-18 01:46:07 +00:00
low012
c5051c4020 *) fixed bug which caused entries to not be deleted when deleting by URL on IndexCreateWWWLocalQueue_p.html (I hope this did not break anything else)
*)  cleaned up code a little bit

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7493 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-18 01:25:46 +00:00
orbiter
4588b5a291 - fixed document number limitation for crawls that restrict the number of documents per domain
- some restructuring of the document counting and logging structures was necessary
- better abstraction of CrawlProfiles
- added deletion of logs to the index deletion option (if the index is deleted using the servlets) which is necessary to reset the domain counters for the page limitation
- more refactoring to get the LibraryProvider more clean
- some refactoring of the Condenser class

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7478 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-02-12 00:01:40 +00:00
orbiter
fffb91447a fixed crawl queue delete function
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7357 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-12-01 14:55:40 +00:00
orbiter
93c535d111 fixed http://forum.yacy-websuche.de/viewtopic.php?p=21113#p21113
fixed a concurrent modification exception during search and a time-out problem

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7298 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-11-03 20:58:50 +00:00
orbiter
e3e3b49d52 - enhanced main release recognition
- yacybot user agent now includes the yacy network name (not the peer name!)
- refactoring and clean-up (mostly turned tab into spaces)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7266 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-10-24 21:43:01 +00:00
orbiter
65eaf30f77 redesign of crawl profiles data structure. target will be:
- permanent storage of auto-dom statistics in profile
- storage of profiles in WorkTable data structure
not finished yet. No functional change yet.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7088 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-31 15:47:47 +00:00
orbiter
7fdb17bb96 redirect uncaught exceptions to logging + small other changes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7042 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-16 12:33:06 +00:00
orbiter
87b1684211 additional double-check in balancer
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7035 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-12 17:18:15 +00:00
orbiter
a82a93f2fc - better url double check in crawler
- more logging for error urls

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7032 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-11 09:54:18 +00:00
orbiter
5924a0d851 - enhanced concurrency in database index access for multicore
- added statistics about database index caches in PerformanceMemory_p.html
- adoped many classes to use the new statistics
- added missing close statements

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7018 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-08-03 04:58:48 +00:00
orbiter
a83772c71b fixes and enhancements for balancer:
- crawl lists for each domain now uses a HandleSet which should use less memory than LinkedLists
- but: fill more entries into the domain lists (all available entries)
- fixes to selection criteria (best domain selection)

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6909 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-06-01 09:30:23 +00:00
orbiter
9cde05418f fixed url crawl list display
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6908 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-31 00:27:00 +00:00
orbiter
30b337fa9f fixes to balancer when crawling filesystem (problem was: host == null)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6906 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-30 11:17:38 +00:00
orbiter
844853243a fixed balancer time guessing
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6905 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-30 10:28:42 +00:00
orbiter
3f93a0cc8f redesign of remote proxy settings
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6903 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-26 00:01:16 +00:00
orbiter
2126c03a62 - removed download-limit that can be given for the crawler for non-crawler download tasks. This was necessary because the same procedure was used for other downloads like for the download of dictionary files where a limit is not useful. The limit still stays for the indexer
- migrated the opengeodb downloader to a new version of the opengeodb-dump


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6873 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-05-14 18:30:11 +00:00
orbiter
40a8d132d9 tried to fix 100% CPU when calling Balancer.top()
see also: http://forum.yacy-websuche.de/viewtopic.php?p=19978#p19978

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6844 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-27 22:37:50 +00:00
orbiter
90c3e5d6f6 - cleanup, removed unused imports
- added crawling queue sizes to /api/status_p.xml, syntax same as in queues_p.html
- fixed a bug in queue enumeration that caused a out of bounds exception

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6842 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-27 21:47:41 +00:00
orbiter
8c40f1cb8e self-healing for broken table files (may cause other problems, but better than nothing)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6826 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-21 11:29:27 +00:00
orbiter
8b8107b2a3 reduced IO-load and synchronization/blocking
- enhanced the Balancer performance when building new domain stacks using a new Table buffer
- added the new Table buffer BufferedObjectIndex class
- changed order of access to LURL-read (prefereing segment over Crawl Queues) will reduced blocking time on balancer
- fixed PPM setting in Crawler_p servlet (had doubled values)
- reduced synchronization in IndexCell because it is not necessary: reduced blocking during indexing/merging/dumping
- removed did-you-mean cache in IndexCell because that caused too much overhead and more memory usage but was not very useful. This reduced also deadlocks that could be causes when searched are performed during indexing.




git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6819 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-18 21:55:20 +00:00
orbiter
1a8a134e0c continuing String-hash - to - byte[]-hash redesign that was started in SVN 6775 and continued in SVN 6790
The result should be a less usage of new String() and less memory usage (since a String-encapsulated byte[] has 40 bytes overhead)


git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6815 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-15 13:22:59 +00:00
orbiter
48b9371735 changed balancer re-load counter. causes less blocking here doing intranet indexing.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6812 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-14 20:57:26 +00:00
orbiter
25aef069a6 continuing String-hash - to - byte[]-hash redesign that was started in SVN 6775
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6790 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-04-08 00:11:32 +00:00
orbiter
ba51d140e1 added more info in assert in balancer
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6782 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-28 22:59:19 +00:00
orbiter
1e8e79b9ef redesign of reference hash (URL-hash) parameter hand-over:
pass value as byte[], not as String. This should cause that less
byte[] <-> String conversions are made during time-critical tasks.
This redesign is not yet complete, more to come ..

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6775 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-03-26 18:33:20 +00:00
orbiter
46c4f8b68a better look-ahead into the crawl queue: show more on crawl monitor
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6699 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-02-24 23:11:58 +00:00
orbiter
dd459281c8 applied code changes that are recommended by PMD
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6563 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-10 23:09:48 +00:00
orbiter
d77a8f3b3e added some modifications recommended by PMD for better performance
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6560 6c8d7289-2bf4-0310-a012-ef5d649a1542
2010-01-10 01:40:26 +00:00
orbiter
362b7a929b added extensive memory protection logic to avoid out of memory errors that may be caused by the RowCollection memory allocation function
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6521 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-12-09 23:27:26 +00:00
orbiter
4a5100789f replaced _all_ size() == 0 with isEmpty() and all size() > 0 with !isEmpty(). The isEmpty() method is much faster in some cases, especially when used to access badly balanced hashtables where an size() operation becomes a large iteration.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6510 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-12-02 00:37:59 +00:00
orbiter
4431b9767e added about 450 replacements for printStackTrace() methods to pipe such traces into the log at DATA/LOG/
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6458 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-11-05 20:28:37 +00:00
orbiter
bea3b99aff moved table and util classes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6397 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-10 01:14:19 +00:00
orbiter
194da25a2f moved kelondro index
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6393 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-09 23:32:08 +00:00
orbiter
4446acc8cd moved kelondro order
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6392 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-09 23:22:22 +00:00
orbiter
f677d534b1 start of a really extensive refactoring which will produce a hierarchical package structure with the domain yacy.net as package root
- moved here the logging classes as part of the new net.yacy.kelondro package

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6391 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-10-09 23:13:30 +00:00
orbiter
2e6bdce086 - added more logging to balancer
- changed balancer logic slightly

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6350 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-27 22:35:22 +00:00
hermens
62a7341c4d Fix for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=2204
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6341 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-24 11:38:15 +00:00
orbiter
573d03c7d7 added configuration to enable ram table copy
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6304 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-09-07 20:30:57 +00:00
orbiter
cf739edc2e fix for possible deadlock, see
http://forum.yacy-websuche.de/viewtopic.php?p=17017#p17017

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6252 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-08-07 12:11:22 +00:00
orbiter
c0e17de2fb - fixes for some problems with the new crawling/caching strategies
- speed enhancements for the cache-only cache policy by using special no-delay rules in the balancer
- fixed some deadlock- and 100% CPU problems in the balancer

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6243 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-25 21:38:57 +00:00
orbiter
c6c97f23ad - added cache usage properties to crawl start
- added special rule to balancer to omit forced delays if cache is used exclusively
- extended the htCache size by default to 32GB

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6241 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-24 11:54:04 +00:00
orbiter
b332dfad67 - inserted request object into response object which carries this now instead generating new objects
- fixed a problem with the crawler introduced in SVN 6216

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6222 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-15 23:08:35 +00:00
orbiter
ca72ed7526 -removed superfluous crawl cache
-refactoring of crawler classes

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6221 6c8d7289-2bf4-0310-a012-ef5d649a1542
2009-07-15 21:07:46 +00:00