Commit Graph

498 Commits

Author SHA1 Message Date
Michael Peter Christen
461a0ce052 removed warnings 2012-06-05 20:03:43 +02:00
Michael Peter Christen
9b4c699526 ehanced location search:
- search request are now made using a map boundary
- search results are only computed for the map boundary
- the number of results is adopted to the results in the visible range
- added a double-buffering for the search result markers
- added a search query option for the search results:
/radius/<lat>/<lon>/<radius>
2012-05-31 22:39:53 +02:00
Michael Peter Christen
16b21f7a5b Added more steering in Crawler_p.html interface 2012-05-23 18:00:37 +02:00
Michael Peter Christen
acc19e190d hack against 100% cpu during crawl delete 2012-05-23 15:45:07 +02:00
Michael Peter Christen
c15fcde1c8 add-on to latest commit 2012-05-21 17:52:30 +02:00
Michael Peter Christen
cf47d94888 performance hack to parse numbers inside of substrings without actually
generating a substring. This avoids the allocation of a String object
ech time a substring is parsed. Should affect CPU load during RWI
transmission.
2012-05-21 13:40:46 +02:00
Michael Peter Christen
7e0ddbd275 added a "fromCache" flag in Response object to omit one cache.has()
check during snippet generation. This should cause less blockings
2012-05-21 03:03:47 +02:00
Michael Peter Christen
f294f2e295 bugfix to http://bugs.yacy.net/view.php?id=181
tried to make a bit less 'noise' to dns server

also included: less processes in snippet fetch to reduce load during
search on small computers
2012-05-19 01:06:33 +02:00
Michael Peter Christen
e7e381d110 added configuration to switch off redirection following in crawler 2012-05-15 12:25:46 +02:00
Michael Peter Christen
70505107ca enhanced crawler/balancer: better remaining waiting-time guessing 2012-05-15 12:24:54 +02:00
Michael Peter Christen
f150bc218b fixed bug in solr error document 2012-05-14 14:56:21 +02:00
Roland 'Quix0r' Haeder
a093ccf5eb Now used synchronization in all close() methods to make sure all objects
are 'closed' in an ordered way

Conflicts:
	source/de/anomic/http/server/ChunkedInputStream.java
	source/de/anomic/http/server/ChunkedOutputStream.java
	source/de/anomic/http/server/ContentLengthInputStream.java
	source/net/yacy/cora/protocol/Domains.java
	source/net/yacy/cora/services/federated/solr/SolrShardingConnector.java
	source/net/yacy/cora/services/federated/solr/SolrSingleConnector.java
	source/net/yacy/document/content/dao/PhpBB3Dao.java
	source/net/yacy/document/parser/html/AbstractTransformer.java
	source/net/yacy/kelondro/blob/BEncodedHeap.java
	source/net/yacy/kelondro/blob/HeapReader.java
	source/net/yacy/kelondro/index/RAMIndexCluster.java
	source/net/yacy/kelondro/io/ByteCountInputStream.java
	source/net/yacy/kelondro/logging/ConsoleOutErrHandler.java
	source/net/yacy/kelondro/table/SQLTable.java
2012-05-14 07:41:55 +02:00
Michael Peter Christen
ba6aaabc51 refactoring + parser bugfixes 2012-05-04 17:28:27 +02:00
Michael Peter Christen
659178942f - Redesigned crawler and parser to accept embedded links from the NOLOAD
queue and not from virtual documents generated by the parser.
- The parser now generates nice description texts for NOLOAD entries
which shall make it possible to find media content using the search
index and not using the media prefetch algorithm during search (which
was costly)
- Removed the media-search prefetch process from image search
2012-04-24 16:07:03 +02:00
Michael Peter Christen
f5efdb21fd refactoring 2012-04-24 12:54:41 +02:00
Michael Peter Christen
f8cd57c92f new indexing strategy: ALL links that appear anywhere are indexed, not
only links where the content can be parsed. All non-parseable links are
placed into the noload queue. The search process must therefore be able
to filter out non-text search results.
- This fixes the problem that image search results appeared in the text
search.
- The interactive search can retrieve now ALL types of links
- The p2p interface is now extended to retrieve only certain types of
links (text, image, video, apps)
- The search process has an extension to filter the right document type
according to the search query
2012-04-22 02:05:17 +02:00
Michael Peter Christen
a1a5b015d8 refactoring: moved document Classification to cora package 2012-04-21 21:31:13 +02:00
Michael Peter Christen
a5d7da68a0 refactoring: removed dependency from switchboard in Balancer/CrawlQueues 2012-04-21 13:47:48 +02:00
Michael Peter Christen
33d1062c79 refactoring: the cache belongs to the crawler 2012-04-21 13:34:07 +02:00
Michael Christen
22f05c83ff fixed default must-match filter for full domain crawls - the old filter
was to restrictive and did not allow intranet crawls
2012-03-28 21:50:00 +02:00
Michael Peter Christen
0cc0290978 bugfix for a must-not-match pattern check. This bug did not make the
check semantically wrong, but a trick that prevented an IP lookup in
case that the filter was not used did not work. That bugfix causes that
crawling gets a huge speed boost for noload urls!
2012-02-27 00:52:44 +01:00
Michael Peter Christen
2fc8ecee36 ConcurrentLinkedQueue has a VERY long return time on the .size() method.
See
http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ConcurrentLinkedQueue.html

and the following test programm:

public class QueueLengthTimeTest {


    public static long countTest(Queue<Integer> q, int c) {
        long t = System.currentTimeMillis();
        for (int i = 0; i < c; i++) {
            q.add(q.size());
        }
        return System.currentTimeMillis() - t;
    }

    public static void main(String[] args) {
        int c = 1;
        for (int i = 0; i < 100; i++) {
            Runtime.getRuntime().gc();
            long t1 = countTest(new ArrayBlockingQueue<Integer>(c), c);
            Runtime.getRuntime().gc();
            long t2 = countTest(new LinkedBlockingQueue<Integer>(), c);
            Runtime.getRuntime().gc();
            long t3 = countTest(new ConcurrentLinkedQueue<Integer>(),
c);

            System.out.println("count = " + c + ": ArrayBlockingQueue =
" + t1 + ", LinkedBlockingQueue = " + t2 + ", ConcurrentLinkedQueue = "
+ t3);
            c = c * 2;
        }
    }
}
2012-02-27 00:42:32 +01:00
Michael Peter Christen
c6c61be3f0 fix for http://bugs.yacy.net/view.php?id=148 2012-02-24 00:38:57 +01:00
Michael Peter Christen
0d148c3353 more logging in resource observer 2012-02-23 01:20:42 +01:00
Michael Peter Christen
2fa037ae1d enhanced crawler 2012-02-23 01:20:24 +01:00
Lotus
ee89cf5ae5 fix must match filter for full domain crawl
allow:
http://www.example.com
http://www.example.com/
http://www.example.com/abc.html?xyz=q
block:
http://www.example.com.cn
http://www.example.com.cn/dsf
2012-02-07 16:13:13 +01:00
Michael Peter Christen
9ad1d8dde2 complete redesign of crawl queue monitoring: do not look at a
ready-prepared crawl list but at the stacks of the domains that are
stored for balanced crawling. This affects also the balancer since that
does not need to prepare the pre-selected crawl list for monitoring. As
a effect:
- it is no more possible to see the correct order of next to-be-crawled
links, since that depends on the actual state of the balancer stack the
next time another url is requested for loading
- the balancer works better since the next url can be selected according
to the current situation and not according to a pre-selected order.
2012-02-02 21:33:42 +01:00
Michael Peter Christen
1f4f60654a Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
Conflicts:
	source/net/yacy/document/parser/pdfParser.java
2012-01-24 20:42:30 +01:00
Michael Peter Christen
2ee8cbeb2c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
Conflicts:
	source/net/yacy/search/Switchboard.java
2012-01-05 18:37:46 +01:00
Michael Peter Christen
992dbdf4bb added noload statistic to servlets 2012-01-05 18:33:05 +01:00
Michael Christen
c21966bb43 fix 2012-01-04 23:02:12 +01:00
Michael Christen
13b05f9c08 fix 2012-01-04 23:01:04 +01:00
Michael Christen
e5d878c59e Merge branch 'master' of ssh://gitorious.org/yacy/rc1
Conflicts:
	source/de/anomic/crawler/CrawlQueues.java
2012-01-04 22:08:17 +01:00
Michael Christen
ec26b2bea4 Merge commit 'fa08ed5ae5d72bddc3cc6a662b23103579e86109' into quix0r
Conflicts:
	source/de/anomic/crawler/CrawlQueues.java
2012-01-04 20:32:42 +01:00
Michael Christen
216a287a85 Merge commit '6d4e08ed06c5cd28c45981b2ebe31c7f7ec6fd83' into quix0r
Conflicts:
	source/de/anomic/crawler/CrawlQueues.java
2012-01-04 20:16:37 +01:00
stbrumm
d18095dc48 Patch fuer Issue 0000102
and fixes to Patch (private peer status is a property of a peer, not a
status)
2012-01-03 17:49:37 +01:00
Roland 'Quix0r' Haeder
901f37d608 Also this ... :( #2 2011-12-29 00:36:56 +01:00
Roland 'Quix0r' Haeder
a985717ed2 Also this ... :( 2011-12-29 00:35:51 +01:00
Roland 'Quix0r' Haeder
5f490de554 Fix for ported fix from my old days ... 2011-12-29 00:34:46 +01:00
Roland 'Quix0r' Haeder
fa08ed5ae5 Fixed a lot CHMOD rights (no need for execute flag on *.java/*.html) and introduced local/remote crawl size ratio based check 2011-12-29 00:33:16 +01:00
Michael Christen
9e5894c784 Removed handling of components objects for URIMetadataRows.
This is a preparation to replace this rows with nodes from the node
store.
2011-12-17 01:27:08 +01:00
Michael Christen
c04bfaa51b refactoring 2011-12-16 23:59:29 +01:00
Michael Christen
6e66c9d7f1 fix for http://bugs.yacy.net/view.php?id=87 2011-12-05 23:46:42 +01:00
Michael Christen
e7e429705a - less automatic indexing after a search (needs to reset the default
crawl profiles)
- fix for concurrency problem in storage of serverSwitch Properties
- markup update
2011-12-05 16:22:11 +01:00
orbiter
11729061f2 added an option in the bookmark import process to put everything into the crawler
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8134 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-12-03 00:27:01 +00:00
orbiter
8895d8c1cd removed unnecessary log entries
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8117 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-27 16:54:48 +00:00
orbiter
5a55397f99 some last-minute performance hacks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8101 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-25 11:23:52 +00:00
orbiter
e4a82ddd8b produce a bookmark entry from every crawl start. these bookmarks are always private.
these bookmarks will be used to get a source reference for the search in case of intranet or portal searches.

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8062 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-21 23:10:29 +00:00
orbiter
aa322bc6d0 fix
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8050 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-16 15:36:30 +00:00
orbiter
97d1347adb added also a default accept field to robots.txt downloads
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8049 6c8d7289-2bf4-0310-a012-ef5d649a1542
2011-11-16 15:33:55 +00:00