Commit Graph

4777 Commits

Author SHA1 Message Date
Roland 'Quix0r' Haeder
edaa09b9b1 Rewrote all String blacklist types to enum 'BlacklistType', closes bug
#143

Conflicts:
	htroot/Supporter.java
	htroot/yacy/crawlReceipt.java
	htroot/yacy/transferRWI.java
	htroot/yacy/transferURL.java
	source/de/anomic/crawler/CrawlStacker.java
	source/de/anomic/data/ListManager.java
	source/net/yacy/peers/Protocol.java
	source/net/yacy/repository/Blacklist.java
	source/net/yacy/repository/LoaderDispatcher.java
	source/net/yacy/search/Switchboard.java
	source/net/yacy/search/index/MetadataRepository.java
	source/net/yacy/search/index/Segment.java
	source/net/yacy/search/query/RWIProcess.java
	source/net/yacy/search/snippet/MediaSnippet.java
2012-06-11 00:17:30 +02:00
cominch
7a4dab6d1d - removed unused variables
- do not replace malformed or invalid URLs in urlproxy

git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7835
6c8d7289-2bf4-0310-a012-ef5d649a1542

Conflicts:
	source/de/anomic/http/server/HTTPDFileHandler.java
2012-06-10 23:33:09 +02:00
Michael Peter Christen
ca93835713 removed usage of deprecated methods 2012-06-10 23:17:21 +02:00
cominch
6b32f7c1f6 re-enable augmented proxy 2012-06-10 13:04:13 +02:00
cominch
b5a8fb5fd8 Catch malformed URL when submitted in encoded style 2012-06-10 12:45:49 +02:00
cominch
8e80894812 create virtual web folder /currentyacypeer/ which always points to local
peer, even when using the urlproxy

Conflicts:
	source/de/anomic/http/server/HTTPDProxyHandler.java
2012-06-10 12:31:49 +02:00
cominch
ae8adb0e58 Small changes 2012-06-10 10:44:16 +02:00
cominch
b21048892b augmentedParser add features and integrate external html parser to
modify existing web pages

Conflicts:
	addon/YaCy.app/Contents/Info.plist
	build.xml
2012-06-10 10:23:35 +02:00
cominch
9cbfc1a1c0 augmentedProxy, which forwards every proxy request to a
rewrite engine to customize existing webpages. originally implemented by
Florian Richter.

Conflicts:
	source/de/anomic/http/server/HTTPDProxyHandler.java
2012-06-10 10:15:34 +02:00
Michael Peter Christen
b0095c8d3c flush the compressor cache when a cleanup is done 2012-06-07 19:42:33 +02:00
Michael Peter Christen
96e9d77270 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
Conflicts:
	source/net/yacy/cora/sorting/WeakPriorityBlockingQueue.java
2012-06-06 20:13:28 +02:00
Michael Peter Christen
3dd8376825 added automatic cleaning of cache if metadata and file database size is
not equal. It might happen that these data is different because one of
that caches is cleaned after a while or when it is too big. The metadata
is then not cleaned, but now wiped after a checkup process at every
application start. This should cause a bit less memory usage.
2012-06-06 14:15:24 +02:00
Michael Peter Christen
461a0ce052 removed warnings 2012-06-05 20:03:43 +02:00
Michael Peter Christen
407fdf6968 more bug fixes and performance hacks for search process 2012-06-05 15:04:23 +02:00
Michael Peter Christen
a1fe65b115 performance hacks 2012-06-05 12:06:26 +02:00
Michael Peter Christen
2fe207f813 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-06-04 23:44:38 +02:00
Michael Peter Christen
0284a4d88f more fixes for double precision of coordinates 2012-06-04 23:37:41 +02:00
Michael Peter Christen
e0d8643226 - performance hacks
- added log warnings in case that search processes run into time-out
situations
- better concurrency for Integer formatter (used a non-synchronized
formatter before)
- bugfix for search termination (a poison pill was missing)
- added timeout parameters for search (again) -> target is, that they
are never reached.
2012-06-04 15:37:39 +02:00
Michael Peter Christen
9b4c699526 ehanced location search:
- search request are now made using a map boundary
- search results are only computed for the map boundary
- the number of results is adopted to the results in the visible range
- added a double-buffering for the search result markers
- added a search query option for the search results:
/radius/<lat>/<lon>/<radius>
2012-05-31 22:39:53 +02:00
Michael Peter Christen
43c2c6e588 better logging 2012-05-30 15:27:45 +02:00
Michael Peter Christen
20e0cc0822 fix for bad location evaluation 2012-05-29 14:46:13 +02:00
Michael Peter Christen
eff7667554 fix for http://bugs.yacy.net/view.php?id=188 2012-05-25 16:21:44 +02:00
Michael Peter Christen
8b974905ee changed log-in text for all servlets with authentication:
- added hint how to set the password using a shell script
- added a shell script to change the password
2012-05-24 13:24:31 +02:00
Michael Peter Christen
16b21f7a5b Added more steering in Crawler_p.html interface 2012-05-23 18:00:37 +02:00
Michael Peter Christen
acc19e190d hack against 100% cpu during crawl delete 2012-05-23 15:45:07 +02:00
Michael Peter Christen
c15fcde1c8 add-on to latest commit 2012-05-21 17:52:30 +02:00
Michael Peter Christen
cf47d94888 performance hack to parse numbers inside of substrings without actually
generating a substring. This avoids the allocation of a String object
ech time a substring is parsed. Should affect CPU load during RWI
transmission.
2012-05-21 13:40:46 +02:00
Michael Peter Christen
7e0ddbd275 added a "fromCache" flag in Response object to omit one cache.has()
check during snippet generation. This should cause less blockings
2012-05-21 03:03:47 +02:00
Michael Peter Christen
125d47b3c1 added more interruptions in DidYouMean because that was the cause for
some blockings during search
2012-05-21 00:59:41 +02:00
Michael Peter Christen
f294f2e295 bugfix to http://bugs.yacy.net/view.php?id=181
tried to make a bit less 'noise' to dns server

also included: less processes in snippet fetch to reduce load during
search on small computers
2012-05-19 01:06:33 +02:00
Michael Peter Christen
3e1bc9477f Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-05-17 13:58:09 +02:00
Roland 'Quix0r' Haeder
b3ae2aa41f With or without 'final'? At least please try it in other methods
Conflicts:
	source/de/anomic/tools/tarTools.java
2012-05-17 06:00:49 +02:00
Michael Peter Christen
5b3acc12cd Pattern.quote() replaces \\Q and \\E according to publication in
http://www.cs.washington.edu/homes/mernst/pubs/regex-types-ftfjp2012.pdf
2012-05-17 03:55:10 +02:00
Michael Peter Christen
89142d1e8d removed (not all) warnings 2012-05-16 13:42:32 +02:00
Michael Peter Christen
e7e381d110 added configuration to switch off redirection following in crawler 2012-05-15 12:25:46 +02:00
Michael Peter Christen
70505107ca enhanced crawler/balancer: better remaining waiting-time guessing 2012-05-15 12:24:54 +02:00
Michael Peter Christen
f150bc218b fixed bug in solr error document 2012-05-14 14:56:21 +02:00
Roland 'Quix0r' Haeder
a093ccf5eb Now used synchronization in all close() methods to make sure all objects
are 'closed' in an ordered way

Conflicts:
	source/de/anomic/http/server/ChunkedInputStream.java
	source/de/anomic/http/server/ChunkedOutputStream.java
	source/de/anomic/http/server/ContentLengthInputStream.java
	source/net/yacy/cora/protocol/Domains.java
	source/net/yacy/cora/services/federated/solr/SolrShardingConnector.java
	source/net/yacy/cora/services/federated/solr/SolrSingleConnector.java
	source/net/yacy/document/content/dao/PhpBB3Dao.java
	source/net/yacy/document/parser/html/AbstractTransformer.java
	source/net/yacy/kelondro/blob/BEncodedHeap.java
	source/net/yacy/kelondro/blob/HeapReader.java
	source/net/yacy/kelondro/index/RAMIndexCluster.java
	source/net/yacy/kelondro/io/ByteCountInputStream.java
	source/net/yacy/kelondro/logging/ConsoleOutErrHandler.java
	source/net/yacy/kelondro/table/SQLTable.java
2012-05-14 07:41:55 +02:00
Michael Peter Christen
ba6aaabc51 refactoring + parser bugfixes 2012-05-04 17:28:27 +02:00
Michael Peter Christen
659178942f - Redesigned crawler and parser to accept embedded links from the NOLOAD
queue and not from virtual documents generated by the parser.
- The parser now generates nice description texts for NOLOAD entries
which shall make it possible to find media content using the search
index and not using the media prefetch algorithm during search (which
was costly)
- Removed the media-search prefetch process from image search
2012-04-24 16:07:03 +02:00
Michael Peter Christen
f5efdb21fd refactoring 2012-04-24 12:54:41 +02:00
Michael Peter Christen
f8cd57c92f new indexing strategy: ALL links that appear anywhere are indexed, not
only links where the content can be parsed. All non-parseable links are
placed into the noload queue. The search process must therefore be able
to filter out non-text search results.
- This fixes the problem that image search results appeared in the text
search.
- The interactive search can retrieve now ALL types of links
- The p2p interface is now extended to retrieve only certain types of
links (text, image, video, apps)
- The search process has an extension to filter the right document type
according to the search query
2012-04-22 02:05:17 +02:00
Michael Peter Christen
a1a5b015d8 refactoring: moved document Classification to cora package 2012-04-21 21:31:13 +02:00
Michael Peter Christen
a5d7da68a0 refactoring: removed dependency from switchboard in Balancer/CrawlQueues 2012-04-21 13:47:48 +02:00
Michael Peter Christen
33d1062c79 refactoring: the cache belongs to the crawler 2012-04-21 13:34:07 +02:00
Michael Peter Christen
046f3a7e8d check if httpc has decompressed the release file and rename the file
from .tar.gz to .tar if that happened
2012-04-16 09:50:55 +02:00
Michael Christen
22f05c83ff fixed default must-match filter for full domain crawls - the old filter
was to restrictive and did not allow intranet crawls
2012-03-28 21:50:00 +02:00
Michael Peter Christen
0cc0290978 bugfix for a must-not-match pattern check. This bug did not make the
check semantically wrong, but a trick that prevented an IP lookup in
case that the filter was not used did not work. That bugfix causes that
crawling gets a huge speed boost for noload urls!
2012-02-27 00:52:44 +01:00
Michael Peter Christen
2fc8ecee36 ConcurrentLinkedQueue has a VERY long return time on the .size() method.
See
http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ConcurrentLinkedQueue.html

and the following test programm:

public class QueueLengthTimeTest {


    public static long countTest(Queue<Integer> q, int c) {
        long t = System.currentTimeMillis();
        for (int i = 0; i < c; i++) {
            q.add(q.size());
        }
        return System.currentTimeMillis() - t;
    }

    public static void main(String[] args) {
        int c = 1;
        for (int i = 0; i < 100; i++) {
            Runtime.getRuntime().gc();
            long t1 = countTest(new ArrayBlockingQueue<Integer>(c), c);
            Runtime.getRuntime().gc();
            long t2 = countTest(new LinkedBlockingQueue<Integer>(), c);
            Runtime.getRuntime().gc();
            long t3 = countTest(new ConcurrentLinkedQueue<Integer>(),
c);

            System.out.println("count = " + c + ": ArrayBlockingQueue =
" + t1 + ", LinkedBlockingQueue = " + t2 + ", ConcurrentLinkedQueue = "
+ t3);
            c = c * 2;
        }
    }
}
2012-02-27 00:42:32 +01:00
Michael Peter Christen
8aba045ba1 if a new pop-up page is set in config portal, then this page applies
also to the default page configuration for the httpd if no path is
given.
2012-02-26 20:53:32 +01:00