Commit Graph

8443 Commits

Author SHA1 Message Date
Michael Peter Christen
659178942f - Redesigned crawler and parser to accept embedded links from the NOLOAD
queue and not from virtual documents generated by the parser.
- The parser now generates nice description texts for NOLOAD entries
which shall make it possible to find media content using the search
index and not using the media prefetch algorithm during search (which
was costly)
- Removed the media-search prefetch process from image search
2012-04-24 16:07:03 +02:00
Michael Peter Christen
3bea25c513 increased image preview size 2012-04-24 16:04:13 +02:00
Michael Peter Christen
a3badd3205 changed search process for images: no more media snippet load process,
show only links from index which had been on the text search page
before. This creates a superfast search process for images!
2012-04-24 12:55:58 +02:00
Michael Peter Christen
f5efdb21fd refactoring 2012-04-24 12:54:41 +02:00
Michael Peter Christen
4aa0eedead one more scroogle... 2012-04-24 12:05:37 +02:00
Michael Peter Christen
347612ddd4 removed scroogle parser 2012-04-24 12:04:44 +02:00
reger
c1f6b4fb52 lookupByIP: prevent comparing of port parameter if called with port -1 (=unknown) 2012-04-24 00:05:01 +02:00
Michael Peter Christen
f8cd57c92f new indexing strategy: ALL links that appear anywhere are indexed, not
only links where the content can be parsed. All non-parseable links are
placed into the noload queue. The search process must therefore be able
to filter out non-text search results.
- This fixes the problem that image search results appeared in the text
search.
- The interactive search can retrieve now ALL types of links
- The p2p interface is now extended to retrieve only certain types of
links (text, image, video, apps)
- The search process has an extension to filter the right document type
according to the search query
2012-04-22 02:05:17 +02:00
Michael Peter Christen
14f67f217c refactoring of ContentDomain: now subclass of Classification 2012-04-22 00:04:36 +02:00
Michael Peter Christen
8a08c96a82 removed dependency from logging 2012-04-21 21:32:31 +02:00
Michael Peter Christen
a1a5b015d8 refactoring: moved document Classification to cora package 2012-04-21 21:31:13 +02:00
Michael Peter Christen
a5d7da68a0 refactoring: removed dependency from switchboard in Balancer/CrawlQueues 2012-04-21 13:47:48 +02:00
Michael Peter Christen
33d1062c79 refactoring: the cache belongs to the crawler 2012-04-21 13:34:07 +02:00
Michael Peter Christen
8429967ea7 no more SVN 2012-04-19 13:29:08 +02:00
Michael Peter Christen
0466bb0ddf no more SVN.. 2012-04-19 13:28:12 +02:00
Michael Peter Christen
4844e124b1 one more warning in case that crawling is paused because of low disk
space
2012-04-19 12:35:11 +02:00
Michael Peter Christen
0ec2713af8 'download' 2012-04-19 11:50:24 +02:00
Michael Peter Christen
2be327b5ab update location update 2012-04-19 11:49:43 +02:00
Michael Peter Christen
f30c577fdb add hint to speed up search results 2012-04-19 11:11:14 +02:00
Michael Peter Christen
6b133de3e9 add hint for consulting support 2012-04-19 11:10:48 +02:00
Michael Peter Christen
4d5da75814 fix for parser problem if a <a>-tag is 'within' html tags with unclosed
tags. That prevented the <a> tags from beeing recognized. This is a fix
for http://forum.yacy-websuche.de/viewtopic.php?p=25516#p25516
2012-04-18 10:30:04 +02:00
Michael Peter Christen
eb2c8ffa62 display is not used any more 2012-04-17 12:30:14 +02:00
Michael Peter Christen
91a86f0b06 fixed to network graph testing 2012-04-17 11:46:14 +02:00
Michael Peter Christen
f31ad84d98 automatic generation of blacklist pattern, see
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=2685&p=25305#p25305
2012-04-17 11:22:19 +02:00
Michael Peter Christen
7b5b9baee0 added citation rank to ranking profile 2012-04-16 23:43:50 +02:00
Michael Peter Christen
046f3a7e8d check if httpc has decompressed the release file and rename the file
from .tar.gz to .tar if that happened
2012-04-16 09:50:55 +02:00
reger
06951ef751 remove heuristic scroogle from search option help text in index.html 2012-04-16 04:00:04 +02:00
Michael Peter Christen
e377092198 fix to xml output format 2012-04-13 09:02:18 +02:00
Michael Christen
41be98dc9d extended webstructure api to show together with incoming links also
outgoing links
2012-04-13 11:53:34 +02:00
Michael Christen
02e4dedff2 fix to url citation collection 2012-04-13 11:52:59 +02:00
Michael Christen
e32055aa15 added stub classes for
- a new database for url reference data ('seen links')
- a new database extending the references to the full url metadata
attributes set which shall replace the old metadata database if it is
finished
- migration help classes stub to use old and new metadata databases
simultanously
2012-04-13 07:09:15 +02:00
Michael Christen
ac5d124ee0 experimental implementation of a citation ranking as post-ranking
method. (ranking coefficient fixed, need to be made configurable)
2012-04-13 06:47:33 +02:00
Michael Christen
8f89c8ef07 added information about inbound, outbound and citation links into
yacydoc api servlet
2012-03-31 07:38:49 +02:00
Michael Christen
71649a1296 added an api to retrieve the new citation.index with the
webstructure.xml api. This api will respond with details about a single
URL if requested with 'webstructure.xml?about=[url|urlhash|host]'.
2012-03-29 17:22:31 +02:00
Michael Christen
8fc86fe397 added storage of full anchor link structure:
the links between all pages are now stored. The same index structure as
used for the word index is used to make a reverse link index.
The new file(s) in SEGMENT/default/citation.index.*.blob store the
citation index. This will be used to create much more detailed link
structures for the YaCy apis and to create a better ranking. A ranking
using the citation.index should provide better results especially for
portal indexes and initranets.
2012-03-29 17:20:14 +02:00
Michael Christen
22f05c83ff fixed default must-match filter for full domain crawls - the old filter
was to restrictive and did not allow intranet crawls
2012-03-28 21:50:00 +02:00
Lotus
3e61287326 some better feedback on properties change 2012-03-25 22:21:42 +02:00
Lotus
96ac95cff9 added hint how to change integration options 2012-03-23 17:02:50 +01:00
Thomas
4f61b8fd82 Fixes for compare-search 2012-03-21 21:43:47 +01:00
Thomas
e0680de7b3 Remove Scroogle from compare-search, Scroogle is dead 2012-03-20 23:00:06 +01:00
Lotus
78f0d8f046 no focus on preview frames for search integration
fixes bug http://bugs.yacy.net/view.php?id=161
2012-03-17 21:10:29 +01:00
Lotus
0b3f39136e allow custom ppm lower than minimum button on /Crawler_p.html
fixes http://bugs.yacy.net/view.php?id=166
2012-03-17 20:43:19 +01:00
Lotus
e14eb9de82 checkalive.sh: try to fetch only once (default: 20) 2012-03-12 09:30:44 +01:00
Lotus
7792ac6406 fix links & bug #163 2012-03-10 10:59:56 +01:00
Michael Peter Christen
532c7cf827 added physics experiment to the graph plotter. not active by default 2012-02-28 13:18:46 +01:00
Michael Peter Christen
aba9b1bfa0 better names for elements of a linked graph 2012-02-27 21:27:17 +01:00
Michael Peter Christen
0cc0290978 bugfix for a must-not-match pattern check. This bug did not make the
check semantically wrong, but a trick that prevented an IP lookup in
case that the filter was not used did not work. That bugfix causes that
crawling gets a huge speed boost for noload urls!
2012-02-27 00:52:44 +01:00
Michael Peter Christen
2fc8ecee36 ConcurrentLinkedQueue has a VERY long return time on the .size() method.
See
http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ConcurrentLinkedQueue.html

and the following test programm:

public class QueueLengthTimeTest {


    public static long countTest(Queue<Integer> q, int c) {
        long t = System.currentTimeMillis();
        for (int i = 0; i < c; i++) {
            q.add(q.size());
        }
        return System.currentTimeMillis() - t;
    }

    public static void main(String[] args) {
        int c = 1;
        for (int i = 0; i < 100; i++) {
            Runtime.getRuntime().gc();
            long t1 = countTest(new ArrayBlockingQueue<Integer>(c), c);
            Runtime.getRuntime().gc();
            long t2 = countTest(new LinkedBlockingQueue<Integer>(), c);
            Runtime.getRuntime().gc();
            long t3 = countTest(new ConcurrentLinkedQueue<Integer>(),
c);

            System.out.println("count = " + c + ": ArrayBlockingQueue =
" + t1 + ", LinkedBlockingQueue = " + t2 + ", ConcurrentLinkedQueue = "
+ t3);
            c = c * 2;
        }
    }
}
2012-02-27 00:42:32 +01:00
Michael Peter Christen
8aba045ba1 if a new pop-up page is set in config portal, then this page applies
also to the default page configuration for the httpd if no path is
given.
2012-02-26 20:53:32 +01:00
Michael Peter Christen
fa7b3481b3 better navigation in file search: less results by first try, but much
faster. after the first search is done, buttons appear to get more
results for the same search
2012-02-26 17:32:45 +01:00