Commit Graph

93 Commits

Author SHA1 Message Date
Michael Peter Christen
ba6aaabc51 refactoring + parser bugfixes 2012-05-04 17:28:27 +02:00
Michael Peter Christen
453010bd68 - solved problems with backpath normalization
- redesigned in/outbound link handover
- removed iframe links from inbound/outbound in solr scheme
2012-04-27 16:48:51 +02:00
Michael Peter Christen
5f5ed33ed8 patch for media search (audio, video apps) 2012-04-27 14:18:02 +02:00
Michael Peter Christen
19efbf1b0f - apply directDocByURL to NOLOAD Queue
- choose pushing to NOLOAD as default for site crawl
2012-04-26 00:23:18 +02:00
Michael Peter Christen
659178942f - Redesigned crawler and parser to accept embedded links from the NOLOAD
queue and not from virtual documents generated by the parser.
- The parser now generates nice description texts for NOLOAD entries
which shall make it possible to find media content using the search
index and not using the media prefetch algorithm during search (which
was costly)
- Removed the media-search prefetch process from image search
2012-04-24 16:07:03 +02:00
Michael Peter Christen
a3badd3205 changed search process for images: no more media snippet load process,
show only links from index which had been on the text search page
before. This creates a superfast search process for images!
2012-04-24 12:55:58 +02:00
Michael Peter Christen
f8cd57c92f new indexing strategy: ALL links that appear anywhere are indexed, not
only links where the content can be parsed. All non-parseable links are
placed into the noload queue. The search process must therefore be able
to filter out non-text search results.
- This fixes the problem that image search results appeared in the text
search.
- The interactive search can retrieve now ALL types of links
- The p2p interface is now extended to retrieve only certain types of
links (text, image, video, apps)
- The search process has an extension to filter the right document type
according to the search query
2012-04-22 02:05:17 +02:00
Michael Peter Christen
14f67f217c refactoring of ContentDomain: now subclass of Classification 2012-04-22 00:04:36 +02:00
Michael Peter Christen
a1a5b015d8 refactoring: moved document Classification to cora package 2012-04-21 21:31:13 +02:00
Michael Peter Christen
33d1062c79 refactoring: the cache belongs to the crawler 2012-04-21 13:34:07 +02:00
Michael Peter Christen
7b5b9baee0 added citation rank to ranking profile 2012-04-16 23:43:50 +02:00
Michael Christen
02e4dedff2 fix to url citation collection 2012-04-13 11:52:59 +02:00
Michael Christen
e32055aa15 added stub classes for
- a new database for url reference data ('seen links')
- a new database extending the references to the full url metadata
attributes set which shall replace the old metadata database if it is
finished
- migration help classes stub to use old and new metadata databases
simultanously
2012-04-13 07:09:15 +02:00
Michael Christen
ac5d124ee0 experimental implementation of a citation ranking as post-ranking
method. (ranking coefficient fixed, need to be made configurable)
2012-04-13 06:47:33 +02:00
Michael Christen
8fc86fe397 added storage of full anchor link structure:
the links between all pages are now stored. The same index structure as
used for the word index is used to make a reverse link index.
The new file(s) in SEGMENT/default/citation.index.*.blob store the
citation index. This will be used to create much more detailed link
structures for the YaCy apis and to create a better ranking. A ranking
using the citation.index should provide better results especially for
portal indexes and initranets.
2012-03-29 17:20:14 +02:00
Lotus
0b3f39136e allow custom ppm lower than minimum button on /Crawler_p.html
fixes http://bugs.yacy.net/view.php?id=166
2012-03-17 20:43:19 +01:00
Michael Peter Christen
8aba045ba1 if a new pop-up page is set in config portal, then this page applies
also to the default page configuration for the httpd if no path is
given.
2012-02-26 20:53:32 +01:00
Michael Peter Christen
36e4d82b27 changed ranking 2012-02-25 12:58:12 +01:00
Michael Peter Christen
096c17e7cd added test code 2012-02-25 12:42:13 +01:00
Michael Peter Christen
9ad1d8dde2 complete redesign of crawl queue monitoring: do not look at a
ready-prepared crawl list but at the stacks of the domains that are
stored for balanced crawling. This affects also the balancer since that
does not need to prepare the pre-selected crawl list for monitoring. As
a effect:
- it is no more possible to see the correct order of next to-be-crawled
links, since that depends on the actual state of the balancer stack the
next time another url is requested for loading
- the balancer works better since the next url can be selected according
to the current situation and not according to a pre-selected order.
2012-02-02 21:33:42 +01:00
Michael Peter Christen
e2f8f263e8 changed storage of search words: keep order 2012-02-01 18:13:31 +01:00
Michael Peter Christen
2e5cd6a1b2 fixed parser extension deny list generation and usage 2012-02-01 00:15:59 +01:00
Michael Peter Christen
3cd6dcd352 do not add new solr fields as activated fields 2012-01-31 22:21:48 +01:00
Michael Peter Christen
e3bb73c3d6 serialized some database access methods 2012-01-31 21:13:49 +01:00
Michael Peter Christen
355ecf330f reduced target file site to 64mb 2012-01-29 20:35:48 +01:00
Michael Peter Christen
2ea585d616 fix for host navigator 2012-01-26 18:10:34 +01:00
Michael Peter Christen
4c5edab1ec added option to have exception search result windows 2012-01-26 15:32:30 +01:00
Michael Peter Christen
ef78f22ee1 performance hack 2012-01-25 12:48:48 +01:00
Michael Peter Christen
41536eb4a2 performance hack 2012-01-25 12:28:56 +01:00
Michael Peter Christen
f91487fc50 added delete-button for host navigation 2012-01-25 11:19:18 +01:00
Michael Peter Christen
e8d24fd802 author navigator can be switched off 2012-01-25 11:11:42 +01:00
Michael Peter Christen
558ab7bd4e made the protocol navigator reversible 2012-01-25 02:54:52 +01:00
Michael Peter Christen
96cb75f1d4 made the filetype navigator be able to deselect the search constraint 2012-01-25 02:50:06 +01:00
Lotus
c73af39e54 refactoring of tray icon class,
now uses Java 6 methods natively
2012-01-18 20:47:09 +01:00
Michael Peter Christen
4eff0e26f1 npe bugfix 2012-01-17 23:39:57 +01:00
Michael Peter Christen
1a0b6b3913 get more navigation details to search results 2012-01-17 16:44:30 +01:00
Michael Peter Christen
83009d86f7 added the vocabulary navigator. It can be very simply tested by
switching on the locale dictionaries.
2012-01-17 01:53:08 +01:00
Michael Peter Christen
254adea51c small fixes 2012-01-13 11:24:08 +01:00
Michael Peter Christen
c602eaaf46 enhanced search process 2012-01-10 03:00:55 +01:00
Michael Christen
eff966f396 fix for search process (it was aborted too early during remote search) 2012-01-09 03:02:35 +01:00
Marek Otahal
72adbeae90 !Important: move from Hashtable to HashMap
Hashtable is an obsolete collection v1, now since v2 offers HashMap with same or better
functionality. Please review, almost all code was already moved, so only a few changes. That is not the issue,
but I found notices that some (ugly big) helper classes had to be created in past
to compensate missing Hashtable's functionality. I'd like input if we can remove some of them.
look for //FIX: if these commits

Signed-off-by: Marek Otahal <markotahal@gmail.com>
2012-01-09 01:29:18 +01:00
Marek Otahal
f40efb39af Blacklist loadList() remove duplicates by using Set
Signed-off-by: Marek Otahal <markotahal@gmail.com>
2012-01-09 01:18:01 +01:00
Michael Peter Christen
2ee8cbeb2c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
Conflicts:
	source/net/yacy/search/Switchboard.java
2012-01-05 18:37:46 +01:00
Michael Peter Christen
992dbdf4bb added noload statistic to servlets 2012-01-05 18:33:05 +01:00
Michael Christen
216a287a85 Merge commit '6d4e08ed06c5cd28c45981b2ebe31c7f7ec6fd83' into quix0r
Conflicts:
	source/de/anomic/crawler/CrawlQueues.java
2012-01-04 20:16:37 +01:00
stbrumm
d18095dc48 Patch fuer Issue 0000102
and fixes to Patch (private peer status is a property of a peer, not a
status)
2012-01-03 17:49:37 +01:00
Michael Christen
585a8f3c44 fixed a bug in search sequence (caused emtpy results) 2012-01-02 02:10:39 +01:00
Roland 'Quix0r' Haeder
a3083d13bf Blacklist checks are now always turned on, in media searches (e.g. image search) images matching blacklist entries are no longer shown to the user 2011-12-28 20:09:17 +01:00
Michael Christen
52184a1170 fix for search process 2011-12-27 23:43:44 +01:00
Michael Christen
0797b0de99 new handling of remote search processes: looking for seeds will now not
block the whole search process any more. A deadlock with a DHT selection
process may have been the cause for interface lockings in the past.
2011-12-21 00:32:03 +01:00