Commit Graph

9317 Commits

Author SHA1 Message Date
Michael Peter Christen
01200f06cc using the author field as solr-native facet. this makes it necessary to
introduce a copy-field for the author field to be copied to a string
field. This field is then used to generate facets. Without this field,
the facet would consist only of the words of the author names, not of
the full author string.
2012-12-19 01:56:33 +01:00
Michael Peter Christen
2a4c064c89 using the publisher information for the author field if no author is
given. This applies to cases where only the copyright field in the html
header is filled but not the author field
2012-12-19 01:54:35 +01:00
Michael Peter Christen
bab573361f - using a filter query for facet restriction
- calculating the whole search result in at most two sub-queries from
solr
2012-12-19 01:00:57 +01:00
Michael Peter Christen
7ad5457db0 using the solr facets as navigation in yacyinteractive.html instead of
counting locally result types
2012-12-19 00:59:40 +01:00
Michael Peter Christen
eac9650b31 added another solr field clickdepth_i which reflects the number of
clicks which are necessary to get from the portal of a host to a
specific document. At this time, only the start document is flagged with
clickdepth '0', all other with '-1'. To get the actual clickdepth, a
process must use crawled information to collect the actual number of
clicks. This will be added in another/next step.
2012-12-18 17:20:42 +01:00
Michael Peter Christen
1052263af3 - added a new solr field references_i which stores the number of
INCOMING links to the corresponding web page. This information is taken
from the reverse link index (a 'little sister' of the RWI index).
- this field can be of use to enhance the ranking because a web page
with more incoming links can be more more important than others. But
this is not true for typical link pages like menues. Therefore the
number of outgoing links is needed.
- added a new solr attribute 'bf' to solr queries which is a boost
function extension. this field can contain a formula which comuptes the
boost according to given field values. After some experiments the
following forumla is now default:
div(add(1,references_i),pow(add(1,inboundlinkscount_i),1.6))^0.4
This takes the number of references and the inbound links. Further
experiments are needed to enhance that forumula.
2012-12-18 14:42:35 +01:00
Michael Peter Christen
7c3de8b4cd - fix for localhost detection
- added IPv6 patterns for localhost detection
2012-12-18 12:52:20 +01:00
Michael Peter Christen
34f8786508 removed dependency of vocabulary navigation from Jena and it's
triplestore; the vocabulary search is now done using generic solr fields
which are created on-the-fly during runtime.
2012-12-18 02:29:03 +01:00
reger
664499bb10 PerformanceQueues: disable input for hardcoded httpd performance values 2012-12-16 21:01:13 +01:00
reger
ad71747525 fix: set defaul language to "en" 2012-12-16 20:53:45 +01:00
Michael Peter Christen
9319b90d8a - fixes for host navigation
- fixes for filetype navigation
- removed unused code
2012-12-15 09:14:49 +01:00
Michael Peter Christen
cb5cbec14d distinguishing modified query string and original query string 2012-12-15 00:05:46 +01:00
Michael Peter Christen
fb0fa9a102 - fixed 'delete from subpath' during crawl start which deleted nothing;
now works;
- changed some crawl start html design details
2012-12-11 13:38:28 +01:00
Aleksej
23d4a62345 fixes in the Russian translation, chmod a-x cn.lng 2012-12-11 13:44:25 +04:00
orbiter
899fd8b62d Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-12-10 21:18:56 +01:00
orbiter
712cc37c40 if maxFileSize < 0 then the file size limit is without limit. 2012-12-10 21:17:45 +01:00
reger
3f26aabfb3 quickfix for translated link containig word "browse" in ru & uk, see http://bugs.yacy.net/view.php?id=213 2012-12-10 21:08:04 +01:00
orbiter
f86d469973 more search command tools 2012-12-10 21:01:14 +01:00
orbiter
54e193a2b8 you can now search for '*' to get just ALL entries in the search index
as result list. This makes sense if you intend to search just by using
the navigation tools to cut the data set into navigation 'slices'.
2012-12-10 21:00:30 +01:00
orbiter
7f5526e6ef allow larger no-proxy expressions 2012-12-10 20:59:43 +01:00
orbiter
1228a5798d you can now search for '*' to get just ALL entries in the search index
as result list. This makes sense if you intend to search just by using
the navigation tools to cut the data set into navigation 'slices'.
2012-12-10 20:55:11 +01:00
orbiter
1f33c30d7b re-integrating useForHost method (lost sometime?) to get the noProxy
pattern working again. Without using this method all remote urls
including the localhost had been accessed through the configured proxy
2012-12-10 20:44:29 +01:00
reger
f1a9c2e604 fix Servlet template on conditional file include with use of conditional template pattern in included template file (example IndexCreateQueues_p.html)
see bug http://bugs.yacy.net/view.php?id=215
2012-12-10 20:02:35 +01:00
orbiter
a4a780b871 - fix for bad url conversion in bookmarks when using smb urls
- fix for localhost hosts in solr schema host handling
2012-12-10 07:22:42 +01:00
reger
e80dfeca23 - making blacklist path part case insensitive (solving http://bugs.yacy.net/view.php?id=171)
- blacklist test adding explicite response text "not blocked" if no blacklist match
2012-12-08 06:34:48 +01:00
reger
e2d499be9e remove NOT NEEDED reference to solr.YaCySchema from ConfigurationSet to be able to use ConfigurationSet for other conf files (than solr.keys.default.list). 2012-12-08 00:19:20 +01:00
Michael Peter Christen
a3cd3852ab introduced a better place to update the lastacc time value in latency 2012-12-07 15:49:23 +01:00
Michael Peter Christen
864abcd33d removed Latency update after URL selection because that causes
a completely wrong behaviour when cache fresh cases appear. Makes
re-crawling MUCH faster!
2012-12-07 15:35:44 +01:00
Michael Peter Christen
4491072256 - clear the search cache when altering the solr boosts
- better positions for submit buttons
2012-12-07 14:56:34 +01:00
Michael Peter Christen
2b7d46bc1f using a filter query for the site parameter in GSA api 2012-12-07 14:54:49 +01:00
Michael Peter Christen
dd241d03bb latency fix: only set last-visit time if access was actually by the
robot
2012-12-07 02:00:12 +01:00
Michael Peter Christen
118233a7e6 fix for bad xml in gsa result when doing a query with quotes 2012-12-07 01:35:02 +01:00
Michael Peter Christen
1e002ab18e added another blacklist-cleaner into balancer 2012-12-07 01:27:24 +01:00
Michael Peter Christen
10527e28ae fix for wrong display of error urls in HostBrowser 2012-12-07 00:31:10 +01:00
Michael Peter Christen
756772fbd3 fix for waitingtime computation for intranet configuration 2012-12-06 17:40:52 +01:00
Michael Peter Christen
fa27e5820f - check blacklist (again) when taking urls from the crawl stack because
the blacklist may get extended during crawling
- removed debug output
2012-12-06 00:12:16 +01:00
Michael Peter Christen
5f5d66921e patch for funny symbols in url paths (like tilde) 2012-12-05 22:05:49 +01:00
Michael Peter Christen
adfecc6ba8 more robustness during shutdown 2012-12-05 18:20:43 +01:00
Michael Peter Christen
d4bfe9339e Brute-force attempt to start solr in case of a memory problem.
I don't actually know if this is correct. It is a desperate try to get
YaCy running on production servers which must get alive even with
strange hacks like this. This is also related to a forum posting in
http://forum.yacy-websuche.de/viewtopic.php?t=4528&p=27135#p27135
2012-12-05 18:16:06 +01:00
Michael Peter Christen
8aa08261a7 update to Solr Boost handling 2012-12-05 12:26:42 +01:00
Michael Peter Christen
908ad2f174 Added a new servlet to configure the solr ranking using field boosts 2012-12-03 17:01:19 +01:00
Michael Peter Christen
a598fb6227 renamed Ranking_p.html to RankingRWI_p.html
because there will be another Ranking servlet as well at next
2012-12-03 00:01:41 +01:00
Michael Peter Christen
a01e47b992 enhanced exists()-method for solr; should reduce a lot of IO during DHT
target selection
2012-12-02 17:29:37 +01:00
Michael Peter Christen
72f165d58b added a Boost class which stores solr query boost values. The class can
be configured using the yacy.init file. The boost information is taken
from the configuration each time when a query to solr is done.
2012-12-02 16:54:29 +01:00
Michael Peter Christen
ea033f8f8e added number of characters in url to default index to be able to use
this field for ranking
2012-12-02 16:53:02 +01:00
Michael Peter Christen
b5ee88c6af added more logging to get info which url causes performance problems 2012-12-02 16:52:12 +01:00
reger
1faa045dc1 fix: prevent regex pattern compile error for blacklist import for path '*' (extend it to '.*') 2012-12-01 22:41:21 +01:00
reger
bb20691d4f fix: respect config setting of "show Nav Top-Menu" in HostBrowser.html for public users (as hostbrowser is now available in search results) 2012-12-01 01:14:29 +01:00
reger
6cf33f899c prevent Solr "version conflict" on update by set Solr "_version_" field to 0 (=no version check) 2012-11-28 00:09:53 +01:00
Michael Peter Christen
acd98bebb7 improvements in GSA result writer 2012-11-26 15:18:51 +01:00