Commit Graph

4135 Commits

Author SHA1 Message Date
Michael Peter Christen
1052263af3 - added a new solr field references_i which stores the number of
INCOMING links to the corresponding web page. This information is taken
from the reverse link index (a 'little sister' of the RWI index).
- this field can be of use to enhance the ranking because a web page
with more incoming links can be more more important than others. But
this is not true for typical link pages like menues. Therefore the
number of outgoing links is needed.
- added a new solr attribute 'bf' to solr queries which is a boost
function extension. this field can contain a formula which comuptes the
boost according to given field values. After some experiments the
following forumla is now default:
div(add(1,references_i),pow(add(1,inboundlinkscount_i),1.6))^0.4
This takes the number of references and the inbound links. Further
experiments are needed to enhance that forumula.
2012-12-18 14:42:35 +01:00
Michael Peter Christen
34f8786508 removed dependency of vocabulary navigation from Jena and it's
triplestore; the vocabulary search is now done using generic solr fields
which are created on-the-fly during runtime.
2012-12-18 02:29:03 +01:00
reger
664499bb10 PerformanceQueues: disable input for hardcoded httpd performance values 2012-12-16 21:01:13 +01:00
Michael Peter Christen
9319b90d8a - fixes for host navigation
- fixes for filetype navigation
- removed unused code
2012-12-15 09:14:49 +01:00
Michael Peter Christen
cb5cbec14d distinguishing modified query string and original query string 2012-12-15 00:05:46 +01:00
Michael Peter Christen
fb0fa9a102 - fixed 'delete from subpath' during crawl start which deleted nothing;
now works;
- changed some crawl start html design details
2012-12-11 13:38:28 +01:00
orbiter
54e193a2b8 you can now search for '*' to get just ALL entries in the search index
as result list. This makes sense if you intend to search just by using
the navigation tools to cut the data set into navigation 'slices'.
2012-12-10 21:00:30 +01:00
orbiter
7f5526e6ef allow larger no-proxy expressions 2012-12-10 20:59:43 +01:00
reger
e80dfeca23 - making blacklist path part case insensitive (solving http://bugs.yacy.net/view.php?id=171)
- blacklist test adding explicite response text "not blocked" if no blacklist match
2012-12-08 06:34:48 +01:00
Michael Peter Christen
4491072256 - clear the search cache when altering the solr boosts
- better positions for submit buttons
2012-12-07 14:56:34 +01:00
Michael Peter Christen
2b7d46bc1f using a filter query for the site parameter in GSA api 2012-12-07 14:54:49 +01:00
Michael Peter Christen
10527e28ae fix for wrong display of error urls in HostBrowser 2012-12-07 00:31:10 +01:00
Michael Peter Christen
5f5d66921e patch for funny symbols in url paths (like tilde) 2012-12-05 22:05:49 +01:00
Michael Peter Christen
8aa08261a7 update to Solr Boost handling 2012-12-05 12:26:42 +01:00
Michael Peter Christen
908ad2f174 Added a new servlet to configure the solr ranking using field boosts 2012-12-03 17:01:19 +01:00
Michael Peter Christen
a598fb6227 renamed Ranking_p.html to RankingRWI_p.html
because there will be another Ranking servlet as well at next
2012-12-03 00:01:41 +01:00
Michael Peter Christen
72f165d58b added a Boost class which stores solr query boost values. The class can
be configured using the yacy.init file. The boost information is taken
from the configuration each time when a query to solr is done.
2012-12-02 16:54:29 +01:00
reger
bb20691d4f fix: respect config setting of "show Nav Top-Menu" in HostBrowser.html for public users (as hostbrowser is now available in search results) 2012-12-01 01:14:29 +01:00
Michael Peter Christen
3de784c8dd replaced more split and replaceAll missing pattern pre-compilation with
pre-compiled pattern
2012-11-26 13:40:53 +01:00
Michael Peter Christen
8fc3679c66 using more pre-compile pattern for split methods 2012-11-26 13:11:55 +01:00
Michael Peter Christen
d48e9788d2 enhanced search result processing behavior
- query less at one time; query more often
- in between the small queries, evaluate results
- remove fields from search results which are not needed
2012-11-26 12:24:35 +01:00
Michael Peter Christen
eca68fa197 added debug code to crawler monitor 2012-11-25 15:43:42 +01:00
Michael Peter Christen
205f8b222b Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-11-25 14:41:49 +01:00
orbiter
c54cb85422 added link to
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
to the /RegexTest.html servlet
2012-11-25 12:20:41 +01:00
Michael Peter Christen
b7004043ea - added a field cache for solr queries which call only for a single
value
- fixed a version conflict exception within a solr add request
2012-11-24 22:30:05 +01:00
Michael Peter Christen
bf42179982 introduced more structure in HostBrowser, table view, better counting,
distinguishing of error cases (fail/excluded)
2012-11-23 14:09:48 +01:00
Michael Peter Christen
4eab3aae60 removed overhead by preventing generation of full search results when
only the url is requested
2012-11-23 01:35:28 +01:00
Michael Peter Christen
a114bb23bb - using edismax in gsa interface
- generating less field data for gsa search results
- using a boost query in gsa interface to move double content to the end
of the result list
2012-11-22 13:03:33 +01:00
Michael Peter Christen
d6b82840f8 added a feature to find similarities in documents.
This uses an enhanced version of the Nutch/Solr TextProfileSignatue.
As a result, a signature of the document is written to the solr search
index. Additionally for each time when a signature is written, it is
checked if the singature exists already in the index. If the signature
does not exist, the document is marked as unique. The unique attribute
can now be used to sort document lists and bring duplicates to the end
of a result list.
To enable this, a large portion of the search api to Solr had to be
changed. This affected mainly caching of 'exists' searches to enhance
the check for existing signatures and do this without actually doing a
solr query.
Because here the first time a long number is used as value in the Solr
store, also the value naming in the YaCySchema had to be adopted and
normalized. This caused that many files had to be changed.
2012-11-21 18:46:49 +01:00
Michael Peter Christen
f5ca5cea44 - added field options to all solr queries. This can be used to restrict
the actual data which is fetched from solr.
- used the new field options to reduce generic options like getting the
load date or the count of search results. should increase overall speed
- used the new field options to reduce overhead in the host browser
during aquisition of links.
- used the field options to make checking of links in crawler faster
- if the crawler is paused, the crawl queue is not cleaned
2012-11-19 17:24:34 +01:00
Michael Peter Christen
46be4af5b9 Merge commit '2bb8f045cc92f31fc7e720cc30b38af417563890' 2012-11-18 22:11:04 +01:00
Michael Peter Christen
952e143580 FINALLY YaCy can now search for full strings using double- or
singlequoted strings in the search query line!!!
2012-11-18 16:03:34 +01:00
orbiter
5dfd6359cb redesign of the QueryParams class: introduced QueryGoal which holds the
query string parser. This shall be used to create a proper full-string
matching which is handled then by QueryGoal.
2012-11-18 01:22:41 +01:00
Michael Peter Christen
5fd3b93661 added deletion of hosts during crawl start if deleteold option was given 2012-11-13 16:54:28 +01:00
Michael Peter Christen
d64445c3cb because we have the inurl:<term> - searchmodifier, we don't actually
need regular expressions as search attributes. They had now been removed
from the advanced search page while they are still created internally.
The filter is then expressed against solr as regular expression filter
query. If the expression points out a selection of an specific protocol,
host or filetype this is then translated into a facetted query.
2012-11-13 11:45:56 +01:00
orbiter
b55ea2197f - redesign of crawl start servlet
- for domain-limited crawls, the domain is deleted now by default before
the crawl is started
2012-11-13 10:54:21 +01:00
orbiter
1c66de4bd4 - removed scheduled crawling options in crawl start because it is
superfluous there; it can be changed in the scheduler servlet. It's also
confusing in the presence of the delete-option, which will be
implemented next.
- removed unused crawl start servlet
- some refactoring to make the time parser reusable
2012-11-12 11:19:39 +01:00
Michael Peter Christen
2e7219f9fd removed hightlighting of search results within collections in GSA
interface
2012-11-09 16:25:24 +01:00
Michael Peter Christen
074dfd297b added icons and a selection for hosts with urls pending for crawler or
with errors
2012-11-09 16:24:56 +01:00
cominch
21df1ad9e0 update and generalization of the SMW import and content control routines 2012-11-09 13:48:40 +01:00
Michael Peter Christen
4c4e0eece2 added new submenu 'Target Analysis' with three servlets which are useful
to analyse the target servers: robots.txt table, mass target analysis
and a regex tester
2012-11-07 21:26:01 +01:00
Michael Peter Christen
61995d508e do the commit anyway before calling a search interface 2012-11-07 17:27:50 +01:00
Michael Peter Christen
86ec199126 using a better file name 2012-11-07 16:39:49 +01:00
Michael Peter Christen
5105256927 update to search result logging (this was a remaining issue from the
solr 4.0.0 migration)
2012-11-07 14:15:27 +01:00
Michael Peter Christen
570e42c4e3 fix for filetype naviagtor 2012-11-07 13:53:29 +01:00
Michael Peter Christen
71ed8e5e07 bugfixes for crawler 2012-11-07 12:52:19 +01:00
Michael Peter Christen
29fbbb49dc better colors for host browser and corrected document count 2012-11-07 12:23:21 +01:00
Michael Peter Christen
6244b084cd fixed wrong order of result count values 2012-11-07 02:29:33 +01:00
Michael Peter Christen
631b08e7e2 update to HostBrowser 2012-11-07 02:17:24 +01:00
Michael Peter Christen
51f420e4f5 removed location search because it is only working in special cases 2012-11-07 02:04:41 +01:00