Commit Graph

388 Commits

Author SHA1 Message Date
Michael Peter Christen
b7004043ea - added a field cache for solr queries which call only for a single
value
- fixed a version conflict exception within a solr add request
2012-11-24 22:30:05 +01:00
Michael Peter Christen
efd2c4622d added a new fail type attribute for the index to distinguish two
separate fail types: network fail and forced exclusion (i.e. by robots
or forwarding rules).
2012-11-23 14:00:30 +01:00
Michael Peter Christen
4eab3aae60 removed overhead by preventing generation of full search results when
only the url is requested
2012-11-23 01:35:28 +01:00
Michael Peter Christen
d6b82840f8 added a feature to find similarities in documents.
This uses an enhanced version of the Nutch/Solr TextProfileSignatue.
As a result, a signature of the document is written to the solr search
index. Additionally for each time when a signature is written, it is
checked if the singature exists already in the index. If the signature
does not exist, the document is marked as unique. The unique attribute
can now be used to sort document lists and bring duplicates to the end
of a result list.
To enable this, a large portion of the search api to Solr had to be
changed. This affected mainly caching of 'exists' searches to enhance
the check for existing signatures and do this without actually doing a
solr query.
Because here the first time a long number is used as value in the Solr
store, also the value naming in the YaCySchema had to be adopted and
normalized. This caused that many files had to be changed.
2012-11-21 18:46:49 +01:00
Michael Peter Christen
f5ca5cea44 - added field options to all solr queries. This can be used to restrict
the actual data which is fetched from solr.
- used the new field options to reduce generic options like getting the
load date or the count of search results. should increase overall speed
- used the new field options to reduce overhead in the host browser
during aquisition of links.
- used the field options to make checking of links in crawler faster
- if the crawler is paused, the crawl queue is not cleaned
2012-11-19 17:24:34 +01:00
Michael Peter Christen
46be4af5b9 Merge commit '2bb8f045cc92f31fc7e720cc30b38af417563890' 2012-11-18 22:11:04 +01:00
Michael Peter Christen
952e143580 FINALLY YaCy can now search for full strings using double- or
singlequoted strings in the search query line!!!
2012-11-18 16:03:34 +01:00
orbiter
5dfd6359cb redesign of the QueryParams class: introduced QueryGoal which holds the
query string parser. This shall be used to create a proper full-string
matching which is handled then by QueryGoal.
2012-11-18 01:22:41 +01:00
cominch
2bb8f045cc content control: use up-to-date definitions 2012-11-13 17:32:19 +01:00
Michael Peter Christen
5fd3b93661 added deletion of hosts during crawl start if deleteold option was given 2012-11-13 16:54:28 +01:00
Michael Peter Christen
d64445c3cb because we have the inurl:<term> - searchmodifier, we don't actually
need regular expressions as search attributes. They had now been removed
from the advanced search page while they are still created internally.
The filter is then expressed against solr as regular expression filter
query. If the expression points out a selection of an specific protocol,
host or filetype this is then translated into a facetted query.
2012-11-13 11:45:56 +01:00
cominch
d2a94cc55e refactor package 2012-11-09 16:22:24 +01:00
cominch
21df1ad9e0 update and generalization of the SMW import and content control routines 2012-11-09 13:48:40 +01:00
Michael Peter Christen
842faf96a2 fixed media search 2012-11-07 17:27:13 +01:00
Michael Peter Christen
93001586a0 removed warnings, removed too-fast pausing of crawls 2012-11-07 15:37:14 +01:00
Michael Peter Christen
8041742e48 added matching of path to query pattern 2012-11-07 15:06:13 +01:00
Michael Peter Christen
570e42c4e3 fix for filetype naviagtor 2012-11-07 13:53:29 +01:00
Michael Peter Christen
71ed8e5e07 bugfixes for crawler 2012-11-07 12:52:19 +01:00
Michael Peter Christen
12c0db20e5 fixed npe for surrogate import 2012-11-07 02:46:51 +01:00
Michael Peter Christen
52df6ee369 more logging 2012-11-07 02:04:08 +01:00
Michael Peter Christen
158732af37 automatically delete entries from the crawl profile list if crawl is
terminated.
2012-11-07 02:03:44 +01:00
Michael Peter Christen
15d1460b40 added information about the reason of pausing of crawls 2012-11-06 15:21:56 +01:00
Michael Peter Christen
2371ef031c added solr faceted search support to YaCy search results
added solr highlighting / YaCy snippets to YaCy search results
- facets are now much more complete
- facets are computed and searched much faster
- snippet computation is done by solr if solr knows the snippet
2012-11-06 14:32:08 +01:00
Michael Peter Christen
d481abd087 added the visualization of error-urls to host browser
- only visible for admins
- a faceted search generates a huge list for all hosts in the host list
- the faceted search algorithms had to be modified for that
- within the browsing of the directory path, the error cause is written
to the url which is presented as error-url
- the errors are also accumulated for directory sums
2012-11-06 00:29:37 +01:00
Michael Peter Christen
791e1dcfdf when a new crawl is started, delete all entries about error-urls for
crawl-start domains
2012-11-05 22:14:27 +01:00
Michael Peter Christen
619bf7e875 fixed filetype modified for media types in text search 2012-11-05 18:08:00 +01:00
Michael Peter Christen
97f82994a6 automatically pause the crawler if there is a problem with solr 2012-11-05 16:34:42 +01:00
Michael Peter Christen
8fb370d9f8 renovated the way how search results are count. should be correct now... 2012-11-05 03:19:28 +01:00
orbiter
354ef8000d - added 'deleteold' option to crawler which causes that documents are
deleted which are selected by a crawl filter (host or subpath)
- site crawl used this option be default now
- made option to deleteDomain() concurrency
2012-11-04 02:58:26 +01:00
Michael Peter Christen
75dd706e1b update to HostBrowser:
- time-out after 3 seconds to speed up display (may be incomplete)
- showing also all links from the balancer queue in the host list (after
the '/') and in the result browser view with tag 'loading'
2012-11-02 13:57:43 +01:00
Michael Peter Christen
e2c4c3c7d3 migration to solr 4.0.0 2012-11-02 12:29:48 +01:00
Michael Peter Christen
b764de424a code cleanup 2012-11-02 10:28:32 +01:00
Michael Peter Christen
9330ad4838 - fixed the delete option in host browser
- added a delete method which can be used to delete a full subpath in
solr.
2012-11-02 01:22:31 +01:00
Michael Peter Christen
1168d09de8 more refactoring - integrated the code of SnippetProcess into
SearchEvent
2012-11-01 17:40:06 +01:00
Michael Peter Christen
6629e37685 tried to clean up the search process mess 2012-11-01 17:16:43 +01:00
Michael Peter Christen
c5f67a5d6d fixed a problem with local search from solr results: now all results
from solr are shown (again)
2012-11-01 10:22:22 +01:00
Michael Peter Christen
f8f05ecba7 - added a delete button in host browser to delete a complete subpath
- removed storage of default collection name - default is now "user"
- made stacking of crawl start points concurrently
2012-10-31 17:44:45 +01:00
Michael Peter Christen
4a14122ba7 in case that a crawl profile has a collection assigned, use the
collection to show a name in the web interface. This should prevent that
much too long names make the interface unusable.
2012-10-31 14:08:33 +01:00
Michael Peter Christen
0833937c1c better balancing and duetime-cumputation also for no-delay intranet
hosts
2012-10-30 11:28:49 +01:00
Michael Peter Christen
c326aa8f67 disabled writing new entries to crawl stacks to prevent that a domain
with many documents block refreshing of the crawl queue
2012-10-29 22:26:52 +01:00
Michael Peter Christen
6905182d41 - fix for number of words log message
- adding meta:refresh also to crawler stack
2012-10-29 21:42:31 +01:00
Michael Peter Christen
c25d7bcb80 - added concurrency for robots.txt loading
- changed data model for domain counter
2012-10-29 21:08:45 +01:00
Michael Peter Christen
2d9e577ad0 replaced the custom robots.txt loader by the standard http loader 2012-10-28 22:48:11 +01:00
Michael Peter Christen
799d71bc67 enhanced solr caching:
- increased cache size which is needed for longer solr commit time
- speed hacks on cache write code
2012-10-28 20:31:29 +01:00
Michael Peter Christen
a33e2742cb - removed unnecessary synchronized and deadlock in crawler
- removed problem with monitoring object on Balancer.wait
- added missing user agent settings
2012-10-28 19:56:02 +01:00
Michael Peter Christen
8e1248ffe3 force a commit in advance of a search for the administrator to get most
recent results even if commit time is high and an indexing is ongoing.
2012-10-26 15:35:42 +02:00
Michael Peter Christen
3b48c78190 added an option to force a commit to solr.
may be used by a search front-end in case that the commitWithinMs time
is too short to get recently indexed documents.
2012-10-26 07:39:07 +02:00
orbiter
276dd6452b removed warnings 2012-10-23 19:08:44 +02:00
sixcooler
47ae7e322e smaller dhtDispatcher.cloudSize
@Orbiter: we talked about this times ago - please revert if I'm wrong
2012-10-21 20:05:28 +02:00
Michael Peter Christen
ce0e5b1e17 - more refactoring / private methods
- fix for usage of custom solr field names
2012-10-18 15:09:04 +02:00