Commit Graph

9315 Commits

Author SHA1 Message Date
Michael Peter Christen
631b08e7e2 update to HostBrowser 2012-11-07 02:17:24 +01:00
Michael Peter Christen
51f420e4f5 removed location search because it is only working in special cases 2012-11-07 02:04:41 +01:00
Michael Peter Christen
52df6ee369 more logging 2012-11-07 02:04:08 +01:00
Michael Peter Christen
158732af37 automatically delete entries from the crawl profile list if crawl is
terminated.
2012-11-07 02:03:44 +01:00
Michael Peter Christen
15d1460b40 added information about the reason of pausing of crawls 2012-11-06 15:21:56 +01:00
Michael Peter Christen
2371ef031c added solr faceted search support to YaCy search results
added solr highlighting / YaCy snippets to YaCy search results
- facets are now much more complete
- facets are computed and searched much faster
- snippet computation is done by solr if solr knows the snippet
2012-11-06 14:32:08 +01:00
Michael Peter Christen
b30a7162fa added more thread-renaiming for search processes 2012-11-06 12:31:23 +01:00
Michael Peter Christen
900445d8e9 set the thread name during solr queries to the solr query to get better
debugging options
2012-11-06 11:48:04 +01:00
Michael Peter Christen
d481abd087 added the visualization of error-urls to host browser
- only visible for admins
- a faceted search generates a huge list for all hosts in the host list
- the faceted search algorithms had to be modified for that
- within the browsing of the directory path, the error cause is written
to the url which is presented as error-url
- the errors are also accumulated for directory sums
2012-11-06 00:29:37 +01:00
Michael Peter Christen
a15819fbec fix for some interface problems 2012-11-05 22:14:52 +01:00
Michael Peter Christen
791e1dcfdf when a new crawl is started, delete all entries about error-urls for
crawl-start domains
2012-11-05 22:14:27 +01:00
Michael Peter Christen
c6a6f4c4e6 added a hack which makes the HostBrowser more performant when the given
host has a lot of urls. If the number of urls is > 1000, then the list
of documents is restricted to such which have no subpath, if the root
path is selected. However, this can cause a problem if no documents on
the root path exist but only on paths below that root path.
2012-11-05 18:57:21 +01:00
Michael Peter Christen
619bf7e875 fixed filetype modified for media types in text search 2012-11-05 18:08:00 +01:00
Michael Peter Christen
97f82994a6 automatically pause the crawler if there is a problem with solr 2012-11-05 16:34:42 +01:00
Michael Peter Christen
64ac2b7b7d new submenu template 2012-11-05 15:36:42 +01:00
Michael Peter Christen
5e77801aac update to web interface structure 2012-11-05 15:23:03 +01:00
Michael Peter Christen
8fb370d9f8 renovated the way how search results are count. should be correct now... 2012-11-05 03:19:28 +01:00
Michael Peter Christen
7bec253bb0 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-11-04 09:21:58 +01:00
Michael Peter Christen
d88eb657fd Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1 2012-11-04 09:21:21 +01:00
orbiter
354ef8000d - added 'deleteold' option to crawler which causes that documents are
deleted which are selected by a crawl filter (host or subpath)
- site crawl used this option be default now
- made option to deleteDomain() concurrency
2012-11-04 02:58:26 +01:00
reger
633fbe9188 Fix Metadata handling
- language default on missing lang property to "uk" (fix set to nothing)
-  language set to TLD (added call to existing language calculation from TLD)
-  coordinate number exception on possible lat/lon content of "NaN,NaN"

adjust Netbeans IDE classpath (for Solr/Lucene 4.0.0 jars)
2012-11-04 02:07:59 +01:00
Michael Peter Christen
19d1f474ce host browser now shows also number of pending files per subdirectory +
bugfixes
2012-11-02 14:40:02 +01:00
Michael Peter Christen
75dd706e1b update to HostBrowser:
- time-out after 3 seconds to speed up display (may be incomplete)
- showing also all links from the balancer queue in the host list (after
the '/') and in the result browser view with tag 'loading'
2012-11-02 13:57:43 +01:00
Michael Peter Christen
e2c4c3c7d3 migration to solr 4.0.0 2012-11-02 12:29:48 +01:00
Michael Peter Christen
b764de424a code cleanup 2012-11-02 10:28:32 +01:00
Michael Peter Christen
69aa39d664 update to libraries required by solr 4.0.0 2012-11-02 10:27:44 +01:00
Michael Peter Christen
9330ad4838 - fixed the delete option in host browser
- added a delete method which can be used to delete a full subpath in
solr.
2012-11-02 01:22:31 +01:00
Michael Peter Christen
a63179f3f9 added the MIME attribute for the R tag in GSA search result writer 2012-11-02 00:14:29 +01:00
Michael Peter Christen
40df2fd193 added the host browser as link to search results. that means you can
select a browsing position after a search is done on the search results.
2012-11-01 21:38:05 +01:00
Michael Peter Christen
1168d09de8 more refactoring - integrated the code of SnippetProcess into
SearchEvent
2012-11-01 17:40:06 +01:00
Michael Peter Christen
6629e37685 tried to clean up the search process mess 2012-11-01 17:16:43 +01:00
Michael Peter Christen
c5f67a5d6d fixed a problem with local search from solr results: now all results
from solr are shown (again)
2012-11-01 10:22:22 +01:00
sixcooler
02957d5982 missing license-files
(sorry I didn't commit theses files by mistake)
2012-10-31 23:47:08 +01:00
Michael Peter Christen
16216c2344 added missing libraries 2012-10-31 23:29:47 +01:00
sixcooler
9d062873d2 bump to httpclient-4.2.2 2012-10-31 19:09:48 +01:00
Michael Peter Christen
f8f05ecba7 - added a delete button in host browser to delete a complete subpath
- removed storage of default collection name - default is now "user"
- made stacking of crawl start points concurrently
2012-10-31 17:44:45 +01:00
Michael Peter Christen
0716a24737 added more / all new crawl profile fields into crawl profile editor 2012-10-31 15:13:05 +01:00
Michael Peter Christen
4a14122ba7 in case that a crawl profile has a collection assigned, use the
collection to show a name in the web interface. This should prevent that
much too long names make the interface unusable.
2012-10-31 14:08:33 +01:00
Michael Peter Christen
0fe8be7981 enhaced data structures for balancer and latency computation which
should produce a bit better prognosis about forced waiting times.
2012-10-30 17:30:24 +01:00
Michael Peter Christen
ac9540dfb6 removed options for stopwords which are not used 2012-10-30 12:36:36 +01:00
Michael Peter Christen
ce3fed8882 added the Google Search Appliance (GSA) api interface to the main menu.
See:
https://developers.google.com/search-appliance/documentation/68/xml_reference#request_overview
2012-10-30 12:27:22 +01:00
Michael Peter Christen
b2ffd49817 less latency 2012-10-30 12:26:32 +01:00
Michael Peter Christen
0833937c1c better balancing and duetime-cumputation also for no-delay intranet
hosts
2012-10-30 11:28:49 +01:00
Michael Peter Christen
c326aa8f67 disabled writing new entries to crawl stacks to prevent that a domain
with many documents block refreshing of the crawl queue
2012-10-29 22:26:52 +01:00
Michael Peter Christen
6905182d41 - fix for number of words log message
- adding meta:refresh also to crawler stack
2012-10-29 21:42:31 +01:00
Michael Peter Christen
c25d7bcb80 - added concurrency for robots.txt loading
- changed data model for domain counter
2012-10-29 21:08:45 +01:00
Michael Peter Christen
a94c537afc fixed getSize() which can use the cache size while the crawl is running 2012-10-29 11:56:07 +01:00
Michael Peter Christen
96912c9471 enhancement to solr caching: consider that during a get() the document
is not in solr but the cache points out that a commit is needed to get
the document.
2012-10-29 11:35:24 +01:00
Michael Peter Christen
a87811bc38 more auto-commit calls when a search interface is opened, but not when a
search is done there to prevent blocking during search-time.
2012-10-29 11:27:13 +01:00
Michael Peter Christen
3d3d654e88 if a network configuration is choosed which does not allow DHT and no
P2P communication is in robinson mode) then some menu entries are
disabled which have no use in this mode.
2012-10-29 01:51:19 +01:00