Commit Graph

9007 Commits

Author SHA1 Message Date
Michael Peter Christen
ca313e404f - if a "/date" modifier is used, the solr remote query applies an
ordering by date (ascending)
- added also some 'anti-timetravel' protection (check if date is in the
future within any metadata date field)
2012-09-26 16:56:33 +02:00
Michael Peter Christen
a4214694df We assert that no other metadata storage than solr is used now.
Therefore a property like solrConnected() must be true all the time.
Removal of this method causes removal of all write operations to the old
metadata index.
2012-09-26 16:05:11 +02:00
Michael Peter Christen
abab291162 made the index schema retrieval public and allow cross-domain retrieval 2012-09-26 15:44:50 +02:00
Michael Peter Christen
0cec7e761a enhanced snippet extractor to find snippets also inside of tokens of an
url
2012-09-26 15:33:37 +02:00
sixcooler
c65b576a6f added filename for missing crawlname when crawling from file 2012-09-26 14:05:33 +02:00
sixcooler
6c50d016ed pdf- and zipParser should not use forced Memory-Limits 2012-09-26 14:03:51 +02:00
Michael Peter Christen
562183932b - removed ip_s from default profile since that needs a DNS lookup to
create an document entry. This makes remote search much slower.
- removed synchronization of add method if ip_s is activated to prevent
that a user configuration causes bad behavior. The disadvantage of that
is, that a index dump can cause data loss if an indexing is running
during index dump
- catched more exceptions and more NPE
- better abstraction in MirrorSolrConnector
- slight performance enhancement when only the index count is requested
(rows=0 is sufficient to get a total count)
2012-09-26 13:38:04 +02:00
Michael Peter Christen
24f4ca4d85 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-09-26 12:01:34 +02:00
apfelmaennchen
7efe9eb37b adding CORS access header for Network.xml to overcome cross domain
restriction (e.g. necessary to build a JavaScript YaCy
client).
2012-09-26 10:36:09 +02:00
apfelmaennchen
116f429e35 fix for java.lang.RuntimeException: TableColumnIndex not available... 2012-09-26 09:56:16 +02:00
Michael Peter Christen
5ac61591f3 better abstraction for solr query params 2012-09-25 23:59:30 +02:00
Michael Peter Christen
c913b2ba77 - fix for NPEs during remote solr configuration
- fixed remote solr setting switch
- added more logging
2012-09-25 23:59:09 +02:00
Michael Peter Christen
b5192e03d7 fixed bad output in stopYACY.sh 2012-09-25 23:20:09 +02:00
Michael Peter Christen
882d54067a added dummy update servlet 2012-09-25 23:09:32 +02:00
Michael Peter Christen
1533bfd63b refactoring 2012-09-25 21:20:03 +02:00
Michael Peter Christen
e49359cc95 removed tenant query attribute since it is not used any more and is
replaced by the site-operator in the GSA interface. This operator can
also be simulated in the Solr interface using the collections_sxt field.
2012-09-25 21:09:06 +02:00
Michael Peter Christen
872f83ebe0 refactoring 2012-09-25 21:04:58 +02:00
Michael Peter Christen
fb9460f0a8 using the search filter to drill down search to file types.
A search like "mp3 filetype:mp3" will now maybe surprise you.
2012-09-25 17:52:33 +02:00
Michael Peter Christen
bc865ab816 more cleaning (yacy-cora) 2012-09-25 12:19:24 +02:00
Michael Peter Christen
640339ee21 added the indexrestore.sh script which must be called with the path of
the index dump. This is the reverse of indexdump.sh which takes the
output of indexdump.sh as input to restore an index.
Now it should be possible to transfer a complete YaCy Solr index from
one peer yacy1 to another peer yacy2 with the following command:
yacy2/bin/indexrestore.sh ´yacy1/bin/indexdump.sh´
2012-09-25 00:28:20 +02:00
Michael Peter Christen
15ea053c3a - added xml output in IndexControlURLs to get the storage page of index
dump commands
- adjusted the apicall.sh script to get the downloaded text as output to
stdout which is necessary to parse the content out of it
- added indexdump.sh script which creates a solr dump and prints out the
storage path for the index dump
- added synchronization to the Fulltext class to prevent that data is
stored to a non-existing solr index while this index is disabled during
the storage of the dump
2012-09-25 00:19:52 +02:00
Michael Peter Christen
1b474139dd used the new zip writer/reader to add a solr dump process: the whole
solr index can be written to a zip dump and also restored during runtime
2012-09-24 17:05:28 +02:00
Michael Peter Christen
4a3e684f8c added a directory-to-zip writer and zip-to-directory reader 2012-09-24 17:04:37 +02:00
Michael Peter Christen
d9ebf4a40f a bit more logging 2012-09-24 15:01:44 +02:00
Michael Peter Christen
5683162bd3 simplifications in DHT Distribution class and more documentation 2012-09-24 12:01:09 +02:00
Michael Peter Christen
e57bf2ca39 simplified DHT classes 2012-09-24 01:04:39 +02:00
orbiter
a053b356ee added new classes to renovate the YaCy protocol based on simple data
structures in cora:
- added the Peer object, which is a fresh version of Seed
- added the Peers object, which is a fresh version of Network
- added the Network api access class to retrieve a list of peers based
on the Network.xml servlet in all YaCy peers.
2012-09-22 11:10:11 +02:00
orbiter
14897d4bfc fixed mistake in wt-option which caused that the yacy json format
overlapped the solr built-in json format
2012-09-21 21:38:50 +02:00
Michael Peter Christen
8219a445f3 refactoring 2012-09-21 16:46:57 +02:00
Michael Peter Christen
f879a344e7 fix for no depth limit default value 2012-09-21 16:05:17 +02:00
Michael Peter Christen
fa7f6f0be8 added HostBrowser servlet (stub) 2012-09-21 15:48:40 +02:00
Michael Peter Christen
00c1c777fa refactoring 2012-09-21 15:48:16 +02:00
orbiter
563d584420 removed more dependencies in cora from kelondro 2012-09-21 11:02:36 +02:00
orbiter
aa65282259 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-09-21 10:27:30 +02:00
orbiter
63762d8f89 removed kelondro dependencies from cora 2012-09-20 19:38:22 +02:00
orbiter
39564fddbd more ignore 2012-09-20 18:45:51 +02:00
orbiter
6e0f4557f8 added ftp to getName 2012-09-20 18:29:04 +02:00
cominch
23204d2245 change parameter to support the smw extension for list import 2012-09-20 15:02:57 +02:00
Michael Peter Christen
c235d5c0f1 fixed size parsing in RSS message parser (for YaCy size parameter) 2012-09-19 06:36:07 +02:00
orbiter
089a03114e full memory usage for debian and when changing the size: debian seems to
dislike the big difference between xmx and xms (I have crashes here
which stop if both values are same)
2012-09-18 22:31:01 +02:00
Michael Peter Christen
5bc8f34150 fix for success query counter 2012-09-18 11:06:36 +02:00
orbiter
60b1e23f05 added new crawl options:
- indexUrlMustMatch and indexUrlMustNotMatch which can be used to select
loaded pages for indexing. Default patterns are in such a way that all
loaded pages are also indexed (as before) but when doing an expert crawl
start, then the user may select only specific urls to be indexed.
- crawlerNoDepthLimitMatch is a new pattern that can be used to remove
the crawl depth limitation. This filter a never-match by default (which
causes that the depth is used) but the user can select paths which will
be loaded completely even if a crawl depth is reached.
2012-09-16 21:27:55 +02:00
orbiter
4987921d3d fixed the size() method which counted also failed pages (which are also
inside the solr index)
2012-09-16 21:22:56 +02:00
Michael Peter Christen
6ec02deec6 added new crawl attributes in crawl profile (not active yet) 2012-09-14 16:49:29 +02:00
Michael Peter Christen
a13e5153ac - added the possibility to have not one but a list of crawl start urls
- the list of urls is entered in the expert crawl start in a textfield;
the one-line input field was replaced with a text box
- start urls can also be given in one single line where the urls are
separated by a '|'-character
- as an effect, the crawl profile cannot carry a single start url for
identificaton because it is possible to have more. Therefore the url was
removed from the crawl profile
- this affect all servlets which display a crawl profile: removed the
url field from all there servlets
- to work consistently with several start urls and the other crawl
starts which computed crawl start url lists from sitelists or sitemaps,
the crawl start servlet was restructured completely
- new rules for must-match patterns were created to make it possible
that site crawl starts also work with several crawl starts at once
2012-09-14 12:25:46 +02:00
Michael Peter Christen
975bc95ddf added default facet fields for json response format (stub) 2012-09-14 12:09:20 +02:00
Michael Peter Christen
2f218df55d added missing license headers 2012-09-14 12:06:06 +02:00
Michael Peter Christen
a30653a864 added a regular expression test servlet which is linked within the
parser/crawler error page whenever a problem with regular expression
occurs.
This makes it easy to correct and enhance the must-match and
must-not-match patterns just by trying out which pattern could be
correct.
2012-09-14 12:04:54 +02:00
Michael Peter Christen
0504b01bdc Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-09-14 00:48:17 +02:00
orbiter
9413f77b65 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2012-09-13 23:54:26 +02:00