Commit Graph

9864 Commits

Author SHA1 Message Date
reger
f7f86d8a5d update to Jetty 9 jars
- include javax.servlet 3.0
2013-09-14 20:49:05 +02:00
reger
bd71b14d25 add mandatory p2p parameter to templatePattern 2013-09-12 22:49:09 +02:00
reger
b8da176c5d adjust setHandled to request of call parameter 2013-09-12 22:04:10 +02:00
reger
127adbf5cf remove references to 10_http thread (legacy http server)
and add needed get/set function to jetty http server wrapper
2013-09-12 22:02:11 +02:00
reger
36b7159282 - remove double initialization of jetty
- refactor some var assignments
2013-09-11 02:24:47 +02:00
reger
8e52271491 - delete not needed old jetty jars from libt
- add jetty to Info.plist
2013-09-10 20:55:03 +02:00
reger
63ed04260a Merge remote-tracking branch 'origin/master' into jetty 2013-09-10 20:42:38 +02:00
reger
fe87fb638a adjust test/ParserTest to dc_description data type 2013-09-10 20:05:10 +02:00
Michael Peter Christen
35ab2cef7b added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in
html meta fields to get a correct (or: better) date timestamp. The
http:last-modified mostly does not work because it is set to the current
date from most CMS.
2013-09-10 10:31:57 +02:00
reger
2ee68f76f6 added read parameter from multi-part form fields (to nasty quick-fix) 2013-09-10 01:42:08 +02:00
Michael Peter Christen
9cc8468b30 added tools to visualize image generation (i.e. during testing) 2013-09-09 12:58:26 +02:00
reger
105cf8f593 changes to adjust jetty to recent code changes 2013-09-09 02:37:29 +02:00
reger
aafef72a8a merged current rc1/master into jetty branch to allow further development with latest version
ServerSideIncludes and servlet return values need further work (for working jetty integration)
- TODO: added nasty quickfix to allow SSI -  needs further work
- TODO: YaCy servlet return values/parameters are not handled
2013-09-09 02:36:06 +02:00
Michael Peter Christen
dbef8ccfcb forced deletion of ZURL entries for a specific host for each host that
appears in the crawl url list
2013-09-05 13:22:16 +02:00
Michael Peter Christen
e137ff4171 refactoring (im preparation for new removeHost method) 2013-09-05 09:59:41 +02:00
Michael Peter Christen
7a5574cd51 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-09-04 23:12:04 +02:00
Michael Peter Christen
85456f46b2 added two new fields, exact_signature_copycount_i and
fuzzy_signature_copycount_i, which count the number of copies of
non-unique documents and assigns this to each document. Thus, each
document there is a number assigned which shows how many copies of this
document exists.
These fields are disabled by default.
2013-09-04 23:11:53 +02:00
orbiter
26366596d9 fix for a problem which ocurres when a site is crawled where the start
url is redirected.
2013-09-04 16:00:47 +02:00
Michael Peter Christen
a2511b5600 turned images_alt_txt back to images_alt_sxt because it is not necessary
to index the alt text. Indexed image Text is in images_text_t
2013-09-04 10:47:18 +02:00
Michael Peter Christen
85b1922244 activated image type navigation for image search 2013-09-03 13:34:01 +02:00
Michael Peter Christen
9e12fdff23 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-09-03 12:22:57 +02:00
Michael Peter Christen
ab1201fdfd fixed wrong facet count 2013-09-03 12:22:29 +02:00
Michael Peter Christen
049c3b3f2e added an option to exclude image search results from text search. This
is on by default.
2013-09-03 11:14:23 +02:00
Michael Peter Christen
69f85265e1 added an option to put image links to the crawl queue and handle these
like normal documents. Using this option (by default on at this moment;
this might change soon) it is possible to get the exif data into the
search index to be used in image search.
2013-09-03 11:13:45 +02:00
Michael Peter Christen
e8e558a9b7 fix for content domain classification in URIMetadataNode 2013-09-03 10:49:09 +02:00
Michael Peter Christen
a8c5bfcf58 avoid to create unnecessary objects 2013-09-03 09:48:05 +02:00
Michael Peter Christen
5a0de1b77d moving image description text to image text field 2013-09-03 09:47:27 +02:00
Michael Peter Christen
dc179bd61f fix for catchall query goal for image search 2013-09-03 07:55:21 +02:00
Michael Peter Christen
5d71a4c8bc fix for dc:description field 2013-09-03 07:54:49 +02:00
reger
392174de8c remove all_words, all_strings lists from QueryGoal
- only used for text highlighting in parser text (ViewFile.html) which can be done with include_strings only
2013-09-02 23:09:43 +02:00
Michael Peter Christen
169ef8963d one more fix for image search 2013-09-02 20:02:26 +02:00
Michael Peter Christen
cb85b22725 redesign of the image search process (with much better results,
unfortunately the index schema has changed and p2p image search will not
be muchmuch better until many people update)
2013-09-02 18:55:38 +02:00
Michael Peter Christen
6184fd9d9a fix for solr/gsa result logging 2013-09-02 08:05:42 +02:00
reger
29967102a2 optimized QueryGoal (reducing mem and computation by removing all_hashes)
- all_hashes used for text highlighting and word distance computation which can be done with include_hashes only
2013-09-02 04:19:53 +02:00
orbiter
f106345eef link strings should not be tokenized 2013-09-01 14:35:36 +02:00
orbiter
deadeb406e image alt tag strings should be tokenized 2013-09-01 13:48:10 +02:00
orbiter
5b14bdfffd npe fix 2013-09-01 13:28:37 +02:00
orbiter
3e5f8e29e2 next development release step to reflect the extension of the solr api
with javabin format capability
2013-09-01 13:12:36 +02:00
orbiter
1ca4b9612c added special handling of the BinaryResponseWriter in the solr interface
which makes it possible to use solrj with the javabin format which is
much better (compressed, no xml overhead, java object streams) and
faster. Furthermore, this enables the 'shards' option in the solr
interface which connects one solr (YaCy) to another solr (YaCy) ad-hoc.
2013-09-01 13:11:40 +02:00
reger
d0e78082d1 return field names in index instead of in schema for SolrServerConnector.getFields 2013-08-31 06:25:12 +02:00
Michael Peter Christen
1a3e42eca4 index migration to lucene 4.4 2013-08-26 12:49:39 +02:00
Michael Peter Christen
a88a62f7aa added a feature to set a collection for a crawl result based on a
regular expression on th url: the collection attribut for a crawl start
may be now either a token or a list of tokens, seperated by ',' where a
token is either a string or a pair <string,pattern> where the string is
separated to the pattern with a ':' and the string is assigned to the
document as collection only if the pattern matches with the url.
2013-08-25 00:13:48 +02:00
Michael Peter Christen
3c5abedabf NPE during shutdown fix 2013-08-24 23:36:50 +02:00
Michael Peter Christen
e4cbe9232d fixed a crawler bug where a double-occurring url was not re-crawled
because the double-check error was written to the error-db and never
deleted. No the error-db is cleared on every start and these
double-messages are not written to the error-db any more.
2013-08-22 15:56:09 +02:00
Michael Peter Christen
765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user
in intranets and the internet can now choose to appear as Googlebot.
This is an essential necessity to be able to compete in the field of
commercial search appliances, since most web pages are these days
optimized only for Google and no other search platform any more. All
commercial search engine providers have a built-in fake-Google User
Agent to be able to get the same search index as Google can do. Without
the resistance against obeying to robots.txt in this case, no
competition is possible any more. YaCy will always obey the robots.txt
when it is used for crawling the web in a peer-to-peer network, but to
establish a Search Appliance (like a Google Search Appliance, GSA) it is
necessary to be able to behave exactly like a Google crawler.
With this change, you will be able to switch the user agent when portal
or intranet mode is selected on per-crawl-start basis. Every crawl start
can have a different user agent.
2013-08-22 14:23:47 +02:00
Michael Peter Christen
0f3d8890db removed an assert which causes a shortcut call circuit 2013-08-22 10:12:25 +02:00
Michael Peter Christen
6d5fefe060 added missing files :( 2013-08-20 16:31:34 +02:00
Michael Peter Christen
554c0351dd fix for http://bugs.yacy.net/view.php?id=286 2013-08-20 16:10:26 +02:00
Michael Peter Christen
47b1c81d08 - refactoring
- generalized writing of url attributes to solr documents
- added more url attributes to error documents
2013-08-20 15:46:04 +02:00
Michael Peter Christen
e6b423c4d9 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-08-19 22:02:41 +02:00