Commit Graph

2099 Commits

Author SHA1 Message Date
reger
561cbc7ee2 use more YaCy HeaderFramework constants (instead of Jetty's) 2013-09-22 04:23:42 +02:00
reger
5c4ba9b5db merge rc1 master 2013-09-22 02:21:24 +02:00
reger
70c51775ae Merge remote-tracking branch 'origin/master' into jetty 2013-09-22 02:09:02 +02:00
reger
4b77733e59 implement a YaCyDefaultServlet to handle YaCy-servlets within Jetty server
- the implementation is inspired by Jetty's DefaultServlet
- handles static html content and YaCy servlets
- translates between standard servlet request/response and YaCy request/response specification
With the implementation of YaCy-servlets as servlet instead via a jetty handler it's closer to servlet standard and carries less jetty specific dependencies.
2013-09-22 01:57:32 +02:00
orbiter
828603e4f1 fix for 100%CPU problem in error cache cleaning process 2013-09-21 10:20:13 +02:00
orbiter
c64b51134e hack to add all tokens from the url to text_t. This was working for the
RWI index (and still is working) but not for solr-only search indexes.
Maybe we should find a solution using a separate search field instead.
2013-09-21 08:57:43 +02:00
orbiter
6e8377b8ad do not check all words with synonym library if the library is empty 2013-09-21 08:56:24 +02:00
orbiter
70ba74b23a disabled ipv4 preference to enable ipv6-only networks like freifunk 2013-09-20 16:52:37 +02:00
orbiter
f3be1930cb CPU problem when pusing to the error cache; wrong class,
ConcurrentHashMap needed for concurrency
2013-09-20 16:51:50 +02:00
Michael Peter Christen
e40671ddb7 better and consistent deletions for error urls 2013-09-17 15:52:57 +02:00
Michael Peter Christen
2602be8d1e - removed ZURL data structure; removed also the ZURL data file
- replaced load failure logging by information which is stored in Solr
- fixed a bug with crawling of feeds: added must-match pattern
application to feed urls to filter out such urls which shall not be in a
wanted domain
- delegatedURLs, which also used ZURLs are now temporary objects in
memory
2013-09-17 15:27:02 +02:00
Michael Peter Christen
31920385f7 set anchor rel attribute of all links to "nofollow" if the html meta
contains a robots:nofollow or if the http header contains a
"X-Robots-Tag: nofollow"
2013-09-16 16:14:56 +02:00
reger
9619b8743c add Solr Servlet 2013-09-16 03:01:18 +02:00
Michael Peter Christen
57e00baf26 fix for parsing of image links inside of anchor links (image-links) 2013-09-15 23:54:46 +02:00
Michael Peter Christen
61c5e40687 - replaced the properties object in AnchorURL with distinct variables
for anchor attributes.
- this caused that large portions of the parser code had to be adopted
as well
- added a counter target_order_i for anchor links in webgraph
computation
2013-09-15 23:27:04 +02:00
Michael Peter Christen
3ea9bb4427 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-09-15 00:30:41 +02:00
Michael Peter Christen
5e31bad711 - the webgraph shall store all links which appear on a web page and not
all unique links! This made it necessary, that a large portion of the
parser and link processing classes must be adopted to carry a different
type of link collection which carry a property attribute which are
attached to web anchors.
- introduction of a new URL class, AnchorURL
- the other url classes, DigestURI and MultiProtocolURI had been renamed
and refactored to fit into a new document package schema, document.id
- cleanup of net.yacy.cora.document package and refactoring
2013-09-15 00:30:23 +02:00
reger
13fc86c960 Merge remote-tracking branch 'origin/master' into jetty 2013-09-14 21:10:24 +02:00
reger
f7f86d8a5d update to Jetty 9 jars
- include javax.servlet 3.0
2013-09-14 20:49:05 +02:00
reger
603368fc3e remove redundant declaration of USER_AGENT 2013-09-14 18:29:44 +02:00
reger
bd71b14d25 add mandatory p2p parameter to templatePattern 2013-09-12 22:49:09 +02:00
reger
b8da176c5d adjust setHandled to request of call parameter 2013-09-12 22:04:10 +02:00
reger
127adbf5cf remove references to 10_http thread (legacy http server)
and add needed get/set function to jetty http server wrapper
2013-09-12 22:02:11 +02:00
Michael Peter Christen
1a8c64117f decreased the responseHeaderDB database which is now flushed more
frequently. This will preserve more documents in the cache in case of a
crash.
2013-09-11 13:03:58 +02:00
reger
36b7159282 - remove double initialization of jetty
- refactor some var assignments
2013-09-11 02:24:47 +02:00
reger
63ed04260a Merge remote-tracking branch 'origin/master' into jetty 2013-09-10 20:42:38 +02:00
Michael Peter Christen
35ab2cef7b added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in
html meta fields to get a correct (or: better) date timestamp. The
http:last-modified mostly does not work because it is set to the current
date from most CMS.
2013-09-10 10:31:57 +02:00
reger
2ee68f76f6 added read parameter from multi-part form fields (to nasty quick-fix) 2013-09-10 01:42:08 +02:00
Michael Peter Christen
9cc8468b30 added tools to visualize image generation (i.e. during testing) 2013-09-09 12:58:26 +02:00
reger
105cf8f593 changes to adjust jetty to recent code changes 2013-09-09 02:37:29 +02:00
reger
aafef72a8a merged current rc1/master into jetty branch to allow further development with latest version
ServerSideIncludes and servlet return values need further work (for working jetty integration)
- TODO: added nasty quickfix to allow SSI -  needs further work
- TODO: YaCy servlet return values/parameters are not handled
2013-09-09 02:36:06 +02:00
Michael Peter Christen
dbef8ccfcb forced deletion of ZURL entries for a specific host for each host that
appears in the crawl url list
2013-09-05 13:22:16 +02:00
Michael Peter Christen
e137ff4171 refactoring (im preparation for new removeHost method) 2013-09-05 09:59:41 +02:00
Michael Peter Christen
7a5574cd51 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-09-04 23:12:04 +02:00
Michael Peter Christen
85456f46b2 added two new fields, exact_signature_copycount_i and
fuzzy_signature_copycount_i, which count the number of copies of
non-unique documents and assigns this to each document. Thus, each
document there is a number assigned which shows how many copies of this
document exists.
These fields are disabled by default.
2013-09-04 23:11:53 +02:00
orbiter
26366596d9 fix for a problem which ocurres when a site is crawled where the start
url is redirected.
2013-09-04 16:00:47 +02:00
Michael Peter Christen
a2511b5600 turned images_alt_txt back to images_alt_sxt because it is not necessary
to index the alt text. Indexed image Text is in images_text_t
2013-09-04 10:47:18 +02:00
Michael Peter Christen
85b1922244 activated image type navigation for image search 2013-09-03 13:34:01 +02:00
Michael Peter Christen
9e12fdff23 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-09-03 12:22:57 +02:00
Michael Peter Christen
ab1201fdfd fixed wrong facet count 2013-09-03 12:22:29 +02:00
Michael Peter Christen
049c3b3f2e added an option to exclude image search results from text search. This
is on by default.
2013-09-03 11:14:23 +02:00
Michael Peter Christen
69f85265e1 added an option to put image links to the crawl queue and handle these
like normal documents. Using this option (by default on at this moment;
this might change soon) it is possible to get the exif data into the
search index to be used in image search.
2013-09-03 11:13:45 +02:00
Michael Peter Christen
e8e558a9b7 fix for content domain classification in URIMetadataNode 2013-09-03 10:49:09 +02:00
Michael Peter Christen
a8c5bfcf58 avoid to create unnecessary objects 2013-09-03 09:48:05 +02:00
Michael Peter Christen
5a0de1b77d moving image description text to image text field 2013-09-03 09:47:27 +02:00
Michael Peter Christen
dc179bd61f fix for catchall query goal for image search 2013-09-03 07:55:21 +02:00
reger
392174de8c remove all_words, all_strings lists from QueryGoal
- only used for text highlighting in parser text (ViewFile.html) which can be done with include_strings only
2013-09-02 23:09:43 +02:00
Michael Peter Christen
169ef8963d one more fix for image search 2013-09-02 20:02:26 +02:00
Michael Peter Christen
cb85b22725 redesign of the image search process (with much better results,
unfortunately the index schema has changed and p2p image search will not
be muchmuch better until many people update)
2013-09-02 18:55:38 +02:00
reger
29967102a2 optimized QueryGoal (reducing mem and computation by removing all_hashes)
- all_hashes used for text highlighting and word distance computation which can be done with include_hashes only
2013-09-02 04:19:53 +02:00