Commit Graph

2259 Commits

Author SHA1 Message Date
reger
a44eede8b8 merge rc1/master 2013-10-11 01:50:25 +02:00
sixcooler
d9a02ed277 NPE fix for my last commit 2013-10-11 00:44:04 +02:00
reger
54a0272338 searchpage javascript (latestinfo) causes reset of search statistic after moving to next page
- disabled call via setTimeout in yacysearch.html
2013-10-10 23:23:58 +02:00
sixcooler
61f627eb85 fix for ssl-connections from proxy-usage staying in close-wait-state
+ some extra 'close' in HttpClient
2013-10-10 20:57:37 +02:00
Michael Peter Christen
d328cc4a83 fix for didyoumean, added also more asian alphabets 2013-10-09 16:17:50 +02:00
Michael Peter Christen
90c8577840 enhanced ranking; patches to replace old ranking 2013-10-09 15:10:03 +02:00
reger
e74f548551 make legacy http server (serverCore) implement YaCyHttpServer interface 2013-10-09 01:07:22 +02:00
reger
71d2655c02 downgrade to Jetty 8 to assure support of JRE 1.6
- introduce a YaCyHttp interface to modulize/separate http server
- adjust the Jetty version specific implementation part (in package net.yacy.http)
     - putting the version specific code in classes starting with Jetty8xxxx
     - moved existing Jetty9xxx implementation into a test class (to keep the code)
- adjust build to the changed jars
- make use of the introduced YaCyHttpServer interface in related htroot servlets

- adjust other test cases/classes
2013-10-09 00:40:48 +02:00
Michael Peter Christen
1b61bd40ed - Added new solr field url_file_name_tokens_t which stores the file name
tokens. This can be used to enhance the ranking.
- Added also a rating_i field as basis for later usage.
- enhanced the tokenization process.
2013-10-08 23:48:13 +02:00
orbiter
6efa7532d2 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-10-08 19:04:57 +02:00
orbiter
5f5a97bafc added the anchor text within web pages to the searcheable entities of a
web page. This can be of benefit for the ranking if these fields are
used for boosts.
2013-10-08 18:41:07 +02:00
orbiter
705b3338ee list more fields available for search and for ranking boosts 2013-10-08 18:15:35 +02:00
sixcooler
d536092fe4 fix false fill NAME_CACHE_MISS-DNS-Cache in case of a timeout
for eg. caused by massive requests when crawl from file
2013-10-08 18:02:42 +02:00
Michael Peter Christen
78e7aadb26 removed unused initialization method 2013-10-07 23:51:28 +02:00
Michael Peter Christen
4fbc4740df removed warnings 2013-10-07 23:41:50 +02:00
Michael Peter Christen
21aa6a0321 migration to Solr 4.5.0 2013-10-07 17:09:40 +02:00
Michael Peter Christen
ef31d0f279 fix for rss reader, see http://bugs.yacy.net/view.php?id=294 2013-10-07 12:59:54 +02:00
Michael Peter Christen
101a6e6e14 Patch the citation index for links with canonical tags.
This shall fulfill the following requirement:
If a document A links to B and B contains a 'canonical C', then the
citation rank computation shall consider that A links to C and B does
not link to C.
To do so, we first must collect all canonical links, find all references
to them, get the anchor list of the documents and patch the citation
reference of these links.
2013-10-07 11:15:58 +02:00
reger
daebeb93aa add call to AccessTracker to jetty security handler 2013-10-04 01:16:17 +02:00
reger
172aefaeeb adjust YaCySecurityHandler to Jetty 9 conventions
- mainly adjust prepareConstraintInfo to use the RoleInfo.setChecked as in Jetty Source distribution
- use constraint check behavior as in ConstraintSecurityHandler
  see http://git.eclipse.org/c/jetty/org.eclipse.jetty.project.git/tree/jetty-security/src/main/java/org/eclipse/jetty/security/ConstraintSecurityHandler.java?id=jetty-9.0.5.v20130813
2013-10-03 19:38:03 +02:00
reger
6f9ed439d3 - expand localHostName check of AbstractRemoteHandler
to pevent request is handled as proxy request 
- make domain handler not relay on included path in resolved .yacy address
2013-10-01 03:04:32 +02:00
reger
561ea135af fix : forgot adding security handler 2013-09-30 04:35:17 +02:00
reger
c7c706fd9f merge with rc1/master 2013-09-30 03:46:39 +02:00
reger
272b196d05 update Jetty server init() to activate yacy-domain and transparent proxy handler
- adding  domain & proxy handler to a context (as it was in inital design)
     (context required for dispatcher)
- make handler context and servlet context parallel available 
     (to allow use of YaCyDefaultServlet to handle legacyServlets)
- set transparent proxy request handled after dispatch.forward to skip further handling for .yacy domain requests
2013-09-30 03:12:52 +02:00
reger
fd119deb00 fix NPE on modified since check ( Response.requestHeader allowed to be null) 2013-09-30 02:50:53 +02:00
reger
66145a0410 - add welcome file (index.html) support to YaCyDefaultServlet
- change SolrServlet default search field (&df) to text_t
2013-09-29 03:34:00 +02:00
Michael Peter Christen
b28d43decc added two more fields source_cr_host_norm_i,target_cr_host_norm_i in
webgraph and an addition to postprocessing to copy all cr ranking
attributes to the link edges associated to the postprocessing documents
2013-09-27 16:57:05 +02:00
Michael Peter Christen
a52f3a597e fix for canonical-from-http-header feature 2013-09-27 15:09:04 +02:00
Michael Peter Christen
2dd7c5be44 added parsing of http-canonical tags (untested, could not find an
example page)
2013-09-27 13:17:50 +02:00
Michael Peter Christen
4476dea5ba do not fail if a wrong boost key is used; instead, print only a warning
See also: http://bugs.yacy.net/view.php?id=293
2013-09-27 12:28:09 +02:00
reger
ab9583d429 add default field (&df) to SolrServlet query if missing 2013-09-26 22:20:35 +02:00
Michael Peter Christen
3bf0104199 fix for crawl domain counter limitation (limit was reached too early) 2013-09-26 13:41:52 +02:00
Michael Peter Christen
82bfd9e00a - crawl profiles shall be deleted from active and passive stacks if they
are deleted to terminate the crawl because otherwise the crawl will go
on after the load-from-passive stack policy.
- better check if a crawl is terminated using the loader queue.
2013-09-26 10:22:31 +02:00
Michael Peter Christen
1b3d26dd23 hack to remove most of the warning: deprecated messages (but not all,
one is left)
2013-09-25 21:14:52 +02:00
Michael Peter Christen
a496313248 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-09-25 20:41:02 +02:00
sixcooler
3c48fc65fd reverted RemoteInstance to deprecated methods of httpClient-4.2
this should work with current remote-Solr-Instances
2013-09-25 18:45:16 +02:00
Michael Peter Christen
91a875dff5 self-healing of mistakenly deactivated crawl profiles. This fixes a bug
which can happen in rare cases when a crawl start and a cleanup process
happen at the same time.
2013-09-25 18:27:54 +02:00
Michael Peter Christen
095053a9b4 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-09-25 17:32:52 +02:00
sixcooler
0cae420d8e some dns-timing changes:
since httpclient uses the domain-cache it is useful not to clean the
domain cache until crawling is running (domains are filled into this
cache)
On huge crawl-starts (eg. from file) my DNS did not follow the high
rates - so I reduced the rate and give some more time(-out)
2013-09-25 15:01:28 +02:00
sixcooler
15b1bb2513 bump to httpClient-4.3 2013-09-25 14:48:37 +02:00
Michael Peter Christen
4f83d5f18c added the new field harvestkey_s to the collection index and the
webgraph index which is temporary filled with the crawl profile key.
This is used to select a set of documents for post-processing as soon as
a crawl is finished. Now the postprocessing for a specific crawl is
started when that specific crawl is finished and not at the end of all
post-processing steps.
2013-09-25 14:38:24 +02:00
orbiter
14442efa6d when profiles are cleaned, there shall be first a callback showing which
profiles are cleaned. This shall enable a profile-termination-driven
postprocessing. To do this, index writings must carry the profile key
which will be implemented in another (next) step.
2013-09-25 11:04:12 +02:00
orbiter
0013d0d0bb removed superfluous class 2013-09-24 21:18:37 +02:00
orbiter
f90d5296cb Added new data structure to be used by the balancer (not used yet).
These data structures will enable the balancer to store the crawl queue
into individual queues, one each for a single host.
2013-09-24 21:08:40 +02:00
orbiter
0e8d752462 refactoring 2013-09-24 19:55:59 +02:00
orbiter
8ac2e8c8c9 added location navigator which causes that the image to the map search
is visible whenever a location is available in the search result.
To activate this, the search.navigation property in yacy.conf must be
modified to the new default values.
2013-09-24 11:26:51 +02:00
orbiter
d86d2be5c3 automatically removed Places autotagging if no location library is
wanted
2013-09-24 11:23:45 +02:00
orbiter
214a087cdf Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-09-23 20:59:03 +02:00
Michael Peter Christen
96ed0c980e - added hosthash to all documents (also fail documents which is needed
there for deletion), this fixes a problem for the deletion of old
documents for new crawl starts
- added clickdepth and citation computation for fail documents
2013-09-23 18:09:42 +02:00
Michael Peter Christen
179ad281f9 close include byte buffer after usage 2013-09-23 12:19:51 +02:00
reger
52dd491c04 fix not necessary use of DigestURL 2013-09-23 03:05:09 +02:00
reger
6b9a624808 remove double declaration of TLD_any_zone_filter 2013-09-23 03:01:08 +02:00
reger
5111841e5b - reduce Jetty debug logging
- fix Context path initialization
2013-09-23 01:30:45 +02:00
reger
bc6ebb3c06 adjust to DigestURI changes from master to DigestURL 2013-09-22 20:57:50 +02:00
reger
561cbc7ee2 use more YaCy HeaderFramework constants (instead of Jetty's) 2013-09-22 04:23:42 +02:00
reger
5c4ba9b5db merge rc1 master 2013-09-22 02:21:24 +02:00
reger
70c51775ae Merge remote-tracking branch 'origin/master' into jetty 2013-09-22 02:09:02 +02:00
reger
4b77733e59 implement a YaCyDefaultServlet to handle YaCy-servlets within Jetty server
- the implementation is inspired by Jetty's DefaultServlet
- handles static html content and YaCy servlets
- translates between standard servlet request/response and YaCy request/response specification
With the implementation of YaCy-servlets as servlet instead via a jetty handler it's closer to servlet standard and carries less jetty specific dependencies.
2013-09-22 01:57:32 +02:00
orbiter
d2effd21db fix for npe during location search 2013-09-21 21:03:58 +02:00
orbiter
828603e4f1 fix for 100%CPU problem in error cache cleaning process 2013-09-21 10:20:13 +02:00
orbiter
c64b51134e hack to add all tokens from the url to text_t. This was working for the
RWI index (and still is working) but not for solr-only search indexes.
Maybe we should find a solution using a separate search field instead.
2013-09-21 08:57:43 +02:00
orbiter
6e8377b8ad do not check all words with synonym library if the library is empty 2013-09-21 08:56:24 +02:00
orbiter
70ba74b23a disabled ipv4 preference to enable ipv6-only networks like freifunk 2013-09-20 16:52:37 +02:00
orbiter
f3be1930cb CPU problem when pusing to the error cache; wrong class,
ConcurrentHashMap needed for concurrency
2013-09-20 16:51:50 +02:00
Michael Peter Christen
e40671ddb7 better and consistent deletions for error urls 2013-09-17 15:52:57 +02:00
Michael Peter Christen
2602be8d1e - removed ZURL data structure; removed also the ZURL data file
- replaced load failure logging by information which is stored in Solr
- fixed a bug with crawling of feeds: added must-match pattern
application to feed urls to filter out such urls which shall not be in a
wanted domain
- delegatedURLs, which also used ZURLs are now temporary objects in
memory
2013-09-17 15:27:02 +02:00
Michael Peter Christen
31920385f7 set anchor rel attribute of all links to "nofollow" if the html meta
contains a robots:nofollow or if the http header contains a
"X-Robots-Tag: nofollow"
2013-09-16 16:14:56 +02:00
reger
9619b8743c add Solr Servlet 2013-09-16 03:01:18 +02:00
Michael Peter Christen
57e00baf26 fix for parsing of image links inside of anchor links (image-links) 2013-09-15 23:54:46 +02:00
Michael Peter Christen
61c5e40687 - replaced the properties object in AnchorURL with distinct variables
for anchor attributes.
- this caused that large portions of the parser code had to be adopted
as well
- added a counter target_order_i for anchor links in webgraph
computation
2013-09-15 23:27:04 +02:00
Michael Peter Christen
3ea9bb4427 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-09-15 00:30:41 +02:00
Michael Peter Christen
5e31bad711 - the webgraph shall store all links which appear on a web page and not
all unique links! This made it necessary, that a large portion of the
parser and link processing classes must be adopted to carry a different
type of link collection which carry a property attribute which are
attached to web anchors.
- introduction of a new URL class, AnchorURL
- the other url classes, DigestURI and MultiProtocolURI had been renamed
and refactored to fit into a new document package schema, document.id
- cleanup of net.yacy.cora.document package and refactoring
2013-09-15 00:30:23 +02:00
reger
13fc86c960 Merge remote-tracking branch 'origin/master' into jetty 2013-09-14 21:10:24 +02:00
reger
f7f86d8a5d update to Jetty 9 jars
- include javax.servlet 3.0
2013-09-14 20:49:05 +02:00
reger
603368fc3e remove redundant declaration of USER_AGENT 2013-09-14 18:29:44 +02:00
reger
bd71b14d25 add mandatory p2p parameter to templatePattern 2013-09-12 22:49:09 +02:00
reger
b8da176c5d adjust setHandled to request of call parameter 2013-09-12 22:04:10 +02:00
reger
127adbf5cf remove references to 10_http thread (legacy http server)
and add needed get/set function to jetty http server wrapper
2013-09-12 22:02:11 +02:00
Michael Peter Christen
1a8c64117f decreased the responseHeaderDB database which is now flushed more
frequently. This will preserve more documents in the cache in case of a
crash.
2013-09-11 13:03:58 +02:00
reger
36b7159282 - remove double initialization of jetty
- refactor some var assignments
2013-09-11 02:24:47 +02:00
reger
63ed04260a Merge remote-tracking branch 'origin/master' into jetty 2013-09-10 20:42:38 +02:00
Michael Peter Christen
35ab2cef7b added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in
html meta fields to get a correct (or: better) date timestamp. The
http:last-modified mostly does not work because it is set to the current
date from most CMS.
2013-09-10 10:31:57 +02:00
reger
2ee68f76f6 added read parameter from multi-part form fields (to nasty quick-fix) 2013-09-10 01:42:08 +02:00
Michael Peter Christen
9cc8468b30 added tools to visualize image generation (i.e. during testing) 2013-09-09 12:58:26 +02:00
reger
105cf8f593 changes to adjust jetty to recent code changes 2013-09-09 02:37:29 +02:00
reger
aafef72a8a merged current rc1/master into jetty branch to allow further development with latest version
ServerSideIncludes and servlet return values need further work (for working jetty integration)
- TODO: added nasty quickfix to allow SSI -  needs further work
- TODO: YaCy servlet return values/parameters are not handled
2013-09-09 02:36:06 +02:00
Michael Peter Christen
dbef8ccfcb forced deletion of ZURL entries for a specific host for each host that
appears in the crawl url list
2013-09-05 13:22:16 +02:00
Michael Peter Christen
e137ff4171 refactoring (im preparation for new removeHost method) 2013-09-05 09:59:41 +02:00
Michael Peter Christen
7a5574cd51 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-09-04 23:12:04 +02:00
Michael Peter Christen
85456f46b2 added two new fields, exact_signature_copycount_i and
fuzzy_signature_copycount_i, which count the number of copies of
non-unique documents and assigns this to each document. Thus, each
document there is a number assigned which shows how many copies of this
document exists.
These fields are disabled by default.
2013-09-04 23:11:53 +02:00
orbiter
26366596d9 fix for a problem which ocurres when a site is crawled where the start
url is redirected.
2013-09-04 16:00:47 +02:00
Michael Peter Christen
a2511b5600 turned images_alt_txt back to images_alt_sxt because it is not necessary
to index the alt text. Indexed image Text is in images_text_t
2013-09-04 10:47:18 +02:00
Michael Peter Christen
85b1922244 activated image type navigation for image search 2013-09-03 13:34:01 +02:00
Michael Peter Christen
9e12fdff23 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-09-03 12:22:57 +02:00
Michael Peter Christen
ab1201fdfd fixed wrong facet count 2013-09-03 12:22:29 +02:00
Michael Peter Christen
049c3b3f2e added an option to exclude image search results from text search. This
is on by default.
2013-09-03 11:14:23 +02:00
Michael Peter Christen
69f85265e1 added an option to put image links to the crawl queue and handle these
like normal documents. Using this option (by default on at this moment;
this might change soon) it is possible to get the exif data into the
search index to be used in image search.
2013-09-03 11:13:45 +02:00
Michael Peter Christen
e8e558a9b7 fix for content domain classification in URIMetadataNode 2013-09-03 10:49:09 +02:00
Michael Peter Christen
a8c5bfcf58 avoid to create unnecessary objects 2013-09-03 09:48:05 +02:00
Michael Peter Christen
5a0de1b77d moving image description text to image text field 2013-09-03 09:47:27 +02:00
Michael Peter Christen
dc179bd61f fix for catchall query goal for image search 2013-09-03 07:55:21 +02:00
reger
392174de8c remove all_words, all_strings lists from QueryGoal
- only used for text highlighting in parser text (ViewFile.html) which can be done with include_strings only
2013-09-02 23:09:43 +02:00
Michael Peter Christen
169ef8963d one more fix for image search 2013-09-02 20:02:26 +02:00
Michael Peter Christen
cb85b22725 redesign of the image search process (with much better results,
unfortunately the index schema has changed and p2p image search will not
be muchmuch better until many people update)
2013-09-02 18:55:38 +02:00
reger
29967102a2 optimized QueryGoal (reducing mem and computation by removing all_hashes)
- all_hashes used for text highlighting and word distance computation which can be done with include_hashes only
2013-09-02 04:19:53 +02:00
orbiter
f106345eef link strings should not be tokenized 2013-09-01 14:35:36 +02:00
orbiter
deadeb406e image alt tag strings should be tokenized 2013-09-01 13:48:10 +02:00
reger
d0e78082d1 return field names in index instead of in schema for SolrServerConnector.getFields 2013-08-31 06:25:12 +02:00
Michael Peter Christen
1a3e42eca4 index migration to lucene 4.4 2013-08-26 12:49:39 +02:00
Michael Peter Christen
a88a62f7aa added a feature to set a collection for a crawl result based on a
regular expression on th url: the collection attribut for a crawl start
may be now either a token or a list of tokens, seperated by ',' where a
token is either a string or a pair <string,pattern> where the string is
separated to the pattern with a ':' and the string is assigned to the
document as collection only if the pattern matches with the url.
2013-08-25 00:13:48 +02:00
Michael Peter Christen
3c5abedabf NPE during shutdown fix 2013-08-24 23:36:50 +02:00
Michael Peter Christen
e4cbe9232d fixed a crawler bug where a double-occurring url was not re-crawled
because the double-check error was written to the error-db and never
deleted. No the error-db is cleared on every start and these
double-messages are not written to the error-db any more.
2013-08-22 15:56:09 +02:00
Michael Peter Christen
765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user
in intranets and the internet can now choose to appear as Googlebot.
This is an essential necessity to be able to compete in the field of
commercial search appliances, since most web pages are these days
optimized only for Google and no other search platform any more. All
commercial search engine providers have a built-in fake-Google User
Agent to be able to get the same search index as Google can do. Without
the resistance against obeying to robots.txt in this case, no
competition is possible any more. YaCy will always obey the robots.txt
when it is used for crawling the web in a peer-to-peer network, but to
establish a Search Appliance (like a Google Search Appliance, GSA) it is
necessary to be able to behave exactly like a Google crawler.
With this change, you will be able to switch the user agent when portal
or intranet mode is selected on per-crawl-start basis. Every crawl start
can have a different user agent.
2013-08-22 14:23:47 +02:00
Michael Peter Christen
0f3d8890db removed an assert which causes a shortcut call circuit 2013-08-22 10:12:25 +02:00
Michael Peter Christen
6d5fefe060 added missing files :( 2013-08-20 16:31:34 +02:00
Michael Peter Christen
554c0351dd fix for http://bugs.yacy.net/view.php?id=286 2013-08-20 16:10:26 +02:00
Michael Peter Christen
47b1c81d08 - refactoring
- generalized writing of url attributes to solr documents
- added more url attributes to error documents
2013-08-20 15:46:04 +02:00
Michael Peter Christen
1c62fa7698 fix for bad snippets in gsa api 2013-08-18 10:37:25 +02:00
Michael Peter Christen
697613170d less logging for postprocessing (this was a debugging logging with high
CPU load)
2013-08-17 09:25:32 +02:00
reger
b4016ff324 - remove possible double initialization of rdfa parser
- use ordered list to use preferred parser for mime/extension first (relates to html, rdfa, argument parser)
- harmonize xhtml extension config for the 3 html base parsers
2013-08-14 21:12:10 +02:00
reger
f0575bd44b FieldReIndex: omit active vocabulary fields from reindex detection 2013-08-14 00:00:30 +02:00
reger
a5019bc470 make Vocabulary Navigator tags a hard result entry filter
by checking vocabulary tags also for rwi results (currently a filter is applied to the solr query)

TODO: as vocabularies are only locally valid, auto-switch to Searchdom.LOCAL could be considered.
2013-08-13 03:07:25 +02:00
reger
a67a4b7d86 improve tld: query modifier filter pattern (to prevent tld:net accepting www.abcinet.org) 2013-08-12 21:20:23 +02:00
reger
02fe8b43ba Field Re-Indexing: display list of fields in reindex queue
change servlet to display statistic on 1st click (instead after refresh)
2013-08-11 04:51:29 +02:00
sixcooler
7f501b7c38 clear some caches before reporting low Memory
do not break lines in Network-table-rows
2013-08-08 14:38:26 +02:00
reger
b355dd52c6 Index Administration - Field Re-Indexing: exclude internal Solr _version_ field from obsolete field check 2013-08-08 00:55:21 +02:00
sixcooler
8a96140f92 fix / workaround for
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=4750
+ Seed.hash should be final
2013-08-01 16:40:58 +02:00
Michael Peter Christen
2857499467 fix to collection schema; bug appeared for _txt fields with empty String
as content
2013-07-31 13:32:05 +02:00
Michael Peter Christen
dbfa865700 added a stub of a class for crawler redesign 2013-07-31 13:16:32 +02:00
Michael Peter Christen
76afcccaaf fix for default boolean post values: the default value MUST NOT be TRUE,
because it's normal that a boolean value is missing in the post argument
if a checkbox is not selected.
Added also some style enhancements to IndexFederated, removed the Solr
attachment manual and replaced it with a link to the wiki which explains
this in more detail.
2013-07-31 10:49:26 +02:00
orbiter
252c525709 fixed feed api servlet and and enhanced RSSReader class 2013-07-31 06:18:30 +02:00
orbiter
d38c3c14d8 fix for CGI test 2013-07-31 05:43:58 +02:00
Michael Peter Christen
31902f54df fix for NPE which happens within solr code at MultiMapSolrParams.java,
line 52 in case that the array arr.length == 0
2013-07-30 14:32:59 +02:00
Michael Peter Christen
f13df9dbb6 migration to solr 4.4.0 2013-07-30 14:01:16 +02:00
Michael Peter Christen
58fe986cca Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2013-07-30 12:49:14 +02:00
Michael Peter Christen
cf12835f20 replaced the single-text description solr field with a multi-value
description_txt text field
2013-07-30 12:48:57 +02:00
sixcooler
7d53ac86a3 fix for Blacklist (-Administration) 2013-07-29 19:09:28 +02:00
reger
f2d99053ed Field Re-Indexing: prevent endless error loop in ReindexSolrBusyThread on Solr exception (by skipping query causing the exception)
(occured during testing while working on q=store:[* TO *])
2013-07-29 01:32:02 +02:00
reger
92d3f71b16 htmlParser: closes input stream -> changed it to leave it open for a reset (used by AugmentParser - even if this is practically not used),
note: stream.close is done by caller (Textparser.parseSource)
- removed unnecessary reset in AugmentParser
- added stream.mark in tdfatripleimpl. to make stream.reset work here
2013-07-28 03:41:09 +02:00
orbiter
87cfeaa4f3 fix for npe 2013-07-27 15:20:09 +02:00
orbiter
268a36aaff emergency fix for crawler: this will otherwise cause loss of complete
crawl queue if latency of remote system is too low
2013-07-27 11:59:07 +02:00
orbiter
d05e0c5368 wait a bit longer before doing the first peer ping 2013-07-27 11:00:35 +02:00
orbiter
b8f57f7703 don't be noisy when doing background tasks that may be allowed to fail 2013-07-27 10:51:58 +02:00
Roland Haeder
0343f0668c Fix for NPE:
E 2013/07/26 20:29:29 BUSYTHREAD Runtime Error in
serverInstantThread.job, thread
'net.yacy.search.Switchboard.cleanupJob': null; target exception: null
java.lang.NullPointerException
        at
net.yacy.search.schema.CollectionConfiguration.convergenceStep(CollectionConfiguration.java:1116)
        at
net.yacy.search.schema.CollectionConfiguration.postprocessing(CollectionConfiguration.java:897)
        at net.yacy.search.Switchboard.cleanupJob(Switchboard.java:2296)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at
net.yacy.kelondro.workflow.InstantBusyThread.job(InstantBusyThread.java:107)
        at
net.yacy.kelondro.workflow.AbstractBusyThread.run(AbstractBusyThread.java:165)

Conflicts:
	source/net/yacy/search/schema/CollectionConfiguration.java
2013-07-27 10:19:46 +02:00
Roland Haeder
b58ca8622d Some cleanups:
- added SKINS_PATH_DEFAULT as same as LISTS_PATH_DEFAULT was added
- Added 'final' keyword to a string
2013-07-27 10:13:57 +02:00
Roland Haeder
7263bb82fb Fix for NPE on shutdown:
java.lang.NullPointerException
        at net.yacy.search.Switchboard.storeDocumentIndex(Switchboard.java:2732)
        at net.yacy.search.Switchboard.access00(Switchboard.java:207)
        at net.yacy.search.Switchboard.run(Switchboard.java:3049)
2013-07-27 09:55:43 +02:00
Roland Haeder
13433d41a1 Log this exception better
Conflicts:
	source/net/yacy/kelondro/blob/Tables.java
2013-07-27 09:54:51 +02:00
orbiter
080d80c9de do not write an empty failreason in case that there is no fail. Because
of the lazy instantiation rule this value was not actually written, but
if lazy instantiation is switched on, then this causes that all crawl
starts delete all crawl-start-hosts completely because this looks for
filled error reasons.
2013-07-26 17:53:28 +02:00
Michael Peter Christen
4c242f9af9 always use a default value for boolean options to have transparency for
the outcome if the attribute is missing in servlets
2013-07-25 12:17:29 +02:00
Michael Peter Christen
61e015268b fix in forced deletion: forced commit needed 2013-07-25 09:53:19 +02:00