Commit Graph

7305 Commits

Author SHA1 Message Date
Michael Peter Christen
fd87fa1613 removed more unnecessary exist-checks in ErrorCache 2014-07-11 16:48:08 +02:00
Michael Peter Christen
f2b476e08b don't do a double check to solr for failed documents if they are not
written to solr
2014-07-11 16:26:52 +02:00
Michael Peter Christen
06ab72d1af enhanced crawler host round-robin strategy 2014-07-11 16:01:42 +02:00
orbiter
dab9a0786a Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-07-11 04:04:34 +02:00
orbiter
51bf5c85b0 Renamed the transmission cloud to buffer in dispatcher since the name
'cloud' was a bad idea. Changed also the accumulation process for peer
targets so that every dht chunk is not assigned the set of redundant
targets but they are assigned to redundant targets individually. This
enhances the granularity of the target accumulation and should enhance
the efficiency of the process. Finally the dht protocol client was
enriched with the ability to remove the 'accept remote index' flag from
peers or remove peers completely if they do not answer at all.
2014-07-11 04:04:09 +02:00
Michael Peter Christen
a694b6a8fc another fix for unique field computation 2014-07-10 17:25:33 +02:00
Michael Peter Christen
fb3dd56b02 fix for processing of noindex flag in http header 2014-07-10 17:13:35 +02:00
Michael Peter Christen
b0d941626f fixed bugs in canonical, robots and title/description unique calculation 2014-07-10 15:40:38 +02:00
reger
d9472d043a cleanup older unused classes 2014-07-10 02:20:01 +02:00
reger
665e12f88e move startup time from old serverCore to switchboard (most used here)
to make servercore eventually obsolete.
2014-07-10 02:17:56 +02:00
reger
336425912a remove unused localSearchThread from SearchEvent 2014-07-10 02:14:03 +02:00
reger
32bd2a61c1 add local ip to AbstractRemoteHandler local hostname cache 2014-07-10 02:09:26 +02:00
Michael Peter Christen
f3a6b6e21e fix for bad URL decoding 2014-07-10 01:59:29 +02:00
Michael Peter Christen
1092e798a5 fixed double content postprocessing 2014-07-07 19:15:11 +02:00
Michael Peter Christen
aee5b108e5 added linkScraperParser, a parser which ignores the text like the
generic parser but extracts links like the htmlParser. This should be
used for ASCII documents without known text format annotation like
source code files or json documents. Probably also good for xml files
without known schema.
2014-07-07 13:37:17 +02:00
reger
2b8cc5832c fix seek error for 0 file size records file
by add extra check for file size = 0 in cleanlast()
- (http://mantis.tokeek.de/view.php?id=411)
2014-07-06 20:49:01 +02:00
reger
2ba394333f fix Crawler HostQueue release of stackfile
- close stackfile inputstream at end of ChunkIterator
This should solve startup delay while unfinished crawl jobs exist (maybe also too many open file situation)
2014-07-06 16:04:30 +02:00
reger
40133ba2d0 fix NPE in Condenser,
discovered by calling IndexControlRWI, "Word Deletion" with "for every resolvable and deleted URL reference"
2014-07-06 13:24:36 +02:00
orbiter
59160984cc timeline performance update 2014-07-03 13:06:29 +02:00
orbiter
54bea96e67 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-07-02 23:23:34 +02:00
Michael Peter Christen
841cc77391 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-07-02 14:35:02 +02:00
Michael Peter Christen
e09218129c remove check for local solr. This check was made during a time when Solr
was optional and another alternative metadata store was available. Since
that store is now removed, Solr is always available (internally or
externally)
2014-07-02 14:34:48 +02:00
orbiter
2073e69034 fix for long periods in timeline 2014-07-02 11:29:50 +02:00
reger
1f94df29e7 fix NPE in solr rss where snippet contains only the title text
and adjusted xslt, for solr snippets (&hl=true) to decode the xml encoded html <b> tag by adding disable-output-escaping
(still open item description may be double as dc: tag and rss.description tag)
2014-07-01 23:24:26 +02:00
Michael Peter Christen
09dcdb9b19 update to solr 4.9.0 2014-07-01 16:39:00 +02:00
Michael Peter Christen
1cd4b2e8be Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-07-01 16:06:12 +02:00
Michael Peter Christen
8c52f0651b refactoring of AccessTracker events & timeline fix 2014-07-01 16:06:01 +02:00
reger
431a5f9c4e added test case for TextSnippet,
removed obsolete/unused parameter and reference to MediaSnippet
2014-06-30 05:36:48 +02:00
Michael Peter Christen
5b94a257ce no timeout for large reference collections 2014-06-29 22:26:22 +02:00
Michael Peter Christen
f5b817bac4 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-06-29 22:25:08 +02:00
reger
cb2c17d236 extract author and keywords in .doc and .ppt parser 2014-06-29 02:54:09 +02:00
reger
a5707cd2eb enable proper Author navigator
- author facet is based on omitted author_sxt field
- adjust to make author nav available on exist of author field but keep using author_sxt to construct the facet (why!?)
- add check for querymodifier author in searchevent
2014-06-27 23:05:06 +02:00
Michael Peter Christen
74206a10c7 refactoring 2014-06-27 14:40:36 +02:00
orbiter
fec673c9d1 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-06-27 10:15:37 +02:00
orbiter
4a66af716d added apkParser stub (work in progress) 2014-06-27 10:15:01 +02:00
orbiter
c59da9fe7a added access tracker log reader stub 2014-06-27 10:14:36 +02:00
reger
2d67f29244 adjust mergeDocument after parsing to
- preserve charset and languages
- fix merge of author
2014-06-26 22:16:15 +02:00
Michael Peter Christen
0d29b972cc Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-06-26 13:02:56 +02:00
Michael Peter Christen
36e623d8bf enhanced metadata enrichment for media file type search:
- Web servers may now deliver YaCy-specific http header field with a
title and keywords. The new http header fields are:
X-YaCy-Media-Title - to be used for media (image, audio, video) titles
X-YaCy-Media-Keywords - to be used for media (image, audio, video)
keywords
- both fields are written to document fields title and keywords and are
searched also during image search.
- to make the usage of arbitrary http header fields (including this new
fields) possible in the /api/push_p.json servlet, a new POST argument is
also introduced to push http header fields. The new POST attribute is
named "responseHeader-X" (where X is the counter). It is allowed to use
this attribute as multi-attribute several times, each can be filled with
a http header line.
- see /api/push_p.html for examples
2014-06-26 13:02:35 +02:00
Michael Peter Christen
49886fab08 enhanced debugging 2014-06-26 12:57:01 +02:00
Michael Peter Christen
b893c42a0f bugfix for image search 2014-06-26 12:56:33 +02:00
Michael Peter Christen
c7995d3e2a increased fixed limit for http POST request sizes to 100MB 2014-06-26 11:58:07 +02:00
reger
7847a93558 fix AbstractParser.singleList not adding null strings
- prevents null titles in oo... parser  (as detected by ParserTest)
- correct ParserTest dc_description check (dc_description allowed to return 0 length array)
2014-06-26 02:56:45 +02:00
Michael Peter Christen
8acae852a0 write <em>-tagged texts also into the bold_txt field 2014-06-25 11:51:11 +02:00
reger
90c4576361 add a link to recrawl index entry to metadata html page
- to allow manually renew index content for this url (e.g. in case it is a remote search result with metadata only)
- use simply a  QuickCrawlLink_p javascript snippet (minimalistic 1st solution)
2014-06-21 04:21:29 +02:00
Michael Peter Christen
2626c8f6db using concurrency to do base64 encoding in file POST commands 2014-06-20 13:55:15 +02:00
Michael Peter Christen
e132689818 fixed and enhanced Base64 (en)coder (again) 2014-06-20 13:54:18 +02:00
Michael Peter Christen
2415e3db43 enhanced ASCII byte[] -> String conversion 2014-06-20 13:53:22 +02:00
Michael Peter Christen
4751ed974f enhanced base64 encoding 2014-06-19 12:11:02 +02:00
Michael Peter Christen
e949071160 removed superfluous date method 2014-06-19 12:10:42 +02:00
Michael Peter Christen
501d55cd35 removed superfluous assert 2014-06-19 12:10:12 +02:00
orbiter
0bbb5040b8 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-06-15 12:38:52 +02:00
orbiter
9d5d86cd03 Added filter query options to the ranking servlet /RankingSolr_p.html.
Filter queries are not actually related to ranking, but user requests
have pointed out that specific boost queries to move results to the end
of the result list are not sufficient. Such boost filters may be better
executed as actual filter and therefore such a filter can now be
statically applied to every search request. A typical use could be the
expression "http_unique_b:true AND www_unique_b:true" which uses the
recently introduced fields http_unique_b and www_unique_b which are true
only for one of the alternatives with/without http(s) and with/without
prefix 'www.' in host names.
2014-06-15 12:38:30 +02:00
Michael Peter Christen
d2151857f1 Added collection navigation:
The collection field (can be filled i.e. in Crawl Start) can be used to
add categories to YaCy index entries. The usage of that field was
restricted to solr searches and post argument filters as implemented in
commit f7571386a3.
This commit extends collections to a full navigation option in the
standard YaCy search interface. The field is not active by default but
can be activated easily in the /ConfigSearchPage_p.html servlet (just
check the 'Collection' facet field). Collections can now be used for (at
least) two purposes:
- to provide search tenants (through post argument collection)
- to provide self-made category navigation
Search requests may now have (independently from switched on or off
collection facet) a "collection:<collection-name>" modifier attached;
firthermore collection names may use disjunctions using the '|' pipe
symbol. For example, this is a valid search request:
www collection:user|proxy
2014-06-15 12:11:23 +02:00
Michael Peter Christen
74c249288a added a push api to make it possible to upload files directly without
crawling to the YaCy indexer. Files are uploaded using POST multipart
requests; multiple file uploads are possible as well. Each file has
attached the file date and mime type which is used to get the right
parser for the submitted data. Also an url is submitted which is
assigned to the document.
The CrawlSwitchboard has a new option for default Crawl Profiles which
are assigned dynamically from the new push interface.
2014-06-12 18:10:07 +02:00
Michael Peter Christen
f13c8aa7dd re-implementation of file push option in the context of POST http
requests. The internal representation of post-arguments is String and
therefore not appropriate for byte[] object as submitted by file pushes.
Therefore all pushed files are encoded to base64 _after_ uploading with
an http form (you do not need to do that encoding yourself) to hand-over
the byte[] as string in the post argument.
Servlets which read such files must decode the base64 data to get the
original byte[] array.
This is considered as a temporary solution for file uploads and a proper
implementations would need to consider all attributes as handed over as
Objects with either String or byte[] Object instances. This would be a
major code change and is not done at this time here now. The feature was
submitted to realize a feature as pushed with the next commit.
2014-06-12 18:06:22 +02:00
Michael Peter Christen
ba6ffddefc refactoring 2014-06-12 05:23:26 +02:00
reger
982601017e crawling of filenames with + fails due to url decoding
modified UTF8.decodeURL to apply x-www-form-urlencoded ( space -> + ) to the query part of the url only.
2014-06-11 04:13:55 +02:00
reger
3b559e7846 optimize pdfParser
skip starting reader thread if all content already read
2014-06-10 04:25:20 +02:00
reger
09f73b790f fix pdfParser not closed warning from pdfbox
for encrypted pdf on exit due to missing permission to extract
2014-06-08 08:20:30 +02:00
reger
92d1604a31 Crawler hostbalancer does not delete finished queue files,
use alternative delete to fight the sympthom (and fix deletion of host dirs on startup)
Root cause (which class holds a lock on .stack) not found.
http://mantis.tokeek.de/view.php?id=404
2014-06-05 02:13:08 +02:00
Michael Peter Christen
0c324d735c NPE fix for postprocessing without term index 2014-06-04 12:28:28 +02:00
Michael Peter Christen
922979aae1 added option to prefer http over https in unique-protocol ranking 2014-06-02 17:40:56 +02:00
Michael Peter Christen
b3b174e2b8 fixed webgraph postprocessing and status display in Crawler_p servlet 2014-06-02 15:06:38 +02:00
Michael Peter Christen
e6b28f5958 removed check on protocol for double content (user request) 2014-06-02 13:11:44 +02:00
reger
d8d318233e fix logging settings
- add missing .level
- remove obsolete jena settings
- set default level=INFO to prevent debug logging of not explicite specified classes
2014-06-01 06:43:50 +02:00
Michael Peter Christen
698f053658 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-06-01 01:02:12 +02:00
Michael Peter Christen
f23c4142e0 added option to configure a custom user agent within allip networks 2014-06-01 01:02:03 +02:00
reger
8e233e2eb4 - fix typo in Message_p (defaultpath)
- use more existing switchboardconstants for getproperties
- replace depriciated call defaultservlet
2014-06-01 00:20:25 +02:00
orbiter
d7d38f9135 made number of open files in crawler configurable and increased default
maximum number of open files from 100 to 1000. This number can be
changed with the attribut crawler.onDemandLimit
2014-05-31 09:29:55 +02:00
Michael Peter Christen
8ad41a882c fixed several problems with postprocessing:
- unique-postprocessing was destroying results from other
postprocessings; removed cross-updates as they had been not necessary
- unique-postprocessing did not restrict on same protocol
- inefficient concurrent update cache was redesigned completely
- increased limits for concurrent blocking queues to prevent early
time-out
2014-05-29 13:24:24 +02:00
reger
ca5437dd50 fix crawl of file:// , also http://mantis.tokeek.de/view.php?id=149
local files can be crawled (intranet mode) url parsing fixed according to  RFC 1738 (for unix and windows)
for win like file:///c:/tmp   or file://localhost/c:/tmp
for linux like file:///tmp  or file://localhost/tmp
Host is ignored and path must be absolute
2014-05-28 03:01:34 +02:00
Michael Peter Christen
ff5b3ac84d added new fields http_unique_b and www_unique_b which can be used for
ranking to prefer urls containing a www subdomain or using the https
protocol
2014-05-27 15:28:28 +02:00
sixcooler
5b1c4ef191 Monitoring and limit connection-count for Jetty 2014-05-22 22:16:39 +02:00
Michael Peter Christen
f0db501630 better handling of ranking parameters and new default values for date
navigation which is done using ranking in solr.
2014-05-22 03:01:07 +02:00
Michael Peter Christen
53948da7d0 tried to make last_modified recognition smarter 2014-05-22 00:28:51 +02:00
Michael Peter Christen
2d03037965 'Last-Modified', not 'Last-modified' according to
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
2014-05-21 23:21:31 +02:00
Michael Peter Christen
3dc5fb0050 fix for operator precedence bug (cast binds stronger than bitwise AND)
in peer hash hashing. This should not change anything if java casts long
to int by masking with 0xFFFFFFFFL but you never know. The important
thing is, that the hashCode() should not return numbers that have the
same order as the hash code order because hashing of seeds is used to
remove the order in some places.
2014-05-21 18:37:52 +02:00
Michael Peter Christen
6634b5b737 debug code for index distribution testing 2014-05-21 18:20:16 +02:00
orbiter
49e344e8d9 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-05-21 09:28:55 +02:00
orbiter
7705e36703 fix for latest generic warning fix 2014-05-21 09:28:23 +02:00
sixcooler
10326892a8 avoid erros from ConnectHandler, correction for #6d16fa9 2014-05-21 03:04:07 +02:00
orbiter
97983ba89f fixed generics warnings for generic array instantiation that appeared
after migration to Java 7
2014-05-20 21:50:16 +02:00
sixcooler
830057d788 lower Segment-size (hope to get Segments of 10GB)
see:
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5216&p=30036#p30034
2014-05-19 17:55:03 +02:00
orbiter
c028ae9b09 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-05-18 21:21:17 +02:00
reger
e31493e139 "Use remote proxy for yacy" has no function, remove option and related config item
see/fix bug http://mantis.tokeek.de/view.php?id=23
http://mantis.tokeek.de/view.php?id=189
2014-05-17 23:36:59 +02:00
orbiter
181784a5cb Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-05-15 08:06:59 +02:00
reger
0587077d06 cleanup obsolete and not used serverswitch Authentify code
as auth is mostly delegated to Jetty container.
2014-05-14 23:13:49 +02:00
orbiter
c9f66be20b move unnecessary nested else out of condition 2014-05-13 22:31:12 +02:00
orbiter
0d8072aa99 removed warnings 2014-05-13 22:29:05 +02:00
orbiter
88f4af90da removed warnings 2014-05-13 22:27:31 +02:00
orbiter
0f425e01ca another circle computation enhancement 2014-05-13 21:30:47 +02:00
reger
a8d162810c Exclude = from percent-encoding in MultiProtocolURL
fix http://mantis.tokeek.de/view.php?id=185 and http://mantis.tokeek.de/view.php?id=280
2014-05-13 02:33:35 +02:00
reger
024f8e9b33 fix truncated urls containing ","
adressing http://mantis.tokeek.de/view.php?id=58

Exclude comma from percent-encoding in MultiProtocolURL (see  RFC 1738 2.2  and  RFC 3986 2.2)
2014-05-13 01:50:15 +02:00
Michael Peter Christen
9112f0a2df enhanced circle tool initialization 2014-05-12 16:21:24 +02:00
Michael Peter Christen
a1ac4c3b76 automatically clear graphics cache 2014-05-12 15:45:25 +02:00
Michael Peter Christen
505f58c79c enhanced circle computation time and memory footprint 2014-05-12 15:34:56 +02:00
reger
cd8c0dbda9 assign serialVersionUID for proxyservlet, too. 2014-05-11 03:51:47 +02:00
reger
b300d7f4ce set serialVersionUID on urlproxyservlet to skip compiler warning
- remove commented out code
2014-05-11 03:31:07 +02:00
reger
e9060d31bd update to Jetty 9
besides adjustments in code it makes the servlet settings in web.xml significant.
This applies to solr, gsa and proxy servlet. There is no longer a default setup in code during init (as jetty 9 checks for double definition).
2014-05-11 01:53:11 +02:00
reger
1432a817dd respect "index media" switched off in CrawlStartExpert.html
fix http://mantis.tokeek.de/view.php?id=64
2014-05-08 22:21:24 +02:00
orbiter
39e1913585 next development step: migration to java 1.7
This includes also a small code change to test generic type inference, a
java 1.7 feature
2014-05-08 07:41:11 +02:00
Michael Peter Christen
4e734815e8 enhanced snippets: remove lines which are identical to the title and
choose longer versions if possible. Prefer the description part.
2014-05-06 16:48:50 +02:00
Michael Peter Christen
e84e07399a Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-05-06 14:51:57 +02:00
orbiter
89f76da24b Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-05-06 05:38:38 +02:00
sixcooler
390f03e041 o not check for segments-count on optimize:
this is also done in Solr and our getSegmentsCount() does not return
up-to-date values
2014-05-05 13:24:41 +02:00
reger
8a7c68e4c7 content of surrogates/out never accessed (remove)
After import the conent is never accessed but may take up a lot of disk space,
also the getLoadedOAIServer (which lists the files in surrogate out) is not used.
Making the surrogate.out obsolete. Removed keeping of xmls after import.
2014-05-04 09:29:07 +02:00
sixcooler
b8cee9b7d8 remove tables from tabletracker on close to avoid lots of dead entrys in
/PerformanceMemory_p.html
2014-05-02 22:55:47 +02:00
reger
1600414450 fix NPE on continuing crawls after YaCy restart
(Agent is then nulll)
2014-05-02 19:32:09 +02:00
Michael Peter Christen
229f2248b8 added configuration option for maxmimum load and minimum ram for
postprocessing
2014-04-30 13:26:32 +02:00
orbiter
f15c832587 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-04-30 07:42:52 +02:00
Marc Nause
c97da1a0d8 First draft of a blacklist API. 2014-04-30 00:48:38 +02:00
Michael Peter Christen
d4f65833a1 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-04-29 19:51:01 +02:00
Michael Peter Christen
c1c1be8f02 fix for slow crawling and better logging in balancer 2014-04-29 19:50:33 +02:00
Michael Peter Christen
3acf416335 npe fix 2014-04-29 19:24:05 +02:00
reger
2eb7682772 add html5 audio/video <source> tag to html content scraper
- <source src=.. type=..> tag content is added to embed collection
2014-04-29 00:41:29 +02:00
reger
0b6db04e40 fix contentscraper img height/width parsing
prevent numberformat exception on common "100px" property

- include in test case
2014-04-28 04:59:47 +02:00
reger
ffc5b75c73 optimize and fix lat / lon assignment 2014-04-27 20:52:06 +02:00
reger
9313447de2 reimplement tighter lat/lon calc in URIMetadataNode
from old MetadataRow, considering http://mantis.tokeek.de/view.php?id=272
2014-04-27 18:20:33 +02:00
reger
d812f80784 add exit proxy link to UrlProxy
on proxied pages a link to exit proxy is added to top of page.
Link text can be configured in web.xml init-parameter (see default/web.xml). If missing no link is displayed.
2014-04-26 22:27:59 +02:00
reger
78d08998db throw MalformedURLException on unknown protocol
on other than the supported   http https ftp file smb \\  mailto
2014-04-26 01:30:51 +02:00
reger
bb8181b2be fix: resolve url without path but searchpart
e.g. http://yacy.net?q=test was resolved as host "yacy.net?q=test" now host="yacy.net" path="/"
fixes http://mantis.tokeek.de/view.php?id=47

added test case for getHost
2014-04-25 20:15:55 +02:00
orbiter
a3542f29b4 npe fix 2014-04-25 09:26:20 +02:00
orbiter
c48d2a2a02 npe fix 2014-04-25 09:23:10 +02:00
reger
121d25be38 recover sax fatal error on OAI-PMH import of xml with entity error
this allows to continue loading next resumptionToken even if import file caused sax parser error
fix http://mantis.tokeek.de/view.php?id=63
2014-04-25 01:05:28 +02:00
reger
81dc2aa536 add current css to HTMLResponseWriter to fix metadata view
(using css from metas.template except js links)
2014-04-23 23:41:10 +02:00
orbiter
2fd8a0ead6 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-04-23 23:13:23 +02:00
orbiter
8e5ce7cd51 fixed a situation where finished crawls had not been detected. 2014-04-23 23:13:07 +02:00
orbiter
2f63bd0261 enhanced Host Balancer strategy: fair round robin 2014-04-23 23:11:37 +02:00
orbiter
0c88a32c36 do not apply lazy value instantiation for numeric or boolean values
because that is misleading and confusing in case of 0- or false-values
and may cause NPEs in retrieval functions.
2014-04-23 08:41:36 +02:00
orbiter
8e04030596 in case of short memory, do not cut down robinson peers to 1, just
reduce by 50%
2014-04-23 08:37:19 +02:00
reger
86f6975edc exclude html tags in in/outboundlinks_anchortext_txt parsed text
- some outboundlinks_anchortext_txt in index contain e.g. <span>text</span> or more tags,
remove all tags for text property (inline img tags are still parsed)
- added test case for above (to htmlParserTest)
- fix solr test case
2014-04-23 00:55:16 +02:00
orbiter
ccb1864d55 catch IllegalArgumentException for wrong process types (that is needed
for migrations when new process types are introduced or disappear)
2014-04-22 23:14:05 +02:00
orbiter
4ee4ba1576 fix for NPE in IndexCreateParserErrors_p.html caused by bad handling of
lazy value instantiation of 0-value in crawldepth_i
2014-04-22 19:48:49 +02:00
orbiter
12ba890205 removed warnings 2014-04-22 19:35:15 +02:00
reger
d51f9cc863 add custom Jetty errorhandler
to provide custom error page footer line
- remove redundant mime check in UrlProxyServlet
2014-04-21 17:28:21 +02:00
reger
c193a02023 defer creation of new ArrayList after possible early return
(to skip not used object allocation)
2014-04-21 17:16:06 +02:00
reger
727dfb5875 refactore URIMetadataNode to further unify interaction with index
-  URIMetadataNode extending SolrDocument
- use language as stored (String), reducing conversion to string
- optimize debug code in transferIndex
2014-04-20 01:41:30 +02:00
reger
79e7947442 - remove empty http0_9 status text array
and unused default_charset = ISO-8859-1
2014-04-18 22:03:16 +02:00
reger
2dabe2009d - remove unused manual http KeepAlive config
(reducing references to obsolete httpdemon)
- add port info to settings_http
2014-04-18 19:57:35 +02:00
Michael Peter Christen
5746aae3db add canonical links to the same crawldepth, not the next crawldepth 2014-04-18 06:51:46 +02:00
Michael Peter Christen
74ab5ef9fa increased runtime for postprocessing query job 2014-04-18 06:51:10 +02:00
Michael Peter Christen
8b32dd5f9e special strategy for balancer: do not remove targets with zero wait time
from the queue
2014-04-18 06:50:07 +02:00
Michael Peter Christen
9c6228d948 fix for deadlocks in crawler 2014-04-17 16:58:17 +02:00
Michael Peter Christen
10cf8215bd added crawl depth for failed documents 2014-04-17 13:21:43 +02:00
Michael Peter Christen
7fefebaeca Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-04-17 12:55:38 +02:00
Michael Peter Christen
c2f62e783f - better subgraph handling, less overhead for crawls without the
webgraph
- usage of crawler crawldepth cache for the linkgraph target depth
computation
2014-04-17 12:54:18 +02:00
Michael Peter Christen
06afb568e2 new Strategies in Balancer:
- doublecheck cache now records the crawl depth as well
- doublecheck cache is available from the outside (made static)
- no more need to crawl hosts with lowest depth first, instead all hosts
which have only singleton entries are preferred to reduce the number of
files.
2014-04-17 12:52:54 +02:00
Michael Peter Christen
1aea01fe5b fix for Table in case that requested file does not exist and paths also
do not exist
2014-04-17 12:44:05 +02:00
reger
710054bb37 implement gzip input handling directly in defaultservlet
(making reference to legacy httpdemon obsolete)
2014-04-17 03:20:29 +02:00