Commit Graph

2945 Commits

Author SHA1 Message Date
orbiter
927aaa95a6 concurrency bugfix 2014-08-13 00:59:11 +02:00
orbiter
c9e593cf78 removed warnings 2014-08-11 23:53:12 +02:00
reger
7584352e7b use more predefined Solr query parameter constants
- use CommonParams and DisMaxParams constants
- fix typo in get sort parameter
- getDocumentCountByParams redundant implementation and risk of not optimized call (row parameter unspecified) -> as only used from getCountByQuery removed from interface
2014-08-10 22:33:10 +02:00
reger
f9db5dd6c5 reduce doublecontent check document (prevent out of memory)
see http://mantis.tokeek.de/view.php?id=437

test result (concurrency=7)
2000 docs = eom always
1000 docs = eom always
100 docs = eom never

chosen -> 200 docs (eom not encountered during test with 1GB mem setting)
2014-08-10 03:18:15 +02:00
reger
e9eae45b55 simplify rssreader and improve atom feed link extraction
- type detection (rss/atom) 
    - init type parameter overwritten during parse, parameter obsolete
    - detection by endtag changed to simpler first-tag evaluation
- channel image not used, removed related extra parser handling
    - remove unused code (set/getImage) in rssfeed
- atom link extraction to account for possible multipe link tags
   - spec limits link to one with rel="alternate" or one without rel attribute
     not accounting for the follwing type & hreflang exception yet:

   o  atom:entry elements MUST NOT contain more than one atom:link
      element with a rel attribute value of "alternate" that has the
      same combination of type and hreflang attribute values.
2014-08-10 01:29:16 +02:00
reger
a8508417d1 catch NPE during crawl (OAI import)
- condenseDocument mime=null (allowed)
- collectionconfiguration responseheader = null (allowed)
2014-08-08 00:02:59 +02:00
reger
3dde94422f center searchevent lines on network graph
(PerformanceSearch_p.html)
2014-08-06 23:04:42 +02:00
Michael Peter Christen
3860711aef fix for possible interruption of concurrent queries 2014-08-06 12:55:18 +02:00
Michael Peter Christen
6344718f8b reducing the concurrent query stack size and reduced concurrency of
postprocessing to avoid OOM situations
2014-08-06 12:36:59 +02:00
Michael Peter Christen
eca9380e3d bugfix for crawler double-check: if an url is redirected, the
redirect-target was not double-checked. This is now done by replacing
the redirect-URL on the crawl queue again (where it is double-checked)
2014-08-06 12:35:12 +02:00
Michael Peter Christen
9ac0c93f17 fix for subpath crawl filter 2014-08-06 01:33:24 +02:00
Michael Peter Christen
66106bdaf0 fix for crawler attribute maxdompages 2014-08-05 21:32:25 +02:00
Michael Peter Christen
49d91b94c3 npe fix in crawler 2014-08-05 21:31:59 +02:00
Michael Peter Christen
b7183a7321 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-08-05 09:54:18 +02:00
reger
ea2e627662 fix ConfigAccounts del user with uppercase letter in name
(usernames are case sensitive, userdb.delete used toLower)
2014-08-05 01:27:27 +02:00
Michael Peter Christen
c465b791af typo 2014-08-04 16:13:39 +02:00
Michael Peter Christen
191ec8c82a added concurrency to postprocess rewrite process 2014-08-04 15:28:58 +02:00
Michael Peter Christen
a1e8bdd5e9 log ppm instead of docs/second 2014-08-04 14:44:42 +02:00
Michael Peter Christen
cc0ded7abd set process type of web graph according to fields as defined in the
schema
2014-08-04 14:44:20 +02:00
Michael Peter Christen
12fb9d7cd1 log postprocessing constraints in case that postprocessing is not
performed
2014-08-04 14:19:37 +02:00
Michael Peter Christen
3c23b89823 less logging 2014-08-04 13:37:34 +02:00
Michael Peter Christen
a0c53174c5 better solr query logging to detect unnecessary sort requests for more
performance profiling
2014-08-04 13:00:45 +02:00
Michael Peter Christen
338f574bdc no sorting if http/www unique fields are not demanded (makes query
faster) and some code restrucuring
2014-08-04 12:59:38 +02:00
Michael Peter Christen
1609763be5 toString fix 2014-08-04 12:58:39 +02:00
Michael Peter Christen
b983e68254 more retries, less sleep 2014-08-04 08:29:35 +02:00
Michael Peter Christen
1503ba7794 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-08-04 08:24:31 +02:00
reger
8f77719091 fix "Ljava.lang.String" in crawl queue anchor name
(e.g. IndexCreateQueues_p.html?stack=LOCAL with images in queue)
2014-08-04 02:38:58 +02:00
Michael Peter Christen
0ceeceb35e more logic on Solr queries; usage of the query terms in posprocessing,
saving one query for double document detection now per document
2014-08-04 02:35:38 +02:00
orbiter
38864ae004 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-08-03 22:44:49 +02:00
orbiter
4099296b45 added new classes which shall reduce call overhead to Solr (stub) 2014-08-03 22:44:22 +02:00
reger
d0c02e1de7 adjust rss lat/lon to double
(common format across other classes)
2014-08-03 20:09:23 +02:00
orbiter
3491ab4c38 removed unused images from webgraph edge computation 2014-08-01 13:21:16 +02:00
orbiter
2371d6b8db target linktexts must be string to enable search facets on these fields 2014-08-01 13:20:25 +02:00
Michael Peter Christen
001e05bb80 do not store failure of loading of robots.txt into the index as a fail
document
2014-08-01 12:15:14 +02:00
Michael Peter Christen
05d58e4df0 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-08-01 12:04:25 +02:00
Michael Peter Christen
98f45c9032 fix for image alt attachment to AnchorURLs in html parser. 2014-08-01 12:04:15 +02:00
orbiter
22ce4fb4dd better error handling for remote solr queries and exists-checks 2014-08-01 11:00:10 +02:00
Marc Nause
9df14fc126 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-07-29 21:26:43 +02:00
Marc Nause
477be17c51 Replaced old UPNP library with Weupnp. UPNP should
work now, at least it does on my network. UPNP code in YaCy can still
be improved though (see TODO comment: make port on gateway configurable
or find free one).

*) removed old code
*) added new lib
*) changed code to work with new lib
2014-07-29 21:26:27 +02:00
orbiter
738989aab7 reverted commit f94c91315b because the
webgraph has not enough performance for that
2014-07-29 18:49:42 +02:00
orbiter
e9163e7e10 fix for malformed hostpath names in crawl balancer 2014-07-29 11:18:45 +02:00
Michael Peter Christen
c115f3869c enhanced snippet computation and test method in ViewFile 2014-07-28 15:42:57 +02:00
reger
6c10b59f3e move bootstrap peers test systems to its test class
var assignment not needed  elsewhere.
2014-07-27 04:13:07 +02:00
orbiter
1027f3d04a fix for the usage of ready-prepared solr queries, some queries are
formulated as edismax query but this was not set as query attribut. The
defType=edismax property needs a qf-field, so this was added as well. Do
not remove that field again! This fixes also a problem with title-unique
computation.
2014-07-25 18:53:13 +02:00
Michael Peter Christen
f94c91315b if the webgraph is used, then use it also for reference computation to
avoid contradictions with references_i in the collection index.
2014-07-24 15:35:53 +02:00
Michael Peter Christen
6e1dc444c3 added a snippet test function in ViewFile: you can now search for a
specific word on the document; the servlet returns the snippet in the
same way as it would be shown in a search result.
2014-07-24 14:59:37 +02:00
orbiter
4b06adb751 fix for file urls 2014-07-23 17:54:31 +02:00
orbiter
08409ec680 no idea why the words max was an ordered one. This change increaes speed
dunring document processin a bit
2014-07-23 17:54:16 +02:00
reger
e5854a5cdb fix localhost link to opensearchdescription.xml 2014-07-22 21:57:38 +02:00
Michael Peter Christen
b44626e55b fixed target_alt_t in webgraph 2014-07-22 18:24:10 +02:00
Michael Peter Christen
504327b15c fix for condition for writing the webgraph 2014-07-22 00:59:08 +02:00
Michael Peter Christen
542c20a597 changed handling of crawl profile field crawlingIfOlder: this should be
filled with the date, when the url is recognized as to be outdated. That
field was partly misinterpreted and the time interval was filled in. In
case that all the urls which are in the index shall be treated as
outdated, the field is filled now with Long.MAX_VALUE because then all
crawl dates are before that date and therefore outdated.
2014-07-22 00:23:17 +02:00
Michael Peter Christen
4eec1a7452 refactoring (change Metadata name of load time data structure to avoid
confusion with Node data which is also called metadata)
2014-07-21 23:54:23 +02:00
reger
c95ba52cf0 improve logexception info
- log a message or class name insted of msgtxt "null"
2014-07-21 22:13:34 +02:00
orbiter
e441831a24 reverted toString() change in AnchorURL to prevent mistakenly used
toString(). This fixes also the update link bug.
2014-07-21 15:58:29 +02:00
reger
47f201a6b8 Add Solr default query fields (&qf) to select servlet
according to the ranking profiles boost fields defined by the peer (if df/qf is not specified in query).
This allows for pretty simple queries ( q=word) without the need to know about the specific index configuration.
Making sure all relevant fields (as determined by the index owner) are searched, still maintaining the option to query specific fields
and does not relay on the duplication of text to text_t.
- add author to reset-default boost fields (support results for author nav)
2014-07-21 00:47:14 +02:00
reger
f96cfdc84d prevent array out of bound exception on getRankingProfile(x)
on faulty &profileNr=  query parameter
2014-07-21 00:04:54 +02:00
reger
5f5fb4ecdc remove unused static (RSS)search from protocol 2014-07-20 02:49:49 +02:00
reger
7c1706d83a use CRLF in generated bat command scripts for windows
- for easier viewing with standard viewers
2014-07-20 00:06:22 +02:00
reger
a2cb366b25 Combine /heuristic search modifier with opensearch configured targets
- with search modifier /heuristic a request is send to all configured opensearch target systems (old /heuristic/blekko modifier not longer valid)
- this allows to use opensearch heuristic on individual search request (in contrast to configuration HEURISTIC_OPENSEARCH=true which sends a osd request on all global searches
- the index.html searchoption text adjusted to be displayed only if option configured
- add Archive-It to predefined systems
2014-07-20 00:00:43 +02:00
Michael Peter Christen
2de159719b added an option to set 'obey nofollow' for links with rel="nofollow"
attribute in the <a> tag for each crawl. This introduces a lot of
changes because it extends the usage of the AnchorURL Object type which
now also has a different toString method that the underlying
DigestURL.toString. It is therefore not advised to use .toString at all
for urls, just just toNormalform(false) instead.
2014-07-18 12:43:01 +02:00
Michael Peter Christen
bf1b6b93e7 do not write CR values to webgraph if no CR values are computed 2014-07-16 18:13:29 +02:00
Michael Peter Christen
e039e78210 small bugfixes 2014-07-16 16:04:38 +02:00
Michael Peter Christen
32a2ff925c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-07-16 14:58:27 +02:00
Michael Peter Christen
d07cdd8c3b added SolrCloud access mode and configuration 2014-07-16 14:57:51 +02:00
Michael Peter Christen
8514bffc22 enhanced postprocessing status report 2014-07-16 14:57:25 +02:00
reger
b24572f304 fix GSA filter query assignment
- use more parameter constants
2014-07-13 00:11:17 +02:00
Michael Peter Christen
b5fc2b63ea removed exist() retrieval functions from error cache and replaced it
with metadata retrieval from connectors directly. This should cause
better usage of the cache. Automatically increase the metadata cache if
more memory is available.
2014-07-11 19:52:25 +02:00
Michael Peter Christen
62c72360ee cleanup of checkAcceptanceInitially in CrawlStacker, should avoid
double-calling of solr
2014-07-11 18:36:04 +02:00
Michael Peter Christen
dd5cdfe212 reverted filter query hack, it did not work 2014-07-11 18:15:35 +02:00
Michael Peter Christen
b5d78ba156 reduced number of solr queries during crawling 2014-07-11 18:05:11 +02:00
Michael Peter Christen
5326970d6c enhanced solr queries for single document extraction 2014-07-11 18:04:55 +02:00
Michael Peter Christen
525575bd97 added debugging of filter queries in thread dump thread names 2014-07-11 17:34:41 +02:00
Michael Peter Christen
f319ef268f testing filter queries instead of queries to retrieve documents by id 2014-07-11 17:09:46 +02:00
Michael Peter Christen
fd87fa1613 removed more unnecessary exist-checks in ErrorCache 2014-07-11 16:48:08 +02:00
Michael Peter Christen
f2b476e08b don't do a double check to solr for failed documents if they are not
written to solr
2014-07-11 16:26:52 +02:00
Michael Peter Christen
06ab72d1af enhanced crawler host round-robin strategy 2014-07-11 16:01:42 +02:00
orbiter
dab9a0786a Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-07-11 04:04:34 +02:00
orbiter
51bf5c85b0 Renamed the transmission cloud to buffer in dispatcher since the name
'cloud' was a bad idea. Changed also the accumulation process for peer
targets so that every dht chunk is not assigned the set of redundant
targets but they are assigned to redundant targets individually. This
enhances the granularity of the target accumulation and should enhance
the efficiency of the process. Finally the dht protocol client was
enriched with the ability to remove the 'accept remote index' flag from
peers or remove peers completely if they do not answer at all.
2014-07-11 04:04:09 +02:00
Michael Peter Christen
a694b6a8fc another fix for unique field computation 2014-07-10 17:25:33 +02:00
Michael Peter Christen
fb3dd56b02 fix for processing of noindex flag in http header 2014-07-10 17:13:35 +02:00
Michael Peter Christen
b0d941626f fixed bugs in canonical, robots and title/description unique calculation 2014-07-10 15:40:38 +02:00
reger
d9472d043a cleanup older unused classes 2014-07-10 02:20:01 +02:00
reger
665e12f88e move startup time from old serverCore to switchboard (most used here)
to make servercore eventually obsolete.
2014-07-10 02:17:56 +02:00
reger
336425912a remove unused localSearchThread from SearchEvent 2014-07-10 02:14:03 +02:00
reger
32bd2a61c1 add local ip to AbstractRemoteHandler local hostname cache 2014-07-10 02:09:26 +02:00
Michael Peter Christen
f3a6b6e21e fix for bad URL decoding 2014-07-10 01:59:29 +02:00
Michael Peter Christen
1092e798a5 fixed double content postprocessing 2014-07-07 19:15:11 +02:00
Michael Peter Christen
aee5b108e5 added linkScraperParser, a parser which ignores the text like the
generic parser but extracts links like the htmlParser. This should be
used for ASCII documents without known text format annotation like
source code files or json documents. Probably also good for xml files
without known schema.
2014-07-07 13:37:17 +02:00
reger
2b8cc5832c fix seek error for 0 file size records file
by add extra check for file size = 0 in cleanlast()
- (http://mantis.tokeek.de/view.php?id=411)
2014-07-06 20:49:01 +02:00
reger
2ba394333f fix Crawler HostQueue release of stackfile
- close stackfile inputstream at end of ChunkIterator
This should solve startup delay while unfinished crawl jobs exist (maybe also too many open file situation)
2014-07-06 16:04:30 +02:00
reger
40133ba2d0 fix NPE in Condenser,
discovered by calling IndexControlRWI, "Word Deletion" with "for every resolvable and deleted URL reference"
2014-07-06 13:24:36 +02:00
orbiter
59160984cc timeline performance update 2014-07-03 13:06:29 +02:00
orbiter
54bea96e67 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-07-02 23:23:34 +02:00
Michael Peter Christen
841cc77391 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-07-02 14:35:02 +02:00
Michael Peter Christen
e09218129c remove check for local solr. This check was made during a time when Solr
was optional and another alternative metadata store was available. Since
that store is now removed, Solr is always available (internally or
externally)
2014-07-02 14:34:48 +02:00
orbiter
2073e69034 fix for long periods in timeline 2014-07-02 11:29:50 +02:00
reger
1f94df29e7 fix NPE in solr rss where snippet contains only the title text
and adjusted xslt, for solr snippets (&hl=true) to decode the xml encoded html <b> tag by adding disable-output-escaping
(still open item description may be double as dc: tag and rss.description tag)
2014-07-01 23:24:26 +02:00
Michael Peter Christen
09dcdb9b19 update to solr 4.9.0 2014-07-01 16:39:00 +02:00
Michael Peter Christen
1cd4b2e8be Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-07-01 16:06:12 +02:00
Michael Peter Christen
8c52f0651b refactoring of AccessTracker events & timeline fix 2014-07-01 16:06:01 +02:00
reger
431a5f9c4e added test case for TextSnippet,
removed obsolete/unused parameter and reference to MediaSnippet
2014-06-30 05:36:48 +02:00
Michael Peter Christen
5b94a257ce no timeout for large reference collections 2014-06-29 22:26:22 +02:00
Michael Peter Christen
f5b817bac4 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-06-29 22:25:08 +02:00
reger
cb2c17d236 extract author and keywords in .doc and .ppt parser 2014-06-29 02:54:09 +02:00
reger
a5707cd2eb enable proper Author navigator
- author facet is based on omitted author_sxt field
- adjust to make author nav available on exist of author field but keep using author_sxt to construct the facet (why!?)
- add check for querymodifier author in searchevent
2014-06-27 23:05:06 +02:00
Michael Peter Christen
74206a10c7 refactoring 2014-06-27 14:40:36 +02:00
orbiter
fec673c9d1 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-06-27 10:15:37 +02:00
orbiter
4a66af716d added apkParser stub (work in progress) 2014-06-27 10:15:01 +02:00
orbiter
c59da9fe7a added access tracker log reader stub 2014-06-27 10:14:36 +02:00
reger
2d67f29244 adjust mergeDocument after parsing to
- preserve charset and languages
- fix merge of author
2014-06-26 22:16:15 +02:00
Michael Peter Christen
0d29b972cc Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-06-26 13:02:56 +02:00
Michael Peter Christen
36e623d8bf enhanced metadata enrichment for media file type search:
- Web servers may now deliver YaCy-specific http header field with a
title and keywords. The new http header fields are:
X-YaCy-Media-Title - to be used for media (image, audio, video) titles
X-YaCy-Media-Keywords - to be used for media (image, audio, video)
keywords
- both fields are written to document fields title and keywords and are
searched also during image search.
- to make the usage of arbitrary http header fields (including this new
fields) possible in the /api/push_p.json servlet, a new POST argument is
also introduced to push http header fields. The new POST attribute is
named "responseHeader-X" (where X is the counter). It is allowed to use
this attribute as multi-attribute several times, each can be filled with
a http header line.
- see /api/push_p.html for examples
2014-06-26 13:02:35 +02:00
Michael Peter Christen
49886fab08 enhanced debugging 2014-06-26 12:57:01 +02:00
Michael Peter Christen
b893c42a0f bugfix for image search 2014-06-26 12:56:33 +02:00
Michael Peter Christen
c7995d3e2a increased fixed limit for http POST request sizes to 100MB 2014-06-26 11:58:07 +02:00
reger
7847a93558 fix AbstractParser.singleList not adding null strings
- prevents null titles in oo... parser  (as detected by ParserTest)
- correct ParserTest dc_description check (dc_description allowed to return 0 length array)
2014-06-26 02:56:45 +02:00
Michael Peter Christen
8acae852a0 write <em>-tagged texts also into the bold_txt field 2014-06-25 11:51:11 +02:00
reger
90c4576361 add a link to recrawl index entry to metadata html page
- to allow manually renew index content for this url (e.g. in case it is a remote search result with metadata only)
- use simply a  QuickCrawlLink_p javascript snippet (minimalistic 1st solution)
2014-06-21 04:21:29 +02:00
Michael Peter Christen
2626c8f6db using concurrency to do base64 encoding in file POST commands 2014-06-20 13:55:15 +02:00
Michael Peter Christen
e132689818 fixed and enhanced Base64 (en)coder (again) 2014-06-20 13:54:18 +02:00
Michael Peter Christen
2415e3db43 enhanced ASCII byte[] -> String conversion 2014-06-20 13:53:22 +02:00
Michael Peter Christen
4751ed974f enhanced base64 encoding 2014-06-19 12:11:02 +02:00
Michael Peter Christen
e949071160 removed superfluous date method 2014-06-19 12:10:42 +02:00
Michael Peter Christen
501d55cd35 removed superfluous assert 2014-06-19 12:10:12 +02:00
orbiter
0bbb5040b8 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-06-15 12:38:52 +02:00
orbiter
9d5d86cd03 Added filter query options to the ranking servlet /RankingSolr_p.html.
Filter queries are not actually related to ranking, but user requests
have pointed out that specific boost queries to move results to the end
of the result list are not sufficient. Such boost filters may be better
executed as actual filter and therefore such a filter can now be
statically applied to every search request. A typical use could be the
expression "http_unique_b:true AND www_unique_b:true" which uses the
recently introduced fields http_unique_b and www_unique_b which are true
only for one of the alternatives with/without http(s) and with/without
prefix 'www.' in host names.
2014-06-15 12:38:30 +02:00
Michael Peter Christen
d2151857f1 Added collection navigation:
The collection field (can be filled i.e. in Crawl Start) can be used to
add categories to YaCy index entries. The usage of that field was
restricted to solr searches and post argument filters as implemented in
commit f7571386a3.
This commit extends collections to a full navigation option in the
standard YaCy search interface. The field is not active by default but
can be activated easily in the /ConfigSearchPage_p.html servlet (just
check the 'Collection' facet field). Collections can now be used for (at
least) two purposes:
- to provide search tenants (through post argument collection)
- to provide self-made category navigation
Search requests may now have (independently from switched on or off
collection facet) a "collection:<collection-name>" modifier attached;
firthermore collection names may use disjunctions using the '|' pipe
symbol. For example, this is a valid search request:
www collection:user|proxy
2014-06-15 12:11:23 +02:00
Michael Peter Christen
74c249288a added a push api to make it possible to upload files directly without
crawling to the YaCy indexer. Files are uploaded using POST multipart
requests; multiple file uploads are possible as well. Each file has
attached the file date and mime type which is used to get the right
parser for the submitted data. Also an url is submitted which is
assigned to the document.
The CrawlSwitchboard has a new option for default Crawl Profiles which
are assigned dynamically from the new push interface.
2014-06-12 18:10:07 +02:00
Michael Peter Christen
f13c8aa7dd re-implementation of file push option in the context of POST http
requests. The internal representation of post-arguments is String and
therefore not appropriate for byte[] object as submitted by file pushes.
Therefore all pushed files are encoded to base64 _after_ uploading with
an http form (you do not need to do that encoding yourself) to hand-over
the byte[] as string in the post argument.
Servlets which read such files must decode the base64 data to get the
original byte[] array.
This is considered as a temporary solution for file uploads and a proper
implementations would need to consider all attributes as handed over as
Objects with either String or byte[] Object instances. This would be a
major code change and is not done at this time here now. The feature was
submitted to realize a feature as pushed with the next commit.
2014-06-12 18:06:22 +02:00
Michael Peter Christen
ba6ffddefc refactoring 2014-06-12 05:23:26 +02:00
reger
982601017e crawling of filenames with + fails due to url decoding
modified UTF8.decodeURL to apply x-www-form-urlencoded ( space -> + ) to the query part of the url only.
2014-06-11 04:13:55 +02:00
reger
3b559e7846 optimize pdfParser
skip starting reader thread if all content already read
2014-06-10 04:25:20 +02:00
reger
09f73b790f fix pdfParser not closed warning from pdfbox
for encrypted pdf on exit due to missing permission to extract
2014-06-08 08:20:30 +02:00
reger
92d1604a31 Crawler hostbalancer does not delete finished queue files,
use alternative delete to fight the sympthom (and fix deletion of host dirs on startup)
Root cause (which class holds a lock on .stack) not found.
http://mantis.tokeek.de/view.php?id=404
2014-06-05 02:13:08 +02:00
Michael Peter Christen
0c324d735c NPE fix for postprocessing without term index 2014-06-04 12:28:28 +02:00
Michael Peter Christen
922979aae1 added option to prefer http over https in unique-protocol ranking 2014-06-02 17:40:56 +02:00
Michael Peter Christen
b3b174e2b8 fixed webgraph postprocessing and status display in Crawler_p servlet 2014-06-02 15:06:38 +02:00
Michael Peter Christen
e6b28f5958 removed check on protocol for double content (user request) 2014-06-02 13:11:44 +02:00
reger
d8d318233e fix logging settings
- add missing .level
- remove obsolete jena settings
- set default level=INFO to prevent debug logging of not explicite specified classes
2014-06-01 06:43:50 +02:00
Michael Peter Christen
698f053658 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-06-01 01:02:12 +02:00
Michael Peter Christen
f23c4142e0 added option to configure a custom user agent within allip networks 2014-06-01 01:02:03 +02:00
reger
8e233e2eb4 - fix typo in Message_p (defaultpath)
- use more existing switchboardconstants for getproperties
- replace depriciated call defaultservlet
2014-06-01 00:20:25 +02:00
orbiter
d7d38f9135 made number of open files in crawler configurable and increased default
maximum number of open files from 100 to 1000. This number can be
changed with the attribut crawler.onDemandLimit
2014-05-31 09:29:55 +02:00
Michael Peter Christen
8ad41a882c fixed several problems with postprocessing:
- unique-postprocessing was destroying results from other
postprocessings; removed cross-updates as they had been not necessary
- unique-postprocessing did not restrict on same protocol
- inefficient concurrent update cache was redesigned completely
- increased limits for concurrent blocking queues to prevent early
time-out
2014-05-29 13:24:24 +02:00
reger
ca5437dd50 fix crawl of file:// , also http://mantis.tokeek.de/view.php?id=149
local files can be crawled (intranet mode) url parsing fixed according to  RFC 1738 (for unix and windows)
for win like file:///c:/tmp   or file://localhost/c:/tmp
for linux like file:///tmp  or file://localhost/tmp
Host is ignored and path must be absolute
2014-05-28 03:01:34 +02:00
Michael Peter Christen
ff5b3ac84d added new fields http_unique_b and www_unique_b which can be used for
ranking to prefer urls containing a www subdomain or using the https
protocol
2014-05-27 15:28:28 +02:00
sixcooler
5b1c4ef191 Monitoring and limit connection-count for Jetty 2014-05-22 22:16:39 +02:00
Michael Peter Christen
f0db501630 better handling of ranking parameters and new default values for date
navigation which is done using ranking in solr.
2014-05-22 03:01:07 +02:00
Michael Peter Christen
53948da7d0 tried to make last_modified recognition smarter 2014-05-22 00:28:51 +02:00