Commit Graph

3582 Commits

Author SHA1 Message Date
reger
b7e8358645 make use of header.getContentType where possible (mime is normalized afterwards)
otherwise use header.mime() differentiated in prev. commit.
2015-12-20 15:49:24 +01:00
reger
7a8c077838 fix HeaderFramework.mime() to strip charset parameter.
Differentiate mime() and getContentType() which gives the raw header field.
This improves parser detection if charsets are included in http content-type field.
2015-12-20 06:44:16 +01:00
reger
b4b6910d60 fix (todo): correct doc.id of remote search result if no match with newly
calculated doc hash if different.
Testing showed that in some cases delivered url doesn't match the local
calculated hash. In this case replace doc.id (and host_id_s) with calculation
from url.
2015-12-20 02:10:49 +01:00
reger
dec3e6ad96 fix: adjust urlstub for mailto links
(skip protocol)
2015-12-19 20:13:33 +01:00
reger
cb83e65f89 drop returning document language "en" if unknown (fix todo)
which also harmonizes handling of query.modifier for rwi and solr results
(to result must match a given language filter)
2015-12-19 01:42:35 +01:00
reger
0c5548a7ff fix (todo) remove redundant holding of email link nameproperty in parser document 2015-12-18 02:35:44 +01:00
reger
71c416f383 show mailto links in ViewFile.html linklist 2015-12-18 01:11:55 +01:00
reger
6b7c10cef8 fix dc:date in mediawikiimporter/document.writexml to use lastmodified 2015-12-17 02:53:10 +01:00
reger
14803d58cd let html scraper accept html5 <link rel="icon"> for favicon links 2015-12-17 00:36:08 +01:00
luc
b4cdacee76 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-12-16 03:26:06 +01:00
luc
ba0a293f5c Corrected another case of
org.apache.lucene.store.AlreadyClosedException" occuring when
SearchEvent.cleanup() was called while committing local solr index.
2015-12-16 03:25:07 +01:00
reger
4d2b934487 prevent mailto links getting into parser result document's in/outbound link collection
by checking mailto scheme early.
- fix upper case mailto protocol assignment
- add test case for getProtocol
2015-12-16 03:01:17 +01:00
luc
8c4ab9c76b Added an option to eventually limit size of remote solr documents put to
local index. See mantis #626.
2015-12-16 02:20:03 +01:00
luc
a2c08402af Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-12-15 23:30:30 +01:00
luc
70595d05d0 Modified MemoryControl.main() test to properly end for better results
displaying.
2015-12-14 23:49:28 +01:00
sixcooler
1be67d9ab6 CachedSolrConnector was replaced by ConcurrentUpdateSolrConnector years
ago - time to let it go
Commented out unused table of cache-objects
2015-12-14 21:33:27 +01:00
reger
28b8bc290a fix use of NETWORK_SEARCHVERIFY for rwi verification
was not used to set the searchevent parameter (done in SearchEventCache.getEvent)
- remove unused corresponding QueryParams.filterfailurls param.
2015-12-13 20:01:49 +01:00
reger
020630efd8 remove unused network scanner parameter from queryparameter
Search event is not using networkscanner 
(removed filterscannerfail param always init to false)
2015-12-13 02:50:08 +01:00
luc
ad5586f8f6 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-12-08 03:35:36 +01:00
luc
8ebefa4233 Fixed MediaWiki import : DCEntry conversion to SolrInputDocument was
failing. Looks like it was broken since Commit
b43811d38c
2015-12-08 03:34:03 +01:00
luc
7736ee5a42 Updated MediaWimporter main() : display usage in console and stop
properly without calling System.exit
2015-12-08 03:30:51 +01:00
reger
cdb8f3b10d make current ranking score value avail. to search interface / api
Update the result score result field with the result queue ranking value to reflect
the actual calculated/used score,
for rwi & solr stack results.
(calc. etc. is unchanged, it's just that result entry carries the latest val
as api retrieves the number from it)
2015-12-08 03:17:32 +01:00
luc
27d11f8671 Fixed isSolrDump function : PushBackInputStream was not unread when
returning false (for example with a WikiMedia dump).
2015-12-07 21:58:36 +01:00
Michael Peter Christen
135a123a77 less logging in new language detection 2015-12-03 00:39:15 +01:00
Michael Peter Christen
ef8cd80593 fix for npe 2015-12-03 00:33:13 +01:00
reger
0347bfa71f Apply collection query constraint/modifiert to rwi result stack.
Collection is not available in pure rwi entries (but in local solr metadata)
But if user wishes to filter by query constraint also rwi shall adhere to this
(even if only rwi entries with parsed or solr received metadata may fit)
2015-12-02 22:57:59 +01:00
luc
2a67d2ba6f Corrected error management for unsupported image formats, parsing
errors, and unavailable resources : avoid logging to much Exceptions as
these errors easily occur when searching images.
2015-12-01 01:06:01 +01:00
Michael Peter Christen
d6e9834040 Merge branch 'master' of
https://github.com/Scarfmonster/yacy_search_server

# Conflicts:
#	.classpath
#	build.xml
2015-11-30 16:54:54 +01:00
Michael Peter Christen
d82d311995 Merge branch 'master' of https://github.com/luccioman/yacy_search_server
# Conflicts:
#	.classpath
2015-11-30 13:34:10 +01:00
reger
b5371ea8c1 read/init crawl queue in a thread
to speed-up YaCy start on large existing crawler queues
2015-11-29 05:19:39 +01:00
reger
1160b13172 remove unused md5 from ViewFile servlet params 2015-11-28 23:09:15 +01:00
reger
e163ea88f6 fix vsdParser (Visio) parser return statement
(final block un-necessary throw)
2015-11-28 02:43:38 +01:00
reger
b2c8bc0ae6 remove md5_s from default index fields
it is not assigned a value / not used
Due to above also excluded from transfer protocol.
2015-11-27 02:41:02 +01:00
luc
e40ae0943b - No max dimensions specified : render raw image data when source and
target image format are the same.
- Corrected scaling condition.
2015-11-26 09:30:43 +01:00
reger
90686a75a2 fix flux factor (additional crawl delay by access count) calculation 2015-11-25 01:34:41 +01:00
luc
4af27289e5 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-23 09:01:25 +01:00
reger
297fdb60d3 throw exception if crawler hostqueue can't create hostpath directory.
In rare cases hostname may not be a valid filesystem directory name,
which can't be created (e.g. containing '*' char). To prevent crawl queue
looping on this invalid entry by throwing a malformedurlexception.
2015-11-22 21:26:18 +01:00
luc
755efac17d Use same max file size when loading all resource bytes or opening stream
content
2015-11-20 19:35:39 +01:00
luc
bc6c79fc12 Corrected scaling function for non RGB images. 2015-11-20 14:35:36 +01:00
luc
1565559df8 Refactoring : extracted write InputStream method. 2015-11-20 09:42:24 +01:00
luc
f0478bb14d BMP and ICO image formats support : integrated /haraldk/TwelveMonkeys
imageio-bmp-3.2 library.

 - better BMP format flavours support
 - handle PNG encoded icons
 - handle transparency
 
Added some javadoc url references to .classpath
2015-11-20 09:38:16 +01:00
luc
07437986e7 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-20 08:15:24 +01:00
reger
97cc03ef6a start using a template for urlproxy header
It is included as iframe  /proxmsg/urlproxyheader.html
to allow full servlet functionallity and flexibility to display some
index/meta data in future.
2015-11-20 01:49:56 +01:00
luc
f01d49c37a Process large or local file images dealing directly with content
InputStream.
2015-11-18 10:15:38 +01:00
luc
3c4c77099d If available, check content length before downloading. Check also
content length is not over Integer.MAX_VALUE.
2015-11-18 10:11:38 +01:00
luc
5bbb2e1730 Ensure resource is closed when reading a full file InputStream 2015-11-18 10:08:06 +01:00
luc
6291a57300 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-18 08:49:31 +01:00
reger
0d3c5b223e have psParser cleanup temp file 2015-11-17 23:45:29 +01:00
reger
7d0d19cb8e avoid File.deleteOnExit() on temp files
JVM registers each file in a list regardless of already deleted and never
cleans up the list during runtime.
This accumulates to a considerable amount of mem during large crawls and/or
long uptime.
To tackle this, all temp files are now created in a subdir of java.io.tmpdir 
and the jvm tmpdir property is set to this subdir, which is deleted by
code on shutdown.
Additionally let pdfParser use this tmp subdir too.
2015-11-17 22:27:07 +01:00
luc
bfe51001e3 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-17 08:30:32 +01:00
reger
02e4489a23 set tmpfile.deleteOnExit by default,
to make sure files are removed on shutdown.
2015-11-16 21:37:45 +01:00
reger
2985baaa01 Exclude repetitive protocol part in tokenized url
used as description if none is avail. from parser.
2015-11-16 01:06:20 +01:00
reger
ca3d26a401 harmonize wordsintitle & CollectionSchema.title_words_val calculation,
remove obsolete partial init of wordreference from urimetadata
2015-11-15 06:06:37 +01:00
reger
52a9040ae6 Sort out double keywords (dc_subject) early in parsed documents
- by direct using Set vs. List
- remove not neede String[] getter
2015-11-13 01:48:28 +01:00
luc
49331dc523 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-12 08:21:56 +01:00
reger
47d70732f6 improve locale translator
- skip empty line
- robustness file section detection (space independant)
2015-11-11 00:57:51 +01:00
sixcooler
646afe9183 do not store subfield *_coordinate + make all num-fields being docvalues 2015-11-10 20:45:33 +01:00
sixcooler
194df613de not using 'location' as defaultfacetfield - since we removed it being
default.
2015-11-10 20:43:58 +01:00
sixcooler
d3b9349b6f simplification / speedup of GenerationMemoryStrategy 2015-11-10 20:39:46 +01:00
sixcooler
4a905ec134 fix to not let the AccessTracker-Log grow to much, but have enough data
to monitor.
(+gitignore-correction)
2015-11-10 20:27:17 +01:00
reger
20e18d79f8 harmonize document title for archive parsers 2015-11-10 01:29:13 +01:00
luc
f11b5e8309 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-09 08:13:12 +01:00
reger
112ae013f4 update bzip and bzip parser process,
to return one document for the file with combined parser results of the
containing file and registers it with supplied url and mime of the archive.
2015-11-07 19:13:18 +01:00
reger
e76a90837b update zip and tar parser process,
to return one document for the file with combined parser results of the
containing files.
2015-11-06 23:58:55 +01:00
luc
4e673ffc9a Ensure closing of InputStream even when an exception occurs. 2015-11-05 09:40:24 +01:00
luc
10696b53f7 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-05 08:26:52 +01:00
reger
8532565c7d optimize order of parsers to try
- start with a parser matching the remote supplied mime
2015-11-04 21:52:02 +01:00
reger
681889ae64 use current tar library for untar files
- remove old source copy
2015-11-04 02:57:00 +01:00
reger
5d71fc70e3 fix tarParser early exit on looping content
- adjust check of data available according to doc 
- return null on no recognized content (to not exit TextParser next parser try)
- use commons.compress directly
2015-11-03 22:14:14 +01:00
luc
bcc2e7cb5b Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-03 09:29:57 +01:00
reger
2fcf6f104c fix bzipParser recognition
- Bzip2Inputstream checks magic byte itself to identify bz2 (leave it in input)
- try to suppy fitting mime for parsing bz2 content
2015-11-03 03:35:01 +01:00
luc
745e97a575 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-02 08:10:11 +01:00
reger
a60b1fb6c2 differentiate api call getLocalPort() from getConfigInt() 2015-10-31 23:09:03 +01:00
reger
11f3666660 increase use of pre.defined CATCHALL_QUERY string 2015-10-31 19:44:31 +01:00
reger
a58ee49307 Optimize internal imagequery focus on using content_type to select images
(in favor of url file extension)
2015-10-31 19:18:46 +01:00
luc
fc3294382e Updated javadocs for warning on target encoding format potential errors. 2015-10-30 16:19:05 +01:00
luc
aa70ff4ff6 Corrected images alpha channel rendering 2015-10-30 05:18:16 +01:00
reger
d223cf0ae4 adjust MediaWiki importer geo coordinate calculation
- allow lat/long 0.xxx
- south / west assignment
include test class
2015-10-26 21:19:35 +01:00
reger
2b775d5be6 fix typo in WikiCode coordinate calculation 2015-10-25 19:38:42 +01:00
reger
bbe9df2bb3 fix MediawikiImporter for bz2 dump
skip reading bz2 file magicbyte to identify bz2 format as inputstream reset would be required. Common compress reads and checks the magicbytes internally and throws ioexception if wrong, making preread obsolete.
2015-10-25 03:06:15 +01:00
reger
c6687dd560 fix a system.out to log.fine
in bmpParser
2015-10-25 00:26:45 +02:00
reger
e53c6bbd51 fix init of peer flags
(remove hiding of ssl flag)
2015-10-24 19:36:33 +02:00
Michael Peter Christen
ac034db8bc Merge branch 'master' of https://github.com/luccioman/yacy_search_server
# Conflicts:
#	htroot/js/highslide/highslide.js
#	source/net/yacy/document/ImageParser.java
2015-10-24 11:22:35 +08:00
reger
826f14f37f fix unnececary set null of peer flags, causing reread
remove obsolete version flags
2015-10-22 02:35:58 +02:00
luc
5902ce032e Corrected NullPointerException case when ImageIO reader is not found for
image format.
2015-10-19 14:11:26 +02:00
reger
c6495a5b62 add a log entry on parsing ajax crawling scheme snapshot
(prev. commit 9252e36aeb)
2015-10-18 06:19:12 +02:00
reger
9252e36aeb implement ajax crawling scheme for ajax sites which adhere to the proposed use of hash-bangs to provide html content
see freshly deprecated https://developers.google.com/webmasters/ajax-crawling/
Implementation improves parsing of the homepage (ajax page) which uses metatag "fragment" in header and parses supplied html snapshot instead of mostly empty ajax/scripted page.
Implementation supports also hash-bang urls (url with anchor starting with ! like  ...path#!hashfragment) but our crawler filters it
(use of hash-bang is controversly discussed and proposal is deprecated, makes no sense to adjust the crawler, but as long as it is used by some sites the minor change/improvement in htmlparser is good for some time).
Quick - how does it work
- if metatag fragment with content "!" is found
   - htmlparser tries to get content of htmls snapshot (using a different url)
   - htmlparser returns 2 documents (original url and snapshot content - but using same original url)
- after parsing result documents are joined (and stored to index containing content also from snapshot page... as the original ajax page contains typically no parseable html content)
2015-10-18 05:51:01 +02:00
Michael Peter Christen
d1ae999ef9 replaced HashMap with LinkedHashMap to preserve the object order 2015-10-16 23:30:51 +02:00
Michael Peter Christen
7d075a1d76 added log lines 2015-10-16 23:30:04 +02:00
Michael Peter Christen
092dac086e Merge branch 'master' of https://github.com/luccioman/yacy_search_server 2015-10-16 23:22:30 +02:00
reger
7a64bebb86 init Recrawl job chunk size to max crawl loader during job start, to use some system preferences
and allow injection of recrawl urls before queue is empty
During recrawl the balancer hangs on the very last urls often on hosts with huge delay time,
by allowing injection earlier progress is more balanced. Max number of injected crawl urls by recrawl job is 2 * max loader.
2015-10-16 03:05:39 +02:00
luc
d6522fa4a2 Integrated haraldk/TwelveMonkeys library to first add TIF image format
support.
2015-10-15 10:06:51 +02:00
Michael Peter Christen
9244694e64 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-10-14 15:17:23 +02:00
Michael Peter Christen
151ccd50a9 fix for image size field values (must be multi-valued) 2015-10-14 15:16:16 +02:00
reger
c9937973e3 unescape MultiProtocolURL getAttributes() return values.
use getAttributes() to get query parameters as clear text (w/o url encoding)
use getSearchpartMap() to get in internal format (url encoded)

fix for http://mantis.tokeek.de/view.php?id=606
2015-10-13 02:43:18 +02:00
reger
78e8c6f3e5 refactor special handling (static override) of SUPPORTED_EXTENSIONS/MIME_TYPES
not used for genericImageParser
2015-10-11 01:23:52 +02:00
reger
d54c5d310a add links with image extension not automatically to image links.
With the wide spread use e.g. of Wikimedia the url file extension of links with image extension often point to html.
2015-10-10 23:49:58 +02:00
reger
851e8f6c8a check jpeg file signature in genericImageParser
to fail early without further object allocation if source is not a jpeg.
2015-10-05 01:58:31 +02:00
reger
fb75fea446 use recrawljob w/o sort results by date
This is a workaround for existing index (not fully reindexed) since intro of schema with docvalues
to prevent solr exception causing recrawljob to fail with
org.apache.solr.core.SolrCore java.lang.IllegalStateException: unexpected docvalues type NONE for field 'load_date_dt' (expected=NUMERIC). Use UninvertingReader or index with docvalues.
2015-10-04 05:43:40 +02:00
reger
43c27aa550 upd to solr/lucene 5.3.1 2015-10-03 23:20:33 +02:00
reger
688f7b2a5c allow/display svg images in image results previews
svg is not supported by awt but by most browser. Image content is delivered as received (without size adjustment)
2015-10-02 01:48:48 +02:00
reger
d5330391de remove some unused var allocation in parser 2015-10-01 23:11:58 +02:00
Michael Peter Christen
3d7dd9d3aa follow-up to latest commit: also flush the search cache if all crawls
had been terminated.
2015-10-01 13:21:28 +02:00
Michael Peter Christen
c737ff235d in case that the include_string contains several entries including
1-char tokens and also more-than-1-char tokens, then remove the 1-char
tokens to prevent that we are to strict. This will make it possible to
be a bit more fuzzy in the search where it is appropriate.
2015-10-01 13:09:33 +02:00
Michael Peter Christen
8e555d79a3 add also 1-character tokens to the token list because that could be also
searched for. A full-string search for a filename may fail if those
1-char tokens are omitted
2015-10-01 13:03:22 +02:00
reger
7c82cd4415 add a end condition to svgParser for wrong content
(if parser choosen just by file extension)
2015-09-29 22:57:33 +02:00
reger
356d4d1301 remove rdfParser from init (current function identical with genericParser) 2015-09-26 17:30:34 +02:00
reger
c647d899e3 add svgParser to parse metadate from svg images
Reads document level included title and description and skips the graphic content to save bandwidth.
svg metadata element is not interpreted
- remove rdfParser from init (current function identical with genericParser)
2015-09-26 17:27:33 +02:00
reger
bad34804fe optimize parseInt for <img> tag attribute parsing
Performance better as using Numberformat.parse or parseInt(substring())
2015-09-26 15:42:23 +02:00
Michael Peter Christen
6ebc2451a9 Merge pull request #14 from luccioman/master
Translator refactoring : no more regular expression processing
2015-09-24 13:50:23 +02:00
reger
2f51baff4f check for loading error (includs unsupported formats)
to prevent blank thumbnail display in image search because of not handled source which don't load on click.
Now the cross icon indicates the problem (inlcuding not supported format)
2015-09-24 01:58:19 +02:00
luc
5578886f6f Merge branch 'master' of https://github.com/luccioman/yacy_search_server.git 2015-09-23 21:04:20 +02:00
luc
c38d6c1f37 Correction for mantis 535: inurl: parameter doesn't work on URLs with
upper-case letters
2015-09-23 21:01:51 +02:00
reger
52e3eb4ce8 harmonize/correct assignment to Ymarkmeta.mime
replace use of deprecated
2015-09-23 00:13:10 +02:00
Michael Peter Christen
87f358058e Fix for index entries which have id's not computed as hash from the url.
This makes it possible to operate with outside-computed url hashes in
enterprise environments not using the build-in crawler from YaCy.
2015-09-22 11:56:17 +02:00
reger
3f2b8ab5e5 optionally include mime in p2p url exchange string
if doctype decodes to ambiguous mime and default conversion is not equal to original
2015-09-22 00:12:31 +02:00
reger
a3195d78ae add Portuguese month names to date recognition 2015-09-20 23:28:42 +02:00
reger
d2cc11ea8f fix html parser taking <style> content as text.
Noticed some result description contain css content from style tag.
Added <style> to tag list to scrape it's content not as text
+ test case included
2015-09-19 05:30:55 +02:00
Michael Peter Christen
5f706797cb patch for a bug inside of solr since solr 5.0 when using a boost
function with a numeric date field:
"unexpected docvalues type NUMERIC for field 'last_modified' (expected
one of [SORTED, SORTED_SET]). Use UninvertingReader or index with
docvalues."
This is a well-known bug inside solr which prevents that now the 'sort
by date' in the YaCy search interface can be used. Without this patch no
results at all is displayed (since the exception prevents that). Now
there is at least a result but it is not ordered properly.
2015-09-18 02:25:44 +02:00
reger
7889fc2389 Hack to prevent Solr issue on partial update on a document containing multivalued date field
(regardless if these fields part of update).
Switch partial update option off in postprocessing if schema contains *_dts (multivalued date field).
see http://mantis.tokeek.de/view.php?id=601
2015-09-13 20:23:15 +02:00
reger
b4cbdea1e7 adapt SolrServerConnector.add to handle error on partial update input document.
In case of error we deleted the original document and added the new doc to the index.
This is not valid for partial update documents (which contain only a subset of the fields).
Remove the "delete" error handling step.
2015-09-13 20:19:50 +02:00
reger
98ab655917 on reindex delete index document with invalid url
if discovered
2015-09-12 23:06:13 +02:00
reger
1e8369e18b use a parsed date in Document.toString 2015-09-12 22:00:40 +02:00
luccioman
199b2ce52d Translator refactoring : to simplify locale files writing, process keys
as simple string and no more as regular expressions.
Updated all locale files to adapt to refectored Translator : removed
useless escaped characters and did minor corrections.
Performed minor syntax corrections on some html source files.
Added an util to translate all html source files with all locales
without launching full YaCy application.
Corrected main arguments parsing on other translation utils.
2015-09-11 17:20:11 +02:00
luccioman
4dd9c0d5d9 Merge from main repository 2015-09-08 08:54:48 +02:00
reger
3428b6f13b improve filtering by filetype navigator.
The used url-filter for filetype doesn't require ".ext" resulting in too many matches,
add a sort-out filter for RWI results.
2015-09-07 02:36:22 +02:00
reger
e37a4f0b3d prevent metadata records in index w/o valid url
by throwing MalformedURL exception on URIMetadataNode creation
2015-09-06 22:19:05 +02:00
reger
41c4eade51 extract modification date from vCard (vcfParser) 2015-09-06 04:28:27 +02:00
reger
8768896975 extract lastmodified from openoffice doc
set lastmod date in office document parsers
2015-09-06 00:04:54 +02:00
Michael Peter Christen
c40c302748 when many crawl queues are generated, this NPE can occur; probably
caused as concurrency issue:
W 2015/09/05 14:09:10 ConcurrentLog java.lang.NullPointerException
java.lang.NullPointerException
	at java.util.TreeMap.rotateRight(TreeMap.java:2239)
	at java.util.TreeMap.fixAfterInsertion(TreeMap.java:2271)
	at java.util.TreeMap.put(TreeMap.java:582)
	at net.yacy.kelondro.table.Table.<init>(Table.java:235)
	at net.yacy.crawler.HostQueue.openStack(HostQueue.java:229)
	at net.yacy.crawler.HostQueue.getStack(HostQueue.java:204)
	at net.yacy.crawler.HostQueue.push(HostQueue.java:397)
	at net.yacy.crawler.HostBalancer.push(HostBalancer.java:237)
	at net.yacy.crawler.data.NoticedURL.push(NoticedURL.java:184)
	at net.yacy.crawler.CrawlStacker.stackCrawl(CrawlStacker.java:355)
	at net.yacy.crawler.CrawlStacker.job(CrawlStacker.java:134)
	at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at
net.yacy.kelondro.workflow.InstantBlockingThread.job(InstantBlockingThread.java:101)
	at
net.yacy.kelondro.workflow.AbstractBlockingThread.run(AbstractBlockingThread.java:82)
	at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
2015-09-05 14:12:17 +02:00
reger
367fe388b9 fix exception throw after sendError in DefaultServlet
- reduce debug exception logs in crawler
2015-09-05 01:57:30 +02:00
luccioman
9752bd5f88 Added utils to help translation without launching full YaCy application
:
- translate all source files with a locale
- list all non translated files with a locale
2015-09-04 13:44:44 +02:00
luccioman
2f0f0180e2 Added a function to list files recursively. 2015-09-04 13:42:57 +02:00
luccioman
7e4c1d2282 Translator refactoring :
- deleted useless new StringBuilder allocation
- use of a new reusable FileNameFilter
- added javadoc
2015-09-04 13:42:10 +02:00
reger
802ccaead6 fix init of error cache, use latest faildates => load_date_dt 2015-09-02 02:36:31 +02:00
reger
dba7f15073 apply same size constrain on result image from doc
as for linked images
see 19f1308bf0
2015-09-01 23:22:48 +02:00
reger
4cf875336c complete TODO: getFileExtension handle dot in query part
+ testcase
2015-08-31 23:28:03 +02:00
sixcooler
87e4abe393 fight the fieldcache by usind DocValues: in Solr-5.x the fieldcache has
moved and was not cleared anymore. This results in an huge fieldcache.
(http://lucene.apache.org/#highlights-of-the-lucene-release-include
https://issues.apache.org/jira/browse/LUCENE-5666)
Here I try to use DovValues where it is possible.
For this I used the Api-Scheme as new basis für the Solr-Schema.
This needs at least a complete optimization of the Solr-Index to get a
smaller FieldCache.
Everything that is indexed with these setting will not use the
Fieldcache at all.
2015-08-31 20:24:41 +02:00
reger
eaf0e8ff2c start recording/indexing pixel size for image document
as for linked images
2015-08-31 01:58:36 +02:00
reger
c33229fc0c check mime prior to ext for metadata modification for images 2015-08-30 23:02:19 +02:00
reger
19f1308bf0 enforce th result images limit to > 16x16px
for linked images
http://mantis.tokeek.de/view.php?id=594
2015-08-30 02:19:52 +02:00
reger
0e4ba0360b fix NPE on .yacyh result url of disconnected peer
(cleanup yacyshare remaining)
2015-08-25 23:26:17 +02:00
reger
7ed812a2bf log missing seed.port
in favour of exception to prevent repeating throws
2015-08-25 02:19:00 +02:00
reger
206883f80d fix: Preserve protocol in url proxy
to connect to http/https. Display warning if https target is viewed over http
2015-08-25 01:16:41 +02:00
reger
f7b0b3b7b3 avoid runtime exception by earlier testing for seed.ip=null 2015-08-23 23:01:20 +02:00
Michael Peter Christen
906b5fd742 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-08-11 00:42:46 +02:00
Michael Peter Christen
8f90767889 fix for filesystem crawl 2015-08-11 00:42:26 +02:00
sixcooler
a3dd4be749 added / corrected charste to be 1.7 compatible.
@Orbiter: please check is this is ok for you
2015-08-10 20:53:20 +02:00
Michael Peter Christen
8028410ab7 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-08-10 14:27:53 +02:00
Michael Peter Christen
df3314ac1a added a new facet type based on a probabilistic classifier using
bayesian filters. This can be used to classify documents during
indexing-time using a pre-definied bayesian filter.

New wordings:
- a context is a class where different categories are possible. The
context name is equal to a facet name.
- a category is a facet type within a facet navigation. Each context
must have several categories, at least one custom name (things you want
to discover) and one with the exact name "negative".

To use this, you must do:
- for each context, you must create a directory within
DATA/CLASSIFICATION with the name of the context (the facet name)
- within each context directory, you must create text files with one
document each per line for every categroy. One of these categories MUST
have the name 'negative.txt'.

Then, each new document is classified to match within one of the given
categories for each context.
2015-08-10 14:27:44 +02:00