Commit Graph

11954 Commits

Author SHA1 Message Date
reger
78e8c6f3e5 refactor special handling (static override) of SUPPORTED_EXTENSIONS/MIME_TYPES
not used for genericImageParser
2015-10-11 01:23:52 +02:00
reger
d54c5d310a add links with image extension not automatically to image links.
With the wide spread use e.g. of Wikimedia the url file extension of links with image extension often point to html.
2015-10-10 23:49:58 +02:00
reger
5744342fec handle image preview for url w empty file extension
fix of commit 688f7b2a5c
2015-10-06 04:13:04 +02:00
reger
851e8f6c8a check jpeg file signature in genericImageParser
to fail early without further object allocation if source is not a jpeg.
2015-10-05 01:58:31 +02:00
reger
fb75fea446 use recrawljob w/o sort results by date
This is a workaround for existing index (not fully reindexed) since intro of schema with docvalues
to prevent solr exception causing recrawljob to fail with
org.apache.solr.core.SolrCore java.lang.IllegalStateException: unexpected docvalues type NONE for field 'load_date_dt' (expected=NUMERIC). Use UninvertingReader or index with docvalues.
2015-10-04 05:43:40 +02:00
reger
43c27aa550 upd to solr/lucene 5.3.1 2015-10-03 23:20:33 +02:00
reger
fd5a1dc297 upd to poi-3.13 2015-10-03 21:43:41 +02:00
reger
688f7b2a5c allow/display svg images in image results previews
svg is not supported by awt but by most browser. Image content is delivered as received (without size adjustment)
2015-10-02 01:48:48 +02:00
reger
d5330391de remove some unused var allocation in parser 2015-10-01 23:11:58 +02:00
Michael Peter Christen
3d7dd9d3aa follow-up to latest commit: also flush the search cache if all crawls
had been terminated.
2015-10-01 13:21:28 +02:00
Michael Peter Christen
225200194a every time a crawl is started, the user expects a different search
result behaviour. This requires that the search cache is flushed for
each crawl start. TODO: this should also be done if a crawl is
terminated.
2015-10-01 13:18:44 +02:00
Michael Peter Christen
c737ff235d in case that the include_string contains several entries including
1-char tokens and also more-than-1-char tokens, then remove the 1-char
tokens to prevent that we are to strict. This will make it possible to
be a bit more fuzzy in the search where it is appropriate.
2015-10-01 13:09:33 +02:00
Michael Peter Christen
8e555d79a3 add also 1-character tokens to the token list because that could be also
searched for. A full-string search for a filename may fail if those
1-char tokens are omitted
2015-10-01 13:03:22 +02:00
reger
7c82cd4415 add a end condition to svgParser for wrong content
(if parser choosen just by file extension)
2015-09-29 22:57:33 +02:00
reger
b92d81b073 remove double caching of inputstream in ViewImage 2015-09-27 03:24:28 +02:00
reger
c7c5e2dff9 fix old/obsolete solr dependency to stax
delete obsolete jar
2015-09-27 00:17:42 +02:00
reger
beed1c417e Add report profile with OWASP Dependency-Check to maven pom 2015-09-26 19:58:15 +02:00
reger
356d4d1301 remove rdfParser from init (current function identical with genericParser) 2015-09-26 17:30:34 +02:00
reger
c647d899e3 add svgParser to parse metadate from svg images
Reads document level included title and description and skips the graphic content to save bandwidth.
svg metadata element is not interpreted
- remove rdfParser from init (current function identical with genericParser)
2015-09-26 17:27:33 +02:00
reger
bad34804fe optimize parseInt for <img> tag attribute parsing
Performance better as using Numberformat.parse or parseInt(substring())
2015-09-26 15:42:23 +02:00
Michael Peter Christen
3c31bf845f fix for latest merge 2015-09-24 13:53:54 +02:00
Michael Peter Christen
6ebc2451a9 Merge pull request #14 from luccioman/master
Translator refactoring : no more regular expression processing
2015-09-24 13:50:23 +02:00
reger
2f51baff4f check for loading error (includs unsupported formats)
to prevent blank thumbnail display in image search because of not handled source which don't load on click.
Now the cross icon indicates the problem (inlcuding not supported format)
2015-09-24 01:58:19 +02:00
luc
5578886f6f Merge branch 'master' of https://github.com/luccioman/yacy_search_server.git 2015-09-23 21:04:20 +02:00
luc
c38d6c1f37 Correction for mantis 535: inurl: parameter doesn't work on URLs with
upper-case letters
2015-09-23 21:01:51 +02:00
reger
52e3eb4ce8 harmonize/correct assignment to Ymarkmeta.mime
replace use of deprecated
2015-09-23 00:13:10 +02:00
Michael Peter Christen
87f358058e Fix for index entries which have id's not computed as hash from the url.
This makes it possible to operate with outside-computed url hashes in
enterprise environments not using the build-in crawler from YaCy.
2015-09-22 11:56:17 +02:00
reger
2951c9fc40 remove unused check for known fileextension in searchtrailer
(check is done on add to filetype-nav)
2015-09-22 03:52:15 +02:00
reger
3f2b8ab5e5 optionally include mime in p2p url exchange string
if doctype decodes to ambiguous mime and default conversion is not equal to original
2015-09-22 00:12:31 +02:00
reger
a3195d78ae add Portuguese month names to date recognition 2015-09-20 23:28:42 +02:00
reger
d2cc11ea8f fix html parser taking <style> content as text.
Noticed some result description contain css content from style tag.
Added <style> to tag list to scrape it's content not as text
+ test case included
2015-09-19 05:30:55 +02:00
Michael Peter Christen
5f706797cb patch for a bug inside of solr since solr 5.0 when using a boost
function with a numeric date field:
"unexpected docvalues type NUMERIC for field 'last_modified' (expected
one of [SORTED, SORTED_SET]). Use UninvertingReader or index with
docvalues."
This is a well-known bug inside solr which prevents that now the 'sort
by date' in the YaCy search interface can be used. Without this patch no
results at all is displayed (since the exception prevents that). Now
there is at least a result but it is not ordered properly.
2015-09-18 02:25:44 +02:00
reger
733d725dec limit css scrolling to result/content window x
from pull request #10
2015-09-15 02:11:30 +02:00
Burkhard
4c38083a11 Merge pull request #10 from Raegdan/raegdan-css-layout-fix
Fixed CSS scrolling
2015-09-15 02:09:17 +02:00
reger
7889fc2389 Hack to prevent Solr issue on partial update on a document containing multivalued date field
(regardless if these fields part of update).
Switch partial update option off in postprocessing if schema contains *_dts (multivalued date field).
see http://mantis.tokeek.de/view.php?id=601
2015-09-13 20:23:15 +02:00
reger
b4cbdea1e7 adapt SolrServerConnector.add to handle error on partial update input document.
In case of error we deleted the original document and added the new doc to the index.
This is not valid for partial update documents (which contain only a subset of the fields).
Remove the "delete" error handling step.
2015-09-13 20:19:50 +02:00
reger
e594130aec add test case for partial update - to discover effect on YaCy for update of documents with multivalued date fields (like dates_in_content_dts)
current result: loss of fields/information in index document, see EmbeddedSolrConnectorTest.testUdate_withMultivaluedDateField()
2015-09-13 06:02:07 +02:00
reger
98ab655917 on reindex delete index document with invalid url
if discovered
2015-09-12 23:06:13 +02:00
reger
1e8369e18b use a parsed date in Document.toString 2015-09-12 22:00:40 +02:00
reger
d5da9e5a38 fix test methode (add throw for URIMetadataNode) 2015-09-12 20:07:43 +02:00
luccioman
a7179138ce Returned again to main repository location : does anyone want to
consider mantis 597 ?  (http://mantis.tokeek.de/view.php?id=597)
2015-09-11 17:23:59 +02:00
luccioman
199b2ce52d Translator refactoring : to simplify locale files writing, process keys
as simple string and no more as regular expressions.
Updated all locale files to adapt to refectored Translator : removed
useless escaped characters and did minor corrections.
Performed minor syntax corrections on some html source files.
Added an util to translate all html source files with all locales
without launching full YaCy application.
Corrected main arguments parsing on other translation utils.
2015-09-11 17:20:11 +02:00
luccioman
711183bd72 Merge branch 'master' of ssh://git@github.com/yacy/yacy_search_server 2015-09-11 11:16:19 +02:00
luccioman
4dd9c0d5d9 Merge from main repository 2015-09-08 08:54:48 +02:00
reger
3428b6f13b improve filtering by filetype navigator.
The used url-filter for filetype doesn't require ".ext" resulting in too many matches,
add a sort-out filter for RWI results.
2015-09-07 02:36:22 +02:00
reger
e37a4f0b3d prevent metadata records in index w/o valid url
by throwing MalformedURL exception on URIMetadataNode creation
2015-09-06 22:19:05 +02:00
reger
41c4eade51 extract modification date from vCard (vcfParser) 2015-09-06 04:28:27 +02:00
reger
8768896975 extract lastmodified from openoffice doc
set lastmod date in office document parsers
2015-09-06 00:04:54 +02:00
Michael Peter Christen
c40c302748 when many crawl queues are generated, this NPE can occur; probably
caused as concurrency issue:
W 2015/09/05 14:09:10 ConcurrentLog java.lang.NullPointerException
java.lang.NullPointerException
	at java.util.TreeMap.rotateRight(TreeMap.java:2239)
	at java.util.TreeMap.fixAfterInsertion(TreeMap.java:2271)
	at java.util.TreeMap.put(TreeMap.java:582)
	at net.yacy.kelondro.table.Table.<init>(Table.java:235)
	at net.yacy.crawler.HostQueue.openStack(HostQueue.java:229)
	at net.yacy.crawler.HostQueue.getStack(HostQueue.java:204)
	at net.yacy.crawler.HostQueue.push(HostQueue.java:397)
	at net.yacy.crawler.HostBalancer.push(HostBalancer.java:237)
	at net.yacy.crawler.data.NoticedURL.push(NoticedURL.java:184)
	at net.yacy.crawler.CrawlStacker.stackCrawl(CrawlStacker.java:355)
	at net.yacy.crawler.CrawlStacker.job(CrawlStacker.java:134)
	at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at
net.yacy.kelondro.workflow.InstantBlockingThread.job(InstantBlockingThread.java:101)
	at
net.yacy.kelondro.workflow.AbstractBlockingThread.run(AbstractBlockingThread.java:82)
	at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
2015-09-05 14:12:17 +02:00
Michael Peter Christen
94cfa63c46 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-09-05 14:07:53 +02:00