Commit Graph

3470 Commits

Author SHA1 Message Date
reger
b4b6910d60 fix (todo): correct doc.id of remote search result if no match with newly
calculated doc hash if different.
Testing showed that in some cases delivered url doesn't match the local
calculated hash. In this case replace doc.id (and host_id_s) with calculation
from url.
2015-12-20 02:10:49 +01:00
reger
dec3e6ad96 fix: adjust urlstub for mailto links
(skip protocol)
2015-12-19 20:13:33 +01:00
reger
cb83e65f89 drop returning document language "en" if unknown (fix todo)
which also harmonizes handling of query.modifier for rwi and solr results
(to result must match a given language filter)
2015-12-19 01:42:35 +01:00
reger
0c5548a7ff fix (todo) remove redundant holding of email link nameproperty in parser document 2015-12-18 02:35:44 +01:00
reger
71c416f383 show mailto links in ViewFile.html linklist 2015-12-18 01:11:55 +01:00
reger
6b7c10cef8 fix dc:date in mediawikiimporter/document.writexml to use lastmodified 2015-12-17 02:53:10 +01:00
reger
14803d58cd let html scraper accept html5 <link rel="icon"> for favicon links 2015-12-17 00:36:08 +01:00
reger
4d2b934487 prevent mailto links getting into parser result document's in/outbound link collection
by checking mailto scheme early.
- fix upper case mailto protocol assignment
- add test case for getProtocol
2015-12-16 03:01:17 +01:00
sixcooler
1be67d9ab6 CachedSolrConnector was replaced by ConcurrentUpdateSolrConnector years
ago - time to let it go
Commented out unused table of cache-objects
2015-12-14 21:33:27 +01:00
reger
28b8bc290a fix use of NETWORK_SEARCHVERIFY for rwi verification
was not used to set the searchevent parameter (done in SearchEventCache.getEvent)
- remove unused corresponding QueryParams.filterfailurls param.
2015-12-13 20:01:49 +01:00
reger
020630efd8 remove unused network scanner parameter from queryparameter
Search event is not using networkscanner 
(removed filterscannerfail param always init to false)
2015-12-13 02:50:08 +01:00
luc
ad5586f8f6 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-12-08 03:35:36 +01:00
luc
8ebefa4233 Fixed MediaWiki import : DCEntry conversion to SolrInputDocument was
failing. Looks like it was broken since Commit
b43811d38c
2015-12-08 03:34:03 +01:00
luc
7736ee5a42 Updated MediaWimporter main() : display usage in console and stop
properly without calling System.exit
2015-12-08 03:30:51 +01:00
reger
cdb8f3b10d make current ranking score value avail. to search interface / api
Update the result score result field with the result queue ranking value to reflect
the actual calculated/used score,
for rwi & solr stack results.
(calc. etc. is unchanged, it's just that result entry carries the latest val
as api retrieves the number from it)
2015-12-08 03:17:32 +01:00
luc
27d11f8671 Fixed isSolrDump function : PushBackInputStream was not unread when
returning false (for example with a WikiMedia dump).
2015-12-07 21:58:36 +01:00
Michael Peter Christen
135a123a77 less logging in new language detection 2015-12-03 00:39:15 +01:00
Michael Peter Christen
ef8cd80593 fix for npe 2015-12-03 00:33:13 +01:00
reger
0347bfa71f Apply collection query constraint/modifiert to rwi result stack.
Collection is not available in pure rwi entries (but in local solr metadata)
But if user wishes to filter by query constraint also rwi shall adhere to this
(even if only rwi entries with parsed or solr received metadata may fit)
2015-12-02 22:57:59 +01:00
luc
2a67d2ba6f Corrected error management for unsupported image formats, parsing
errors, and unavailable resources : avoid logging to much Exceptions as
these errors easily occur when searching images.
2015-12-01 01:06:01 +01:00
Michael Peter Christen
d6e9834040 Merge branch 'master' of
https://github.com/Scarfmonster/yacy_search_server

# Conflicts:
#	.classpath
#	build.xml
2015-11-30 16:54:54 +01:00
Michael Peter Christen
d82d311995 Merge branch 'master' of https://github.com/luccioman/yacy_search_server
# Conflicts:
#	.classpath
2015-11-30 13:34:10 +01:00
reger
b5371ea8c1 read/init crawl queue in a thread
to speed-up YaCy start on large existing crawler queues
2015-11-29 05:19:39 +01:00
reger
1160b13172 remove unused md5 from ViewFile servlet params 2015-11-28 23:09:15 +01:00
reger
e163ea88f6 fix vsdParser (Visio) parser return statement
(final block un-necessary throw)
2015-11-28 02:43:38 +01:00
reger
b2c8bc0ae6 remove md5_s from default index fields
it is not assigned a value / not used
Due to above also excluded from transfer protocol.
2015-11-27 02:41:02 +01:00
luc
e40ae0943b - No max dimensions specified : render raw image data when source and
target image format are the same.
- Corrected scaling condition.
2015-11-26 09:30:43 +01:00
reger
90686a75a2 fix flux factor (additional crawl delay by access count) calculation 2015-11-25 01:34:41 +01:00
luc
4af27289e5 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-23 09:01:25 +01:00
reger
297fdb60d3 throw exception if crawler hostqueue can't create hostpath directory.
In rare cases hostname may not be a valid filesystem directory name,
which can't be created (e.g. containing '*' char). To prevent crawl queue
looping on this invalid entry by throwing a malformedurlexception.
2015-11-22 21:26:18 +01:00
luc
755efac17d Use same max file size when loading all resource bytes or opening stream
content
2015-11-20 19:35:39 +01:00
luc
bc6c79fc12 Corrected scaling function for non RGB images. 2015-11-20 14:35:36 +01:00
luc
1565559df8 Refactoring : extracted write InputStream method. 2015-11-20 09:42:24 +01:00
luc
f0478bb14d BMP and ICO image formats support : integrated /haraldk/TwelveMonkeys
imageio-bmp-3.2 library.

 - better BMP format flavours support
 - handle PNG encoded icons
 - handle transparency
 
Added some javadoc url references to .classpath
2015-11-20 09:38:16 +01:00
luc
07437986e7 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-20 08:15:24 +01:00
reger
97cc03ef6a start using a template for urlproxy header
It is included as iframe  /proxmsg/urlproxyheader.html
to allow full servlet functionallity and flexibility to display some
index/meta data in future.
2015-11-20 01:49:56 +01:00
luc
f01d49c37a Process large or local file images dealing directly with content
InputStream.
2015-11-18 10:15:38 +01:00
luc
3c4c77099d If available, check content length before downloading. Check also
content length is not over Integer.MAX_VALUE.
2015-11-18 10:11:38 +01:00
luc
5bbb2e1730 Ensure resource is closed when reading a full file InputStream 2015-11-18 10:08:06 +01:00
luc
6291a57300 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-18 08:49:31 +01:00
reger
0d3c5b223e have psParser cleanup temp file 2015-11-17 23:45:29 +01:00
reger
7d0d19cb8e avoid File.deleteOnExit() on temp files
JVM registers each file in a list regardless of already deleted and never
cleans up the list during runtime.
This accumulates to a considerable amount of mem during large crawls and/or
long uptime.
To tackle this, all temp files are now created in a subdir of java.io.tmpdir 
and the jvm tmpdir property is set to this subdir, which is deleted by
code on shutdown.
Additionally let pdfParser use this tmp subdir too.
2015-11-17 22:27:07 +01:00
luc
bfe51001e3 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-17 08:30:32 +01:00
reger
02e4489a23 set tmpfile.deleteOnExit by default,
to make sure files are removed on shutdown.
2015-11-16 21:37:45 +01:00
reger
2985baaa01 Exclude repetitive protocol part in tokenized url
used as description if none is avail. from parser.
2015-11-16 01:06:20 +01:00
reger
ca3d26a401 harmonize wordsintitle & CollectionSchema.title_words_val calculation,
remove obsolete partial init of wordreference from urimetadata
2015-11-15 06:06:37 +01:00
reger
52a9040ae6 Sort out double keywords (dc_subject) early in parsed documents
- by direct using Set vs. List
- remove not neede String[] getter
2015-11-13 01:48:28 +01:00
luc
49331dc523 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-12 08:21:56 +01:00
reger
47d70732f6 improve locale translator
- skip empty line
- robustness file section detection (space independant)
2015-11-11 00:57:51 +01:00
sixcooler
646afe9183 do not store subfield *_coordinate + make all num-fields being docvalues 2015-11-10 20:45:33 +01:00