reger
c91e712178
further refactor using standard java / (one) utf-8 charset variable
...
extending initiative of commit 9a25751850
2016-01-07 16:17:37 +01:00
luc
571bc55937
Refactoring : use StandardCharsets constants instead of hard-coded
...
charset names.
2016-01-05 23:37:05 +01:00
reger
1af0e9ef74
remove workaround for Solr bug regarding multivalued date fields
...
fixed in 5.4.0
http://issues.apache.org/jira/browse/SOLR-8050
2016-01-03 01:11:27 +01:00
sixcooler
5a35f9383a
bump to solr/lucene 5.4.0
2016-01-02 21:07:50 +01:00
reger
a58d34a4e8
check error URL cache before adding errorDoc to index
...
- del obsolete related switchboardconstant
2016-01-02 05:03:57 +01:00
reger
e9539b1086
reintroduce special handling of file upload multipart/form-data from HTTPDemon.parseMultipart
...
- add filename to parameter fieldname
- add filecontent to special parameter fieldname$file
(some servlets use this $file parameter)
fix for http://mantis.tokeek.de/view.php?id=542
2015-12-31 03:04:13 +01:00
reger
cd26717ba2
fix low memory status hint (dht-in disabled)
...
http://mantis.tokeek.de/view.php?id=619
2015-12-29 20:38:45 +01:00
reger
a5faf73afa
remove obsolete yacy.init entries interaction.*
...
(related to removed triplestore)
2015-12-29 15:41:19 +01:00
sixcooler
dce1cb65c4
Merge remote-tracking branch 'choose_remote_name/master'
2015-12-28 23:20:42 +01:00
reger
46ac0867ff
fix poison mediawikiimporter output queue also after ExecutionException
...
in worker thread.
Writer of importer keeps needs a poison to close the file. On exception (e.g. OOM)
add a poison marker in outer most try/catch to assure output queue will terminate
in this condition too (and closes+renames the surrogate/in/xxx.prt file)
2015-12-28 02:32:00 +01:00
reger
a7591d3ed0
fix mediawikiimporter number format exception on coordinate parsing
...
handle uncomplete metadata like "NS=43/50//N".
For other {expr ... } type entries a try catch added
2015-12-27 01:59:15 +01:00
reger
9da1712a31
increase http header EXPIRES for css and images in DefaultServlet
...
to increase browser cache hits for not changing content
2015-12-26 17:35:46 +01:00
reger
6d54eb3d36
skip loading document on crawl start for YMark bookmarks
...
by adding a constructor giving the already loaded document as parameter.
2015-12-26 01:15:07 +01:00
reger
80e2c82249
fix NPE on empty blog importfile parameter
2015-12-24 02:00:45 +01:00
reger
e84d94f8ca
fix mime table for ms office / open office documents
...
(causing wrong parser detect in intranet mode)
2015-12-22 17:48:24 +01:00
reger
45b9bd8403
adjust MultiProtocolURL.protocol detection to handle mailto with "://" in parameters,
...
and feeding hyperlinks to webgraph processing.
2015-12-21 04:42:26 +01:00
reger
d5fd031449
fix reading of ippattern config array in URLProxy
2015-12-20 15:51:54 +01:00
reger
b7e8358645
make use of header.getContentType where possible (mime is normalized afterwards)
...
otherwise use header.mime() differentiated in prev. commit.
2015-12-20 15:49:24 +01:00
reger
7a8c077838
fix HeaderFramework.mime() to strip charset parameter.
...
Differentiate mime() and getContentType() which gives the raw header field.
This improves parser detection if charsets are included in http content-type field.
2015-12-20 06:44:16 +01:00
reger
b4b6910d60
fix (todo): correct doc.id of remote search result if no match with newly
...
calculated doc hash if different.
Testing showed that in some cases delivered url doesn't match the local
calculated hash. In this case replace doc.id (and host_id_s) with calculation
from url.
2015-12-20 02:10:49 +01:00
reger
dec3e6ad96
fix: adjust urlstub for mailto links
...
(skip protocol)
2015-12-19 20:13:33 +01:00
reger
cb83e65f89
drop returning document language "en" if unknown (fix todo)
...
which also harmonizes handling of query.modifier for rwi and solr results
(to result must match a given language filter)
2015-12-19 01:42:35 +01:00
reger
0c5548a7ff
fix (todo) remove redundant holding of email link nameproperty in parser document
2015-12-18 02:35:44 +01:00
reger
71c416f383
show mailto links in ViewFile.html linklist
2015-12-18 01:11:55 +01:00
reger
6b7c10cef8
fix dc:date in mediawikiimporter/document.writexml to use lastmodified
2015-12-17 02:53:10 +01:00
reger
14803d58cd
let html scraper accept html5 <link rel="icon"> for favicon links
2015-12-17 00:36:08 +01:00
luc
b4cdacee76
Merge branch 'master' of https://github.com/yacy/yacy_search_server
2015-12-16 03:26:06 +01:00
luc
ba0a293f5c
Corrected another case of
...
org.apache.lucene.store.AlreadyClosedException" occuring when
SearchEvent.cleanup() was called while committing local solr index.
2015-12-16 03:25:07 +01:00
reger
4d2b934487
prevent mailto links getting into parser result document's in/outbound link collection
...
by checking mailto scheme early.
- fix upper case mailto protocol assignment
- add test case for getProtocol
2015-12-16 03:01:17 +01:00
luc
8c4ab9c76b
Added an option to eventually limit size of remote solr documents put to
...
local index. See mantis #626 .
2015-12-16 02:20:03 +01:00
luc
a2c08402af
Merge branch 'master' of https://github.com/yacy/yacy_search_server
2015-12-15 23:30:30 +01:00
luc
70595d05d0
Modified MemoryControl.main() test to properly end for better results
...
displaying.
2015-12-14 23:49:28 +01:00
sixcooler
1be67d9ab6
CachedSolrConnector was replaced by ConcurrentUpdateSolrConnector years
...
ago - time to let it go
Commented out unused table of cache-objects
2015-12-14 21:33:27 +01:00
reger
28b8bc290a
fix use of NETWORK_SEARCHVERIFY for rwi verification
...
was not used to set the searchevent parameter (done in SearchEventCache.getEvent)
- remove unused corresponding QueryParams.filterfailurls param.
2015-12-13 20:01:49 +01:00
reger
020630efd8
remove unused network scanner parameter from queryparameter
...
Search event is not using networkscanner
(removed filterscannerfail param always init to false)
2015-12-13 02:50:08 +01:00
luc
ad5586f8f6
Merge branch 'master' of https://github.com/yacy/yacy_search_server
2015-12-08 03:35:36 +01:00
luc
8ebefa4233
Fixed MediaWiki import : DCEntry conversion to SolrInputDocument was
...
failing. Looks like it was broken since Commit
b43811d38c
2015-12-08 03:34:03 +01:00
luc
7736ee5a42
Updated MediaWimporter main() : display usage in console and stop
...
properly without calling System.exit
2015-12-08 03:30:51 +01:00
reger
cdb8f3b10d
make current ranking score value avail. to search interface / api
...
Update the result score result field with the result queue ranking value to reflect
the actual calculated/used score,
for rwi & solr stack results.
(calc. etc. is unchanged, it's just that result entry carries the latest val
as api retrieves the number from it)
2015-12-08 03:17:32 +01:00
luc
27d11f8671
Fixed isSolrDump function : PushBackInputStream was not unread when
...
returning false (for example with a WikiMedia dump).
2015-12-07 21:58:36 +01:00
Michael Peter Christen
135a123a77
less logging in new language detection
2015-12-03 00:39:15 +01:00
Michael Peter Christen
ef8cd80593
fix for npe
2015-12-03 00:33:13 +01:00
reger
0347bfa71f
Apply collection query constraint/modifiert to rwi result stack.
...
Collection is not available in pure rwi entries (but in local solr metadata)
But if user wishes to filter by query constraint also rwi shall adhere to this
(even if only rwi entries with parsed or solr received metadata may fit)
2015-12-02 22:57:59 +01:00
luc
2a67d2ba6f
Corrected error management for unsupported image formats, parsing
...
errors, and unavailable resources : avoid logging to much Exceptions as
these errors easily occur when searching images.
2015-12-01 01:06:01 +01:00
Michael Peter Christen
d6e9834040
Merge branch 'master' of
...
https://github.com/Scarfmonster/yacy_search_server
# Conflicts:
# .classpath
# build.xml
2015-11-30 16:54:54 +01:00
Michael Peter Christen
d82d311995
Merge branch 'master' of https://github.com/luccioman/yacy_search_server
...
# Conflicts:
# .classpath
2015-11-30 13:34:10 +01:00
reger
b5371ea8c1
read/init crawl queue in a thread
...
to speed-up YaCy start on large existing crawler queues
2015-11-29 05:19:39 +01:00
reger
1160b13172
remove unused md5 from ViewFile servlet params
2015-11-28 23:09:15 +01:00
reger
e163ea88f6
fix vsdParser (Visio) parser return statement
...
(final block un-necessary throw)
2015-11-28 02:43:38 +01:00
reger
b2c8bc0ae6
remove md5_s from default index fields
...
it is not assigned a value / not used
Due to above also excluded from transfer protocol.
2015-11-27 02:41:02 +01:00
luc
e40ae0943b
- No max dimensions specified : render raw image data when source and
...
target image format are the same.
- Corrected scaling condition.
2015-11-26 09:30:43 +01:00
reger
90686a75a2
fix flux factor (additional crawl delay by access count) calculation
2015-11-25 01:34:41 +01:00
luc
4af27289e5
Merge branch 'master' of https://github.com/yacy/yacy_search_server
2015-11-23 09:01:25 +01:00
reger
297fdb60d3
throw exception if crawler hostqueue can't create hostpath directory.
...
In rare cases hostname may not be a valid filesystem directory name,
which can't be created (e.g. containing '*' char). To prevent crawl queue
looping on this invalid entry by throwing a malformedurlexception.
2015-11-22 21:26:18 +01:00
luc
755efac17d
Use same max file size when loading all resource bytes or opening stream
...
content
2015-11-20 19:35:39 +01:00
luc
bc6c79fc12
Corrected scaling function for non RGB images.
2015-11-20 14:35:36 +01:00
luc
1565559df8
Refactoring : extracted write InputStream method.
2015-11-20 09:42:24 +01:00
luc
f0478bb14d
BMP and ICO image formats support : integrated /haraldk/TwelveMonkeys
...
imageio-bmp-3.2 library.
- better BMP format flavours support
- handle PNG encoded icons
- handle transparency
Added some javadoc url references to .classpath
2015-11-20 09:38:16 +01:00
luc
07437986e7
Merge branch 'master' of https://github.com/yacy/yacy_search_server
2015-11-20 08:15:24 +01:00
reger
97cc03ef6a
start using a template for urlproxy header
...
It is included as iframe /proxmsg/urlproxyheader.html
to allow full servlet functionallity and flexibility to display some
index/meta data in future.
2015-11-20 01:49:56 +01:00
luc
f01d49c37a
Process large or local file images dealing directly with content
...
InputStream.
2015-11-18 10:15:38 +01:00
luc
3c4c77099d
If available, check content length before downloading. Check also
...
content length is not over Integer.MAX_VALUE.
2015-11-18 10:11:38 +01:00
luc
5bbb2e1730
Ensure resource is closed when reading a full file InputStream
2015-11-18 10:08:06 +01:00
luc
6291a57300
Merge branch 'master' of https://github.com/yacy/yacy_search_server
2015-11-18 08:49:31 +01:00
reger
0d3c5b223e
have psParser cleanup temp file
2015-11-17 23:45:29 +01:00
reger
7d0d19cb8e
avoid File.deleteOnExit() on temp files
...
JVM registers each file in a list regardless of already deleted and never
cleans up the list during runtime.
This accumulates to a considerable amount of mem during large crawls and/or
long uptime.
To tackle this, all temp files are now created in a subdir of java.io.tmpdir
and the jvm tmpdir property is set to this subdir, which is deleted by
code on shutdown.
Additionally let pdfParser use this tmp subdir too.
2015-11-17 22:27:07 +01:00
luc
bfe51001e3
Merge branch 'master' of https://github.com/yacy/yacy_search_server
2015-11-17 08:30:32 +01:00
reger
02e4489a23
set tmpfile.deleteOnExit by default,
...
to make sure files are removed on shutdown.
2015-11-16 21:37:45 +01:00
reger
2985baaa01
Exclude repetitive protocol part in tokenized url
...
used as description if none is avail. from parser.
2015-11-16 01:06:20 +01:00
reger
ca3d26a401
harmonize wordsintitle & CollectionSchema.title_words_val calculation,
...
remove obsolete partial init of wordreference from urimetadata
2015-11-15 06:06:37 +01:00
reger
52a9040ae6
Sort out double keywords (dc_subject) early in parsed documents
...
- by direct using Set vs. List
- remove not neede String[] getter
2015-11-13 01:48:28 +01:00
luc
49331dc523
Merge branch 'master' of https://github.com/yacy/yacy_search_server
2015-11-12 08:21:56 +01:00
reger
47d70732f6
improve locale translator
...
- skip empty line
- robustness file section detection (space independant)
2015-11-11 00:57:51 +01:00
sixcooler
646afe9183
do not store subfield *_coordinate + make all num-fields being docvalues
2015-11-10 20:45:33 +01:00
sixcooler
194df613de
not using 'location' as defaultfacetfield - since we removed it being
...
default.
2015-11-10 20:43:58 +01:00
sixcooler
d3b9349b6f
simplification / speedup of GenerationMemoryStrategy
2015-11-10 20:39:46 +01:00
sixcooler
4a905ec134
fix to not let the AccessTracker-Log grow to much, but have enough data
...
to monitor.
(+gitignore-correction)
2015-11-10 20:27:17 +01:00
reger
20e18d79f8
harmonize document title for archive parsers
2015-11-10 01:29:13 +01:00
luc
f11b5e8309
Merge branch 'master' of https://github.com/yacy/yacy_search_server
2015-11-09 08:13:12 +01:00
reger
112ae013f4
update bzip and bzip parser process,
...
to return one document for the file with combined parser results of the
containing file and registers it with supplied url and mime of the archive.
2015-11-07 19:13:18 +01:00
reger
e76a90837b
update zip and tar parser process,
...
to return one document for the file with combined parser results of the
containing files.
2015-11-06 23:58:55 +01:00
luc
4e673ffc9a
Ensure closing of InputStream even when an exception occurs.
2015-11-05 09:40:24 +01:00
luc
10696b53f7
Merge branch 'master' of https://github.com/yacy/yacy_search_server
2015-11-05 08:26:52 +01:00
reger
8532565c7d
optimize order of parsers to try
...
- start with a parser matching the remote supplied mime
2015-11-04 21:52:02 +01:00
reger
681889ae64
use current tar library for untar files
...
- remove old source copy
2015-11-04 02:57:00 +01:00
reger
5d71fc70e3
fix tarParser early exit on looping content
...
- adjust check of data available according to doc
- return null on no recognized content (to not exit TextParser next parser try)
- use commons.compress directly
2015-11-03 22:14:14 +01:00
luc
bcc2e7cb5b
Merge branch 'master' of https://github.com/yacy/yacy_search_server
2015-11-03 09:29:57 +01:00
reger
2fcf6f104c
fix bzipParser recognition
...
- Bzip2Inputstream checks magic byte itself to identify bz2 (leave it in input)
- try to suppy fitting mime for parsing bz2 content
2015-11-03 03:35:01 +01:00
luc
745e97a575
Merge branch 'master' of https://github.com/yacy/yacy_search_server
2015-11-02 08:10:11 +01:00
reger
a60b1fb6c2
differentiate api call getLocalPort() from getConfigInt()
2015-10-31 23:09:03 +01:00
reger
11f3666660
increase use of pre.defined CATCHALL_QUERY string
2015-10-31 19:44:31 +01:00
reger
a58ee49307
Optimize internal imagequery focus on using content_type to select images
...
(in favor of url file extension)
2015-10-31 19:18:46 +01:00
luc
fc3294382e
Updated javadocs for warning on target encoding format potential errors.
2015-10-30 16:19:05 +01:00
luc
aa70ff4ff6
Corrected images alpha channel rendering
2015-10-30 05:18:16 +01:00
reger
d223cf0ae4
adjust MediaWiki importer geo coordinate calculation
...
- allow lat/long 0.xxx
- south / west assignment
include test class
2015-10-26 21:19:35 +01:00
reger
2b775d5be6
fix typo in WikiCode coordinate calculation
2015-10-25 19:38:42 +01:00
reger
bbe9df2bb3
fix MediawikiImporter for bz2 dump
...
skip reading bz2 file magicbyte to identify bz2 format as inputstream reset would be required. Common compress reads and checks the magicbytes internally and throws ioexception if wrong, making preread obsolete.
2015-10-25 03:06:15 +01:00
reger
c6687dd560
fix a system.out to log.fine
...
in bmpParser
2015-10-25 00:26:45 +02:00
reger
e53c6bbd51
fix init of peer flags
...
(remove hiding of ssl flag)
2015-10-24 19:36:33 +02:00
Michael Peter Christen
ac034db8bc
Merge branch 'master' of https://github.com/luccioman/yacy_search_server
...
# Conflicts:
# htroot/js/highslide/highslide.js
# source/net/yacy/document/ImageParser.java
2015-10-24 11:22:35 +08:00