Commit Graph

3435 Commits

Author SHA1 Message Date
luc
6291a57300 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-18 08:49:31 +01:00
reger
0d3c5b223e have psParser cleanup temp file 2015-11-17 23:45:29 +01:00
reger
7d0d19cb8e avoid File.deleteOnExit() on temp files
JVM registers each file in a list regardless of already deleted and never
cleans up the list during runtime.
This accumulates to a considerable amount of mem during large crawls and/or
long uptime.
To tackle this, all temp files are now created in a subdir of java.io.tmpdir 
and the jvm tmpdir property is set to this subdir, which is deleted by
code on shutdown.
Additionally let pdfParser use this tmp subdir too.
2015-11-17 22:27:07 +01:00
luc
bfe51001e3 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-17 08:30:32 +01:00
reger
02e4489a23 set tmpfile.deleteOnExit by default,
to make sure files are removed on shutdown.
2015-11-16 21:37:45 +01:00
reger
2985baaa01 Exclude repetitive protocol part in tokenized url
used as description if none is avail. from parser.
2015-11-16 01:06:20 +01:00
reger
ca3d26a401 harmonize wordsintitle & CollectionSchema.title_words_val calculation,
remove obsolete partial init of wordreference from urimetadata
2015-11-15 06:06:37 +01:00
reger
52a9040ae6 Sort out double keywords (dc_subject) early in parsed documents
- by direct using Set vs. List
- remove not neede String[] getter
2015-11-13 01:48:28 +01:00
luc
49331dc523 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-12 08:21:56 +01:00
reger
47d70732f6 improve locale translator
- skip empty line
- robustness file section detection (space independant)
2015-11-11 00:57:51 +01:00
sixcooler
646afe9183 do not store subfield *_coordinate + make all num-fields being docvalues 2015-11-10 20:45:33 +01:00
sixcooler
194df613de not using 'location' as defaultfacetfield - since we removed it being
default.
2015-11-10 20:43:58 +01:00
sixcooler
d3b9349b6f simplification / speedup of GenerationMemoryStrategy 2015-11-10 20:39:46 +01:00
sixcooler
4a905ec134 fix to not let the AccessTracker-Log grow to much, but have enough data
to monitor.
(+gitignore-correction)
2015-11-10 20:27:17 +01:00
reger
20e18d79f8 harmonize document title for archive parsers 2015-11-10 01:29:13 +01:00
luc
f11b5e8309 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-09 08:13:12 +01:00
reger
112ae013f4 update bzip and bzip parser process,
to return one document for the file with combined parser results of the
containing file and registers it with supplied url and mime of the archive.
2015-11-07 19:13:18 +01:00
reger
e76a90837b update zip and tar parser process,
to return one document for the file with combined parser results of the
containing files.
2015-11-06 23:58:55 +01:00
luc
4e673ffc9a Ensure closing of InputStream even when an exception occurs. 2015-11-05 09:40:24 +01:00
luc
10696b53f7 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-05 08:26:52 +01:00
reger
8532565c7d optimize order of parsers to try
- start with a parser matching the remote supplied mime
2015-11-04 21:52:02 +01:00
reger
681889ae64 use current tar library for untar files
- remove old source copy
2015-11-04 02:57:00 +01:00
reger
5d71fc70e3 fix tarParser early exit on looping content
- adjust check of data available according to doc 
- return null on no recognized content (to not exit TextParser next parser try)
- use commons.compress directly
2015-11-03 22:14:14 +01:00
luc
bcc2e7cb5b Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-03 09:29:57 +01:00
reger
2fcf6f104c fix bzipParser recognition
- Bzip2Inputstream checks magic byte itself to identify bz2 (leave it in input)
- try to suppy fitting mime for parsing bz2 content
2015-11-03 03:35:01 +01:00
luc
745e97a575 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-02 08:10:11 +01:00
reger
a60b1fb6c2 differentiate api call getLocalPort() from getConfigInt() 2015-10-31 23:09:03 +01:00
reger
11f3666660 increase use of pre.defined CATCHALL_QUERY string 2015-10-31 19:44:31 +01:00
reger
a58ee49307 Optimize internal imagequery focus on using content_type to select images
(in favor of url file extension)
2015-10-31 19:18:46 +01:00
luc
fc3294382e Updated javadocs for warning on target encoding format potential errors. 2015-10-30 16:19:05 +01:00
luc
aa70ff4ff6 Corrected images alpha channel rendering 2015-10-30 05:18:16 +01:00
reger
d223cf0ae4 adjust MediaWiki importer geo coordinate calculation
- allow lat/long 0.xxx
- south / west assignment
include test class
2015-10-26 21:19:35 +01:00
reger
2b775d5be6 fix typo in WikiCode coordinate calculation 2015-10-25 19:38:42 +01:00
reger
bbe9df2bb3 fix MediawikiImporter for bz2 dump
skip reading bz2 file magicbyte to identify bz2 format as inputstream reset would be required. Common compress reads and checks the magicbytes internally and throws ioexception if wrong, making preread obsolete.
2015-10-25 03:06:15 +01:00
reger
c6687dd560 fix a system.out to log.fine
in bmpParser
2015-10-25 00:26:45 +02:00
reger
e53c6bbd51 fix init of peer flags
(remove hiding of ssl flag)
2015-10-24 19:36:33 +02:00
Michael Peter Christen
ac034db8bc Merge branch 'master' of https://github.com/luccioman/yacy_search_server
# Conflicts:
#	htroot/js/highslide/highslide.js
#	source/net/yacy/document/ImageParser.java
2015-10-24 11:22:35 +08:00
reger
826f14f37f fix unnececary set null of peer flags, causing reread
remove obsolete version flags
2015-10-22 02:35:58 +02:00
luc
5902ce032e Corrected NullPointerException case when ImageIO reader is not found for
image format.
2015-10-19 14:11:26 +02:00
reger
c6495a5b62 add a log entry on parsing ajax crawling scheme snapshot
(prev. commit 9252e36aeb)
2015-10-18 06:19:12 +02:00
reger
9252e36aeb implement ajax crawling scheme for ajax sites which adhere to the proposed use of hash-bangs to provide html content
see freshly deprecated https://developers.google.com/webmasters/ajax-crawling/
Implementation improves parsing of the homepage (ajax page) which uses metatag "fragment" in header and parses supplied html snapshot instead of mostly empty ajax/scripted page.
Implementation supports also hash-bang urls (url with anchor starting with ! like  ...path#!hashfragment) but our crawler filters it
(use of hash-bang is controversly discussed and proposal is deprecated, makes no sense to adjust the crawler, but as long as it is used by some sites the minor change/improvement in htmlparser is good for some time).
Quick - how does it work
- if metatag fragment with content "!" is found
   - htmlparser tries to get content of htmls snapshot (using a different url)
   - htmlparser returns 2 documents (original url and snapshot content - but using same original url)
- after parsing result documents are joined (and stored to index containing content also from snapshot page... as the original ajax page contains typically no parseable html content)
2015-10-18 05:51:01 +02:00
Michael Peter Christen
d1ae999ef9 replaced HashMap with LinkedHashMap to preserve the object order 2015-10-16 23:30:51 +02:00
Michael Peter Christen
7d075a1d76 added log lines 2015-10-16 23:30:04 +02:00
Michael Peter Christen
092dac086e Merge branch 'master' of https://github.com/luccioman/yacy_search_server 2015-10-16 23:22:30 +02:00
reger
7a64bebb86 init Recrawl job chunk size to max crawl loader during job start, to use some system preferences
and allow injection of recrawl urls before queue is empty
During recrawl the balancer hangs on the very last urls often on hosts with huge delay time,
by allowing injection earlier progress is more balanced. Max number of injected crawl urls by recrawl job is 2 * max loader.
2015-10-16 03:05:39 +02:00
luc
d6522fa4a2 Integrated haraldk/TwelveMonkeys library to first add TIF image format
support.
2015-10-15 10:06:51 +02:00
Michael Peter Christen
9244694e64 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2015-10-14 15:17:23 +02:00
Michael Peter Christen
151ccd50a9 fix for image size field values (must be multi-valued) 2015-10-14 15:16:16 +02:00
reger
c9937973e3 unescape MultiProtocolURL getAttributes() return values.
use getAttributes() to get query parameters as clear text (w/o url encoding)
use getSearchpartMap() to get in internal format (url encoded)

fix for http://mantis.tokeek.de/view.php?id=606
2015-10-13 02:43:18 +02:00
reger
78e8c6f3e5 refactor special handling (static override) of SUPPORTED_EXTENSIONS/MIME_TYPES
not used for genericImageParser
2015-10-11 01:23:52 +02:00