Commit Graph

12266 Commits

Author SHA1 Message Date
reger
c2a88d53ab fix use of not initialized variable (m_LiteralDecoder) 2016-03-06 03:24:09 +01:00
reger
96b8d9b09e moving the J7Zip-modified source and Maven build to libbuild
from main pom. 
Using source included in j7zip-modified.jar.
This combines all external lib preparation in the libbuild main pom.
2016-03-06 03:19:52 +01:00
reger
764f5100f0 fix delete of temp file after odt % ooxml parser
Close zipfile after parsing
2016-03-04 23:05:55 +01:00
reger
379e9b330d use supplied url port to get robots.txt in crawlers hostqueue 2016-03-02 00:12:34 +01:00
reger
ed765de29b adjust start/stop classpath in build script
(with servlet classloader no need for htroot in system classpath)
2016-02-29 00:04:36 +01:00
reger
9a7efa7814 harmonize classpath with startYaCy.bat
(with servlet classloader no need for htroot in system classpath)
2016-02-28 22:53:41 +01:00
reger
0dcda3809e harmonize classpath with startYaCy.bat 2016-02-28 22:10:43 +01:00
reger
58a959403d fix mixed logfactory in UrlProxyServlet,
Class doesn't use functions of declared ancestor, change to extend on httpservlet
2016-02-27 03:44:43 +01:00
reger
dc112d0e32 upd to slf4j-1.7.16 2016-02-26 00:50:26 +01:00
Michael Peter Christen
2494a820c7 0N - added recording of dump exports if given time frame is not negative 2016-02-24 15:13:20 +01:00
Michael Peter Christen
ef2cc4f690 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2016-02-24 11:19:32 +01:00
reger
7b02cacb12 upd to Jetty 9.2.15.v20160210 2016-02-24 02:32:12 +01:00
Michael Peter Christen
a6bf0b1649 0N - added option to generate index export files for a specific number
of minutes in the past and reverted latest change. The export file dump
will now contain four data elements: f - first date of index entry write
date, l - last date of index write date, n - now-date of index dump
time, c - count of numbers inside the dump. '0N' denotes a series of
changes which will lead to the opportunity to exchange index data dumps
in a way that is needed to integrate ZeroNet index data. This will be
based on index dump sharing; that causes this commit.
2016-02-23 18:56:20 +01:00
reger
9312fbe563 making WebStructurePicture_p less vulnerable to faulty host input parameter (like host1,,host3)
by continue host loop on exception

inspired by http://mantis.tokeek.de/view.php?id=637
2016-02-21 21:38:11 +01:00
reger
6d56beaed8 fix assertion exception in toString of MultiProtocolURL
toString of AnchorURL and MultiProtocolURL are identical code
(no need to override or to protect call to parent)

as reported in https://github.com/yacy/yacy_search_server/issues/43
2016-02-21 00:23:00 +01:00
reger
b12b8fb1c2 include initial japaneese translation to language selection 2016-02-20 23:17:59 +01:00
Burkhard
6a3d27ca5b Merge pull request #44 from ImpactCrater/master
Created a translation file ja.lng
2016-02-20 22:43:41 +01:00
reger
42a7bdb2af fix SolrSelectServlet authentication to default to true 2016-02-20 22:30:15 +01:00
ImpactCrater
567c292302 Created a translation file ja.lng
I wrote a bit of translation to Japanese.
2016-02-21 03:55:33 +09:00
Michael Peter Christen
5b9030180c added peer hash to export dump name. 2016-02-19 19:26:02 +01:00
Michael Peter Christen
287b918bd7 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2016-02-19 07:52:59 +01:00
reger
20e3c25ae3 upd to weupnp-0.1.4.jar 2016-02-18 01:09:29 +01:00
reger
dbb28bb4f3 del unused statistic parameter (from status servlet) 2016-02-17 22:47:03 +01:00
Michael Peter Christen
b851308ee6 enhanced robustnes of image computation 2016-02-16 17:36:49 +01:00
reger
06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
- Above brought up that parser start url parameter, declared as AnchorURL uses only methodes of parent object DigestURL (changed parameter declaration accordingly).
2016-02-16 02:05:58 +01:00
reger
caf9e98f09 put metadata dc_publisher in corresponding schema field 2016-02-14 21:13:25 +01:00
reger
38e2b054d4 remove servlet classloder internal cache map (to save the resources, cache hits marginal)
- DefaultServlet includes already a class cache "templateMethodCache" which is emptied 
  on low mem status 
- avoid classloader cache gets has no hits but over time holds all (used) servlet classes
2016-02-12 01:20:03 +01:00
reger
6f0b073bf3 override detected language (statistic langdetect) only with TLD determided
language if langdetect probability is not high.
+ additionally truncate zh-cn / zh-tw returned by langdetect to 2 char ISO639-1 zh
used by YaCy
2016-02-07 21:16:22 +01:00
reger
b65e2b527d include use of condenser's content text for language detection.
Language identification may show poor performance on documents with short or no
title but clear lang indication in text content. Using content text too
improves lang detection.
+ remove double caching of text in Identificator
2016-02-07 01:52:32 +01:00
reger
756c55e6d1 upd to Solr 5.4.1 2016-02-06 21:32:54 +01:00
reger
937fbb0b9f correct isHidden() for smb from last commit 2016-02-04 19:20:27 +01:00
reger
535d4bf75f respect hidden attribute for file and smb directory listing
(hidden directories are not listed, effects crawling of local file system)
2016-02-04 19:16:00 +01:00
reger
cc79ad8de6 compare search page, remove diminished search target
(romso.de, dbpedia.neofonie.de )
2016-02-04 00:47:42 +01:00
reger
375d49d536 upd classpath in batches (remove not necessary htroot)
see prev commit
2016-02-03 21:50:50 +01:00
reger
c28142095a add findClass() to servlet class loader (used in YaCyDefaltServlet)
In the 2 cases where servlet calls servlet the jvm classloader chain is
invoked and servlet class loaded by jvm loader (successful while requiring 
htroot in system classpath). This patch uses the standard override design
for loaders to handle these cases (making in not longer crucial to have htroot 
in system classpath, as this classLoader is mainly used for servlets and
looks in this case for the class in the configured path).
+ As the default classloader is parallelcapable we should register this too.
2016-02-02 03:44:01 +01:00
reger
8e60788c8f fix json date facet displayname 2016-01-31 02:38:39 +01:00
reger
46772e08d0 upd to pdfbox 1.8.11 2016-01-31 00:30:39 +01:00
reger
a6617ad887 expand initRemoteCrawler() to terminate worker threads if called to deactivate
remote crawl.
On startup we save the resources for remote crawler if disabled. Once started
threads are running idle after disable remote crawl. Now threads are terminated
to save the resources also while disabeling during runtime.
+ remove empty class Channels
2016-01-28 23:14:09 +01:00
reger
2048b7e057 support scraping start-/enddate from html tag with property "datetime"
This may be used in html5 <time> tag (which we don't explicite support yet for date in content scraping).
2016-01-26 21:27:44 +01:00
reger
900d4584ba complet resource cleanup of lists in contentscraper's close() 2016-01-25 23:54:20 +01:00
reger
06e5cd6164 add support parsing swf-metadata to swfparser
flash supports metadata tag in swf file with metadata in xmp (xml) format.
parse some common data to include it in the head section of the html string
of converttohtml.
2016-01-25 22:13:04 +01:00
reger
11b1587067 replace remaining use of java.util.Vector by ArrayList (WebCat-swf) 2016-01-24 02:30:27 +01:00
reger
9331acdb18 add support for DEFINEFONT3 (swf8) to webcat parser
experienced issue with JPEGTABLE tag (with length=0) causing abort of parsing (ioexception)
as we don't use/need it for text parsing skip this tag.
2016-01-23 22:46:22 +01:00
reger
bf5fca5d99 add missing swf tag constants according to latest spec
reduce use of synced vector in webcat parser
2016-01-23 20:19:01 +01:00
reger
1f18653de0 pass parsed swf content trough htmlscraper
Swf may contain subset of html tags which shoul'd appear as text.
Especially <font> tag may totally screw up metadata servlet if not filtered out.
2016-01-21 02:55:05 +01:00
reger
18ecf57792 add support of compressed swf to swfParser
from JavaSWF2 (source compatible to WebCat).
Moved swf file signature check to parser
Changed use of synced vector to list swf InStream
2016-01-20 00:58:29 +01:00
sixcooler
5cb7ba0dc4 fix for connections not getting closed to get favicon.ico during seach 2016-01-19 20:57:22 +01:00
sixcooler
e1dd808e1c fix for 'move test classes to test/java' 2016-01-19 20:50:26 +01:00
reger
6c25710a34 replace bugfixed webcat-swf.jar 2016-01-18 23:36:18 +01:00
reger
4213ff84d4 import WebCat swf parser custom source package
This package is not available as jar (used jar is a custom compile as we 
use just a portion of the package) 
WebCat package is not maintained. To be able to fix bugs, source extract 
of swf parser imported here.
2016-01-18 22:41:49 +01:00