Commit Graph

7961 Commits

Author SHA1 Message Date
Michael Peter Christen
849ab671a9 0n: modified the p2p bootstraping process - rules had been too tight and
did not support the re-start of a network with just one principal peer.
2016-03-11 08:54:42 +01:00
reger
764f5100f0 fix delete of temp file after odt % ooxml parser
Close zipfile after parsing
2016-03-04 23:05:55 +01:00
reger
379e9b330d use supplied url port to get robots.txt in crawlers hostqueue 2016-03-02 00:12:34 +01:00
reger
58a959403d fix mixed logfactory in UrlProxyServlet,
Class doesn't use functions of declared ancestor, change to extend on httpservlet
2016-02-27 03:44:43 +01:00
Michael Peter Christen
2494a820c7 0N - added recording of dump exports if given time frame is not negative 2016-02-24 15:13:20 +01:00
Michael Peter Christen
ef2cc4f690 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2016-02-24 11:19:32 +01:00
Michael Peter Christen
a6bf0b1649 0N - added option to generate index export files for a specific number
of minutes in the past and reverted latest change. The export file dump
will now contain four data elements: f - first date of index entry write
date, l - last date of index write date, n - now-date of index dump
time, c - count of numbers inside the dump. '0N' denotes a series of
changes which will lead to the opportunity to exchange index data dumps
in a way that is needed to integrate ZeroNet index data. This will be
based on index dump sharing; that causes this commit.
2016-02-23 18:56:20 +01:00
reger
6d56beaed8 fix assertion exception in toString of MultiProtocolURL
toString of AnchorURL and MultiProtocolURL are identical code
(no need to override or to protect call to parent)

as reported in https://github.com/yacy/yacy_search_server/issues/43
2016-02-21 00:23:00 +01:00
reger
42a7bdb2af fix SolrSelectServlet authentication to default to true 2016-02-20 22:30:15 +01:00
reger
dbb28bb4f3 del unused statistic parameter (from status servlet) 2016-02-17 22:47:03 +01:00
reger
06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
- Above brought up that parser start url parameter, declared as AnchorURL uses only methodes of parent object DigestURL (changed parameter declaration accordingly).
2016-02-16 02:05:58 +01:00
reger
caf9e98f09 put metadata dc_publisher in corresponding schema field 2016-02-14 21:13:25 +01:00
reger
38e2b054d4 remove servlet classloder internal cache map (to save the resources, cache hits marginal)
- DefaultServlet includes already a class cache "templateMethodCache" which is emptied 
  on low mem status 
- avoid classloader cache gets has no hits but over time holds all (used) servlet classes
2016-02-12 01:20:03 +01:00
reger
6f0b073bf3 override detected language (statistic langdetect) only with TLD determided
language if langdetect probability is not high.
+ additionally truncate zh-cn / zh-tw returned by langdetect to 2 char ISO639-1 zh
used by YaCy
2016-02-07 21:16:22 +01:00
reger
b65e2b527d include use of condenser's content text for language detection.
Language identification may show poor performance on documents with short or no
title but clear lang indication in text content. Using content text too
improves lang detection.
+ remove double caching of text in Identificator
2016-02-07 01:52:32 +01:00
reger
937fbb0b9f correct isHidden() for smb from last commit 2016-02-04 19:20:27 +01:00
reger
535d4bf75f respect hidden attribute for file and smb directory listing
(hidden directories are not listed, effects crawling of local file system)
2016-02-04 19:16:00 +01:00
reger
c28142095a add findClass() to servlet class loader (used in YaCyDefaltServlet)
In the 2 cases where servlet calls servlet the jvm classloader chain is
invoked and servlet class loaded by jvm loader (successful while requiring 
htroot in system classpath). This patch uses the standard override design
for loaders to handle these cases (making in not longer crucial to have htroot 
in system classpath, as this classLoader is mainly used for servlets and
looks in this case for the class in the configured path).
+ As the default classloader is parallelcapable we should register this too.
2016-02-02 03:44:01 +01:00
reger
a6617ad887 expand initRemoteCrawler() to terminate worker threads if called to deactivate
remote crawl.
On startup we save the resources for remote crawler if disabled. Once started
threads are running idle after disable remote crawl. Now threads are terminated
to save the resources also while disabeling during runtime.
+ remove empty class Channels
2016-01-28 23:14:09 +01:00
reger
2048b7e057 support scraping start-/enddate from html tag with property "datetime"
This may be used in html5 <time> tag (which we don't explicite support yet for date in content scraping).
2016-01-26 21:27:44 +01:00
reger
900d4584ba complet resource cleanup of lists in contentscraper's close() 2016-01-25 23:54:20 +01:00
reger
1f18653de0 pass parsed swf content trough htmlscraper
Swf may contain subset of html tags which shoul'd appear as text.
Especially <font> tag may totally screw up metadata servlet if not filtered out.
2016-01-21 02:55:05 +01:00
reger
18ecf57792 add support of compressed swf to swfParser
from JavaSWF2 (source compatible to WebCat).
Moved swf file signature check to parser
Changed use of synced vector to list swf InStream
2016-01-20 00:58:29 +01:00
sixcooler
5cb7ba0dc4 fix for connections not getting closed to get favicon.ico during seach 2016-01-19 20:57:22 +01:00
reger
ed3e16e092 apply remote result count config value to Bookmark Autosearch
+ prepare to make the widely unused Bookmark feature optional
2016-01-15 02:10:10 +01:00
Ryszard Goń
a98c395023 Add the Autocrawl thread 2016-01-14 00:50:23 +01:00
Ryszard Goń
1728cd30c6 Create autocrawl profiles 2016-01-12 16:28:34 +01:00
reger
ff27824964 fix swfParser reading file signature
before passing to library (current version expects data w/o signature)
2016-01-10 01:16:31 +01:00
reger
c91e712178 further refactor using standard java / (one) utf-8 charset variable
extending initiative of commit 9a25751850
2016-01-07 16:17:37 +01:00
luc
571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
charset names.
2016-01-05 23:37:05 +01:00
reger
1af0e9ef74 remove workaround for Solr bug regarding multivalued date fields
fixed in 5.4.0
http://issues.apache.org/jira/browse/SOLR-8050
2016-01-03 01:11:27 +01:00
sixcooler
5a35f9383a bump to solr/lucene 5.4.0 2016-01-02 21:07:50 +01:00
reger
a58d34a4e8 check error URL cache before adding errorDoc to index
- del obsolete related switchboardconstant
2016-01-02 05:03:57 +01:00
reger
e9539b1086 reintroduce special handling of file upload multipart/form-data from HTTPDemon.parseMultipart
- add filename to parameter fieldname
- add filecontent to special parameter fieldname$file
(some servlets use this $file parameter)

fix for http://mantis.tokeek.de/view.php?id=542
2015-12-31 03:04:13 +01:00
reger
cd26717ba2 fix low memory status hint (dht-in disabled)
http://mantis.tokeek.de/view.php?id=619
2015-12-29 20:38:45 +01:00
reger
a5faf73afa remove obsolete yacy.init entries interaction.*
(related to removed triplestore)
2015-12-29 15:41:19 +01:00
sixcooler
dce1cb65c4 Merge remote-tracking branch 'choose_remote_name/master' 2015-12-28 23:20:42 +01:00
reger
46ac0867ff fix poison mediawikiimporter output queue also after ExecutionException
in worker thread.
Writer of importer keeps needs a poison to close the file. On exception (e.g. OOM)
add a poison marker in outer most try/catch to assure output queue will terminate
in this condition too (and closes+renames the surrogate/in/xxx.prt file)
2015-12-28 02:32:00 +01:00
reger
a7591d3ed0 fix mediawikiimporter number format exception on coordinate parsing
handle uncomplete metadata like "NS=43/50//N". 
For other {expr ... } type entries a try catch added
2015-12-27 01:59:15 +01:00
reger
9da1712a31 increase http header EXPIRES for css and images in DefaultServlet
to increase browser cache hits for not changing content
2015-12-26 17:35:46 +01:00
reger
6d54eb3d36 skip loading document on crawl start for YMark bookmarks
by adding a constructor giving the already loaded document as parameter.
2015-12-26 01:15:07 +01:00
reger
80e2c82249 fix NPE on empty blog importfile parameter 2015-12-24 02:00:45 +01:00
reger
e84d94f8ca fix mime table for ms office / open office documents
(causing wrong parser detect in intranet mode)
2015-12-22 17:48:24 +01:00
reger
45b9bd8403 adjust MultiProtocolURL.protocol detection to handle mailto with "://" in parameters,
and feeding hyperlinks to webgraph processing.
2015-12-21 04:42:26 +01:00
reger
d5fd031449 fix reading of ippattern config array in URLProxy 2015-12-20 15:51:54 +01:00
reger
b7e8358645 make use of header.getContentType where possible (mime is normalized afterwards)
otherwise use header.mime() differentiated in prev. commit.
2015-12-20 15:49:24 +01:00
reger
7a8c077838 fix HeaderFramework.mime() to strip charset parameter.
Differentiate mime() and getContentType() which gives the raw header field.
This improves parser detection if charsets are included in http content-type field.
2015-12-20 06:44:16 +01:00
reger
b4b6910d60 fix (todo): correct doc.id of remote search result if no match with newly
calculated doc hash if different.
Testing showed that in some cases delivered url doesn't match the local
calculated hash. In this case replace doc.id (and host_id_s) with calculation
from url.
2015-12-20 02:10:49 +01:00
reger
dec3e6ad96 fix: adjust urlstub for mailto links
(skip protocol)
2015-12-19 20:13:33 +01:00
reger
cb83e65f89 drop returning document language "en" if unknown (fix todo)
which also harmonizes handling of query.modifier for rwi and solr results
(to result must match a given language filter)
2015-12-19 01:42:35 +01:00