Commit Graph

3839 Commits

Author SHA1 Message Date
Michael Peter Christen
bf6709d196 fixed missing browser activation in linux 2016-08-19 19:24:15 +02:00
Michael Peter Christen
d8504418b6 enhanced browser-caching of static content 2016-08-19 19:23:51 +02:00
Michael Peter Christen
079112358c Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2016-08-19 15:31:09 +02:00
Michael Peter Christen
efeb592661 don't do solr optimization, this create high IO load. We should leave
this task to solr to do that on it's own instead of forcing it.
2016-08-19 15:30:53 +02:00
luccioman
46b8836548 Copy image resources contained in donation iframe.
Handle eventual images loading errors.
2016-08-17 15:19:15 +02:00
reger
4c7a77662a eleminate dependency on file-extension in storeDocument but use supported mime-type
to also support handling of urls w/o corresponding file-extension.
For this refactor use of document.getParserObject() to alway return a Parser (for clean logic)
and define/move the scraperObject as local var of AbstractParser.
Adjust related calls to getParserObject (where actually a scraperObject is wanted).
Addionally skip appending url token to parsed text for dht metadata entries 
(by default returned as result by rwi index).
2016-08-14 03:53:16 +02:00
reger
ebde21079a refactor xlsParser to include Excel file attribute (like author) in parser result doc.
Similar to ppt and doc parser, completing a TODO in xlsParser.
2016-08-13 23:46:36 +02:00
luccioman
744c9a2615 Opensearch desc : handle https protocol url with default port (443)
This completes modifications made for mantis 669
(http://mantis.tokeek.de/view.php?id=669)
2016-08-12 12:18:26 +02:00
luccioman
b9c28893ee Merged master to 'heroku' branch. 2016-08-10 11:03:01 +02:00
Michael Peter Christen
103a8348b3 fix for NPE and small performance enhancement 2016-08-10 06:48:08 +02:00
reger
2910fe35c1 add missing scheduler calc of next exec_date (call of calculateAPIScheduler)
- after last_exec_date is altered, next_exec_date should be recalculated
- makes the recalculation of next_exec in advance (without api call surely made) in Switchbard.schedulerJob() obsolete
Slightly modify next_exec calc. on missed event to now+schedule_time (from fix 10min)
2016-08-09 03:03:04 +02:00
reger
70d47ae38a keep scheduler selection by repeat entry from 07311020d4
to allow exec schedule on actual exec event.
Iterate on exec date (of advantage after interruption/shutdown) to schedule
older or missed events first.
2016-08-08 02:19:48 +02:00
reger
7c3f932e5d revert due to conflict with double count recording by schedulter / servlet by the commit under normal operation (no shutdown) 2016-08-08 01:57:31 +02:00
reger
07311020d4 postpone apicall exec date init until actual call
fix for http://mantis.tokeek.de/view.php?id=677
The difference is on scheduling a large number of rss feeds and loading 
is not finished before shutdown of YaCy. The change makes sure not already
loaded RSS will be loaded by the scheduler on next startup.
2016-08-07 05:08:55 +02:00
reger
5e335b32da fix Blacklist.contains() matching path pattern to string
similar to 5e9e871192
+ add proof testcase
2016-08-04 01:12:49 +02:00
reger
5e9e871192 fix Blacklist.remove by using pattern.toString to find pattern to remove,
parameter String path did never equal Pattern.
+ delete unused removeAll, as it does not persist changes after restart
2016-08-03 02:13:26 +02:00
reger
1843ea7e69 on Blacklist.add pattern to source file also update internal entry maps
as in Blacklist.add(blacklistType) to make entry effective w/o restart
fix for http://mantis.tokeek.de/view.php?id=676
2016-08-02 02:41:03 +02:00
reger
bf6ce33da3 Correct use of _htDocsPath config in YaCyDefaultServlet to use servlet config variable
+ add some javadoc and remove a not useful static declaration
2016-07-31 23:16:24 +02:00
luccioman
480027ec98 Merge remote-tracking branch 'origin/master' into heroku_experiments 2016-07-28 02:29:40 +02:00
reger
fcad2d0744 add uses of config constant INDEX_RECEIVE_ALLOW 2016-07-27 02:16:20 +02:00
reger
226f81cfcf declare poison pill url MultiProtocolURL() as protected to make sure not
used from outside.
After double checking use of poison url revert path init from commit
f8632ad292
2016-07-23 20:03:13 +02:00
reger
f8632ad292 prevent string index out of bounds MultiProtocolURL.getPaths
as path maybe a empty string
+ init path to "" also in init for poison url (to guarantee success for 
all existing uses of path w/o check for null)
2016-07-23 19:18:23 +02:00
reger
35a7d57260 update lucenematchversion to current (5.2.0 -> 5.5.0)
there should be no need for reindex by the update
2016-07-23 18:36:43 +02:00
reger
9b07bbf955 deprecate newurl(), not used and already replaced
instead of making it handle all supported the protocols
2016-07-21 02:14:35 +02:00
luccioman
47d486298f Merged changes from master. 2016-07-20 00:37:31 +02:00
reger
774b3906a9 fix GenericFormatter.parse ("time","timeoffset")
change: UTC offset internally expected in minutes
2016-07-19 02:57:41 +02:00
reger
27163af0e1 improve detection of referenced links by taking http and https link protocol
into account
+ correct query start detection of commit f89d4eb51d
2016-07-17 23:42:25 +02:00
reger
f89d4eb51d fix MultiProtocolURL init (assign of host) for urls with '/' in query part
+ add to test case
2016-07-17 04:17:01 +02:00
reger
87fcfc6d78 Adjusted hash computation and toNormalform for file:// protocol to deliver
same hash same file on Windows filesystem path with forward- and backslash in path.
Background see http://mantis.tokeek.de/view.php?id=671
+Test case
2016-07-16 01:59:09 +02:00
luccioman
d6bf90803f Merged from maain master branch. 2016-07-12 09:05:31 +02:00
luccioman
9b9c112263 Handle more propertly local port configuration by system property
And prefixed property with "net.yacy" to avoid ambiguity.
2016-07-12 01:53:01 +02:00
reger
3811184abd fix GSA servlet clientIP retrival 2016-07-09 23:39:43 +02:00
reger
7ab41d4ff1 use directories original lastmodified date in file- & smbloader in response 2016-07-09 19:55:47 +02:00
reger
708bcbb042 one more replacement to use cached hosthash vs. calculated 2016-07-07 02:50:57 +02:00
luccioman
b57a06d88e Let Heroku decide which http port to use 2016-07-06 22:14:40 +02:00
reger
22db449f2a to prevent crawler to concurrently access and alter same crawl queue
after restart, put hosthash in queue's filename (which is used as primary 
key for crawl queue. Hint: initial hosthash from url and recalculated hosthash 
from just hostname:port are not the same. 
fixes http://mantis.tokeek.de/view.php?id=668 (partially)
2016-07-05 23:22:35 +02:00
luccioman
893a40995a Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2016-07-04 21:24:40 +02:00
Orbiter
50c5ddf1a1 Merge pull request #56 from luccioman/LibreJS
LibreJS compliance : YaCy JavaScript license information
2016-07-04 21:07:11 +02:00
Michael Peter Christen
7466d390b2 small refactoring + do not accept too old peers during bootstrap 2016-07-04 11:02:15 +02:00
luccioman
6e96c7341a Merge remote-tracking branch 'origin/master'
Conflicts:
	htroot/Load_MediawikiWiki.java
	htroot/Load_PHPBB3.java
	htroot/ViewImage.java
2016-07-03 18:59:00 +02:00
reger
8d58a48029 remove wrong log line in CrawlSwitchboard
+ don't allow CrawlSwitchboard to exit application
making network param unused
2016-07-02 20:33:23 +02:00
reger
5aaa057c65 ignore empty input lines in FileUtils.getListArray() to poka joke blacklist read.
equalizes behavior with getListString()
improves: case were blacklist file contained a undesired empty line, not 
fixed by blacklist-cleaner.
2016-06-28 23:44:28 +02:00
reger
41c36ffd75 exclude rejected results from result count
(by using the resultcontainer.size instead of input docList.size)
skip waiting for write-search-result-to-local-index
  (by removing the Thread.join - which will bring a small performance increase)
2016-06-26 06:46:26 +02:00
reger
d4da4805a8 internal wiki code, require header line to start with markup
(to allow something like  "one=two"  as text)
+ incl. test case
2016-06-25 02:46:44 +02:00
reger
e952e355a2 have Translator servlet adhoc apply added translation by translating a single file
+ fix NPE in Translator, coming from translation read by TranslatorXliff 
  which allows null content for not translated key's
2016-06-14 22:14:46 +02:00
reger
b119ff65be clean out not used Switchboard variables
counter indexedPages, const xstackCrawlSlots
2016-06-14 01:50:32 +02:00
reger
223071337b Translator to take caution of word boundaries to identify text portion to
be translated. To avoid key="TEST" sourcetext="this is a myTESTcase for it"
translation of partial terms/words.
Add check of word boundary before and after sourcetext (incl. take care
of current praxis for key to be delimetered by > < 
+ add test case
2016-06-10 01:14:19 +02:00
luccioman
009657791e Merge remote-tracking branch 'origin/master' into LibreJS 2016-06-09 14:44:51 +02:00
luccioman
a73c9327a5 JavaScript License fixes for LibreJS compatibility 2016-06-08 23:16:10 +02:00
reger
0c40401d28 fix MessageBoard test for null data 2016-06-07 23:34:42 +02:00
reger
5b22c63030 Adjust TranslatorXliff to load default 1st and merge downloaded or modified local translation.
process 1. load default from locales/*.* 
        2. load and merge(overwrite) from DATA/LOCALE/*.* (can be partial translation as it is merged)
- include all entries from DATA/LOCAL to be edited in Translator servlet
  and save just modifications (instead of full list) to DATA/LOCALE

This shall make it easy to share modifications.
2016-06-05 23:01:45 +02:00
reger
a2e0f00456 optimize Translator
- translateFilesRecursive: load translation once (reduce io), return true on complete success
  - remove resulting unused translateFiles() variant
- translate: use StringBuilder parameter (skip toString conversion)
- remove not needed static declaration
- upd some javadoc
2016-06-05 03:57:08 +02:00
reger
a6ba1faa80 introduce a translation edit servlet Translator_p.html YaCy's UI text translation
This is the 1st rudimentary approach to support the translatio utilities.
It allows currently to edit untranslated text and save it in a local translation file
in the DATA/LOCALE directory.
+ refactor Translator (less static's) to leverage on class overrides and support garbage collection for this 1 time routine
+ adjust TranslatorXliff to check for local translations in DATA/LOCALE,
  this includes storing manually downloaded translation files in DATA as well 
  (to keep default untouched)
+ on 1st call of Translator_p a master tanslation file is generated, checking
the supported languages for missing translation text (later this masterfile is planned to part of the distribution, to harmonize translation key text between the languages)
Outlook: the local modifications (possibly as translation fragments instead of complete file) to be shared with maintainer using xlif features.
2016-06-03 01:46:30 +02:00
reger
b3c9041f79 remove with localHostNames redundant (but unused) publicIPv4HostNames and publicIPv6HostNames
to free unused resources
2016-06-02 01:42:15 +02:00
reger
bd8f7c11f5 Use transparent addToCrawler in AutoSearch instead of addToIndex
This would likely also be of advantage for RSS import/schedule as
following bug-reports suggest
http://mantis.tokeek.de/view.php?id=569
http://mantis.tokeek.de/view.php?id=655
2016-06-01 01:14:22 +02:00
reger
f23d8ab47b fix 2 more servlet RuntimeException in intranet mode thrown due to seed.getIP()
returning null in intranet mode (in servlets: ConfigSearchBox, Load_PHPBB3
+remove unused (const &empty;) seed.IPTYPE
2016-05-29 20:35:57 +02:00
reger
bb0076c3dd fix: assure close inputstream in TranslatorXliff after reading xlf file
by using try-wiht-resource block
2016-05-29 01:25:47 +02:00
reger
6384b7d82e fix NPE in Load_MediawikiWiki servlet in intranet mode
- in intranet mode getip returns null causing a NPE
  - adjust starturl (which was set to http://localip/repository) which is never the start url for the Mediawiki
+ correct javadoc for seed.getIP()
2016-05-27 03:10:25 +02:00
Michael Peter Christen
596b5dfa59 add the JRE version in the seed. Purpose: identify if it is possible to
migrate to new JRE version
2016-05-24 23:11:59 +02:00
reger
4cc38e979d add InputStream close after reading input file (Vocabulary_p servlet) 2016-05-24 00:26:28 +02:00
reger
6bf9c55584 adjust Solr select servlet to lates bugfix for boostquery (bq param)
to split query into multiple parameter on line separator in input query.
e.g. split "crawldepth_i_0^10.0 \n crawldepth_i:1^5.0"
but allow   "url_file_ext_s:jpg OR url_file_ext_s:png"  to be unsplitted
2016-05-22 22:43:56 +02:00
Burkhard
9a18e2297b Merge pull request #51 from JeremyRand/multiple-boost-query
Fix multiple boost queries
2016-05-22 22:24:04 +02:00
reger
f0d7b93372 make use and activate autodetect charset in Vocabulary input from file
+ revert mistake of empty cn.lng
2016-05-22 05:38:26 +02:00
JeremyRand
433217b33e Properly support multiple Boost Queries. (Previous code was broken because it concatenated multiple Boost Queries together rather than passing Solr an array.) 2016-05-20 20:17:51 -05:00
JeremyRand
58824dfa6c Refactor escaping in config file read/write code. Now it uses Apache Commons StringUtils instead of RegEx. 2016-05-20 20:17:51 -05:00
reger
9e94989237 upd to PDFBox 2.0.1 2016-05-20 23:12:16 +02:00
reger
d0a571bed2 del cytag trail for own index.html (save resource not used by default) 2016-05-19 01:59:00 +02:00
reger
de46879637 fix SeedDB.get(byte[]) hash string compare (for returning own seed shortcut) 2016-05-17 02:07:49 +02:00
reger
24b0fa2a38 extend snapshot Html2Image.pdf2image to use PDFBox image export capability
if no external tool installed (and for Win)
Resulting jpg are not always perfect (if graphic included) but imho sufficient.
2016-05-16 02:13:33 +02:00
reger
eb2a00b1d8 fix NPE on missing crawldepth_i 2016-05-15 01:26:38 +02:00
reger
efb9f1a8b7 save resource for unused blacklistFiles map 2016-05-12 00:13:57 +02:00
reger
5f113be760 cleanup connectPeer & yacyVersion.latestRelease usage
obsolete since 
527b3decde
2016-05-06 21:05:15 +02:00
reger
7097dcbdbd cleanup hack for partial Solr update on multivalued datefields
has been fixed in Solr http://issues.apache.org/jira/browse/SOLR-8050
2016-05-06 02:47:04 +02:00
reger
f10ea3c155 clean-out unused SwitchboardConstants 2016-05-05 00:55:22 +02:00
reger
ef24593347 delete obsolete SEARCHRESULT busythread constants
not used since 29.05.2013 18:27:27
0c1a018bbd
2016-05-04 01:30:10 +02:00
reger
125b5e26a5 apply bugfix for ChartPlotter from Pullreq 42
https://github.com/yacy/yacy_search_server/pull/42
thanks to otteresk (https://github.com/otteresk)
2016-05-03 03:06:06 +02:00
reger
06ce9ae711 prevent "unchecked conversion" compiler message
+ include "translate" property in xlf "trans-unit" export
2016-05-01 02:22:05 +02:00
reger
b4a576dbdf exclude unused protocol param "duetime"
(receiver interpretes param "time" only)
2016-04-25 01:57:33 +02:00
reger
3bd6ae8d8b keep addon/Notepad++ keyword marker on lng export
(length of remarks devider line)
+ harmonize status_p.inc lng text
2016-04-21 00:51:08 +02:00
reger
16837d60c7 fix version in locale version file
(it's compared to full version)
2016-04-17 22:54:28 +02:00
reger
0fb01e429e fix migration, account for ssl port in config (for auto-disable https) 2016-04-17 04:42:05 +02:00
reger
7be1c7a05a fix logger name 2016-04-17 03:20:14 +02:00
reger
1d940e5a94 upd commons-compress 1.11 2016-04-16 23:31:03 +02:00
reger
7789c32c82 delete crawl queue on init exception
(happens occasionally on path name vaiolation and will never get resolved)
2016-04-16 00:22:48 +02:00
reger
f781b9dd47 revert call condition f. migration.installSkins
(a bug introduced in fb8ae14b21 , 
see comment on that commit )
2016-04-14 22:14:32 +02:00
reger
3adb670f44 remove never used Domains.myHostNames set 2016-04-14 02:54:41 +02:00
reger
6ecc180299 fix rwi doubledom return best (highest) ranking 2016-04-12 03:55:43 +02:00
reger
2343e3f1cd keep and update existing xlf translation master instead of create new
in utility CreateTranslationMasters
+ small fixes in lng's
2016-04-09 23:25:05 +02:00
reger
a1935f485f Added utility class CreateTranslationMasters to create a language independant
translation master as source to harmonize individual translation files
Included a main to create masters in YaCy an xliff format for testing 

+ restrict TranslatorXliff to use only entries with State=translated

P.S. used https://open-language-tools.java.net/editor/about-xliff-editor.html to
experiement with xlf output (haven't a Pootle avail.)
2016-04-05 01:57:32 +02:00
reger
acaf51b296 keep ConfigLanguage_p as 1st entry in exported translation file
+ rem untranslated text & some typo fixes in several translations
(considering to create a translation master file to harmonize entries)
2016-04-04 02:56:19 +02:00
reger
61c5b6b403 fix empty drop down list in ConfigLanguage after wrong/empty download
+ add xliff translated attribut
+ append japanese lng name
2016-03-31 01:51:25 +02:00
reger
4eddabee42 translate Network History screen -> de
+ remove leftover debug line
2016-03-30 01:09:13 +02:00
reger
90c79014ae remove unused translator routine which also doesn't handle rel path input
+ correct some language file match issues
2016-03-29 21:31:02 +02:00
reger
902e79e261 Introduce a TranslatorXliff wich can read/write xliff from/to internal translation map.
This eases up suggested initatives from http://mantis.tokeek.de/view.php?id=649
Allows longer term also to store translation maps for the htroot files 
in standardized/reuseable xliff format ( http://docs.oasis-open.org/xliff/xliff-core/xliff-core.html ).
+ added test case creating and comparing xliff file with internal custom prop file.
(currently the introduced class is not used in core code)
2016-03-28 23:26:30 +02:00
reger
d9adc2c255 load handler for Transparent Proxy on startup only if feature is activated
to save the resources and keep handler chain small if the feature is not used.
+add a warning message on settingsack_p page to restart on first activation
2016-03-25 05:26:48 +01:00
reger
ec24a0c85a add test case for optimized toTokens() 2016-03-24 19:26:38 +01:00
reger
cada24f918 adjust utility ListNonTranslatedFiles for path compare on windows
(backslash replace)
2016-03-20 23:46:39 +01:00
reger
fb8ae14b21 make migration version safe 2016-03-20 03:34:28 +01:00
reger
258cd41577 reduce logging (EmbeddedSolrConnector.query)
mainly to reduce the frequent metadat checks like
> EmbeddedSolrConnector.query QUERY: q={!cache=false raw f=id}xXxXxX&rows=1&start=0&fl=id,load_date_dt
(p.s. direct servlet queries logged via AccessTracker.addToDump)
2016-03-14 22:32:06 +01:00
reger
6783ef5540 move example code SearchClient out of yacycore package
to example directory
2016-03-14 02:22:06 +01:00
Michael Peter Christen
b89465d952 0N - basic dump upload servlet infrastructure, to share index dumps
within an experimental new sharing model
2016-03-11 18:12:13 +01:00
Michael Peter Christen
f12a900f3e harmonization of http post of files for one and several files - this had
been differently - and wrong for several files. also: base64-encoding
for gzipped push files because our data structures currently only
supports ASCII POST pushes..
2016-03-11 08:59:33 +01:00
Michael Peter Christen
849ab671a9 0n: modified the p2p bootstraping process - rules had been too tight and
did not support the re-start of a network with just one principal peer.
2016-03-11 08:54:42 +01:00
reger
764f5100f0 fix delete of temp file after odt % ooxml parser
Close zipfile after parsing
2016-03-04 23:05:55 +01:00
reger
379e9b330d use supplied url port to get robots.txt in crawlers hostqueue 2016-03-02 00:12:34 +01:00
reger
58a959403d fix mixed logfactory in UrlProxyServlet,
Class doesn't use functions of declared ancestor, change to extend on httpservlet
2016-02-27 03:44:43 +01:00
Michael Peter Christen
2494a820c7 0N - added recording of dump exports if given time frame is not negative 2016-02-24 15:13:20 +01:00
Michael Peter Christen
ef2cc4f690 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2016-02-24 11:19:32 +01:00
Michael Peter Christen
a6bf0b1649 0N - added option to generate index export files for a specific number
of minutes in the past and reverted latest change. The export file dump
will now contain four data elements: f - first date of index entry write
date, l - last date of index write date, n - now-date of index dump
time, c - count of numbers inside the dump. '0N' denotes a series of
changes which will lead to the opportunity to exchange index data dumps
in a way that is needed to integrate ZeroNet index data. This will be
based on index dump sharing; that causes this commit.
2016-02-23 18:56:20 +01:00
reger
6d56beaed8 fix assertion exception in toString of MultiProtocolURL
toString of AnchorURL and MultiProtocolURL are identical code
(no need to override or to protect call to parent)

as reported in https://github.com/yacy/yacy_search_server/issues/43
2016-02-21 00:23:00 +01:00
reger
42a7bdb2af fix SolrSelectServlet authentication to default to true 2016-02-20 22:30:15 +01:00
reger
dbb28bb4f3 del unused statistic parameter (from status servlet) 2016-02-17 22:47:03 +01:00
reger
06d0e2aeb9 result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
- Above brought up that parser start url parameter, declared as AnchorURL uses only methodes of parent object DigestURL (changed parameter declaration accordingly).
2016-02-16 02:05:58 +01:00
reger
caf9e98f09 put metadata dc_publisher in corresponding schema field 2016-02-14 21:13:25 +01:00
reger
38e2b054d4 remove servlet classloder internal cache map (to save the resources, cache hits marginal)
- DefaultServlet includes already a class cache "templateMethodCache" which is emptied 
  on low mem status 
- avoid classloader cache gets has no hits but over time holds all (used) servlet classes
2016-02-12 01:20:03 +01:00
luc
3f338777f7 Also check and index eventual icon url information from metadata. 2016-02-11 09:33:20 +01:00
luc
9f712146df Display icons in ViewFile "links" mode. 2016-02-10 10:08:07 +01:00
luc
26f1ead57c Created ViewFavicon class specialized in favicon viewing.
Main image processing is now in ImageViewer, used by both ViewImage and
ViewFavicon.

Fixed URIMetadataNode.getFavicon to use non-standard icons with no size
ass fallback.
2016-02-09 20:46:44 +01:00
reger
6f0b073bf3 override detected language (statistic langdetect) only with TLD determided
language if langdetect probability is not high.
+ additionally truncate zh-cn / zh-tw returned by langdetect to 2 char ISO639-1 zh
used by YaCy
2016-02-07 21:16:22 +01:00
reger
b65e2b527d include use of condenser's content text for language detection.
Language identification may show poor performance on documents with short or no
title but clear lang indication in text content. Using content text too
improves lang detection.
+ remove double caching of text in Identificator
2016-02-07 01:52:32 +01:00
luc
07222b3e1a Added favicon url transmission in RWI chunks. 2016-02-05 17:05:36 +01:00
luc
480772c070 Fixed json search results from commit "Improved URLLicence reliability" 2016-02-05 15:23:29 +01:00
reger
937fbb0b9f correct isHidden() for smb from last commit 2016-02-04 19:20:27 +01:00
reger
535d4bf75f respect hidden attribute for file and smb directory listing
(hidden directories are not listed, effects crawling of local file system)
2016-02-04 19:16:00 +01:00
luc
3cc5619d93 Improved HTML icons indexing and rendering in search results.
See http://mantis.tokeek.de/view.php?id=629
2016-02-02 09:57:54 +01:00
luc
edef6cd0dc Merge branch 'master' of https://github.com/yacy/yacy_search_server 2016-02-02 07:58:12 +01:00
reger
c28142095a add findClass() to servlet class loader (used in YaCyDefaltServlet)
In the 2 cases where servlet calls servlet the jvm classloader chain is
invoked and servlet class loaded by jvm loader (successful while requiring 
htroot in system classpath). This patch uses the standard override design
for loaders to handle these cases (making in not longer crucial to have htroot 
in system classpath, as this classLoader is mainly used for servlets and
looks in this case for the class in the configured path).
+ As the default classloader is parallelcapable we should register this too.
2016-02-02 03:44:01 +01:00
luc
f7b854465b Merge branch 'master' of https://github.com/yacy/yacy_search_server 2016-01-29 08:20:10 +01:00
reger
a6617ad887 expand initRemoteCrawler() to terminate worker threads if called to deactivate
remote crawl.
On startup we save the resources for remote crawler if disabled. Once started
threads are running idle after disable remote crawl. Now threads are terminated
to save the resources also while disabeling during runtime.
+ remove empty class Channels
2016-01-28 23:14:09 +01:00
reger
2048b7e057 support scraping start-/enddate from html tag with property "datetime"
This may be used in html5 <time> tag (which we don't explicite support yet for date in content scraping).
2016-01-26 21:27:44 +01:00
reger
900d4584ba complet resource cleanup of lists in contentscraper's close() 2016-01-25 23:54:20 +01:00
luc
aa60ad1dbc Merge branch 'master' of https://github.com/yacy/yacy_search_server 2016-01-21 08:12:22 +01:00
reger
1f18653de0 pass parsed swf content trough htmlscraper
Swf may contain subset of html tags which shoul'd appear as text.
Especially <font> tag may totally screw up metadata servlet if not filtered out.
2016-01-21 02:55:05 +01:00
reger
18ecf57792 add support of compressed swf to swfParser
from JavaSWF2 (source compatible to WebCat).
Moved swf file signature check to parser
Changed use of synced vector to list swf InStream
2016-01-20 00:58:29 +01:00
sixcooler
5cb7ba0dc4 fix for connections not getting closed to get favicon.ico during seach 2016-01-19 20:57:22 +01:00
luc
ef83e34b8a Merge branch 'master' of https://github.com/yacy/yacy_search_server 2016-01-19 08:06:49 +01:00
reger
ed3e16e092 apply remote result count config value to Bookmark Autosearch
+ prepare to make the widely unused Bookmark feature optional
2016-01-15 02:10:10 +01:00
Ryszard Goń
a98c395023 Add the Autocrawl thread 2016-01-14 00:50:23 +01:00
Ryszard Goń
1728cd30c6 Create autocrawl profiles 2016-01-12 16:28:34 +01:00
luc
41767a01c2 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2016-01-11 23:08:22 +01:00
reger
ff27824964 fix swfParser reading file signature
before passing to library (current version expects data w/o signature)
2016-01-10 01:16:31 +01:00
luc
7aa1a29e33 Return more accurate HTTP status 400 with detail message when some error
occurs on ViewImage :
 - missing required parameters
 - url licence invalid
2016-01-08 23:18:13 +01:00
luc
bd9dc2f32b Corrected NullPointerException cases occuring in YJsonResponseWriter
when no description is available.
2016-01-08 20:46:02 +01:00
luc
0076f9f97d Updated documented sample url 2016-01-08 20:43:49 +01:00
luc
cfdbc2b487 Improved URLLicence reliability for use by conccurrent non authaurized
users.
Removed URLLicence generation when unnecessary (authorized users)
2016-01-08 20:42:57 +01:00
reger
c91e712178 further refactor using standard java / (one) utf-8 charset variable
extending initiative of commit 9a25751850
2016-01-07 16:17:37 +01:00
luc
571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
charset names.
2016-01-05 23:37:05 +01:00
reger
1af0e9ef74 remove workaround for Solr bug regarding multivalued date fields
fixed in 5.4.0
http://issues.apache.org/jira/browse/SOLR-8050
2016-01-03 01:11:27 +01:00
sixcooler
5a35f9383a bump to solr/lucene 5.4.0 2016-01-02 21:07:50 +01:00
reger
a58d34a4e8 check error URL cache before adding errorDoc to index
- del obsolete related switchboardconstant
2016-01-02 05:03:57 +01:00
reger
e9539b1086 reintroduce special handling of file upload multipart/form-data from HTTPDemon.parseMultipart
- add filename to parameter fieldname
- add filecontent to special parameter fieldname$file
(some servlets use this $file parameter)

fix for http://mantis.tokeek.de/view.php?id=542
2015-12-31 03:04:13 +01:00
reger
cd26717ba2 fix low memory status hint (dht-in disabled)
http://mantis.tokeek.de/view.php?id=619
2015-12-29 20:38:45 +01:00
reger
a5faf73afa remove obsolete yacy.init entries interaction.*
(related to removed triplestore)
2015-12-29 15:41:19 +01:00
sixcooler
dce1cb65c4 Merge remote-tracking branch 'choose_remote_name/master' 2015-12-28 23:20:42 +01:00
reger
46ac0867ff fix poison mediawikiimporter output queue also after ExecutionException
in worker thread.
Writer of importer keeps needs a poison to close the file. On exception (e.g. OOM)
add a poison marker in outer most try/catch to assure output queue will terminate
in this condition too (and closes+renames the surrogate/in/xxx.prt file)
2015-12-28 02:32:00 +01:00
reger
a7591d3ed0 fix mediawikiimporter number format exception on coordinate parsing
handle uncomplete metadata like "NS=43/50//N". 
For other {expr ... } type entries a try catch added
2015-12-27 01:59:15 +01:00
reger
9da1712a31 increase http header EXPIRES for css and images in DefaultServlet
to increase browser cache hits for not changing content
2015-12-26 17:35:46 +01:00
reger
6d54eb3d36 skip loading document on crawl start for YMark bookmarks
by adding a constructor giving the already loaded document as parameter.
2015-12-26 01:15:07 +01:00
reger
80e2c82249 fix NPE on empty blog importfile parameter 2015-12-24 02:00:45 +01:00
reger
e84d94f8ca fix mime table for ms office / open office documents
(causing wrong parser detect in intranet mode)
2015-12-22 17:48:24 +01:00
reger
45b9bd8403 adjust MultiProtocolURL.protocol detection to handle mailto with "://" in parameters,
and feeding hyperlinks to webgraph processing.
2015-12-21 04:42:26 +01:00
reger
d5fd031449 fix reading of ippattern config array in URLProxy 2015-12-20 15:51:54 +01:00
reger
b7e8358645 make use of header.getContentType where possible (mime is normalized afterwards)
otherwise use header.mime() differentiated in prev. commit.
2015-12-20 15:49:24 +01:00
reger
7a8c077838 fix HeaderFramework.mime() to strip charset parameter.
Differentiate mime() and getContentType() which gives the raw header field.
This improves parser detection if charsets are included in http content-type field.
2015-12-20 06:44:16 +01:00
reger
b4b6910d60 fix (todo): correct doc.id of remote search result if no match with newly
calculated doc hash if different.
Testing showed that in some cases delivered url doesn't match the local
calculated hash. In this case replace doc.id (and host_id_s) with calculation
from url.
2015-12-20 02:10:49 +01:00
reger
dec3e6ad96 fix: adjust urlstub for mailto links
(skip protocol)
2015-12-19 20:13:33 +01:00
reger
cb83e65f89 drop returning document language "en" if unknown (fix todo)
which also harmonizes handling of query.modifier for rwi and solr results
(to result must match a given language filter)
2015-12-19 01:42:35 +01:00
reger
0c5548a7ff fix (todo) remove redundant holding of email link nameproperty in parser document 2015-12-18 02:35:44 +01:00
reger
71c416f383 show mailto links in ViewFile.html linklist 2015-12-18 01:11:55 +01:00
reger
6b7c10cef8 fix dc:date in mediawikiimporter/document.writexml to use lastmodified 2015-12-17 02:53:10 +01:00
reger
14803d58cd let html scraper accept html5 <link rel="icon"> for favicon links 2015-12-17 00:36:08 +01:00
luc
b4cdacee76 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-12-16 03:26:06 +01:00
luc
ba0a293f5c Corrected another case of
org.apache.lucene.store.AlreadyClosedException" occuring when
SearchEvent.cleanup() was called while committing local solr index.
2015-12-16 03:25:07 +01:00
reger
4d2b934487 prevent mailto links getting into parser result document's in/outbound link collection
by checking mailto scheme early.
- fix upper case mailto protocol assignment
- add test case for getProtocol
2015-12-16 03:01:17 +01:00
luc
8c4ab9c76b Added an option to eventually limit size of remote solr documents put to
local index. See mantis #626.
2015-12-16 02:20:03 +01:00
luc
a2c08402af Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-12-15 23:30:30 +01:00
luc
70595d05d0 Modified MemoryControl.main() test to properly end for better results
displaying.
2015-12-14 23:49:28 +01:00
sixcooler
1be67d9ab6 CachedSolrConnector was replaced by ConcurrentUpdateSolrConnector years
ago - time to let it go
Commented out unused table of cache-objects
2015-12-14 21:33:27 +01:00
reger
28b8bc290a fix use of NETWORK_SEARCHVERIFY for rwi verification
was not used to set the searchevent parameter (done in SearchEventCache.getEvent)
- remove unused corresponding QueryParams.filterfailurls param.
2015-12-13 20:01:49 +01:00
reger
020630efd8 remove unused network scanner parameter from queryparameter
Search event is not using networkscanner 
(removed filterscannerfail param always init to false)
2015-12-13 02:50:08 +01:00
luc
ad5586f8f6 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-12-08 03:35:36 +01:00
luc
8ebefa4233 Fixed MediaWiki import : DCEntry conversion to SolrInputDocument was
failing. Looks like it was broken since Commit
b43811d38c
2015-12-08 03:34:03 +01:00
luc
7736ee5a42 Updated MediaWimporter main() : display usage in console and stop
properly without calling System.exit
2015-12-08 03:30:51 +01:00
reger
cdb8f3b10d make current ranking score value avail. to search interface / api
Update the result score result field with the result queue ranking value to reflect
the actual calculated/used score,
for rwi & solr stack results.
(calc. etc. is unchanged, it's just that result entry carries the latest val
as api retrieves the number from it)
2015-12-08 03:17:32 +01:00
luc
27d11f8671 Fixed isSolrDump function : PushBackInputStream was not unread when
returning false (for example with a WikiMedia dump).
2015-12-07 21:58:36 +01:00
Michael Peter Christen
135a123a77 less logging in new language detection 2015-12-03 00:39:15 +01:00
Michael Peter Christen
ef8cd80593 fix for npe 2015-12-03 00:33:13 +01:00
reger
0347bfa71f Apply collection query constraint/modifiert to rwi result stack.
Collection is not available in pure rwi entries (but in local solr metadata)
But if user wishes to filter by query constraint also rwi shall adhere to this
(even if only rwi entries with parsed or solr received metadata may fit)
2015-12-02 22:57:59 +01:00
luc
2a67d2ba6f Corrected error management for unsupported image formats, parsing
errors, and unavailable resources : avoid logging to much Exceptions as
these errors easily occur when searching images.
2015-12-01 01:06:01 +01:00
Michael Peter Christen
d6e9834040 Merge branch 'master' of
https://github.com/Scarfmonster/yacy_search_server

# Conflicts:
#	.classpath
#	build.xml
2015-11-30 16:54:54 +01:00
Michael Peter Christen
d82d311995 Merge branch 'master' of https://github.com/luccioman/yacy_search_server
# Conflicts:
#	.classpath
2015-11-30 13:34:10 +01:00
reger
b5371ea8c1 read/init crawl queue in a thread
to speed-up YaCy start on large existing crawler queues
2015-11-29 05:19:39 +01:00
reger
1160b13172 remove unused md5 from ViewFile servlet params 2015-11-28 23:09:15 +01:00
reger
e163ea88f6 fix vsdParser (Visio) parser return statement
(final block un-necessary throw)
2015-11-28 02:43:38 +01:00
reger
b2c8bc0ae6 remove md5_s from default index fields
it is not assigned a value / not used
Due to above also excluded from transfer protocol.
2015-11-27 02:41:02 +01:00
luc
e40ae0943b - No max dimensions specified : render raw image data when source and
target image format are the same.
- Corrected scaling condition.
2015-11-26 09:30:43 +01:00
reger
90686a75a2 fix flux factor (additional crawl delay by access count) calculation 2015-11-25 01:34:41 +01:00
luc
4af27289e5 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-23 09:01:25 +01:00
reger
297fdb60d3 throw exception if crawler hostqueue can't create hostpath directory.
In rare cases hostname may not be a valid filesystem directory name,
which can't be created (e.g. containing '*' char). To prevent crawl queue
looping on this invalid entry by throwing a malformedurlexception.
2015-11-22 21:26:18 +01:00
luc
755efac17d Use same max file size when loading all resource bytes or opening stream
content
2015-11-20 19:35:39 +01:00
luc
bc6c79fc12 Corrected scaling function for non RGB images. 2015-11-20 14:35:36 +01:00
luc
1565559df8 Refactoring : extracted write InputStream method. 2015-11-20 09:42:24 +01:00
luc
f0478bb14d BMP and ICO image formats support : integrated /haraldk/TwelveMonkeys
imageio-bmp-3.2 library.

 - better BMP format flavours support
 - handle PNG encoded icons
 - handle transparency
 
Added some javadoc url references to .classpath
2015-11-20 09:38:16 +01:00
luc
07437986e7 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-20 08:15:24 +01:00
reger
97cc03ef6a start using a template for urlproxy header
It is included as iframe  /proxmsg/urlproxyheader.html
to allow full servlet functionallity and flexibility to display some
index/meta data in future.
2015-11-20 01:49:56 +01:00
luc
f01d49c37a Process large or local file images dealing directly with content
InputStream.
2015-11-18 10:15:38 +01:00
luc
3c4c77099d If available, check content length before downloading. Check also
content length is not over Integer.MAX_VALUE.
2015-11-18 10:11:38 +01:00
luc
5bbb2e1730 Ensure resource is closed when reading a full file InputStream 2015-11-18 10:08:06 +01:00
luc
6291a57300 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-18 08:49:31 +01:00
reger
0d3c5b223e have psParser cleanup temp file 2015-11-17 23:45:29 +01:00
reger
7d0d19cb8e avoid File.deleteOnExit() on temp files
JVM registers each file in a list regardless of already deleted and never
cleans up the list during runtime.
This accumulates to a considerable amount of mem during large crawls and/or
long uptime.
To tackle this, all temp files are now created in a subdir of java.io.tmpdir 
and the jvm tmpdir property is set to this subdir, which is deleted by
code on shutdown.
Additionally let pdfParser use this tmp subdir too.
2015-11-17 22:27:07 +01:00
luc
bfe51001e3 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-17 08:30:32 +01:00
reger
02e4489a23 set tmpfile.deleteOnExit by default,
to make sure files are removed on shutdown.
2015-11-16 21:37:45 +01:00
reger
2985baaa01 Exclude repetitive protocol part in tokenized url
used as description if none is avail. from parser.
2015-11-16 01:06:20 +01:00
reger
ca3d26a401 harmonize wordsintitle & CollectionSchema.title_words_val calculation,
remove obsolete partial init of wordreference from urimetadata
2015-11-15 06:06:37 +01:00
reger
52a9040ae6 Sort out double keywords (dc_subject) early in parsed documents
- by direct using Set vs. List
- remove not neede String[] getter
2015-11-13 01:48:28 +01:00
luc
49331dc523 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-12 08:21:56 +01:00
reger
47d70732f6 improve locale translator
- skip empty line
- robustness file section detection (space independant)
2015-11-11 00:57:51 +01:00
sixcooler
646afe9183 do not store subfield *_coordinate + make all num-fields being docvalues 2015-11-10 20:45:33 +01:00
sixcooler
194df613de not using 'location' as defaultfacetfield - since we removed it being
default.
2015-11-10 20:43:58 +01:00
sixcooler
d3b9349b6f simplification / speedup of GenerationMemoryStrategy 2015-11-10 20:39:46 +01:00
sixcooler
4a905ec134 fix to not let the AccessTracker-Log grow to much, but have enough data
to monitor.
(+gitignore-correction)
2015-11-10 20:27:17 +01:00
reger
20e18d79f8 harmonize document title for archive parsers 2015-11-10 01:29:13 +01:00
luc
f11b5e8309 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-09 08:13:12 +01:00
reger
112ae013f4 update bzip and bzip parser process,
to return one document for the file with combined parser results of the
containing file and registers it with supplied url and mime of the archive.
2015-11-07 19:13:18 +01:00
reger
e76a90837b update zip and tar parser process,
to return one document for the file with combined parser results of the
containing files.
2015-11-06 23:58:55 +01:00
luc
4e673ffc9a Ensure closing of InputStream even when an exception occurs. 2015-11-05 09:40:24 +01:00
luc
10696b53f7 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-05 08:26:52 +01:00
reger
8532565c7d optimize order of parsers to try
- start with a parser matching the remote supplied mime
2015-11-04 21:52:02 +01:00
reger
681889ae64 use current tar library for untar files
- remove old source copy
2015-11-04 02:57:00 +01:00
reger
5d71fc70e3 fix tarParser early exit on looping content
- adjust check of data available according to doc 
- return null on no recognized content (to not exit TextParser next parser try)
- use commons.compress directly
2015-11-03 22:14:14 +01:00
luc
bcc2e7cb5b Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-03 09:29:57 +01:00
reger
2fcf6f104c fix bzipParser recognition
- Bzip2Inputstream checks magic byte itself to identify bz2 (leave it in input)
- try to suppy fitting mime for parsing bz2 content
2015-11-03 03:35:01 +01:00
luc
745e97a575 Merge branch 'master' of https://github.com/yacy/yacy_search_server 2015-11-02 08:10:11 +01:00
reger
a60b1fb6c2 differentiate api call getLocalPort() from getConfigInt() 2015-10-31 23:09:03 +01:00
reger
11f3666660 increase use of pre.defined CATCHALL_QUERY string 2015-10-31 19:44:31 +01:00
reger
a58ee49307 Optimize internal imagequery focus on using content_type to select images
(in favor of url file extension)
2015-10-31 19:18:46 +01:00
luc
fc3294382e Updated javadocs for warning on target encoding format potential errors. 2015-10-30 16:19:05 +01:00
luc
aa70ff4ff6 Corrected images alpha channel rendering 2015-10-30 05:18:16 +01:00
reger
d223cf0ae4 adjust MediaWiki importer geo coordinate calculation
- allow lat/long 0.xxx
- south / west assignment
include test class
2015-10-26 21:19:35 +01:00
reger
2b775d5be6 fix typo in WikiCode coordinate calculation 2015-10-25 19:38:42 +01:00
reger
bbe9df2bb3 fix MediawikiImporter for bz2 dump
skip reading bz2 file magicbyte to identify bz2 format as inputstream reset would be required. Common compress reads and checks the magicbytes internally and throws ioexception if wrong, making preread obsolete.
2015-10-25 03:06:15 +01:00
reger
c6687dd560 fix a system.out to log.fine
in bmpParser
2015-10-25 00:26:45 +02:00
reger
e53c6bbd51 fix init of peer flags
(remove hiding of ssl flag)
2015-10-24 19:36:33 +02:00
Michael Peter Christen
ac034db8bc Merge branch 'master' of https://github.com/luccioman/yacy_search_server
# Conflicts:
#	htroot/js/highslide/highslide.js
#	source/net/yacy/document/ImageParser.java
2015-10-24 11:22:35 +08:00
reger
826f14f37f fix unnececary set null of peer flags, causing reread
remove obsolete version flags
2015-10-22 02:35:58 +02:00
luc
5902ce032e Corrected NullPointerException case when ImageIO reader is not found for
image format.
2015-10-19 14:11:26 +02:00
reger
c6495a5b62 add a log entry on parsing ajax crawling scheme snapshot
(prev. commit 9252e36aeb)
2015-10-18 06:19:12 +02:00
reger
9252e36aeb implement ajax crawling scheme for ajax sites which adhere to the proposed use of hash-bangs to provide html content
see freshly deprecated https://developers.google.com/webmasters/ajax-crawling/
Implementation improves parsing of the homepage (ajax page) which uses metatag "fragment" in header and parses supplied html snapshot instead of mostly empty ajax/scripted page.
Implementation supports also hash-bang urls (url with anchor starting with ! like  ...path#!hashfragment) but our crawler filters it
(use of hash-bang is controversly discussed and proposal is deprecated, makes no sense to adjust the crawler, but as long as it is used by some sites the minor change/improvement in htmlparser is good for some time).
Quick - how does it work
- if metatag fragment with content "!" is found
   - htmlparser tries to get content of htmls snapshot (using a different url)
   - htmlparser returns 2 documents (original url and snapshot content - but using same original url)
- after parsing result documents are joined (and stored to index containing content also from snapshot page... as the original ajax page contains typically no parseable html content)
2015-10-18 05:51:01 +02:00
Michael Peter Christen
d1ae999ef9 replaced HashMap with LinkedHashMap to preserve the object order 2015-10-16 23:30:51 +02:00