Commit Graph

3905 Commits

Author SHA1 Message Date
reger
e0816ef2e5 use human readable date format in CrawlStacker error message
"double in: local index, oldDate = "
2016-11-05 19:40:14 +01:00
luccioman
54d879a9b3 Generate HTML relative (to each peer) links from hosted WikiCode.
When WikiCode inserted in a peer hosted Blog, Wiki, Messages or Profile
contains relative links (images or any content, hosted in DATA/HTDOCS),
it is more reliable to keep these links relative, especially when the
peer is behind any kind of reverse Proxy.
2016-11-04 11:21:20 +01:00
luccioman
2da5f339f8 Fixed /News.html and /Wiki.html pages in Search Portal mode (issue #87).
Also fixes theses pages rendering when the peer is not online.

Re-factored code in common with /opensearchdescription.xml and
ConfigPortal.html.
2016-11-03 02:33:36 +01:00
reger
8fe28a83f2 harmonize used lastmodified date for rwi and fulltext in storeDocument 2016-11-02 03:43:39 +01:00
reger
3d1d297308 refactor namespace navigator as part of navigatorplugin map, this allows
the navigator to include counts all matches (rwi+fulltext).
Fixing also unresolved_pattern in navigators title (of the counter)
The use of inurl: query modifier as filter has not been changed keeping
it as soft (unsharp) filter facet.

Upd StringNavigator to prevent empty string form multivalued solr fields,
removed date value conversion (better handled elsewhere, not need here).
2016-11-01 04:38:47 +01:00
reger
67f660523b Make navigators underlaying indexfield name accessible in interface
use interface in declaration and extend facet check to include navigator
field.
2016-10-31 18:42:23 +01:00
reger
5eb3ee4e20 Add search navigator interface to allow for additional navigators (plugins)
Prepared the first basic navigators (for authors and collections) for the
list of SearchEvent.navigatorPlugins and adjusted servlet to use these.
- this allows to configure display order of these navigators (by ordering config string)
- eventually allows for additional and/or custom navigators using any
available index field without need for changing servlets
- the Collection navigation has been adjusted to exclude the internal, 
default robot_*  and dht collections from displaying
- rwi results are now also checked for navigatior by the refactored navi's

So far no config options were added to customize or add navigators (may
come later if route of upcoming modularization/plugin system is defined).
2016-10-31 02:17:43 +01:00
reger
fd3f58fcaa improve query modifier parsing of "collection:" and possible collision
with "on:" in case multiple collection modifier were entered (by mistake)
http://mantis.tokeek.de/view.php?id=702
2016-10-31 00:43:01 +01:00
reger
af39a76bf6 Reduce number of default max. search navigator lines (from 10000)
to 100 + make it configurable
2016-10-29 04:19:46 +02:00
reger
20a1b29ed3 add simple test case for ReferenceContainer helpful for debugging
calculated ranking parameter
2016-10-26 01:38:40 +02:00
reger
3c7220bc7b Refacture rwi reference word position and word distance calculation
used for rwi ranking.
Main changes:  
- introduce a  posintext() to access the stored value. This reduces also mem alloc of position array for WordReferenceRow (index access)
- use the positions() array for joined references on multi-word queries if needed (otherwise allow positions() to be null
- adjust assignments and the min() max() and distance() calculation accordingly
2016-10-23 19:40:02 +02:00
luccioman
f0639d810c Customized name for Threads still using the default "Thread-n" pattern.
This makes threads monitoring easier to read.
2016-10-22 17:17:21 +02:00
luccioman
db3b9db9c2 Crawl from local file : faster task end when manually terminating crawl. 2016-10-22 09:11:20 +02:00
reger
4c67ed3f8d catch rwi ranking div by zero exception
during rwi search result processing worddistance calculation is effected 
by concurrent update (normalization) of min/max ranking parameter for
wordpositions. On update of min/max the exception is raised in distance calc
and now catched. 
This concurrent update and change of ranking results is needed for speed
but should be further checked for optimization
2016-10-22 00:53:47 +02:00
luccioman
47af33a04c Advanced Crawl from local file : better processing of large files.
Applied strategy : when there is no restriction on domains or
sub-path(s), stack anchor links once discovered by the content scraper
instead of waiting the complete parsing of the file. 

This makes it possible to handle a crawling start file with thousands of
links in a reasonable amount of time.

Performance limitation : even if the crawl start faster with a large
file, the content of the parsed file still is fully loaded in memory.
2016-10-21 13:03:31 +02:00
luccioman
ee92082a3b Updated javadocs : warning about closing stream responsibility. 2016-10-21 12:48:36 +02:00
luccioman
6f49ece22f Fixed redirected URLs processing as crawl start point.
See mantis 699 (http://mantis.tokeek.de/view.php?id=699) for details.
2016-10-20 12:12:26 +02:00
reger
68217465fe div by null in word distance calculation
(again, description in http://mantis.tokeek.de/view.php?id=698)
as root cause was not seen, added just workaround reducing in favour over a 
try catch (for easier followup).
2016-10-19 22:55:36 +02:00
luccioman
7263d17436 Removed mentions of deprecated LURL-db.
Thanks to LA_FORGE asking about if on YaCy forum (
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5895 )
2016-10-19 14:56:25 +02:00
reger
8b74a6bf57 fix min/max calculation of WordReferenceVars.distance()
Issue was the calculation in AbstractReference with positions.clear() call,
this made distance result always 0 (distance needs min 2 positions) and created concurrency issues.
+ unit test of changes
2016-10-17 23:58:28 +02:00
luccioman
da362628fb Added fine log level for too long blacklist matching processing. 2016-10-17 22:32:19 +02:00
reger
aaae7c6462 adjust ConcurrentScoreMap internal value map to interface and use parameter
Long -> Integer (saves some bytes)
2016-10-16 06:31:48 +02:00
reger
31d2a5645e remove obsolete query variable
leftover from 8fb370d9f8 (diff-1d4259005ebfddc11083387857a86175)
harmonize ranking shift parameter to 0xFF
correct addresult weight parameter to long
2016-10-15 19:29:19 +02:00
luccioman
a588ed7628 Applied image headers customization to the new ViewFavicon servlet. 2016-10-14 14:05:38 +02:00
luccioman
7717a3d43d Fixed license headers on files created to improve favicon management. 2016-10-14 11:55:49 +02:00
luccioman
6e1959f469 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
Conflicts:
	htroot/yacysearchitem.java
	source/net/yacy/cora/federate/solr/responsewriter/YJsonResponseWriter.java
	source/net/yacy/search/schema/CollectionConfiguration.java
	source/net/yacy/server/serverObjects.java
2016-10-14 11:29:55 +02:00
reger
685d8e86bf Avoid frequent data type casting (float/long) for rwi score
refactor to using long in URIMetadataNode too (and related call parameters)
As remote rwi score's are not used (since v1.83) skip reading float-score ,
but keep in toString() for communication with older versions.
2016-10-14 01:17:34 +02:00
luccioman
3ccd89e274 Fixed MultiProtocolURL.resolveBackpath to handle remaining '..' segments 2016-10-13 16:18:24 +02:00
luccioman
4b699c469a Blacklist refactoring : extracted a function for easier unit testing 2016-10-13 15:33:31 +02:00
luccioman
54cfcc3f56 CrawlCheck_p.html : also display info about disallowed URLs. 2016-10-12 11:26:59 +02:00
luccioman
8b341e9818 Robots : properly handle URLs including non ASCII characters
This fixes GitHub issue 80 (
https://github.com/yacy/yacy_search_server/issues/80 ) reported by
Lord-Protector.
2016-10-12 11:25:36 +02:00
reger
e68b00678e prevent negative score on URIMetadataNode - in the special case were no
solr score is supplied.
+ assert before use & test case
2016-10-11 19:54:50 +02:00
luccioman
242707f9b4 Fixed loadFromCache with strategy IFFRESH.
This fixes mantis 695 ( http://mantis.tokeek.de/view.php?id=695 ) :
crawl start with 'Link-List of URL' option on websites using cookies.
2016-10-10 01:10:35 +02:00
reger
b752bcfecb adjust date in text detection to ignore some program version strings
like "3.1.2.0102" see http://mantis.tokeek.de/view.php?id=650
+ expand test case
2016-10-06 23:37:12 +02:00
reger
b017e97421 optimize condenser language detection a little.
langdetect probabilities take letter case into account, add words from
description and anchors etc. as is.
+ add it to javadoc
2016-10-06 19:03:52 +02:00
reger
ae3717d087 adjust Tokenizer sentence count to ignore repeated punktuation (like !!!! )
+ remove unused sentenceword map (we use only the count)
+ upd test case for sentence count
2016-10-06 03:41:07 +02:00
reger
474f0476c6 adjust Tokenizer sentence count on trailing text after last recognized sentence
+ upd test case for rwi multi-word-query  (leaving results known to fail untested)
2016-10-05 05:52:37 +02:00
reger
3861ac9293 upd maven dependency-check plugin to reflect changes of https://nvd.nist.gov
+ upd unknown ant script with current lib/jsch version
2016-10-04 03:05:26 +02:00
reger
681a61dafb adjust rwi index result word position handling used for rwi ranking
- correct WordReferenceVars.toRowEntry posintext parameter
to set expected min posintext (the difference is on multi-word queries,
while positions are ordered by search word order).
- modified posofphrase/posinphrase join operation
 - to set min posofphrase
 - and keep posinphrase if not same posofphrase (was set to 0, no differentiation during ranking)
+ fix compiler msg (missing type declaration)
2016-10-04 01:42:18 +02:00
reger
14f7577231 add support for older Word versions (Word6/Word95) to docParser 2016-10-03 01:52:51 +02:00
reger
1a79c64495 generalize DateDetection with holiday date rules readily available in icu
to make sure current dates are recognized (was fixed to 2014 - 2016)
+ adjust holiday date parser from pattern.match to pattern.find to deal with leading and trailing text
+ moved relative date recognition (morgen, tomorrow) to parseline (used by query parser only), as not working and problematic for indexing
+ add test case for parseline (used by query parser)
2016-10-02 03:19:12 +02:00
reger
6f68f08354 correct DateDetection Silvester date
add Thanksgiving
2016-10-01 03:16:27 +02:00
reger
32a2e3a22a have RSSFeed.getChannel return empty message on missing channel element,
a) required b) prevent NPE in rss servlets
+ add test
2016-09-30 21:46:57 +02:00
luccioman
8d57b5b970 Added some javadocs. 2016-09-30 17:12:55 +02:00
luccioman
60df09fff9 Fixed some HTML validation errors : Illegal character in query
Now encode space characters in URLs query part.
2016-09-30 10:54:53 +02:00
reger
862f28eaa6 display number of documents/rss-items for label "docs" in load_rss_p servlet
(as replacement for the rarely used "docs" rss-tag for a url to the rss-specification)
2016-09-29 23:59:10 +02:00
luccioman
dcdea2d02f Fixed shutdown for crawler.MaxActiveThreads value greater than 200
Shutdown was hanging in CrawlQueues.close() at
this.workerQueue.put(POISON_REQUEST) when config value
crawler.MaxActiveThreads was greater than 200.

Revealed by "Collision" Threads dumps in mantis 689
(http://mantis.tokeek.de/view.php?id=689#c1312)

Fixed consistency between this.worker.length and this.workerQueue
capacity, and made the process more reliable using non-blocking offer()
function.
2016-09-29 10:33:11 +02:00
luccioman
d286ba2c3e Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2016-09-28 14:53:08 +02:00
luccioman
b8f6458152 Prevent yacy main thread from hanging on browser opening process.
First fix for mantis 689 (http://mantis.tokeek.de/view.php?id=689).

On Debian Linux, with a headless jre and no open browser,
browser.openBrowserClassic() was called and waited forever the browser
process end (p.waitFor()). YaCy shutdown was therefore not working until
the browser was closed.

Also modified browser opening command for Unix platform to open the
default the browser (with xdg-open util) instead of Firefox.

xdg-open also has the advantage to be asynchronous (not blocking).
2016-09-28 14:52:30 +02:00
reger
70e1eb30a5 prevent StringIndexOutOfBounds in getLocalFile()
+ tighten patching of DOS path w/o protocol to drive "LETTER":
2016-09-27 22:40:36 +02:00
luccioman
1bb0b135ac Avoid duplication of various MS Windows file URLs flavors
Fix for mantis 692 (http://mantis.tokeek.de/view.php?id=692)
2016-09-27 07:53:08 +02:00
luccioman
b9a8476f02 Removed unused import 2016-09-27 07:41:45 +02:00
reger
e73c1eea8c remove unused rootpattern, leftover from commit
9a5ab4e2c1 (diff-d2b184283abed53ae260fc9eabdaef40)
2016-09-26 02:54:58 +02:00
reger
6f8c3ccea4 improve url hash computation for file path with mixed java & windows
file.separator to compute equal hashes (by normalizing path for computation)
+ expand test case for to check mixed java / windows file url notation
like e.g. file:///c:/test/file.html vs. file:///c:\test/file.html
- relates partially to http://mantis.tokeek.de/view.php?id=692
2016-09-25 22:08:12 +02:00
reger
efcb6a1e74 fix supported mime XML -> xml for rssParser (mime normalized to lower case for comparison)
+ add mime text/xml as in use for rss in the wild
2016-09-23 23:37:12 +02:00
luccioman
b3b75b0498 Accessibility : add a customizable alternative text to YaCy log
Applied W3C recommendations :
https://www.w3.org/TR/html51/semantics-embedded-content.html#a-link-or-button-containing-nothing-but-an-image
and
https://www.w3.org/TR/html51/semantics-embedded-content.html#logos-insignia-flags-or-emblems
2016-09-22 16:08:33 +02:00
luccioman
f2bc1b268d Updated URL fragment validation rules according to current standards
See RFC 3986 (https://tools.ietf.org/html/rfc3986) or URL living
standard (https://url.spec.whatwg.org/)
2016-09-22 11:28:33 +02:00
luccioman
b1b8e69da8 Fixed NullPointerException cases 2016-09-22 11:25:33 +02:00
luccioman
3ee4f56c39 Improved ErrorCache behavior when switching networks
Even after network switch, ErroCache was still holding a reference to
the previous Solr cores, thus becoming useless until next YaCy restart.

Initial error cache filling with recent errors from the index was also
missing after the swtich.
2016-09-22 09:07:07 +02:00
luccioman
7d5ba2afa4 Added some JavaDoc and moved crawlStacker close at the right place. 2016-09-22 08:21:14 +02:00
luccioman
8edbcd8ad4 Log eventual Solr instances close errors.
We do not want to block on this kind of error, but this should not
silently fail as it may have later consequences.
2016-09-22 08:20:01 +02:00
reger
330768c8a2 fix for solr write.lock after mode change http://mantis.tokeek.de/view.php?id=686
The embedded core holds a lock on the index and must be closed. Earlier commit
comment states that core should be closed with solr instance instead on close 
of connector.
Adjusted the InstanceMirror.close() to take care of closing the embedded 
instance to release the lock.
In 2 routines of fulltext this was already explicite implemented (disconnectLocalSolr).
Now this disconnect is part of the InstanceMirror.close().
2016-09-22 00:16:22 +02:00
reger
585d2a6441 test case: for NewsPool to check the id modificator (for unique id)
and observe the distribution order .. hands on.
+ add test/DATA to gitignor
2016-09-20 01:55:56 +02:00
luccioman
de5c873e38 Removed unused JavaScript file docs.min.js
This file is used by Bootstrap documentation website
(http://getbootstrap.com/) but is not part of the Bootstrap distribution
and has not be included in a Bootstrap based application.
2016-09-20 00:17:42 +02:00
Michael Peter Christen
df51e4ef07 Merge branch 'master' of git@github.com:yacy/yacy_search_server.git 2016-09-19 11:01:58 +02:00
Michael Peter Christen
e063aaf97f enable fuzzy search, solr style (append a ~ to get a fuzzyness on the
word)
2016-09-19 11:01:39 +02:00
reger
ff6589fc0f test case: simulating multi word query for local rwi index
Purpose of the test case is to be able to (controlled) analyse the rwi ranking for
multi word searches (with focus on posintext and word-distance ranking)
2016-09-18 00:59:27 +02:00
reger
e990297d2e avoid NPE on hello message with missing "yourip" key
http://mantis.tokeek.de/view.php?id=684
2016-09-15 23:26:25 +02:00
reger
e51ab8c7aa hack to generate a unique message-id for messages created in the same second
by optionally add a 1 second offset counter to the current time (which is
used as the unique id part)
2016-09-15 02:59:32 +02:00
Michael Peter Christen
b82300358a removed version number check because it does not work any more if
version numbers are expressed in a different way as we expect. That
could cause that YaCy does not run on systems which are appropriate but
we simply do not understand the version string.
2016-09-14 16:32:57 +02:00
Michael Peter Christen
2107674999 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2016-09-14 16:24:55 +02:00
Michael Peter Christen
0d28f563f4 fix for java version "9-ea" 2016-09-14 16:24:32 +02:00
reger
3b694b3935 add some javadoc to rwi wordreference distance, position
to remember facts for http://mantis.tokeek.de/view.php?id=683
Init missing word position to 0 like in other non text body words
2016-09-14 00:36:19 +02:00
reger
a4465c97d6 as requested, disable/remove old swf parser
http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5861#p33098
2016-09-13 02:47:36 +02:00
reger
7f63fc50f3 prepare a IndexSegment test case for RWI index testing
+ prevent NPE in Segment.clear() on missing embedded solr instance.
2016-09-11 23:25:44 +02:00
reger
96467c5467 remove not needed counter in Tokeninzer (completing last changes)
including a small change, word posintext counting. 
We remember/store 1st posintext. Previously following words got a handle (posintext)
excluding found. Now it just counts and assigns true posintext as handle (posintext)
2016-09-10 18:23:09 +02:00
luccioman
d66b0f7b7b Fixed french messages encoding in YaCy tray.
Also added the missing french translations.
2016-09-09 07:43:33 +02:00
reger
7efb66ee10 adjust the WordReference.join wordsintext calc to take the max (instead of sum)
The reference is for the same url (add same for title and phrases).
+ del redundant join() procedure
2016-09-08 02:29:48 +02:00
luccioman
0a9ff14d96 Fixed NullPointerException case and added Javadoc 2016-09-07 10:03:48 +02:00
luccioman
06d4f93d03 Merged master into postprocessing branch 2016-09-07 09:28:37 +02:00
Michael Peter Christen
b73d2db914 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2016-09-07 06:49:15 +02:00
Michael Peter Christen
25a3c7a6d0 catch exception and write end of object 2016-09-07 06:48:52 +02:00
reger
272cdd496a reactivate sentence counter in WordTokenizer for phrasepos ranking,
by counting punktuation (delivered as 1 char word) again.
2016-09-07 02:16:16 +02:00
Michael Peter Christen
5e165a8150 removed unused imports 2016-09-06 18:46:24 +02:00
Michael Peter Christen
c716648c78 enhanced json encoding of strings 2016-09-06 18:45:29 +02:00
Michael Peter Christen
6139bd85a8 fix for broken facet names 2016-09-06 17:19:54 +02:00
Michael Peter Christen
5060f9fee9 fix for too long snippets 2016-09-06 09:05:39 +02:00
Michael Peter Christen
8681cee3f3 fix for bad comma 2016-09-06 09:00:35 +02:00
Michael Peter Christen
db6d8fc197 fix for bad json 2016-09-06 07:44:38 +02:00
Michael Peter Christen
8f4a341735 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2016-09-06 06:58:17 +02:00
Michael Peter Christen
9934f546bb added default fl to solr query, removed large texts retrieval and
changed snippet to description tag if no other description is available
2016-09-06 06:56:51 +02:00
reger
120bf7e6e2 implemented RWI WordReference to return the word position value (was always left empty)
This is needed and enables existing word position ranking for RWI.
The upcoming concurrency issue in word position min/max calculation were eliminated
by iterator.hasHext check before next() access.
2016-09-06 03:18:02 +02:00
reger
e310ec5f70 fix posInText ranking calculation to score 0 on no position info
+ fix Word posInText calc in Tokenizer to start with 1
+ test case
2016-09-06 00:05:59 +02:00
luccioman
74f9927ddc Merge remote-tracking branch 'origin/master' into dist_macOS 2016-09-05 22:42:17 +02:00
reger
51c077f493 adjust the getTopics() and getTopicNavigator() to current useage
- move the maxcount limit restriction completely to getTopicNavigator (as there not used in getTopics)
- let search servlet use getTopics by default (w/o RWI connected check, as of now, Topics are available w/o any additional index interaction)
2016-09-05 00:07:01 +02:00
reger
39dd244693 fix ConcurrentScoreMap.set() calculation of totalCount()
+ test case
2016-09-04 22:18:07 +02:00
reger
ebf818ad95 log a error on aborted news publish (due to duplicate news.id)
+ change printed err msg to log entry in PeerAction.processPeerArrival
2016-09-04 06:42:48 +02:00
reger
cc2d9dd3f1 reactivate the use of included-in-topwords boost in postRanking
+ changed the postRanking to add one score only if word appears more as one time.
+ getTopics() unused code block rem'd (save performace)-> routine needs rework !
2016-09-04 00:09:45 +02:00
luccioman
39ea28adfd Merged master to dist_macOS branch. 2016-09-03 15:22:57 +02:00
luccioman
8255e91c99 Fixed serverClassLoader.findClass method
htroot is a supposed to be a subfolder of appPath and not of dataPath,
as assumed in other places where htroot is loaded. This issue was not
visible when dataPath and appPath are equals.
2016-09-03 15:21:02 +02:00
reger
6801673a07 apply postranking media search boost only on media queries 2016-09-03 03:37:40 +02:00
luccioman
1dc4306058 Fixed indentation for better readability. 2016-09-02 11:23:02 +02:00
luccioman
8c49a755da Postprocessing refactoring
Added Javadocs to refactored methods.
Added log warnings instead of silently failing some errors.
Only fill collection1hosts when required ( shallComputeCR true).
2016-09-01 15:40:28 +02:00
luccioman
42f45760ed Refactored postprocessing
For easier understanding and performances profiling.
2016-08-31 12:16:25 +02:00
reger
4386e84b55 correct NewPool rentention calculation
(was still clearing everything after one day)
2016-08-31 02:24:30 +02:00
reger
5e72d37f0a TransNews_p: add ad-hoc translation of target file on positive vote (additon to local translation)
+ errmsg on language=default
2016-08-30 00:06:42 +02:00
reger
9462a32244 Added news service for easy, community driven UI translation support.
New or modified translation (via /Translator_p.html) can be shared/distributed
via the YaCy internal news service. Remote peers can see and vote on the
translation via the new http://localhost:8090/TransNews_p.html servlet.
A positive vote will add the received translation to the local translation
list and post a voting message to the news service.
(at this no processing of received votings is implemented)

+ fixed the msg service retention time check (NewsPool.automaticProcessP)
2016-08-29 02:15:06 +02:00
reger
f8d6543a23 Rename class CreateTranslationMaster to TranslationManager and add
additional routines and the capability to handle translation maps internally 
(to reduce complexity of handling translation maps for calling servelets)
2016-08-28 23:08:03 +02:00
reger
19b4509d54 speed-up reading of xlif language file, by using xmlparser (stax) instead of jaxb
making xliff-core-1.2-1.1.jar obsolete
2016-08-28 02:55:42 +02:00
Michael Peter Christen
e1fac86f53 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2016-08-26 14:51:43 +02:00
Michael Peter Christen
a9316ceff6 force browser-caching of favicons from search results 2016-08-26 14:51:32 +02:00
Orbiter
503312ca43 Merge pull request #61 from luccioman/heroku_experiments
Deploy YaCy on Heroku
2016-08-26 11:57:41 +02:00
reger
33bf35d90f missing file for prev commint "Introduction of additional language setting browser" 2016-08-23 00:13:20 +02:00
reger
16e8ed3f01 Introduce additional language setting "browser/Browser Language" for UI internationalization.
If language is set to "browser" the client/user browser language is used to choose from
available translation.
simply: one users browser speaks English -> YaCy responds in English, other users browser speaks French -> YaCy responds in French.

! To make a translation/language available you have to activate the language once ! 
(or manually use the utility class TranslateAll)
In ConfigBasic.html availabel translations are marked green on setting language=Browser
The client language is determined by http header Accept-Language (checked in DefaultServlet)
2016-08-23 00:04:24 +02:00
reger
3b47a07dd1 change unused servletProperties entry CONNECTION_PROP_CLIENT_REQUEST_HEADER to
use directly HttpServletRequest. This is used to get the http protocol version
in HTTPDProxyHandler.fulfillRequestFromWeb() for error response to client.
- adjust YaCyProxyServlet and UrlProxyServlet accordingly
- use more http_version constants in headerframework and httpdeamon
- equalize servlets (3) use of HeaderFramework.CONNECTION_PROP_HOST to HeaderFramework.HOST
2016-08-21 19:34:44 +02:00
reger
036c1dc6ef fix CookieTest_p formatting (output of <br> as text),
change to dataoutput only by servlet, leave formatting to html.
+ removed link to obsolete env/grafics gif
2016-08-20 22:23:47 +02:00
Michael Peter Christen
bf6709d196 fixed missing browser activation in linux 2016-08-19 19:24:15 +02:00
Michael Peter Christen
d8504418b6 enhanced browser-caching of static content 2016-08-19 19:23:51 +02:00
Michael Peter Christen
079112358c Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2016-08-19 15:31:09 +02:00
Michael Peter Christen
efeb592661 don't do solr optimization, this create high IO load. We should leave
this task to solr to do that on it's own instead of forcing it.
2016-08-19 15:30:53 +02:00
luccioman
46b8836548 Copy image resources contained in donation iframe.
Handle eventual images loading errors.
2016-08-17 15:19:15 +02:00
reger
4c7a77662a eleminate dependency on file-extension in storeDocument but use supported mime-type
to also support handling of urls w/o corresponding file-extension.
For this refactor use of document.getParserObject() to alway return a Parser (for clean logic)
and define/move the scraperObject as local var of AbstractParser.
Adjust related calls to getParserObject (where actually a scraperObject is wanted).
Addionally skip appending url token to parsed text for dht metadata entries 
(by default returned as result by rwi index).
2016-08-14 03:53:16 +02:00
reger
ebde21079a refactor xlsParser to include Excel file attribute (like author) in parser result doc.
Similar to ppt and doc parser, completing a TODO in xlsParser.
2016-08-13 23:46:36 +02:00
luccioman
744c9a2615 Opensearch desc : handle https protocol url with default port (443)
This completes modifications made for mantis 669
(http://mantis.tokeek.de/view.php?id=669)
2016-08-12 12:18:26 +02:00
luccioman
b9c28893ee Merged master to 'heroku' branch. 2016-08-10 11:03:01 +02:00
Michael Peter Christen
103a8348b3 fix for NPE and small performance enhancement 2016-08-10 06:48:08 +02:00
reger
2910fe35c1 add missing scheduler calc of next exec_date (call of calculateAPIScheduler)
- after last_exec_date is altered, next_exec_date should be recalculated
- makes the recalculation of next_exec in advance (without api call surely made) in Switchbard.schedulerJob() obsolete
Slightly modify next_exec calc. on missed event to now+schedule_time (from fix 10min)
2016-08-09 03:03:04 +02:00
reger
70d47ae38a keep scheduler selection by repeat entry from 07311020d4
to allow exec schedule on actual exec event.
Iterate on exec date (of advantage after interruption/shutdown) to schedule
older or missed events first.
2016-08-08 02:19:48 +02:00
reger
7c3f932e5d revert due to conflict with double count recording by schedulter / servlet by the commit under normal operation (no shutdown) 2016-08-08 01:57:31 +02:00
reger
07311020d4 postpone apicall exec date init until actual call
fix for http://mantis.tokeek.de/view.php?id=677
The difference is on scheduling a large number of rss feeds and loading 
is not finished before shutdown of YaCy. The change makes sure not already
loaded RSS will be loaded by the scheduler on next startup.
2016-08-07 05:08:55 +02:00
reger
5e335b32da fix Blacklist.contains() matching path pattern to string
similar to 5e9e871192
+ add proof testcase
2016-08-04 01:12:49 +02:00
reger
5e9e871192 fix Blacklist.remove by using pattern.toString to find pattern to remove,
parameter String path did never equal Pattern.
+ delete unused removeAll, as it does not persist changes after restart
2016-08-03 02:13:26 +02:00
reger
1843ea7e69 on Blacklist.add pattern to source file also update internal entry maps
as in Blacklist.add(blacklistType) to make entry effective w/o restart
fix for http://mantis.tokeek.de/view.php?id=676
2016-08-02 02:41:03 +02:00
reger
bf6ce33da3 Correct use of _htDocsPath config in YaCyDefaultServlet to use servlet config variable
+ add some javadoc and remove a not useful static declaration
2016-07-31 23:16:24 +02:00
luccioman
480027ec98 Merge remote-tracking branch 'origin/master' into heroku_experiments 2016-07-28 02:29:40 +02:00
reger
fcad2d0744 add uses of config constant INDEX_RECEIVE_ALLOW 2016-07-27 02:16:20 +02:00
reger
226f81cfcf declare poison pill url MultiProtocolURL() as protected to make sure not
used from outside.
After double checking use of poison url revert path init from commit
f8632ad292
2016-07-23 20:03:13 +02:00
reger
f8632ad292 prevent string index out of bounds MultiProtocolURL.getPaths
as path maybe a empty string
+ init path to "" also in init for poison url (to guarantee success for 
all existing uses of path w/o check for null)
2016-07-23 19:18:23 +02:00
reger
35a7d57260 update lucenematchversion to current (5.2.0 -> 5.5.0)
there should be no need for reindex by the update
2016-07-23 18:36:43 +02:00
reger
9b07bbf955 deprecate newurl(), not used and already replaced
instead of making it handle all supported the protocols
2016-07-21 02:14:35 +02:00
luccioman
47d486298f Merged changes from master. 2016-07-20 00:37:31 +02:00
reger
774b3906a9 fix GenericFormatter.parse ("time","timeoffset")
change: UTC offset internally expected in minutes
2016-07-19 02:57:41 +02:00
reger
27163af0e1 improve detection of referenced links by taking http and https link protocol
into account
+ correct query start detection of commit f89d4eb51d
2016-07-17 23:42:25 +02:00
reger
f89d4eb51d fix MultiProtocolURL init (assign of host) for urls with '/' in query part
+ add to test case
2016-07-17 04:17:01 +02:00
reger
87fcfc6d78 Adjusted hash computation and toNormalform for file:// protocol to deliver
same hash same file on Windows filesystem path with forward- and backslash in path.
Background see http://mantis.tokeek.de/view.php?id=671
+Test case
2016-07-16 01:59:09 +02:00
luccioman
d6bf90803f Merged from maain master branch. 2016-07-12 09:05:31 +02:00
luccioman
9b9c112263 Handle more propertly local port configuration by system property
And prefixed property with "net.yacy" to avoid ambiguity.
2016-07-12 01:53:01 +02:00
reger
3811184abd fix GSA servlet clientIP retrival 2016-07-09 23:39:43 +02:00
reger
7ab41d4ff1 use directories original lastmodified date in file- & smbloader in response 2016-07-09 19:55:47 +02:00
reger
708bcbb042 one more replacement to use cached hosthash vs. calculated 2016-07-07 02:50:57 +02:00
luccioman
b57a06d88e Let Heroku decide which http port to use 2016-07-06 22:14:40 +02:00
reger
22db449f2a to prevent crawler to concurrently access and alter same crawl queue
after restart, put hosthash in queue's filename (which is used as primary 
key for crawl queue. Hint: initial hosthash from url and recalculated hosthash 
from just hostname:port are not the same. 
fixes http://mantis.tokeek.de/view.php?id=668 (partially)
2016-07-05 23:22:35 +02:00
luccioman
893a40995a Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2016-07-04 21:24:40 +02:00
Orbiter
50c5ddf1a1 Merge pull request #56 from luccioman/LibreJS
LibreJS compliance : YaCy JavaScript license information
2016-07-04 21:07:11 +02:00
Michael Peter Christen
7466d390b2 small refactoring + do not accept too old peers during bootstrap 2016-07-04 11:02:15 +02:00
luccioman
6e96c7341a Merge remote-tracking branch 'origin/master'
Conflicts:
	htroot/Load_MediawikiWiki.java
	htroot/Load_PHPBB3.java
	htroot/ViewImage.java
2016-07-03 18:59:00 +02:00
reger
8d58a48029 remove wrong log line in CrawlSwitchboard
+ don't allow CrawlSwitchboard to exit application
making network param unused
2016-07-02 20:33:23 +02:00
reger
5aaa057c65 ignore empty input lines in FileUtils.getListArray() to poka joke blacklist read.
equalizes behavior with getListString()
improves: case were blacklist file contained a undesired empty line, not 
fixed by blacklist-cleaner.
2016-06-28 23:44:28 +02:00
reger
41c36ffd75 exclude rejected results from result count
(by using the resultcontainer.size instead of input docList.size)
skip waiting for write-search-result-to-local-index
  (by removing the Thread.join - which will bring a small performance increase)
2016-06-26 06:46:26 +02:00
reger
d4da4805a8 internal wiki code, require header line to start with markup
(to allow something like  "one=two"  as text)
+ incl. test case
2016-06-25 02:46:44 +02:00
reger
e952e355a2 have Translator servlet adhoc apply added translation by translating a single file
+ fix NPE in Translator, coming from translation read by TranslatorXliff 
  which allows null content for not translated key's
2016-06-14 22:14:46 +02:00
reger
b119ff65be clean out not used Switchboard variables
counter indexedPages, const xstackCrawlSlots
2016-06-14 01:50:32 +02:00
reger
223071337b Translator to take caution of word boundaries to identify text portion to
be translated. To avoid key="TEST" sourcetext="this is a myTESTcase for it"
translation of partial terms/words.
Add check of word boundary before and after sourcetext (incl. take care
of current praxis for key to be delimetered by > < 
+ add test case
2016-06-10 01:14:19 +02:00
luccioman
009657791e Merge remote-tracking branch 'origin/master' into LibreJS 2016-06-09 14:44:51 +02:00
luccioman
a73c9327a5 JavaScript License fixes for LibreJS compatibility 2016-06-08 23:16:10 +02:00
reger
0c40401d28 fix MessageBoard test for null data 2016-06-07 23:34:42 +02:00
reger
5b22c63030 Adjust TranslatorXliff to load default 1st and merge downloaded or modified local translation.
process 1. load default from locales/*.* 
        2. load and merge(overwrite) from DATA/LOCALE/*.* (can be partial translation as it is merged)
- include all entries from DATA/LOCAL to be edited in Translator servlet
  and save just modifications (instead of full list) to DATA/LOCALE

This shall make it easy to share modifications.
2016-06-05 23:01:45 +02:00
reger
a2e0f00456 optimize Translator
- translateFilesRecursive: load translation once (reduce io), return true on complete success
  - remove resulting unused translateFiles() variant
- translate: use StringBuilder parameter (skip toString conversion)
- remove not needed static declaration
- upd some javadoc
2016-06-05 03:57:08 +02:00
reger
a6ba1faa80 introduce a translation edit servlet Translator_p.html YaCy's UI text translation
This is the 1st rudimentary approach to support the translatio utilities.
It allows currently to edit untranslated text and save it in a local translation file
in the DATA/LOCALE directory.
+ refactor Translator (less static's) to leverage on class overrides and support garbage collection for this 1 time routine
+ adjust TranslatorXliff to check for local translations in DATA/LOCALE,
  this includes storing manually downloaded translation files in DATA as well 
  (to keep default untouched)
+ on 1st call of Translator_p a master tanslation file is generated, checking
the supported languages for missing translation text (later this masterfile is planned to part of the distribution, to harmonize translation key text between the languages)
Outlook: the local modifications (possibly as translation fragments instead of complete file) to be shared with maintainer using xlif features.
2016-06-03 01:46:30 +02:00
reger
b3c9041f79 remove with localHostNames redundant (but unused) publicIPv4HostNames and publicIPv6HostNames
to free unused resources
2016-06-02 01:42:15 +02:00
reger
bd8f7c11f5 Use transparent addToCrawler in AutoSearch instead of addToIndex
This would likely also be of advantage for RSS import/schedule as
following bug-reports suggest
http://mantis.tokeek.de/view.php?id=569
http://mantis.tokeek.de/view.php?id=655
2016-06-01 01:14:22 +02:00
reger
f23d8ab47b fix 2 more servlet RuntimeException in intranet mode thrown due to seed.getIP()
returning null in intranet mode (in servlets: ConfigSearchBox, Load_PHPBB3
+remove unused (const &empty;) seed.IPTYPE
2016-05-29 20:35:57 +02:00
reger
bb0076c3dd fix: assure close inputstream in TranslatorXliff after reading xlf file
by using try-wiht-resource block
2016-05-29 01:25:47 +02:00
reger
6384b7d82e fix NPE in Load_MediawikiWiki servlet in intranet mode
- in intranet mode getip returns null causing a NPE
  - adjust starturl (which was set to http://localip/repository) which is never the start url for the Mediawiki
+ correct javadoc for seed.getIP()
2016-05-27 03:10:25 +02:00
Michael Peter Christen
596b5dfa59 add the JRE version in the seed. Purpose: identify if it is possible to
migrate to new JRE version
2016-05-24 23:11:59 +02:00
reger
4cc38e979d add InputStream close after reading input file (Vocabulary_p servlet) 2016-05-24 00:26:28 +02:00
reger
6bf9c55584 adjust Solr select servlet to lates bugfix for boostquery (bq param)
to split query into multiple parameter on line separator in input query.
e.g. split "crawldepth_i_0^10.0 \n crawldepth_i:1^5.0"
but allow   "url_file_ext_s:jpg OR url_file_ext_s:png"  to be unsplitted
2016-05-22 22:43:56 +02:00
Burkhard
9a18e2297b Merge pull request #51 from JeremyRand/multiple-boost-query
Fix multiple boost queries
2016-05-22 22:24:04 +02:00
reger
f0d7b93372 make use and activate autodetect charset in Vocabulary input from file
+ revert mistake of empty cn.lng
2016-05-22 05:38:26 +02:00
JeremyRand
433217b33e Properly support multiple Boost Queries. (Previous code was broken because it concatenated multiple Boost Queries together rather than passing Solr an array.) 2016-05-20 20:17:51 -05:00
JeremyRand
58824dfa6c Refactor escaping in config file read/write code. Now it uses Apache Commons StringUtils instead of RegEx. 2016-05-20 20:17:51 -05:00
reger
9e94989237 upd to PDFBox 2.0.1 2016-05-20 23:12:16 +02:00
reger
d0a571bed2 del cytag trail for own index.html (save resource not used by default) 2016-05-19 01:59:00 +02:00
reger
de46879637 fix SeedDB.get(byte[]) hash string compare (for returning own seed shortcut) 2016-05-17 02:07:49 +02:00
reger
24b0fa2a38 extend snapshot Html2Image.pdf2image to use PDFBox image export capability
if no external tool installed (and for Win)
Resulting jpg are not always perfect (if graphic included) but imho sufficient.
2016-05-16 02:13:33 +02:00
reger
eb2a00b1d8 fix NPE on missing crawldepth_i 2016-05-15 01:26:38 +02:00
reger
efb9f1a8b7 save resource for unused blacklistFiles map 2016-05-12 00:13:57 +02:00
reger
5f113be760 cleanup connectPeer & yacyVersion.latestRelease usage
obsolete since 
527b3decde
2016-05-06 21:05:15 +02:00
reger
7097dcbdbd cleanup hack for partial Solr update on multivalued datefields
has been fixed in Solr http://issues.apache.org/jira/browse/SOLR-8050
2016-05-06 02:47:04 +02:00
reger
f10ea3c155 clean-out unused SwitchboardConstants 2016-05-05 00:55:22 +02:00
reger
ef24593347 delete obsolete SEARCHRESULT busythread constants
not used since 29.05.2013 18:27:27
0c1a018bbd
2016-05-04 01:30:10 +02:00
reger
125b5e26a5 apply bugfix for ChartPlotter from Pullreq 42
https://github.com/yacy/yacy_search_server/pull/42
thanks to otteresk (https://github.com/otteresk)
2016-05-03 03:06:06 +02:00
reger
06ce9ae711 prevent "unchecked conversion" compiler message
+ include "translate" property in xlf "trans-unit" export
2016-05-01 02:22:05 +02:00
reger
b4a576dbdf exclude unused protocol param "duetime"
(receiver interpretes param "time" only)
2016-04-25 01:57:33 +02:00
reger
3bd6ae8d8b keep addon/Notepad++ keyword marker on lng export
(length of remarks devider line)
+ harmonize status_p.inc lng text
2016-04-21 00:51:08 +02:00
reger
16837d60c7 fix version in locale version file
(it's compared to full version)
2016-04-17 22:54:28 +02:00
reger
0fb01e429e fix migration, account for ssl port in config (for auto-disable https) 2016-04-17 04:42:05 +02:00
reger
7be1c7a05a fix logger name 2016-04-17 03:20:14 +02:00
reger
1d940e5a94 upd commons-compress 1.11 2016-04-16 23:31:03 +02:00
reger
7789c32c82 delete crawl queue on init exception
(happens occasionally on path name vaiolation and will never get resolved)
2016-04-16 00:22:48 +02:00