Commit Graph

13000 Commits

Author SHA1 Message Date
luccioman
5c8958bcea Updated Javadoc and Junit tests for the WebStructureGraph class. 2017-01-17 17:01:56 +01:00
luccioman
17b7c92009 Made sure webstructure.xml API produces valid XML.
Host names should not contain XML special characters such as quotation
mark, but at this stage the WebGraph may have mistakenly recorded a host
name with such characters. What's more the DigestURL constructor does
not prevent this.
By the way using serverObjects.putXML to encode host names we ensure
here the rendered XML is well formed and can be parsed by external tools
even if an structure entry is incorrect.
2017-01-17 15:59:55 +01:00
luccioman
d9766ca981 Fixed WatchWebStructure_p.html render to include https URLs.
As described in mantis 721 (http://mantis.tokeek.de/view.php?id=721)
WatchWebStructure_p.html failed to include in its structure view https
and other protocols and ports than default http.
2017-01-16 18:41:58 +01:00
luccioman
ed3dd5e31a Fixed webstructure.xml API used with a domain name 'about' parameter.
As described in mantis 720 (http://mantis.tokeek.de/view.php?id=720),
when requesting this API with a domain name instead of a complete URL
only HTTP references on default port were listed.
2017-01-16 16:41:06 +01:00
luccioman
0da1e6ba16 Factored code re-implementing DigestURL.hosthash() method.
This ensure consistent implementation of the url host hash generation
and easier usage finding in source code.

Also added a unit test for this function.
2017-01-16 10:18:42 +01:00
luccioman
86adfef30f Added automated unit tests and perfs test for WebStructureGraph class.
Fixed references count when multiple links target the same domain name
in one document.
2017-01-13 16:10:59 +01:00
luccioman
f793d97e56 Factored common code with DigestURL.hosthash() 2017-01-13 16:05:46 +01:00
luccioman
9cea7cbb10 Detailed some Javadoc related to /api/webstructure.xml usage. 2017-01-12 17:52:47 +01:00
reger
007e2afa6e Start to rename "Augmented Browsing" to "Web Proxy ..." / "View via Proxy"
The augmented Browsing option was reduced to the web proxy functionallity.
Augmented browsing is not available and no known plan exist to reimplement
alteration of result pages with additional information.
2017-01-12 01:36:30 +01:00
luccioman
c9889991b9 Fixed 2 failing JUNit tests. 2017-01-09 17:59:01 +01:00
luccioman
bdaef80a55 Ignore generated Javadoc with git SCM. 2017-01-09 16:45:31 +01:00
luccioman
6a4d51d8f9 Cleaned up some Javadoc warnings. 2017-01-09 16:44:47 +01:00
luccioman
86dc198698 Fixed some JavaDocs broken links. 2017-01-09 09:57:53 +01:00
luccioman
c78e2f3b4b Fixed maven assembly base directory to match last main YaCy binaries. 2017-01-09 09:54:14 +01:00
reger
16beb551ea fix DC.Elements namespace in DublinCore vocabulary class
delete redundant (unused) DCElements.
2017-01-07 18:24:29 +01:00
luccioman
339f005ced Blacklist import and update performance improvements.
Measurement sample : import from blacklist local file containing about
15000 entries
 - before refactoring : several minutes
 - after refactoring : a few seconds!
2017-01-06 12:24:31 +01:00
luccioman
e3892b0957 Added some JavaDoc. 2017-01-06 11:23:40 +01:00
luccioman
52d05d14c6 Display result favicons only for http or https resources.
Favicon display only makes sense for http(s) websites, being public or
intranet. So I modified the favicon conditional display to verify the
result URL protocol rather than if we are in intranet mode.

Also prevented rendering an img HTML tag with empty src on other results
protocols such as ftp or file.

Fixing this thanks to priest2 report
(http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5923).
2017-01-06 09:00:28 +01:00
reger
4c9be29a55 fix concurrency issue with htmlParser using not current scraper data
resulting in incorrect data for some html index metadata.
Details see http://mantis.tokeek.de/view.php?id=717
2017-01-06 03:01:52 +01:00
luccioman
b154d3eb87 Added descriptive titles to Crawler_p.html speed settings.
As reported by bubul
(http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5924) , LF and MH
acronyms meaning were not detailed.
Also added label tags for improved accessibility on these input fields.
2017-01-05 14:54:59 +01:00
reger
eedee6eabb fix exception on URIMetadataNote instantiation with corrected id hash on
host_id_s. Use Solr setField instead of addField to prevent
java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
	at net.yacy.kelondro.data.meta.URIMetadataNode.hosthash(URIMetadataNode.java:247)
	at net.yacy.search.query.SearchEvent.addNodes(SearchEvent.java:966)
	at net.yacy.peers.Protocol.solrQuery(Protocol.java:1242)
	at net.yacy.peers.RemoteSearch$2.run(RemoteSearch.java:349)
2017-01-05 00:24:37 +01:00
luccioman
b55cf16dad Upgraded jgit build library to version 4.5.0
This is the latest Java 7 compatible jgit release.

Properly support GitHub tags marked as "Pre-release". 
With the previous venerable jgit version 1.1.0, a YaCy repository clone
having such a tag made GitRevTask and GitRevMavenTask crash.
2017-01-04 17:09:37 +01:00
luccioman
9cfe8dd6d6 Upgraded Apache Ant to 1.10.0 for the Alpine flavor Docker image. 2017-01-02 14:23:25 +01:00
luccioman
c1401d821e Adjusted crawl depth control for FTP crawl start URLs. 2017-01-02 10:24:17 +01:00
reger
68d4dc5cc5 Complete harmonization RequestHeader getCookie with std ServletRequest
to use javax.servlet.http.Cookie parameters.
Depreciate now obsolete getHeaderCookies.
Adjust setting of MaxAge to spec if >= 0 otherwise keep default.
2017-01-02 03:04:21 +01:00
reger
396ed3c769 On negative result vote also delete document from fulltext index
(not only from dht)
2017-01-01 23:58:38 +01:00
reger
50e211fd92 Merge origin/master 2017-01-01 23:54:18 +01:00
reger
a1e5f7dbca fix of fulltext.remove() by id of webgraph document
webgraph has document hash in source_id_s
2017-01-01 23:53:44 +01:00
luccioman
1eafa7bfaf Fixed docker stop behavior.
- Adjusted start script in debug mode to make sure the main java process
can receive signals such as SIGTERM
- Modified docker images main command to properly propagate SIGTERM
signal to the main java process
2016-12-31 09:51:07 +01:00
luccioman
1df558a6c6 Fixed YaCy proper shutdown triggered by SIGTERM signal.
The main shutdown hook thread was not properly waiting for the main
thread termination which consequently could not properly close resources
and threads. After terminating a running YaCy peer this way (Ctrl+C in
console, or kill <pid> for example), you could see the still existing
DATA/yacy.running file.

Tested with :
 - Debian Jessie openjdk 7 and 8 : regular shutdown, Ctrl+C, kill
command, system restart while yacy is running
 - Windows 10 Oracle JDK 7 and 8 : non regression on regular shutdown
2016-12-28 09:47:27 +01:00
Michael Peter Christen
fce701f5cc release 1.92 2016-12-26 15:00:55 +01:00
Michael Peter Christen
c7af94a0c0 release 1.92 2016-12-26 14:13:45 +01:00
Michael Peter Christen
dbd34befc0 added luccioman development release builds as discussed in
http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5906
2016-12-26 13:50:35 +01:00
Michael Peter Christen
204e507b2b updated seed-list bootstrap locations 2016-12-26 13:32:36 +01:00
reger
b522d540b9 Include itemprop latitude/longitude (see schema.org) in attribute
parsing for lat/lon.
Harmonize number parsing for lat/lon to parseDouble.
Fix endDate_dts value assignment.
2016-12-25 23:39:55 +01:00
reger
083df255e4 fix html tag attribute parsing containing attribute w/o value
e.g. itemscope or autofocus (in such case the next key was not properly
recognized).
2016-12-24 06:57:11 +01:00
reger
cb95b7339a include html5 <time> tag in content scraper,
add "datetime" property of <time> tag to scrapers startdate list.
Datetime is parsed as iso8601 (xml) date, html5 allows partial as well
as duration (not handled by this)
2016-12-24 03:11:35 +01:00
reger
f153cc4b5d add/allow to create a bookmark of result viewed via urlproxy.
For this on the header of the viewed result a "add bookmark" button is
available (for authenticated users).
Currently the bookmark is added to a (virtual) bookmark folder "/proxy"
w/o any additional tags etc.
2016-12-23 19:03:44 +01:00
reger
7bf2bcf504 fix and prevent exception on missing required cookie name
skip cookie creation if name is empty.
2016-12-22 19:52:38 +01:00
luccioman
3ca695390c FTP crawl start URLs : applied crawl profile depth control
Applied rules :
- when the FTP URL denotes a file resource, stack it as any start URL :
eventually embedded links can be followed applying the usual depth rules
- when the FTP URL denotes a directory, list files under this directory
and stack them for crawl, and repeat the process on sub folders until
crawl depth is reached
2016-12-22 16:25:09 +01:00
luccioman
128c8ef8d4 Fixed title rendering having non ASCII chars in QuickCrawlLink_p.html. 2016-12-21 08:19:09 +01:00
luccioman
ee6933c004 Added a title on the previous and next page pagination buttons.
This is to clarify the meaning of these buttons for users who could
think they link respectively to the first and last results page.
2016-12-21 07:22:41 +01:00
reger
8eb6fba59c activate filetype navigator plugin and restrict config (append) of navs
to not already actives.
Dht results are now included in count this might over shoot on redundant
dht and solr, while the previous solr facet based was always low.
2016-12-21 02:04:13 +01:00
luccioman
c25e48e969 Enabled displaying results after 14th page for local search queries.
Fixes issue #90 for local queries only: Stealth mode, Portal mode or
Intranet mode. 
For P2p mode, the issue would probably be difficult to solve with
reasonable performance. This is still to dig.

Also switched some InterreputedException catch log messages to warn
level as this is normal behavior when shutting down a peer.

Fixed yacysearch buttons navbar behavior to deal correctly with total
results count or offset over 1000. Also improved the buttons navbar to
be able to navigate over 10th page for local queries.
2016-12-20 14:52:33 +01:00
luccioman
a3886c6adb Merge branch 'master' of https://github.com/yacy/yacy_search_server.git 2016-12-20 13:41:13 +01:00
luccioman
feaa87005e Improved indentation for easier debugging steps. 2016-12-20 08:27:17 +01:00
reger
6be9d62ab4 show earthsearch.png in ConfigSearchPage layout on activated location
navigator (for more realistic impression)
2016-12-20 02:06:43 +01:00
reger
bab4804d11 add FileTypeNavigator plugin 2016-12-19 23:56:03 +01:00
reger
d35c47090c remove obsolete put of HttpServletRequest attributes to YaCy servlet
parameters on SSI (server side includes).
Query parameters are already merged by dispatcher.include, making copy
of parameter (RequestDispatcher.INCLUDE_QUERY_STRING) obsolete.
All other parameter are not used as YaCy servlet arguments.
2016-12-19 02:30:55 +01:00
reger
0959038624 correct DefaultServlet resource pathinContext calculation
exclude servletPath option as resources are always relative to htroot 
or htdocs, the change reflects this.
Theoretically it and the recent adjustments arcording relative urls 
allows to configure the instance to be configurable in a path other as 
root (/)
2016-12-18 21:11:00 +01:00