Commit Graph

443 Commits

Author SHA1 Message Date
luccioman
e048e74072 Added an optional parameter to webstructure.xml api.
This new "documentStructure" parameter can be set to false to only get
hosts accumulated references on a resource and thus prevent scraping the
specified URL and getting citations references.

Also set WebStructureGraph constants as final and updated the Javadoc
with example api call URLs.
2017-01-19 12:30:44 +01:00
luccioman
17b7c92009 Made sure webstructure.xml API produces valid XML.
Host names should not contain XML special characters such as quotation
mark, but at this stage the WebGraph may have mistakenly recorded a host
name with such characters. What's more the DigestURL constructor does
not prevent this.
By the way using serverObjects.putXML to encode host names we ensure
here the rendered XML is well formed and can be parsed by external tools
even if an structure entry is incorrect.
2017-01-17 15:59:55 +01:00
luccioman
ed3dd5e31a Fixed webstructure.xml API used with a domain name 'about' parameter.
As described in mantis 720 (http://mantis.tokeek.de/view.php?id=720),
when requesting this API with a domain name instead of a complete URL
only HTTP references on default port were listed.
2017-01-16 16:41:06 +01:00
luccioman
f793d97e56 Factored common code with DigestURL.hosthash() 2017-01-13 16:05:46 +01:00
luccioman
9cea7cbb10 Detailed some Javadoc related to /api/webstructure.xml usage. 2017-01-12 17:52:47 +01:00
reger
c50e23c495 reduce creation of empty legacy RequestHeader() in situation where null
is acceptable (less for garbage collection).
2016-12-18 02:38:43 +01:00
reger
f45945cada increase use of header const for custom "EXT" header 2016-11-13 01:39:14 +01:00
luccioman
812abfc868 Converted one more set of URLs to pure relative ones.
Easier YaCy peer configuration behind a reverse proxy subfolder : no
need for the reverse proxy to rewrite HTML links or URLs in css files.

Tested on Debian Jessie with an apache2 reverse proxy.

See related mantis issues http://mantis.tokeek.de/view.php?id=106 and
http://mantis.tokeek.de/view.php?id=701
2016-11-12 15:54:35 +01:00
luccioman
74fec066f4 Converted more URLs to pure relative ones.
Easier YaCy peer configuration behind a reverse proxy subfolder : no
need for the reverse proxy to rewrite HTML links or URLs in css files.

Tested on Debian Jessie with an apache2 reverse proxy.

See related mantis issues http://mantis.tokeek.de/view.php?id=106 and
http://mantis.tokeek.de/view.php?id=701
2016-11-12 10:51:54 +01:00
luccioman
734340c128 Fixed errors for Search portal mode or when peer is not reachable.
Same case as reported on issue #87.
2016-11-04 14:31:22 +01:00
luccioman
6e1959f469 Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
Conflicts:
	htroot/yacysearchitem.java
	source/net/yacy/cora/federate/solr/responsewriter/YJsonResponseWriter.java
	source/net/yacy/search/schema/CollectionConfiguration.java
	source/net/yacy/server/serverObjects.java
2016-10-14 11:29:55 +02:00
reger
7c81160f45 correct blacklist export as text url to blacklists_p.txt
was using servlet for network access and missing network.unit.name
fix for http://mantis.tokeek.de/view.php?id=694
+ prevent unresoved_pattern in yacy/list servlet
2016-10-07 03:03:41 +02:00
reger
91ab8a526a add error msg to api/share.html
and skip display of url on nothing uploaded
2016-08-17 03:07:26 +02:00
luccioman
6e96c7341a Merge remote-tracking branch 'origin/master'
Conflicts:
	htroot/Load_MediawikiWiki.java
	htroot/Load_PHPBB3.java
	htroot/ViewImage.java
2016-07-03 18:59:00 +02:00
reger
4e0892962a fix NPE in citation servlet on empty text field 2016-05-14 03:51:13 +02:00
reger
d9adc2c255 load handler for Transparent Proxy on startup only if feature is activated
to save the resources and keep handler chain small if the feature is not used.
+add a warning message on settingsack_p page to restart on first activation
2016-03-25 05:26:48 +01:00
Michael Peter Christen
b89465d952 0N - basic dump upload servlet infrastructure, to share index dumps
within an experimental new sharing model
2016-03-11 18:12:13 +01:00
Michael Peter Christen
f12a900f3e harmonization of http post of files for one and several files - this had
been differently - and wrong for several files. also: base64-encoding
for gzipped push files because our data structures currently only
supports ASCII POST pushes..
2016-03-11 08:59:33 +01:00
luc
8682dfbd5e Updated getpageinfo outputs to return page icons list. 2016-02-10 09:02:21 +01:00
luc
3cc5619d93 Improved HTML icons indexing and rendering in search results.
See http://mantis.tokeek.de/view.php?id=629
2016-02-02 09:57:54 +01:00
luc
571bc55937 Refactoring : use StandardCharsets constants instead of hard-coded
charset names.
2016-01-05 23:37:05 +01:00
luc
55a4d15775 Added a note on deprecated default search field and operator. 2015-12-14 23:55:12 +01:00
reger
52a9040ae6 Sort out double keywords (dc_subject) early in parsed documents
- by direct using Set vs. List
- remove not neede String[] getter
2015-11-13 01:48:28 +01:00
reger
a60b1fb6c2 differentiate api call getLocalPort() from getConfigInt() 2015-10-31 23:09:03 +01:00
sixcooler
87e4abe393 fight the fieldcache by usind DocValues: in Solr-5.x the fieldcache has
moved and was not cleared anymore. This results in an huge fieldcache.
(http://lucene.apache.org/#highlights-of-the-lucene-release-include
https://issues.apache.org/jira/browse/LUCENE-5666)
Here I try to use DovValues where it is possible.
For this I used the Api-Scheme as new basis für the Solr-Schema.
This needs at least a complete optimization of the Solr-Index to get a
smaller FieldCache.
Everything that is indexed with these setting will not use the
Fieldcache at all.
2015-08-31 20:24:41 +02:00
Michael Peter Christen
b43811d38c added surrogate import process for exported solr dumps.
Just throw your solr dump file into DATA/SURROGATES/in/ and it will be
imported!
2015-05-30 13:19:59 +02:00
reger
3e742d1e34 Init remote crawler on demand
If remote crawl option is not activated, skip init of remoteCrawlJob to save the resources of queue and ideling thread.
Deploy of the remoteCrawlJob deferred on activation of the option.
2015-05-23 02:06:39 +02:00
reger
609c52e987 refactor getBookmark
to consistenly check existance by != null (w/o throwing exception on not found)
2015-05-11 00:37:04 +02:00
reger
8a5b8f8789 on bookmaring of search result, remember orig. query in separate bookmark property
(instead of using the description field)
- adjust display and autosearch
- don't overwrite existing bookmark but combine info
2015-05-03 02:31:50 +02:00
Michael Peter Christen
fed26f33a8 enhanced timezone managament for indexed data:
to support the new time parser and search functions in YaCy a high
precision detection of date and time on the day is necessary. That
requires that the time zone of the document content and the time zone of
the user, doing a search, is detected. The time zone of the search
request is done automatically using the browsers time zone offset which
is delivered to the search request automatically and invisible to the
user. The time zone for the content of web pages cannot be detected
automatically and must be an attribute of crawl starts. The advanced
crawl start now provides an input field to set the time zone in minutes
as an offset number. All parsers must get a time zone offset passed, so
this required the change of the parser java api. A lot of other changes
had been made which corrects the wrong handling of dates in YaCy which
was to add a correction based on the time zone of the server. Now no
correction is added and all dates in YaCy are UTC/GMT time zone, a
normalized time zone for all peers.
2015-04-15 13:17:23 +02:00
reger
7fcf0d0b71 fix missing display of CrawlerMonitor -> robots.txt Monitor
revert delete of file api/table_p.html see 3ffe19b85c
(still used in this menu)
2015-03-24 00:13:05 +01:00
Michael Peter Christen
710a0efa1b generalized time period computations 2015-03-02 12:55:31 +01:00
Michael Peter Christen
974d58b01f IPv6 Fix for push interface 2015-02-04 15:03:34 +01:00
Michael Peter Christen
69eacdf4eb applying precompiled CommonPattern.COMMA.split to all places where
split(",") was used
2015-01-29 01:46:22 +01:00
Michael Peter Christen
bee5ee7cce removed some warnings 2015-01-27 17:00:20 +01:00
Michael Peter Christen
6390454652 fix for vocabulary on/off setting 2015-01-27 16:24:27 +01:00
Michael Peter Christen
7db2888336 fixed font size and print page generation in pdf snapshots 2015-01-20 17:14:14 +01:00
reger
198102304b refactor size() -> filesize() of URIMetadataNode
(harmonize with ResultEntry and to not get confused with Collection.size())
2014-12-21 06:05:35 +01:00
Michael Peter Christen
5516819354 preventing the use of no-cache and expires in case that images are
generated dynamically which will stay static in the future. This applies
mainly to the search result favicon in front of search hits. These icons
will now be generated once, but then caches in the browser. There is
also a YaCy-internal cache for these icons which had prevented the
re-generation of the icons in YaCy, but this cache is now superfluous
since the browser should not call the servlet ViewImage again.
2014-12-19 17:41:38 +01:00
Michael Peter Christen
932faafffe reactivated on-demand snapshot loading 2014-12-16 12:09:57 +01:00
Michael Peter Christen
2362ad7c34 fix for a count issue in snapshot api 2014-12-16 11:33:30 +01:00
Michael Peter Christen
9971e197e0 Added a transaction interface to the snapshots: all documents in the
snapshots can now be processed with transactions using commit and
rollback commands. Furthermore, a large number of monitoring methods had
been added to check the success of transactions.

The transactions for snapshots have two main components: a rss search
API to get information about latest/oldest entries and a commit/rollback
API to move entries away from the rss results. This is done by usage of
two storage locations for the snapshots, INVENTORY and ARCHIVE. New
snapshots are placed to INVENTORY, commited snapshots move to ARCHIVE,
rollback snapshots move to INVENTORY again.

Normal Workflow:
Beside all these options below, usually it is sufficient to process data
like this:
- call
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST
- process the rss result and use the <guid> value as <urlhash> (see next
command)
- for each processed result call
http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash>
- then you can call the rss feed again and the commited urls are omited
from the next set of items.

These are the commands to control this:
The rss feed:
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=OLDESTFIRST
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY
http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST
http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=OLDESTFIRST
http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST

The feed will return a <urlhash> in the <guid> - field of the rss. This
must be used for commit/rollback:

Commit/Rollback:
http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash>
http://localhost:8090/api/snapshot.json?command=rollback&urlhash=<urlhash>
The json will return a property list containing the property "result"
with possible values "success" or "fail", according of the result. If an
"fail" occurs, please look into the log for further info.

Monitoring:
http://localhost:8090/api/snapshot.json?command=status
This shows the total number of entries in the INVENTORY and the ARCHIVE 
http://localhost:8090/api/snapshot.json?command=list
This will result a list of all hosts which have snapshots and the number
of entries for the hosts. Counts for INVENTORY and ARCHIVE are listed in
the porperties for "count.INVENTORY" and "count.ARCHIVE"
http://localhost:8090/api/snapshot.json?command=list&depth=2
The list can be restricted to such which have a specific depth. The list
contains then the same host names, but the count values change because
only documents at that specific crawl depth are listed
http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80
This lists all urlhashes for the given host, not only an accumulated
list of the number of entries
http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80&depth=0
This restricts the list of urlhashes for that host for the given depth
http://localhost:8090/api/snapshot.json?command=list&state=INVENTORY
http://localhost:8090/api/snapshot.json?command=list&state=ARCHIVE
This selects either the INVENTORY or ARCHIVE for all list commands,
default is ALL which means that from both snapshot directories the host
information is collected and combined. You can use the state option for
all the commands as listed above

Detailed Information:
http://localhost:8090/api/snapshot.json?command=metadata&urlhash=upiFJ7Fh1hyQ
This collects metadata information for the given urlhash. This can also
be restricted with state=INVENTORY and state=ARCHIVE to test if the
document is either in one of these snapshot directories. If an urlhash
is not found, an empty result is returned. If an entry was found and the
state was not restricted, then the result contains a state property
containing the name of the location where the document is, either
INVENTORY or ARCHIVE.

Hint:
If a very large number of documents is inside of INVENTORY, then it
could be better to call the rss feed with
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY
because that is very efficient.
2014-12-15 23:32:46 +01:00
Michael Peter Christen
c3c2b6999b fixes on wkhtmltopdf 2014-12-14 04:03:20 +01:00
Michael Peter Christen
8df8ffbb6d enhanced the snapshot functionality:
- snapshots can now also be xml files which are extracted from the solr
index and stored as individual xml files in the snapshot directory along
the pdf and jpg images
- a transaction layer was placed above of the snapshot directory to
distinguish snapshots into 'inventory' and 'archive'. This may be used
to do transactions of index fragments using archived solr search results
between peers. This is currently unfinished, we need a protocol to move
snapshots from inventory to archive
- the SNAPSHOT directory was renamed to snapshot and contains now two
snapshot subdirectories: inventory and archive
- snapshots may now be generated by everyone, not only such peers
running on a server with xkhtml2pdf installed. The expert crawl starts
provides the option for snapshots to everyone. PDF snapshots are now
optional and the option is only shown if xkhtml2pdf is installed.
- the snapshot api now provides the request for historised xml files,
i.e. call:
http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ
The result of such xml files is identical with solr search results with
only one hit.
The pdf generation has been moved from the http loading process to the
solr document storage process. This may slow down the process a lot and
a different version of the process may be needed.
2014-12-09 16:20:34 +01:00
Michael Peter Christen
d97deb5555 npe fix 2014-12-06 00:43:12 +01:00
Michael Peter Christen
4fe4bf29ad added rss feed output to snapshot servlet which can be used to get a
list of latest/oldest entries in the snapshot database. This is an
example:
http://localhost:8090/api/snapshot.rss?depth=2&order=LATESTFIRST&host=yacy.net&maxcount=100

The properties depth, order, host and maxcount can be omited. The
meaning of the fields are:
host: select only urls from this host or all, if not given
depth: select only urls at that crawl depth or all, if not given
maxcount: select at most the given number of urls or 10, if not given
order: either LATESTFIRST to select the youngest entries, OLDESTFIRST to
select the first entries or ANY to select any

The rss feed needs administration rights to work, a call to this servlet
with rss extension must attach login credentials.
2014-12-06 00:25:05 +01:00
reger
ff18129def ViewFile servlet: update index if newer,
so viewed text and metadata (stored) info is similar
- to archive it, use request with profile to allow indexing (defaultglobaltext) and update index 
   (the resource is loaded, parsed anyway, so it's not a expensive operation)

Request: remove 2 unused init parameter 
- number of anchors of the parent
- forkfactor sum of anchors of all ancestors
2014-12-05 01:13:37 +01:00
Michael Peter Christen
226aea5914 added a servlet which can create preview images, preview tumbnails and
preview pdfs from web pages, i.e.:
http://localhost:8090/api/snapshot.png?url=http://yacy.net/en/&width=128&height=128
http://localhost:8090/api/snapshot.jpg?url=http://yacy.net/en/&width=128&height=128
http://localhost:8090/api/snapshot.pdf?url=http://yacy.net/en/

This supports also an on-the-fly generation of the preview documents if
the user is an administrator. Otherwise, the servlet fails.
To enable this, you must add wkhtmltopdf, imagemagick and (on headless
servers) xvfb to your operation system.

for detailed instructions, see
97f6089a41
2014-12-03 11:45:48 +01:00
Michael Peter Christen
0550b54d56 added fix to postprocessing: avoid caching of postprocessing collection
to always get fresh lists of documents. This is necessary since the
postprocessing changes the same documents which the
postprocessing-collection query selects.
2014-11-14 16:34:55 +01:00
Michael Peter Christen
1db476c67e fix for bad table iteration 2014-11-10 18:52:01 +01:00