Commit Graph

5141 Commits

Author SHA1 Message Date
Michael Peter Christen
85773ebd4f removed debug lines 2014-12-21 17:53:06 +01:00
reger
198102304b refactor size() -> filesize() of URIMetadataNode
(harmonize with ResultEntry and to not get confused with Collection.size())
2014-12-21 06:05:35 +01:00
Michael Peter Christen
445fafeb7c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-12-20 15:38:15 +01:00
Michael Peter Christen
0d69089c61 fix for division by zero 2014-12-20 15:11:06 +01:00
reger
ac61a39828 use peeraddress for link in remote crawl list
to make link work without enabled proxy

upd pom for Jetty (missing in last commit)
2014-12-20 01:59:00 +01:00
Michael Peter Christen
5516819354 preventing the use of no-cache and expires in case that images are
generated dynamically which will stay static in the future. This applies
mainly to the search result favicon in front of search hits. These icons
will now be generated once, but then caches in the browser. There is
also a YaCy-internal cache for these icons which had prevented the
re-generation of the icons in YaCy, but this cache is now superfluous
since the browser should not call the servlet ViewImage again.
2014-12-19 17:41:38 +01:00
Michael Peter Christen
d3e71ed070 fixes for searches when initialization of large autotagging libraries
have not been finished
2014-12-19 17:38:58 +01:00
Michael Peter Christen
28683530cd fixes to usage of no-cache: use and recognize also the no-store
directive
2014-12-19 17:37:58 +01:00
Michael Peter Christen
932faafffe reactivated on-demand snapshot loading 2014-12-16 12:09:57 +01:00
Michael Peter Christen
2362ad7c34 fix for a count issue in snapshot api 2014-12-16 11:33:30 +01:00
Michael Peter Christen
9971e197e0 Added a transaction interface to the snapshots: all documents in the
snapshots can now be processed with transactions using commit and
rollback commands. Furthermore, a large number of monitoring methods had
been added to check the success of transactions.

The transactions for snapshots have two main components: a rss search
API to get information about latest/oldest entries and a commit/rollback
API to move entries away from the rss results. This is done by usage of
two storage locations for the snapshots, INVENTORY and ARCHIVE. New
snapshots are placed to INVENTORY, commited snapshots move to ARCHIVE,
rollback snapshots move to INVENTORY again.

Normal Workflow:
Beside all these options below, usually it is sufficient to process data
like this:
- call
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST
- process the rss result and use the <guid> value as <urlhash> (see next
command)
- for each processed result call
http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash>
- then you can call the rss feed again and the commited urls are omited
from the next set of items.

These are the commands to control this:
The rss feed:
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=OLDESTFIRST
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY
http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST
http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=OLDESTFIRST
http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST

The feed will return a <urlhash> in the <guid> - field of the rss. This
must be used for commit/rollback:

Commit/Rollback:
http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash>
http://localhost:8090/api/snapshot.json?command=rollback&urlhash=<urlhash>
The json will return a property list containing the property "result"
with possible values "success" or "fail", according of the result. If an
"fail" occurs, please look into the log for further info.

Monitoring:
http://localhost:8090/api/snapshot.json?command=status
This shows the total number of entries in the INVENTORY and the ARCHIVE 
http://localhost:8090/api/snapshot.json?command=list
This will result a list of all hosts which have snapshots and the number
of entries for the hosts. Counts for INVENTORY and ARCHIVE are listed in
the porperties for "count.INVENTORY" and "count.ARCHIVE"
http://localhost:8090/api/snapshot.json?command=list&depth=2
The list can be restricted to such which have a specific depth. The list
contains then the same host names, but the count values change because
only documents at that specific crawl depth are listed
http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80
This lists all urlhashes for the given host, not only an accumulated
list of the number of entries
http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80&depth=0
This restricts the list of urlhashes for that host for the given depth
http://localhost:8090/api/snapshot.json?command=list&state=INVENTORY
http://localhost:8090/api/snapshot.json?command=list&state=ARCHIVE
This selects either the INVENTORY or ARCHIVE for all list commands,
default is ALL which means that from both snapshot directories the host
information is collected and combined. You can use the state option for
all the commands as listed above

Detailed Information:
http://localhost:8090/api/snapshot.json?command=metadata&urlhash=upiFJ7Fh1hyQ
This collects metadata information for the given urlhash. This can also
be restricted with state=INVENTORY and state=ARCHIVE to test if the
document is either in one of these snapshot directories. If an urlhash
is not found, an empty result is returned. If an entry was found and the
state was not restricted, then the result contains a state property
containing the name of the location where the document is, either
INVENTORY or ARCHIVE.

Hint:
If a very large number of documents is inside of INVENTORY, then it
could be better to call the rss feed with
http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY
because that is very efficient.
2014-12-15 23:32:46 +01:00
reger
6c3f36def1 - fix path to default heuristic.cfg
- deprecate unused ProxyServlet
2014-12-14 21:27:45 +01:00
Michael Peter Christen
c3c2b6999b fixes on wkhtmltopdf 2014-12-14 04:03:20 +01:00
Michael Peter Christen
ff035a20e7 fix for vocabulary import (double term detection) 2014-12-10 14:09:34 +01:00
Michael Peter Christen
e6650050fe fix for Is Facet checkbox 2014-12-10 13:14:39 +01:00
Michael Peter Christen
bd3ed5cae5 added charset detection to vocabulary reader 2014-12-10 13:11:51 +01:00
Michael Peter Christen
7bfc5b80cb added new options to vocabulary editor:
- new switch 'isFacet' which causes that the usage of the vocabulary for
search facets is enabled or disabled. This shall be used for large
vocabularies sind searched in solr are extremely slow if facets for a
large set of alternative terms are generated
- new option to disable auto-enrichment from synonyms
- new option to add synonyms from another column when importing from csv
- automatically recognize double-occurrences in synonyms and bundling
terms for such synonyms
2014-12-10 12:20:27 +01:00
Michael Peter Christen
8df8ffbb6d enhanced the snapshot functionality:
- snapshots can now also be xml files which are extracted from the solr
index and stored as individual xml files in the snapshot directory along
the pdf and jpg images
- a transaction layer was placed above of the snapshot directory to
distinguish snapshots into 'inventory' and 'archive'. This may be used
to do transactions of index fragments using archived solr search results
between peers. This is currently unfinished, we need a protocol to move
snapshots from inventory to archive
- the SNAPSHOT directory was renamed to snapshot and contains now two
snapshot subdirectories: inventory and archive
- snapshots may now be generated by everyone, not only such peers
running on a server with xkhtml2pdf installed. The expert crawl starts
provides the option for snapshots to everyone. PDF snapshots are now
optional and the option is only shown if xkhtml2pdf is installed.
- the snapshot api now provides the request for historised xml files,
i.e. call:
http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ
The result of such xml files is identical with solr search results with
only one hit.
The pdf generation has been moved from the http loading process to the
solr document storage process. This may slow down the process a lot and
a different version of the process may be needed.
2014-12-09 16:20:34 +01:00
Michael Peter Christen
4111d42c81 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-12-08 12:40:12 +01:00
Michael Peter Christen
793ce6d13b added confirmation dialogs for row deletion 2014-12-08 11:41:28 +01:00
Michael Peter Christen
cdc21d43b1 more robustness for broken table data in Table_API_p.html -- see bug
report http://mantis.tokeek.de/view.php?id=495
2014-12-08 11:35:40 +01:00
reger
1d3ea35d69 prevent NPE on host link for to short HeuristicCfg.OpenSearchURL 2014-12-08 01:35:37 +01:00
Michael Peter Christen
a95af11050 enhancement for clearing the crawl queue 2014-12-07 23:43:38 +01:00
reger
5f0bb1214f modified FieldReIndex to reindex queries with low number of documents first
by using a internally a score map with number of documents as score
and working through the list from low to high.
2014-12-07 04:31:09 +01:00
Michael Peter Christen
d97deb5555 npe fix 2014-12-06 00:43:12 +01:00
Michael Peter Christen
4fe4bf29ad added rss feed output to snapshot servlet which can be used to get a
list of latest/oldest entries in the snapshot database. This is an
example:
http://localhost:8090/api/snapshot.rss?depth=2&order=LATESTFIRST&host=yacy.net&maxcount=100

The properties depth, order, host and maxcount can be omited. The
meaning of the fields are:
host: select only urls from this host or all, if not given
depth: select only urls at that crawl depth or all, if not given
maxcount: select at most the given number of urls or 10, if not given
order: either LATESTFIRST to select the youngest entries, OLDESTFIRST to
select the first entries or ANY to select any

The rss feed needs administration rights to work, a call to this servlet
with rss extension must attach login credentials.
2014-12-06 00:25:05 +01:00
reger
d6539ba597 Merge origin/master 2014-12-05 01:15:41 +01:00
reger
ff18129def ViewFile servlet: update index if newer,
so viewed text and metadata (stored) info is similar
- to archive it, use request with profile to allow indexing (defaultglobaltext) and update index 
   (the resource is loaded, parsed anyway, so it's not a expensive operation)

Request: remove 2 unused init parameter 
- number of anchors of the parent
- forkfactor sum of anchors of all ancestors
2014-12-05 01:13:37 +01:00
Michael Peter Christen
d83de9ecf5 added another path for the convert command because on older Macs
ImageMagick has a different installation location
2014-12-03 18:07:05 +01:00
Michael Peter Christen
226aea5914 added a servlet which can create preview images, preview tumbnails and
preview pdfs from web pages, i.e.:
http://localhost:8090/api/snapshot.png?url=http://yacy.net/en/&width=128&height=128
http://localhost:8090/api/snapshot.jpg?url=http://yacy.net/en/&width=128&height=128
http://localhost:8090/api/snapshot.pdf?url=http://yacy.net/en/

This supports also an on-the-fly generation of the preview documents if
the user is an administrator. Otherwise, the servlet fails.
To enable this, you must add wkhtmltopdf, imagemagick and (on headless
servers) xvfb to your operation system.

for detailed instructions, see
97f6089a41
2014-12-03 11:45:48 +01:00
Michael Peter Christen
181911376c showing list of all thread in threaddump using the ThreadMXBean counter
(this obviously show more threads than before?)
2014-12-02 16:21:06 +01:00
Michael Peter Christen
64887f6b21 show number of threads on status page 2014-12-02 16:04:11 +01:00
Michael Peter Christen
6f0167fac1 get cloned crawl start parameter for snapshots 2014-12-02 12:52:05 +01:00
Michael Peter Christen
97f6089a41 YaCy can now create web page snapshots as pdf documents which can later
be transcoded into jpg for image previews. To create such pdfs you must
do:

Add wkhtmltopdf and imagemagick to your OS, which you can do:
On a Mac download wkhtmltox-0.12.1_osx-cocoa-x86-64.pkg from
http://wkhtmltopdf.org/downloads.html and downloadh
ttp://cactuslab.com/imagemagick/assets/ImageMagick-6.8.9-9.pkg.zip
In Debian do "apt-get install wkhtmltopdf imagemagick"

Then check in /Settings_p.html?page=ProxyAccess: "Transparent Proxy" and
"Always Fresh" - this is used by wkhtmltopdf to fetch web pages using
the YaCy proxy. Using "Always Fresh" it is possible to get all pages
from the proxy cache.

Finally, you will see a new option when starting an expert web crawl.
You can set a maximum depth for crawling which should cause a pdf
generation. The resulting pdfs are then available in
DATA/HTCACHE/SNAPSHOTS/<host>.<port>/<depth>/<shard>/<urlhash>.<date>.pdf
2014-12-01 15:03:09 +01:00
Michael Peter Christen
41d00350e4 moved network configuration to Use Case submenu; this is necessary
because the definiton of portal peers within the YaCy freeworld network
is otherwise splitted into two different main menus.
2014-12-01 01:12:51 +01:00
reger
221f86dd5e position api icon (ViewFile.html) 2014-11-30 01:58:14 +01:00
Michael Peter Christen
ad0da5f246 added new web page snapshot infrastructure which will lead to the
ability to have web page previews in the search results.
(This is a stub, no function available with this yet...)
2014-11-29 11:56:32 +01:00
reger
c475be2937 fix (enable) error msg on empty query 2014-11-28 22:44:33 +01:00
reger
f709132961 remove obsolete alternate link
fix api link
2014-11-28 01:40:46 +01:00
Michael Peter Christen
3c71e1c872 show vocabularies in search result (in case of debugging) 2014-11-28 01:19:31 +01:00
Michael Peter Christen
2fce2e2697 larger boost fields for ranking 2014-11-27 12:11:54 +01:00
Michael Peter Christen
6c03ff8355 bold words in snippets should not be coloured black in the base style
because there are styles with dark backgrounds which make the bold word
invisible
2014-11-27 08:08:05 +01:00
Michael Peter Christen
c0f9f6ac66 added option to change the navbar-default, i.e. usable for dark skins 2014-11-26 18:01:35 +01:00
Michael Peter Christen
84763126e0 added option to make the YaCy proxy act as the cache is never stale. If
set to 'Always Fresh' the cache is always used if the entry in the cache
exist. This is a good way to archive web content and access it without
going online again in case the documents exist.
To do so, open /Settings_p.html?page=ProxyAccess and check the "Always
Fresh" checkbox.
This is set do false which behave as set before.
If you set this to true, then you have your web archive in DATA/HTCACHE.
Copy this to carry around your private copy of the internet!
2014-11-24 20:28:52 +01:00
Michael Peter Christen
5bb52f79be reduce number of calls to queue.size() because that may be a bottleneck
during crawling
2014-11-23 20:09:32 +01:00
Michael Peter Christen
092d97d7ac when importing vocabulary csv files, accept also files without semicolon
and truncate quotes from literals
2014-11-21 12:42:29 +01:00
Michael Peter Christen
ee9ec40048 added hints to ranking to make ranking boosts using vocabularies easier 2014-11-20 18:46:06 +01:00
Michael Peter Christen
70f03f7c8e do not cache search requests to Solr if the result is used for
doublechecking. If a double-check comes from cached results the
doublecheck fails.
2014-11-20 18:45:27 +01:00
Michael Peter Christen
a0b84e4def use a LinkedHashMap for factes to maintain facet order as given by solr 2014-11-20 18:44:29 +01:00
Michael Peter Christen
0dc6e0a5f2 added option to enrich vocabularies with synonyms from synonym database 2014-11-19 18:12:43 +01:00
Michael Peter Christen
6a2a669db4 added loading of the synonyms file from addon/synonyms into the
knowledge loader
2014-11-19 17:36:56 +01:00
Michael Peter Christen
fdba8e2fa0 fix for 2-day network stats table: showing 48 instead of 24 hours from
peer history
2014-11-17 14:23:21 +01:00
Michael Peter Christen
ec9d021568 added option in vocabulary editor to import CSV files with different
encodings (preselected windows-type character encoding which is typical
for CSV files). Fixed also other problems with character encoding in
dictionary files. Automatically generated vocabularies are now also
noted in the API steering.
2014-11-17 14:22:40 +01:00
reger
b558433211 adjust tag cloud font size calculation
to limit max font size to ~ TOPWORDS_MAXSIZE
2014-11-17 01:24:30 +01:00
Michael Peter Christen
0550b54d56 added fix to postprocessing: avoid caching of postprocessing collection
to always get fresh lists of documents. This is necessary since the
postprocessing changes the same documents which the
postprocessing-collection query selects.
2014-11-14 16:34:55 +01:00
Michael Peter Christen
68e8039fd1 added high-precision scheduler for API processes. This allows also to
make the execution in dependency of available RAM or CPU load. The
default value for CPU load is 4.0 and the check runs once a minute.
2014-11-14 10:02:50 +01:00
Michael Peter Christen
0a879c98e7 added new 'firstSeen' database table and necessary data structures which
hold a date for each URL to record when a url was first seen. This is
then used to overwrite the modification date for urls upon recrawl in
case that the first-seen date is before the latest document date. This
behaviour is necessary due to the common behaviour of content management
systems which attach always the current date to all documents. Using the
firstSeen database it is possible to approximate a real first document
creation date in case that the crawler starts frequently for the same
domain. As a result the search results ordered by date have a much
better quality and the usage of YaCy as search agent for latest news has
a better quality.
2014-11-13 00:58:58 +01:00
Michael Peter Christen
487a733c99 fix for catchall handling in search 2014-11-12 22:48:33 +01:00
sixcooler
33b0234454 added a input-field for setting 'fileHost'
Set this to avoid error-messages like 'proxy use not allowed / granted'
on accessing your Peer by its hostname.
2014-11-12 21:32:34 +01:00
Michael Peter Christen
1db476c67e fix for bad table iteration 2014-11-10 18:52:01 +01:00
Michael Peter Christen
e05b7332b9 html fix 2014-11-10 02:18:44 +01:00
reger
c1ad265efd remove not used accordion javascript call for facet navs 2014-11-09 22:06:00 +01:00
Michael Peter Christen
ecdfb35f09 added long variables to debug output in index browser 2014-11-07 18:12:09 +01:00
Michael Peter Christen
95d87f00b3 fix for bad query generation in doublecheck in postprocessing 2014-11-07 18:11:23 +01:00
orbiter
a2b5cfb3cf added reverse button to tables, by default on now (to see latest entries
first)
2014-11-02 20:30:49 +01:00
orbiter
fceac5d2d4 added (missing) Tables_p.xml for table xml api 2014-11-02 20:10:32 +01:00
orbiter
dbafd4865e enhanced debug code in host browser 2014-10-30 15:47:44 +01:00
Michael Peter Christen
8f6587e87b fix for broken protocol navigation 2014-10-30 12:41:04 +01:00
Michael Peter Christen
5c962dd009 better scaling of network statistic graphs 2014-10-29 21:41:41 +01:00
orbiter
3ffe19b85c replaced old /api/table_p.xml servlet with /Tables_p.xml to avoid double
code
2014-10-29 17:23:58 +01:00
Michael Peter Christen
b4585e9546 added new index size history image in /Status.html page 2014-10-29 13:37:44 +01:00
Michael Peter Christen
9aebbbebc0 added network history in /Network.html?page=5 2014-10-29 13:21:35 +01:00
Michael Peter Christen
26279b0993 added debug code for statistics about document attributes related to
domains
2014-10-29 10:50:08 +01:00
reger
d65e3f2b53 RankingSolr: display only available or configured boost fields 2014-10-26 23:33:21 +01:00
Michael Peter Christen
4e56d79fc8 replaced input text field with text field for index deletion with query
and replaced GET with POST method. This should make it possible to
tubmit here very large queries for deletion.
2014-10-24 12:57:37 +02:00
orbiter
6f707b4305 removed spaces in seedlist.xml to reduce data 2014-10-20 18:05:37 +02:00
orbiter
78c9d31388 fix for bad json 2014-10-17 21:32:07 +02:00
Michael Peter Christen
8098a86f1d ipv6 fix for api /yacy/seedlist.[json|xml], multiple IPs are now
attached to the seed info. API clients must be adopted. Documentation
will be fixed in
http://www.yacy-websuche.de/wiki/index.php/Dev:APIseedlist

Also added a new retrieval option for seeds, they can now be retrieved
by their name with the get parameter name=<name>
2014-10-17 12:44:28 +02:00
Michael Peter Christen
07c5b57953 removed warnings 2014-10-15 11:19:25 +02:00
Michael Peter Christen
509eba2484 automatically zoom to location/POI 2014-10-15 11:07:08 +02:00
orbiter
fa2ad101ec enhanced graphics computation (avoiding long string parsing for colours) 2014-10-15 10:31:24 +02:00
orbiter
ef813cec91 added proper copyright notice to OSM tiles presented at the search
result page
2014-10-15 09:13:23 +02:00
Michael Peter Christen
1269e77dfa enhanced location search 2014-10-15 00:55:57 +02:00
Michael Peter Christen
75b5f24be4 make browsing of file://z: - paths in index browser easier - this will
now show the root paths on a shared drive
2014-10-13 18:33:39 +02:00
Michael Peter Christen
8ac3e9f890 fix for api icon in yacysearch_location.html 2014-10-13 16:53:00 +02:00
Michael Peter Christen
a1dd0ae62c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-10-12 23:43:32 +02:00
reger
f5967dfedf add filter to citation page and a on/off button
to display only sentences with citations,
while maintaining the sentence number.
Make the filtered list the default in search result citation link
2014-10-12 06:32:13 +02:00
Michael Peter Christen
f818f84adb more ipv6 fixes 2014-10-11 00:34:07 +02:00
Michael Peter Christen
2c2b50e65d refactoring (class name should start with uppercase letter) 2014-10-10 14:32:21 +02:00
Michael Peter Christen
14385057c2 added also the NetworkHistory servlet... 2014-10-10 14:16:16 +02:00
Michael Peter Christen
d8beafba3a fix for values in CrawlProfileEditor table and xml; now the full profile
is available in the xml.
2014-10-09 13:27:20 +02:00
Michael Peter Christen
ec95dfa2e6 fixed crawl profile xml result which did not show the correct crawl
status.
2014-10-08 18:48:57 +02:00
Michael Peter Christen
8c1a89cb34 added another decoration flag to switch off network graphics in crawler
monitor and index browser: decoration.grafics.linkstructure
Please set this to false to remove the graphics from the interface.
2014-10-08 17:12:35 +02:00
Michael Peter Christen
764e4ed673 fixed appearance of RSS icon on search result page 2014-10-08 15:48:45 +02:00
Michael Peter Christen
9b1958e8ca more ipv6 bugfixes 2014-10-08 15:21:49 +02:00
Michael Peter Christen
7817fc50c9 added a high cpu cycle monitor to PerformanceQueues 2014-10-08 15:20:43 +02:00
Michael Peter Christen
5082feb103 less volume for effect sounds 2014-10-08 15:04:35 +02:00
Michael Peter Christen
0bfc69b29b more ipv6 bugfixes 2014-10-08 12:38:56 +02:00
Michael Peter Christen
a27563e5c3 removed the atmo sound clips because they had been too large 2014-10-07 23:42:41 +02:00
Michael Peter Christen
ae58b22f5b ipv6 fixes for Network.html front page 2014-10-07 21:57:41 +02:00