Commit Graph

71 Commits

Author SHA1 Message Date
Michael Peter Christen
92655c7fd9 - added bootstrap css framework
- adopted all YaCy administration pages to new framework
- created new search page layout (working, but still work in progress)
- old skin files are fully appliable! (and looking good)
- target is a new style based on bootstrap examples, see /test.html
- icons in YaCy may be replaced by glyphicons (to be done)
2014-03-18 13:42:31 +01:00
Michael Peter Christen
b08375da33 fix for bad/missing values of size_i 2014-03-11 09:51:04 +01:00
Michael Peter Christen
51800007c4 - added concurrency to postprocessing of webgraph document
- bundeled separate webgraph postprocesing steps into one
2014-03-06 01:43:48 +01:00
Michael Peter Christen
fdaeac374a - enhanced postprocessing speed and memory footprint (by using HashMaps
instead of TreeMaps)
- enhanced memory footprint of database indexes (by introduction of
optimize calls)
- optimize calls shrink the amount of used memory for index sets if they
are not changed afterwards any more
2014-02-28 14:01:09 +01:00
Michael Peter Christen
0f6b72f24b do not use luke requests for remote solr servers if the result is
different from normal requests. This happens if the remote solr is
actually a solrCloud; in such cases the luke request returns only the
result of the single solr peer, not the whole cloud.
also done: some refactoring.
2014-02-26 14:30:48 +01:00
orbiter
da5d4128bf prevent npe 2014-02-25 03:26:20 +01:00
orbiter
f6e441dd77 refactoring 2014-02-24 21:01:56 +01:00
orbiter
c3f6c06f2c removed host increment on stored documents from crawler (that was wrong) 2014-02-24 20:02:15 +01:00
Michael Peter Christen
69391e5d9e changed strategy to test existence of documents in Solr: using the
update time. The reason for that is a better caching for the crawler
double-check, which needs the update time for crawler steering.
2014-02-19 04:03:45 +01:00
Michael Peter Christen
8b14e92ba4 added button in host browser to re-load 404/failed documents 2014-01-23 15:56:36 +01:00
Michael Peter Christen
234ca720f5 only admins should be able to force a commit 2013-11-26 07:03:20 +01:00
Michael Peter Christen
64048ff217 fir for XSS 2013-11-22 09:53:32 +01:00
Michael Peter Christen
ffe8276063 replaced referrer link masking to 'pure' links to the referring page
(that was more useful during testing)
2013-11-06 18:05:46 +01:00
Michael Peter Christen
434e13b46d in host browser also show the properties of failed documents including
referrer urls (this is a VERY USEFUL SEO and Web Admin feature!!)
2013-11-01 13:30:53 +01:00
orbiter
1ac504ae51 use html encoding for urls in metadata 2013-10-31 16:16:29 +01:00
Michael Peter Christen
2602be8d1e - removed ZURL data structure; removed also the ZURL data file
- replaced load failure logging by information which is stored in Solr
- fixed a bug with crawling of feeds: added must-match pattern
application to feed urls to filter out such urls which shall not be in a
wanted domain
- delegatedURLs, which also used ZURLs are now temporary objects in
memory
2013-09-17 15:27:02 +02:00
Michael Peter Christen
5e31bad711 - the webgraph shall store all links which appear on a web page and not
all unique links! This made it necessary, that a large portion of the
parser and link processing classes must be adopted to carry a different
type of link collection which carry a property attribute which are
attached to web anchors.
- introduction of a new URL class, AnchorURL
- the other url classes, DigestURI and MultiProtocolURI had been renamed
and refactored to fit into a new document package schema, document.id
- cleanup of net.yacy.cora.document package and refactoring
2013-09-15 00:30:23 +02:00
Michael Peter Christen
e137ff4171 refactoring (im preparation for new removeHost method) 2013-09-05 09:59:41 +02:00
orbiter
f106345eef link strings should not be tokenized 2013-09-01 14:35:36 +02:00
Michael Peter Christen
76afcccaaf fix for default boolean post values: the default value MUST NOT be TRUE,
because it's normal that a boolean value is missing in the post argument
if a checkbox is not selected.
Added also some style enhancements to IndexFederated, removed the Solr
attachment manual and replaced it with a link to the wiki which explains
this in more detail.
2013-07-31 10:49:26 +02:00
Michael Peter Christen
4c242f9af9 always use a default value for boolean options to have transparency for
the outcome if the attribute is missing in servlets
2013-07-25 12:17:29 +02:00
Roland Haeder
841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
to optimize memory usage

Conflicts:
	source/net/yacy/search/Switchboard.java
2013-07-17 18:31:30 +02:00
Michael Peter Christen
5878c1d599 - refactoring of log to ConcurrentLog:
jdk-based logger tend to block
at java.util.logging.Logger.log(Logger.java:476) in concurrent
environments. This makes logging a main performance issue. To overcome
this problem, this is a add-on to jdk logging to put log entries on a
concurrent message queue and log the messages one by one using a
separate process.
- FTPClient uses the concurrent logging instead of the log4j logger
2013-07-09 14:28:25 +02:00
Michael Peter Christen
203921006a redesign of citation index storage 2013-06-30 02:11:46 +02:00
Michael Peter Christen
570511f3c8 removed fields references_internal_id_sxt and
references_internal_url_sxt because they had been shown to be
superfluous. The citation of referrer in the host browser is possible
without them. Therefore now the host browser does not only show
internal, but also external referrer to each link.
2013-06-13 13:01:28 +02:00
Michael Peter Christen
f7e77a21bf Added a citation reference computation for intra-domain link structures.
While the values for the reference evaluation are computed, also a
backlink-structure can be discovered and written to the index as well.
The host browser has been extended to show such backlinks to each
presented links. The host browser therefore can now show an information
where an document is linked. The new citation reference is computed as
likelyhood for a random click path with recursive usage of previously
computed likelyhood. This process is repeated until the likelyhood
converges to a specific number. This number is then normalized to a
ranking value CRn, 0<=CRn<=1. The value CRn can therefore be used to
rank popularity within intra-domain link structures.
2013-06-07 13:20:57 +02:00
Michael Peter Christen
8dbc80da70 redesign of index.exist-test: this shall now not be done using a single
id to be tested, but with a collection of ids. This will cause only a
single call to solr instead of many. The result is a much better
performace when testing the existence of many urls. The effect should
cause very much less IO during index transmission, both on sender and
receiver side.
2013-05-17 13:59:37 +02:00
Michael Peter Christen
e26bdd4a52 fixes to deletion methods (removed unnecessary concurrency and added
removal of crawl queue entries)
2013-05-08 13:26:25 +02:00
Michael Peter Christen
f7f3e28c5e prevent that the size of the index is computed too many times.
Because the index size is now provided by solr, and the only way to do
that is a match for [* TO *], a size computation is quite complex and
time-consuming. Therefore this patch prevents that the method is called
at all and if necessary puts a DOS-preventing barrier in front of it.
2013-05-08 11:50:46 +02:00
Michael Peter Christen
cca19d94d4 re-declared some fields to be of type string rather than text which
makes them more efficient and less large
2013-05-06 16:45:54 +02:00
Michael Peter Christen
3841854c97 abstraction of catchall term 2013-05-04 00:14:22 +02:00
Michael Peter Christen
3502b4c697 refactoring (renaming) of yacy-solr api 2013-04-27 01:32:18 +02:00
Michael Peter Christen
579eb01a49 showing now the details of references count in host browser:
external (ext), internal (int) and external hosts (hosts) for each
indexed document.
2013-04-14 11:30:57 +02:00
reger
0f4237d8e5 add admin option to delete load errors from index 2013-04-14 05:33:01 +02:00
Michael Peter Christen
edc0b33f6d - showing references count and clickdepth in host browser
- fixed generation and presentation of both values
2013-04-11 14:46:13 +02:00
Michael Peter Christen
788288eb9e added the generation of 50 (!!) new solr field in the core 'webgraph'.
The default schema uses only some of them and the resting search index
has now the following properties:
- webgraph size will have about 40 times as much entries as default
index
- the complete index size will increase and may be about the double size
of current amount
As testing showed, not much indexing performance is lost. The default
index will be smaller (moved fields out of it); thus searching
can be faster.
The new index will cause that some old parts in YaCy can be removed,
i.e. specialized webgraph data and the noload crawler. The new index
will make it possible to:
- search within link texts of linked but not indexed documents (about 20
times of document index in size!!)
- get a very detailed link graph
- enhance ranking using a complete link graph

To get the full access to the new index, the API to solr has now two
access points: one with attribute core=collection1 for the default
search index and core=webgraph to the new webgraph search index. This is
also avaiable for p2p operation but client access is not yet
implemented.
2013-02-22 15:45:15 +01:00
Michael Peter Christen
91a0401d59 introduced a second core named 'webgraph'. This core will hold the link
structure, but is not filled yet. To have the opportunity of a second
core, multi-core functionality had to be implemented to the
deep-embedded solr:
- migrated the solr_40 directory content to a subdirectory
'collection1'; the previously used default core is now called
collection1
- added solr_40/webgraph subdirectory as second core
- added a servlet configuration for the second core 'webgraph' in
/IndexSchema_p.html
- added instance handling as addition to solr connections: all solr
connectors are now instances of an solr 'instance' object; this required
a complete re-design of the solr embedding
- migrated also caching and sharding ontop of new instance handling
- migrated the search apis to handle now the access to a specific core,
the default core named 'collection1'
- migrated the remote solr search interface to access shards of cores;
for the yacy remote search the default core is now called 'solr'; using
the peer address as solr address
- migrated the solr backup and restore process: old backups cannot be
used after this migration!
- redesign of solr instance handling in all methods which access the
instances: they cannot hold copies of these instances any more; the must
retrieve the actuall connection object every time they want to write to
it (this solves also some bugs when switching the index/network)
- added another schema 'solr.webgraph.schema', the old solr.keys.list is
replaced by solr.collection.schema
2013-02-21 13:23:55 +01:00
Michael Peter Christen
4735bd47f4 - changed solr commit call and added an optimize option. Since Solr
4.0.0 there is a new softcommit feature which implements a
near-real-time (NRT) search option. The softcommit does not do IO and
does not cause performance issues.
YaCy has now an extension in its solr connectors to use the softcommit
feature. The softcommit call now replaces all places where a hard commit
was used. Furthermore the commit strategy in when doing a search from
the web interface was changed (it's done every time before a search is
done).

The softcommit feature was implemented because it was needed for the
following changes (customer demands), which is also included in this
git commit:

- added a feature to identify all documents which have unique titles
and/or unique descriptions. These unique flags are disabled by default.
- added also a feature to set a flag when the url from a canonical tag
is equal to the document url. This is also disabled by default.

To support the new softcommit strategy, the commitWithinMs option was
set to -1 do disable automatic commit based on document insert times. If
documents are inserted permanently then also a commit would happen
permanently whenever the commitWithinMs time is reached. This would
conflict with the regular autocommit of 10 minutes and the new
softcommit strategy.
2013-01-23 14:40:58 +01:00
Michael Peter Christen
6f0baaa309 added the clickdepth post-processing: some links may have 'shortcuts' to
already calculated click depths. There are then calculated if the crawl
buffer is empty and therefore no new 'shortcuts' can be discovered.
The status of the clickdepth stack (to-be-processed) can be seen using a
solr search command like this:
http://localhost:8090/solr/select?q=process_sxt:[*%20TO%20*]&start=0&rows=30&fl=sku,clickdepth_i,process_sxt
2013-01-04 16:37:39 +01:00
Michael Peter Christen
5c0c56cfe1 Preparations to produce a click depth attribute in the search index.
This attribute can be used for ranking and for other purpose (demand by
customer)
The click depth is computed in two steps:
- during indexing the current fill-state of the reverse link index is
used to backtrack the current page to the root page. The length of that
backtrack is the clickdepth. But this does not discover the shortest
click depth. To get this, a second process to check again is needed
- added a process tag that can be used to do operations on the existing
index after a crawl; i.e. calculation the shortest clickpath. Added a
field to control this operation but not a method to operate on this.
- added a visualization of the clickpath length in the host browser
2013-01-02 20:55:43 +01:00
reger
e9e0d63897 Add config option to show HostBrowser link in search result
- ConfigPortal: added checkbox Host Browser
- yacy.init: added search.result.show.hostbrowser as default = on (true)
- fix HostBrowser: broken link to protected WebStructurePicture for public user
2012-12-27 10:01:10 +01:00
Michael Peter Christen
10527e28ae fix for wrong display of error urls in HostBrowser 2012-12-07 00:31:10 +01:00
Michael Peter Christen
5f5d66921e patch for funny symbols in url paths (like tilde) 2012-12-05 22:05:49 +01:00
reger
bb20691d4f fix: respect config setting of "show Nav Top-Menu" in HostBrowser.html for public users (as hostbrowser is now available in search results) 2012-12-01 01:14:29 +01:00
Michael Peter Christen
bf42179982 introduced more structure in HostBrowser, table view, better counting,
distinguishing of error cases (fail/excluded)
2012-11-23 14:09:48 +01:00
Michael Peter Christen
f5ca5cea44 - added field options to all solr queries. This can be used to restrict
the actual data which is fetched from solr.
- used the new field options to reduce generic options like getting the
load date or the count of search results. should increase overall speed
- used the new field options to reduce overhead in the host browser
during aquisition of links.
- used the field options to make checking of links in crawler faster
- if the crawler is paused, the crawl queue is not cleaned
2012-11-19 17:24:34 +01:00
Michael Peter Christen
074dfd297b added icons and a selection for hosts with urls pending for crawler or
with errors
2012-11-09 16:24:56 +01:00
Michael Peter Christen
61995d508e do the commit anyway before calling a search interface 2012-11-07 17:27:50 +01:00
Michael Peter Christen
29fbbb49dc better colors for host browser and corrected document count 2012-11-07 12:23:21 +01:00
Michael Peter Christen
631b08e7e2 update to HostBrowser 2012-11-07 02:17:24 +01:00