Commit Graph

4797 Commits

Author SHA1 Message Date
Michael Peter Christen
b4b0d14c04 fix for display bug 2014-04-16 22:24:04 +02:00
Michael Peter Christen
9a5ab4e2c1 removed clickdepth_i field and related postprocessing. This information
is now available in the crawldepth_i field which is identical to
clickdepth_i because of a specific crawler strategy.
2014-04-16 22:16:20 +02:00
Michael Peter Christen
da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues:
This organizes all urls to be loaded in separate queues for each host.
Each host separates the crawl depth into it's own queue. The primary
rule for urls taken from any queue is, that the crawl depth is minimal.
This produces a crawl depth which is identical to the clickdepth.
Furthermorem the crawl is able to create a much better balancing over
all hosts which is fair to all hosts that are in the queue.
This process will create a very large number of files for wide crawls in
the QUEUES folder: for each host a directory, for each crawl depth a
file inside the directory. A crawl with maxdepth = 4 will be able to
create 10.000s of files. To be able to use that many file readers, it
was necessary to implement a new index data structure which opens the
file only if an access is wanted (OnDemandOpenFileIndex). The usage of
such on-demand file reader shall prevent that the number of file
pointers is over the system limit, which is usually about 10.000 open
files. Some parts of YaCy had to be adopted to handle the crawl depth
number correctly. The logging and the IndexCreateQueues servlet had to
be adopted to show the crawl queues differently, because the host name
is attached to the port on the host to differentiate between http,
https, and ftp services.
2014-04-16 21:34:28 +02:00
Michael Peter Christen
dd12dd392f introduction of a data structure for HyperlinkEdges which should use
less memory as it does no double-storage of source links for each edge
of the graph.
2014-04-11 12:09:33 +02:00
Michael Peter Christen
a37d067692 refactoring 2014-04-10 23:46:35 +02:00
orbiter
95780eed32 Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-04-10 21:40:54 +02:00
Michael Peter Christen
6bd8c6f195 fix for wrong status codes of error pages 2014-04-10 09:08:59 +02:00
Michael Peter Christen
9e503b3376 also delete the robots.txt file from the cache when a new crawl is
started
2014-04-09 21:59:54 +02:00
orbiter
67501c9dda Merge branch 'master' of git@gitorious.org:yacy/rc1.git 2014-04-09 19:58:54 +02:00
Michael Peter Christen
1c21b3256d fix for robots.txt handling: delete old entry before starting a new
crawl.
2014-04-09 18:33:48 +02:00
orbiter
c250fac9f4 linkstructure refactoring to get more options for clickdepth analysis 2014-04-09 17:52:51 +02:00
Michael Peter Christen
bd886054cb new structure and enhancements for link graph computation:
- added order option to solr queries to be able to retrieve document
lists in specific order, here: link length
- added HyperlinkEdge class which manages the link structure
- integrated the HyperlinkEdge class into clickdepth computation
- extended the linkstructure.json servlet to show also the clickdepth
and other statistic information
2014-04-09 12:45:04 +02:00
Michael Peter Christen
c8d4a63604 eliminating the word 'Facet' from the interface because it is ugly. If
people do not know what search navigation is, then they also do not know
what a 'facet' is.
2014-04-04 15:25:37 +02:00
Michael Peter Christen
e8ddd415a8 enhanced the new link structure graph 2014-04-04 14:43:54 +02:00
Michael Peter Christen
8443255e18 better link structure limit calibration 2014-04-04 12:48:55 +02:00
Michael Peter Christen
7f5733638b fix for linkstructure computation: now also detecting dead links 2014-04-04 12:47:29 +02:00
orbiter
18f9c40302 moved Edge class out of linkstructure servlet as this does not work on
non-eclipse driven environments (all non-dev cases)
2014-04-04 10:54:11 +02:00
Michael Peter Christen
a6bb9be97e - added d3.js for visualizations using embedded svg
- added a servlet api/linkstructure.json which generates a link graph
information in json
- added a javascript link graph renderer hypertree.js using d3 and the
new servlet linkstructure.json
- embedded the new link graph in the crawler monitor and the host
browser
2014-04-03 14:51:19 +02:00
Michael Peter Christen
c64c10ef00 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-04-03 01:58:06 +02:00
Michael Peter Christen
48fbfa60c1 bugfix to inbound/outbound identification 2014-04-03 01:21:43 +02:00
reger
227c42bc96 eleminate obsolete URIMetaDataRow class
by joining it with/into URIMetaDataNode.
2014-04-03 00:35:15 +02:00
Michael Peter Christen
cca851a417 introduced new solr field crawldepth_i which records the crawl depth of
a document. This is the upper limit for the clickdepth_i value which may
be shorter in case that the crawler did not take the shortest path to
the document.
2014-04-02 23:37:01 +02:00
Michael Peter Christen
d321b0314e added missing servlet html 2014-04-02 17:37:21 +02:00
orbiter
b1ba764d81 fix for first start options and added german translation for popup texts 2014-04-02 17:10:59 +02:00
orbiter
043d274af5 fixed crawl start path for cloned crawls 2014-04-02 16:06:29 +02:00
Michael Peter Christen
1b9ec9a1c5 - added popover to p2p/stealth mode button to explain the peer mode and
privacy issues.
- added popover to first-time use case to explain that specific servlets
are only visible after customization and/or crawl starts
2014-04-02 13:33:43 +02:00
Michael Peter Christen
8d35fcb1c7 transition.js is also included in bootstrap.js 2014-04-02 12:19:26 +02:00
Michael Peter Christen
3abc3c4c4c removed alert.js, modal.js and tooltip.js as these libraries are all
included in bootstrap.js
2014-04-02 12:18:33 +02:00
Michael Peter Christen
898f78258e fix for naming bug 2014-04-02 04:06:35 +02:00
Michael Peter Christen
39b641d6cd added tutorial mode - some menu items will only appear if you 'qualify'
for them. Thus, the first-time user will only see four menu items. The
other items will unfold as the user interacts.
2014-04-02 02:33:17 +02:00
Michael Peter Christen
7a49f72480 fix for crawler column width 2014-04-02 01:16:34 +02:00
Michael Peter Christen
46a1a15441 added more bootstrap libraries 2014-04-01 17:18:26 +02:00
Michael Peter Christen
5ccbfeb803 show host list by default in host browser 2014-04-01 16:55:22 +02:00
Michael Peter Christen
ba0e3fb0dc fixed crawl start links after renaming them in latest commit 2014-04-01 00:35:58 +02:00
orbiter
d29b6db270 made crawl start pages public since they do not reveal individual
information and they are also not used as servlet to actually start the
crawl (which is Crawler_p.html).
2014-03-31 20:42:39 +02:00
Michael Peter Christen
e41db47cac added (again) underline to a tags 2014-03-31 18:25:11 +02:00
Michael Peter Christen
ff82a80eb3 Integrated HostBrowser back to administration interface; it can appear
with and without navigation bar.
2014-03-31 18:19:24 +02:00
Michael Peter Christen
94366ba2e5 added template for latest commit 2014-03-31 16:00:13 +02:00
Michael Peter Christen
701df02ead Complete redesign of administration top-level menu. This follows two
principles:
- provide an easy tutorial-like "what should I do first" menu
- provide all elements which are subject to most first questions to YaCy
exibition people on top level: Resource limitation, Parser and Ranking
settings
I apologize to everyone who are used to the old style and need to find
the menu items (again) after this change. I hope that this will make the
interface more usable for new users who see a web indexer/crawler the
first time.
2014-03-31 15:47:58 +02:00
Michael Peter Christen
a3b7366aee Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-03-31 03:21:02 +02:00
Michael Peter Christen
6b66bb7109 redesign of search page integration menu structure 2014-03-31 03:18:38 +02:00
reger
92811d7850 fix: 3 more links pointing to old /xml path 2014-03-31 02:58:43 +02:00
reger
c183d66d40 fix: blacklist xml export path to xml template 2014-03-31 02:48:28 +02:00
Michael Peter Christen
656e2ce62a replacing direct html table cellspacing with css set-up for cellspacing 2014-03-31 01:15:35 +02:00
reger
e11504309f adding a hint to javascript browser short cut on Url-Proxy page (AugmentedBrowsing_p.html) 2014-03-30 05:11:42 +02:00
reger
7f29eee9ac fix: cut-off button in WatchWebStructure_p.html
(by header css dd  hight/line-hight)
2014-03-30 00:23:54 +01:00
reger
2953ebe701 fix: port in local target adress
& button style
2014-03-29 00:34:01 +01:00
Michael Peter Christen
fda591695c fixed visibility of custom icon 2014-03-28 17:25:39 +01:00
Michael Peter Christen
a9b9950d7f Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git 2014-03-28 14:48:32 +01:00
Michael Peter Christen
bd54b85d46 fix for relative sitemap urls 2014-03-28 14:44:52 +01:00