Saturday 11 September 2010

The Featherstone-Kite Openwork Basketweave Mark Two Gentleman’s Flying Machine

In a shopping centre in the middle of Leeds, not far from the Universities, there is - or was - a big glass case.

In the case is a contraption built from wicker-work, string and cogs and lights and bits of old gramophone. Every so often, it springs to life and whirls around and plays a tune.

It is not an Yorkshire-based competitor for the iPod but a sculpture by Rowland Emmet: "The Featherstone-Kite Openwork Basketweave Mark Two Gentleman’s Flying Machine". Leeds shoppers passing by look at it and think...
What on earth is THAT meant to do?
Which is a rather tortuous way of introducing the latest bit of R+D work.

We are investigating how to move the important features of the existing INCA monitoring service to a new monitoring service based on WLCG Nagios.Link
But - as has been said many times - grids are complicated. Which means that the software needed to monitor grids is complicated. Which means that when you start to look at the software, you spend a lot of time staring at a screen and thinking...
What on earth is THAT meant to do?

So after a week of staring and thinking, here is what we think the bits and pieces of the service are meant to do:

At the core sits Nagios: an open-source monitoring system familiar to many system administrators. It consists of a set of programs called 'plugins' and a scheduler that arranges for these plugins to be run.

A plugin tests if a particular service on a given host is working as expected. Plugins typically return a short message and a status code that means one of: 'OK', 'WARNING', 'CRITICAL' or - if the plugin broke - 'UNKNOWN'. They can also track performance data such as disk usage.

Nagios comes with a set of basic plugins. WLCG Nagios adds a whole raft of Grid specific ones.
In this documentation, plugins within WLCG are referred to as probes.

Next up, a 'configuration generator' called NCG takes data published about a site or set of sites and generates a configuration for Nagios that monitors them.

Statistics and performance metrics generated by the plugins/probes are collected and are delivered via a message bus to a service that stuffs them into a database. A tool called MyEGEE is used to visualise the contents of this database.

If you want to know more...

Staff from STFC and Oxford gave an NGS surgery on WLCG Nagios in late July this year. Their slides describing how WLCG Nagios can be configured and how it has been deployed can be found on the NGS web site.

There is more technical information on twiki.cern.ch in the GridMonitoringNcgOverview and GridMonitoringNcgYaim pages. More information about the plugins/probes can be found on SAMProbesMetrics.

If you want to know more about the NGS R+D activity, we will on on hand at All Hands next week.

1 comment:

James said...

Hi Jason,

Good to hear you're making progress with Sam/Nagios. Feel free to ask us if you have any questions on the internals.

W'd be happy to give you some support.

Cheers,

James Casey - Sam/nagios developer.