Friday 29 October 2010

I'm a monitoring service, let me out of here

When we last covered the development of the new Nagios monitoring service in the blog - before last week's commercial break - we had just convinced it that all the hosts were alive and ready to be tested.

We can now proudly say that we have coaxed the service towards its first, official complete and utter failure. 

All those highly motivated people who tell you`failure is not an option' - ignore them. If you are running service that tests things, having a test fail means that you actually persuaded that test to run. It isn't failure, it is a different kind of success.

And it is not as easy as it sounds because the Nagios development server is, quite deliberately, kept isolated from the rest of the world. 

This is not a reference to the Harwell Science and Innovation Campus near Didcot: where the people from the STFC e-Science centre who run the service are based, and where the NGS Innovation Forum 2010 will be held.

It is simply that the Nagios development server has limited Internet access - as befits an experimental service. All access to the World Wide Web must be channeled through a web proxy. Privileged access to services is granted only when needed.

Neither the NCG configuration program or the various tests and probes that Nagios uses were written for an environment with a web proxy. Much of the code is written in Perl and support for proxies is already present -it just needed to be turned on. The Nagios developers at CERN have already accepted the changes for the next release.

With web access granted, NCG could build a complete configuration and the tests that suck information from web sites all began to run.

The next problem was getting permission to do things.

This is a grid. To use a grid, you need a certificate. WLCG Nagios has the wherewithall to download a certificate from a MyProxy Credential Management Service - as long as someone has uploaded it in the first place and there is no passphrase required.

The NGS provides a central MyProxy service and MyProxy allows certificates to uploaded so that they can be downloaded using another certificate as authentication. The command to do this isn't exactly short:


env GT_PROXY_MODE=old myproxy-init -s myproxy.ngs.ac.uk -l nagios_dev -x -Z '/C=UK/O=eScience/OU=CLRC/L=RAL/CN=nagios-dev.ngs.ac.uk/emailAddress=sct-certificates@stfc.ac.uk' -k nagios_dev-ngs -c 336

.. but it works.

Or at least it worked after we had added the certificate DN to the authorized_receivers and trusted_receivers entries in the myproxy server configuration file.

... and ensured that the ngs.ac.uk virtual organisation was defined on the Nagios server.

So at long last, the Nagios server could download a certificate, associate it with a virtual organisation and use it to submit jobs via a Workload Management Server.

Which was the point at which we realised that the Workload Management Service endpoint (https://ngswms01.ngs.ac.uk:7443/glite_wms_wmproxy_server) should have been defined in the glite_wms.conf and glite_wmsui.conf files in $GLITE_LOCATION/etc/ngs.ac.uk/.

With that final hurdle overcome, the test jobs started to flow.

The Compute Element tests were sent sites declaring themselves as Compute Elements - including the original NGS core sites at Leeds and RAL.

I admit to rigging it so that Leeds was tested first. The little status box went Green as the job was submitted, then Red as it failed with a friendly:

- Standard output does not contain useful data.Cannot read JobWrapper output, both from Condor and from Maradona.

Sometimes an Argentinian footballer and a large scavenging bird can make your day.



No comments: