Friday, 11 February 2011

Missing the message bus

[With thanks to Konstantin Skaburskas.]

Two weeks ago, we were very nearly at the point where we could deploy WLCG Nagios and phase out our existing testing service.

We had created our own tests and worked out how to add them to the bundle of tests that are sent out onto the Grid.

The tests were actually being run on remote sites.

All that was missing was - well - a big chunk of the test results.

When the tests landed on gLite-based sites - everything worked as expected. In other places - the tests ran... but resolutely refused to let anyone know the results.

We have now found the missing messages - after losing them twice on the way.

It was all due to subtle differences in the environment variables defined at a site. The fix is to set two environment variables by adding something like...

Environment = {
to the template used to generate the JDL file that describes the test.

To understand why, you need to understand how the tests on remote hosts are run. The hard work is done by a script called - that:
  • unpacks the bundle of tests and configures them for the local machine.
  • runs them using a bundled copy of nagios
  • translates the test results into messages.
  • sends the messages to a message broker - which shoves them on the message bus back to the Nagios server.
At the Nagios server end, each message is unpacked and fed to the central Nagios as a passive test result.

If LCG_GFAL_INFOSYS is missing, the tests never make it to the message broker; if OSG_HOSTNAME is missing - they are ignored when they reach the Nagios server.

It is the WLCG Nagios for a reason - it was designed to test machines that sat within the Worldwide LHC Computing Grid. One of its roles was to serve as a replacement for the older 'Service Availability Monitoring' (SAM) tests.

It makes the - perfectly logical - assumption that the environment on the machine running the tests will be like that used for the SAM tests.

WLCG has a dedicated network of message brokers. Any host can find a suitable broker to contact by asking its friendly local information service. The environment variable LCG_GFAL_INFOSYS points to the information service.

Some of the sites we are testing sit outside WLCG. We have our own message broker and pass additional information with the tests to direct messages to it.

A subtle bug, which is been fixed in the current release, meant that even though LCG_GFAL_INFOSYS was not being used, it still had to be set. If it wasn't, could not find a message broker to contact.

So, the messages were making it back to the Nagios server. The Nagios server was ignoring them.

The reason: the messages sent back are meant to include a reference to the Compute Element (CE) that actually accepted the job. The messages we were sending were all being marked as coming from 'localhost.localdomain' - a dummy name used internally by the nagios tests.

The script tries to work out the CE name from the local environment and from the output of certain scripts. If all else fails, it assumes nagios knows the answer.

This WLCG Nagios developers had encountered this problem before - when running ATLAS tests against hosts on the US Open Science Grid - and had added code that allows a Open Science Grid hostname to be used as a CE name. It expects the environment variable OSG_HOSTNAME to hold that hostname.

We can also report that the WMS administrators have reconfigured the server so it no longer gets clogging up with CREAM jobs - and the CREAM CE tests are now running via the WMS - as WLCG intended.

We are now ready to deploy WLCG Nagios - unfortunately without the MyEGI friendly front end - and make it available to site administrators.

We will describe how we decide which sites to test and what tests to run in a future posting.

At which point, Nagios related Research and Development will take a break and I will have to find something else to prattle about every couple of weeks.

No comments: