Sunday 19 December 2010

Three months of basketweaving with Elvis and Maradona.

A little over 3 months ago, we started a project to replace NGS's INCA monitoring service with WLCG Nagios.

This was never going to be easy. There are partner sites in the NGS provide that services that are almost - but not entirely - completely unlike those expected by the Worldwide LHC Computing Grid.

It has been a long, and sometimes tedious, process - documented in long, and no doubt equally tedious, posts on the NGS blog.

We would never have got this far without the WLCG Nagios developers. They have offered advice and produce helpful documentation, fixed quirks and, above all, written code that may be complicated but remains readable and comprehensible.

Over the three months, we have learned how to persuade Nagios to run the tests we need; we have learned how to get a message bus through a firewall, even when said firewall denies that the machine being tested is alive; and we've only broken something important once - when we accidentally clogged up the WMS service with CREAM.

This week, for the very first time, all the bits of the service worked together. We let Nagios run its tests and saw (some of) the results in the MyEGEE and MyEGI 'portals'.

So what took us so long....? Well, we'd accidentally crashed a bus into the database.

The WLCG Nagios software is based around a message bus. Any time anything interesting happens, a message is pushed onto the bus. This relies on a command assigned as a Nagios event handler and the slightly-disturbingly-named 'obsessive compulsive' option that ensures this command is run whenever an interesting test result arrives.

At the same time, a dedicated bus spotter, called msg-to-handler, watches for incoming messages from the bus and stores them in directories on the local disk. Special Nagios plugins checks the directories, react to incoming messages and possibly creating new messages in the process.

The MyEGEE and MyEGI portals are, crudely speaking, pretty views of a complicated MySQL database. They rely on plugins run periodically by Nagios to update the database with details of tests and results.

The messages were arriving. The plugins were running. The database was not being updated.
There were many reasons.

In part, we were simply behind the times....

As WLCG Nagios has developed, the database schema has changed. Earlier versions stuffed different categories of information into separate databases - called things like 'atp', 'metricstore' and 'mddb' - associated with different users and passwords. In newer ones, all the information is kept in one database called 'mrs'.

Some out of date entries in the configuration files for the YAIM configuration tool meant we had components using old style database names instead of the all-conquering mrs.

Fixing the databases names brought us to the point where test results were being processed.
The bad news was that the results they were being rejected in processing.

Yet, it is to the developers credit that WLCG Nagios handles rejection well. Duff data is dumped in special SQL tables - with names ending in 'rejected' - with a reason column explaining what went horribly wrong.

In our case, it was because we had information on test results but no information on tests.
We were missing the data from one vital message - one generated by the NCG configuration generator to announce the safe arrival of a new configuration.

To get the message, we needed to add
<ncg::configpublish>
<configcache>
NAGIOS_ROLE=ROC
VO=ngs.ac.uk
</configcache>
</ncg>
to the ncg.conf configuration file and ensure that the /usr/sbin/mrs-load-services script was run.

And - for some tests, under some circumstances, for certain sites, with a following wind, on a good day - the results appeared.

No comments: