Wednesday 20 July 2011

On the 97th of April 2011...

Back in February, an over-optimistic fool promised that the NGS would have a working Nagios service in the next few weeks.

The over-optimistic fool was confident because he had a real deadline to meet. Nagios had to be ready by April. April was the month during which the old NGS core sites - which ran the tests for our old INCA-based testing framework - were to be decommissioned.

We are running little late... but I am pleased to say that 2 weeks ago - on Wednesday the 97th of April 2011 - the NGS's Nagios testing service finally went live.

If you have an certificate and it is listed in the Grid Operations Centre database - you can pay it a visit at https://nagios01.ngs.ac.uk/nagios.

If you haven't or aren't - sorry: WLCG Nagios, unlike INCA, denies access to unregistered users by default. We may be able to remove the restriction in future - but, for the moment, we want to focus on fixing the problems it has found.

It is a bit untidy - as we have been without a fully working monitoring service for over 6 months.

While we kept the INCA service running as long as possible, it had become increasingly out of step due to a decision - very early on - to use the 'NeSCForge' software repository as a safe place to keep its configuration.

NeSCForge was not as safe as we had hoped. It vanished in December last year. The list of sites and tests to run remained frozen in their December state... and the Grid moved on.

We have different partner sites offering different services now. INCA wasn't testing them, Nagios is.

More significantly, Nagios takes its list of sites directly from the Grid Operations Centre database. Changes made there should be reflected in Nagios within a day.

My colleagues in the NGS Partnership team are working their way through the Nagios test results. They are identifying problems, finding missing sites and services - and, most importantly, working out how to make things better.

No comments: