Friday 3 December 2010

ICE and too much CREAM

If these is one area where the Grid community excels, it is in the creation of acronyms. The 600-odd entries on GridPP's Grid Acronym Soup page include a FIreMan, two kinds of GENIUS and a PanDA.

This post is brought to you by the acronyms ICE and CREAM - which are not yet in the Soup but are widely deployed by GridPP.

ICE is nothing to do with the white stuff covering most of the UK - it stands for Interface to CREAM Environment; CREAM is Computing Resource Execution And Management.

CREAM provides an alternative, web-service-y, interface for submitting jobs to a compute cluster. ICE allows CREAM services accept jobs from resource brokers such as the NGS's UI/WMS service.

The NGS deployment of WLCG Nagios is having problems swallowing ICE and CREAM.

Our plan is to replace the tests run from the existing INCA service with similar tests from WLCG Nagios. The INCA tests use credentials associated with the ngs.ac.uk Virtual Organisation when submitting jobs, so our Nagios instance is doing the same.

This contrasts with the GridPP Nagios deployment which uses the CERN Ops VO when testing.

Ops membership is acccepted anywhere that processes the CERN data. The ngs.ac.uk VO is accepted, at least in part, by all NGS member and affiliate sites. These include many GridPP sites as well as a number of sites who really aren't bothered by the Higgs Boson.

This is where it gets complicated.

When monitoring a whole region, WLCG Nagios does not submit tests directly to the sites. It passes them to a WMS resource broker where they queue until the site is ready. If the site takes too long to respond, the WMS is told to cancel the test.

Tests aimed at 'Classic' Compute Elements are running. The sites run the tests and, after some tweaks at STFC, we are now able to collect test results from the message bus.

Tests aimed at CREAM services are not running. Worse still, they get stuck in a strange state where cancellations are ignored. Under these circumstances, the CREAM testing bit of Nagios sends another cancellation request... and another... and another...

Eventually the cancel requests clog up the resource broker.

We are not yet sure why CREAM based services and the NGS do not get along.

GridPP people who came to a recent NGS Surgery suggested that it might simply be the presence of an email address in our VO certificate's distinguished name. Comparing distinguished names is far more complicated that it appears and embedded email addresses, in particular, cause no end of hassle.

We've turned off the WMS CREAM tests for now and replaced them with ones sent directly from the Nagios server.

After all no-one wants a broken broker.

[Update: 8-Dec-2010. The Grid Acronym Soup now includes both ICE and CREAM. I suppose this turns it into a Gazpacho.]

1 comment:

Neasan said...

I have added ICE and CREAM to the Grid Acronym Soup, let me know if any other terms are missing.