Tuesday, 21 December 2010

Looking for some reading material?

If you are looking for some reading material in the run up to Christmas then have a look at the latest edition of NGS News, the quarterly newsletter from the NGS.

This quarters edition contains features on -
  • Using the NGS Cloud Protoype in teaching
  • Running Taverna Workflows on the NGS
  • Research Communities on the NGS Cloud Prototype
  • Quantitative genetic analysis on the NGS
  • ... and much more!
To download your copy of this quarters newsletter go to the dedicated NGS News webpage where you can download copies of all past NGS newsletters.

Sunday, 19 December 2010

Three months of basketweaving with Elvis and Maradona.

A little over 3 months ago, we started a project to replace NGS's INCA monitoring service with WLCG Nagios.

This was never going to be easy. There are partner sites in the NGS provide that services that are almost - but not entirely - completely unlike those expected by the Worldwide LHC Computing Grid.

It has been a long, and sometimes tedious, process - documented in long, and no doubt equally tedious, posts on the NGS blog.

We would never have got this far without the WLCG Nagios developers. They have offered advice and produce helpful documentation, fixed quirks and, above all, written code that may be complicated but remains readable and comprehensible.

Over the three months, we have learned how to persuade Nagios to run the tests we need; we have learned how to get a message bus through a firewall, even when said firewall denies that the machine being tested is alive; and we've only broken something important once - when we accidentally clogged up the WMS service with CREAM.

This week, for the very first time, all the bits of the service worked together. We let Nagios run its tests and saw (some of) the results in the MyEGEE and MyEGI 'portals'.

So what took us so long....? Well, we'd accidentally crashed a bus into the database.

The WLCG Nagios software is based around a message bus. Any time anything interesting happens, a message is pushed onto the bus. This relies on a command assigned as a Nagios event handler and the slightly-disturbingly-named 'obsessive compulsive' option that ensures this command is run whenever an interesting test result arrives.

At the same time, a dedicated bus spotter, called msg-to-handler, watches for incoming messages from the bus and stores them in directories on the local disk. Special Nagios plugins checks the directories, react to incoming messages and possibly creating new messages in the process.

The MyEGEE and MyEGI portals are, crudely speaking, pretty views of a complicated MySQL database. They rely on plugins run periodically by Nagios to update the database with details of tests and results.

The messages were arriving. The plugins were running. The database was not being updated.
There were many reasons.

In part, we were simply behind the times....

As WLCG Nagios has developed, the database schema has changed. Earlier versions stuffed different categories of information into separate databases - called things like 'atp', 'metricstore' and 'mddb' - associated with different users and passwords. In newer ones, all the information is kept in one database called 'mrs'.

Some out of date entries in the configuration files for the YAIM configuration tool meant we had components using old style database names instead of the all-conquering mrs.

Fixing the databases names brought us to the point where test results were being processed.
The bad news was that the results they were being rejected in processing.

Yet, it is to the developers credit that WLCG Nagios handles rejection well. Duff data is dumped in special SQL tables - with names ending in 'rejected' - with a reason column explaining what went horribly wrong.

In our case, it was because we had information on test results but no information on tests.
We were missing the data from one vital message - one generated by the NCG configuration generator to announce the safe arrival of a new configuration.

To get the message, we needed to add
<ncg::configpublish>
<configcache>
NAGIOS_ROLE=ROC
VO=ngs.ac.uk
</configcache>
</ncg>
to the ncg.conf configuration file and ensure that the /usr/sbin/mrs-load-services script was run.

And - for some tests, under some circumstances, for certain sites, with a following wind, on a good day - the results appeared.

Tuesday, 14 December 2010

Win Amazon vouchers in the NGS user survey 2010

The annual NGS user survey is underway and, with all the changes that lie ahead, it has never been more important to gather your feedback.

During the next year there will be changes taking place at the NGS and it is important that our users are involved in these changes and are kept informed every step of the way. We have recently launched the annual NGS user survey and this year, perhaps more than any other, it is vital that we receive your input.

The user survey asks about several possible changes to the services that the NGS currently provides and the impact that changing these services would have for our users. For example the withdrawal of free compute and data resource. We would also like to know what services NGS users see as essential such as the helpdesk, grid certificate management and training. The results of the user survey will be fed directly into our funding bid for NGS 4 which will reflect the wishes and needs of our user community.

The user survey is open to all registered NGS users and all completed user surveys will be entered into a draw for one of 3 Amazon vouchers.

Thursday, 9 December 2010

Dead. Again.

Apparently 'Grid Computing is Dead'.

Again.

It wasn't Colonel Mustard, with the lead piping, in the library. It was David De Roure, with a posting, on the Nature eResearch blog.

To be fair on David: he is an eyewitness - not the perpetuator to the dastardly deed. He was highlighting a panel discussion at the IEEE eScience conference in Brisbane entitled "Grid Computing is dead: Let's all move to the Cloud".

That title looks like another round of that popular panel game: my vague terminology is better that your vague terminology. As Simon Hettrick has pointed out - Clouds computing has one big advantage over Grid computing - the Name. Clouds sound nice and fluffy; grids sound hard and rigid.

You cannot really discuss Cloud computing in general. You need to talking about the the various Somethings-as-a-Service.

Most cloudy discussions concentrate on IaaS - Infrastructure-as-a-Service. Through the wonders of virtualisation, imaginary computers are conjured up on a magic box somewhere on The Internet. You can ask for an imaginary computer and use and abuse it just like a real computer under your desk. This has changed the way computing is delivered.

It is not the only option. Some Research Institutions, and commercial companies, are offering access to High Performance Computing systems and calling it 'Cloud Computing'. More accurately, this is  SaaS - Software-as-a-service - and PaaS - Platform-as-a-Service. This is good when you need access to a particular application or the messy bits needed to build an application. 

A few years ago, they might have been called such an offering 'Grid Computing'.

The Grid is a Platform. We are offering it as a Service. 

It might be a slightly-rickety platform but we have used it to support many applications and enable new research.

The Grid isn't dead. It has PaaS-ed over to the other side.


Friday, 3 December 2010

ICE and too much CREAM

If these is one area where the Grid community excels, it is in the creation of acronyms. The 600-odd entries on GridPP's Grid Acronym Soup page include a FIreMan, two kinds of GENIUS and a PanDA.

This post is brought to you by the acronyms ICE and CREAM - which are not yet in the Soup but are widely deployed by GridPP.

ICE is nothing to do with the white stuff covering most of the UK - it stands for Interface to CREAM Environment; CREAM is Computing Resource Execution And Management.

CREAM provides an alternative, web-service-y, interface for submitting jobs to a compute cluster. ICE allows CREAM services accept jobs from resource brokers such as the NGS's UI/WMS service.

The NGS deployment of WLCG Nagios is having problems swallowing ICE and CREAM.

Our plan is to replace the tests run from the existing INCA service with similar tests from WLCG Nagios. The INCA tests use credentials associated with the ngs.ac.uk Virtual Organisation when submitting jobs, so our Nagios instance is doing the same.

This contrasts with the GridPP Nagios deployment which uses the CERN Ops VO when testing.

Ops membership is acccepted anywhere that processes the CERN data. The ngs.ac.uk VO is accepted, at least in part, by all NGS member and affiliate sites. These include many GridPP sites as well as a number of sites who really aren't bothered by the Higgs Boson.

This is where it gets complicated.

When monitoring a whole region, WLCG Nagios does not submit tests directly to the sites. It passes them to a WMS resource broker where they queue until the site is ready. If the site takes too long to respond, the WMS is told to cancel the test.

Tests aimed at 'Classic' Compute Elements are running. The sites run the tests and, after some tweaks at STFC, we are now able to collect test results from the message bus.

Tests aimed at CREAM services are not running. Worse still, they get stuck in a strange state where cancellations are ignored. Under these circumstances, the CREAM testing bit of Nagios sends another cancellation request... and another... and another...

Eventually the cancel requests clog up the resource broker.

We are not yet sure why CREAM based services and the NGS do not get along.

GridPP people who came to a recent NGS Surgery suggested that it might simply be the presence of an email address in our VO certificate's distinguished name. Comparing distinguished names is far more complicated that it appears and embedded email addresses, in particular, cause no end of hassle.

We've turned off the WMS CREAM tests for now and replaced them with ones sent directly from the Nagios server.

After all no-one wants a broken broker.

[Update: 8-Dec-2010. The Grid Acronym Soup now includes both ICE and CREAM. I suppose this turns it into a Gazpacho.]