Monday 30 May 2011

Lost in the Workload Management System

If you work with a complicated technology for any length of time, you build up mental map of how the bits fit together.

In a corner of my mental map of the Grid world is the WMS, aka the Workload Management System. Next to it is a warning: Here Be Dragons.

I know what the WMS does: it's a mixture of matchmaker and postal service. It takes a list of tasks, distributes them around the grid and makes sure they reach their destinations. I know that WMS is part of the NGS's UI/WMS service and provides a comparatively simple way to get work done on the Grid.

What I do not understand is how it works. This is a little embarrassing, as the NGS is currently dealing with two WMS problems: a long term problem that limiting the kinds of users we can support, and a more urgent problem - which surfaced last week - that left the UI/WMS service unusable by all NGS users.

I must emphasise that for the urgent problem: we now know what broke and how to fix it. The fix should be deployed later in the week. Please keep watching the status report on the website or the NGS-STATUS mailing list for updates.

In both cases, we see failures in the authentication between the components of the WMS.

In last week's major incident, the WMS refused to recognise the NGS's virtual organisation service after the associated Virtual Organisation Management Server (VOMS) it was upgraded. If you tried, you were told...
unable to delegate the credential to the endpoint...
Our longer term problem is with the WMS and the `SARoNGS' certificates generated by cts.ngs.ac.uk. CTS - which I think stands for Credential Translation Service - allows you to obtain a grid certificate using just your institutional username and password.

The downside of SARoNGS certificates are signed by a certificate authority that it is not yet recognised by the International Grid Trust Federation. Services must explicitly recognise the certificate authority before anyone can use it.

Somewhere within the WMS, the validation step has gone wrong. If you have a SARoNGS certificate, it tells you...
Connection failed: CA certificate verification failed
The WMS - as you might gather from reading the overview of its architecture - is really a linked collection of services. The official Service Reference Card lists 12 separate components that need to be running for the WMS to function. Some of these depend other pieces of software. In particular, some run within a webserver and use Gridsite to provide access control.

It was a slightly-out-of-date version of Gridsite that caused our major problem last week.

The VOMS server update changed the format of the attribute certificates that link you to a particular Virtual Organisation. Previous releases of the VOMS service used MD5 digital signatures within the attribute certificates. The current one has replaced MD5 with the more secure SHA1.

Our copy of gridsite only knew about MD5 signatures. An updated, SHA1-aware, version was made available late last year. We just hadn't realised that it was needed until last week.

We think that the problems with SARoNGS certificates can be traced to quirks in the way certificate authority information is being passed around the WMS. We are very fortunate that the Software Sustainability Institute has been able to offer the NGS some additional development effort to find these quirks

The Software Sustainability Institute knows its way around the grid. Their developers know how to deal with complicated software. It takes more than a 'Here Be Dragons' warning to stop them...

No comments: