Friday, 24 September 2010

Santa's little helpers

In a blog post published in The Guardian: Lily Asquith, a particle physicist working at Argonne National Laboratory, described her experience of using the grid as
A bit like sending a letter to Santa (you have no idea where it is going and you can be fairly sure you won't hear anything back).
I suppose that makes us Santa's little helpers: less of a National Grid Service, more of a National Elf Service.

Dr. Asquith is not impressed by the the cryptic errors that we - the grid world - inflict on users. These are even less amusing than that last attempt at a joke.

Her example was "Lost Heartbeat" but there are many more. Those of us who answer NGS helpdesk tickets are familiar with people asking why the WMS said...
Standard output does not contain useful data. Cannot read JobWrapper output, both from Condor and from Maradona.
and people running grid software quickly get used to seeing the classic
GSS failed Major:01090000 Minor:00000000 Token:00000003
These are nothing to do with large birds, Argentinian footballers or unsuccessful members of the military. Roughly translated these mean 'Sorry... your job went missing' and 'Oops... invalid certificate' respectively.

In their favour, at least these error messages are obscure enough to give sensible answers when fed to Google.

So why are we so bad at telling people what has gone wrong?

In part, this is because so many things have to go right for a job to run: data has to be delivered to the right place, the right software needs to be available and the machine doing the processing needs to be behaving itself. The end result of any failure of any part is the same - the job failed.

The same applies to the letter from Santa. All you know is that you didn't get what you asked for. You will never know if this was because your letter was eaten by a reindeer, or if it was dropped down a chimney or simply that you happen to be on the Naughty List this year.

The situation is not helped by the body of very general purpose code that is buried deep within grid software. That GSS message about the failed major, for example, comes from an implementation of the 'Generic Security Service' Application Programming Interface.

The GSS-API is meant to be able to handle any mechanism for securing network traffic - it handles Kerberos in the same way that it handles certificates. JANET(UK)'s 'Project Moonshot' plans to use it in conjunction some of the technology behind Shibboleth.

The thing about generic interfaces is that they tend to return generic errors: basically saying that something built around the interface went wrong - go look there instead. That is great for developers but confusing for users.

The grid can be a scary thing to use. It is complicated. It will go wrong. If we want it to be less scary, we need to learn how to go wrong, better.

No comments: