Monday, 9 May 2011

Looking back - from inside a brown paper bag

The phrase 'brown paper bag' bug was coined by Linus 'Mr Linux himself' Torvalds to describe an screw-up that is both embarrassing and visible for all to see.

We have already covered the Good and the Not-So-Quite-Good aspects of the last two years R+D. It is now time for the Embarrassingly Bad.

Our brown paper bag moment comes courtesy of the under-used NGS Advanced Reservation service - which allowed any one who needed to to pre-book matching time-slots on multiple computers.

Providing an Advanced Reservation service was a major theme in our original plans for the 'Integrated Infrastructure' part of the R+D work. We wanted to build a service, monitor it, account for use and advertise its existence to the grid.

It started well. We had successfully deployed bits of the HARC co-scheduler as part of phase 2 of the NGS. It was being used to simulate the blood flow through the brain using computers in more than one location.

And then we hit a big problem - described on the blog back in June 2010.

HARC relies on having a network of computers acting as acceptors. Acceptors, unsurprisingly, accept user requests for reservations on a set of compute clusters. They work together: identifying and reserving matching slots of time on each cluster on behalf of the users.

Which is all very well, if the acceptors are working.

We originally piggy-backed on an acceptor network run by the Louisiana Optical Network Initiative (LONI) but, as time went by, this become less-and-less reliable.

So we tried to set up our own. And mostly failed.

You cannot run a production service if the acceptor network is not robust and for robustness, you need to spread the set of acceptors over more than one site. Neither LONI or the NGS could persuade the set of distributed acceptors to stay working for long enough to be useful. When the acceptors worked, they worked well.

When any part of the communication betweeen acceptors went wrong, all the acceptors failed, one after the other, domino style.

The chain of communications could be broken by misplaced firewalls or by small differences in acceptor configuration.

We put a lot of time and effort into sanity checking firewalls and synchronising the configuration between acceptors - ensuring that updates all happened at the same time - but we never managed to build a proper acceptor network.

Eventually we abandoned the project - leaving a small set of working acceptors at Manchester for anyone who wanted to use them

Our advanced reservation service may have failed but the idea of advanced reservation is still alive.

We are now in the era of the data deluge and of cloud computing. In the not to distant future, researchers will have access to massive amounts of data someone on the Internet and will need to get their hands on sufficient computer power and enough network capacity to process it.

There is still a case to be made for a service that can book compute time and network bandwidth, like HARC did. I'm just sorry to say that the NGS was not able to provide it.

Rest assured, the NGS has identified the blithering idiot who decided to spend time and effort on a failed service. The guilty party will be informed - in no uncertain terms - that he is a disgrace to the long and noble history of Grid software development.... the next time I pass a convenient mirror.

No comments: