Monday, 30 May 2011

Lost in the Workload Management System

If you work with a complicated technology for any length of time, you build up mental map of how the bits fit together.

In a corner of my mental map of the Grid world is the WMS, aka the Workload Management System. Next to it is a warning: Here Be Dragons.

I know what the WMS does: it's a mixture of matchmaker and postal service. It takes a list of tasks, distributes them around the grid and makes sure they reach their destinations. I know that WMS is part of the NGS's UI/WMS service and provides a comparatively simple way to get work done on the Grid.

What I do not understand is how it works. This is a little embarrassing, as the NGS is currently dealing with two WMS problems: a long term problem that limiting the kinds of users we can support, and a more urgent problem - which surfaced last week - that left the UI/WMS service unusable by all NGS users.

I must emphasise that for the urgent problem: we now know what broke and how to fix it. The fix should be deployed later in the week. Please keep watching the status report on the website or the NGS-STATUS mailing list for updates.

In both cases, we see failures in the authentication between the components of the WMS.

In last week's major incident, the WMS refused to recognise the NGS's virtual organisation service after the associated Virtual Organisation Management Server (VOMS) it was upgraded. If you tried, you were told...
unable to delegate the credential to the endpoint...
Our longer term problem is with the WMS and the `SARoNGS' certificates generated by cts.ngs.ac.uk. CTS - which I think stands for Credential Translation Service - allows you to obtain a grid certificate using just your institutional username and password.

The downside of SARoNGS certificates are signed by a certificate authority that it is not yet recognised by the International Grid Trust Federation. Services must explicitly recognise the certificate authority before anyone can use it.

Somewhere within the WMS, the validation step has gone wrong. If you have a SARoNGS certificate, it tells you...
Connection failed: CA certificate verification failed
The WMS - as you might gather from reading the overview of its architecture - is really a linked collection of services. The official Service Reference Card lists 12 separate components that need to be running for the WMS to function. Some of these depend other pieces of software. In particular, some run within a webserver and use Gridsite to provide access control.

It was a slightly-out-of-date version of Gridsite that caused our major problem last week.

The VOMS server update changed the format of the attribute certificates that link you to a particular Virtual Organisation. Previous releases of the VOMS service used MD5 digital signatures within the attribute certificates. The current one has replaced MD5 with the more secure SHA1.

Our copy of gridsite only knew about MD5 signatures. An updated, SHA1-aware, version was made available late last year. We just hadn't realised that it was needed until last week.

We think that the problems with SARoNGS certificates can be traced to quirks in the way certificate authority information is being passed around the WMS. We are very fortunate that the Software Sustainability Institute has been able to offer the NGS some additional development effort to find these quirks

The Software Sustainability Institute knows its way around the grid. Their developers know how to deal with complicated software. It takes more than a 'Here Be Dragons' warning to stop them...

Tuesday, 24 May 2011

Getting the users story across

As you are no doubt aware, I’m always after interesting pieces of user research to feature in our user case study section of the NGS website. User case studies are ideal introductions to people who have never used the grid before as it can demonstrate what can be done with the resources and what the benefits are.

Whilst the production of user case studies benefits the NGS, they also benefit the researcher featured as quite often the user case studies are picked up by other grid organisations.

This was the case with NGS user Cristiano Sabiu who hit the headlines in iSGTW last month with his user case study on his cosmology research.

Cristiano from the University of Portsmouth used NGS resources to study the distribution of galaxies in the universe by creating 2000 mock universes for comparison to actual galaxy distribution in the Sloan Digital Sky Survey (SDSS). By using the NGS he managed to run 20 full scale simulations which required approximately 100,000 cpu hours over the course of a year.

His research was made into a user case study for the NGS website from where it was picked up by International Science Grid This Week (iSGTW) and appeared as their headline article in April. iSGTW is a weekly online publication that provides international coverage of distributed computing and the research it supports. They feature articles on all aspects of distributed computing technology, such as grids and clouds from all research areas and are always on the look out for interesting stories to cover.

The NGS user case studies keep on coming though and the most recent one is on quantum mechanics modelling. Stewart Reed from the University of Leeds used NGS resources to develop new methods of performing accurate computer simulations of quantum mechanical tunnelling.

Stewart explained how "the NGS provides excellent computing resources with which to perform these calculations. The computational capacity available from the NGS allows larger systems to be studied more accurately than are possible with standard workstations”.

To read more about his research, see the quantum mechanics modelling case study.

Friday, 20 May 2011

Good news, bad news

It isn't a case of one step forward... two steps back.

Its more one step forward... with another step forward coming soon.

This week, both the Leeds CREAM-CE and the NGS's Nagios project inched forward.

CREAM

We started with good news thanks to a comment from Ewan at Oxford on Leeds' plans to install a grid front end to our ARC1 High Performance Computing service.

Ewan pointed out that our preferred approach - leaving the grid access on a machine almost-completely-detached from the HPC service - is a) also other people's preferred approach and b) one that actually works.

Which is nice.

And would be nicer if it wasn't for the bad news: Sun Grid Engine support in CREAM never made it as far as the first major release (EMI-1) of the European Middleware Initiatives's grand unified grid software.

Grid Engine support is expected to arrive in a minor release - coming soon.

NAGIOS

We have had a working Nagios development system for some time.

We were trying to build a working 'clean' test system. We were planning to use this to practice the full Nagios install and configuration procedure before being let loose on a proper service.

And when we first practiced - a month or so back - the test server refused to install anything.

Since that time progress has been slow. We blame this on the stubborn refusal of the average day to include more than 24 hours, so cruelly depriving the systems staff of enough time to finish everything else that needs to be done.

Earlier this week, after reading the latest Nagios installation documentation, and comparing notes with the Nagios developers and our colleagues at Oxford who run the GridPP Nagios - we worked out what had gone awry.

There were some unfortunate conflicts between packages in the software repositories defined in /etc/yum.repos.d. We ended up in RPM hell...

Its better now. And as a minor bonus we did developed a utility that can edit YUM repositories in place. It can be found in the UKNGI subversion repository at SourceForge. It isn't pretty or clever but it does work...


Wednesday, 18 May 2011

I’ve recently added a new poll to the NGS website home page which asks if our users collaborate internationally and, if so, if this involves the NGS.

The latest question is the 7th poll to appear on the NGS website so I thought that in this blog post I would recap some of the results of previous posts.

The polls have covered a wide range of topics from “do you like our new website?” way back in November 2009 when we launched the new site, to questions regarding funding and access to resources.

Back in March last year we asked people which operating system they used to access the NGS. This has an important bearing on how our tools and applications are developed in the future as we need to ensure that they are usable by NGS users regardless of their operating system. We had 175 responses and the results were -
  • Windows – 35%
  • Linux – 43%
  • Mac – 19%
  • Other – 3%
The results were pretty much as expected with Linux users being in the majority. However it was interesting to see that Windows users are catching up and Mac users growing in numbers.

Also many moons ago we asked if our users would like us to offer an academic cloud service. This was a popular poll with 137 replies and the following results –
  • Yes -66%
  • No - 20%
  • Don’t know – 14%
We are pleased to say that the NGS cloud prototype has been very popular and is currently running at capacity as we help and assist NGS users to make the most of this new prototype service.

Of course it’s not just specialised services that we offer, we also offer the day to day stuff that enables people to get started using the NGS. We wanted to see how the users thought we were doing in terms of authorising new applications to use NGS resources. Were we taking too long? Were we meeting expectations? We were pleased to see that nearly half (46%) of applications were approved the same day with 28% approved in the next 1 – 3 days.

We are always looking for new polls for the NGS website so if you have any suggestions then we would love to hear them. You can ask anything about software, hardware, services, usage, user stats etc. All the polls are anonymous.

Monday, 16 May 2011

Stuck in the middle

To a casual observer, they look very similar: groups of largely male, occasionally slightly dishevelled individuals who spend far too much time starting at screens and who communicate almost exclusively in acronyms.

But those in the know recognise that your High Performance Computing (HPC) geeks and your Grid Computing geeks are very different creatures.

Should you find yourself talking to one - having somehow managed to side-step the awkward initial `Who are you and how did you get into my server room?' conversation - it is very important to know which species it belongs to.

By far the easiest way to find out is to ask the question: how does someone run a program.

If you have found a Grid geek, the answer will feature web services, UIs, WMSs, CEs of various kinds, certificates and, in extreme cases, XML.

The HPC geek will answer Ssh, Ssh, Ssh. Again and again and again.

That is SSH as in Secure Shell. If the person is saying Shh!, you've wandered into the library by mistake.

HPC is all about building the biggest, fastest computer that can fit in the room. HPC systems are designed to be self contained, with fast disks and fast CPUs linked using fast networks. Users may connect to an HPC service from outside - via SSH - but everything they do from that point on stays within an HPC bubble.

Grid is about connecting a disparate set of resources, spread far and wide, so they can do something useful together. There is very definitely more than one way to do it.

At Leeds, we are in the interesting position of trying to connect our HPC service to the grid.

Rather than trying to graft the full Grid software stack onto a very specialised, and customised HPC environment - we are using a separate (virtual) machine to act as a relay, or maybe a translator, between the two worlds. The HPC service is called ARC1. It is only right that the relay will be called NGS.ARC1.

It will talk Grid to the world, but its only channels of communication to the HPC service will be the batch queuing system - SGE - and good old SSH. There is no shared disk space of any kind.

We are now able to submit jobs from the NGS.ARC1 to the HPC service and monitor their progress.

We have the ability to create separate SSH keys for every grid user. Our next step is to configure the HPC service to use these keys in 'scp' commands within 'prolog' and 'epilog' scripts. Data will be pulled onto ARC1 from NGS.ARC1 when a job starts and pushed it back when the job is done.

The latest set of documentation for the CREAM-CE - our choice for the grid side - says that you can set:

  SANDBOX_TRANSFER_METHOD_BETWEEN_CE_WN=LRMS
and let the Local Resource Management Service, the grids general term for batch services like SGE, do the donkey work.

At the moment, we have no idea if weakly-linking the Grid and HPC in this way will work.
Well update you in a week-or-so's time.

And if you are passing through Leeds and want to ask questions, I will - of course - answer `Who are you and what are you doing in my server room?'

Wednesday, 11 May 2011

Moving pictures

Back in March I attended the SSI Collaborations Workshop up in Edinburgh. I used to attend these events when OMII-UK organised them and they were always enjoyable and productive meetings. I wrote up my thoughts and activities on the 2 day event in the blog here (day 1) and here (day 2).

If you are interested in finding out more about these meetings such as who attends, how it works and what is discussed then you are in luck. The SSI has just released the videos of the report back sessions onto their YouTube channel. Unlike the usual "death by PowerPoint" these videos show discussion group members reporting back on the issues that their group discussed and the actions their group decided upon. The presentations from the report back sessions are on average about 2-3 mins long thanks to strick timekeeping by the conference chairs. Short and to the point!

SSI have also recently released a list of outcomes from the meeting which they are working on over the forthcoming months.

Monday, 9 May 2011

Looking back - from inside a brown paper bag

The phrase 'brown paper bag' bug was coined by Linus 'Mr Linux himself' Torvalds to describe an screw-up that is both embarrassing and visible for all to see.

We have already covered the Good and the Not-So-Quite-Good aspects of the last two years R+D. It is now time for the Embarrassingly Bad.

Our brown paper bag moment comes courtesy of the under-used NGS Advanced Reservation service - which allowed any one who needed to to pre-book matching time-slots on multiple computers.

Providing an Advanced Reservation service was a major theme in our original plans for the 'Integrated Infrastructure' part of the R+D work. We wanted to build a service, monitor it, account for use and advertise its existence to the grid.

It started well. We had successfully deployed bits of the HARC co-scheduler as part of phase 2 of the NGS. It was being used to simulate the blood flow through the brain using computers in more than one location.

And then we hit a big problem - described on the blog back in June 2010.

HARC relies on having a network of computers acting as acceptors. Acceptors, unsurprisingly, accept user requests for reservations on a set of compute clusters. They work together: identifying and reserving matching slots of time on each cluster on behalf of the users.

Which is all very well, if the acceptors are working.

We originally piggy-backed on an acceptor network run by the Louisiana Optical Network Initiative (LONI) but, as time went by, this become less-and-less reliable.

So we tried to set up our own. And mostly failed.

You cannot run a production service if the acceptor network is not robust and for robustness, you need to spread the set of acceptors over more than one site. Neither LONI or the NGS could persuade the set of distributed acceptors to stay working for long enough to be useful. When the acceptors worked, they worked well.

When any part of the communication betweeen acceptors went wrong, all the acceptors failed, one after the other, domino style.

The chain of communications could be broken by misplaced firewalls or by small differences in acceptor configuration.

We put a lot of time and effort into sanity checking firewalls and synchronising the configuration between acceptors - ensuring that updates all happened at the same time - but we never managed to build a proper acceptor network.

Eventually we abandoned the project - leaving a small set of working acceptors at Manchester for anyone who wanted to use them

Our advanced reservation service may have failed but the idea of advanced reservation is still alive.

We are now in the era of the data deluge and of cloud computing. In the not to distant future, researchers will have access to massive amounts of data someone on the Internet and will need to get their hands on sufficient computer power and enough network capacity to process it.

There is still a case to be made for a service that can book compute time and network bandwidth, like HARC did. I'm just sorry to say that the NGS was not able to provide it.

Rest assured, the NGS has identified the blithering idiot who decided to spend time and effort on a failed service. The guilty party will be informed - in no uncertain terms - that he is a disgrace to the long and noble history of Grid software development.... the next time I pass a convenient mirror.


Tuesday, 3 May 2011

NGS TV channel?

I'm beginning to think that the NGS might need it's own TV channel as there is another video featuring NGS staff available for your viewing pleasure.

This time it is the videos from the UCISA Cloud Computing seminar which was held in February in Loughborough, UK. This was a one day event focusing on a Cloud Computing which is becoming more and more popular in the UK academic sector.

The event was billed as "a perfect opportunity to take a look at the innovation and challenges with Cloud Computing and to reflect on the benefits it can bring to your organisation". The sessions were designed to give a flavour of Cloud Computing possibilities as well as case studies from early adopters.

The presentation videos are all available on the event website and include a presentation from the NGS Technical Director, David Wallom on the FleSSR project - Flexible Services for the Support of Research.