Thursday 29 September 2011

Blogging off

It is the last day of All Hands 2011 and it is my last day working for the NGS.

After 4 years of general griddery, I'm moving on.

Four years is a long time in research, and today's All Hands meeting at York is very different from the first grid event I attended, Open Grid Forum 20 in Manchester.

I remember that the Manchester meeting was huge and full of international delegates.

The UK contingent were based in something called the UK e-Science Village - which conjured up bucollic images of computer scientists dancing around the maypole on the e-Science village green, just next to the local shop for local people.

At the very least, I was hoping to see the UK e-Science Village People giving a rousing chorus of their classic - `(its fun to be at the) STFC.'

The village turned out to be a very large display booth.

All Hands is national, rather than international. The conference and the booths are smaller. As at OGF, people still enthuse about shiny new technology that will solve all our problems in the future.

But in among them are people using the less-shiny, less-all-singing, less-all-dancing software that we have now. And they are using it to do new research that is nothing to do with the technology itself.

And it is those are the people I want to hear - because what I have learned to call e-Infrastructure is very broad - in one session yesterday, the talks covered the behaviour of the heart, and how what the researchers have learned there has been applied to the way muscles move when giving birth; and how to model the way water shapes landscapes over millennia.

I still do not give a damn about how clever, or web-service-y, or standards compliant, a bit of e-Infrstructure is. It is what you do with it that counts.

It is the researchers who have take what we provide and use it to deliver the research that could not otherwise be done. These are the people you can read about in the case studies.

These are the people who have turned e-Research into Research - and will continue to do so for many years to come.

Tuesday 27 September 2011

Goodbye UKI, hello NGI_UK

At All Hands 2011, in the atrium of the University of York's brand new Ron Cooke Hub conference venue.

On our stand in the middle of the room is a familiar face - helpdesk manager John Kewley - sitting under a slightly less familiar sign.

It doesn't say NGS, or GridPP, although both have posters on display.

The sign says but 'NGI' - aka National Grid Infrastructure - and we have had to to get used to it very quickly.

At  last week's EGI technical forum, what was the UKI ROC - or the UK and Ireland Regional Operation Centre - was offically replaced by two new NGIs called NGI_UK and NGI_IE.

And lots of things broke - including the load monitor and the Nagios testing service.

Names matter. Both the load monitor and Nagios were pulling information about sites and users from the Grid Operations Centre Database. More specifically, they will pulling information about sites and users associated with the UKI ROC.

The UKI ROC is no more: it has no sites or users associated with it.

So... we have spent the last few days tracking down every reference to the 'UKI' in every configuration file for every service and replacing them with NGI_UK.

There were quite a few....

The load monitor is back. We've been working on Nagios today and it should be fully working soon.

Monday 19 September 2011

Three Little Words

There are these three little words. For some people, these words bring feelings of fulfilment and contentment. For others, they bring nothing but frustration.

Those three little words are:

  Proof of Concept

For that part of the e-Research Community interested in how research will be done in future, A proof of concept is evidence that it is possible to do something new and interesting, using something new and interesting. It might change the way research is done next decade. It is more than enough for a published paper and a presentation at All Hands.

And it is of bog-all use to those for whom e-Research is simply a means to an end. They just want something that works now and works reliably.

There is always a gap between the potentially useful and the actually useful. When you can build something that bridges that gap, you can enable research that would not otherwise be done.

Which brings me to slightly embarrassing news that our project to deploy the ARC middleware in front of the local High Performance Computing service has been a complete success... as a proof of concept.

We have shown that it is possible to deploy ARC services in front of what we should now be calling Oracle Grid Engine.

With some inventive use of ssh copies in prolog and epilog scripts --- that this can be made to work even where there is no file-space shared between the grid 'front end' and the HPC cluster.

We also know that you can support parallel tasks  using ARCs Runtime Environment mechanism --- there are examples at the bottom of the (slightly out of date) Nordugrid documentation --- and make use of to the LCAS/LCMAPS authentication system used by other grid software.

Which is nice....

Whether it is going to be useful is a completely different question.  We do not yet know if the local communities who are best placed to use it --- the rather incongruous pairing of Solar Physics and Social Science --- will want to do so.

Epilogue: Prologs and Epilogs


A quick technical note on faking a shared directory via Grid Engine prolog and epilog scripts.

The scripts run just before the start and just after the end of every job.

ARC-the-middleware obligingly changes directory to the 'shared' scratch directory before submitting the job. This mean that prolog and epilog scripts are presented with the path to this directory in the $SGE_O_WORKDIR environment variable.

The recipe is along the lines of...

  • Create a ssh keypair for each user - to be used solely for transfers from HPC backend to grid front end
  • Copy the private key to a safe place on the HPC back end, readable only by the user. We will call this $GRID_KEYS.
  • Use the public key to create a per-user authorized_key file on the grid front end in somewhere like
       /etc/ssh/authorized_keys.d/$USER
    and change the /etc/ssh/sshd_config (again on the grid-front-end) to set.
        AuthorizedKeysFile  /etc/ssh/authorized_keys.d/%u
  • Add code to prolog and epilog to use scp (or rdist) with the -i $GRID_KEYS/$USER to pull files from $SGE_O_WORKDIR at the beginning of the job and push them back at the end.




Tuesday 13 September 2011

Conference season approaches...

It's gone slightly quieter for me now that the SeIUCCR e-infrastructure summer school is safely under way.  30 students are now ensconced in Coesner’s House in Abingdon where they are learning about the wonders of e-infrastructure and how it can help their research.  As I type they will have just finished a “hands on” session on the NGS and how to run jobs on our resources.

One event is underway but we still have two to go.  Next week sees many of the NGS staff at the EGI Technical Forum in Lyon.  The NGS in conjunction with GridPP is the UK National Grid Infrastructure (UK NGI) and in turn the UK NGI is part of EGI (European Grid Infrastructure).

It’s a very active meeting for many NGS staff due to the level of involvement we have in this major project.  As well as meetings, there will also be presentations in several sessions from NGS staff.  I’ve been asked to give a presentation on the NGS roadshows as EGI are developing their own roadshows – well they say that imitation is the greatest form of flattery!  As always the UK NGI will have a stand at the event where people can talk to us further about our activities, meet staff and obtain information.  If you are attending the EGI Technical Forum then drop by and see us.

The week following the EGI conference, many of us will be in York for the UK e-Science All Hands Meeting.  Registration for this is open until the 19th of September so if you want to go, make sure you register soon!  Again the NGS will have an exhibition stand along with GridPP at the event.  The exhibition stand will be a hive of activity as there will be several demos taking place here.  The demos are –
  • Applying for UK e-Science Certificates using the new CA Certificate tool
  • Taverna Server: Towards enabling long running workflows on the NGS
Some of our users will be actively taking part in the conference with demos and presentations not to mention NGS staff giving presentations and posters as well.

A major activity at AHM is a workshop organised by SeIUCCR which is a collaboration between the NGS and the Software Sustainability Institute (SSI).    The workshop is entitled  "Meet the Champions" and will take place on the Tuesday 13:30-16:30.

The workshop is an opportunity to meet researchers that have been promoting and leading research over the past decade of e-Science; find out about their work and how they utilise e-Infrastructure, and learn how you can interact with them.  Specifically the "Champions" to meet are members of the Community Champions network from the SeIUCCR (Supporting e-Infrastructure Uptake through Community Champions) project; the NGS Campus Champions and the Software Sustainability Institute Agents Network.

There will also be 2 key presentations -
  • Scott Lathrop is Blue Waters Technical Program Manager for Education and TeraGrid Area Director for Education, Outreach and Training.  Scott's talk is entitled "Engaging Campuses in XSEDE".  XSEDE is the successor to TeraGrid.  Scott will be talking about the XSEDE Campus Champions programme and also the Campus Bridging programme for XSEDE.
  • Steve Brewer is Chief Community Officer of EGI, the European Grid Infrastructure and he will be talking about Community Engagement in Europe.
And if you thought that was enough there will also be a panel session on the question "Why should researchers use e-Infrastructure?".

For the full details of all the NGS activities at the forthcoming AHM meeting please see the news article on the NGS website.

Hopefully we'll see some of you at some point over the next two weeks!

Wednesday 7 September 2011

It is not easy being parallel

As has been said before - there are differences between Grid and traditional High Performance Computing. Some of the differences are due less to the technology and more to the problems being solved.

The more successful grid users are task farmers: they scatter comparatively small compute tasks and data and wait for them to grow into results. The grid - metaphorically speaking - is there to plough the land, spread  the fertilizer and muck out the system administrators.

Traditional HPC concerns itself with big applications and - in particular - applications that are too big to fit on a single computer. HPC systems are built with parallel computing in mind.

The Grid does not do parallel computing well.

Consider the two steps in running any parallel tasks
  • Asking for more than one CPU core on the same system.
  • Setting those CPU codes to work
For each step, there is definitely more than one way to do it...

Take 4...

So, there you are, sitting by your favourite grid client, a freshly minted X509 proxy ready. All you need to answer one of the great problems of modern science is 4 CPUs.

All you need to do is ask.

How you ask depends on who you are asking and what grid dialect they understand.

Globus GRAM5 and ARC accept tasks defined in Globus  Resource Specification Language (RSL), possibly with some Nordic extensions. In RSL, you can ask for more than one CPU with an additional:

  (count=4)

The web-service-y Globus job submission systems (WS-GRAM) used a similar approach but written as XML.

In Job Description Language, as understood by the gLite CREAM-CE and WMS, you need

  CPUNumber=4;

And in the OpenGridForum-approved XML-based Standard Job Specification Description Language, you have the instantly-memorable and easily-readable:

    <jdsl:TotalCPUCount>
       <jdsl:Exact>4.0</jdsl:Exact>
    </jdsl:TotalCPUCount>

(which you will find buried somewhere under 3 levels of XML tags). 

Yes - I know JSDL isn't really there for humans to read, but it doesn't stop some humans trying.

4 go to work...

That was the easy part.

Now it gets complicated.

And, on this occasion, you can't blame the Grid for the complexity.

Large-scale parallel programs are typically written around libraries implementing the Message Passing Interface (MPI). There is more than one version of the MPI standard and more than one library implementing them.

To add to the confusion, from some MPI variants, you need to build versions for each FORTRAN compiler installed.

Launching a parallel job depends on both the job management software and the underlying mechanisms used for communication. MPI installations typically provide either an mpirun or mpiexec command that ensures that the right processes are started in the right way on the right computers.

It is very likely that each version or each MPI implemention will have its own variant of mpirun or mpiexec. It is equally likely that - at least for mpirun - they will expect different arguments.

In the first and second phases of the NGS, we were funded to provide exemplar Grid clusters at RAL, Oxford, Leeds and Manchester. The grid software we deployed - Pre-WS GRAM from Globus 4 - could launch MPI jobs if

  (jobtype="mpi")

was included in the RSL.

It could only launch one of the many possible mpirun commands. To work around this, devious system administrators cooked up a sort of super-mpirun that would locate the correct version for an applications.

Researcher in Ireland found ways of launching MPI jobs from within JDL jobs - but they could not hide all the complexity.

ARC supports parallel jobs via its Runtime Environments extension - which can tune the environment for an application so that the right number of CPUs are assigned and the right mpirun is run. Again, this needs the  system administrator to do something devious if it is to work.

We haven't even begin to cover parallel programs written outside MPI - such as those using the Java sort-of-MPI library MPJ-Express.

So... what am I trying to say?

It would be nice to have a conclusion, or at least a lame joke, to end this blog post - but I can't think of one.

All I can say is that parallel computing is complicated, distributed computing is complicated and that any attempt to combine the two - either using existing Grid solutions, or something newer, shinier and probably invoking the word Cloud - cannot make either kind of complicated vanish completely.

Thursday 1 September 2011

Where did August go?

Usually August is a quiet month outreach wise at the NGS but this year it seems to have been the complete opposite.  The SeIUCCR summer school is fast approaching and I've been busy sorting out registrations, accommodation and queries for that.  The summer school was massively oversubscribed with 4 people applying for each place.  Demonstrates that there is quite a demand out there for training in e-infrastructure across institutions in the UK!  Thankfully all speakers and delegates appear to be sorted so perhaps now I can breathe a sigh of relief.  If you were not able to get a place at the summer school keep an eye on the NGS website and mailing list as the material from the course will be made available online.

In other news there is a new NGS user case study up on our website.  This time Maria Holstensson from the Institute of Cancer explains how she is using NGS resources to improve cancer treatment for children.

Children with neuroblastoma who are being treated with targeted radionuclide therapy can have their treatment monitored with gamma camera images. These images are used to calculate the amount of drug taken up by the tumour and to estimate the radiation dose. However the image quality can be poor due to scattering and interference. Maria Holstensson from the Institute of Cancer is looking at tackling this problem.

It never fails to amaze me the range of research carried out on NGS resources.  We have users from every area from linguistic analysis to high energy physics and we are always looking for more.  There are now 23 user case studies on the NGS website and I hope that they demonstrate that e-infrastructure is for everyone and not just those from physics or computing research areas. 

This may be a good time to mention an interesting blog post from Steve Crouch over on the Ask Steve blog from the Software Sustainability Institute (SSI).  He's been musing on communications between developers and researchers - do they really speak the same language?  Comments will no doubt be welcome!