Tuesday, 21 December 2010

Looking for some reading material?

If you are looking for some reading material in the run up to Christmas then have a look at the latest edition of NGS News, the quarterly newsletter from the NGS.

This quarters edition contains features on -
  • Using the NGS Cloud Protoype in teaching
  • Running Taverna Workflows on the NGS
  • Research Communities on the NGS Cloud Prototype
  • Quantitative genetic analysis on the NGS
  • ... and much more!
To download your copy of this quarters newsletter go to the dedicated NGS News webpage where you can download copies of all past NGS newsletters.

Sunday, 19 December 2010

Three months of basketweaving with Elvis and Maradona.

A little over 3 months ago, we started a project to replace NGS's INCA monitoring service with WLCG Nagios.

This was never going to be easy. There are partner sites in the NGS provide that services that are almost - but not entirely - completely unlike those expected by the Worldwide LHC Computing Grid.

It has been a long, and sometimes tedious, process - documented in long, and no doubt equally tedious, posts on the NGS blog.

We would never have got this far without the WLCG Nagios developers. They have offered advice and produce helpful documentation, fixed quirks and, above all, written code that may be complicated but remains readable and comprehensible.

Over the three months, we have learned how to persuade Nagios to run the tests we need; we have learned how to get a message bus through a firewall, even when said firewall denies that the machine being tested is alive; and we've only broken something important once - when we accidentally clogged up the WMS service with CREAM.

This week, for the very first time, all the bits of the service worked together. We let Nagios run its tests and saw (some of) the results in the MyEGEE and MyEGI 'portals'.

So what took us so long....? Well, we'd accidentally crashed a bus into the database.

The WLCG Nagios software is based around a message bus. Any time anything interesting happens, a message is pushed onto the bus. This relies on a command assigned as a Nagios event handler and the slightly-disturbingly-named 'obsessive compulsive' option that ensures this command is run whenever an interesting test result arrives.

At the same time, a dedicated bus spotter, called msg-to-handler, watches for incoming messages from the bus and stores them in directories on the local disk. Special Nagios plugins checks the directories, react to incoming messages and possibly creating new messages in the process.

The MyEGEE and MyEGI portals are, crudely speaking, pretty views of a complicated MySQL database. They rely on plugins run periodically by Nagios to update the database with details of tests and results.

The messages were arriving. The plugins were running. The database was not being updated.
There were many reasons.

In part, we were simply behind the times....

As WLCG Nagios has developed, the database schema has changed. Earlier versions stuffed different categories of information into separate databases - called things like 'atp', 'metricstore' and 'mddb' - associated with different users and passwords. In newer ones, all the information is kept in one database called 'mrs'.

Some out of date entries in the configuration files for the YAIM configuration tool meant we had components using old style database names instead of the all-conquering mrs.

Fixing the databases names brought us to the point where test results were being processed.
The bad news was that the results they were being rejected in processing.

Yet, it is to the developers credit that WLCG Nagios handles rejection well. Duff data is dumped in special SQL tables - with names ending in 'rejected' - with a reason column explaining what went horribly wrong.

In our case, it was because we had information on test results but no information on tests.
We were missing the data from one vital message - one generated by the NCG configuration generator to announce the safe arrival of a new configuration.

To get the message, we needed to add
<ncg::configpublish>
<configcache>
NAGIOS_ROLE=ROC
VO=ngs.ac.uk
</configcache>
</ncg>
to the ncg.conf configuration file and ensure that the /usr/sbin/mrs-load-services script was run.

And - for some tests, under some circumstances, for certain sites, with a following wind, on a good day - the results appeared.

Tuesday, 14 December 2010

Win Amazon vouchers in the NGS user survey 2010

The annual NGS user survey is underway and, with all the changes that lie ahead, it has never been more important to gather your feedback.

During the next year there will be changes taking place at the NGS and it is important that our users are involved in these changes and are kept informed every step of the way. We have recently launched the annual NGS user survey and this year, perhaps more than any other, it is vital that we receive your input.

The user survey asks about several possible changes to the services that the NGS currently provides and the impact that changing these services would have for our users. For example the withdrawal of free compute and data resource. We would also like to know what services NGS users see as essential such as the helpdesk, grid certificate management and training. The results of the user survey will be fed directly into our funding bid for NGS 4 which will reflect the wishes and needs of our user community.

The user survey is open to all registered NGS users and all completed user surveys will be entered into a draw for one of 3 Amazon vouchers.

Thursday, 9 December 2010

Dead. Again.

Apparently 'Grid Computing is Dead'.

Again.

It wasn't Colonel Mustard, with the lead piping, in the library. It was David De Roure, with a posting, on the Nature eResearch blog.

To be fair on David: he is an eyewitness - not the perpetuator to the dastardly deed. He was highlighting a panel discussion at the IEEE eScience conference in Brisbane entitled "Grid Computing is dead: Let's all move to the Cloud".

That title looks like another round of that popular panel game: my vague terminology is better that your vague terminology. As Simon Hettrick has pointed out - Clouds computing has one big advantage over Grid computing - the Name. Clouds sound nice and fluffy; grids sound hard and rigid.

You cannot really discuss Cloud computing in general. You need to talking about the the various Somethings-as-a-Service.

Most cloudy discussions concentrate on IaaS - Infrastructure-as-a-Service. Through the wonders of virtualisation, imaginary computers are conjured up on a magic box somewhere on The Internet. You can ask for an imaginary computer and use and abuse it just like a real computer under your desk. This has changed the way computing is delivered.

It is not the only option. Some Research Institutions, and commercial companies, are offering access to High Performance Computing systems and calling it 'Cloud Computing'. More accurately, this is  SaaS - Software-as-a-service - and PaaS - Platform-as-a-Service. This is good when you need access to a particular application or the messy bits needed to build an application. 

A few years ago, they might have been called such an offering 'Grid Computing'.

The Grid is a Platform. We are offering it as a Service. 

It might be a slightly-rickety platform but we have used it to support many applications and enable new research.

The Grid isn't dead. It has PaaS-ed over to the other side.


Friday, 3 December 2010

ICE and too much CREAM

If these is one area where the Grid community excels, it is in the creation of acronyms. The 600-odd entries on GridPP's Grid Acronym Soup page include a FIreMan, two kinds of GENIUS and a PanDA.

This post is brought to you by the acronyms ICE and CREAM - which are not yet in the Soup but are widely deployed by GridPP.

ICE is nothing to do with the white stuff covering most of the UK - it stands for Interface to CREAM Environment; CREAM is Computing Resource Execution And Management.

CREAM provides an alternative, web-service-y, interface for submitting jobs to a compute cluster. ICE allows CREAM services accept jobs from resource brokers such as the NGS's UI/WMS service.

The NGS deployment of WLCG Nagios is having problems swallowing ICE and CREAM.

Our plan is to replace the tests run from the existing INCA service with similar tests from WLCG Nagios. The INCA tests use credentials associated with the ngs.ac.uk Virtual Organisation when submitting jobs, so our Nagios instance is doing the same.

This contrasts with the GridPP Nagios deployment which uses the CERN Ops VO when testing.

Ops membership is acccepted anywhere that processes the CERN data. The ngs.ac.uk VO is accepted, at least in part, by all NGS member and affiliate sites. These include many GridPP sites as well as a number of sites who really aren't bothered by the Higgs Boson.

This is where it gets complicated.

When monitoring a whole region, WLCG Nagios does not submit tests directly to the sites. It passes them to a WMS resource broker where they queue until the site is ready. If the site takes too long to respond, the WMS is told to cancel the test.

Tests aimed at 'Classic' Compute Elements are running. The sites run the tests and, after some tweaks at STFC, we are now able to collect test results from the message bus.

Tests aimed at CREAM services are not running. Worse still, they get stuck in a strange state where cancellations are ignored. Under these circumstances, the CREAM testing bit of Nagios sends another cancellation request... and another... and another...

Eventually the cancel requests clog up the resource broker.

We are not yet sure why CREAM based services and the NGS do not get along.

GridPP people who came to a recent NGS Surgery suggested that it might simply be the presence of an email address in our VO certificate's distinguished name. Comparing distinguished names is far more complicated that it appears and embedded email addresses, in particular, cause no end of hassle.

We've turned off the WMS CREAM tests for now and replaced them with ones sent directly from the Nagios server.

After all no-one wants a broken broker.

[Update: 8-Dec-2010. The Grid Acronym Soup now includes both ICE and CREAM. I suppose this turns it into a Gazpacho.]

Tuesday, 30 November 2010

And breathe - it's over for another year

So the NGS Innovation Forum is over for another year and, although it seemed to consume most of my time over the last month, I’ll miss it!

This years event was well received by all those who attended according to the feedback I was given both at the event and afterwards by email. Always nice to know we’re doing the right thing!

The event kicked off on Tuesday with a day focused primarily on our users and I’m glad to say there were some present in the audience. Steven Young gave a brief summary of the Campus Champions who are our eyes and ears in institutions – ready to help users and to feedback comments and suggestions to us. We then moved onto a series of talks about user tools. The aim of this session was to talk through some of the tools that we offer in order that users could head home from the event and actually apply them in their research. The tools covered were –

The day also featured three presentations from users who make a great deal of use of the NGS resources. We had presentations from a variety of research areas to demonstrate just how widely used our resource are. Luke Rendell from St. Andrews University talked about simulating learning strategies, Zhongwei Guan talked about modelling composite structures and Narcis Fernandes-Fuentes talked about using the NGS for early stage drug discovery. A bit of a range of uses!

Day 1 was really good with lots of questions and discussion which continued right the way from the last session through the drinks reception and poster viewing until the end of the event dinner!

Wednesday was aimed primarily at IT staff, sys admins etc so there were a few new faces on this day. In order to bring everyone up to speed, David Wallom re-capped the discussion from findings the day before. We then kicked off with a presentation from the University of Westminster who have been a NGS member for some time before moving onto a discussion session about how the NGS can help to facilitate collaboration between researchers and institutions.

Presentations on two NGS projects followed – accessing the NGS with Shibboleth and updates to the NGS accounting provision. The last session was dedicated to the EU with an update from the EGI Director, Steven Newhouse followed by presentations from two ESFRI projects – CLARIN and ELIXIR.

An exhausting couple of days but well worth it!

From an outreach point of view I’m now busy organising a couple of new roadshow events that people requested during the IF, I’m gathering the presentations from the event to go on the NGS website (watch this space!) and announcing the winner of the best poster at the event.

Congratulations to Jarmila Husby from the School of Pharmacy, University of London whose poster “Molecular Modelling Studies of the STAT3β homodimer:DNA complex” was voted the winner by the delegates. Jarmila won an Amazon voucher which is very handy with Christmas coming up! All the posters from the event will also be on the website soon.

If you missed the event there are a number of ways to catch up – the Twitter posts are available, a blog post from Catherine Gater of EGI, an article on Cloud computing from Simon Hettrick at SSI and photos from the event are available on the NGS Flickr account.

Thank you once again to all those who attended and hopefully we’ll see you all next year!

Monday, 29 November 2010

Innovation writeup part 1

Another successful NGS Innovation Forum. It was good to hear Real Users™ stand up and say they love the NGS (no, really!) and to tell us about all the interesting work they are doing. (Slides should appear on the agenda page shortly, Gillian is busy chasing people.)

Highlights will always be a personal choice - Jason already mentioned CVMFS. There are many interesting bits one could mention, so let's focus on one in this post: authentication.

We demonstrated the CertWizard on behalf of the dev team, and despite being a live demo, it was 100% successful. This tool will make it much easier to manage certificates: browsers were built for a lot of different things, including e-commerce, so managing certificates with browsers can be challenging. Managing credentials with this tool will be much easier, and even fun.

Speaking of easy credentials, Mike talked about Shibboleth access to the NGS (aka SARoNGS). SARoNGS is not new, but it is still changing access to services. For example, we have demonstrated login to Jason's nodes in Leeds using SARoNGS credentials.

Finally, it is worth mentioning that we are collaborating with JANET on demonstrating Project Moonshot. This project is again about federated access but at a "lower" layer than Shibboleth - Shib is very web (or HTTP) oriented which is very useful, but Moonshot aims at other services like ssh (or at the Moon.) Expect more blog posts as we make progress.

Ultimately all this authentication stuff should benefit all end users who will have a choice of how to access their services.

Sunday, 28 November 2010

Software distribution by squid

Last week saw the NGS Innovation Forum. Many of the people who do the NGS's Research and Development work were involved in the forum which, ironically, left us very little time to do any actual innovating in the last week or so.

So... this post will be about something our colleagues in GridPP are working on - and which was discussed at a gathering of UK High Energy Physics System Managers early last week.

The NGS had been invited to the gathering to talk Nagios and monitoring, Other presentations covered the use of sofware from CERN called CVMFS.

CVMFS is interesting approach of delivering software efficiently - by combining the idea of Content Addressable Storage with the World Wide Web's capacity to bring data close to where it is needed. There is a very detailed technical report available from CERN and a twitter feed but little of what could be thought of as public documentation.

To understand why CVMFS is so appealing to GridPP, you need to understand their users.

The use of GridPP systems is very different from that of systems elsewhere in the NGS. They provide a lot of compute power, handle a mind-blowingly-huge amount of data - but deploy a comparatively small range of applications software, albeit on a large number of machines.

It is vital that the software used to analyse data from the major experiments at CERN be available everywhere where that data will be analysed. In the past, special deployment jobs were run for this purpose.

CVMFS is an alternative approach. It was sprung from a CERN project to deploy virtual images and the need to keep the images small.

In CVMFS, files are deployed from a single central source. When a file is needed, it is copied to a local disk and read from there. No file is copied more than once and copies are stashed 'nearby' in case another nearby machine needs them.

The caching and stashing is made possible by referring to a file by the SHA1 hash of its contents - hence content addressable storage - and putting it on a web server under a name derived from the hash.

The server provides a catalogue, translating from filenames to hashes. If the same file appears more than once - within an application or within different releases of the same application - it will be represented by the same hash-related-filename on the server.

CVMFS uses this with the Filesystem in Userspace feature of Linux - aka FUSE - to present a user with something looks like any other directory.

Behind the scenes, requests are made to the central server via a local Squid web proxy cache. Squid is designed to collect files from the web on behalf of clients, store copies as it does so and deliver the copy where-ever possible. It is very, very good at this.

There are quirks: the first time as file is needed by a site, access will be slow although all subsequent attempts to use it will be much faster.

As long as a site has enough local disk space and a nice big squid, CVMFS can deliver software to where it is needed, when it is needed.

Sunday, 21 November 2010

Failing more succesfully - getting past Maradona and Condor

It has been nearly a month since the last progress report on Nagios. Which is a shame, because in that time we have made something that looks rather like progress.

The NGS's development Nagios server was at the point where it was throwing tests at NGS partner sites.

The simpler tests - for things such as service certificates reaching their expiry date - are working.

We have have less success with the more sophisticated tests - like those that poke every nook and cranny of a Compute Element.

A few sites - notably those in Scotgrid - are accepting the tests and running them to completion but we only see part of the results. For others sites we get the infamous
Standard output does not contain useful data.Cannot read JobWrapper output, both from Condor and from Maradona.
error message.

In both cases, the same test - the CE-probe - is involved. This is thrown at all sites that advertise Compute Elements in the GOCDB database of all things griddy.

This test makes use of the Nagios concepts of active and passive tests. In an active test, the Nagios service runs some bit of code and expects that bit of code to provide a result. In a passive test, there is no explicit test code and results are fed in by whatever means necessary.

The CE-probe appears within Nagios as one active test and a whole raft of passive ones. The active test delivers a bundle of tests to the site - via a Workload Management service (WMS) - and checks on its progress. At various stages in the life of the bundle, the passive tests results are updated.

Some passive tests results are generated from the Nagios server itself; others are sent directly from the system under test via the next available Message Bus.

When the bundle of tests runs successfully, we see the results generated from within the nagios server but not those coming from the message bus. This is because the development service uses a message broker that sits outside the core set of brokers used by WLCG. A workaround for this is coming any day now.

The Maradona message appears when the bundle of tests doesn't run at all.

It is a by-product of the script generated within the WMS and sent on to the site and, in particular, how this script handles 'Shallow' resubmission.

A shallow failure is one where the job is rejected and can be tried elsewhere. The WMS touts the job around the grid until it finds a system prepared to accept it. Acceptance is signified by the deletion of a marker file using GridFTP.

Which is all very well, as long as the machine on which the script is running has software that is able to delete a file using GridFTP.

gLite-based systems usually have something , those using the NGS VDT based installer do not. If this step fails, the script gives up early and prints the Maradona message.

A VDT based system can be persuaded to run the WMS-generated script by installing the UberFTP tool using
  pacman -get http://vdt.cs.wisc.edu/vdt_181_cache:UberFTP

(Pick a different cache if you are using something other than the elderly version 1.8.1 of VDT.)

UberFTP provides enough GridFTP support to allow the bundle of tests to run - though we have yet to persuade them to run to completion. I would call that a more successful failure.

Anyone attending the HEPSYSMAN meeting in Birmingham on 22 November will have the opportunity to hear, and ask questions, about what we needed to do to persuade WLCG nagios to work on the weirder bits of the NGS.

[Edit 2010-11-24 fixing typos]

Tuesday, 16 November 2010

Unable to attend the NGS IF10?

If you are unable to attend the NGS Innovation Forum next week, we hope to keep you up to date with interesting points, discussions etc through the medium of Twitter and this blog!

I will be encouraging all delegates to Twitter with the tag #ngsif10 and of course the regular bloggers will hopefully be in action on here.

The presentations from the event will be available on the NGS website after the event along with pdfs of the posters from the Tuesday evening poster session.

If there is anything else you would like to see us do to keep you up to speed with developments at the event then please let us know!

Saturday, 13 November 2010

Escape from NeSCForge

NeSCForge - home to the NGS's collection of software, documentation and training material - is officially doomed.

On 20 December 2010, the service will be turned off: there simply isn't the money available to keep it running.

NeSCForge has long provided our version control repository. The code we developed to simply the deployment of grid software lives there, as does the Myproxy enabled GSISSH and the accounting clients.

We needed to find a new home for our software sharpish - and we didn't want to break anything when we did. In particular, we wanted to retain the code and the history of changes in our CVS repository.

Which made the decision about where to go, very easy.

The only public software hosting service which provides CVS support is SourceForge. On 7 November, the UKNGI project joined SourceForge.

Why UKNGI? Partially because we are becoming part of the UK National Grid Initiative, but mostly because the NGS name was already taken.

The next stage is moving all our data and - with perfect timing - the Software Sustainability Institute has come to galloping to the rescue. They have recently extended their collection of guides for developers to include:
That covers what we need to do nicely.

Earlier today - following those last two guides - we copied the CVS repository from NeSCForge to its new home on SourceForge - with all branches and tags and other version control stuff intact. We have also added Subversion and Git based repositories which we expect to use for future development.

The software releases and other files will be moving soon and we can allow NeSCForge to retire gracefully.

Friday, 12 November 2010

Notes on validating XML signatures

Technical brain dump - left on the NGS blog in case it is useful to anyone. There will be a proper R+D posting along shortly.

We have been investigating a problem with SARoNGS and Shibboleth that is similar, but not identical, to the XML signature problems covered in an earlier posting.

As in that earlier case, we are being sent a lump of XML within which is:
  • some data,
  • a certificate
  • a digital signature for the data generated from the key matching the certificate.
Unlike the earlier case, there is no known bug in the code that generates the XML - yet something deep within SARoNGS was refusing to accept the data.

We suspected that the XML had been mangled - but needed to prove it.

After much fiddling and searching, we dug up a useful one-line command to check the signature without needing the whole of Shibboleth.

First, catch your assertion. This is left as an exercise for the reader.

Next, verify the signature by running:
  xmlsec1 verify --id-attr:AssertionID Assertion shibdata.xml
Where shibdata.xml here is a file containing the assertion.

An explaination... ignoring namespaces, the digital signature consists of a block within which there is a <Reference> element along the lines of
 <Reference URI="#_39e459384b39f1ddce64e11c58155abc">
The URI is meant to point you at the bit of XML that has actually been signed. The code expects to find an attribute
   ID="_39e459384b39f1ddce64e11c58155abc"
attached to that element.

In this case there is no such attribute. There is, however, an AssertionID attribute with exactly that value.

Which is why we need that odd looking --id-attr option. It explicitly tells the program to use AssertionID within a Assertion element when searching for the signature.

Thursday, 11 November 2010

The NGS at the centre of the universe?

I've recently been working on a number of case studies with several NGS users to highlight the different ways that the NGS is used in many different research areas.

The first case study from this batch has recently been released and is now available on the NGS website case study section. This case study highlights the work of Cristiano Sabiu from the University of Portsmouth who used the NGS to analyse the distribution of galaxies in the universe.

Cristiano made use of the freely available Gadget2 code which is installed on the NGS STFC RAL site and ran 20 full scale simulations which required approximately 100,000 cpu hours.

To find out more about Cristianos research see our user case study page. More user case studies are in the pipeline so watch this space!

Tuesday, 9 November 2010

NGS Cloud at the NGS Innovation Forum

We're on the home straight for the NGS Innovation Forum. Registration closes this Friday (12th Nov) at 4.30pm sharp so if you want to attend and haven't yet registered, you had better do so soon!

All our previous NGS innovation forums have featured breakout sessions with groups reporting back to the meeting as a whole. This year we will have our breakout sessions as usual but we would like to flag one up early!

We will be having a break out session on the new NGS Cloud prototype so this would be an ideal opportunity for current users of our cloud service to feedback directly their experiences to NGS staff. It would also be an excellent opportunity for anyone interested in the NGS cloud prototype to find out more.

Remember you don't have to attend both days of the event, delegates are welcome to attend either day as a single day.

Friday, 5 November 2010

Who do we think you are?

This posting is going deep into the innards of Grid software.

Think of it as a computer programmer's version of Inside Nature's Giants - a wonderful example of TV science but not necessary suitable for watching over dinner. So before we are get out the (metaphorical) scalpels, I want to explain why we need to do this.

The NGS provides the SARoNGS service - that provides certificates to people using their institutional credentials and store these in a MyProxy server.

We have developed the Myproxy enabled GSISSH to give users command line access to a grid compute service from any SSH client - this reads credentials from a MyProxy server.

By linking SARoNGS and Myproxy-enabled GSISSH, using the ability to create accounts on demand and opening the service to anyone in the UK Access Management Federation, it would be possible to provide such a service anyone in the UK academic community who needed it.

The big practical problem with this plan - and the one most likely to give your IT security people nightmares - is stopping this service being abused.

The missing link is the ability to provide very restricted access to users who are being nosy - enough to prove that it can be done, not enough to do anything - and full access to ones who have signed up to a suitable acceptable use policy.

Non-technical people can look away now...

If you offer a service that runs actual real programs on behalf of actual real grid users, then at some point you are going to be handed a blob of data that contains:

  • A user proxy certificate - with possible added Virtual Organisation membership - that gives your service rights to pretend to be that user.
  • A description of what it is that the user wants to run.

For services such as Globus GSI-OpenSSH and GRAM you need to associate the proxy certificate with an account on a compute service. The account will be used when running anything on behalf of the user.

This sounds simple. Lots of things about Grid computing sound simple.

This particular problem fails to be simple because there are many, many different ways by which the users proxy certificate can be delivered.

For GSI-OpenSSH, delivery is left to the Generic Security Service (GSS). Technical details can be found on Globus development webpages.

The code that provides GSS authentication plays a complicated game of network ping-pong as client and server bounce messages at one another until they come to a mutual agreement or give up trying. The people behind the Heimdal project have bravely attempted to explain how it works on their blog.

At the end of the game, the credentials are delivered to Globus in the form of a 'context' stored in a variable of type gss_ctx_id_t.

There is a function within the Globus libraries called globus_gss_assist_map_and_authorize that uses this context, feeds it to whatever authorization mechanism is used locally and returns a local user account.

globus_gss_assist_map_and_authorize is used in both the Globus GRAM gatekeeper and GSI-OpenSSH but does not seem to be part of the official application programming interface.

It will either look up the user in the Globus gridmap file or call out to an external authorization service such as LCAS/LCMAPS. The exact behaviour depends on environment variables and configuration files.

MyProxy-enabled GSISSH does this mapping by running the Unix id command as the user via the proper gsissh command. This is not going to work if the user is not allowed to run the id command.

We would like to be able to replace the gsissh step by a stand alone program that does the mapping in the same way as gsissh when presented with the same environment as gsissh.

Luckily, we have the basis of this program buried in another NGS project - integrating LCAS/LCMAPS with Globus webservices - which was put on hold several years ago. The developer left his work in the source code repository at NeSCForge.

http://forge.nesc.ac.uk/cgi-bin/cvsweb.cgi/lcas-lcmaps/gt4ws_lcas_lcmaps_callout/src/c/?cvsroot=ngs#dirlist

The idea that code and code history is valuable in itself has been mentioned before in this blog and in much more prestigious publications before - and this applies even if the code was never finished.

We have one more problem to overcome. NeSCForge will be closing down on 20 December and we are not going to lose our source code when it does. The details of exactly how we will save our code will have to wait for another day and another posting.

[With thanks to Robert Frank at Manchester]

Tuesday, 2 November 2010

NGS Innovation Forum incoming!

We really are now in the run up to the third Innovation Forum which will take place on the 23rd - 24th November.

The final speaker for the event was announced a few weeks ago and we are pleased to welcome Andrew Lyall from the European Bioinformatics Institute who will be speaking about the European project ELIXIR.

We will also be providing a "Roaming RA" service at the Innovation Forum so if you would like to use the NGS but do not have a RA at your local institution from whom to obtain a grid certificate, come along to the IF! Not only will you obtain a grid certificate but you will also be able to meet NGS staff and hear about tools which will be of great use to you in running jobs. For more information on how to obtain a certificate at the event, see the instructions on our website.

If you are unable to attend the event you can hopefully follow us on Twitter. We will have a special tag for the event #ngsif10 so look out for this on your Twitter feeds. If you happen to Twitter about the NGS in general please use our tag #ukngs. Help us spread the word!

Remember registration closes on the 12th of November!

Friday, 29 October 2010

I'm a monitoring service, let me out of here

When we last covered the development of the new Nagios monitoring service in the blog - before last week's commercial break - we had just convinced it that all the hosts were alive and ready to be tested.

We can now proudly say that we have coaxed the service towards its first, official complete and utter failure. 

All those highly motivated people who tell you`failure is not an option' - ignore them. If you are running service that tests things, having a test fail means that you actually persuaded that test to run. It isn't failure, it is a different kind of success.

And it is not as easy as it sounds because the Nagios development server is, quite deliberately, kept isolated from the rest of the world. 

This is not a reference to the Harwell Science and Innovation Campus near Didcot: where the people from the STFC e-Science centre who run the service are based, and where the NGS Innovation Forum 2010 will be held.

It is simply that the Nagios development server has limited Internet access - as befits an experimental service. All access to the World Wide Web must be channeled through a web proxy. Privileged access to services is granted only when needed.

Neither the NCG configuration program or the various tests and probes that Nagios uses were written for an environment with a web proxy. Much of the code is written in Perl and support for proxies is already present -it just needed to be turned on. The Nagios developers at CERN have already accepted the changes for the next release.

With web access granted, NCG could build a complete configuration and the tests that suck information from web sites all began to run.

The next problem was getting permission to do things.

This is a grid. To use a grid, you need a certificate. WLCG Nagios has the wherewithall to download a certificate from a MyProxy Credential Management Service - as long as someone has uploaded it in the first place and there is no passphrase required.

The NGS provides a central MyProxy service and MyProxy allows certificates to uploaded so that they can be downloaded using another certificate as authentication. The command to do this isn't exactly short:


env GT_PROXY_MODE=old myproxy-init -s myproxy.ngs.ac.uk -l nagios_dev -x -Z '/C=UK/O=eScience/OU=CLRC/L=RAL/CN=nagios-dev.ngs.ac.uk/emailAddress=sct-certificates@stfc.ac.uk' -k nagios_dev-ngs -c 336

.. but it works.

Or at least it worked after we had added the certificate DN to the authorized_receivers and trusted_receivers entries in the myproxy server configuration file.

... and ensured that the ngs.ac.uk virtual organisation was defined on the Nagios server.

So at long last, the Nagios server could download a certificate, associate it with a virtual organisation and use it to submit jobs via a Workload Management Server.

Which was the point at which we realised that the Workload Management Service endpoint (https://ngswms01.ngs.ac.uk:7443/glite_wms_wmproxy_server) should have been defined in the glite_wms.conf and glite_wmsui.conf files in $GLITE_LOCATION/etc/ngs.ac.uk/.

With that final hurdle overcome, the test jobs started to flow.

The Compute Element tests were sent sites declaring themselves as Compute Elements - including the original NGS core sites at Leeds and RAL.

I admit to rigging it so that Leeds was tested first. The little status box went Green as the job was submitted, then Red as it failed with a friendly:

- Standard output does not contain useful data.Cannot read JobWrapper output, both from Condor and from Maradona.

Sometimes an Argentinian footballer and a large scavenging bird can make your day.



Wednesday, 27 October 2010

Arts and Humanities and ICT

Following on from my previous blog post about the JISC- Future of Research? conference, the next parallel session I attended was “Evolution and Revolution in ICT and Arts and Humanities research”.

The first presenter was Simon Tanner from King’s College London who spoke about some JISC digitised collections. No slides from this one I’m afraid as he showed us pretty pictures instead!

The next speaker was from Mimas who spoke about “long tails and efficiencies of scale”. The slides for this one are available and of particular relevance to myself and the NGS, was the observation that services have to show –

  • User demand
  • Benefits
  • Impact and value
  • Sustainability

This is something that the NGS has been undertaking recently and will continue to do so as we come up for refunding next year. You may have seen a small flurry of stats etc on the website and we have been undertaking more stats gathering behind the scenes from the proliferation of data available from usage stats and user applications. From this we are building up a picture of user demand, impact and value etc.

We hope that our users will help us by contributing to our forthcoming annual user survey and our follow up roadshow survey. If you have attended any NGS roadshows we would be grateful if you could complete the short survey which will take about a minute to complete! This will help us shape future roadshow events and also analyse the benefits that users get from attending these events.

The final speaker was John Coleman from the University of Oxford who presented on “Large Scale Computational Research in Arts and Humanities”. He started with an interesting fact that in 2008 YouTube was the second most popular search engine with people looking for speech / audio instead of text. He is currently working on a JISC funded project to “mine a year of speech” which aims to annotate a years worth of data in the form of a corpus. To do this he is currently using about 20 computers set up as a cluster in a local lab. However he is now looking at placing the data in several other universities which he says is “like grid computing”. John also highlighted a report available on "ICT Tools for searching, annotation and analysis of audiovisual material" which may be of interest to people.

The JISC Future of Research? conference had some very interesting parallel sessions and I hope it continues next year!

Tuesday, 26 October 2010

Centralising your IT support - thoughts from the JISC Conference

Last Tuesday the NGS was out on the road with an exhibition stand at the JISC Future of Research conference which was held in London and online.

As well as speaking to delegates browsing the exhibition stands, I also went to some very interesting parallel sessions. The first session I attended was “Centralising your IT Support for Research” and consisted of 3 presentations including Mary Visser, Director of IT from the University of Leicester. Mary talked about how researchers want “free at point of use” as funders are unwilling to pay FEC for these, seeing IT as basic facilities which should be provided by the institution. Leicester currently have an IT research liaison manager who speaks to researchers about their IT needs and provides guidance and advice about local and national resources. Sounds like something that many of our users would like!

Mary’s presentation for a centralised IT support service was countered by Rob Procter from Manchester e-Research Centre who argued the case for more distributed IT provision within schools and departments. Rob pointed out that many researchers do not trust IT services to provide what they need as they are teaching focused with most of their effort going in this direction rather than towards research. Robs argument for embedding IT staff in departments is certainly one I’ve heard many times before and I’ve also seen some very good results and collaborations come out of these situations.

Here at the NGS we realise that not all universities have IT services that can help with grid computing or even the use of computing in research. The NGS can’t have technical staff based in every institution in the country but we do have a variety of means to try and help from afar.

Check this list to see if you have a local Campus Champion who can provide some advice or support to you in your institution. If you don’t have a Campus Champion, we have our helpdesk where knowledgeable staff can answer your queries by email. We also have a variety of tutorials to talk you through getting started and running jobs. If there is anything else you would find useful regards training material then please let us know!

Friday, 22 October 2010

Now wash your schema

In that strange parallel universe that exists only in TV adverts: two friends sit in a remarkably spacious and clean kitchen, sipping low-calorie-but-surprisingly-tasty beverages, and talking.

And what are they talking about? Which wonderful washing powder washes whites whitest.

If we lived in their world, I would be able to introduce:

The all-new NGS schema washing service - removing unpleasant stains from your grid information and leaving it huggably soft and smelling of Summer Meadows.

and not feel like a complete idiot.

You really don't want to have dirty schema. Not only will your friends talk incessantly about it when they visit for a cup of low-calorie-but-surprisingly-tasty beverage - your site will be completely ignored by the UI/WMS Resource broker.

The technical details were covered in an NGS surgery earlier this year.

It is all to do with the Grid Information Services through which sites publish static information about the hardware and software available and dynamic information such as the number of running jobs or the amount of free disk space.

Information services feed off one another. A site service would collect and combine all the information from the compute and data service within the site.

Data from all the sites is gathered together into a central service where it can be used by the UI/WMS, and the load monitor and applications page on the NGS website.

The information is passed around using LDAP.

LDAP represents things as Objects. Each object will be a member of one or more Object Classes. Each object class is associated with a set of Attributes. An attribute is a label and a one or more values.

Beneath any LDAP service there is a set of schema. A schema defines which attributes can be defined for an object of a particular class.

The problems appear when the data passed around does not follows the schema, or follows a slightly different schema. Older publishing software will take anything. Newer software is routinely configured to be far more strict and will silently drop any object that does not fit the schema.

A number of NGS partner sites are using a rather elderly version of the Virtual Data Toolkit to publish data. The NGS central service was recently updated to the latest, greatest and strictest version of the Berkeley Database Information Index (BDII).

The VDT and BDII schemas differed by one small detail: two object classes called GlueTop and GlobusStub existed only in the VDT version. Neither GlueTop or GlobusStub have any attributes directly associated with them, so their presence did not affect the content published. It was just that a reference to them was enough for the VDT-flavour data to be ignominiously dumped before ever reaching the central BDII.

Information from sites simply disappeared.

But... the BDII is perfectly capable of collecting 'dirty' data, removing the extra object classes and similar quirks and republishing it as clean, fresh data (with a hint of Melon and Lotus Flower). All it needed was a touch of FIX_GLUE.

FIX_GLUE is a BDII configuration option, originally intended to be used at a site level, that turns on the data cleaning. An NGS staff member at STFC realised that this same approach would work as a national service - and the schema washing service was born.

[27 Oct: Edit to improve the description of how the washing works. With thanks to Jonathan Churchill at STFC.]

Friday, 15 October 2010

It's alive

If you work at one of those institutions that let users choose names for their own computers: it is inevitable that, sooner or later, someone will claim the name 'Elvis'.

Occasionally, this will be because they like his music. In most cases, it is because they want to be able to run 'ping elvis' and be told that, despite the events of August 16 1977:
 elvis is alive [*]
'Ping', the friendly name of an ICMP echo request packet, was invented as a way of testing network connectivity. The original idea was that if a machine on the Internet was working and it was pinged, it should send back the contents of the 'ping' to the sender as an ICMP echo reply.

Ping dates from when the Internet was a smaller and nicer place. These days, pings are seen as a security issue and are frequently blocked at campus and departmental firewalls.

So, once Elvis has left the building, you may never know if he is still alive.

This is a real issue for the current WLCG Nagios deployment. The Nagios service is hosted at the Rutherford Appleton Laboratory but the sites that make up the NGS and GridPP are spread around the country behind many different firewalls. Some hosts can be pinged from off site, some can't.

Nagios has the concept of hosts and services provided by those hosts. It will only check services if the associated host is working. The usual way of testing a host is by sending a ping. If pings are not permitted, no service on that host is tested.

The NCG utility that generates Nagios configurations can use a dummy test in place of a ping test for all hosts. To enable this, the /etc/ncg/ncg.conf configuration file needs to be changed to include a line setting CHECK_HOSTS to zero:

<NCG::ConfigGen>

<Nagios>
...
# Disable 'ping' checks of hosts
CHECK_HOSTS=0

</Nagios>
</NCG::ConfigGen>

When ncg is run and a new Nagios configuration built, all services on all hosts are tested. On the down side, If a host really has dropped off the network, Nagios will continue to test the services and generate alerts.

Now, if you will excuse me, I better stop writing about Nagios and go back to configuring it. As someone once said: a little less conversation, a little more action please.

[*] For the pedants.. the ping command on Linux run until you stop it and print statistics. You will need to find a Solaris system or a Cisco switch to actually see this message.

Wednesday, 13 October 2010

On a mission

Over the last week or two I've been on a mission to gather stats on just about everything you can think of here at the NGS. Some of this was at the request of our funders and some of this was a "stocktaking" exercise here at the NGS.

One area that has became more important for the NGS is funding and PI's. Our funders (JISC and EPSRC) want to know who funds the researchers that use our resources so funding councils, universities, industry etc. It certainly builds up an interesting picture for us! In turn the funding councils want to know which of their researchers use the NGS. Now while we know who our users are, many of the research councils are interested in who the PI's are.

Obviously this is information that we have to rely on being provided by the users when they complete the application form to use our resources. We recently updated the form to make the funding field and PI field compulsory no matter if you are a new user or are asking for a renewal. It only takes 2 secs to fill in this information and helps us tremendously in keeping track our usage and stats.

We have had some teething problems with PhD students putting their own names down instead of their supervisors and users performing biomedical reserch putting the AHRC down as their funding council! However we're sure we can tidy up the database with some help from our users so please help us to accurately populate our databases.

Sunday, 10 October 2010

The Bottom Line

Interesting thing about the "Science is Vital" rally yesterday in London (I was there, man). Most of the speakers spoke about medicine. Is it easier to explain to a politician (or a member of the public) the benefits of science from the point of medicine, when lives are directly affected?

I was thinking about the work that researchers do on the NGS, and it seems entirely plausible that in-silico drug screening, modelling bloodflow through the brain, or defibrillation of the heart will save lives someday, if they haven't already (and there's much more!)

While this work has a high impact on the individual whose life it saves, it is more subtle to assess the economic impact of some of the other work. What use is it to know how dinosaurs walk? If we find the Higgs (over on GridPP), will it make us all richer? What is the use of astronomy? A point that was made yesterday is that we can't say beforehand what is "good" and what is not. Impact can come from the strangest places. Knowing how dinosaurs walk would make the next dinosaur-themed blockbuster more interesting. The recent work which led to a Nobel prize in physics involved (more or less) pencils and sticky tape. The particle accelerators and the Universe are laboratories for the very small and the very large (in fact the Universe is also a particle accelerator), and they both lead to an increased understanding of the laws of physics, which in turn lead to many other benefits.

It was left to Simon Singh to remind us of maths, the lives saved by cryptanalysis during WWWII when Enigma was cracked. (I might add the same kind of maths is also the foundation of e-Commerce.)

The more subtle benefit is that if and when we solve these problems, we also have the infrastructure to solve them (like the NGS), and we know how to do it (e.g. CFD, or computational modelling, visualisation, steering, etc.)

And of course doing exciting things like watching planets and stars, or seeing dinosaurs walk, helps attract young people into a science career, the next generation of researchers.

Saturday, 9 October 2010

Nagios, MyEGEE and MyEGI

[With thanks to Cristina Del Cano Novales at STFC]

The story so far... early last month we started to deploy WLCG Nagios as a replacement for the existing INCA testing service.

STFC set up the servers and deployed the latest published version of WLCG Nagios before handing the baton to Leeds. Leeds are configuring Nagios and the ecosystem of software that supports it.

And it is a there is a lot of supporting software.

In addition to Nagios itself, the NCG configuration tool and the plugins that actually do the work - all of which were described in a previous post - the WLCG package includes..
  • pnp4nagios to keep and display historical records of performance metrics.
  • an ActiveMQ message bus to pass data to wherever it needs to be. STFC have cunningly configured this on the development box so that it can only talk to itself.
  • a MySQL database to keep status information .
There are two web tools called MyEGEE and MyEGI that allow people to extract the bits of information in that database that are relevant to them.

Both are built on the kind of general purpose frameworks that have sprung up since people started bandying around the term 'Web 2.0' as if it meant something. Much of what is seen as Web 2.0 is - at its heart - as some way of reading, or updating, a database from a suitably interactive and pretty webpage. Frameworks to do database wrangling are available for all popular languages and a quite a few unpopular ones.

MyEGEE is written in PHP on top of Zend Framework; MyEGI is python and based around Django.

According to a presentation at the EGI Technical Forum - highlighted by one of our colleagues from STFC - MyEGI will replace MyEGEE in the next few months. The developers are expecting to produce the first official MyEGI release in November.

Tuesday, 5 October 2010

Gathering pace

Preparations for the NGS Innovation Forum are quickly gathering pace. Our call for poster abstracts closed on the 24th of September and the Programme Committee meeting to discuss the abstracts takes place tomorrow. If you have submitted an abstract for the event you will hear the outcome by the end of this week so watch your inbox!

I've also confirmed the last of the speakers so the agenda is now complete although missing one presentation title which will be with us very soon hopefully so watch this space...

In the meantime I can announce details of some other presentations at the event -

NGS site presentation - What the University of Westminster gained from being a NGS Partner Site - Gabor Terstyanszky, University of Westminster
The University of Westminster joined NGS in 2007 as a partner site. Being a partner site is very different than being a core site and this presentation will overview the challenges and experiences of the University of Westminster as a partner site. The presentation will also outline how we identify prospective users at the University of Westminster and what kind of application support and technical services we provide for local users to port and run their applications on NGS resources.

Federated Access to NGS resources - Mike Jones, NGS, University of Manchester
This talk will demonstrate how to use NGS resources using your institutional login credentials (via the UK Access Management Federation). It will describe how the UK's two main eScience authentication systems are combined to form an easy to use yet robust identity management environment. It will discuss how this mechanism links together with system, project and VO registration procedures.

The registration for the event will close on the 12th of November so make sure you register before then!

Saturday, 2 October 2010

Secret messages

Listen very carefully, I will say this only once.

This posting is about keeping secrets. As such, information will be provided on a strictly need to know basis.

And what you need to know is that...

We may (or may not) be transferring some data from somewhere that might (or might not) be near Didcot to somewhere else that might (or might not) be Manchester.

Some people might speculate that this is something to do with job accounting data.

While it is true that there are number of GridPP sites that are NGS partners and it is true that these sites will run jobs on behalf of users within the ngs.ac.uk virtual organisation and it is true that the amount of CPU time used by these jobs is recorded in the EGI APEL accounting service - we obviously can neither confirm or deny that APEL data is now sent to the NGS and recorded against users records in the User Account Service.

So, putting aside the cloak and dagger for the moment, what is the big secret?

Like many secrets it is very, very dull. We are indeed passing accounting data around. As far as data protection rules are concerned, accounting records count as personal information so we have to ensure that they are never left sitting around unprotected.

As we are not currently able to directly link the NGS and the APEL accounts database, we needed to find a delivery mechanism that is both secure and which could be automated.

The solution we chose was to encrypt the data and put it somewhere where it can be collected at regular intervals.

The encryption is done with OpenSSL...
  
export TOP_SECRET="if we told you, then it wouldn't be a secret any more."
openssl enc -e -blowfish -in secret-data.plain -out secret-data.enc -pass env:TOP_SECRET

and - when the file is transferred, the data can be decrypted with

export TOP_SECRET="that password in the line above that we are still not telling you"
openssl enc -d -blowfish -e -in secret-data.enc -out secret-data.plain -pass env:TOP_SECRET
This is symmetric encryption so now that we need to play pass the password.

This where certificates come to the fore as they support asymmetric or public/private key encryption. As long as the password is shorter than the key length, it can be encrypted using this certificate so only the certificate holder can retrieve it. The command is...

echo $TOP_SECRET | \
openssl rsautl -encrypt -certin -inkey someones-public-cert.pem \
-out password.enc
The password.enc file can be sent to the certificate holder who can unscramble it using something like
  
openssl rsautl -inkey /path/to/my/userkey.pem -decrypt -in password.enc
This can be used to update the password securely and ensure and all the secrets stay secrets.

That is all. This blog posting will self destruct in 30 seconds...

Monday, 27 September 2010

Mind the gap!

With this year's All Hands over, it is worth going through the notes, and summarise at least the plenaries for the benefit of those who didn't have a chance to go. I am sure 2010 will be remembered for the amazing quality of the plenaries, and quite a lot of excellent content in the parallel sessions as well.

Overall there was a theme of "bridging the gap" - or is it a chasm? - to take software from its initial deployment with limited uptake to a broader user base, the place where prototypes either fail or become successful. Prof Dan Atkins from UMich spoke about this chasm, and how to bridge it, and about UK's e-science and its impact. Prof Alex Szalay from Johns Hopkins spoke about coping with data volumes in the context of Amdahl's law, ie the advantages - or (eventually) lack of advantages - of parallelisation, and the importance of simulation - we can today simulate things made of 1,000,000,000 interacting pieces, which was seen as impossible just a short time ago, and we expect to soon be able to process 1000 times that - so simulations and the data they generate will have an increasingly larger impact on research and society as a whole. Our very own prof, Carole Goble, spoke about the long tail scientist, all the many who do research but are not part of big groups, and how this tail is getting "fatter" (or how to make it fatter), and even getting "normal" people involved like Galaxy Zoo did - but to do this, you (ie us, infrastructure and software people, e-scientists) need to "walk in their shoes"; and also by rewarding people for engaging and sharing - no prizes for second place or for developing standards, so people feel possessive about their work, and they may not realise the benefits of sharing and collaborating until they've tried.

I can only encourage you to browse the programme for presentations. Dan Atkin's plenary even contains a transcription! Make yourself a cuppa and go read them now.

Friday, 24 September 2010

Santa's little helpers

In a blog post published in The Guardian: Lily Asquith, a particle physicist working at Argonne National Laboratory, described her experience of using the grid as
A bit like sending a letter to Santa (you have no idea where it is going and you can be fairly sure you won't hear anything back).
I suppose that makes us Santa's little helpers: less of a National Grid Service, more of a National Elf Service.

Dr. Asquith is not impressed by the the cryptic errors that we - the grid world - inflict on users. These are even less amusing than that last attempt at a joke.

Her example was "Lost Heartbeat" but there are many more. Those of us who answer NGS helpdesk tickets are familiar with people asking why the WMS said...
Standard output does not contain useful data. Cannot read JobWrapper output, both from Condor and from Maradona.
and people running grid software quickly get used to seeing the classic
GSS failed Major:01090000 Minor:00000000 Token:00000003
These are nothing to do with large birds, Argentinian footballers or unsuccessful members of the military. Roughly translated these mean 'Sorry... your job went missing' and 'Oops... invalid certificate' respectively.

In their favour, at least these error messages are obscure enough to give sensible answers when fed to Google.

So why are we so bad at telling people what has gone wrong?

In part, this is because so many things have to go right for a job to run: data has to be delivered to the right place, the right software needs to be available and the machine doing the processing needs to be behaving itself. The end result of any failure of any part is the same - the job failed.

The same applies to the letter from Santa. All you know is that you didn't get what you asked for. You will never know if this was because your letter was eaten by a reindeer, or if it was dropped down a chimney or simply that you happen to be on the Naughty List this year.

The situation is not helped by the body of very general purpose code that is buried deep within grid software. That GSS message about the failed major, for example, comes from an implementation of the 'Generic Security Service' Application Programming Interface.

The GSS-API is meant to be able to handle any mechanism for securing network traffic - it handles Kerberos in the same way that it handles certificates. JANET(UK)'s 'Project Moonshot' plans to use it in conjunction some of the technology behind Shibboleth.

The thing about generic interfaces is that they tend to return generic errors: basically saying that something built around the interface went wrong - go look there instead. That is great for developers but confusing for users.

The grid can be a scary thing to use. It is complicated. It will go wrong. If we want it to be less scary, we need to learn how to go wrong, better.

Tuesday, 21 September 2010

Attention, attention! Call for poster abstracts closes this Friday!

An extra special reminder that the Call for Poster Abstracts for the NGS Innovation Forum closes this Friday. If you or anyone you know would like to submit a poster abstract to this event please make sure that you submit your 200 word contribution to the NGS website by 5pm on Friday (24th).

Remember that there will be a prize for the best poster as voted for by the delegates and all abstracts will be peer reviewed by the Innovation Forum Programme Committee. We would like to encourage all users to submit a poster and to attend the event in order to hear about the latest developments and tools from the NGS and also to leave with the knowledge of how to apply these tools in their research. Delegates are welcome to attend for one or both days of the event.

Registration for the event is also open now and further details are available on the event page on the NGS website.