Friday 25 February 2011

See SPOT run

Just when you thought it was safe to go back to the Internet, up pops another Acronym.

Meet SPOT.

You won't - yet - find SPOT in the Grid Acronym Soup because it is not one of ours. It is an escapee from another of the great sources of Acronyms - the world of IT Business Systems.

SPOT stands for Single Point Of Truth.

A SPOT is not a single genuine fact that slipped into a Business Systems sales pitch. Nor does the name imply that Business Systems remind those involved of an inflamed, infected ball of pus. SPOTs are, in general, good.

If you have a SPOT, then you know there is one-and-only-one definitive source for any piece of information - whether it is a price, a name, a salary or an office number.

As anyone who has dealt with a large organisation will appreciate, we do not have as many SPOTs as we should.

On the Grid - which is large, diverse and dispersed by its nature - single points of truth are very hard to find.

Which makes deciding what we should test with Nagios... interesting.

The NCG configuration generator which writes the Nagios configuration needs to know:
  • What sites to test.
  • What services to test at those sites.
  • What tests to run for each service.
Perhaps the closest we have to a SPOT is the Grid Operations Centre Database or GOCDB - which lists every site on the European Grids, their official downtimes and some of the services they provide.

The `some of' is there because the GOCDB defines services in terms of service endpoints - which represent a host within a site acting as, say, a Compute Element or a GSISSH server or a GridFTP server.

There are a comparatively small number of predefined endpoints and these will never cover everything a site can offer - you cannot, for example, advertise an iRODS service.

The GOCDB does not directly provide information about the Virtual Organisations that a service is prepared to support but it should point anyone wanting this information at a site information service willing and able to provide it.

For our first attempt at a WLCG-like Nagios service...
  • We collect a list of sites come from the GOCDB - we take any site flagged as belonging to NorthGrid, SouthGrid, Scotgrid or the London Tier 2 subgrids within the UK and Ireland Region.
  • We only test services for which GOCDB service endpoints are defined.
  • We define the tests for each endpoint within the Perl code of NCG. There is a 'standard' set of tests defined within a perl module called NCG::LocalMetrics::Hash which forms part of the NCG package.
    We modified the module to include local changes from a NCG::LocalMetrics::Hash_local module - a change that has been adopted by the NCG maintainers.
As an approach, it works well enough for Nagios tests but not for the friendly-front-end MyEGI.

MyEGI gets its truth from elsewhere: from the the Aggregated Topology Provider (ATP). The ATP is a sort of single point of single points of truth. It swallows data from the Metric Description Database (MDDB) and from Virtual Organisation feeds. It is scarily complicated in places - as you might be able to gather by looking at the MDDB and ATP database schema

The European Grid Initiative exists to make the grid work better, in part by giving us nicer SPOTs and are encouraging development of the ATP and friends. The curious can find out more on the SAM and Nagios wiki pages at CERN.

The Single Point of Truth is Out There....

Wednesday 23 February 2011

Ch-ch-ch-changes as David Bowie would say...

Hopefully all NGS users are now aware of the forthcoming changes to the NGS service from the end of March 2011. There will be a slight reduction in the number of free-to-use cores available to NGS users and individual cpu allocations will be reduced to reflect this.

If you haven't already checked please read the full announcement which is available on the NGS website. I would also strongly recommend ensuring that you are subscribed to the NGS news list which is the first port of call (along with the website) for all news about the forthcoming changes.

In the meantime if you have any concerns about the changes please contact the NGS helpdesk. We are there to help and to answer any queries you may have about how the changes will affect you.

Thursday 17 February 2011

Leaving and Joining

Most NGS users should now be aware of the major changes due at the end of March 2011 - when some of the machines providing free-to-use CPU time will be retired.

The NGS cluster at Leeds - ngs.leeds.ac.uk - is among those that have reached the end of their useful life. It will be removed from service on 31 March 2011.

This does not mean that Leeds as a site is dropping off the Grid forever. We will be back...

The NGS cluster is one of a number of systems managed by the Research Computing group within the Central IT Service at Leeds. It is also - by modern standards - one of the smallest. In compute terms, it is dwarfed by the one in the room next door - which boasts 4000 CPUs, 7 Tb of memory and 100 Tb of fast disk - and goes by the name of ARC1.

ARC1 is so big because it is two computer clusters rolled into one.

Around half the cluster was funded by the University for use by local researchers - for whom applications such as DL_POLY, AMBER and CASTEP have been are installed.

The rest comes from a UK-wide consortium of Solar Physicists - so there is a need for people from outside Leeds to safely and securely use the service. Cross site access is why the National Grid Service exists. We can do that.

While primarily aimed at the Sun spotters - Leeds has kindly offered some CPU time on ARC1 to the NGS. If, that is, the NGS can get ARC1 onto the Grid.

If we can... applications installed locally will be made available to external users - where licenses permit. The users were are expecting are those who currently access resources via the UI/WMS.

So what do we need...?
  • A standard way to presenting the applications to the world. We can do that... it is why the Uniform Execution Environment was invented.
  • A way of limiting access to licensed applications to the right people. We can do that too.
  • A means of accepting requests from the UI/WMS. Taking the lead from the particle physics community - we are looking at CREAM as deployed by gLite.
  • An information service that lets people outside see the state of the system and the applications available - the obvious choice here is the BDII, widely deployed and also available from gLite.
  • A way of accounting for use in APEL. We may send data directly or indirectly via the NGS accounting service and RUS records
Whatever we build, will need to work with what is now called Oracle Grid Engine - the local batch management system - and within a highly specialised and customised Linux environment that is significantly different to that used by the World Wide LHC Grid.

As a first stage, we are going to deploy a separate (virtual) machine - running Scientific Linux and using packages from gLite - to act as the link between ARC1 and the grid.

We want to produce - and make available to others - a kickstart configuration and associated scripts that automate as much of the installation as possible.

It will certain produce a blog post or two.

Leeds has been involved in the NGS for a long time: we have hosted two generations of NGS cluster; been involved in the NGS Outreach, Research and Development and Monitoring activities and created automatic installation systems for deploying and configuring Grid software.

We've had a lot of practice. It certainly hasn't made us perfect but, if you will excuse what sounds like Yorkshire pride coming from someone who is technically an Essex Boy - if anyone can do it- Leeds can.

Tuesday 15 February 2011

Are you a researcher looking for a partner?

If so then you really should consider coming to the dating event of the year! Well okay it’s not quite a dating event but matchmaking is definitely part of the plan of the Software Sustainability Institutes Collaboration Workshop.

It will be held in Edinburgh at the beginning of March and is described as “the perfect opportunity to meet the researchers, software developers, funders and other software experts that can help you advance your research”.

If you are a researcher who uses software but could do with some help from a computer scientist to make the most of your software, come along and advertise yourself and your situation / problem. You can give a lightning talk or present a poster and then work through the issue with other attendees in small break out groups.

There have already been some break out group discussion topics suggested BUT one of the best things about this meeting is that the agenda constantly changes depending on the interests of the attendees, the emergence of new discussion threads and tangents. There’s never a dull moment!

I’ve attended several of these meetings and I’ve thoroughly enjoyed every single one. No death by PowerPoint and plenty of discussion and positive solutions instead! The constantly changing agenda works really well as we report back to the assembled attendees after every session and then decide on the next set of discussion topics. I’ve always met new and interesting people and came away fully enthused.

If you are a developer there are some free places up for grabs but the deadline is this Friday (18th Feb) so don’t delay!

Friday 11 February 2011

Missing the message bus

[With thanks to Konstantin Skaburskas.]

Two weeks ago, we were very nearly at the point where we could deploy WLCG Nagios and phase out our existing testing service.

We had created our own tests and worked out how to add them to the bundle of tests that are sent out onto the Grid.

The tests were actually being run on remote sites.

All that was missing was - well - a big chunk of the test results.

When the tests landed on gLite-based sites - everything worked as expected. In other places - the tests ran... but resolutely refused to let anyone know the results.

We have now found the missing messages - after losing them twice on the way.

It was all due to subtle differences in the environment variables defined at a site. The fix is to set two environment variables by adding something like...

Environment = {
"OSG_HOSTNAME=<jdlreqceinfohostname>",
"LCG_GFAL_INFOSYS=bdii.ngs.ac.uk:2170"
};
to the template used to generate the JDL file that describes the test.

To understand why, you need to understand how the tests on remote hosts are run. The hard work is done by a script called nagrun.sh - that:
  • unpacks the bundle of tests and configures them for the local machine.
  • runs them using a bundled copy of nagios
  • translates the test results into messages.
  • sends the messages to a message broker - which shoves them on the message bus back to the Nagios server.
At the Nagios server end, each message is unpacked and fed to the central Nagios as a passive test result.

If LCG_GFAL_INFOSYS is missing, the tests never make it to the message broker; if OSG_HOSTNAME is missing - they are ignored when they reach the Nagios server.

It is the WLCG Nagios for a reason - it was designed to test machines that sat within the Worldwide LHC Computing Grid. One of its roles was to serve as a replacement for the older 'Service Availability Monitoring' (SAM) tests.

It makes the - perfectly logical - assumption that the environment on the machine running the tests will be like that used for the SAM tests.

WLCG has a dedicated network of message brokers. Any host can find a suitable broker to contact by asking its friendly local information service. The environment variable LCG_GFAL_INFOSYS points to the information service.

Some of the sites we are testing sit outside WLCG. We have our own message broker and pass additional information with the tests to direct messages to it.

A subtle bug, which is been fixed in the current release, meant that even though LCG_GFAL_INFOSYS was not being used, it still had to be set. If it wasn't, nagrun.sh could not find a message broker to contact.

So, the messages were making it back to the Nagios server. The Nagios server was ignoring them.

The reason: the messages sent back are meant to include a reference to the Compute Element (CE) that actually accepted the job. The messages we were sending were all being marked as coming from 'localhost.localdomain' - a dummy name used internally by the nagios tests.

The nagrun.sh script tries to work out the CE name from the local environment and from the output of certain scripts. If all else fails, it assumes nagios knows the answer.

This WLCG Nagios developers had encountered this problem before - when running ATLAS tests against hosts on the US Open Science Grid - and had added code that allows a Open Science Grid hostname to be used as a CE name. It expects the environment variable OSG_HOSTNAME to hold that hostname.

We can also report that the WMS administrators have reconfigured the server so it no longer gets clogging up with CREAM jobs - and the CREAM CE tests are now running via the WMS - as WLCG intended.

We are now ready to deploy WLCG Nagios - unfortunately without the MyEGI friendly front end - and make it available to site administrators.

We will describe how we decide which sites to test and what tests to run in a future posting.

At which point, Nagios related Research and Development will take a break and I will have to find something else to prattle about every couple of weeks.


Tuesday 8 February 2011

It's not about us, it's about you

As Jason mentioned in his blog post last week, we have a number of NGS user case studies on our website and the number is growing.

The user case studies are a collaboration between the user and myself as the NGS Liaison Officer. I work with the user to produce a short case study outlining their research, how they use the NGS and (rather importantly) the benefits that the NGS has brought them.

I also work with users to make their research more accessible to the wider NGS community which can be pretty difficult given the very specialised nature of some of the research performed on the NGS!

Most of the user case studies come about through users volunteering to write a case study, mainly through our annual user survey. For example this year I had 31 users who volunteered to produce a user case study - a rather large increase in last years number!

I hope the increase has come about through people realising the benefits of advertising their research to the wider community and not just so they can tick a box on their final project report! I have found that the most enthusiastic communicators tend to be PhD students which bodes well for the future of science communication.

The user case studies don't just stay on our website. With over 2000 hits since they were placed on the website, they are picked up by other dissemination teams UK and Europe wide. Quite a few of our case studies have been picked up by iSGTW which has 7300 subscribers and many more unique visitors to their website (over 125,000 at the last count!). EGI have also became interested in our case studies and are looking to produce something similar for their own project.

And it's not just within e-science / grid / e-infrastructure organisations that the case studies are picked up. Hardly an issue of Scientific Computing World goes by without me getting the NGS mentioned in there courtesy of one of our user's research being featured! I am often asked by editors if I know of anyone researching X or Y and, thanks in no small part to the user case studies and the NGS Communities service, I can usually track someone down! SCW is free to read online and you can subscribe to a free printed copy as well.

The latest NGS user case study has recently been put on our website - Simulating carbon nanotubules on the NGS by Rebeca Garcia Fandino at the University of Oxford. Watch this space for more new case studies coming soon!

Friday 4 February 2011

While interesting things happen... elsewhere.

If you look through the case studies on the NGS web site - you can see how the grid has helped researchers study stars, and genes, and criminal activity; to understand how the brain, heart and nervous system work, to search for new drugs to treat disease and study how existing drugs are absorbed by the body. That isn't even mentioning the original goal of the grid - wrangling the enormous amount of data produced when bashing hadrons together in a big pipe under Switzerland.

There is a huge amount of interesting and important research out there.

And a huge amount of very dull - but important - work to be done to enable this research.

Someone needs to make sure the infrastructure is running and keep track of what was used and when. Someone - actually Jens Jensen, one of the contributors to this blog - has to worry about how we look after private keys.

NGS Research and Development's latest excursion to the outer limits of tedium comes courtesy of a project to feed data from our RUS accounting service directly into the EGI's APEL accounting service.

We are already combining data recorded by APEL with that recorded by RUS when updating the CPU usage for account holders within the NGS User Account Service.

In future, we want APEL to be the only place to look for account records and in the great Grid Tradition, this is where things get complicated....

RUS is standards-based. We use the Open Grid Forum Usage Record (UR) format - as defined in great detail in an official specification - to transfer accounting information from individual compute resources to a central store.

UR records can identify the resource which actually did the computing and how much CPU time, or wallclock time or memory was used. It does not carry any information about the relative speed of that resource compared to any other random computer on The Internet.

The NGS accounting clients, and Grid-SAFE all generate UR format data.

APEL is more pragmatic and - as it is part of the gLite software stack - much more widely deployed. It does incorporate a measure (typically derived from the SPECMark) of the speed of the CPU that did the work. It is not explicitly tied to a resource.

So - we are creating a tool that takes records sent to RUS and translates them into APEL updates. It will have to fill in the missing scaling factor - if necessary falling back to a 'custom' scaling to warn potential users of the data that we have no idea how fast the computer is.

These updates will be fed to the central APEL service in exactly the same way as those generated by any other APEL client - and eventually find their way back to the User Account Service.

And the users of the service - well, they don't have to care in the slightest... they have far more interesting things to do.

Thursday 3 February 2011

NGS roadshow event at the University of Huddersfield

Last week saw the NGS outreach team at the University of Huddersfield for a NGS roadshow event. The local organiser was Ibad Kureshi who did a fantastic job of organising everything locally and making the day went smoothly. Ibad also did a great job of advertising the event locally meaning that we had over 30 participants at the morning roadshow and over 20 registered for the afternoon training event.

The roadshow kicked off in the usual fashion with an overview of the NGS from our Technical Director, David Wallom from the University of Oxford. Many people in the audience were not familiar with the NGS before the event or had just heard our name mentioned but were unaware of what the NGS actually does so this presentation is designed as a gentle introduction!

Following David we moved onto presentations from NGS users including Paul Martin who is a researcher at the University of Huddersfield. Paul has been using the NGS for computer modelling of thoria in order to determine its suitability for a next generation nuclear fuel. He explained how he has used DL_POLY 2 which scales very well on the NGS. Paul in particular praised the good on-line instructions/training, FAQ’s, blog and helpdesk/support which is great to hear!

Paul was followed by another user, Matt Smith from the University of Liverpool. Ibad had suggested that a presentation by an Abaqus user would go down very well due to the interest in this software package at Huddersfield. We were happy to oblige! Matt spoke about his use of Abaqus on the NGS to model lattice structures. This was a very interesting talk helped by the sample materials Matt brought along to illustrate his presentation. As Ibad had thought, Matt’s presentation prompted plenty of questions about his usage of the NGS and about his research.

David Fergusson from the training team at NeSC then elaborated on how users actually get started on the NGS and how to apply for a certificate, run jobs etc. Again there were plenty of questions showing that Huddersfield researchers are keen to use our resources!

The final presentation was by Ibad who outlined the locally available resources and how these tied in with the resources offered by the NGS. We’ve found in previous roadshows that having presentations about local resources helps the audience to visualise how the NGS fits in locally and that it is there to complement existing resources.

All the presentations from the event are available on the NGS website from the event page.

Wednesday 2 February 2011

Private keys

The point about public key cryptography is that public keys are public: they are used to prove possession of a secret (namely, the private key) without revealing any information about secret. This is called a Zero Knowledge proof. In other words, much of the level of assurance in the infrastructure rests on management (not just protection) of private keys.

So this is why you have all generated your keys and protected them with strong passphrases.

Recently there has been some discussion among the grid CAs as to whether the rules can be relaxed, without lowering the level of assurance too much. Many people generate their keys on systems which are maintained by someone else: e.g. your desktop at work, or maybe even a UI.

This leads to the proposed loosening of the rules, or perhaps a better description of existing practices.

It is likely that in the future, we will support private keys generated by:
  • users themselves, on trusted systems (eg your own machine, or your desktop machine at work)
  • institutions, letting them pre-generate keys for their users (apparently some like to do this);
  • third parties: e.g. running a credential repository like MyProxy.
However, as with much change, it is easy to introduce new rules without fully understanding the problem. There are serious (but fixable) problems with the draft rules. For example, different CAs interpret "third parties" in different ways. Is the CA a third party? I would have thought not. Would the NGS count as a third party, despite the fact that it runs a CA? Probably.

Anyway; the upshot of this is that private key protection rules will be relaxed. What is currently missing is the in-depth understanding of the security aspects of the lifecycle of the private key. I have soapboxed on this topic before. More on this later. Stay tuned.