Friday 29 October 2010

I'm a monitoring service, let me out of here

When we last covered the development of the new Nagios monitoring service in the blog - before last week's commercial break - we had just convinced it that all the hosts were alive and ready to be tested.

We can now proudly say that we have coaxed the service towards its first, official complete and utter failure. 

All those highly motivated people who tell you`failure is not an option' - ignore them. If you are running service that tests things, having a test fail means that you actually persuaded that test to run. It isn't failure, it is a different kind of success.

And it is not as easy as it sounds because the Nagios development server is, quite deliberately, kept isolated from the rest of the world. 

This is not a reference to the Harwell Science and Innovation Campus near Didcot: where the people from the STFC e-Science centre who run the service are based, and where the NGS Innovation Forum 2010 will be held.

It is simply that the Nagios development server has limited Internet access - as befits an experimental service. All access to the World Wide Web must be channeled through a web proxy. Privileged access to services is granted only when needed.

Neither the NCG configuration program or the various tests and probes that Nagios uses were written for an environment with a web proxy. Much of the code is written in Perl and support for proxies is already present -it just needed to be turned on. The Nagios developers at CERN have already accepted the changes for the next release.

With web access granted, NCG could build a complete configuration and the tests that suck information from web sites all began to run.

The next problem was getting permission to do things.

This is a grid. To use a grid, you need a certificate. WLCG Nagios has the wherewithall to download a certificate from a MyProxy Credential Management Service - as long as someone has uploaded it in the first place and there is no passphrase required.

The NGS provides a central MyProxy service and MyProxy allows certificates to uploaded so that they can be downloaded using another certificate as authentication. The command to do this isn't exactly short:


env GT_PROXY_MODE=old myproxy-init -s myproxy.ngs.ac.uk -l nagios_dev -x -Z '/C=UK/O=eScience/OU=CLRC/L=RAL/CN=nagios-dev.ngs.ac.uk/emailAddress=sct-certificates@stfc.ac.uk' -k nagios_dev-ngs -c 336

.. but it works.

Or at least it worked after we had added the certificate DN to the authorized_receivers and trusted_receivers entries in the myproxy server configuration file.

... and ensured that the ngs.ac.uk virtual organisation was defined on the Nagios server.

So at long last, the Nagios server could download a certificate, associate it with a virtual organisation and use it to submit jobs via a Workload Management Server.

Which was the point at which we realised that the Workload Management Service endpoint (https://ngswms01.ngs.ac.uk:7443/glite_wms_wmproxy_server) should have been defined in the glite_wms.conf and glite_wmsui.conf files in $GLITE_LOCATION/etc/ngs.ac.uk/.

With that final hurdle overcome, the test jobs started to flow.

The Compute Element tests were sent sites declaring themselves as Compute Elements - including the original NGS core sites at Leeds and RAL.

I admit to rigging it so that Leeds was tested first. The little status box went Green as the job was submitted, then Red as it failed with a friendly:

- Standard output does not contain useful data.Cannot read JobWrapper output, both from Condor and from Maradona.

Sometimes an Argentinian footballer and a large scavenging bird can make your day.



Wednesday 27 October 2010

Arts and Humanities and ICT

Following on from my previous blog post about the JISC- Future of Research? conference, the next parallel session I attended was “Evolution and Revolution in ICT and Arts and Humanities research”.

The first presenter was Simon Tanner from King’s College London who spoke about some JISC digitised collections. No slides from this one I’m afraid as he showed us pretty pictures instead!

The next speaker was from Mimas who spoke about “long tails and efficiencies of scale”. The slides for this one are available and of particular relevance to myself and the NGS, was the observation that services have to show –

  • User demand
  • Benefits
  • Impact and value
  • Sustainability

This is something that the NGS has been undertaking recently and will continue to do so as we come up for refunding next year. You may have seen a small flurry of stats etc on the website and we have been undertaking more stats gathering behind the scenes from the proliferation of data available from usage stats and user applications. From this we are building up a picture of user demand, impact and value etc.

We hope that our users will help us by contributing to our forthcoming annual user survey and our follow up roadshow survey. If you have attended any NGS roadshows we would be grateful if you could complete the short survey which will take about a minute to complete! This will help us shape future roadshow events and also analyse the benefits that users get from attending these events.

The final speaker was John Coleman from the University of Oxford who presented on “Large Scale Computational Research in Arts and Humanities”. He started with an interesting fact that in 2008 YouTube was the second most popular search engine with people looking for speech / audio instead of text. He is currently working on a JISC funded project to “mine a year of speech” which aims to annotate a years worth of data in the form of a corpus. To do this he is currently using about 20 computers set up as a cluster in a local lab. However he is now looking at placing the data in several other universities which he says is “like grid computing”. John also highlighted a report available on "ICT Tools for searching, annotation and analysis of audiovisual material" which may be of interest to people.

The JISC Future of Research? conference had some very interesting parallel sessions and I hope it continues next year!

Tuesday 26 October 2010

Centralising your IT support - thoughts from the JISC Conference

Last Tuesday the NGS was out on the road with an exhibition stand at the JISC Future of Research conference which was held in London and online.

As well as speaking to delegates browsing the exhibition stands, I also went to some very interesting parallel sessions. The first session I attended was “Centralising your IT Support for Research” and consisted of 3 presentations including Mary Visser, Director of IT from the University of Leicester. Mary talked about how researchers want “free at point of use” as funders are unwilling to pay FEC for these, seeing IT as basic facilities which should be provided by the institution. Leicester currently have an IT research liaison manager who speaks to researchers about their IT needs and provides guidance and advice about local and national resources. Sounds like something that many of our users would like!

Mary’s presentation for a centralised IT support service was countered by Rob Procter from Manchester e-Research Centre who argued the case for more distributed IT provision within schools and departments. Rob pointed out that many researchers do not trust IT services to provide what they need as they are teaching focused with most of their effort going in this direction rather than towards research. Robs argument for embedding IT staff in departments is certainly one I’ve heard many times before and I’ve also seen some very good results and collaborations come out of these situations.

Here at the NGS we realise that not all universities have IT services that can help with grid computing or even the use of computing in research. The NGS can’t have technical staff based in every institution in the country but we do have a variety of means to try and help from afar.

Check this list to see if you have a local Campus Champion who can provide some advice or support to you in your institution. If you don’t have a Campus Champion, we have our helpdesk where knowledgeable staff can answer your queries by email. We also have a variety of tutorials to talk you through getting started and running jobs. If there is anything else you would find useful regards training material then please let us know!

Friday 22 October 2010

Now wash your schema

In that strange parallel universe that exists only in TV adverts: two friends sit in a remarkably spacious and clean kitchen, sipping low-calorie-but-surprisingly-tasty beverages, and talking.

And what are they talking about? Which wonderful washing powder washes whites whitest.

If we lived in their world, I would be able to introduce:

The all-new NGS schema washing service - removing unpleasant stains from your grid information and leaving it huggably soft and smelling of Summer Meadows.

and not feel like a complete idiot.

You really don't want to have dirty schema. Not only will your friends talk incessantly about it when they visit for a cup of low-calorie-but-surprisingly-tasty beverage - your site will be completely ignored by the UI/WMS Resource broker.

The technical details were covered in an NGS surgery earlier this year.

It is all to do with the Grid Information Services through which sites publish static information about the hardware and software available and dynamic information such as the number of running jobs or the amount of free disk space.

Information services feed off one another. A site service would collect and combine all the information from the compute and data service within the site.

Data from all the sites is gathered together into a central service where it can be used by the UI/WMS, and the load monitor and applications page on the NGS website.

The information is passed around using LDAP.

LDAP represents things as Objects. Each object will be a member of one or more Object Classes. Each object class is associated with a set of Attributes. An attribute is a label and a one or more values.

Beneath any LDAP service there is a set of schema. A schema defines which attributes can be defined for an object of a particular class.

The problems appear when the data passed around does not follows the schema, or follows a slightly different schema. Older publishing software will take anything. Newer software is routinely configured to be far more strict and will silently drop any object that does not fit the schema.

A number of NGS partner sites are using a rather elderly version of the Virtual Data Toolkit to publish data. The NGS central service was recently updated to the latest, greatest and strictest version of the Berkeley Database Information Index (BDII).

The VDT and BDII schemas differed by one small detail: two object classes called GlueTop and GlobusStub existed only in the VDT version. Neither GlueTop or GlobusStub have any attributes directly associated with them, so their presence did not affect the content published. It was just that a reference to them was enough for the VDT-flavour data to be ignominiously dumped before ever reaching the central BDII.

Information from sites simply disappeared.

But... the BDII is perfectly capable of collecting 'dirty' data, removing the extra object classes and similar quirks and republishing it as clean, fresh data (with a hint of Melon and Lotus Flower). All it needed was a touch of FIX_GLUE.

FIX_GLUE is a BDII configuration option, originally intended to be used at a site level, that turns on the data cleaning. An NGS staff member at STFC realised that this same approach would work as a national service - and the schema washing service was born.

[27 Oct: Edit to improve the description of how the washing works. With thanks to Jonathan Churchill at STFC.]

Friday 15 October 2010

It's alive

If you work at one of those institutions that let users choose names for their own computers: it is inevitable that, sooner or later, someone will claim the name 'Elvis'.

Occasionally, this will be because they like his music. In most cases, it is because they want to be able to run 'ping elvis' and be told that, despite the events of August 16 1977:
 elvis is alive [*]
'Ping', the friendly name of an ICMP echo request packet, was invented as a way of testing network connectivity. The original idea was that if a machine on the Internet was working and it was pinged, it should send back the contents of the 'ping' to the sender as an ICMP echo reply.

Ping dates from when the Internet was a smaller and nicer place. These days, pings are seen as a security issue and are frequently blocked at campus and departmental firewalls.

So, once Elvis has left the building, you may never know if he is still alive.

This is a real issue for the current WLCG Nagios deployment. The Nagios service is hosted at the Rutherford Appleton Laboratory but the sites that make up the NGS and GridPP are spread around the country behind many different firewalls. Some hosts can be pinged from off site, some can't.

Nagios has the concept of hosts and services provided by those hosts. It will only check services if the associated host is working. The usual way of testing a host is by sending a ping. If pings are not permitted, no service on that host is tested.

The NCG utility that generates Nagios configurations can use a dummy test in place of a ping test for all hosts. To enable this, the /etc/ncg/ncg.conf configuration file needs to be changed to include a line setting CHECK_HOSTS to zero:

<NCG::ConfigGen>

<Nagios>
...
# Disable 'ping' checks of hosts
CHECK_HOSTS=0

</Nagios>
</NCG::ConfigGen>

When ncg is run and a new Nagios configuration built, all services on all hosts are tested. On the down side, If a host really has dropped off the network, Nagios will continue to test the services and generate alerts.

Now, if you will excuse me, I better stop writing about Nagios and go back to configuring it. As someone once said: a little less conversation, a little more action please.

[*] For the pedants.. the ping command on Linux run until you stop it and print statistics. You will need to find a Solaris system or a Cisco switch to actually see this message.

Wednesday 13 October 2010

On a mission

Over the last week or two I've been on a mission to gather stats on just about everything you can think of here at the NGS. Some of this was at the request of our funders and some of this was a "stocktaking" exercise here at the NGS.

One area that has became more important for the NGS is funding and PI's. Our funders (JISC and EPSRC) want to know who funds the researchers that use our resources so funding councils, universities, industry etc. It certainly builds up an interesting picture for us! In turn the funding councils want to know which of their researchers use the NGS. Now while we know who our users are, many of the research councils are interested in who the PI's are.

Obviously this is information that we have to rely on being provided by the users when they complete the application form to use our resources. We recently updated the form to make the funding field and PI field compulsory no matter if you are a new user or are asking for a renewal. It only takes 2 secs to fill in this information and helps us tremendously in keeping track our usage and stats.

We have had some teething problems with PhD students putting their own names down instead of their supervisors and users performing biomedical reserch putting the AHRC down as their funding council! However we're sure we can tidy up the database with some help from our users so please help us to accurately populate our databases.

Sunday 10 October 2010

The Bottom Line

Interesting thing about the "Science is Vital" rally yesterday in London (I was there, man). Most of the speakers spoke about medicine. Is it easier to explain to a politician (or a member of the public) the benefits of science from the point of medicine, when lives are directly affected?

I was thinking about the work that researchers do on the NGS, and it seems entirely plausible that in-silico drug screening, modelling bloodflow through the brain, or defibrillation of the heart will save lives someday, if they haven't already (and there's much more!)

While this work has a high impact on the individual whose life it saves, it is more subtle to assess the economic impact of some of the other work. What use is it to know how dinosaurs walk? If we find the Higgs (over on GridPP), will it make us all richer? What is the use of astronomy? A point that was made yesterday is that we can't say beforehand what is "good" and what is not. Impact can come from the strangest places. Knowing how dinosaurs walk would make the next dinosaur-themed blockbuster more interesting. The recent work which led to a Nobel prize in physics involved (more or less) pencils and sticky tape. The particle accelerators and the Universe are laboratories for the very small and the very large (in fact the Universe is also a particle accelerator), and they both lead to an increased understanding of the laws of physics, which in turn lead to many other benefits.

It was left to Simon Singh to remind us of maths, the lives saved by cryptanalysis during WWWII when Enigma was cracked. (I might add the same kind of maths is also the foundation of e-Commerce.)

The more subtle benefit is that if and when we solve these problems, we also have the infrastructure to solve them (like the NGS), and we know how to do it (e.g. CFD, or computational modelling, visualisation, steering, etc.)

And of course doing exciting things like watching planets and stars, or seeing dinosaurs walk, helps attract young people into a science career, the next generation of researchers.

Saturday 9 October 2010

Nagios, MyEGEE and MyEGI

[With thanks to Cristina Del Cano Novales at STFC]

The story so far... early last month we started to deploy WLCG Nagios as a replacement for the existing INCA testing service.

STFC set up the servers and deployed the latest published version of WLCG Nagios before handing the baton to Leeds. Leeds are configuring Nagios and the ecosystem of software that supports it.

And it is a there is a lot of supporting software.

In addition to Nagios itself, the NCG configuration tool and the plugins that actually do the work - all of which were described in a previous post - the WLCG package includes..
  • pnp4nagios to keep and display historical records of performance metrics.
  • an ActiveMQ message bus to pass data to wherever it needs to be. STFC have cunningly configured this on the development box so that it can only talk to itself.
  • a MySQL database to keep status information .
There are two web tools called MyEGEE and MyEGI that allow people to extract the bits of information in that database that are relevant to them.

Both are built on the kind of general purpose frameworks that have sprung up since people started bandying around the term 'Web 2.0' as if it meant something. Much of what is seen as Web 2.0 is - at its heart - as some way of reading, or updating, a database from a suitably interactive and pretty webpage. Frameworks to do database wrangling are available for all popular languages and a quite a few unpopular ones.

MyEGEE is written in PHP on top of Zend Framework; MyEGI is python and based around Django.

According to a presentation at the EGI Technical Forum - highlighted by one of our colleagues from STFC - MyEGI will replace MyEGEE in the next few months. The developers are expecting to produce the first official MyEGI release in November.

Tuesday 5 October 2010

Gathering pace

Preparations for the NGS Innovation Forum are quickly gathering pace. Our call for poster abstracts closed on the 24th of September and the Programme Committee meeting to discuss the abstracts takes place tomorrow. If you have submitted an abstract for the event you will hear the outcome by the end of this week so watch your inbox!

I've also confirmed the last of the speakers so the agenda is now complete although missing one presentation title which will be with us very soon hopefully so watch this space...

In the meantime I can announce details of some other presentations at the event -

NGS site presentation - What the University of Westminster gained from being a NGS Partner Site - Gabor Terstyanszky, University of Westminster
The University of Westminster joined NGS in 2007 as a partner site. Being a partner site is very different than being a core site and this presentation will overview the challenges and experiences of the University of Westminster as a partner site. The presentation will also outline how we identify prospective users at the University of Westminster and what kind of application support and technical services we provide for local users to port and run their applications on NGS resources.

Federated Access to NGS resources - Mike Jones, NGS, University of Manchester
This talk will demonstrate how to use NGS resources using your institutional login credentials (via the UK Access Management Federation). It will describe how the UK's two main eScience authentication systems are combined to form an easy to use yet robust identity management environment. It will discuss how this mechanism links together with system, project and VO registration procedures.

The registration for the event will close on the 12th of November so make sure you register before then!

Saturday 2 October 2010

Secret messages

Listen very carefully, I will say this only once.

This posting is about keeping secrets. As such, information will be provided on a strictly need to know basis.

And what you need to know is that...

We may (or may not) be transferring some data from somewhere that might (or might not) be near Didcot to somewhere else that might (or might not) be Manchester.

Some people might speculate that this is something to do with job accounting data.

While it is true that there are number of GridPP sites that are NGS partners and it is true that these sites will run jobs on behalf of users within the ngs.ac.uk virtual organisation and it is true that the amount of CPU time used by these jobs is recorded in the EGI APEL accounting service - we obviously can neither confirm or deny that APEL data is now sent to the NGS and recorded against users records in the User Account Service.

So, putting aside the cloak and dagger for the moment, what is the big secret?

Like many secrets it is very, very dull. We are indeed passing accounting data around. As far as data protection rules are concerned, accounting records count as personal information so we have to ensure that they are never left sitting around unprotected.

As we are not currently able to directly link the NGS and the APEL accounts database, we needed to find a delivery mechanism that is both secure and which could be automated.

The solution we chose was to encrypt the data and put it somewhere where it can be collected at regular intervals.

The encryption is done with OpenSSL...
  
export TOP_SECRET="if we told you, then it wouldn't be a secret any more."
openssl enc -e -blowfish -in secret-data.plain -out secret-data.enc -pass env:TOP_SECRET

and - when the file is transferred, the data can be decrypted with

export TOP_SECRET="that password in the line above that we are still not telling you"
openssl enc -d -blowfish -e -in secret-data.enc -out secret-data.plain -pass env:TOP_SECRET
This is symmetric encryption so now that we need to play pass the password.

This where certificates come to the fore as they support asymmetric or public/private key encryption. As long as the password is shorter than the key length, it can be encrypted using this certificate so only the certificate holder can retrieve it. The command is...

echo $TOP_SECRET | \
openssl rsautl -encrypt -certin -inkey someones-public-cert.pem \
-out password.enc
The password.enc file can be sent to the certificate holder who can unscramble it using something like
  
openssl rsautl -inkey /path/to/my/userkey.pem -decrypt -in password.enc
This can be used to update the password securely and ensure and all the secrets stay secrets.

That is all. This blog posting will self destruct in 30 seconds...