Back in February, an over-optimistic fool promised that the NGS would have a working Nagios service in the next few weeks.
The over-optimistic fool was confident because he had a real deadline to meet. Nagios had to be ready by April. April was the month during which the old NGS core sites - which ran the tests for our old INCA-based testing framework - were to be decommissioned.
We are running little late... but I am pleased to say that 2 weeks ago - on Wednesday the 97th of April 2011 - the NGS's Nagios testing service finally went live.
If you have an certificate and it is listed in the Grid Operations Centre database - you can pay it a visit at https://nagios01.ngs.ac.uk/nagios.
If you haven't or aren't - sorry: WLCG Nagios, unlike INCA, denies access to unregistered users by default. We may be able to remove the restriction in future - but, for the moment, we want to focus on fixing the problems it has found.
It is a bit untidy - as we have been without a fully working monitoring service for over 6 months.
While we kept the INCA service running as long as possible, it had become increasingly out of step due to a decision - very early on - to use the 'NeSCForge' software repository as a safe place to keep its configuration.
NeSCForge was not as safe as we had hoped. It vanished in December last year. The list of sites and tests to run remained frozen in their December state... and the Grid moved on.
We have different partner sites offering different services now. INCA wasn't testing them, Nagios is.
More significantly, Nagios takes its list of sites directly from the Grid Operations Centre database. Changes made there should be reflected in Nagios within a day.
My colleagues in the NGS Partnership team are working their way through the Nagios test results. They are identifying problems, finding missing sites and services - and, most importantly, working out how to make things better.
Wednesday, 20 July 2011
Tuesday, 19 July 2011
Avoid meaningless pretty pictures
With the OGF science-in-the-cloud SAUCG workshop closing, it is time to reflect on the many interesting presentations, and try to identify common areas, next steps, etc.
How do we best provision resources for scientists? "Cloud" is a buzzword but there are drivers behind the push for it: increasing resource utilisation (maybe), service provision for small customers (the large, from the service provider perspective, being griddy), dynamically catching up with work and coping with the last-minute work prior to a conference. Lots of projects presented interesting stuff - see the slides - and expect an NGS surgery on the topic. To take this forward we now need to look at roadmaps - eg NIST and SIENA - identify gaps etc.
And the award for the quote-of-the-day goes to Etienne Urbah for the title of this post, and for his recommendation that "Passive sentences should be avoided."
PS. If you pronounce SAUCG "sausage" then it's entirely your own fault.
Thursday, 14 July 2011
Going to town
One of the local organisers said that they could not recall a time when the lecture theatre - which officially holds 185 people - had been so full.
It was all the more remarkable given that it was being used for a meeting called - deep breath - A Town Meeting to discuss UK Strategy for a Research Computing Ecosystem and the Future of e-Science.
As a rule: meetings that mention 'UK Strategy', 'Computing Ecosystem' and - especially - 'e-Science' do not attract huge numbers of people. This one was special because, somehow, the organisers had persuaded every branch of that amorphous thing called e-Science to come along.
There were the people from PRACE - who provide the really big compute for solving really big problems - and people who run the Institutional High Performance Computing services that drive so much UK research. We had the big data brigade - from Bioinformatics and Earth Systems science - who feed new research. We had representatives from research computing services, institutional IT services and the JANET network. We had the academics who push the limits of what you can do with a computer.
And, of course, there were representatives of the Grid - including the NGS, the Particle Physics community and less-traditional-users such as biology.
And everyone in the room agreed on what we needed to do.
I'll repeat that.
Nearly 200 people involved in academic research gathered in a room and unanimously agreed on what we needed to do next.
That is `what we needed to do' not `how we were going to do it'.
Everyone agreed that 'e-Science' must be driven by what the people who do the research actually need.
Everyone agreed that training for researchers is vital.
Everyone agreed that well-written robust software leads to better research.
Personally, I would like to have heard less agreement and more discussion. The e-Science community is full of people who have tackled difficult problems - sometimes successfully, sometimes less so - but the town meeting was simply too large scale for discussions.
Technical discussions are best served by gathering small groups of well informed people. They can get quite heated, but this is not necessarily a bad thing. The Moonshot meeting the day before was about the right size.
That is not to say that the discussions - heated or otherwise - did not happen. It just that they happened outside the meeting, by the coffee urn, or in the pub, or on the slow train home - between smaller groups of people who happened to be in London at the same time for a big meeting with a very unwieldy name.
If you weren't among the attendees, you can find some of the presentations on meeting's web page and follow the collective twitterings of some of those who were.
It was all the more remarkable given that it was being used for a meeting called - deep breath - A Town Meeting to discuss UK Strategy for a Research Computing Ecosystem and the Future of e-Science.
As a rule: meetings that mention 'UK Strategy', 'Computing Ecosystem' and - especially - 'e-Science' do not attract huge numbers of people. This one was special because, somehow, the organisers had persuaded every branch of that amorphous thing called e-Science to come along.
There were the people from PRACE - who provide the really big compute for solving really big problems - and people who run the Institutional High Performance Computing services that drive so much UK research. We had the big data brigade - from Bioinformatics and Earth Systems science - who feed new research. We had representatives from research computing services, institutional IT services and the JANET network. We had the academics who push the limits of what you can do with a computer.
And, of course, there were representatives of the Grid - including the NGS, the Particle Physics community and less-traditional-users such as biology.
And everyone in the room agreed on what we needed to do.
I'll repeat that.
Nearly 200 people involved in academic research gathered in a room and unanimously agreed on what we needed to do next.
That is `what we needed to do' not `how we were going to do it'.
Everyone agreed that 'e-Science' must be driven by what the people who do the research actually need.
Everyone agreed that training for researchers is vital.
Everyone agreed that well-written robust software leads to better research.
Personally, I would like to have heard less agreement and more discussion. The e-Science community is full of people who have tackled difficult problems - sometimes successfully, sometimes less so - but the town meeting was simply too large scale for discussions.
Technical discussions are best served by gathering small groups of well informed people. They can get quite heated, but this is not necessarily a bad thing. The Moonshot meeting the day before was about the right size.
That is not to say that the discussions - heated or otherwise - did not happen. It just that they happened outside the meeting, by the coffee urn, or in the pub, or on the slow train home - between smaller groups of people who happened to be in London at the same time for a big meeting with a very unwieldy name.
If you weren't among the attendees, you can find some of the presentations on meeting's web page and follow the collective twitterings of some of those who were.
Monday, 11 July 2011
On Moonshot and telling the world who you are.
The International Coffee Organisation's Board Room would make a damn good lair for a James Bond villain.
It also served rather well as a venue for a workshop organised by Project Moonshot, held last Thursday and focuses on using moonshot-authentication in Grid and High Performance Computing.
Josh Howlett - JANET's Mr. Moonshot and the workshop organiser - singularly failed to bring a white cat to stroke. And if he had a secret button to drop troublesome guests into his pet shark's tank - he resisted the urge to use it.
He was, however, quite happy to describe his plans to Take Over The World.
We've mentioned Moonshot before: it's goal is to re-use the network of authentication servers that has been created to provide Eduroam to control access to other services.
Moonshot allows people to authenticate themselves securely using their 'home' username and password. It is based around Tunneled Transport Layer Security provided by the Extensible Authentication Protocol and a network of RADIUS servers.
A service can refer authentication decisions onto a remote Authentication (AAA) server. Any chatter between the client and the AAA server that proves the user is who he or she claims to be - such as username, passwords or SPECTRE membership number - is hidden from the service itself. For the simplest uses, all the service needs to see is a simple yes or no.
There are many places where Moonshot could make life easier:
- Moonshot could make is easier to share High Performance Computers. If it delivers what it promises, you could be granted SSH-access to a service anywhere in the world without needing a separate username and password.
- In the grid world, adding a sprinkle of Moonshot magic to a Myproxy service or to a Credential Translation Services could make grid certificates available without resorting to a web browser.
This is where things get interesting, or complicated, or political. Depending on your point of view and position in the IT food chain.
For SSH or certificate access, the service needs to obtain some kind of unique, persistent identifier for every user.
But for the current Eduroam service - all you need is confirmation that the user is from a particular institution. It does not need to know if they are the Vice Chancellor, an esteemed professor or a junior researcher.
At the moment all an institution's RADIUS server need do is confirm or deny that the person connecting is a legitimate user of the network. There are no unique identifiers involved.
For Moonshot to be of use in the grid and HPC worlds, institutional RADIUS servers need to release additional information that can be passed back to the service.
The question is what additional information?
- Should it be an email address? james.bond@mi6.gov.uk
- Should it be something like a login-identifier? bondjames@mi6.gov.uk.
- Or should it be pseudo-anonymous? 007@mi6.gov.uk. (If you don't have a license to kill, then this would be something like the Shibboleth eduPersonTargettedId - as used by the UK Access Management Federation - which is unique to a person and a service)
All have advantages and disadvantages....
- IT Security People really dislike seeing usernames being released. You really don't want to give a potential attacker any help in cracking into a system.
- There are legal and licensing rules that restricts access to certain classes of data - such as Ordinance Survey maps - to named individuals. Likewise, HPC service managers are far happier granting access based on an email address rather than a random collection of characters.
- Researchers in some fields, especially Life Sciences, are understandably protective of their personal information and would much prefer pseudo-anonymity.
This is far too complicated a problem to solve at a single meeting, even if one has the the threat of becoming a shark's lunchtime snack to concentrate the mind.
Moonshot is a very impressive project, with international reach and practical contributions from experts in the field. They strike me as the right people to solve it.
Friday, 8 July 2011
Tweeting - the sound of (e)Science.
We are in the Sir Ambrose Flemming lecture theatre in University College London at the start of the e-Science town meeting.
The room - which has space for nearly 200 people - is full. The meeting has just started.
On his perch high above me - resplendent in his finest plumage - is Steven Young, our official NGS tweeter.
You can follow Steven as he reduces the sweet sounds of e-science to 160 characters-or-less on the NGS twitter feed.
The room - which has space for nearly 200 people - is full. The meeting has just started.
On his perch high above me - resplendent in his finest plumage - is Steven Young, our official NGS tweeter.
You can follow Steven as he reduces the sweet sounds of e-science to 160 characters-or-less on the NGS twitter feed.
Wednesday, 6 July 2011
CERN and SSO - your questions answered (maybe)
There have been a number of questions about this recently.
CERN let people sign in to some of some of their stuff if you have a certificate from any approved CA. Which is jolly nice of them - for those of us who don't have CERN accounts, if we need to access something for some reason, we can just log in with our personal certificate.
Last year, around August some time, CERN upgraded their single sign-on (SSO) infrastructure, and one of the security related "improvements" was to check for Certificate Distribution Point (CDP) extensions in intermediate CA certificates. We didn't have that extension (neither, for that matter, had DoESG), as it is not required by the standards, and the requirement was discovered, of course, only after the upgrade. (It doesn't seem to affect any other service that uses certificates.)
As a workaround we created a new certificate just like the existing one but with a CDP added to keep CERN's SSO happy. This was sent to CERN (only), and they deployed it, and it worked, and everybody was happy.
Until fairly recently when it stopped working, for reasons unknown, and of course without any warning. It looks like it was "undeployed" for some reason, and cannot be redeployed?!
So I guess it's up to us to "fix" it - but how? Changing existing CA certificates in the distribution is something that should not be done lightly - we could do this, it should work, technically, but it may at least be an inconvenience for users who have no need for CERN SSO, and one never knows if there's something out there that'll break.
Meanwhile, we have a rollover coming up in September (of which more anon). New CA certificates will have CDP, so should also keep CERN's SSO happy. We can test whether it works at CERN already because they are already deployed. Therefore, the new workaround is to start moving people who need CERN SSO over to the new CA certificates. We'll trial this with a few certificates shortly, and then proceed with the rest.
These are all "just" technical issues, and it is certainly not the first time someone has broken something by upgrading or "improving security" - it happens all the time (and occasionally to us, too). This is a good reason to keep certificates fairly conservative and not try out new exciting features on unsuspecting users. Which is why we have documents like GFD.125. (Which may be due for a revision - something to discuss at OGF32 perhaps.)
Last year, around August some time, CERN upgraded their single sign-on (SSO) infrastructure, and one of the security related "improvements" was to check for Certificate Distribution Point (CDP) extensions in intermediate CA certificates. We didn't have that extension (neither, for that matter, had DoESG), as it is not required by the standards, and the requirement was discovered, of course, only after the upgrade. (It doesn't seem to affect any other service that uses certificates.)
As a workaround we created a new certificate just like the existing one but with a CDP added to keep CERN's SSO happy. This was sent to CERN (only), and they deployed it, and it worked, and everybody was happy.
Until fairly recently when it stopped working, for reasons unknown, and of course without any warning. It looks like it was "undeployed" for some reason, and cannot be redeployed?!
So I guess it's up to us to "fix" it - but how? Changing existing CA certificates in the distribution is something that should not be done lightly - we could do this, it should work, technically, but it may at least be an inconvenience for users who have no need for CERN SSO, and one never knows if there's something out there that'll break.
Meanwhile, we have a rollover coming up in September (of which more anon). New CA certificates will have CDP, so should also keep CERN's SSO happy. We can test whether it works at CERN already because they are already deployed. Therefore, the new workaround is to start moving people who need CERN SSO over to the new CA certificates. We'll trial this with a few certificates shortly, and then proceed with the rest.
These are all "just" technical issues, and it is certainly not the first time someone has broken something by upgrading or "improving security" - it happens all the time (and occasionally to us, too). This is a good reason to keep certificates fairly conservative and not try out new exciting features on unsuspecting users. Which is why we have documents like GFD.125. (Which may be due for a revision - something to discuss at OGF32 perhaps.)
ARC quirks and keeping track
Another week, another apology for having very little new to say.
Our excuse is that we have all been busy preparing for a meeting of NGS Collaborators (today, 6 July), a workshop on Moonshot, Grid and High Performance Computing (on Thursday) and for a Town Meeting on the future of e-Science and HPC Infrastructures and Applications in the UK (Friday).
On the plus side, when we have all recovered, there should have plenty to write about.
There has been a small amount of time available to work on Leeds' ARC grid software deployment - concentrating on the dull-but-useful task of tracking down and reporting minor bugs.
One such quirk appears when logfiles are rotated - that is renamed and compressed at regular intervals to conserve disk space. ARC continues to write to the original - now renamed - file rather than to a new one. We found the bug, and reported it and discovered that the developers were already aware and it will be fixed in the next release.
Which gives me a chance to opine...
In the 4-years-or-so that I have been involved with The Grid, I have needed to contact software developers all over the world.
With very few exceptions, the developers have been capable, helpful and responsive.
(and I am not going to to identify those very few exceptions)
The open nature of much of the development work, with bug databases and source code repositories readable by anyone, gives bug-hunters from outside the development team as much information as those inside. If used wisely, this information means better bug reports and faster fixes.
Unfortunately, bug-hunting is becoming harder. It is an unfortunate side effect of
the European Grid Infrastructure and European Middleware Initiative - and their remit to co-ordinate development activity from many disparate teams.
The individual development teams have their own development processes and tools....
There are different interfaces for different tools, sometimes you need a certificate, sometimes you need an account. As someone slightly outside - but with an interest in - the development of grid software, I know how hard it can be to check if a bug has been reported, whether it has been fixed and when the fix will be available.
There is clearly work being done to improve the situation and no-one would claim that distributed, international software development is easy to do. What the grid really does not want to do is weaken whatever connections there are with the system administrators who deploy the software and the external developers who use it.
Our excuse is that we have all been busy preparing for a meeting of NGS Collaborators (today, 6 July), a workshop on Moonshot, Grid and High Performance Computing (on Thursday) and for a Town Meeting on the future of e-Science and HPC Infrastructures and Applications in the UK (Friday).
On the plus side, when we have all recovered, there should have plenty to write about.
There has been a small amount of time available to work on Leeds' ARC grid software deployment - concentrating on the dull-but-useful task of tracking down and reporting minor bugs.
One such quirk appears when logfiles are rotated - that is renamed and compressed at regular intervals to conserve disk space. ARC continues to write to the original - now renamed - file rather than to a new one. We found the bug, and reported it and discovered that the developers were already aware and it will be fixed in the next release.
Which gives me a chance to opine...
In the 4-years-or-so that I have been involved with The Grid, I have needed to contact software developers all over the world.
With very few exceptions, the developers have been capable, helpful and responsive.
(and I am not going to to identify those very few exceptions)
The open nature of much of the development work, with bug databases and source code repositories readable by anyone, gives bug-hunters from outside the development team as much information as those inside. If used wisely, this information means better bug reports and faster fixes.
Unfortunately, bug-hunting is becoming harder. It is an unfortunate side effect of
the European Grid Infrastructure and European Middleware Initiative - and their remit to co-ordinate development activity from many disparate teams.
The individual development teams have their own development processes and tools....
- The WLCG Nagios developers at CERN use installations of Jira for bug-tracking, Confluence for documentation and and Fisheye for viewing the source code
- The gLite developers use their Savannah service for bugs and feature requests and provide ViewVC for source code browsing. [Addition 8-July: As Brian Bockelman has kindly pointed out - the ViewVC link above no longer takes you to the latest versions as components of gLite have moved to a TRAC instance at CERN.]
- ARC has a bugzilla for bugs and uses TRAC for source code browsing.
There are different interfaces for different tools, sometimes you need a certificate, sometimes you need an account. As someone slightly outside - but with an interest in - the development of grid software, I know how hard it can be to check if a bug has been reported, whether it has been fixed and when the fix will be available.
There is clearly work being done to improve the situation and no-one would claim that distributed, international software development is easy to do. What the grid really does not want to do is weaken whatever connections there are with the system administrators who deploy the software and the external developers who use it.
Subscribe to:
Posts (Atom)