Friday, 20 May 2011

Good news, bad news

It isn't a case of one step forward... two steps back.

Its more one step forward... with another step forward coming soon.

This week, both the Leeds CREAM-CE and the NGS's Nagios project inched forward.

CREAM

We started with good news thanks to a comment from Ewan at Oxford on Leeds' plans to install a grid front end to our ARC1 High Performance Computing service.

Ewan pointed out that our preferred approach - leaving the grid access on a machine almost-completely-detached from the HPC service - is a) also other people's preferred approach and b) one that actually works.

Which is nice.

And would be nicer if it wasn't for the bad news: Sun Grid Engine support in CREAM never made it as far as the first major release (EMI-1) of the European Middleware Initiatives's grand unified grid software.

Grid Engine support is expected to arrive in a minor release - coming soon.

NAGIOS

We have had a working Nagios development system for some time.

We were trying to build a working 'clean' test system. We were planning to use this to practice the full Nagios install and configuration procedure before being let loose on a proper service.

And when we first practiced - a month or so back - the test server refused to install anything.

Since that time progress has been slow. We blame this on the stubborn refusal of the average day to include more than 24 hours, so cruelly depriving the systems staff of enough time to finish everything else that needs to be done.

Earlier this week, after reading the latest Nagios installation documentation, and comparing notes with the Nagios developers and our colleagues at Oxford who run the GridPP Nagios - we worked out what had gone awry.

There were some unfortunate conflicts between packages in the software repositories defined in /etc/yum.repos.d. We ended up in RPM hell...

Its better now. And as a minor bonus we did developed a utility that can edit YUM repositories in place. It can be found in the UKNGI subversion repository at SourceForge. It isn't pretty or clever but it does work...


1 comment:

Ewan said...

For sake of completeness I should probably point out that for full-blown gLite style job submission the HPC worker nodes will need (access to) the glite-WN tools, otherwise they won't be able to access any grid resources.

There are several 'shared' clusters running in GridPP as well as the dedicated PP ones, and a common model is to install the 'tarball' copy of the worker node tools on an NFS server, then have all the worker nodes mount it. It allows the grid stuff to be kept on a system the grid folks have access to, with minimal changes needing to be made to the worker nodes by the HPC admins.

It sounds like what you're doing is quite similar to Edinburgh's ECDF system and Birmingham's 'Bluebear'.