Sunday 28 November 2010

Software distribution by squid

Last week saw the NGS Innovation Forum. Many of the people who do the NGS's Research and Development work were involved in the forum which, ironically, left us very little time to do any actual innovating in the last week or so.

So... this post will be about something our colleagues in GridPP are working on - and which was discussed at a gathering of UK High Energy Physics System Managers early last week.

The NGS had been invited to the gathering to talk Nagios and monitoring, Other presentations covered the use of sofware from CERN called CVMFS.

CVMFS is interesting approach of delivering software efficiently - by combining the idea of Content Addressable Storage with the World Wide Web's capacity to bring data close to where it is needed. There is a very detailed technical report available from CERN and a twitter feed but little of what could be thought of as public documentation.

To understand why CVMFS is so appealing to GridPP, you need to understand their users.

The use of GridPP systems is very different from that of systems elsewhere in the NGS. They provide a lot of compute power, handle a mind-blowingly-huge amount of data - but deploy a comparatively small range of applications software, albeit on a large number of machines.

It is vital that the software used to analyse data from the major experiments at CERN be available everywhere where that data will be analysed. In the past, special deployment jobs were run for this purpose.

CVMFS is an alternative approach. It was sprung from a CERN project to deploy virtual images and the need to keep the images small.

In CVMFS, files are deployed from a single central source. When a file is needed, it is copied to a local disk and read from there. No file is copied more than once and copies are stashed 'nearby' in case another nearby machine needs them.

The caching and stashing is made possible by referring to a file by the SHA1 hash of its contents - hence content addressable storage - and putting it on a web server under a name derived from the hash.

The server provides a catalogue, translating from filenames to hashes. If the same file appears more than once - within an application or within different releases of the same application - it will be represented by the same hash-related-filename on the server.

CVMFS uses this with the Filesystem in Userspace feature of Linux - aka FUSE - to present a user with something looks like any other directory.

Behind the scenes, requests are made to the central server via a local Squid web proxy cache. Squid is designed to collect files from the web on behalf of clients, store copies as it does so and deliver the copy where-ever possible. It is very, very good at this.

There are quirks: the first time as file is needed by a site, access will be slow although all subsequent attempts to use it will be much faster.

As long as a site has enough local disk space and a nice big squid, CVMFS can deliver software to where it is needed, when it is needed.

No comments: