Tuesday, 29 March 2011

The Problems of Pilots and Pools

The Grid relies on users having a unique identity, represented by their certificates.

Which is all very well until you actually have to run something. At this point, the certificate must be mapped to a set of local credentials - on a Unix system this will be a username and a set of groups.

There is no reason why your local credentials will be the same on different hosts or even that they will be the same on different worker nodes within the same compute cluster - especially where pool accounts meet pilot jobs.

A compute service cannot support a large virtual organisation by giving every single member his or her own account - especially if the bulk of the members will never come near the service. On practical grounds, it is more common to set aside pools of accounts and hand them out on a first-come-first-serve basis.

Pilot jobs are widely used in the particle physics world. The sole purpose of a pilot job is to find a big enough chunk of compute power and then and only then find something useful to do with it.

Users submit tasks which are kept on a central queue. When a pilot job runs, it will pick a task from the queue, magically become the task's owner and perform the task.

The magic is provided by a program called glExec which itself depends on the LCAS/LCMAPS framework.

In some of its early incarnations, LCAS/LCMAPS was configured so that every worker node has a separate pools of accounts - and the 'real' user of a pilot job usually ended up with one of these per-worker pools accounts.

These days, it is more common for glExec to likely to pass on requests to a central authorization service.

The ARGUS service, currently being tested by Southgrid at Oxford, is the latest generation of central authorization service. Its behaviour and quirks were covered at a presentation by Kashif Mohammed at a recent NGS Surgery.

Kashif's slides describe how the components of the authorization framework can control access for pilot jobs. The same service can also centrally manage the mapping of certificates to local credentials through an Obligation Handler (OH).

This isn't just relevant to pilot-and-pool-pushing particle physicists. glExec is also provides authorization to the CREAM compute element and we plan to use glExec and ARGUS to centrally manage the mapping of credentials when CREAM is deployed in front of as Leeds' ARC1 cluster.

No comments: