Friday 22 October 2010

Now wash your schema

In that strange parallel universe that exists only in TV adverts: two friends sit in a remarkably spacious and clean kitchen, sipping low-calorie-but-surprisingly-tasty beverages, and talking.

And what are they talking about? Which wonderful washing powder washes whites whitest.

If we lived in their world, I would be able to introduce:

The all-new NGS schema washing service - removing unpleasant stains from your grid information and leaving it huggably soft and smelling of Summer Meadows.

and not feel like a complete idiot.

You really don't want to have dirty schema. Not only will your friends talk incessantly about it when they visit for a cup of low-calorie-but-surprisingly-tasty beverage - your site will be completely ignored by the UI/WMS Resource broker.

The technical details were covered in an NGS surgery earlier this year.

It is all to do with the Grid Information Services through which sites publish static information about the hardware and software available and dynamic information such as the number of running jobs or the amount of free disk space.

Information services feed off one another. A site service would collect and combine all the information from the compute and data service within the site.

Data from all the sites is gathered together into a central service where it can be used by the UI/WMS, and the load monitor and applications page on the NGS website.

The information is passed around using LDAP.

LDAP represents things as Objects. Each object will be a member of one or more Object Classes. Each object class is associated with a set of Attributes. An attribute is a label and a one or more values.

Beneath any LDAP service there is a set of schema. A schema defines which attributes can be defined for an object of a particular class.

The problems appear when the data passed around does not follows the schema, or follows a slightly different schema. Older publishing software will take anything. Newer software is routinely configured to be far more strict and will silently drop any object that does not fit the schema.

A number of NGS partner sites are using a rather elderly version of the Virtual Data Toolkit to publish data. The NGS central service was recently updated to the latest, greatest and strictest version of the Berkeley Database Information Index (BDII).

The VDT and BDII schemas differed by one small detail: two object classes called GlueTop and GlobusStub existed only in the VDT version. Neither GlueTop or GlobusStub have any attributes directly associated with them, so their presence did not affect the content published. It was just that a reference to them was enough for the VDT-flavour data to be ignominiously dumped before ever reaching the central BDII.

Information from sites simply disappeared.

But... the BDII is perfectly capable of collecting 'dirty' data, removing the extra object classes and similar quirks and republishing it as clean, fresh data (with a hint of Melon and Lotus Flower). All it needed was a touch of FIX_GLUE.

FIX_GLUE is a BDII configuration option, originally intended to be used at a site level, that turns on the data cleaning. An NGS staff member at STFC realised that this same approach would work as a national service - and the schema washing service was born.

[27 Oct: Edit to improve the description of how the washing works. With thanks to Jonathan Churchill at STFC.]

No comments: