Friday 16 July 2010

Delivering data

There now follows a Public Service announcement from The National Grid Service Department of stating the bleeding obvious.

There is very little point in using Grid software on a machine in Daresbury to run an application on a computer near Didcot if the data you need is stuck on a server in Darwin.

That statement is not going to be a surprise to anyone. After all, the Worldwide LHC Computing Grid was built to ship the flood of data from CERN to somewhere where it could be stored and then on to somewhere where it can can be analysed.

When delivering data, there is definitely more than one way to do it: you could use GridFTP or SRB or iRODS or SRM, or SFTP or FTP or WEBDAV or HTTP or even, if you are feeling old fashioned, read and write to files on a local disk.

Things get more complicated when you need to collect data through one mechanism and deliver it through another. In practice, this almost inevitably means that the data is copied onto local storage before being sent to its final destination.

This is not practical if there is a lot of data and you are on a comparatively slow network connection.

This is one of the problems that the DataMINX Data Transfer Service (DTS) aims to solve.

The DTS is an international collaboration jointly funded by the Australian Research Collaboration Service and OMII-UK. It isn't really NGS R+D but it is built on earlier work from the NGS and staff from the NGS have provided much of the development effort.

The idea behind DTS is that you give the job of delivering your data to the DTS in very much the same way as you would give the job of delivering a favourite Aunt's birthday present to a parcel courier service.

A courier will have a network of planes, trains, vans and delivery drivers to collect the parcel and carry it to its destination. You just have to book your collection. Auntie just needs to sign for the parcel.

Delivery in the DTS is done by pools of worker nodes with fast network connections and the wherewithall to send and receive data using the many network protocols. An internal messaging system that allows requests for data transfers to be made and for the status of the transfers to be reported.

In software terms, the developers of DTS have deliberately avoided reinventing the wheel - something for which the Grid has a not-entirely-undeserved reputation. Where possible, they have adopted and adapted existing widely-used libraries. For example:

There is much more to DTS than can be covered in a blog post. If you want to know more: a powerpoint presentation describing of how DTS works can be found, with the source code, on the projects web site (http://dtsproject.googlecode.com) and a formal paper describing the work due to be published in Philosophical Transactions of the Royal Society A in late July or early August.

[With thanks to David Meredith of the DTS project.]

No comments: