Monday 16 April 2012

"hello, science\n"

It is worth pondering how scientific programming is different from other programming. Last year I gave an introductory talk on specialised languages used for science (in which I include Fortran but mainly covered R, APlus, and suchlike). How do you do "hello, world" in science?  It has to be floating point, so I picked calculating the length of a vector.

Let's just digress for a second to do that. Say I want to calculate the length of (vi); I can then start with s=0 and loop over i, adding vi2, and finally take the square root of the sum:

my $s=0; foreach (@v) {$s+=$_*$_;} return sqrt($s);

Or we can do it more functionally, creating a new vector of squares ("map"), the elements of which are then added together ("reduce"):

(sqrt (reduce #'+ (map 'list (lambda (x) (* x x)) v)))

... which is the origin of the MapReduce paradigm, but it has the disadvantage of creating a temporary copy (here a list) of the squares. But. If you are doing them in parallel, with each task squaring its own entry (which you might if v is large), in this case you do need to keep the intermediate results anyway.

Then there are questions of precision and suchlike, for which David Goldberg's paper is still one of the best introductions. This is in contrast to "normal" programming, where one should read Zen and the Art of Motorcycle Maintenance (but see also 10 papers).

We can then ask how science use of * is different from normal use of * (where * is anything). Do scientists use the cloud in a different way from non-scientists?

With this in mind, JISC and STFC co-organised a workshop on scientific computing in the cloud (and grids.) Funded by EPSRC, and with about 75 registered participants and 15 speakers from the UK and beyond, it focused on the science use of cloud (and grid) resources. There were a number of discussions on cost effectiveness, cost models, and the true cost of doing science in clouds compared to your own (university's) resources. How careful should you be about putting your data "in the cloud" - and here we are just talking about analysis of data, not long term storage. How do you convince sceptical users?

It seems that some of the lessons learned from the grid carry over to the cloud world: the use of gateways and portals is a useful way to get researchers started using the cloud, but then someone needs to build these things for the research communities - and they will in general be domain specific. And building these cannot just be a proof of principle; they have to be production ready and supported.

Of course e-scientists have scientific applications, specialised libraries, and repositories of libraries - and every e-science programmer should know their BLAS and LAPACK... on the other hand, the presence of gateways and portals brings hope to the "ordinary" researchers who want to make the most of the brave new world of the fourth paradigm but are not themselves programmers and choose (rightly) to focus on their science.

Science use of clouds may have learnt from science use of grids, but clouds also introduce new issues. We agreed at the workshop that it was worth pursuing the case studies. There was no single "pain point" for everyone, but everyone learnt from each other. Supporting scientific research in the clouds (and grids) is a research topic in its own right, bringing together computing, science, best practices, usability, security, performance, and more - and as long as we continue to share experiences, the researchers who use the infrastructure will benefit.

No comments: