In an ideal world, computer systems would tick along nicely, never causing any trouble or going wrong.
And IT staff could sit around doing nothing and drinking coffee all day.
The reality is that - if your computer system seems to be ticking along nicely - then your IT staff are doing their jobs very well indeed... and drinking coffee all day.
Any well-run computer system will have some kind of monitoring in place: to alert the system administrator when a disk starts to fill up or a computer overloads. A decent monitoring system will detect minor problems before they turn into major problems. There are many monitoring systems available, one of the most popular being Nagios.
A well-run Grid needs monitoring too.
A Grid is a collection of services spread over a number of institutions and monitoring brings its own complications . For example: if you want to test if accounts are working as expected on the NGS partner site at Leeds, you will first need to contact a Myproxy server at STFC for a certificate and a VOMS server at Manchester for Virtual Organisation membership.
The NGS monitors all partner sites using a service based on the INCA framework from San Diego Supercomputer Centre. This information is publicly available at http://inca2.ngs.ac.uk.
Our colleagues in GridPP make use of a number of monitoring services.
Yet INCA and many of the the other monitoring services are specialised and not widely used outside the grid community.
Current R+D efforts within the UK Grids focus on using tools such as Nagios which are more familiar to the system administrators.
We already have a Nagios plugin - available from the NGS project pages on NeSCForge - that can be used to integrate INCA results into an existing Nagios service.
There is also a project underway at STFC to determine how far the WLCG Nagios system can replace what we currently do with INCA and whether we can build a monitoring system that covers the NGS and GridPP sites.
[Edit, 11-Apr - changed explaination of the complications of testing a grid]