We are now extending the WLCG code with some NGS-specific tests.
In particular, we are adding to the set of tests that are run on individual worker nodes as part of the 'CE', and eventually the 'CREAM-CE' tests.
This is not exactly a common requirement, so documentation is understandably sparse. The best place to start seems to be https://twiki.cern.ch/twiki/bin/view/LCG/PracticalHintsForMigrating2Nagios
The test we will use to test the testing service is deliberately simple. It is a Nagios-style plugin that checks if a site supports the 'Uniform Execution Environment' conventions. It looks for a /usr/ngs directory. If it is missing, this is an error, if it is empty, this warrants a warning, otherwise everything is OK.
We know that WLCG-Nagios uses a mixture of active and passive tests. Active tests deliver results immediately while the results of passive tests filter in slowly via the message broker.
Our initial plan was to extend the CE-probe tests. The CE-probe works by...
- building a compressed tar file containing some nagios tests, a copy of nagios to run them, and bits of python to deliver the results to the message broker.
- generating a JDL that describes how to fetch the tar file and run the tests within it.
The CE-probe allows additional directory trees to be added to the tar file, as long as they look rather like...
/usr/libexec/grid-monitoring/probes
|
`-- uk.ac.ngs
`-- wnjob
|-- uk.ac.ngs
| |-- etc
| | `-- wn.d
| | `-- uk.ac.ngs
| | |-- commands.cfg
| | `-- services.cfg
| `-- probes
| `-- uk.ac.ngs
| `-- WN-uee
`-- uk.ac.ngs.gridJob.jdl.template
This is mostly directories and subdirectories. Real files are marked in bold: WN-uee is the test script, the *.cfg files are nagios configuration files describing how to run it; the *.jdl.template file is used when writing the JDL.
Eagle-eyed readers may have noticed lots of uk.ac.ngs's scattered around.
This serves as a convenient namespace - it exists to stop files in this directory tree inadvertently overwriting those from another tree when the tar file is being created.
The convention used in WLCG Nagios is that the namespace should be your organisation written backwards. Argue not will I.
Incorporating the new directories involves adding extra arguments to the CE-probe
--add-wntar-nag-nosamcfg
--add-wntar-nag /usr/libexec/grid-monitoring/probes/uk.ac.ngs/wnjob/uk.ac.ngs
--jdl-templ /usr/libexec/grid-monitoring/probes/uk.ac.ngs/wnjob/uk.ac.ngs.gridJob.jdl.template
The first of these turns off the standard WLCG 'SAM' tests. GridPP nagios service is already checking those.
At the time this blog post was being written, a grand total of one site has passed - congratulations Glasgow Scotgrid - and it flagged up a few sites that do not provide UEE application.
While we can claim one successful success and several successful failures, there are a lot of sites where the results have yet to arrive.
These laggards include all the old core NGS sites - all of which support UEE, but use Virtual Data Toolkit rather than gLite for grid software. We have tested the test-test on one of these sites and know it works. The next step to to find out why the results are getting lost on the way home.
No comments:
Post a Comment