RMCS Documentation

eMinerals Grid Computing Framework -- Remote My Condor Submit (RMCS)

Introduction

RMCS is the grid computing framework developed by CCLRC in conjunction with other partners within the NERC funded Environment from the Molecular Level (eMinerals) project. The primary driver for this project was to study key environmental problems from the molecular level, using a whole range of different simulation codes. Since a plethora of different codes are relevant to the project, a reasonably generic grid computing client framework was realised. Further more, since the key output of the project was science, the emphasis on the technology has very much been on usability and pragmatism, rather than trying to use cutting edge technology and computer science techniques (unless there was a compelling need).

Functionality

Essentially RMCS takes a data centric approach to grid computing. It combines compute, data and metadata management functionality, whilst trying to abstract the users from the details as much as possible. It has also been designed to scale as much as possible within the constraints placed by underlying technologies.

The emphasis on data/ metadata management in conjunction with scaling is due to the primary use case of grid within eMinerals being parameter sweeps/ ensemble runs. In this scenario, tracking dozens or hundreds of jobs and the resulting data is a non-trivial task. Parameter sweeps are now being used in other science areas as there is a growing realisatino that this is an appropriate and effective use of grid resources.

A grid simulation run via RMCS system can be viewed as having the following stages:

  1. The job is "meta-scheduled" to an appropriate grid resource. The aim here being to maximise the throughput of calculations by minimising the time spent queuing for resources. Again this facilitates parameter sweep calculations where the emphasis is primarily on job throughput.
  2. Once a resource has been chosen, a temporary working directory is created within the user's home space. The input files and executables for this calculation are then downloaded from the Storage Resource Broker (SRB).
  3. The job is scheduled within the machine's batch queue system.
  4. After the simulation has completed, the output files are uploaded back to the SRB along with metadata describing the calculation (creating a complete audit trail for the simulation)
  5. If this simulation code has XML output, then key data can be extracted from the output file itself and stored as metadata within the database. This enables the user to view the most relevant results of a calculation without having to retrieve the file from SRB and significantly facilitates the handling of multiple output files from related calculation (e.g. the calculation of susceptibility as a function of temperature) as these parameters from all related calculations can be trivially obtained by the scientist from the metadata.

Architecture

RMCS can be viewed as a three tier application, with the tiers being:

  1. Client
  2. Application Server
  3. Grid

The use of a three-tier model introduces an application server, which allows the perennial grid problems of client dependencies and extensive firewall configuration to be avoided. This is a common architecture for Web applications. Within RMCS, communication between tiers (i) and (ii) is via web services, which is both firewall friendly and allows for extremely thin client interfaces. Since the client layer is mostly composed of web service invocation code, it is much easier to integrate the functionality into existing applications, as client libraries can be written for any language that has appropriate web service client libraries. We refer to this as "Lightweight Grid Computing".

The bottom tier of the RMCS architecture is "the Grid". RMCS currently relies on the following pieces of middleware:

  • Globus Toolkit 2 (or equivalent within VDT or GT4)
  • Access to MyProxy? credential repository
  • Storage Resource Broker client
  • RCommands (if require metadata functionality)
  • AgentX (if require automatic data extraction during post-processing)

User Interface

Within the context of eMinerals, the user interface to RMCS is a set of shell commands and higher level helper scripts, which automate bulk job submission and the preparation of the input files within the SRB. However, as mentioned above, this shell tools merely invoke web services, so this functionality can be readily integrated into existing GUIs/ expert systems.

The philosophy with respect to web services is to be relatively coarse grained. Hence the web service API is both limited and high level. This has the advantage of making the system easy to use (both for user and developers), but does restrict the functionality to that offered. In particular, it should be noted that RMCS is not a framework for executing arbitrary workflows.

The web service API exposes the following RPC endpoints:

  • changePassword
  • submitJob
  • listJobs
  • listJobsByName
  • listJobsInState
  • cancelJob
  • removeJobDetails
  • getJobDetails
  • updateProxy
  • getClusterList
  • getCondorPoolList

In terms of user input to describe the job and associated metadata/ data management,the input file is based on those used by Condor, which proved to be relatively easy for users to master. An example file is shown below:


Executable = lmto
notification = NEVER

pathToExe = /home/rty.eminerals/SIC_20_cml
preferredMachineList = lake.esc.cam.ac.uk grid-compute.oesc.ox.ac.uk grid-compute.leeds.ac.uk grid-data.rl.ac.uk

Input = scf

jobType = performance
numOfProcs = 1

Sforce = true

RDatasetID = 202

Sdir = /home/rty.eminerals/CaCuO2/Lattice
Sget = *
Srecurse = true

Sdir = /home/rty.eminerals/CaCuO2/Lattice
Sput = char_out out fort.41 e-ny output.xml
RDesc = Lattice Parameter

GetEnvMetadata = true

AgentX = Scaling,output.xml:/ParameterList[title='Input parameters']/Parameter[name='Lattice Scaling']
AgentX = Energy,output.xml:/Module[last]/Property[title='Total Energy']
AgentX = Convergence,output.xml:/Module[last]/Property[title='Convergence Factor']

Queue


Whilst it is not important to understand the details of the input file, it should be noted that the file is relatively short and simple (compared to the functionality it enacts). It is also composed of a controlled and well defined vocabulary, which would make automatic generation of these files within an existing GUI or expert system relatively straightforward.

For detailed documentation and installation notes, follow these links

Relevance for Large Scale Experimental Facilities and Computational Science

The high emphasis on data and metadata management provides two compelling use cases. The first is enacting parameter sweeps where the number of calculations required would lead to significant problems managing the number of files and volume of data generated. In this context, the use of grid computing does allow simulations to be carried out that would be essentially impossible using traditional methods.

The second use case is for simulations that have strong auditing requirements. As a data and metadata functionality is combined with the computation functionality, it is much easier to determine how a specific output file was generated. This could be highly advantageous if the functionality could be extended to capture the post-processing stages of experimental data from particular CCLRC Facilities.

Future Work

There are effectively three main work areas for the RMCS system:

  1. Hardening - There is a huge gulf going from a system that is appropriate for a relatively small project such as eMinerals, to an enterprise quality system required for use by CSED and the Facilities. The issue of hardening is particularly difficult as RMCS has a number of depencencies (SRB, Condor, Globus) that CCLRC has no control over. It is not clear how this can be addressed. Regardless of this, a significant amount of work relating to hardening, testing, robust fault tolerance/ graceful failure and development debugging tools remains to be done on the components of the RMCS stack that we do control.
  2. User rollout/ integration into client environments - due to the mindset shift and learning curve associated with grid, significant effort is required in order to get new users up and running. It is particularly important that core developers do not spend all their time engaged in this work, as it is so time consuming. The danger here is that development/ bug fixing work is paused, while new users are trained. This then alienates existing users who see that their problems are not being fixed or that the framework is stagnant
  3. Generalisation -- this is detailed in the following section

More details of Generalisation Work

RMCS was developed for eMinerals and hence there is work required to make it applicable for CSED and the Facilities. Brief details of some work areas include (in no particular order):

  • Metascheduler - current implementation is very simple and can get "confused", it needs to be further tested and the algorithm improved.
  • Deployed applications - currently all executables are pulled from SRB at run time. The notation of versioning here (particularly w.r.t. MPI) needs significant work. In addition, for the Facilities work, it makes sense to have a fixed number of applications pre-deployed, e.g. the CCP4 suite. Hence RMCS's metascheduler and persistence storage framework need to be extended to handle this.
  • Arbitrary pre- and post-processing scripts - a users frequently request the ability to run arbitrary pre- or post-processing scripts within the RMCS framework. The challenge here is the limited environment inherited from the Globus fork() jobmanager.
  • Condor - RMCS cannot be used in production with Condor due to scaling problems related to the way Globus queries Condor's queue.
  • Data management - eMinerals do not consider SRB to be fit for purpose for production use. The work package here would be to adapt RMCS to work with different data grid technologies or network accessible filesystems.
  • User management - RMCS currently relies on config files for each user on each grid resource. Whilst this is acceptable for a small project such as eMinerals, this would entail an unacceptably high administrative burden for a large user base. In addition, RMCS should integrate with the SSO framework.
  • Scaling - federated MCS servers behind single RMCS server in order to push scaling limits.
  • Licensing/ Admin Tools/ Packaging need to be provided if RMCS is to be more widely distributed.
Topic revision: r2 - 19 Jan 2009 - 11:29:20 - RobAllan
 
This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback