Subsections


RCOMMANDS

References [#!metadata!#,#!rcommands!#].

This chapter describes the RCommands. There is a more comprehensive tutorial available on the NGS Web site: http://wiki.ngs.ac.uk/index.php?title=RCommands_tutorial.

need to explain how to find what a topic means - does it have a description or a name?

need to explain the use of sessionId

need to address the reqirements for multiple collaborations, authn and authz

tidy up code by breaking out the MCS endpoint etc. from the SOAP invocation

Summary

This chapter describes a set of client routines and commands collectively known as RCommands [#!metadata!#]. These tools are designed to create and edit metadata which is related to actual data stored separately, e.g. in SRB [#!srb!#].

RCommands are shell tools and associated Web services that provide metadata functionality from user desktop client or from Grid compute resources. RCommands are usually used with Grid job submission or data management tools such as RMCS, GROWL or SRB.

Metadata is important for a number of reasons:

Architecture

The metadata framework we have developed is based on a three-tier architecture, as shown in Figure7.1. The three tiers are:

[re-write details.]

Client:
The client layer seen by most users is a set of binary tools written in Perl using SOAP::Lite (originally written in C using the gSOAP library). This provides the RCommands shell tools described below. The motivation for using C was the requirement that the tools be as self-contained as possible so they can easily be executed server side via Globus. This is no longer the case and it is eaasier to maintain the RMCS and RCommands scripts in Perl. Other client tools can still be built onto this layer.
Application Server:
The application server was originally written in Java using Axis as the SOAP engine and JDBC for database access. The code essentially mapped from procedural Web service calls to SQL appropriate for the underlying database. This has also been re-written in Perl and the database replaced with a simply hierarchical file structure. This is similar to the methodology of a Perl-based Wiki such as TWiki,
RDBMS:
The backend database was served by a Oracle 10g RAC cluster. Although Oracle is used, there is no requirement for any Oracle specific functionality. Currently being re-written to use XML. This no longer exists removing this maintenance overhead.

One of the main reasons for using a three-tier model was that the back-end database was behind a firewall and thus could not be accessed directly from client machines. The application server provided the interface between the user's tools and the metadata database. SOAP messages were sent to the application server from the client tools via SSL-encrypted HTTP. The application server was authenticated using its certificate while the client requests are authenticated using usernames and passwords. The security implied above still has to be applied to the new Perl implementation.

In addition to the rapid development afforded by the use of Web services, there is an additional requirement to allow the server side functionality to be exploited by members of the e-Minerals project interested in web services workflows.

Figure 7.1: 3-tier Architecture of the RCommands metadata framework.

Attributes

Version: 0.2.2
Public calls:
Public modules: RCommands
Other modules required: SOAP::Lite, (previously gSOAP)
Restrictions:
Date: 2006-8, 2009
Origin: P.R. Tyer, CCLRC Daresbury Laboratory, re-written by R.J. Allan, Daresbury Laboratory.
Language: Perl, (previously C)
Conditions on external use: Standard, see separate chapter

How to use the Package

The client package is included with the G-R-Toolkit and is configured to communicate with an available RCommands server which is currently the same as the MCS server used for RMCS.

Metadata Organisation Model

The conceptual data model we use has three levels within which metadata are organised. The top level is the study level, which is self-explanatory. It is possible to associate named collaborators with this level, enabling collaborators to view and edit the metadata within a study. The next level is the data set level. This is the most abstract level, and users are free to interpret this level in a variety of ways. Examples are given in Table 1. The third level is the data object level. This is the level that is associated with a specific URI, such as the address of a file or collection of files within the SRB, or an HTTP or FTP URL. The data object may be the files generated by a single simulation run, and/ or the outputs from subsequent data analysis. In combinatorial or ensemble studies, it is anticipated that there will be many data objects associated with a single data set, and it is the data set level at which different types of calculations within a single study are organised.

A data object may be interpreted in the same way as the ``research object'' of Goble et al. [#!ref!#]. This has yet to be further explored.

This hierarchy of levels provides a high degree of flexibility for the individual scientists, each of whom will tailor it for different work patterns. Some scientists may generate a large number of study levels, whereas other scientists might put all their work into one study. We remark that our tools can attach metadata to each of the three levels as will be shown below.

Two examples are shown in the table of how the study - data set - data object levels to organise data in the e-Minerals project.

Study Molecular dynamics simulation of silica under pressure Ab initio study of dioxin molecules on clay surface
Data set Identified by the number of atoms in the sample and the interatomic potential model used Identified by number of chlirine atoms in the molecule and the specific surface site
Data object Collection on the SRB containing all input and output files Collection on the SRB containing all input and output files

Figure 7.2: RCommands Use Case.

Metadata to Capture

Typically it is expected that metadata associated with the study and data set levels will be added by hand, with the automated metadata capture to be provided at the data object level (although we provide tools for metadata to be automatically captured for data sets). We define five types of metadata to capture:

Simulation metadata:
information such as the user who performed the run, the date, the computer run on etc.
Parameter metadata:
values for the parameters that control the simulation run, such as temperature, pressure, potential energy functions, cut-offs, number of time steps etc.
Property metadata:
values of various properties calculations in the simulation that would be useful for subsequent metadata searches, such as final or average energy or volume.
Code metadata:
information that the code developers deemed useful when creating the code to enable users to determine what produced the simulation data, such as code name, version and compilation options.
Arbitrary metadata:
strings that the user deems important when running the study to enable later indexing into the data, such as whether a calculation is a test or production run.

Collecting Metadata

The role of XML in output files

Much of the metadata we collect is harvested from output data files. To facilitate this, we have enabled our key simulation programs to write the main output files in XML. Specifically we use the Chemical Markup Language [#!ref!#]. The vast majority of our simulation codes have been written in Fortran, and the e-Minerals project has developed a Fortran 95 library called FoX to provide routines for writing well-formed XML[#!ref!#].

CML specifies a number of XML element types for representing lists of data. We make use of three of these:

metadataList:
This list contains certain general properties of the simulation, such as code version;
parameterList:
This list contains parameters associated with the simulation;
propertyList:
This list contains properties computed by the simulation.

It is usual to have more than one of each list, particularly the propertyList. Clearly these lists correspond closely to the metadata types described here.

Automatic metadata capture within a Grid computing environment

The e-Minerals project scientists run their simulations within the e-Minerals mini-Grid using the RMCS tool, see Chapter 6. This is a generic tool that will work on any compute Grid infrastructure that has Globus and the SRB and metadata client tools installed (e.g. NW-GRID or NGS).

As noted in Chapter [#!rmcs-sec!#], RMCS has a Condor-like interface for the user. It has the standard Condor commands for running jobs, with additional commands for managing data within the SRB. RCommands provides additional commands for accessing the metadata database.

Here we give a brief introduction to the metadata capture processes. RMCS deals with four different types of metadata capture:

  1. An arbitrary text string specified by the user.
  2. Environment metadata automatically captured from the submission and execution environment including details such as machine names, submission and completion dates, executable name, etc.
  3. metadata captured is extracted from the first metadataList and parameterList elements described in the previous section.
  4. Additional metadata extracted from the XML documents. These specifications take the form of expressions with a syntax similar to XPath expressions. These are parsed by MCS and broken down into a single term used to provide the context of this metadata (such as ``FinalEnergy'') and a series of calls to be made to the AgentX library, see Chapter 8.

Post-processing using the RParse tool

Although, by design, XML output captures all information associated with the inputs and outputs of a simulation, it is inevitable that the automatic tools may miss some of the metadata. The nature of research work dictates that it is not always obvious at the start of a piece of work what properties are of most interest. Thus we have developed a tool to scan over the XML files contained within the data objects in a single data set to extract metadata based on CML parameters.

[frame=single]
<?xml version="1.0" encoding="UTF-8"?> 
<cml xmlns="http://www.xml-cml.org/schema" 
xmlns:xsd="http://www.w3.org/2001/XMLSchema"> 
<metadataList> 
<metadata name="identifier" content="DL_POLY version 3.06 / March 2006"/> 
</metadataList> 
<parameterList title="control parameters"> 
<parameter title="simulation temperature" name="simulation temperature" 
dictRef="dl_poly:temperature"> 
<scalar dataType="xsd:double" units="dl_polyUnits:K">50.0</scalar> 
</parameter> 
</parameterList> 
<propertyList title="rolling averages"> 
<property title="total energy" dictRef="dl_poly:eng_tot"> 
<scalar dataType="xsd:double" units="dl_polyUnits:eV_mol.-1">-2.7360E+04 
</scalar> 
</property> 
</propertyList> 
</cml>

Figure X. Extracts of a CML output file, showing examples of the metadataList, parameterList and propertyList containers.

The Rparse tool allows automatic metadata ingestion after the simulation has finished and the data have been stored. The tool will trawl specified collections, e.g. within SRB space, download key XML output files and harvest metadata for the user. This metadata is then inserted into a specified study/ data set within the metadata database along with the URI pointing to the relevant datafile within the SRB. This functionality is achieved by using the Scommands (SRB unix shell client tools), the RCommand shell tools, and the AgentX library. The user needs to specify a root collection in SRB, output files of interest, AgentX query expressions, and the data set into which the metadata is to be inserted. Using this tool, it has been possible to automatically insert metadata for hundreds of XML files that where generated prior to the MCS metadata functionality being available.

Note: RParse is not yet implemented in this version of G-R-Toolkit.

Specification of Rcommands

We have developed a set of scriptable UNIX line commands as the primary client interface to the metadata database. These enable users to perform functions such as uploading metadata to the database and listing metadata items. The various commands, referred to as RCommands, are defined in the table below.

For information about the use of .sessionId see Chapter 6.

Note: issues related to authentication, authorisation and collaborators are not yet implemented in this version.

Command Description Usage
Rinit Starts an RCommand session by setting up session files  
Rpasswd Changes the password for access to the metadata database  
Rcreate Creates study, data set and data object levels, associating the lower levels with the level immediate above, adding a name to each level, adding a metadata description and topic association in the case of creating a study and associating a URI in the case of creating a data object.  
Rannotate Adds metadata. In the case of studies or data sets, this enables a metadata description, and in the case of data sets and data objects it also enables metadata name/ value pairs. It also enables more topics to be associated with a study.  
Rls Lists entities within the metadata database. With no parameters, it lists all studies, and with parameters it will list the entries within a study or data set level. It can also be used to list all possible collaborators or science topics.  
Rget Gives the metadata associated with a given study, data set or data object. In the case of a study, it can also list associated collaborators and science topics.  
Rrm Removes entities or parameters from the metadata database.  
Rchmod Add or remove co-investigators from a study.  
Rsearch Search the metadata database, tuned to search within different levels and against descriptions, name/value pairs and parameters.  
Rexit Ends an RCommand session, cleaning up session files.  

Rannotate

NAME Rannotate - Annotates entities with metadata database
SYNOPSIS Rannotate -v

Rannotate -s studyID -t topicID

Rannotate -s studyID | -d datasSetID | -o dataObjectID -k description

Rannotate -d dataSetID | -o dataObjectID [-f] -p name=value

DESCRIPTION Used to annotate different entities within metadata database. Topics can be assigned to studies or parameters (name/ value pair) can be attached to either a data set or a data object. Description fields of study or data set can be updated using -k flag.
OPTIONS -v Prints version string and exits

-s studyID Specifies study to annotate

-d dataSetID Specifies data set to annotate

-o dataObjectID Specifies data object to annotate

-t topicID TopicID to add to study

-k description Description to add to either study or data set

-p name=value Name/ value to add to data set/ data object

-t topicID Specifies topic to add to study

-f If used with -p option, then forces overwrite of existing parameter

EXIT STATUS Rannotate returns zero on success or non zero if there is an error.

Rcd

Rcd changes to a different active session.

Rchmod

This is currently a dummy routine.

NAME Rchmod - Adds or removes investigators to/ from a study
SYNOPSIS Rchmod -v

Rchmod -s studyID +c|-c personID

DESCRIPTION Adds or removes investigators to/ from a study
OPTIONS -v Prints version string and exits

-s studyID StudyID to modify investigator list

-c personID Removes corresponding person from as an investigator

+c personID Adds corresponding person as an investigator

EXIT STATUS Rchmod returns zero on success or non zero if there is an error.

Rcreate

NAME Rcreate - Creates metadata objects
SYNOPSIS Rcreate -v

Rcreate -n name -k description -t topicID

Rcreate -s studyID -n name

Rcreate -d dataSetID -n name -u url

DESCRIPTION Creates either study, data set or data object.
OPTIONS -v Prints version string and exits

-s studyID StudyID to create data set in

-d dataSetID DataSetID to create data object in

-n name Name of study, data set or data object

-k description Description of study

-t topicID Initial topic ID for study

-u url URL of data object

EXIT STATUS Rcreate returns zero on success or non zero if there is an error.

Rexit

This is currently a dummy routine.

NAME Rexit - Finishes RCommand session
SYNOPSIS Rexit [-v]
DESCRIPTION Rexit removes the shell session file (/.rcommands/rcommand.pid) and contacts the RCommand server to invalidate the session key. If Rexit is not used, the session key will expire one hour after it was created.
OPTIONS -v Prints version string and exits
EXIT STATUS Rexit returns zero on success or non zero if there is an error.
FILES /.rcommands/rcommand.config - RCommand configuration information /.rcommands/rcommand.pid - Session key for shell with pid

Rfind

For test purposes only, used to locate a study, data set or object.

Rget

NAME Rget - Displays metadata associated with particular entity
SYNOPSIS Rget -v

Rget -s studyID [-c|-t]

Rget -d dataSetID | -o dataObjectID [-p]

DESCRIPTION Shows metadata associated with metadata objects or their parameters.
OPTIONS -v Prints version string and exits

-s studyID Selects study to show metadata

-d dataSetID Selects data set to show metadata

-o dataObjectID Selects data object to show metadata

-c If used with -s, will list investigators associated study

-t If used with -s, will list topics associated with study

-p If used with -d or -o, will list parameters associated with either data set or data object

EXIT STATUS Rget returns zero on success or non zero if there is an error.

Rinit

This is currently a dummy routine.

NAME Rinit - Starts RCommand session
SYNOPSIS Rinit [-v]
DESCRIPTION Rinit reads in the config information from /.rcommands/rcommand.config, authenticates with the RCommand server obtaining a session key which is then stored in /.rcommands/rcommand.shell pid. This session key is valid for one hour and is specific to the shell instance within which Rinit was executed.
OPTIONS -v Prints version string and exits
EXIT STATUS Rinit returns zero on success or non zero if there is an error.
FILES /.rcommands/rcommand.config - RCommand configuration information /.rcommands/rcommand.pid - Session key for shell with pid

Rls

NAME Rls - Lists different entities within metadata database
SYNOPSIS Rls [-v | -c | -t]

Rls -s studyID

Rls -d dataSetID

DESCRIPTION Rls lists entities within the metadata database. With no arguments, it will list all studies where the user is either the originator or an investigator. With -c or -t options, it lists the people or topics, respectively, within the database. The -s will list the data sets within the specified study, while -d will list the data objects within the specified data set.

OPTIONS -v Prints version string and exits

-s studyID Lists data sets within a given study.

-d dataSetID Lists data objects within a given data set.

-t Lists topics within database.

-c Lists people (colleagues/ collaborators) within database.

EXIT STATUS Rls return zero on success or non zero if there is an error.

FILES /.rcommands/rcommand.config - RCommand configuration information /.rcommands/rcommand.pid - Session key for shell with pid

Rpasswd

This is currently a dummy routine.

NAME Rpasswd - Changes RCommand password
SYNOPSIS Rpasswd [-v]
DESCRIPTION Rpasswd changes user RCommand password both on the RCommand server and within the user configuration file.
OPTIONS -v Prints version string and exits
EXIT STATUS Rpasswd return zero on success or non zero if there is an error.
FILES /.rcommands/rcommand.config - RCommand configuration information

Rpwd

Prints out current sessionId.

Rrm

This is currently a dummy routine.

NAME Rrm - Removes different entities from metadata database
SYNOPSIS Rrm [-v]

Rrm -s studyID [-f]

Rrm -s studyID -t topicID

Rrm -d dataSetID | -o dataObjID [-p ParamName]

DESCRIPTION Rrm removes entities or parameters from within the metadata database.

OPTIONS -v Prints version string and exits

-s studyID Specifies study either to delete, or to remove topics from if used in conjunction with -t option.

-d dataSetID Specifies data set to remove, or data set parameter to remove if used in conjunction with -p option.

-o dataObjectID Specifies data object to remove, or data object parameter to remove if used in conjunction with -p option.

-t topicID Specifies topic to remove from study topic list.

-p paramName Used with -d or -o options in order to remove data set or data object parameters.

-f If used with -s, will force deletion of non-empty study.

EXIT STATUS Rrm return zero on success or non zero if there is an error.

Rsearch

NAME Rsearch - Searches data set and data objects for parameters
SYNOPSIS Rsearch -v

Rsearch -u url

Rsearch -t topicID

Rsearch -s studyID | -d dataSetID -p name=value

Rsearch -s studyID | -d dataSetID -p namevalue

Rsearch -s studyID | -d dataSetID -p namevalue

Rsearch [ -d dataSetID | -o dataObjectID ] -k keyword

DESCRIPTION Searches for entities within metadata database. Can search for studies by topic. Can search for keywords within study and/ or data set metadata. Can search for specifed parameters attached to either data sets and/ or data objects. Can search for data objects with a specific url. If all values of a parameter with a given name are numerical, then the and operators can be used.
OPTIONS -v Prints version string and exits

-u url Searches for data object with specified url

-s studyID Specifies study to search

-d dataSetID Specifies data set to search

-t topicID Searches for studies with this topicID

-p name | = | value Parameter to search for

-k keyword Specifies keyword to search for

NOTES If you use Rsearch -p with the = operator, then it will treat the value as a string. Hence values of 0.2 and 2e-1 will not match. In contrast, the and operators, should work fine with a mixture of standard and scientific notation.
EXIT STATUS Rsearch returns zero on success or non zero if there is an error.

Template

NAME  
SYNOPSIS  
DESCRIPTION  
OPTIONS  
EXIT STATUS  

Example

Username

You will need a username to provide you with access to the RCommands database: this will be provided by the database manager.

Create the configuration files

You need to create a file of the name /.rcommands/rcommand.config, which has the form

[frame=single]
username = <your username in the RCommands database>
password = <your password>
cacertdir = /etc/grid-security/certificates

Initiating an RCommand session

You initiate an RCommand session using the Rinit command. You can test that all is well by typing the Rls command: it will return a message telling you about any studies you have. Note this does not match the specification, so Rls with no arguments should list studies in the current session... To get information about other commands, you can simply type the command name with no arguments also not true, you can use the unix man command, or you can look at the information below.

Creating a study

First use the Rcreate command to create a study level. To use Rcreate you will need to give the study a name, add a description and assign it to a topic as follows:

[frame=single]
Rcreate -n <name> -k <description> -t <topicID>

First you should think about the topic. You can list all topics by the command Rls -t. need to add this to the Rls command.

Chose a topic and note the number - this will be the topicID label. Run the Rcreate command to create a study. The name and description labels can contain more than one word within quotes. For example, suppose we want to create a database entry containing a set of workshop papers, we might do this as follows:

[frame=single]
Rcreate -n "Workshop papers" -k "Papers for workshop" -t 4

We can check that this has worked by running the Rls command. need to cache last studyId etc. This will return information like:

[frame=single]
-------------------------
StudyID: 1026
Name: Workshop papers
-------------------------

where the StudyID number will differ for different people. Now we can look at this in more detail using the Rget command:

[frame=single]
Rget -s studyID

where you add your StudyID number. For the example above:

[frame=single]
Rget -s 1026

gives:

[frame=single]
-------------------------
StudyID: 1026
Name: Workshop papers
Description: Papers for workshop
Created by: martin dove
Status: In Progress
Start_date: 07-01-2006
-------------------------

Adding data sets with metadata

Now we want to add some data sets to the study. Following the example of PDF publications, we could create some data sets by:

[frame=single]
Rcreate -s 1026 -n "Papers on Grid computing"
Rcreate -s 1026 -n "Papers on data management"
Rcreate -s 1026 -n "Papers on collaborative tools"
Rcreate -s 1026 -n "Papers on escience applications"

Each invocation will create a DataSetID, as will be echoed to the screen. Now check on the results of these commands by:

[frame=single]
Rls -s 1026

This will show you the DataSetID for each data set (again, different users will get different numbers). You can look at any one data set by using the command:

[frame=single]
Rget -d DataSetID

where you use the appropriate number of each DataSetID.

Now we will add some metadata against each data set. For this we use the Rannotate command. The first is to add a brief description to the data set. In my example, running Rls -s 1026 gives:

[frame=single]
-------------------------
Data Set ID: 26
Data Set Name: Papers on grid computing
Parent StudyID: 1026
-------------------------
Data Set ID: 27
Data Set Name: Papers on data management
Parent StudyID: 1026
-------------------------
Data Set ID: 28
Data Set Name: Papers on collaborative tools
Parent StudyID: 1026
-------------------------
Data Set ID: 29
Data Set Name: Papers on escience applications
Parent StudyID: 1026
-------------------------

need to add the extra horizontal lines between dataset metadata listings

We can use the Rannotate command in in two ways. First we can add a description to the data set. An example is:

[frame=single]
Rannotate -d 29 -k "Collection of papers on escience applications"

Second we can add some name parameter name=value pairs. My example is:

[frame=single]
Rannotate -d 29 -p topic=escience
Rannotate -d 29 -p topicarea=applications

Running the Rget -d 29 command to view the metadata gives:

[frame=single]
-------------------------
DataSetID: 29
Name: Papers on escience applications
Parent StudyID: 1026
Created by: martin dove
Creation_date: 07-01-2006
Description: Collection of papers on escience applications
-------------------------

need to add in parent study ID

need to *not* print the name=value pairs (why?)

Note that this shows the description but not the name pair values. To see the name pairs I need to use the command Rget -d 29 -p, which yields:

[frame=single]
-------------------------
Parameter Name: topic
Parameter Value: escience
-------------------------
Parameter Name: topicarea
Parameter Value: applications
-------------------------

You can repeat this for other data sets, and you can be add whatever name=value pairs you like.

Adding data objects with metadata

Finally we reach the point where we can add metadata to the data objects. You need to first have data somewhere, in the e-Minerals case our data was in the SRB. The data object can either be a file or a collection of files within the SRB. The command for adding a new data object with metadata is:

[frame=single]
Rcreate -u <url> -d <dataSetID> -n <name>

The url specifies where the file is and has the form:

[frame=single]
srb://<zone>/<collection>/<object>

In general <collection> is composed of:

[frame=single]
/home/<username>.<domain>/<subcollection1>/.../<subcollectionN>

An example might be:

[frame=single]
srb://Test/home/nieessrb40.srbdom/test.dat

The dataSetID gives the data set that you want to associate the file with, and name is the name you want to give the data object.

You then add metadata with the Rannotate command in the same way that you added name/ value pair metadata to the datase:

[frame=single]
Rannotate -o dataObjectID -p <name>=<value>

where you get the object dataID from the data set using the command Rls -d dataSetID. Hopefully by now you are getting more familiar with the various ID labels: studyID, dataSetID and now dataObjectID for the study, data set and data object respectively.

As before, you can use the Rget command to get the metadata from a data object:

[frame=single]
Rget -o <dataObjectID> -p

Searching on the metadata

The power of metadata comes down to what you do with it. The Rcommands provide for this with the Rsearch command. There are several ways to use this command:

[frame=single]
Rsearch -s studyID -p <name>=<value>
Rsearch -d dataSetID -p <name>=<value>
Rsearch -d dataSetID -k <keyword>
Rsearch -o dataObjectID -k <keyword>

Once you have created enough metadata you can experiment with the Rsearch command.

Rob Allan 2009-11-10