Developing a Data Model for Computational Chemistry
Outline of discussion topics
On this page you will find a summary of the issues that have arisen so far in our
development of a data representation for quantum chemical data.
This outline text, and the more detailed discussions accessed via the links in this
document, are taken from a paper
Data modelling for computational chemistry - a methodology
by Philip Couch, Daresbury Laboratory.
As discussed in ECCP1Activities
, we are hoping to develop a way to represent
quantum chemical data that will facilitate exchange of data between different
quantum chemistry packages, starting with those used within the CCP1 community
(e.g. GAMESS-UK and Molpro) but in the hope that this exercise can lead to a wider,
de-facto standardisation effort which other package developers will get involved in.
To achieve this we welcome input during the development process from any other
groups involved in developing quantum chemistry packages, and also those with experience
of data model and data representation development in the chemistry area. To make
your comments you can use either the mailing list
if you want send email to all involved parties, or take part in the
e-CCP1 open discussion forum
Adoption of a markup language
The interoperability of computational codes running in Grid environments is currently hindered by a lack of data standards. The simplest and most straightforward level of standardisation is that of the data syntax; the eCCP1 project has proposed the adoption of XML to address this issue. However, XML imposes a hierarchical structure on the data that is not always appropriate, and a question remains as to whether a more suitable technology can be used (perhaps alongside XML) to represent computational chemistry concepts (such as the Resource Description Framework).
At the moment, we are assuming that an XML-based approach offers the best way forward, See the detailed discussion
The data model design
The data model can be designed in one of two extreme ways. The first involves the consideration of all the concepts of interest and the relationships between them, followed by the construction of a rigid data model that supports these. Tools can be developed that are 'hard-wired' to work with implicit semantics. The second approach involves the design of components that relate to concepts and provides methods to link these components together. But, is there a natural granularity for the XML components?
We are following a component-oriented approach, each document comprising components
represented as defined by a number of individual schema, with addition information provided to cross-reference
the components as required to describe complex, multi-component data: See details
Existing XML languages with relevance to computational chemistry
Other XML languages exist with relevance to computational chemistry, such as the Chemical Markup Language, and some XML components can be based on these. The computation chemistry component of CML
is under development and some extensions are required to cover all of the required concepts. This raises questions as to which XML components can be based on CML and other languages, such as VTK, and which must be created. Consideration must also be given to choosing the most appropriate method for collaboratively designing such representations - eCCP1 is currently using UML for this purpose.
Status: From discussions at our meeting in April 2004 we conclude there is strong support for making use of the CML framework for representing chemical structure and molecular properties. We are now working on defining which extensions are needed, and candidates for other parts of the data model - details.
General data containers
In some cases it is a good idea to group concepts into classes and attempt to find a common representation for the class. For example, all scalar properties could be represented using one XML 'scalar' element-type. This approach can reduce schema entries and the burden on application developers. It is not clear to what extend these general containers should be used with quantum chemistry data.
Status: We will follow the generic container model as defined by CML specification, but welcome comments on
the wider issues of how to define new parts of the markup. details
Proposed elements of the data representation
As we develop proposals for new components we will add them here.
Relationships between concepts
It is not only important to represent computation chemistry concepts, but also the relationships between them. For example, a way of assigning basis sets to atoms is required. There are several important types of relationship that need to be addressed along with multiple methods that can be used to express them. The relative merits of each approach needs to be carefully considered.
Status: This is an active area of research. We are exploring the W3C? XLink spec and RDF as alternatives to
explicitly including identifiers & references in the data elements, details.
Metadata and provenance
In addition to representing data, there is also a requirement to represent metadata, such as information on the particular code that was used to calculate the data, details of the computational methods and individuals involved. CCLRC has produced a general scientific metadata model (CCLRC Scientific Metadata Model) that may fulfil the metadata and provenance requirements of the eCCP1 project.
Status: The merits/limitations of the CCLRC Scientific Metadata Model are under investigation details.
-- PhilipCouch - 08 Nov 2004