Developing a Data Model for Computational Chemistry

Outline of discussion topics

On this page you will find a summary of the issues that have arisen so far in our development of a data representation for quantum chemical data. This outline text, and the more detailed discussions accessed via the links in this document, are taken from a paper Data modelling for computational chemistry - a methodology. by Philip Couch, Daresbury Laboratory.

As discussed in ECCP1Activities, we are hoping to develop a way to represent quantum chemical data that will facilitate exchange of data between different quantum chemistry packages, starting with those used within the CCP1 community (e.g. GAMESS-UK and Molpro) but in the hope that this exercise can lead to a wider, de-facto standardisation effort which other package developers will get involved in. To achieve this we welcome input during the development process from any other groups involved in developing quantum chemistry packages, and also those with experience of data model and data representation development in the chemistry area. To make your comments you can use either the mailing list, if you want send email to all involved parties, or take part in the e-CCP1 open discussion forum.

Adoption of a markup language

The interoperability of computational codes running in Grid environments is currently hindered by a lack of data standards. The simplest and most straightforward level of standardisation is that of the data syntax; the eCCP1 project has proposed the adoption of XML to address this issue. However, XML imposes a hierarchical structure on the data that is not always appropriate, and a question remains as to whether a more suitable technology can be used (perhaps alongside XML) to represent computational chemistry concepts (such as the Resource Description Framework).

Status: At the moment, we are assuming that an XML-based approach offers the best way forward, See the detailed discussion.

The data model design

The data model can be designed in one of two extreme ways. The first involves the consideration of all the concepts of interest and the relationships between them, followed by the construction of a rigid data model that supports these. Tools can be developed that are 'hard-wired' to work with implicit semantics. The second approach involves the design of components that relate to concepts and provides methods to link these components together. But, is there a natural granularity for the XML components?

Status: We are following a component-oriented approach, each document comprising components represented as defined by a number of individual schema, with addition information provided to cross-reference the components as required to describe complex, multi-component data: See details.

Existing XML languages with relevance to computational chemistry

Other XML languages exist with relevance to computational chemistry, such as the Chemical Markup Language, and some XML components can be based on these. The computation chemistry component of CML is under development and some extensions are required to cover all of the required concepts. This raises questions as to which XML components can be based on CML and other languages, such as VTK, and which must be created. Consideration must also be given to choosing the most appropriate method for collaboratively designing such representations - eCCP1 is currently using UML for this purpose.

Status: From discussions at our meeting in April 2004 we conclude there is strong support for making use of the CML framework for representing chemical structure and molecular properties. We are now working on defining which extensions are needed, and candidates for other parts of the data model - details.

General data containers

In some cases it is a good idea to group concepts into classes and attempt to find a common representation for the class. For example, all scalar properties could be represented using one XML 'scalar' element-type. This approach can reduce schema entries and the burden on application developers. It is not clear to what extend these general containers should be used with quantum chemistry data.

Status: We will follow the generic container model as defined by CML specification, but welcome comments on the wider issues of how to define new parts of the markup. details

Proposed elements of the data representation

As we develop proposals for new components we will add them here.

Relationships between concepts

It is not only important to represent computation chemistry concepts, but also the relationships between them. For example, a way of assigning basis sets to atoms is required. There are several important types of relationship that need to be addressed along with multiple methods that can be used to express them. The relative merits of each approach needs to be carefully considered.

Status: This is an active area of research. We are exploring the W3C? XLink spec and RDF as alternatives to explicitly including identifiers & references in the data elements, details.

Metadata and provenance

In addition to representing data, there is also a requirement to represent metadata, such as information on the particular code that was used to calculate the data, details of the computational methods and individuals involved. CCLRC has produced a general scientific metadata model (CCLRC Scientific Metadata Model) that may fulfil the metadata and provenance requirements of the eCCP1 project.

Status: The merits/limitations of the CCLRC Scientific Metadata Model are under investigation details.

-- PhilipCouch - 08 Nov 2004

Topic revision: r6 - 17 Jan 2005 - 11:27:32 - PhilipCouch
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback