Data Modelling for Computational Chemistry – a Methodology

Part One (Draft)

Philip Couch
Daresbury Laboratory, Daresbury, Warrington, Cheshire, UK. WA4 4AD.

Adoption of a markup language

The first step in developing standards for computational chemistry data requires specification of the concepts that need to be considered. In the current context, these would include concepts such as molecular structures, atomic basis sets and molecular orbitals. The eCCP1 project had been involved in number of meetings that have been held to consult with the international quantum chemistry community and gather these requirements (Towards a Common Data and Command Representation for Quantum Chemistry, http://www.nesc.ac.uk/action/esi/contribution.cfm?Title=394); a summary can be found in the TWiki document data-type requirements. The simplest, and most straightforward, level of data standardisation is that of the data syntax; this can bring quick advantages for little effort. In particular, use of an agreed syntax reduces the burden on application code developers that wish to make use of the data. A syntactic standard can be achieved through the adoption of the eXtensible Markup Language (XML). This provides a standard, specified by the W3C? (http://www.w3c.org), for describing and hierarchically structuring data. The advent of XML has made a big impact on areas such as publishing, data sharing, database access and distributed computing. Its growing importance in a number of domains has lead to the development of mature tools that can be utilised by computational code developers to read and manipulate XML documents.

Question: XML imposes a hierarchical structure on the data, and this is not always suitable. Is there a technology, other than XML, that is more appropriate for quantum chemistry data (such as the Resource Description Framework)?

The data model design

A way of specifying XML representations for quantum chemistry concepts is required, and can be achieved using W3C? XML schemas. XML schemas are not simple to understand, if unfamiliar with the schema language; the unified modelling language (UML) is a preferred option for visualising the representation. This graphical notation can be converted to XML schemas for use with XML tools. In order to explicitly specify some of the XML schema design choices in the UML it is necessary to adopt a convention (UML profile), for this purpose. Many UML modelling tools exist; the Object Domain R3 tool (http://www.objectdomain.com) is being used within the eCCP1 project, adopting the XML schema generation profile of David Carlson (http://www.xml.com/pub/a/2001/08/22/uml.html). The XML schema can be generated by serialising the UML as XMI (XML Metadata Interchange) and then by importing this into the Eclipse based Hypermodel tool (http://www.xmlmodeling.com). There are two extreme approaches in the design of the data model. The first involves the consideration all of the possible concepts and the relationships between them and creation a rigid data model to support these. This approach relies heavily on XML nesting. Tools can then be created to work with this data model, and to work with implicit semantics; this is a common approach. The problem is that it is rather inflexible. Everything is ‘hardwired’ and any extensions require adaptation of both the data model and code. However, writing code generators that automatically produce APIs (for parsing the formatted data) from the data model can ease this burden. The second extreme involves the design of XML components that have little, or no, interdependence. These components would relate to the various quantum chemistry concepts and the specification of links between components. This modular approach makes the data model extensible and through componentisation, more readily reusable. It results in quite a flat format, since XML nesting is kept to a minimum and mainly used within components. The adoption of this approach does, however, introduce some complexity, in that careful consideration must be given to the specification of the relationships between components. It is useful to group the quantum chemistry concepts into two categories: those that can exist on their own (called entities) and those that exist through the association of several concepts (called associations). Examples of entities are concepts such as molecular structures and atomic basis sets; both of these concepts have some meaning in isolation. Examples of association concepts would include molecular vibrations that require information about dynamics, perhaps specified by a vector, to be associated with a molecular structure. It is the entities that form the primitive types and relate to a single XML component; the associations are complex types that are formed from the linking of components. The adoption of a modular approach makes it simpler to use representations from other XML languages.

Question: Is there a natural granularity for the XML components? As a simple example, should all the data associated with a molecular vibration form a single component, or should a vibration be formed from the association of a vector (specifying dynamics) with a molecular structure?

Existing XML languages with relevance to quantum chemistry

The domain of computational chemistry is fortunate in that XML representations already exist for a number of its key concepts. These have been developed as part of the Chemical Markup Language (CML) (http://cml.sourceforge.net). This includes a number of XML schemas that specify the representations for concepts that relate to various chemistry domains. One schema specifies common concepts (CMLCore) and the others (known an extensions) focus on particular areas, such as computational chemistry (CMLComp) and chemical reactions (CMLReact). The XML schemas and example documents can be downloaded from the CML site. CML has excellent support for the representation of small molecule structures, and is now supported by a number of tools (such a Jmol (http://jmol.sourceforge.net), JChemPaint (http://jchempaint.sourceforge.net) and Marvin (http://www.chemaxon.com/marvin/)). CMLComp is under development and eCCP1 has found that some extensions are required to enable better coverage of computational chemistry concepts. This has resulted from discussions at a number of meetings, identifying further necessary support for concepts including basis sets (e.g. Gaussian atomic), properties on grids (such as charge density maps), molecular vibrations and macromolecules. Some of these extensions can be taken from other XML languages, while others will be too specific to computational chemistry and will need to be created. For example, the Visualisation Tool Kit (VTK) (http://public.kitware.com/VTK/) has XML representations for properties on grids (regular, rectilinear, irregular). As with the use of the Chemical Markup Language, the adoption of representations from other XML languages provides benefits through the availability of tools that understand data in this format. The use of VTK markup for grids results in simple visualisation of this data using the APIs made available through VTK. The atomic basis set is an example of a concept that will need to have a representation designed, something that is being addressed within the eCCP1 project.

Question: What CML markup would be useful for quantum chemistry?
Question: What does CML not cover?

General data containers

In some cases, it is a good idea to group concepts into classes and attempt to find a common representation for the class. For example, there are many scalar properties that are important in computational chemistry and need to be supported in the data model. Each scalar property could be considered separately, such as the electron exchange energy or kinetic energy. But, this would lead to a rather large XML schema that would need constant revision as new scalar properties are added. An alternative is to find a common representation for all scalar properties. CML implements this by specifying representations for several classes of concepts (scalar, matrix, array). These representations allow references to be made to XML dictionary entries. These references specify the particular instance of the class that is represented (perhaps electron correlation energy for scalars). Dictionaries need to be created by specific communities, providing individual terms and their definitions. The use of general containers for data does require some extra consideration with regard to validation. XML documents should be both valid and well formed. For a document to be well formed it must be syntactically correct. In order for the document to be valid, it must conform to the restrictions specified in an XML schema. Due to limitations of W3C? XML schemas, it is not possible to specify restrictions on a documents structure based on element attribute values. Therefore, it is possible to specify where the general containers (representing a class of concepts) may occur, but not where specific concepts may occur. Combining W3C? XML schemas with the eXtensible Stylesheet Language (XSL) can enhance its expressive power. The CML schemas use the schema appinfo element to contain XSL that express further constraints on the validity of a CML document. There are many mature XSL parser APIs that can be used by application code developers to check conformance to these validity constraints. In addition to the inclusion of XSL in the XML schemas, it may also be included in dictionaries for the same purpose. It is simpler to make use of components from other XML languages if the schemas in which they are specified are designed in a modular fashion (e.g. specify all elements as global elements). In particular, such components can be easily referenced from other schemas. The CML schemas have been designed in this way, and it is therefore a simple task to make use of CML components. Components used from other languages should make use of XML namespacing to ensure that name collisions are avoided. It is important to note that XML does not add any semantics to data; it is used to describe and structure data. The description adds some implicit semantics, but often this is not sufficient. Further semantics can be added via a couple of mechanisms. The first is to add simple annotation to the XML schema, and this could be made machine and/or human readable. The second is through XML dictionaries. The format of the XML dictionaries are specified in XML schemas; in the CML this specification is part of the Scientific, Technical and Medical Markup Language (STMML).

Question: What types of data should be contained within general containers
Question: How should we standardise dictionaries?

The relationships between concepts

If a modular approach is adopted for the data model, the representation of the concepts alone is not enough. It is also a requirement to be able to link concepts and specify what these links mean. As an example, an XML document may provide information about a quantum chemistry calculation, including several atomic basis sets and a molecular structure. In order for this data to be correctly interpreted, it could be necessary to understand which atomic basis set has been assigned to which atom – a way of specifying such ‘mappings’ is required and this is a surprisingly non-trivial problem. The type of link described here is one that ‘maps’ entities. A further type of link is one that forms an association concept by linking other entity concepts together. An example would be the ‘binding’ of a vector specifying atom dynamics with a molecular structure to form the association concept of a vibration - it is important to distinguish between the two types of link. A third type of link is one that relates an entity with another link. An example could be the linking of a density matrix to a link that maps basis sets to atoms. It should not be possible to use data from an XML document out of context. Distinguishing between links in this manner helps to avoid this problem by providing information on the semantic ties between data. These ties are not just important for the semantics, but also for allowing some implicit mappings. For example, specifying that a particular vector is bound to a particular molecule does not provide any information about how this vector might map to the individual atoms. However, if the vector is always tied to a particular molecule then we can rely on, for example, document ordering to map components of the vector to atoms of the molecule. Although in general it is not good practice to rely on implicit mappings, in this case it is useful because it significantly reduces the verbosity of the overall representation. Additional links are required that do not relate to the association of XML components, but to the repetition of data. In some cases, several components may share a significant amount of data that it is not desirable to duplicate. In this case, it is useful to have a method of linking sub-components. An example would be the use of sub-components of a 6-31G atomic basis set to form part of the 6-31G* atomic basis set. Further, there could be the requirement for a method of grouping components, for example, an additional component that acts as a general container for other components. This could be used to, for example, group 50 scalar properties to form a list and this list could then be linked to a molecular structure. This is in contrast to explicitly linking each scalar to the structure. Often, current methods rely on document ordering for this grouping. The eCCP1 project is currently addressing how these links should be expressed in an XML document. There are several standards that could be used for this purpose and each is briefly discussed herein.

The simplest way is to link components is via ‘id’ and ‘ref’ attributes. Each concept is allowed a unique identifier and has the ability to reference the identifier of other components. This approach is commonly used, but has some drawbacks. The first is that the data model needs to specify all the possible references that a particular component will need to make, making this approach somewhat inflexible. The second, and more significant drawback, is that often components may be taken from different sources and collated for the purpose of a calculation. For example, a structure may be taken from a structure database and basis set from a basis set library. These components do not know about each other’s identifiers and a simple method is required to express how the components relate. Difficulties are also introduced when a particular component needs to reference a large number of other components, creating lots of reference information. Further, this method does not provide any explicit semantics for the link.

Of course, XML nesting relates components in a hierarchy. Nesting can usefully be used to group data that form entities. But, heavy dependence on nesting produces problems; data becomes locked away in the branches of the tree structure and it often becomes necessary to repeat data across nodes. It is therefore likely that a rather flat document structure is preferential to a heavily nested one. Repetition could be avoided through the id and ref attribute approach, but this raises the issues discussed above.

A further approach is to include a specification for expressing these links in the data model. This would need to provide support for locating the XML nodes to be linked and for adding semantics to the link. This specification could make use of W3C? standards that allow the location of nodes and nodes sets to be specified (XPath, XPointer). In addition, the XLink specification provides much of what is needed here; it allows the use of XPointer and makes provision for link semantics.

It is clear from the above discussion that difficulty does not lie in creating a representation for quantum chemistry concepts, but rather in associating the concepts and specifying what the associations mean. Difficulties are encountered because the XML data model is hierarchical and the nature of the mappings can be complex. It seems that a natural solution would be to work towards a way of integrating XML with other semantic web technologies that are able to express complex relationships in a more natural way. One promising technology is the Resource Description Framework (RDF). This specifies a way of making statements such as ‘The basis set with identifier “carbon” is assigned to all atoms that have identifiers “C1”’. RDF statements can be expressed as triples: a subject, predicate and object (n3 format). These triples can then be serialised as XML and can therefore be included as part of the XML data document. It will also be an important requirement to document the types of possible links and place constraints on the components that must be used by each type of link. XML dictionaries could be used for this purpose.

Question: Which method is best for the specification of links between XML components?

Metadata and provenance

In addition to representing quantum chemistry data, there is a requirement to be able to represent metadata. This would include data such as the description of the code used to perform a particular calculation along with its version and details of people involved in the computation. The Data Management Group of the CLRC have worked in close collaboration with various scientific groups (such as ISIS, Rutherford Appleton Lab, UK and SRS, Daresbury Laboratory, UK) to develop an XML schema for the representation of general scientific metadata (The CLRC Scientific Metadata Model) (http://www-dienst.rl.ac.uk/library/2002/tr/dltr-2002001.pdf). It is likely that this can be used to fulfil the metadata and provenance requirements of the eCCP1 project, a question that is currently under investigation.

It should be possible to create a framework that allows XML components to be ‘pluged in’ and easily incorporated from other XML languages. The framework should support the specification of relationships and semantics, and tools can be created that work with the framework, rather than a rigid data model. The framework would integrate XML with other semantic web technologies.

-- PhilipCouch - 09 Nov 2004

Topic revision: r1 - 02 Dec 2004 - 15:54:34 - PhilipCouch
ECCP.DataModellingForComputationalChemistryAMethodology moved from ECCP.DataModellingForComputationalChemistry -AMethodology on 02 Dec 2004 - 15:51 by PhilipCouch - put it back
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback