Toward a Common Data and Command Representation for Quantum Chemistry

Couch P.A. Daresbury Laboratory, Daresbury, Warrington, Cheshire, WA44AD? , UK.

Introduction

The realisation of Grid technologies has provided a strong technological framework for the interoperability of computer codes. This has significant benefits for many scientific communities, including those of quantum chemistry. However, communication between such codes is hindered by a lack of agreed standards in data and command representation.

Consideration of such a common format would provide advantages in other areas. In particular, storage of data in a 'universal' format with suitable meta-data would simplify its analysis, interpretation and appropriate re-use.

This meeting aimed to address issues associated with the implementation of such a representation, including: design implications, software tools and existing efforts such as the XML-based Chemical Markup Language[1].

The meeting agenda and abstracts can be found at http://www.nesc.ac.uk/esi/events/394/ and presentations can be downloaded from http://www.nesc.ac.uk/action/esi/contribution.cfm?Title=394.

Open Discussions.

The aim of the open discussions was two fold. The first discussion group focused on identifying issues associated with key computational chemistry data and command types of interest, and the second on the logistics of future developments in this area. Important areas for consideration, with regard to the data-types, are listed below and any comments/questions raised at the meeting are detailed under the headings. A mention of an area, but with no further comments results in just a heading. Sub-category headings are italicised under the main heading.

Computational Chemistry Data and Command Types.

Geometries

The discussion group concluded that the known current representations of geometries needed extensions to fulfil the requirements raised during the session.

Containers

Containers group information associated with a particular concept. For example, this may include atomic coordinates and charge for molecules. CML containers include those for molecules and calculations.

Coordinates

CML supports a number of methods of representing the structure of material. This includes representations for atomic connectivity and coordinates (Cartesian and internal).

Derivatives

It is a requirement to represent the geometric derivatives of energy and other properties.

Periodicity

CML can be used to specify symmetry and band structure. It canít be used for real space Brillouin zones. CML can define regions within structures (e.g. useful for 3D periodicity). It is a further requirement to represent materials with 1 D and 2 D periodicity, taking full advantage of the symmetry.

Dummy Atoms and Ghost Atoms

CML allows for dummy atoms and these can be used to mark the location of point charges. CML does not support the use of ghost atoms.

Macromolecules

CML will not be used for macromolecules Ė mmCIF[2] should be used.

Isotopes

CML has an isotope element.

Symmetry

CML has support for point and space groups.

Solvation

It will be important to be able to represent solvation models

Collections/Sequences

It is a requirement to represent the relationship between collections and sequences of data. For example, information about a reaction involving a molecule could be provided as a sequence of structural and spatial data, reflecting the situation at different points in time. It should be clear that the information shows a temporal evolution. Another example includes data associated with the relaxation of a structure and the explicit declaration of the relationship between the structures and energies. This declaration of relationships could also have an Ďadvisoryí role, giving the intended use of the data. For example, it would not normally be sensible to take a structure determined mid-way through the relaxation of a structure and use it to calculate normal modes for comparison with experimental data. CML has been considered for enzyme reactions and combustion. Currently, there is no reaction pathway or transition state support. Some consideration should be given to dynamics (e.g. reactions and trajectories). Currently, CML provides limited support for dynamics.

Internal Coordinates

CML supports Z-matrices (lengths, angles and torsions can be represented). Non-redundant internal coordinates are generated algorithmically and some meta-data should provide a description of the method of generation.

Embedded Structures

An example would be crystal impurities

Basis Sets

Expression

The expressions should support Gaussian, plane wave and numerical basis sets. It is possible that a common data structure will be difficult to find and that these should be represented separately. For functional basis sets, such as Gaussian, there is a need to provide information on: type of function, contraction scheme, angular functions, role (e.g. orbital, density), ordering of basis functions, normalisation, exponents and coefficients. Some information can be provided implicitly through a meta-data description of the representation, this could include the normalisation and ordering of basis functions. Such implicit assumptions/legislation would help to restrict the freedom of representation (resulting in simpler code generation) and cut down on verbosity. Some consideration should be given to how collections of basis sets can be represented (e.g. a 6-31G* representation should be able to reference that of a 6-31G to avoid repetition). This will be particularly important for basis set libraries. Some decision needs to be made about other valuable information, for example, do we need to include published energies for the basis set derivation?

Mappings/Referrals

Mappings between data-types are often required (e.g. mappings between basis sets and atoms). We could explicitly declare these (using something like RDF), or/and formulate a convention and rely on implicit mappings. We canít solely rely on implicit mappings because we may want to change the rules for some calculations (e.g. use a particular basis set for certain carbon atoms of a molecule, and a different basis set for others), and a method of representing this is required.

Interpretation of some data will require referrals back to other data. An example includes the link between molecular orbital coefficients, basis sets and atom coordinates. As with mappings, we could explicitly declare relationships using RDF, but also produce legislation for implicit referrals. This means that we donít have to generate RDF for standard mappings and referrals.

Pseudo Potentials

The FSAtom pseudo potential efforts[3] are not CML based. These can be included with CML through the use of namespaces.

Effective Core Potentials

Reference Frames

Sometimes, energies will depend on orientation and some legislation may be required (e.g. z points to Z). If a molecule has symmetry and the principal axes are not XYZ then some action is required. It was commented that in GAMESS-UK the molecule is re-orientated. It is a requirement to represent orientation when defining sub-units of a structure.

Grids

Orbitals and Densities

Integrals

Overlap Matrices and Hamiltonian Matrices

Binary Representations

It was suggested that data structures that scale as the square of the number of electrons should have a binary and ASCII representation. Scaling greater than the square should be in binary only (this would include two electron integrals). It was also commented that such data is difficult to represent because it not well standardised. Binary data could be used in conjunction with a data format description language such as DFDL[4]

Accuracies and Tolerances

Both meta-data and commands are required. One example would be DFT grid properties.

Dynamics

Excited Electronic States and Transition Probabilities

Scientific Units

Meta-data

Careful consideration should be given to a meta-data model. Sometimes the data will be large and we donít want to have to search through it to find the meta-data. Therefore, it would be best to keep data and meta-data separate. Links would need to be maintained from the data to the meta-data and this would require data to be handled by tools/portals that maintain these links (see CLRC meta-data portal). Meta-data could be used as an advisor (as discussed above in collections/sequences). The idea of meta-data conformance levels could be introduced to indicate the meta-data available for a specific set of data (as used in the CLRC meta-data portal). The Dublin Core[5] meta-data elements could be used. The use of meta-data for retrieval of requested data is not as difficult as harvesting the meta-data in the first instance. Keeping track of pedigree is particularly challenging.

Semantics and Ontology

Dictionaries of parameters/inputs could be generated and code owners should comment on their completeness.
CML makes use of the CIF dictionary.
Dictionaries are difficult to use when we have complex objects.
The dictionaries should be defined by schema.
Relationships need to be described for the specification of mappings and referrals. RDF/XML is a formalised way of describing the relationship between concepts.

References.

1. The Chemical Markup Language (http://www.xml-cml.org/)

2. The Macromolecular Crystallographic Information File (http://ndbserver.rutgers.edu/mmcif/)

3. The Free Software Project for Atomic Scale Simulations (http://www.tddft.org/fsatom/programs.php)

4. The Data Format Description Language (http://forge.gridforum.org/projects/dfdl-wg/)

5. Dublin Core Metadata Initiative (http://dublincore.org/)

-- PhilipCouch - 04 May 2004

Topic revision: r3 - 06 May 2004 - 09:56:00 - PhilipCouch
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback