Management and Analysis of Large Scientific Data Sets

Rob Allan

Computational Science and Engineering Department,

STFC Daresbury Laboratory, Daresbury, Warrington WA4 4AD

Abstract:

This report looks at issues and solutions for data management in the context of high performance computing for scientific simulation and modelling. It considers data set formats, cluster and wide area distributed or hierarchical and high performance storage systems. It finally looks at novel ways to extract information from large data sets, particularly if this can be carried out ``in situ'' avoiding the need for lengthy file transfers.

© STFC 2009-10. Neither the STFC Computational Science and Engineering Department nor its collaborators accept any responsibility for loss or damage arising from the use of information contained in any of their reports or in any communication about their tests or investigations.


Contents

Introduction

Managing scientific data has been identified as one of the most important emerging needs of the scientific community because of the sheer volume and increasing complexity of data being collected. This is particularly true in the growing field of computational science where increases in computer performance permit ever more realistic simulations and the potential to explore large parameter spaces. Effectively generating, managing and analysing the data and resulting information requires a comprehensive, end-to-end approach that encompasses all of the stages from the initial data acquisition to the final analysis of the data. This is sometimes referred to as Information Lifecycle Management or ILM.

A SciDAC project [21] has identified three significant requirements based on community input. Firstly, access to more efficient to storage systems - in particular, parallel file system improvements are needed to write and read large volumes of data without slowing a simulation, analysis, or visualisation engine. Secondly, scientists require technologies to facilitate better understanding of their data, in particular the ability to perform complex data analysis and searches over large data sets in an effective way - specialised feature discovery, parallel statistical analysis and efficient indexing are needed before the data can be understood or visualised. Finally, generating the data, collecting and storing the results, data post-processing, and analysis of results is a tedious, fragmented process - workflow tools are required for automation of this process in a robust, tractable and recoverable fashion to enhance scientific exploration adding provenance and other metadata in the process.

In this report we review the requirements and tools for managing large data sets of interest to the UK science community involved in computational simulation and modelling. This extends work in a previous report [4] prepared for the UK High End Computing project http://www.ukhec.ac.uk.

Data Management

Scientists often limit the meaning of ``data management'' to the mere physical data storage and access layer and movement of data from one location to another. The scope of scientific data management is however much boarder, including both meaning and content.

The cutting edge of computational science involves very large simulations taking many hours or days on the latest high performance computers. For business data, projects implement enterprise wide data architectures, with data warehouse and data mining to extract information from they data. Can something similar be done for scientific data?

Some problems identified from current scientific projects include the following.

The increasing size of scientific data collections brings not only problems, but also opportunities. One of the biggest is the possibility to re-use existing data for new studies. This was one of the great hopes of the e-Science programme. Many projects investigated data curation, provenance and metadata definitions based on common ontologies. At STFC it was a goal of the Facilities e-Infrastructure Programme with a focus on SRB, iCAT and DataPortal for SRD, Diamond and ISIS. This is now being taken up in Australia [1].

Scientific data is composed not only of bytes, but also of workflow definitions, computation parameters, environment setup and provenance. Capturing and using all this associated information is a goal of, among others, the JISC funded myExperiment project which focusses on workflows encapsulated in ``research objects''.

Other aspects of data management, particularly virtualisation of the underlying storage, has been tackled in projects such as SRB, the Storage Resource Broker from SDSC.

Much of the work mentioned above is aimed a creating catalogues such as iCAT which reference large data collections. These may be the outputs of a facility or long term research programme, will have consumed extensive funding and are deemed to be of lasting and sometimes national importance.

We will not attempt to fully describe or re-produce this e-Science work here, the background to which is discussed in a report [11]. We also do not consider issues of data curation for which appropriate standards and processes have been developed and documented by the Digital Curation Centre http://www.dcc.ac.uk. Instead, we will focus on ways to manipulate and analyse large scientific data sets. Some examples are as follows.

Astrophysics:
Virtual Observatories, e.g. SLOAN Sky Survey.
Biology, Bio-chemistry and Bio-physics:
CFD and Computational Engineering:
Environment and Atmosphere:
BADC, POL.
Facilities Data:
iCAT and associated tools developed at STFC. iCAT is an Information Catalogue with associated scientific metadata schema.
Geo-physics:
e.g. FAGS: Federation of Astronomical and Geo-physical Data Analysis Services, http://www.icsu-fags.org/
High energy physics:
Analysis of data from CERN LHC, e.g. UK Tier-1 Centre at RAL.
Medical Imaging:
Meteorology:
e.g. Met Office examination of longitudinal data sets for climate trends.
Oceans:
BODC and POL
Protein Data Bank:
3D biological macro-molecular structure data widely used by projects such as CCP4.
Social and Geo-spatial:
e.g. ESDS, EDINA

The LHC data analysis represents an extreme case as it will generate upwards of 14PB of data a year, which has to be distributed across the EGEE, OSG and NorduGrid Grids for analysis. Such data volumes cannot be handled easily with current production networks, so have required the provisioning of optical private networks (an unusual form of SAN) linking CERN's Tier-0 centre to the key Tier-1 computing centres around the world, the one for the UK being situated at RAL. Dedicated 10Gb network links are provided in this way for data movement. We do not consider the middleware aspects of such data grids in this report, but focus more on the requirements of a high end data centre focusing on computational simulation and modelling.

The following figure illustrates the architecture implemented in the SciDAC Data Management Center [21].

Figure 1: Architecture of SciDAC Data Management Center
Image 3-layer_new_small

This organises activities in three layers that abstract the end-to-end data flow. The layers are Storage Efficient Access (SEA), Data Mining and Analytics (DMA), and Scientific Process Automation (SPA). The SEA layer is immediately on top of hardware, operating systems, file systems and mass storage systems and provides parallel data access technology and transparent access to archival storage. The DMA layer, which builds on the functionality of the SEA layer, consists of indexing, feature selection and parallel statistical analysis technology. The SPA layer, which is on top of the DMA layer, provides the ability to compose scientific workflows from the components in the DMA layer as well as application specific modules. This architecture has been used in the centre to organise its components and apply them to various scientific applications.

Important components in any data management system have been identified as follows.


Data Formats

Not surprisingly, many data format standards are concerned with multi-dimensional data or image processing. In some cases these are domain specific, but others are more generic.

Data management issues and processes are mostly independent of the data format. Nevertheless some formats are more suited to support metadata and more helpful in reaching the various data management goals. The format can also have a significant effect on i/o performance and the ability to search across and pull out sub-sets of data. An example of this is slicing through 3D data. For system independence, e.g. transferring data between big endian and little endian systems, formats such as XDR (eXternal Data Representation) can be used, although translation may be slow.

There have been several attempts to use XML to describe binary data formats, but these have been largely un-successful. Initiatives such as DFDL, Data Format Description Language and BINX attempt to do this. Traditional formats like HDF-5, NetCDF and CGNS are widely used and newer formats using XML may be more suited for storing metadata. It is also possible to use relational or object databases for certain types of data. We note that usually the metadata is stored separately from the actual data, for instance in a catalogue which contains location references or URLs. There are issues of transaction management and consistency when data or metadata are distributed or replicated.

We do not address low level file systems such as FAT, EXT3, HPFS, etc. in this document, although we note that some performance improvements can be associated with an appropriate choice. This is particularly true of newer ones which are being developed to address scalability to large networked file stores, such as Oracle's BTRFS, B-Tree File System (fault tolerant) introduced in 2007, see http://btrfs.wiki.kernel.org and CRFS, the Coherent Remote File System. ZFS from Sun offers some similiar capabilities and is available on the newer NW-GRID clusters. This includes volume management funtions, scalability, snapshots and copy-on-write clones plus built in integrity checking and repair, RAID and NFS-4 support. See Wikipedia for more information, http://en.wikipedia.org/wiki/List_of_file_systems.

The following list describes a number of widely used data set formats. Section 2.2 describes storage and distributed high performance file systems which may be of interest.

CDF:
the Common Data Format is a library and toolkit developed by NASA. The software is an interface for the storage and manipulation of multi-dimensional data sets.

CGNS:
the CFD General Notation System consists of a collection of conventions and software for the storage and retrieval of CFD data.

The CGNS system is designed to facilitate the exchange of data between sites and applications and to help stabilise the archiving of aerodynamic data. The data are stored in a compact, binary format and are accessible through a comprehensive and extensible library of functions. The API is platform independent and can be easily implemented in C, C++ and Fortran.

CIF:
the IUCr Crystallographic Information File is becoming standard for crystallography and related fields, see http://www.iucr.org/resources/cif. iCAT for example uses imgCIF for crystallographic binary image data. There is a related mmCIF for macro-molecular structures.

DEM:
a Digital Elevation Model consists of a sampled array of elevations for ground positions that are normally at regularly spaced intervals. Information about this format, along with data availability, is available from USGS, the US Geological Survey. Note there are several related DEM file formats.

DLG-3:
the Digital Line Graph format is used for cartographiy by USGS to store geographical vector data as part of a geographical information system.

FITS:
the Flexible Image Transport System is a digital file format used to store, transmit and manipulate scientific and other images. It is the most commonly used digital file format in astronomy. Unlike many image formats, FITS is designed specifically for scientific data and hence includes many provisions for describing photometric and spatial calibration information, together with image origin metadata. See http://fits.gsfc.nasa.gov.

A major feature of the FITS format is that image metadata is stored in a human readable ASCII header, so that an interested user can examine the headers to investigate a file of unknown provenance.

FITS is also often used to store non-image data, such as spectra, photon lists, data cubes, or even structured data such as multi-table databases. A FITS file may contain several extensions, and each of these may contain a data object. For example, it is possible to store X-ray and infrared exposures in the same file.

GRIB-1 and GRIB-2:
GRid In Binary is a concise data format commonly used in meteorology to store historical and forecast weather data. It is standardised by the World Meteorological Organisation's Commission for Basic Systems, known under number GRIB FM 92-IX. The first version is still used operationally worldwide by most meteorological centers, for numerical weather prediction output. Since the introduction of GRIB-2, data is slowly changing over to the new format. GRIB-2 is used for derived product distributed in Eumetcast of Meteosat Second Generation. Another example is NAM, the North American Mesoscale model. See http://www.wmo.ch/pages/prog/www/DPS/FM92-GRIB2-11-2003.pdf.

HDF-5:
the Hierarchical Data Format, is a general purpose library and file format for storing scientific data. It is a self defining file format for transfer of various types of data between different machines. The HDF library contains interfaces for storing and retrieving compressed or uncompressed raster images with palettes and an interface for storing and retrieving n-dimensional scientific datasets together with information about the data, such as labels, units, formats, and scales for all dimensions. HDF-5 can store two primary objects: data set and group. A data set is essentially a multi-dimensional array of homogeneous data elements and a group is a structure for organising objects (data sets or other groups) in an HDF-5 file. Using these two basic objects, one can create and store almost any kind of scientific data structure, such as images, arrays of vectors and structured and un-structured meshes. Data is accessed using a POSIX style path notation.

HDF was originally from NCSA and is now supported by the HDF Group, see http://www.hdfgroup.org. We note that HDF-4 still exists, but is significantly different both in design and API.

Note: Q5cost is an HDF-5 based format which has a Fortran API developed in an EU COST D23 project for computational chemistry [3], see http://abigrid.cineca.it/abigrid/the-docs-archive/q5cost/.

NetCDF:
the Network Common Data Form is a set of software libraries and self describing, machine independent data formats that support the creation, access and sharing of array oriented scientific data. NetCDF implements a machine independent, self describing, extensible file. The project homepage is hosted by Unidata at UCAR, the University Corporation for Atmospheric Research. They are also the chief source of netCDF software, standards development, updates, etc. NetCDF is an open standard and is widely used for climate modelling and related studies.

The project is actively supported. The recently released (2008) version 4.0 greatly enhances the data model by allowing the use of the HDF-5 data file format. See http://www.unidata.ucar.edu/software/netcdf.

NeXuS:
is a common data format for neutron, X-ray, and muon science. It is being developed as an international standard by scientists and programmers representing major scientific facilities in Europe, Asia, Australia and USA in order to facilitate greater cooperation in the analysis and visualisation of neutron, X-ray and muon data. See http://www.nexusformat.org.

Like HDF-5, NeXuS is a hierarchical format with a directory style structure. Some metadata in NeXuS files is encoded as XML with standard tag names making them easy to interpret. NeXuS is used in iCAT.

OpenMath:
aims at developing a standard exchange format for mathematical objects such as formulae processed by computer algebra systems. See http://www.openmath.org.

PDS:
the Planetary Data System (PDS) is an archive which has been responsible for archiving space mission data on CD-ROM media, using its own self describing data format, variously know as PDS or ODL, Object Description Language. At least some of the current projects (e.g. Magellan, Galileo) are using the PDS format as a ``pointer'' to detached VICAR format imagery on the mission CD-ROM volumes. See http://pds.nasa.gov.

SAIF:
Spatial Archive and Interchange Format is a Canadian standard for the exchange of geographic data. It uses an object oriented data model and consists of definitions of the underlying building blocks, including tuples, sets, lists, enumerations, and primitives. A company, Safe Software, was formed to provide tools and training for the SAIF data standard.

SDTS:
the Spatial Data Transfer Standard is US Federal Information Processing Standard (FIPS) 173 for transfer of geological and other spatial data. Documentation and examples are available from the USGS. There are SDTS versions of DEM and CLG.

VICAR:
Video Image Communication and Retrieval is a collection of image processing programs supported by the Multi-Mission Image Processing Laboratory (MIPL) at the US Jet Propulsion Laboratory (JPL), for use in manipulating and analysing images from spacecraft. The image format used by VICAR programs and for all or most data from JPL managed missions, is referred to as VICAR format. An independent third party description of the VICAR image format is available.

Miscellaneous graphics formats:
include formats for storing graphics files - TIFF, GIF, JPEG, FLI, CGM, MPEG, etc.

There are many other formats, some proprietary or application specific, see Section 3.2.


Data Transfer Tools and Wide Area File Systems

Distributed File Systems are sometimes called Distributed Datastore Networks - see Wikipedia. In this technical report we consider only those which work on a wide area network and are thefore suitable for Campus or inter-site computing and data management. Normally many implementations have been made, they are location dependent and they have access control lists (ACLs), unless otherwise stated below.

We separate the rest of this section into server centric storage systems and those supporting distributed file servers. Most systems do however have some dependency on one or more central services, such as metadata services, database or catalogues, which are noted. We assume that all systems reviewed can access distributed storage or provide storage to distributed clients in some way.

Keywords and Definitions

Keywords include: data migration; hierarchical storage management; information lifecycle management; storage area network; tiered storage.

Data Migration

Data migration is the transferring of data between storage types, formats, or computer systems. Data migration is usually performed programmatically to achieve an automated migration, freeing up human resources from tedious tasks. It is required when organisations or individuals change computer systems or upgrade to new systems. Migration is a key issue in data curation, see the Digital Curation Centre http://www.dcc.ac.uk. Migration is thus a means of overcoming technological obsolescence by transferring digital resources from one hardware or software generation to the next. The purpose of migration is to preserve the intellectual content of digital objects and to retain the ability for clients to retrieve, display and otherwise use them in the face of constantly changing technology. Migration differs from the refreshing of storage media in that it is not always possible to make an exact digital copy or replicate original features and appearance and still maintain the compatibility of the resource with the new generation of technology.

ILM

Information Lifecycle Management refers to a wide ranging set of strategies for administering storage systems on computing devices. Specifically, four categories of storage strategies may be considered under the auspices of ILM. These concern: policies including SLAs around data management; operational aspects including backup and data protection; logical and physical infrastructure; and definition of how the strategies are applied.

SAN

A Storage Area Network (SAN) is a high speed network designed to attach computer storage devices such as disk array controllers and tape libraries to servers. As of 2006, SANs are most commonly found in enterprise (e.g. campus) storage.

A SAN allows a machine to connect to remote targets such as disks and tape drives on a network usually for block level I/O. From the point of view of the class drivers and application software, the devices appear as locally attached devices.

There are two variations of SANs:

  1. A network whose primary purpose is the transfer of data between computer systems and storage elements. A SAN consists of a communication infrastructure, which provides physical connections, and a management layer, which organises the connections, storage elements and computer systems so that data transfer is secure and robust.

  2. A storage system consisting of storage elements, storage devices, computer systems, and/ or appliances, plus all control software, communicating over a network.

Tiered Storage

Tiered storage is a data storage environment consisting of two or more kinds of storage delineated by differences in at least one of these four attributes: Price; Performance; Capacity; Function. Any significant difference in one or more of the four defining attributes can be sufficient to justify a separate storage tier.

Examples:

HSM

Hierarchical Storage Management is related to tiered storage. It is a data storage technique that automatically moves data between high cost and low cost storage media. HSM systems exist because high speed storage devices, such as hard disk drives, are typically more expensive (per byte stored) than slower devices, such as optical discs and magnetic tape drives. While it would be ideal to have all data available on high speed devices all the time, this is prohibitively expensive for many installations. HSM systems instead store the bulk of the organisation's data on slower devices and copy data to faster disk drives only when needed. In effect, HSM turns the fast disk drives into caches for the slower mass storage. The HSM system monitors the way data is used and makes best guesses as to which data can safely be relegated to slower devices and which data should stay on the hard disks.

HSM implements policy based management of file backup and archiving in a way that uses storage devices economically and without the user needing to be aware of when files are being retrieved from backup storage media. Although HSM can be implemented on a standalone system, it is more frequently used in the distributed network of an enterprise. The hierarchy represents different types of storage media, such as redundant array of independent disks systems, optical storage, or tape, each type representing a different level of cost and speed of retrieval when access is needed. For example, as a file ages in an archive, it can be automatically moved to a slower but less expensive form of storage. Using an HSM product, an administrator can establish and state guidelines for how often different kinds of files are to be copied to a backup storage device. Once the guideline has been set up, the HSM software manages everything automatically.

HSM adds to archiving and file protection for disaster recovery the capability to manage storage devices efficiently, especially in large scale user environments where storage costs can mount rapidly. It also enables the automation of backup, archiving, and migration to the hierarchy of storage devices in a way that frees users from having to be aware of the storage policies. Older files can automatically be moved to less expensive storage. If needed, they appear to be immediately accessible and can be restored transparently from the backup storage medium. The apparently available files are known as stubs and point to the real location of the file in backup storage. The process of moving files from one storage medium to another is known as migration.

An administrator can set high and low thresholds for hard disk capacity that HSM software will use to decide when to migrate older or less-frequently used files to another medium. Certain file types, such as executable files (programs), can be excluded from those to be migrated.

File Migration

Assuming a file based storage system, efficient file migration services are at the heart of HSM. It is also relevant when moving data from one vendor's product to another, but see under Data Migration above.

File migration thus arises from an information ILM strategy that relegates data to less expensive devices as the data decreases in value to the enterprise. Migration may also be driven by a need to simplify or standardise environments, to improve storage space utilisation, to balance workloads between filers, or to consolidate storage management.

File migration services are present in many commercial products, e.g. from CommVault, HP, LSI and Symantec.

Server Centric Storage Systems

This section lists some traditional storage systems aimed principally at backup and restore, but now increasingly including business logic. These are typically aimed at managing (e.g. by indexing) many small files to produce a searchable archive store or collection.

CASTOR:
the CERN Advanced Storage Manager is used as an interface to storage systems for high energy physics, including the Atlas Data Store at RAL. This is a HSM system in which files can be stored, listed, retrieved and accessed using command line tools or applications built on top of the different data transfer protocols like RFIO (Remote File IO), ROOT libraries, GridFTP and XROOTD. CASTOR manages disk cache(s) and the data on tertiary storage or tapes. CASTOR provided a POSIX like directory structure with a single namespace per site. All files are staged to allow for retrieval from tape etc. on demand. Metadata is contained in a central database. See http://castor.web.cern.ch. This is also discussed by Stewart al. [17] in the context of EGEE storage management.

CommVault:
Simpana-8 product has modules for backup, archive, replication, resource management and search built on a common software platform. Modules can be individually licensed, see http://www.commvault.com.

Simpana software indexes manages data across all tiers of enterprise storage (including laptops), online, archive and backup, into a single virtual pool. Global de-duplication is implemented. This is aimed at data and legal queries which be issued from a ``self service'' search engine like interface.

HDS:
Hitachi Data Systems offer storage management and virtualisation. See http://www.hds.com/solutions/infrastructure.

HP:
HP Neoview data warehousing is aimed at the commercial sector and includes data analysis and customer relationship management. It is typically delivered alongside HP storage solutions and other business intelligence products. See http://h71028.www7.hp.com/enterprise/w1/en/software/business-intelligence-neoview.html.

IBRIX Fusion:
from HP is based on a patented segmented file system architecture that, unlike other file systems, does not require a central metadata server or a distributed lock manager. Good speed up and scalability is achieved by ``parallelizing'' the data as well as the metadata Available for Linux under a proprietary software license.

Isilon:
clustered storage system architecture consists of independent nodes that are all integrated with the OneFS operating system software. The systems can be installed in standard data center environments and are accessible to users and applications running Windows, UNIX/ Linux and Mac operating systems using industry standard file sharing protocols over standard Gigabit Ethernet. The OneFS operating system software is designed with file striping functionality across each node in a cluster, a fully distributed lock manager, caching, fully distributed metadata, and a remote block manager to maintain global coherency and synchronization across the cluster. See Wikipedia.

LSI:
offers traditional high performance networking and storage for HPC systems, see http://www.lsi.com/storage_home/high_performance_computing/index.html.

StoreAge MultiMigrate product, now from LSI, see http://www.lsi.com/storage_home/products_home/storage_virtualization_data_services/storeage_multimigrate. This enables the online migration of data from any storage device to any other storage device, regardless of vendor. The migration takes place while the applications remain on-line without any interruption. Aimed at migrating critical applications from older storage devices onto newer platforms.

ONStor:
See http://www.onstor.com/ for clustered NAS storage gateways. Among other things offer an SMB implementation that also supports NFS protocol so users can access the same data through both protocols, see Section 2.4. Note that ONStor is now part of LSI.

RelData:
UnitedStorage is an IP storage gateway which consolidates and virtualises open storage resources, providing a storage ``pool'' over an existing IP network infrastructure in NAS or SAN form. Storage better fully utilised, protected and centrally managed. RelData has an open back end connectivity enabling exsting storage to be re-used and also permits new fibre channel disk arrays to be added. There is no vendor tie in.

Symantec:
Enterprise Vault product offers a range of storage and backup solutions aimed at Microsoft servers and typically used for archiving e-mail and sharepoint files content. Has tools for legal and compliance testing for commerce.

Tivoli:
product line from IBM, see http://www-03.ibm.com/systems/storage/software/. Tivoli includes StorageManager-6 HSM for Microsoft Windows and Sharepoint. Virtualisation is available for storage consolidation. For high performance cluster and networked storage see GPFS in Section 2.5.4.

Systems with Distributed Services

High performance computing environments require parallel file systems. Traditional server based file systems such as those exported via NFS are unable to efficiently scale to support hundreds of nodes or utilise multiple servers. Parallel file systems are typically deployed for dedicated high performance storage solutions within clusters, usually as part of a vendor’s integrated cluster solution. These parallel file systems are often tightly integrated with a single cluster’s hardware and software environment making sharing them impractical. Recently, several parallel filesystems have been introduced that are designed to make sharing a filesystem between clusters feasible in the presence of hardware and software heterogeneity. These are the ones discussed in this section.

Here we review distributed file systems, which are also possibly also parallel and fault tolerant, stripe and replicate data over multiple servers for high performance and to maintain data integrity. Even if a server fails no data is lost. The file systems are used in both HPC and high availability clusters.

All file systems listed here focus on high availability, scalability and high performance unless otherwise stated below. Whilst these provide distinct advantages over more traditional file systems they may be more or less complicated to install, configure and manage and may require specific Linux kernel patches.

AFS:
Andrew File System is scalable and location independent, has a large client cache and uses Kerberos for authentication. AFS is a distributed networked file system which uses a set of trusted servers to present a homogeneous, location transparent file name space to all the client workstations. It was developed by Carnegie Mellon University as part of the Andrew Project and is named after Andrew Carnegie and Andrew Mellon. See Wikipedia.

AFP:
Apple Filing Protocol for MacOS supplements SMB, NFS and FTP. See Wikipedia.

Ceph:
a scalable, distributed, open source file system from the Storage Systems Research Center, UC Santa Cruz. It is aimed at petabyte storage with replication and fault tolerance. Ceph has been included as experimental in Linux kernels since 2.6.34 in Mar'2010. Earlier versions used FUSE 2.

Ceph storage consists of a potentially large number of data object servers (bricks), a smaller set of metadata server daemons, and a few monitor daemons for managing cluster membership and state. Metadata daemons use the Paxos consensus protocol, see http://en.wikipedia.org/wiki/Paxos_algorithm. The storage daemons rely on the new Linux BTRFS object store. This makes the storage cluster simple to deploy, while providing scalability not currently available from block based cluster file systems. V0.20 is available for Linux under LGPL from SourceForge, see http://ceph.newdream.net.

Coda:
is a fault tolerant distributed file system from Carnegie Mellon University focuses on bandwidth adaptive operation, including disconnected operation using a client side cache for mobile computing. It is a descendant of AFS-2, see the AFS architecture section below. The client side caching can give this good performance for multiple read operations. It is available for Linux under GPL. See Wikipedia.

dCache:
from ErmiLab and DESY is part of the EGEE data management architecture and aims to provide a mechanism for storing and retrieving huge amounts of data among a large number of heterogeneous server nodes, which can be of varying architectures. It provides a single namespace view of all of the files that it manages and allows access to these files using a variety of a protocols, including SRM. By connecting dCache to a tape backend, it becomes a hierarchical storage manager (HSM). See http://www.dcache.org and also Stewart al. [17].

DCE/DFS:
Distributed File System from IBM (earlier Transarc) is similar to AFS with a focus on full POSIX file system semantics and high availability. Available for AIX and Solaris under a proprietary software license. See the AFS architecture section below.

DFS:
fault tolerant Distributed File System from Microsoft focuses on location transparency and redundancy for high availability. Based on SMB and available for Windows under a proprietary software license. See Wikipedia.

eXludus:
High performance data management for optimisation within a cluster. This relies on multiple multicast routes on the internal cluster network and is implemented as a storage server and clients on each node. Benchmarking has reported good scalability but overall performance somewhat slower than GPFS. Such solutions can improve application startup times. See separate reports [9,12].

FraunhoferFS:
from the Fraunhofer Society Competence Center for High Performance Computing. Available free of charge for Linux under a proprietary license. See http://www.fhgfs.com.

FraunhoferFS is written from scratch and incorporates results from previous experience. It is a fully POSIX compliant, scalable file system including feature as follows.

GAM:
IBM's Grid Access Manager [13] software delivers a virtualisation and data protection layer that creates a unified, fixed content storage interface across multiple facilities and heterogeneous storage media. See http://www-03.ibm.com/systems/storage/software/gam.

GAM software enables formation of fixed content storage systems that can scale to petabytes of data across numerous sites. The system's efficient wide area replication is essential to delivering a storage system spanning sites linked together with differing bandwidth networks, even in a straightforward single site and disaster recovery configuration. Furthermore, GAM software optimises broad availability of data across sites through network file system interfaces (CIFS and NFS) and as an object store delivering a global namespace.

GFS:
Google File System has a focus on fault tolerance, high throughput and scalability. It was originally developed for internal use to provide efficient, reliable access to data using large clusters of commodity hardware, but is now freely available as GFS-2. Based on the concept of a 64MB chunk server, we note that Hadoop's HDFS is based on this. See Wikipedia.

Gluster:
GlusterFS is a general purpose distributed file system for scalable storage. It aggregates various storage bricks over Infiniband RDMA or TCP/IP interconnect into one large parallel network file system. GlusterFS is based on a stackable user space design without compromising performance.

GlusterFS has a client and server component. Servers are typically deployed as storage bricks, with each server running a glusterfsd daemon to export a local file system as a volume. The glusterfs client process, which connects to servers with a custom protocol over TCP/IP, InfiniBand or SDP, composes remote volumes into larger ones using stackable translators. The final volume is then mounted by the client host through the FUSE mechanism. Applications doing large amounts of file i/o can also use the libglusterfs client library to connect to the servers directly and run in-process translators, without going through the file system and incurring FUSE overhead.

Most of the functionality of GlusterFS is implemented as translators, including: file based mirroring and replication; file based striping; file based load balancing; volume failover; scheduling and disk caching; storage quotas.

The GlusterFS server is kept minimally simple - it exports an existing file system as-is, leaving it up to client side translators to structure the store. The clients themselves are stateless, do not communicate with each other, and are expected to have translator configurations consistent with each other. This can cause coherency problems, but allows GlusterFS to scale up to several petabytes on commodity hardware by avoiding bottlenecks that normally affect more tightly coupled distributed file systems.

There seems to be good community support for Gluster, although most users seem to be from Web hosting companies, see http://www.gluster.org and it is being considered for use by Streamline Computing.

GPFS:
the General Parallel File System is a high performance shared disk clustered file system for AIX and Linux developed by IBM. It is used by many of the supercomputers that populate the Top 500 List. GPFS was evaluated in [5].

GPFS provides concurrent high speed file access to applications executing on multiple nodes of clusters. It can be used with AIX, Linux, Microsoft Windows Server 2003-r2 or a heterogeneous cluster of AIX, Linux and Windows nodes. In addition to providing filesystem storage capabilities, GPFS provides tools for management and administration of the GPFS cluster and allows for shared access to file systems from remote GPFS clusters.

HDFS:
Apache Hadoop file system stores large files (an ideal size is a multiple of 64MB), across multiple nodes or machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts or a SAN. With the default replication value of 3, data is stored on three nodes. For redundancy this is recommended to be two on the same rack, and one on a different rack in a cluster. Hadoop is a Java application which uses a variant of the Google MapReduce method to split data out, perform functions on the data in parallel, create replicas of the blocks and re-combine (reduce) it when required. In this way Hadoop can be used in the processing of large data sets, it is typlically implemented in a functional programming language.

The filesystem is thus built from a cluster of data nodes, each of which serves up blocks of data over the network using a block protocol specific to HDFS. They also serve the data over HTTP, allowing access to all content from a Web browser or other client. Data nodes can talk to each other to re-balance data, to move copies around, and to keep the replication of data high. Hadoop is based on Google's GFS and uses P2P methods. It can be used for storage scavaging. Other projects are now building on Hadoop, such as Yale University's HadoopDB which also used MapReduce and is aimed at petabyte databases. This may be very valuable for data mining applications, see Section [].

A filesystem requires one unique server, the name node. This is therefore a single point of failure for an HDFS installation. If the name node goes down, the filesystem is effectively off line. When it comes back up, the name node must replay all outstanding operations. Another limitation of HDFS is that it cannot be directly mounted by an existing operating system. Getting data into and out of the HDFS file system is an action that often needs to be performed before and after executing a job, so this can be inconvenient. A filesystem in user space has been developed to address this problem, at least for Linux and some other Unix systems.

It is speculated in 2010 that Sun Grid Engine v6.2-5 will include support for Hadoop. Thus SGE, which is aware of HDFS, would be able to route processing jobs to where the data is already located in the nodes, which speeds up execution of those jobs. This is more efficient than starting up a job somewhere and then trying to move the data over to that node. It can also be of use for chaining multi-stage jobs which use large data sets.

It was speculated in 2009 the Condor would take a similar approach from v7.4 onwards, how this has not yet been seen.

Hadoop is now available with support for running on Amazon EC2 or S3 clouds, although its performance has been questioned in this context.

A number of applications are becoming available built on Hadoop. Hive is a data warehouse infrastructure with its own query language (called Hive QL). Hive makes Hadoop more familiar to those with a Structured Query Language (SQL) background, but it also supports the traditional MapReduce infrastructure for data processing. HBase is a high performance database system similar to Google BigTable. Instead of traditional file processing, HBase makes database tables the input and output form for MapReduce processing. Finally, Pig is a platform on Hadoop for analysing large data sets. Pig provides a high level language that compiles to map and reduce applications.

HP:
StorageWorks file share for clusters is based on Lustre, see Section 2.5.3.

Lustre:
originated as a project at Carnegie Mellon University in 1991 as an object based, distributed file system, generally used for large scale cluster computing. The project aims to provide a file system for clusters of tens of thousands of nodes with petabytes of storage capacity, without compromising speed, security or availability through the decades. Due to the high scalability of Lustre file systems, Lustre deployments are popular in the oil and gas, manufacturing, rich media and finance sectors. Lustre was evaluated in [5].

Lustre is now designed, developed and maintained by Sun Microsystems, Inc. with input from many other individuals and companies after its acquisition of Cluster File Systems, Inc. in 2007, with the intention of bringing the benefits of Lustre technologies to Sun's ZFS file system and the Solaris operating system. See http://www.sun.com/software/products/lustre.

Lustre has failover, but multi-server RAID-1 or RAID-5 is still in their roadmap for future versions. Available for Linux under GPL. Largest current Lustre implementation is on the TeraGrid Data Capacitor [15].

MogileFS:
from Danga Interactive is an open source distributed file system. It is not POSIX compliant and is designed for archiving, e.g. write once files which are read multiple times. This would be appropriate in a data collection scenario. It uses a flat namespace, application level, uses MySQL for metadata and NFS or HTTP for transport. Available for Linux (but may be ported) under GPL. See http://www.danga.com/mogilefs.

NFS-4
Network File System (NFS) originally from Sun is an open standard in POSIX based networked file systems. The NFS protocol allows clients on a network to mount shared file systems from one or more remote servers. NFS may use Kerberos authentication and a client cache. NFS v4.1 includes parallel file access and separates location metadata from the actual files. This is also referred to as pNFS. NFS-4 is developed and maintained by the IETF. See Wikipedia and also the AFS architecture Section 2.5.1 below.

OCFS:
Oracle Cluster File System, currently OCFS-2 is a POSIX compliant shared disk cluster file system for Linux capable of providing both high performance and high availability. See http://oss.oracle.com/projects/ocfs2.

As it provides local file system semantics, it can be used with any application. Cluster aware applications can make use of parallel i/o for higher performance. Applications, not able to benefit from parallel i/o, can take advantage of the file system to provide a fail-over setup to increase its availability.

Apart from being used with Oracle's Real Application Cluster (RAC) database product, OCFS-2 is currently in use to provide scalable Web servers and file servers as well as fail over mail servers and for hosting virtual machine images.

Some of the notable features of the file system are:

Variable block sizes
Flexible allocations (extents, sparse, unwritten extents with the ability to punch holes)
Journaling (ordered and write-back data journaling modes)
Endian and architecture neutral (x86, x86-64, ia64 and ppc64)
Built in Clusterstack with a distributed lock manager
Support for buffered, direct, asynchronous, splice and memory mapped i/o
Comprehensive tools support

PanFS:
Panasas ActiveScale File System uses object storage devices available as a proprietary storage solution. The DirectFLOW capability offers users fully parallel i/o to allow high speed, direct communications between Linux clusters and Panasas storage. Panasas have won several industry awards and have had a large influence on pNFS, see Section 2.4.

PanFS is used at LANL for RoadRunner and on the NW-GRID clusters and some other systems at Daresbury Laboratory. See http://www.panasas.com.

Parrot:
Parrot is a virtual file system component of Condor that allows computation jobs to access data stored on remote servers. It is used for deploying campus Grids, see http://www.cse.nd.edu/~ccl/software/parrot.

PeerFS:
from Radiant Data Corporation focus on high availability and high performance and uses peer-to-peer replication with multiple sources and targets. Available for Linux under a proprietary software license, see http://www.radiantdata.com.

PVFS:
Parallel Virtual File System available for Linux available under GPL. It is designed to scale to petabytes of storage and provide access rates at 100s of GB/s. PVFS is supported by SciDAC through the Scientific Data Management Center and developed by a multi-institution team with an interest in reliability and performance supporting multiply hardware platforms and MPI-IO implementations. Currently PVFS-2, see http://www.pvfs.org. PVFS-2 was evaluated in [5].

SMB:
Server Message Block originally from IBM operates as an application layer network protocol mainly used to provide shared access to files, printers, serial ports and miscellaneous communications between nodes on a network. It also provides an authenticated inter-process communication mechanism. Most usage of SMB involves computers running Microsoft Windows, where it is often known as Microsoft Windows Network. SMB is also known as Common Internet File System (CIFS) or Samba file system. SMB may use Kerberos authentication.

SRB and iRODS:
SRB, originally from the San Diego Supercomputer Centre, provides the capability to virtualise distributed storage resources and to provide standardised access to a broad range of underlying technologies, from flat file systems to databases and tape archiving systems. Through SRB, users are freed from concerns about the location of data and determining the correct procedures to recall or transfer data to their local or host compute environment. SRB abstracts these aspects of distributed data management away from the end user and provides a simplified and uniform way to recall data via indexing systems (meta-catalogues) which keep a logical mapping of the underlying distributed data. SRB has been widely adopted within large scale Grid applications, particularly in the science communities and provides the data management backbone for the National Grid Service (NGS).

The developers of SRB have now developed iRODS, another data Grid software system with the important addition of a distributed rules engine. Asers can execute rules and micro-services to automate the enforcement of management policies to control data access, manipulation operations at distributed sites, etc. The use of rules provides iRODS with a flexibility that would have to be hard coded using SRB.

Tahoe:
is a distributed file system from allmydata.com, which they claim safely stores files on multiple machines to protect against hardware failures. Cryptographic tools are used to ensure integrity and confidentiality, and a de-centralised architecture minimises single points of failure. Files can be accessed through a Web interface or native system calls via FUSE. Fine grained sharing allows individual files or directories to be delegated by passing short URI like strings through e-mail. Tahoe grids are easy to set up, and can be used by a handful of friends or by a large company for thousands of customers. Tahoe relies on distributing multiple copies of data split into blocks and using an erasure coding re-construction algorithm. It is coded in Python. See http://allmydata.org/~warner/pycon-tahoe.html.


Detailed Architectures

We have chosen to illustrate in more detail the architectures of four of the distributed file systems: AFS, iRODS, Lustre, and GPFS. All are suitable for use in large data centres or experimental facilities. In future Ceph and GFS may be evaluated too.


AFS

AFS has several benefits over traditional networked file systems, particularly in the areas of security and scalability. It is not uncommon for enterprise AFS cells to exceed 25,000 clients. AFS uses Kerberos for authentication and implements access control lists on directories for users and groups. Each client caches files on the local filesystem for increased speed on subsequent requests for the same file. This also allows limited filesystem access in the event of a server crash or a network outage.

Read and write operations on an open file are directed only to the locally cached copy. When a modified file is closed, the changed portions are copied back to the file server. Cache consistency is maintained by a callback mechanism. Clients are informed by the server if the file is changed elsewhere. Callbacks are discarded and must be re-established after any client, server or network failure, including a time-out. Re-establishing a callback involves a status check and does not require re-reading the file itself.

A consequence of the file locking strategy is that AFS does not support large shared databases or record updating within files shared between client systems. This was a deliberate design decision based on the perceived needs of the university computing environment.

A significant feature of AFS is the ``volume'', a tree of files, sub-directories and AFS mount points (links to other AFS volumes). Volumes are created by administrators and linked at a specific named path in an AFS cell. Once created, users of the filesystem may create directories and files as usual without concern for the physical location of the volume. A volume may have a quota assigned to it in order to limit the amount of space consumed. As needed, AFS administrators can move that volume to another server and disk location without the need to notify users; indeed the operation can occur while files in that volume are being used.

AFS volumes can be replicated to read only cloned copies. When accessing files in a read only volume, a client system will retrieve data from a particular read only copy. If at some point that copy becomes unavailable, clients will look for any of the remaining copies. Again, users of that data are unaware of the location of the read only copy; administrators can create and relocate such copies as needed. The AFS command suite guarantees that all read only volumes contain exact copies of the original read write volume at the time the read only copy was created.

The file name space on an Andrew workstation is partitioned into a shared and local name space. The shared name space (usually mounted as /afs on the Unix filesystem) is identical on all workstations. The local name space is unique to each workstation. It only contains temporary files needed for workstation initialisation and symbolic links to files in the shared name space.

AFS heavily influenced SUN's NFS-4. Additionally, a variant of AFS, the Distributed File System (DFS) was adopted by the Open Software Foundation in 1989 as part of their Distributed Computing Environment.

There are currently three major implementations from Transarc (IBM), OpenAFS and Arla, although the Transarc software is losing support and is deprecated. AFS-2 is also the predecessor of the Coda file system.

A fourth implementation exists in the Linux kernel source code since at least version 2.6.10. This was committed by Red Hat, but is a fairly simple implementation in its early stages of development and therefore still incomplete.

Note: AFS is in use at University of Manchester and clients can be provided on the National Grid Service clusters if projects require them.


iRODS

iRODS stands for integrated Rule Oriented Data Systems. It is a second generation data grid system providing a unified view and seamless access to distributed digital objects across a wide area network. It is an evolution ofthe first generation data grid system Storage Resource Broker (SRB) which provided a unified view based on logical naming concepts - users, resources, data objects and virtual directories were abstracted by logical names and mapped onto physical entities - providing a physical-to-logical independence for client level applications. iRODS builds upon this by abstracting the data management process itself - this is referred to as policy abstraction.

iRODS v1.0 provides user friendly installation tools, a modular environment for extensibility through micro-services, a robust Web based interface and support for Java, C and shell programming through libraries and utilities for application development.

The ``integrated'' part of iRODS comes from the fact that it provides a unified software envelope for interactions with a host of underlying services which interact in a complex fashion among themselves. This idea is different to a toolkit methodology where one is provided with a suite of modules (tools) which can be integrated by the user or application to form a customised system. The integrated envelope exposes a uniform interface to the client application hiding the complexity of dealing with details. SRB also had an integrated envelope methodology with a single server installation hiding the details about third party authentication, authorization, auditing, metadata managment, streaming access mechanism, resource (vendor level) and other information.

iRODS currently has around 100 API functions and 80 command level utilities. These build upon the integrated envelope adding more functionality and services. Functionalities include the following.

For an evaluation of iRODS (the JISC funded iREAD project) see http://www.wrg.york.ac.uk/iread.

Note: SRB has been used in the NERC funded e-Minerals e-Science project, the JISC funded Cheshire-3 VRE project and on the Diamond synchrotron facility. iRODS is under evaluation at STFC.


Lustre

A Lustre file system has three major functional units as follows.

The MDT, OST, and client can be on the same node or on different nodes. In typical installations these functions are on separate nodes with two to four OSTs per OSS node communicating over a network. Lustre supports several network types, including Infiniband, TCP/IP on Ethernet, Myrinet, Quadrics and other proprietary technologies. Lustre can take advantage of remote direct memory access (RDMA) transfers, when available, to improve throughput and reduce CPU usage.

The storage attached to the servers is partitioned, optionally organised with logical volume management (LVM) and/ or RAID and formatted as file systems. The Lustre OSS and MDS servers read, write, and modify data in the format imposed by these file systems.

An OST is a dedicated filesystem that exports an interface to byte ranges of objects for read and write operations. An MDT is a dedicated filesystem that controls file access and tells clients which object(s) make up a file. MDTs and OSTs currently use a modified version of ext3 to store data. In the future, Sun's ZFS or DMU will also be used to store data.

When a client accesses a file, it completes a filename lookup on the MDS. As a result, a file is created on behalf of the client or the layout of an existing file is returned to the client. For read or write operations, the client then passes the layout to a logical object volume (LOV), which maps the offset and size to one or more objects, each residing on a separate OST. The client then locks the file range being operated on and executes one or more parallel read or write operations directly to the OSTs. With this approach, bottlenecks for client-to-OST communications are eliminated, so the total bandwidth available for the clients to read and write data scales almost linearly with the number of OSTs in the filesystem.

Clients do not directly modify the objects on the OST filesystems, but, instead, delegate this task to OSSes. This approach ensures scalability for large scale clusters and supercomputers, as well as improved security and reliability.

In a typical Lustre installation on a Linux client, a Lustre filesystem driver module is loaded into the kernel and the filesystem is mounted like any other local or network filesystem. Client applications see a single, unified POSIX like filesystem even though it may be composed of tens to thousands of individual servers and MDT or OST filesystems.

On some MPP installations, computational processors can access a Lustre file system by redirecting their i/o requests to a dedicated node configured as a Lustre client. This approach was for instance used in the LLNL BlueGene installation.

Another approach uses the liblustre library to provide userspace applications with direct filesystem access. Liblustre is a user level library that allows computational processors to mount and use the Lustre file system as a client. Using liblustre, the computational processors can access a Lustre file system even if the service node on which the job was launched is not a Lustre client. Liblustre allows data movement directly between application space and the Lustre OSSes without requiring an intervening data copy through the kernel, thus providing low latency, high bandwidth access from computational processors to the Lustre file system directly. Good performance characteristics and scalability make this approach the most suitable for using Lustre with MPP systems. Liblustre is the most significant design difference between Lustre implementations on MPPs such as Cray XT3 and Lustre implementations on conventional clustered computational systems.

High availability features include a robust failover and recovery mechanism, making server failures and reboots transparent. Version inter-operability between successive minor versions of the software enables a server to be upgraded by taking it offline (or failing it over to a standby server), performing the upgrade, and restarting it, while all active jobs continue to run, merely experiencing a delay.

Note: Lustre is being deployed for the Diamond synchrotron facility.


GPFS

GPFS is an IBM proprietary filesystem which provides high performance by allowing data to be accessed over multiple computers at once. Most existing file systems are designed for a single server environment and adding more file servers does not improve performance. GPFS provides higher input and output performance by ``striping'' blocks of data from individual files over multiple disks, and reading and writing these blocks in parallel. Other features provided by GPFS include high availability, support for heterogeneous clusters, disaster recovery, security, DMAPI, HSM and ILM. See http://www.ibm.com.

A file that is written to GPFS is broken up into blocks of a configured size, less than 1MB each. These blocks are distributed across multiple nodes, so that a single file is fully distributed across the disk array. This results in high read and write speeds as the combined bandwidth of the many physical drives is high. This however makes the filesystem vulnerable to disk failures. To prevent data loss, the filesystem nodes therefore have RAID controllers. It is also possible to replicate blocks on different filesystem nodes.

Other features of the filesystem include the following.

For ILM, storage pools allow for the grouping of disks within a file system. Tiers of storage can be created by grouping disks based on performance, locality or reliability characteristics. For example, one pool could be high performance fibre channel disks and another more economical SATA storage.

A fileset is a sub-tree of the file system namespace and provides a way to partition the namespace into smaller, more manageable units. Filesets provide an administrative boundary that can be used to set quotas and be specified in a policy to control initial data placement or data migration. Data in a single fileset can reside in one or more storage pools. Where the file data resides and how it is migrated is based on a set of rules in a user defined policy.

There are two types of user defined policies in GPFS: file placement and file management. File placement policies direct file data as files are created to the appropriate storage pool. File placement rules are determined by attributes such as file name, the user name or the fileset. File management policies allow the file's data to be moved or replicated or files deleted. File management policies can be used to move data from one pool to another without changing the file's location in the directory structure. File management policies are determined by file attributes such as last access time, path name or size of the file.

The GPFS policy processing engine is scalable and can be run on many nodes at once. This allows management policies to be applied to a single file system with billions of files and complete in a few hours.

Note: GPFS is used at Daresbury Laboratory for HPCx and related services. It is also used at POL, the Proudman Oceanographic Laboratory and other UK academic data centres.


Data Analysis

In-situ Analysis

To follow.


Visualisation

Visualisation typically requires access to large volumes of data. Rendering is carried out to view part of the data, e.g. to produce surface or volume images of multi-dimensional data. It is usual to want to scan through this data interactively, viewing successive parts, for instance rendering slices through a volume to explore the interior structure. Other activities are to map functions using artefacts such as streamlines or vectors (hedgehog plots). It is also likely that these will be composed together as a ``movie''.

The visualisation workflow is thus to extract a part of the data and then manipulate and render it. In terms of what is commonly referred to as the standard visualisation pipeline the steps are: (i) read and filter data; (ii) map; (iii); render; and (iv) display. This involves a combination of data extraction, computation and rendering which is greatly facilitated by an appropriate dataset format. For instance slicing multi-dimensional meteorological data is facilitated by array structured formats such as GRIB. Software is often build from modules, each of which is responsible for one of the pipeline steps (e.g. parallel rendering) and may produce intermediate file output.

Many visualisation software packages exist which are intended to be used with data in one or more of the available scientific formats. Here are pointers to some lists of information about this software. An overview can be found on Wikipedia at http://en.wikipedia.org/wiki/Visualization_computer_graphics. Brief descriptions and pointers to software that can be used with netCDF is at http://www.unidata.ucar.edu/packages/netcdf/utilities.html. A page of links to many scientific visualization and graphics software packages is at http://static.msi.umn.edu/user_support/scivis/scivis-list.html.

A certain amount of filtering can be applied to reduce the volume of data required in this process. Visualisation doesn't always need 64-bit, so reduction is possible. It may also be possible to extract a sub-set of data to manipulate if the full volume is not required for rendering. This can help if the data needs to be moved to the visualisation system. An alternative is to do remote visualisation but rendering in-situ.

Server Side Graphics

The primary techniques for building visualization servers are as follows.

Parallel Architecture:
uses multiple graphics cards for one application, sharing the load. These graphics accelerators could be attached to a single system, e.g. with large memory and multiple CPUs to render large data sets. A cluster with graphics accelerators can form a parallel rendering system.
Parallel Decomposition:
split the problem across multiple graphics cards in the application's data space or in display space.
Open Software Architecture:
addresses server side rendering at middleware, API, or application levels.

Software such as Chromium, ParaView, OpenSceneGraph, DMX, VisIt and SciRun can also do parallel rendering so it is important to be able to manipulate the data inside a cluster to enable this, for instance re-ordering the data from the original application to an order suitable for the rendering package. Table 3.3 lists properties of some well known packages.

Table 1: Some Visualisation Packages
Package Name Comments/ File Formats Web site
Amira (licensed) *, Amira, Analyze, various images, netCDF, DXF, PS, HTML, Hoc, Icol, Interfile, Matlab, Nifti, Open Inventor, PSI, Ply, raw binary, SGI RGB, STL, SWC, Tecplot, VRML, Vevo http://www.amiravis.com
Avizo (licensed) * http://www.vsg3d.com
AVS/ Express (licensed) ** netCDF, ... http://www.avs.com
Blender 3D rendering and animation  
Chromium parallel rendering http://chromium.sourceforge.net
DMX multi-display X http://dmx.sourceforge.net/
EnSight netCDF, ... http://www.ensight.com
Environmental Workbench netCDF, ... http://www.ssesco.com/files/ewb.html
IDL netCDF, ... http://www.rsinc.com/idl/index.cfm
OpenSceneGraph   http://www.openscenegraph.org/projects/osg
OpenDX **, CDF, netCDF, HDF-4, HDF-5, DX, various images, mesh, ... http://www.opendx.org
ParaView netCDF, HDF-5, ParaView, EnSight, VTK, Exodus, XDMF, Plot3d, SpyPlot, DEM, VRML, Ply, PDB, XMol, Stereo Lithography, Gaussian Cube, AVS, Meta Image, Facet, various images, SAF, LS-Dyna, raw binary http://www.itk.org/Wiki/ParaView
PV-Wave (licensed) netCDF, ... http://www.vni.com/products/wave/index.html
SciRun **, Nimrod, ... http://www.sci.utah.edu
VisIt Analyze, Ansys, BOV, Boxlib, CGNS, Chombo, CTRL, Curve2D, Ensight, Enzo, Exodus, FITS, FLASH, Fluent, FVCOM, GGCM, GIS, H5Nimrod, H5Part, various images, ITAPS, MFIX, MM5, NASTRAN, Nek5000, netCDF, OpenFOAM, PATRAN, Plot3d, Point3D, PDB, SAMRAI, Silo, Spheral, STL, TecPlot, VASP, Vis5D, VTK, Wavefront OBJ, Xmdv, XDMF, ZeusMP (HDF-4) https://wci.llnl.gov/codes/visit

* Amira also has a number of ``packs'' available, such as Molecular Pack, Mesh Pack, VR Pack, etc. which support additional application specific file formats.

* Avizo is also packaged in six editions with extension modules: Standard; Earth; Wind; Fire and Green.

** OpenDX, SciRun and AVS can be extended by adding special purpose file readers into the pipeline.

For more activities involving scientific visualisation in the UK, see the JISC funded VizNet project http://www.viznet.ac.uk.

Remote Visualisation

The primary techniques to provide easy, high-performance access from remote networked clients are as follows.

Transmitting images:
rather than graphical data, from servers to clients.
Compression:
compressing images on the server and decompressing them on the client to avoid the need to high network bandwidth. This generally increases performance but can make the CPU a bottleneck.

Some of the packages noted above can operate in client-server mode, this permits remote visualisation so that the data does not have to be moved. Instead, a visualisation server is installed on a system wih good connection to the data store and a client allows the user to control the images.

Widely used packages which operate in this way include the following.

IBM DCV:
Deep Computing Visualisation. The DCV client can be freely installed and is used with a modified versio of RealVNC to display 3D images from the server via a compressed OpenGL graphics stream. The DCV server can also be freely installed for academic use. For more information about IBM visualisation projects, see http://domino.research.ibm.com/comm/research.nsf/pages/r.graphics.html but note some of the links are broken. It is not clear what IBM intend to do with the DCV product in future.

ParaView:

SGI RemoteVUE:
Part of the VUE family of products which includes PowerVUE. No longer available after SGI merged with Rackable Systems in 2009.

SGI Express:
to follow

TurboVNC:
Virtual Network Computing with accelerated JPEG compression. See http://www.virtualgl.org/About/TurboVNC.
VirtualGL:
which interposes on un-modified Unix 3D applications, redirecting its 3D commands onto a server side 3D graphics accelerator, reading back rendered images, and converting the application images into a video stream with which remote clients can interact to view and control the 3D application in real time. The video stream is normally compressed using server specific optimization. See http://www.virtualgl.org/.

Statistical Analysis

To follow


Data Mining and Feature Recognition

Techniques similar to visualisation also apply to processes of data mining and feature recognition. Sub-sets of data can be used and processes can be applied in-situ. Example, looking for phase transitions - cite NW-GRID example. Data Mining is sometimes referred to as KDD, Knowledge Discovery through Data.

Data mining is a process for extracting patterns from data. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and is also used in scientific discovery. In recent years, data mining has been widely used areas of science and engineering including bio-informatics, genetics, medicine, education, electrical power engineering, plasma physics experiments and simulations, remote sensing imagery, video surveillance, climate simulations, astronomy, and fluid mix experiments and simulations. An introduction is provided in a paper by Ramakrishnan and Grama [14]. A more recent review is provided by Kamath [8].

Methodologies applied include Bayesian analysis, regression analysis, and self organising feature maps (SOMs or Kohonen maps). In addition to some standardisation efforts, freely available open source software have been developed, including Orange, RapidMiner, Weka, KNIME, and the packages in the GNU R Project which include data mining processes. Most of these systems are able to import and export models in PMML (Predictive Model Markup Language) which provides a standard way to represent data mining models so that these can be shared between different statistical applications. PMML is an XML based language developed by the Data Mining Group (DMG), an independent group composed of many data mining companies. PMML version 4.0 was released in June 2009, see http://www.dmg.org.

A project called Sapphire at LLNL is is carrying out research to develop scalable algorithms for interactive exploration of large, complex, multi-dimensional scientific data sets. By applying and extending ideas from data mining, image and video processing, statistics, and pattern recognition, they are developing a new generation of computational tools and techniques that are being used to improve the way in which scientists extract useful information from data. See https://computation.llnl.gov/casc/sapphire/sapphire_home.html (last updated 2006).

A useful place to look at activities around statisitical analysis and data mining is the community using the Analytic Bridge Web 2.0 tool http://www.analyticbridge.com.

Computational Steering

Computational steering is a valuable mechanism for scientific investigation in which the parameters of a running program can be altered and the results visualised interactively. As an investigative paradigm its history goes back more than ten years, though it has only relatively recently captured the scientific imagination. With the advent of Grid computing, the range of problems that can be tackled interactively has widened markedly. Applications specialists now see the prospect of accomplishing ``real'' science using computational steering whilst, at the same time, computing researchers are working to bring about the supporting infrastructure of middleware, toolkits and environments that will render this prospect concrete.

EPSRC funded a collaboration network to develop and promote computational steering techniques from 2006-2008, see http://compusteer.dcs.hull.ac.uk/index.php.

Computational steering typically requires remote visualisation, see above.

Some packages available include the following.

RealityGrid:
http://www.realitygrid.org;
gViz:
http://www.comp.leeds.ac.uk/vvr/gviz;
CSE:
Computational Steering Environment http://www.cwi.nl/projects/cse/cse.html;
Magellan:
;
Cumulvs:
http://www.csm.ornl.gov/cs/cumulvs.html
SciRun:
http://software.sci.utah.edu/scirun.html.

Visit from Juelich http://www.jstor.org/pss/30039695

COVISE HLRS

GRASPARC http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.52.2629

Technologies and Tools

In this section we illustrate some use cases with collections of technologies and tools in use in typical data centric research environments.

SciDAC Data Management Center

See [21]:

HPCx and NW-GRID

Note: HECToR is using Lustre.

CERN and ADS

CERN currently uses CASTOR-2 to manage storage of LHC experimental data. The system design goals are to provide reliable central data recording of the data, plus transparent access. Much of this access is intended to be from remote sites around the world via data Grid tools. CASTOR itself is designed around a mass storage (tape) system and is therefore not suitable for deployment at sites without this facility.

As CASTOR-2 is such a large installation it only exists at the largest sites in EGEE. These are CERN, plus 4 of the Tier-1 centres including RAL. The instance at CERN currently manages more than 50 million files and 5PB of data.

CASTOR-2 has been designed around a central Oracle RDBMS system, which handles, as much as possible, the state of the system. The resilience of this database is key to CASTOR's reliability. Around this database core all daemons are designed to be stateless, allowing for redundancy and daemon restarts without loss of service.

The stager is another key component. This manages the disk pools in front of the tape system, see Figure 2.

Figure 2: Figure 3: Architecture of CASTOR-2
Image Castor_architecture

In order for the stager to manage access to files on the disk pool it uses a scheduler plugin. This balances the load on each of the disk servers and can also provide policy and ``fair shares'' considerations to file access. The scheduling problem for disk pools is in fact rather similar to that faced in compute clusters and CASTOR currently uses the commercial LSF batch system scheduler.

CASTOR has the ability to dynamically replicate ``hot'' files and even to switch access on an open file to a less busy replica.

Monitoring is carried by integration with the LHC Era Monitoring system at CERN, with alarms being issued when any abnormal conditions are detected. Logging is done into the Oracle database, allowing the central gathering of information, plus the ability to cross query logs from different services.

Bibliography

1
R.J. Allan Virtual Research Environments: from Portals to Science Gateways (Chandos Publishing, Oxford, 2009) 230pp in press http://www.woodheadpublishing.com/en/book.aspx?bookID=1892&ChandosTitle=1

2
R.J. Allan and K. Kleese Data Management 2000 Proc. Intl. Workshop on Advanced Data Storage and Management for HPC. DL-CONF-00-001 (May 2000)

3
C. Angeli, G.L. Bendazzoli, S. Borini, R. Cimiraglia, A. Emerson, S. Evangelisti, D. Maynau, A. Monari, E. Rossi, J. Sanchez-Marin, P.G. Szalay and A. Tajti. The problem of inter-operability: a common data format for quantum chemistry codes COST D23 MetaChem working group draft http://abigrid.cineca.it/the-docs-archive/pubblications/Papero1.8_finalDraft.pdf

4
J.V. Ashby, C. Greenough and R.J. Allan Data management Tools for High Performance Applications Technical Report (UKHEC, 2001) RAL-TR-2001-013 http://epubs.cclrc.ac.uk/work-details?w=29541

5
J. Cope, M. Oberg, H.M. Tufo and M. Woitaszek Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments (University of Colorado))

6
Y. Gu, R.L. Grossman and J. Mambretti A Peer-to-Peer Infrastructure for Distributing Large Scientific Data Sets over Wide Area High-Performance Networks: Experimental Studies Using Wide Area Layer 2 Services http://www.rgrossman.com/dl/proc-106.pdf

7
M. Grove Necho - A System for Distributing and Managing Very Large Datasets University of Reading (2007) http://acet.reading.ac.uk/projects/necho/index.php

8
C. Kamath Scientific Data Mining: a Practical Perspective SIAM (2009) 286pp ISBN 978-0898716757

9
I. Kozin and M. Deegan Evaluation of the eXludus Grid Optimiser (DL, Nov'2007) http://www.cse.scitech.ac.uk/disco/publications/eXludus_GridOptimizer.pdf

10
G. Mallinson CFD Visualisation: Challenges of Complex 3D and 4D Data Fields Int. J. Com. Fluid Dynamics 22:1-2 (Jan'2008) 49-59

11
B. Mann, R. Williams, M. Atkinson, K. Brodlie, A. Storkey and C. Williams Scientific Data Mining, Integration, and Visualization Report of the workshop held at the e-Science Institute, Edinburgh, 24-25/10/2002. http://umbriel.dcs.gla.ac.uk/NeSC/general/talks/sdmiv/report.pdf

12
C. Mountford EXludus Evaluation for the ETF (University of York, 2007) Technical Report UKeS-2007-04 http://www.wrgrid.org.uk/eXludus.pdf

13
A. Osuna, G. Miller, B. Poston and J. Auvenshine Introducing the IBM Grid Access Manager (IBM Redbooks, 2008, SG24-7612-00) 90pp ISBN 0738485012 http://www.redbooks.ibm.com/abstracts/sg247612.html?Open

14
N. Ramakrishnan and A.Y. Grama Mining Scientific Data Adv. Computers 55 (2001) 119-69

15
S.C. Simms, G.G. Pike and D. Balog Wide Area Filesystem Performance using Lustre on the TeraGrid Proc. TeraGrid Conference (2007) http://datacapacitor.researchtechnologies.uits.iu.edu/lustre_wan_tg07.pdf

16
R.R. Sinha, A. Termehchy, M. Winslett, S. Mitra and J. Norris Maitri: A Format-Independent Framework for Managing Large Scale Scientific Data http://dais.cs.uiuc.edu/~termehch/cidr-2007.pdf (2007)

17
G.A. Stewart, D. Cameron, G.A. Cowan and G. McCance Storage Management in EGEE Proc. 5th Australasian symposium on ACSW frontiers 68 (2007) 69-77 http://epp.ph.unimelb.edu.au/twiki/pub/EPP/WebHome/storage_and_dm.pdf

18
M. Valle Scientific Data Management - an Introduction CSCS (2008) http://personal.cscs.ch/~mvalle/sdm/scientific-data-management.html

19
J. Vetter and K. Schwam Techniques for High Performance Computational Steering IEEE Concurrency 7:4 (1999) 63-74 doi:10.1109/4434.806980

20
S.A. Weil, S.A. Brandt, E.L.Miller and C. Maltzahn CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data (UC Santa Cruz, 2006)

21
SciDAC Scientific Data management Center. https://sdm.lbl.gov/sdmcenter

22
Scientific Data Format FAQ http://www.cv.nrao.edu/fits/traffic/scidataformats/faq.html

23
NetCDF FAQ http://www.unidata.ucar.edu/software/netcdf/docs/faq.html

24
Digital Curation Centre Glossary http://www.dcc.ac.uk/resource/glossary

25
Digital Preservation Coalition Definitions http://www.dpconline.org/graphics/intro/definitions.html

About this document ...

Management and Analysis of Large Scientific Data Sets

This document was generated using the LaTeX2HTML translator Version 2008 (1.71)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -local_icons -split 3 -html_version 4.0 LargeData

The translation was initiated by Rob Allan on 2010-07-26


Rob Allan 2010-07-26