© STFC 2009-10. Neither the STFC Computational Science and Engineering Department nor its collaborators accept any responsibility for loss or damage arising from the use of information contained in any of their reports or in any communication about their tests or investigations.
Managing scientific data has been identified as one of the most important emerging needs of the scientific community because of the sheer volume and increasing complexity of data being collected. This is particularly true in the growing field of computational science where increases in computer performance permit ever more realistic simulations and the potential to explore large parameter spaces. Effectively generating, managing and analysing the data and resulting information requires a comprehensive, end-to-end approach that encompasses all of the stages from the initial data acquisition to the final analysis of the data. This is sometimes referred to as Information Lifecycle Management or ILM.
A SciDAC project  has identified three significant requirements based on community input. Firstly, access to more efficient to storage systems - in particular, parallel file system improvements are needed to write and read large volumes of data without slowing a simulation, analysis, or visualisation engine. Secondly, scientists require technologies to facilitate better understanding of their data, in particular the ability to perform complex data analysis and searches over large data sets in an effective way - specialised feature discovery, parallel statistical analysis and efficient indexing are needed before the data can be understood or visualised. Finally, generating the data, collecting and storing the results, data post-processing, and analysis of results is a tedious, fragmented process - workflow tools are required for automation of this process in a robust, tractable and recoverable fashion to enhance scientific exploration adding provenance and other metadata in the process.
In this report we review the requirements and tools for managing large data sets of interest to the UK science community involved in computational simulation and modelling. This extends work in a previous report  prepared for the UK High End Computing project http://www.ukhec.ac.uk.
Scientists often limit the meaning of ``data management'' to the mere physical data storage and access layer and movement of data from one location to another. The scope of scientific data management is however much boarder, including both meaning and content.
The cutting edge of computational science involves very large simulations taking many hours or days on the latest high performance computers. For business data, projects implement enterprise wide data architectures, with data warehouse and data mining to extract information from they data. Can something similar be done for scientific data?
Some problems identified from current scientific projects include the following.
The increasing size of scientific data collections brings not only problems, but also opportunities. One of the biggest is the possibility to re-use existing data for new studies. This was one of the great hopes of the e-Science programme. Many projects investigated data curation, provenance and metadata definitions based on common ontologies. At STFC it was a goal of the Facilities e-Infrastructure Programme with a focus on SRB, iCAT and DataPortal for SRD, Diamond and ISIS. This is now being taken up in Australia .
Scientific data is composed not only of bytes, but also of workflow definitions, computation parameters, environment setup and provenance. Capturing and using all this associated information is a goal of, among others, the JISC funded myExperiment project which focusses on workflows encapsulated in ``research objects''.
Other aspects of data management, particularly virtualisation of the underlying storage, has been tackled in projects such as SRB, the Storage Resource Broker from SDSC.
Much of the work mentioned above is aimed a creating catalogues such as iCAT which reference large data collections. These may be the outputs of a facility or long term research programme, will have consumed extensive funding and are deemed to be of lasting and sometimes national importance.
We will not attempt to fully describe or re-produce this e-Science work here, the background to which is discussed in a report . We also do not consider issues of data curation for which appropriate standards and processes have been developed and documented by the Digital Curation Centre http://www.dcc.ac.uk. Instead, we will focus on ways to manipulate and analyse large scientific data sets. Some examples are as follows.
The LHC data analysis represents an extreme case as it will generate upwards of 14PB of data a year, which has to be distributed across the EGEE, OSG and NorduGrid Grids for analysis. Such data volumes cannot be handled easily with current production networks, so have required the provisioning of optical private networks (an unusual form of SAN) linking CERN's Tier-0 centre to the key Tier-1 computing centres around the world, the one for the UK being situated at RAL. Dedicated 10Gb network links are provided in this way for data movement. We do not consider the middleware aspects of such data grids in this report, but focus more on the requirements of a high end data centre focusing on computational simulation and modelling.
The following figure illustrates the architecture implemented in the SciDAC Data Management Center .
This organises activities in three layers that abstract the end-to-end data flow. The layers are Storage Efficient Access (SEA), Data Mining and Analytics (DMA), and Scientific Process Automation (SPA). The SEA layer is immediately on top of hardware, operating systems, file systems and mass storage systems and provides parallel data access technology and transparent access to archival storage. The DMA layer, which builds on the functionality of the SEA layer, consists of indexing, feature selection and parallel statistical analysis technology. The SPA layer, which is on top of the DMA layer, provides the ability to compose scientific workflows from the components in the DMA layer as well as application specific modules. This architecture has been used in the centre to organise its components and apply them to various scientific applications.
Important components in any data management system have been identified as follows.
Not surprisingly, many data format standards are concerned with multi-dimensional data or image processing. In some cases these are domain specific, but others are more generic.
Data management issues and processes are mostly independent of the data format. Nevertheless some formats are more suited to support metadata and more helpful in reaching the various data management goals. The format can also have a significant effect on i/o performance and the ability to search across and pull out sub-sets of data. An example of this is slicing through 3D data. For system independence, e.g. transferring data between big endian and little endian systems, formats such as XDR (eXternal Data Representation) can be used, although translation may be slow.
There have been several attempts to use XML to describe binary data formats, but these have been largely un-successful. Initiatives such as DFDL, Data Format Description Language and BINX attempt to do this. Traditional formats like HDF-5, NetCDF and CGNS are widely used and newer formats using XML may be more suited for storing metadata. It is also possible to use relational or object databases for certain types of data. We note that usually the metadata is stored separately from the actual data, for instance in a catalogue which contains location references or URLs. There are issues of transaction management and consistency when data or metadata are distributed or replicated.
We do not address low level file systems such as FAT, EXT3, HPFS, etc. in this document, although we note that some performance improvements can be associated with an appropriate choice. This is particularly true of newer ones which are being developed to address scalability to large networked file stores, such as Oracle's BTRFS, B-Tree File System (fault tolerant) introduced in 2007, see http://btrfs.wiki.kernel.org and CRFS, the Coherent Remote File System. ZFS from Sun offers some similiar capabilities and is available on the newer NW-GRID clusters. This includes volume management funtions, scalability, snapshots and copy-on-write clones plus built in integrity checking and repair, RAID and NFS-4 support. See Wikipedia for more information, http://en.wikipedia.org/wiki/List_of_file_systems.
The following list describes a number of widely used data set formats. Section 2.2 describes storage and distributed high performance file systems which may be of interest.
The CGNS system is designed to facilitate the exchange of data between sites and applications and to help stabilise the archiving of aerodynamic data. The data are stored in a compact, binary format and are accessible through a comprehensive and extensible library of functions. The API is platform independent and can be easily implemented in C, C++ and Fortran.
A major feature of the FITS format is that image metadata is stored in a human readable ASCII header, so that an interested user can examine the headers to investigate a file of unknown provenance.
FITS is also often used to store non-image data, such as spectra, photon lists, data cubes, or even structured data such as multi-table databases. A FITS file may contain several extensions, and each of these may contain a data object. For example, it is possible to store X-ray and infrared exposures in the same file.
HDF was originally from NCSA and is now supported by the HDF Group, see http://www.hdfgroup.org. We note that HDF-4 still exists, but is significantly different both in design and API.
Note: Q5cost is an HDF-5 based format which has a Fortran API developed in an EU COST D23 project for computational chemistry , see http://abigrid.cineca.it/abigrid/the-docs-archive/q5cost/.
The project is actively supported. The recently released (2008) version 4.0 greatly enhances the data model by allowing the use of the HDF-5 data file format. See http://www.unidata.ucar.edu/software/netcdf.
Like HDF-5, NeXuS is a hierarchical format with a directory style structure. Some metadata in NeXuS files is encoded as XML with standard tag names making them easy to interpret. NeXuS is used in iCAT.
There are many other formats, some proprietary or application specific, see Section 3.2.
Distributed File Systems are sometimes called Distributed Datastore Networks - see Wikipedia. In this technical report we consider only those which work on a wide area network and are thefore suitable for Campus or inter-site computing and data management. Normally many implementations have been made, they are location dependent and they have access control lists (ACLs), unless otherwise stated below.
We separate the rest of this section into server centric storage systems and those supporting distributed file servers. Most systems do however have some dependency on one or more central services, such as metadata services, database or catalogues, which are noted. We assume that all systems reviewed can access distributed storage or provide storage to distributed clients in some way.
Keywords include: data migration; hierarchical storage management; information lifecycle management; storage area network; tiered storage.
Data migration is the transferring of data between storage types, formats, or computer systems. Data migration is usually performed programmatically to achieve an automated migration, freeing up human resources from tedious tasks. It is required when organisations or individuals change computer systems or upgrade to new systems. Migration is a key issue in data curation, see the Digital Curation Centre http://www.dcc.ac.uk. Migration is thus a means of overcoming technological obsolescence by transferring digital resources from one hardware or software generation to the next. The purpose of migration is to preserve the intellectual content of digital objects and to retain the ability for clients to retrieve, display and otherwise use them in the face of constantly changing technology. Migration differs from the refreshing of storage media in that it is not always possible to make an exact digital copy or replicate original features and appearance and still maintain the compatibility of the resource with the new generation of technology.
Information Lifecycle Management refers to a wide ranging set of strategies for administering storage systems on computing devices. Specifically, four categories of storage strategies may be considered under the auspices of ILM. These concern: policies including SLAs around data management; operational aspects including backup and data protection; logical and physical infrastructure; and definition of how the strategies are applied.
A Storage Area Network (SAN) is a high speed network designed to attach computer storage devices such as disk array controllers and tape libraries to servers. As of 2006, SANs are most commonly found in enterprise (e.g. campus) storage.
A SAN allows a machine to connect to remote targets such as disks and tape drives on a network usually for block level I/O. From the point of view of the class drivers and application software, the devices appear as locally attached devices.
There are two variations of SANs:
Tiered storage is a data storage environment consisting of two or more kinds of storage delineated by differences in at least one of these four attributes: Price; Performance; Capacity; Function. Any significant difference in one or more of the four defining attributes can be sufficient to justify a separate storage tier.
Hierarchical Storage Management is related to tiered storage. It is a data storage technique that automatically moves data between high cost and low cost storage media. HSM systems exist because high speed storage devices, such as hard disk drives, are typically more expensive (per byte stored) than slower devices, such as optical discs and magnetic tape drives. While it would be ideal to have all data available on high speed devices all the time, this is prohibitively expensive for many installations. HSM systems instead store the bulk of the organisation's data on slower devices and copy data to faster disk drives only when needed. In effect, HSM turns the fast disk drives into caches for the slower mass storage. The HSM system monitors the way data is used and makes best guesses as to which data can safely be relegated to slower devices and which data should stay on the hard disks.
HSM implements policy based management of file backup and archiving in a way that uses storage devices economically and without the user needing to be aware of when files are being retrieved from backup storage media. Although HSM can be implemented on a standalone system, it is more frequently used in the distributed network of an enterprise. The hierarchy represents different types of storage media, such as redundant array of independent disks systems, optical storage, or tape, each type representing a different level of cost and speed of retrieval when access is needed. For example, as a file ages in an archive, it can be automatically moved to a slower but less expensive form of storage. Using an HSM product, an administrator can establish and state guidelines for how often different kinds of files are to be copied to a backup storage device. Once the guideline has been set up, the HSM software manages everything automatically.
HSM adds to archiving and file protection for disaster recovery the capability to manage storage devices efficiently, especially in large scale user environments where storage costs can mount rapidly. It also enables the automation of backup, archiving, and migration to the hierarchy of storage devices in a way that frees users from having to be aware of the storage policies. Older files can automatically be moved to less expensive storage. If needed, they appear to be immediately accessible and can be restored transparently from the backup storage medium. The apparently available files are known as stubs and point to the real location of the file in backup storage. The process of moving files from one storage medium to another is known as migration.
An administrator can set high and low thresholds for hard disk capacity that HSM software will use to decide when to migrate older or less-frequently used files to another medium. Certain file types, such as executable files (programs), can be excluded from those to be migrated.
Assuming a file based storage system, efficient file migration services are at the heart of HSM. It is also relevant when moving data from one vendor's product to another, but see under Data Migration above.
File migration thus arises from an information ILM strategy that relegates data to less expensive devices as the data decreases in value to the enterprise. Migration may also be driven by a need to simplify or standardise environments, to improve storage space utilisation, to balance workloads between filers, or to consolidate storage management.
File migration services are present in many commercial products, e.g. from CommVault, HP, LSI and Symantec.
This section lists some traditional storage systems aimed principally at backup and restore, but now increasingly including business logic. These are typically aimed at managing (e.g. by indexing) many small files to produce a searchable archive store or collection.
Simpana software indexes manages data across all tiers of enterprise storage (including laptops), online, archive and backup, into a single virtual pool. Global de-duplication is implemented. This is aimed at data and legal queries which be issued from a ``self service'' search engine like interface.
StoreAge MultiMigrate product, now from LSI, see http://www.lsi.com/storage_home/products_home/storage_virtualization_data_services/storeage_multimigrate. This enables the online migration of data from any storage device to any other storage device, regardless of vendor. The migration takes place while the applications remain on-line without any interruption. Aimed at migrating critical applications from older storage devices onto newer platforms.
High performance computing environments require parallel file systems. Traditional server based file systems such as those exported via NFS are unable to efficiently scale to support hundreds of nodes or utilise multiple servers. Parallel file systems are typically deployed for dedicated high performance storage solutions within clusters, usually as part of a vendor’s integrated cluster solution. These parallel file systems are often tightly integrated with a single cluster’s hardware and software environment making sharing them impractical. Recently, several parallel filesystems have been introduced that are designed to make sharing a filesystem between clusters feasible in the presence of hardware and software heterogeneity. These are the ones discussed in this section.
Here we review distributed file systems, which are also possibly also parallel and fault tolerant, stripe and replicate data over multiple servers for high performance and to maintain data integrity. Even if a server fails no data is lost. The file systems are used in both HPC and high availability clusters.
All file systems listed here focus on high availability, scalability and high performance unless otherwise stated below. Whilst these provide distinct advantages over more traditional file systems they may be more or less complicated to install, configure and manage and may require specific Linux kernel patches.
Ceph storage consists of a potentially large number of data object servers (bricks), a smaller set of metadata server daemons, and a few monitor daemons for managing cluster membership and state. Metadata daemons use the Paxos consensus protocol, see http://en.wikipedia.org/wiki/Paxos_algorithm. The storage daemons rely on the new Linux BTRFS object store. This makes the storage cluster simple to deploy, while providing scalability not currently available from block based cluster file systems. V0.20 is available for Linux under LGPL from SourceForge, see http://ceph.newdream.net.
FraunhoferFS is written from scratch and incorporates results from previous experience. It is a fully POSIX compliant, scalable file system including feature as follows.
GAM software enables formation of fixed content storage systems that can scale to petabytes of data across numerous sites. The system's efficient wide area replication is essential to delivering a storage system spanning sites linked together with differing bandwidth networks, even in a straightforward single site and disaster recovery configuration. Furthermore, GAM software optimises broad availability of data across sites through network file system interfaces (CIFS and NFS) and as an object store delivering a global namespace.
GlusterFS has a client and server component. Servers are typically deployed as storage bricks, with each server running a glusterfsd daemon to export a local file system as a volume. The glusterfs client process, which connects to servers with a custom protocol over TCP/IP, InfiniBand or SDP, composes remote volumes into larger ones using stackable translators. The final volume is then mounted by the client host through the FUSE mechanism. Applications doing large amounts of file i/o can also use the libglusterfs client library to connect to the servers directly and run in-process translators, without going through the file system and incurring FUSE overhead.
Most of the functionality of GlusterFS is implemented as translators, including: file based mirroring and replication; file based striping; file based load balancing; volume failover; scheduling and disk caching; storage quotas.
The GlusterFS server is kept minimally simple - it exports an existing file system as-is, leaving it up to client side translators to structure the store. The clients themselves are stateless, do not communicate with each other, and are expected to have translator configurations consistent with each other. This can cause coherency problems, but allows GlusterFS to scale up to several petabytes on commodity hardware by avoiding bottlenecks that normally affect more tightly coupled distributed file systems.
There seems to be good community support for Gluster, although most users seem to be from Web hosting companies, see http://www.gluster.org and it is being considered for use by Streamline Computing.
GPFS provides concurrent high speed file access to applications executing on multiple nodes of clusters. It can be used with AIX, Linux, Microsoft Windows Server 2003-r2 or a heterogeneous cluster of AIX, Linux and Windows nodes. In addition to providing filesystem storage capabilities, GPFS provides tools for management and administration of the GPFS cluster and allows for shared access to file systems from remote GPFS clusters.
The filesystem is thus built from a cluster of data nodes, each of which serves up blocks of data over the network using a block protocol specific to HDFS. They also serve the data over HTTP, allowing access to all content from a Web browser or other client. Data nodes can talk to each other to re-balance data, to move copies around, and to keep the replication of data high. Hadoop is based on Google's GFS and uses P2P methods. It can be used for storage scavaging. Other projects are now building on Hadoop, such as Yale University's HadoopDB which also used MapReduce and is aimed at petabyte databases. This may be very valuable for data mining applications, see Section .
A filesystem requires one unique server, the name node. This is therefore a single point of failure for an HDFS installation. If the name node goes down, the filesystem is effectively off line. When it comes back up, the name node must replay all outstanding operations. Another limitation of HDFS is that it cannot be directly mounted by an existing operating system. Getting data into and out of the HDFS file system is an action that often needs to be performed before and after executing a job, so this can be inconvenient. A filesystem in user space has been developed to address this problem, at least for Linux and some other Unix systems.
It is speculated in 2010 that Sun Grid Engine v6.2-5 will include support for Hadoop. Thus SGE, which is aware of HDFS, would be able to route processing jobs to where the data is already located in the nodes, which speeds up execution of those jobs. This is more efficient than starting up a job somewhere and then trying to move the data over to that node. It can also be of use for chaining multi-stage jobs which use large data sets.
It was speculated in 2009 the Condor would take a similar approach from v7.4 onwards, how this has not yet been seen.
Hadoop is now available with support for running on Amazon EC2 or S3 clouds, although its performance has been questioned in this context.
A number of applications are becoming available built on Hadoop. Hive is a data warehouse infrastructure with its own query language (called Hive QL). Hive makes Hadoop more familiar to those with a Structured Query Language (SQL) background, but it also supports the traditional MapReduce infrastructure for data processing. HBase is a high performance database system similar to Google BigTable. Instead of traditional file processing, HBase makes database tables the input and output form for MapReduce processing. Finally, Pig is a platform on Hadoop for analysing large data sets. Pig provides a high level language that compiles to map and reduce applications.
Lustre is now designed, developed and maintained by Sun Microsystems, Inc. with input from many other individuals and companies after its acquisition of Cluster File Systems, Inc. in 2007, with the intention of bringing the benefits of Lustre technologies to Sun's ZFS file system and the Solaris operating system. See http://www.sun.com/software/products/lustre.
Lustre has failover, but multi-server RAID-1 or RAID-5 is still in their roadmap for future versions. Available for Linux under GPL. Largest current Lustre implementation is on the TeraGrid Data Capacitor .
As it provides local file system semantics, it can be used with any application. Cluster aware applications can make use of parallel i/o for higher performance. Applications, not able to benefit from parallel i/o, can take advantage of the file system to provide a fail-over setup to increase its availability.
Apart from being used with Oracle's Real Application Cluster (RAC) database product, OCFS-2 is currently in use to provide scalable Web servers and file servers as well as fail over mail servers and for hosting virtual machine images.
Some of the notable features of the file system are:
Variable block sizes
Flexible allocations (extents, sparse, unwritten extents with the ability to punch holes)
Journaling (ordered and write-back data journaling modes)
Endian and architecture neutral (x86, x86-64, ia64 and ppc64)
Built in Clusterstack with a distributed lock manager
Support for buffered, direct, asynchronous, splice and memory mapped i/o
Comprehensive tools support
PanFS is used at LANL for RoadRunner and on the NW-GRID clusters and some other systems at Daresbury Laboratory. See http://www.panasas.com.
The developers of SRB have now developed iRODS, another data Grid software system with the important addition of a distributed rules engine. Asers can execute rules and micro-services to automate the enforcement of management policies to control data access, manipulation operations at distributed sites, etc. The use of rules provides iRODS with a flexibility that would have to be hard coded using SRB.
We have chosen to illustrate in more detail the architectures of four of the distributed file systems: AFS, iRODS, Lustre, and GPFS. All are suitable for use in large data centres or experimental facilities. In future Ceph and GFS may be evaluated too.
AFS has several benefits over traditional networked file systems, particularly in the areas of security and scalability. It is not uncommon for enterprise AFS cells to exceed 25,000 clients. AFS uses Kerberos for authentication and implements access control lists on directories for users and groups. Each client caches files on the local filesystem for increased speed on subsequent requests for the same file. This also allows limited filesystem access in the event of a server crash or a network outage.
Read and write operations on an open file are directed only to the locally cached copy. When a modified file is closed, the changed portions are copied back to the file server. Cache consistency is maintained by a callback mechanism. Clients are informed by the server if the file is changed elsewhere. Callbacks are discarded and must be re-established after any client, server or network failure, including a time-out. Re-establishing a callback involves a status check and does not require re-reading the file itself.
A consequence of the file locking strategy is that AFS does not support large shared databases or record updating within files shared between client systems. This was a deliberate design decision based on the perceived needs of the university computing environment.
A significant feature of AFS is the ``volume'', a tree of files, sub-directories and AFS mount points (links to other AFS volumes). Volumes are created by administrators and linked at a specific named path in an AFS cell. Once created, users of the filesystem may create directories and files as usual without concern for the physical location of the volume. A volume may have a quota assigned to it in order to limit the amount of space consumed. As needed, AFS administrators can move that volume to another server and disk location without the need to notify users; indeed the operation can occur while files in that volume are being used.
AFS volumes can be replicated to read only cloned copies. When accessing files in a read only volume, a client system will retrieve data from a particular read only copy. If at some point that copy becomes unavailable, clients will look for any of the remaining copies. Again, users of that data are unaware of the location of the read only copy; administrators can create and relocate such copies as needed. The AFS command suite guarantees that all read only volumes contain exact copies of the original read write volume at the time the read only copy was created.
The file name space on an Andrew workstation is partitioned into a shared and local name space. The shared name space (usually mounted as /afs on the Unix filesystem) is identical on all workstations. The local name space is unique to each workstation. It only contains temporary files needed for workstation initialisation and symbolic links to files in the shared name space.
AFS heavily influenced SUN's NFS-4. Additionally, a variant of AFS, the Distributed File System (DFS) was adopted by the Open Software Foundation in 1989 as part of their Distributed Computing Environment.
There are currently three major implementations from Transarc (IBM), OpenAFS and Arla, although the Transarc software is losing support and is deprecated. AFS-2 is also the predecessor of the Coda file system.
A fourth implementation exists in the Linux kernel source code since at least version 2.6.10. This was committed by Red Hat, but is a fairly simple implementation in its early stages of development and therefore still incomplete.
Note: AFS is in use at University of Manchester and clients can be provided on the National Grid Service clusters if projects require them.
iRODS stands for integrated Rule Oriented Data Systems. It is a second generation data grid system providing a unified view and seamless access to distributed digital objects across a wide area network. It is an evolution ofthe first generation data grid system Storage Resource Broker (SRB) which provided a unified view based on logical naming concepts - users, resources, data objects and virtual directories were abstracted by logical names and mapped onto physical entities - providing a physical-to-logical independence for client level applications. iRODS builds upon this by abstracting the data management process itself - this is referred to as policy abstraction.
iRODS v1.0 provides user friendly installation tools, a modular environment for extensibility through micro-services, a robust Web based interface and support for Java, C and shell programming through libraries and utilities for application development.
The ``integrated'' part of iRODS comes from the fact that it provides a unified software envelope for interactions with a host of underlying services which interact in a complex fashion among themselves. This idea is different to a toolkit methodology where one is provided with a suite of modules (tools) which can be integrated by the user or application to form a customised system. The integrated envelope exposes a uniform interface to the client application hiding the complexity of dealing with details. SRB also had an integrated envelope methodology with a single server installation hiding the details about third party authentication, authorization, auditing, metadata managment, streaming access mechanism, resource (vendor level) and other information.
iRODS currently has around 100 API functions and 80 command level utilities. These build upon the integrated envelope adding more functionality and services. Functionalities include the following.
For an evaluation of iRODS (the JISC funded iREAD project) see http://www.wrg.york.ac.uk/iread.
Note: SRB has been used in the NERC funded e-Minerals e-Science project, the JISC funded Cheshire-3 VRE project and on the Diamond synchrotron facility. iRODS is under evaluation at STFC.
A Lustre file system has three major functional units as follows.
The MDT, OST, and client can be on the same node or on different nodes. In typical installations these functions are on separate nodes with two to four OSTs per OSS node communicating over a network. Lustre supports several network types, including Infiniband, TCP/IP on Ethernet, Myrinet, Quadrics and other proprietary technologies. Lustre can take advantage of remote direct memory access (RDMA) transfers, when available, to improve throughput and reduce CPU usage.
The storage attached to the servers is partitioned, optionally organised with logical volume management (LVM) and/ or RAID and formatted as file systems. The Lustre OSS and MDS servers read, write, and modify data in the format imposed by these file systems.
An OST is a dedicated filesystem that exports an interface to byte ranges of objects for read and write operations. An MDT is a dedicated filesystem that controls file access and tells clients which object(s) make up a file. MDTs and OSTs currently use a modified version of ext3 to store data. In the future, Sun's ZFS or DMU will also be used to store data.
When a client accesses a file, it completes a filename lookup on the MDS. As a result, a file is created on behalf of the client or the layout of an existing file is returned to the client. For read or write operations, the client then passes the layout to a logical object volume (LOV), which maps the offset and size to one or more objects, each residing on a separate OST. The client then locks the file range being operated on and executes one or more parallel read or write operations directly to the OSTs. With this approach, bottlenecks for client-to-OST communications are eliminated, so the total bandwidth available for the clients to read and write data scales almost linearly with the number of OSTs in the filesystem.
Clients do not directly modify the objects on the OST filesystems, but, instead, delegate this task to OSSes. This approach ensures scalability for large scale clusters and supercomputers, as well as improved security and reliability.
In a typical Lustre installation on a Linux client, a Lustre filesystem driver module is loaded into the kernel and the filesystem is mounted like any other local or network filesystem. Client applications see a single, unified POSIX like filesystem even though it may be composed of tens to thousands of individual servers and MDT or OST filesystems.
On some MPP installations, computational processors can access a Lustre file system by redirecting their i/o requests to a dedicated node configured as a Lustre client. This approach was for instance used in the LLNL BlueGene installation.
Another approach uses the liblustre library to provide userspace applications with direct filesystem access. Liblustre is a user level library that allows computational processors to mount and use the Lustre file system as a client. Using liblustre, the computational processors can access a Lustre file system even if the service node on which the job was launched is not a Lustre client. Liblustre allows data movement directly between application space and the Lustre OSSes without requiring an intervening data copy through the kernel, thus providing low latency, high bandwidth access from computational processors to the Lustre file system directly. Good performance characteristics and scalability make this approach the most suitable for using Lustre with MPP systems. Liblustre is the most significant design difference between Lustre implementations on MPPs such as Cray XT3 and Lustre implementations on conventional clustered computational systems.
High availability features include a robust failover and recovery mechanism, making server failures and reboots transparent. Version inter-operability between successive minor versions of the software enables a server to be upgraded by taking it offline (or failing it over to a standby server), performing the upgrade, and restarting it, while all active jobs continue to run, merely experiencing a delay.
Note: Lustre is being deployed for the Diamond synchrotron facility.
GPFS is an IBM proprietary filesystem which provides high performance by allowing data to be accessed over multiple computers at once. Most existing file systems are designed for a single server environment and adding more file servers does not improve performance. GPFS provides higher input and output performance by ``striping'' blocks of data from individual files over multiple disks, and reading and writing these blocks in parallel. Other features provided by GPFS include high availability, support for heterogeneous clusters, disaster recovery, security, DMAPI, HSM and ILM. See http://www.ibm.com.
A file that is written to GPFS is broken up into blocks of a configured size, less than 1MB each. These blocks are distributed across multiple nodes, so that a single file is fully distributed across the disk array. This results in high read and write speeds as the combined bandwidth of the many physical drives is high. This however makes the filesystem vulnerable to disk failures. To prevent data loss, the filesystem nodes therefore have RAID controllers. It is also possible to replicate blocks on different filesystem nodes.
Other features of the filesystem include the following.
For ILM, storage pools allow for the grouping of disks within a file system. Tiers of storage can be created by grouping disks based on performance, locality or reliability characteristics. For example, one pool could be high performance fibre channel disks and another more economical SATA storage.
A fileset is a sub-tree of the file system namespace and provides a way to partition the namespace into smaller, more manageable units. Filesets provide an administrative boundary that can be used to set quotas and be specified in a policy to control initial data placement or data migration. Data in a single fileset can reside in one or more storage pools. Where the file data resides and how it is migrated is based on a set of rules in a user defined policy.
There are two types of user defined policies in GPFS: file placement and file management. File placement policies direct file data as files are created to the appropriate storage pool. File placement rules are determined by attributes such as file name, the user name or the fileset. File management policies allow the file's data to be moved or replicated or files deleted. File management policies can be used to move data from one pool to another without changing the file's location in the directory structure. File management policies are determined by file attributes such as last access time, path name or size of the file.
The GPFS policy processing engine is scalable and can be run on many nodes at once. This allows management policies to be applied to a single file system with billions of files and complete in a few hours.
Note: GPFS is used at Daresbury Laboratory for HPCx and related services. It is also used at POL, the Proudman Oceanographic Laboratory and other UK academic data centres.
Visualisation typically requires access to large volumes of data. Rendering is carried out to view part of the data, e.g. to produce surface or volume images of multi-dimensional data. It is usual to want to scan through this data interactively, viewing successive parts, for instance rendering slices through a volume to explore the interior structure. Other activities are to map functions using artefacts such as streamlines or vectors (hedgehog plots). It is also likely that these will be composed together as a ``movie''.
The visualisation workflow is thus to extract a part of the data and then manipulate and render it. In terms of what is commonly referred to as the standard visualisation pipeline the steps are: (i) read and filter data; (ii) map; (iii); render; and (iv) display. This involves a combination of data extraction, computation and rendering which is greatly facilitated by an appropriate dataset format. For instance slicing multi-dimensional meteorological data is facilitated by array structured formats such as GRIB. Software is often build from modules, each of which is responsible for one of the pipeline steps (e.g. parallel rendering) and may produce intermediate file output.
Many visualisation software packages exist which are intended to be used with data in one or more of the available scientific formats. Here are pointers to some lists of information about this software. An overview can be found on Wikipedia at http://en.wikipedia.org/wiki/Visualization_computer_graphics. Brief descriptions and pointers to software that can be used with netCDF is at http://www.unidata.ucar.edu/packages/netcdf/utilities.html. A page of links to many scientific visualization and graphics software packages is at http://static.msi.umn.edu/user_support/scivis/scivis-list.html.
A certain amount of filtering can be applied to reduce the volume of data required in this process. Visualisation doesn't always need 64-bit, so reduction is possible. It may also be possible to extract a sub-set of data to manipulate if the full volume is not required for rendering. This can help if the data needs to be moved to the visualisation system. An alternative is to do remote visualisation but rendering in-situ.
The primary techniques for building visualization servers are as follows.
Software such as Chromium, ParaView, OpenSceneGraph, DMX, VisIt and SciRun can also do parallel rendering so it is important to be able to manipulate the data inside a cluster to enable this, for instance re-ordering the data from the original application to an order suitable for the rendering package. Table 3.3 lists properties of some well known packages.
|Package Name||Comments/ File Formats||Web site|
|Amira (licensed)||*, Amira, Analyze, various images, netCDF, DXF, PS, HTML, Hoc, Icol, Interfile, Matlab, Nifti, Open Inventor, PSI, Ply, raw binary, SGI RGB, STL, SWC, Tecplot, VRML, Vevo||http://www.amiravis.com|
|AVS/ Express (licensed)||** netCDF, ...||http://www.avs.com|
|Blender||3D rendering and animation|
|Environmental Workbench||netCDF, ...||http://www.ssesco.com/files/ewb.html|
|OpenDX||**, CDF, netCDF, HDF-4, HDF-5, DX, various images, mesh, ...||http://www.opendx.org|
|ParaView||netCDF, HDF-5, ParaView, EnSight, VTK, Exodus, XDMF, Plot3d, SpyPlot, DEM, VRML, Ply, PDB, XMol, Stereo Lithography, Gaussian Cube, AVS, Meta Image, Facet, various images, SAF, LS-Dyna, raw binary||http://www.itk.org/Wiki/ParaView|
|PV-Wave (licensed)||netCDF, ...||http://www.vni.com/products/wave/index.html|
|SciRun||**, Nimrod, ...||http://www.sci.utah.edu|
|VisIt||Analyze, Ansys, BOV, Boxlib, CGNS, Chombo, CTRL, Curve2D, Ensight, Enzo, Exodus, FITS, FLASH, Fluent, FVCOM, GGCM, GIS, H5Nimrod, H5Part, various images, ITAPS, MFIX, MM5, NASTRAN, Nek5000, netCDF, OpenFOAM, PATRAN, Plot3d, Point3D, PDB, SAMRAI, Silo, Spheral, STL, TecPlot, VASP, Vis5D, VTK, Wavefront OBJ, Xmdv, XDMF, ZeusMP (HDF-4)||https://wci.llnl.gov/codes/visit|
* Amira also has a number of ``packs'' available, such as Molecular Pack, Mesh Pack, VR Pack, etc. which support additional application specific file formats.
* Avizo is also packaged in six editions with extension modules: Standard; Earth; Wind; Fire and Green.
** OpenDX, SciRun and AVS can be extended by adding special purpose file readers into the pipeline.
For more activities involving scientific visualisation in the UK, see the JISC funded VizNet project http://www.viznet.ac.uk.
The primary techniques to provide easy, high-performance access from remote networked clients are as follows.
Some of the packages noted above can operate in client-server mode, this permits remote visualisation so that the data does not have to be moved. Instead, a visualisation server is installed on a system wih good connection to the data store and a client allows the user to control the images.
Widely used packages which operate in this way include the following.
Techniques similar to visualisation also apply to processes of data mining and feature recognition. Sub-sets of data can be used and processes can be applied in-situ. Example, looking for phase transitions - cite NW-GRID example. Data Mining is sometimes referred to as KDD, Knowledge Discovery through Data.
Data mining is a process for extracting patterns from data. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and is also used in scientific discovery. In recent years, data mining has been widely used areas of science and engineering including bio-informatics, genetics, medicine, education, electrical power engineering, plasma physics experiments and simulations, remote sensing imagery, video surveillance, climate simulations, astronomy, and fluid mix experiments and simulations. An introduction is provided in a paper by Ramakrishnan and Grama . A more recent review is provided by Kamath .
Methodologies applied include Bayesian analysis, regression analysis, and self organising feature maps (SOMs or Kohonen maps). In addition to some standardisation efforts, freely available open source software have been developed, including Orange, RapidMiner, Weka, KNIME, and the packages in the GNU R Project which include data mining processes. Most of these systems are able to import and export models in PMML (Predictive Model Markup Language) which provides a standard way to represent data mining models so that these can be shared between different statistical applications. PMML is an XML based language developed by the Data Mining Group (DMG), an independent group composed of many data mining companies. PMML version 4.0 was released in June 2009, see http://www.dmg.org.
A project called Sapphire at LLNL is is carrying out research to develop scalable algorithms for interactive exploration of large, complex, multi-dimensional scientific data sets. By applying and extending ideas from data mining, image and video processing, statistics, and pattern recognition, they are developing a new generation of computational tools and techniques that are being used to improve the way in which scientists extract useful information from data. See https://computation.llnl.gov/casc/sapphire/sapphire_home.html (last updated 2006).
A useful place to look at activities around statisitical analysis and data mining is the community using the Analytic Bridge Web 2.0 tool http://www.analyticbridge.com.
Computational steering is a valuable mechanism for scientific investigation in which the parameters of a running program can be altered and the results visualised interactively. As an investigative paradigm its history goes back more than ten years, though it has only relatively recently captured the scientific imagination. With the advent of Grid computing, the range of problems that can be tackled interactively has widened markedly. Applications specialists now see the prospect of accomplishing ``real'' science using computational steering whilst, at the same time, computing researchers are working to bring about the supporting infrastructure of middleware, toolkits and environments that will render this prospect concrete.
EPSRC funded a collaboration network to develop and promote computational steering techniques from 2006-2008, see http://compusteer.dcs.hull.ac.uk/index.php.
Computational steering typically requires remote visualisation, see above.
Some packages available include the following.
Visit from Juelich http://www.jstor.org/pss/30039695
In this section we illustrate some use cases with collections of technologies and tools in use in typical data centric research environments.
Note: HECToR is using Lustre.
CERN currently uses CASTOR-2 to manage storage of LHC experimental data. The system design goals are to provide reliable central data recording of the data, plus transparent access. Much of this access is intended to be from remote sites around the world via data Grid tools. CASTOR itself is designed around a mass storage (tape) system and is therefore not suitable for deployment at sites without this facility.
As CASTOR-2 is such a large installation it only exists at the largest sites in EGEE. These are CERN, plus 4 of the Tier-1 centres including RAL. The instance at CERN currently manages more than 50 million files and 5PB of data.
CASTOR-2 has been designed around a central Oracle RDBMS system, which handles, as much as possible, the state of the system. The resilience of this database is key to CASTOR's reliability. Around this database core all daemons are designed to be stateless, allowing for redundancy and daemon restarts without loss of service.
The stager is another key component. This manages the disk pools in front of the tape system, see Figure 2.
In order for the stager to manage access to files on the disk pool it uses a scheduler plugin. This balances the load on each of the disk servers and can also provide policy and ``fair shares'' considerations to file access. The scheduling problem for disk pools is in fact rather similar to that faced in compute clusters and CASTOR currently uses the commercial LSF batch system scheduler.
CASTOR has the ability to dynamically replicate ``hot'' files and even to switch access on an open file to a less busy replica.
Monitoring is carried by integration with the LHC Era Monitoring system at CERN, with alarms being issued when any abnormal conditions are detected. Logging is done into the Oracle database, allowing the central gathering of information, plus the ability to cross query logs from different services.
This document was generated using the LaTeX2HTML translator Version 2008 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -local_icons -split 3 -html_version 4.0 LargeData
The translation was initiated by Rob Allan on 2010-07-26