This report does not address storage solutions, backup or checkpointing which are nowadays considered to be part of the infrastructure provision. It also does not address the implementation of archival, curation and discovery technologies for which we refer the reader to work of the UK e-Science Programme and in particular the Digital Curation Centre.
© Science and Technology Facilities Council 2010-12. Neither the Council nor the Laboratory accept any responsibility for loss or damage arising from the use of information contained in any of their reports or in any communication about their tests or investigations.
Managing scientific data has been identified as one of the most important emerging needs of the scientific community because of the sheer volume and increasing complexity of data being created or collected. This is particularly true in the growing field of computational science where increases in computer performance permit ever more realistic simulations and the potential to automatically explore large parameter spaces, e.g. using tools based on workflow. Bell et al. [6] noted that As simulations and experiments yield ever more data, a fourth paradigm is emerging, consisting of the techniques and technologies needed to perform data intensive science. ... The demands of data intensive science represent a challenge for diverse scientific communities.
Effectively generating, managing and analysing the data and resulting information requires a comprehensive, end-to-end approach that encompasses all the stages from the initial data acquisition to its final analysis. This is sometimes referred to as Information Lifecycle Management or ILM. For a discussion of the research activity lifecycle in the context of data management see [36].
A SciDAC project [25] has identified three significant requirements based on community input. Firstly, access to more efficient storage systems - in particular, parallel file system improvements are needed to write and read large volumes of data without slowing a simulation, analysis, or visualisation engine. Secondly, scientists require technologies to facilitate better understanding of their data, in particular the ability to perform complex data analysis and searches over large data sets in an effective way - specialised feature discovery, parallel statistical analysis and efficient indexing are needed before the data can be understood or visualised. Finally, generating the data, collecting and storing the results, data post-processing, and analysis of results is a tedious, fragmented process - workflow tools are required for automation of this process in a robust, tractable and recoverable fashion to enhance scientific exploration adding provenance and other metadata in the process. To this we could add stages of pre-processing which bring similar requirements, for instance mesh generation in engineering. We consider workflow tools in a separate report.
Here we review the requirements and tools for managing large data sets of interest to the UK research community involved in computational simulation and modelling. This in part extends work in a previous report [5] prepared for the UK High End Computing project http://www.ukhec.ac.uk.
We do not address archival, curation and discovery for which we refer the reader to work of the UK e-Science Programme and in particular the Digital Curation Centre. Nevertheless we first survey the expectations of public funding bodies in terms of publishing science outcomes and making related data available. This makes it essential not only to be able to manage and interpret large data sets at the time of the original research, but to create accurate metadata at the time of the original data creation 1 and ensure that data sharing and subsequent analysis is possible many years in the future.
A second report will focus on data intensive computing with requirements and examples of approaches to data analysis [2].
A report [18] commissioned by JISC, the Joint Information Systems Committee and RIN, the Research Information Network, was published in Sep'2011 and surveyed the work of a number of UK data centres and services including: ADS, Archeology Data Service; BADC, British Atmospheric Data Centre; CDS, Chemical Database Service; EBI, European Bioinformatics Institute; ESDS, Economic and Social Data Service; NCDR, National Cancer Data Repository; NGDC, National Geo-science Data Centre; UKSSDC, UK Solar System Data Centre. These receive funding from the UK Research Councils plus CR-UK and the Wellcome Trust.
Rather than policies or technology, the report looked at the centres from a user perspective and was based on surveys by the Technolopis Group from Nov'2009-Jan'2010. It concluded that the re-use of curated data was high and it did indeed lead to improved research efficiency and quality with quantifiable impacts, followed by additional data deposits. Nearly all users were academic, with the exception of social data sets.
There are two main issues addressed briefly in the rest of this section: (1) open publication of research outcomes; and (2) the need for data retention (curation).
Policies arise from the seven core RCUK principles on data sharing, see http://www.rcuk.ac.uk/research/Pages/DataPolicy.aspx. Two of these principles are of particular importance: (1) that publicly funded research data should generally be made as widely and freely available as possible in a timely and responsible manner; and (2) the research process should not be damaged by inappropriate release of such data.
We note that these principles are themselves derived from a statement of the OECD (world wide Organisation for Economic Co-operation and Development) that publicly funded research data are a public good, produced in the public interest and, therefore, should be openly available to the maximum extent possible [34].
Policies must also reflect the principal UK legal provisions intended to assure access to publicly held information. These include: the Freedom of Information Act 2000 and the Freedom of Information (Scotland) Act 2002; Data Protection Act 1998; the Environmental Information Regulations 2004; and the Environmental Information (Scotland) Regulations 2004. These Acts allow any person to ask any public authority (including universities) for any information they believe to be held by that authority, and require the authority to respond in writing stating whether or not they hold the information sought and, if so, to supply that information unless certain exemptions apply.
Any exemptions, which may be absolute or qualified, generally relate to considerations such as national security, law enforcement, commercial interests or data protection, all of which may be relevant to research data. Guidance is available to help researchers and their institutional representatives understand their obligations. See for example the JISC publication [37]. Note: the exemptions in Scotland differ from those in the rest of the UK.
Digital curation is about maintaining and adding value to a trusted body of digital research data for current and future use. It includes the active management of data throughout the research lifecycle. To be useful the data should be validated and the method by which it was generated should be recorded.
In the UK, the Digital Curation Centre (DCC) has experts in curating digital research data who promote best practice in storing, managing and protecting digital data, see http://www.dcc.ac.uk. They explain the principles of curation to primary stake holders and to the wider community, seek to inform and influence political positioning in the curation and preservation landscape and promote and publicise curation concepts.
The DCC has created a number of information brochures and a Curation Reference Manual [30]. There is also a collection of papers on Managing Research Data [31].
Developing a data management plan is now a core part of good research practice and has been shown to bring significant benefits in terms of more efficiently conducted research and avoiding the risk of data loss. The starting point in developing such a plan should be to consult the DCC's useful overview of research funders' requirements as also summarised below.
The widely praised DMP Online is the DCC's data management planning tool, see http://dmponline.dcc.ac.uk. The tool draws upon the DCC's analysis of funders' data requirements to help project teams create up to three iterations of a data management plan: a ``minimal'' version for use at the grant application stage; a ``core'' version to be developed during the project itself; and towards the end of the project a ``full'' version that addresses issues of longer term access and preservation. It contains information from RCUK, non-RCUK funders, US funders and sets of institutional and disciplinary templates are being developed.
A set of slides giving further information and examples of using DMP Online can be found here http://tardis.dl.ac.uk/NWGrid/mew22-dm-ab.pdf.
The BBSRC states that publications should be deposited at the earliest opportunity and expects data to be made available in a timely and responsible manner. Timely release could be considered as no later than the release of main findings through publication, or three years as a general guide. Data should be maintained for a minimum of ten years after project completion through their home institutions.
The BBSRC encourages data sharing in all research areas where there is strong scientific need and where it is cost effective. They encourage researchers to make material openly available, in suitably accessible formats using established standards. A publications repository and financial support for data sharing is available to facilitate sustained access. Researchers are therefore required to submit a data sharing plan with their proposals.
A number of databases are recommended for depositing research data, several of which are hosted at EBI, the European Bio-molecular Institute at Hinxton near Cambridge.
Further information:
http://www.dcc.ac.uk/resources/policy-and-legal/research-funding-policies/bbsrc
http://www.bbsrc.ac.uk/publications/policy/access-research-outputs.aspx
http://www.bbsrc.ac.uk/web/FILES/Policies/data-sharing-policy.pdf
http://www.bbsrc.ac.uk/publications/policy/good-scientific-practice.aspx
EPSRC has mandated open access publication of research that it funds since 2009. From 1/5/2011, they introduced a new policy framework covering access to, and management of, research data arising from research sponsored by EPSRC. This sets out their expectations arising from the core RCUK principles.
The framework is not prescriptive about how the expectations should be met and gives freedom for institutions to develop policies and practices based on individual circumstances. Specific expectations are published on-line [33] and include the following. Publications should include information on how and under what terms the related data can be accessed. Organisations must ensure that structured metadata exists describing the research data they hold made freely accessible on the internet. The metadata must be sufficient to allow others to understand what research data exists, why, when and how it was generated and how to access it. Where the research data referred to in the metadata is a digital object it is expected that the metadata will include use of a robust digital object identifier (DOI).
Organisations must ensure that research data is securely preserved for a minimum of 10 years from the date that any ``privileged access'' period expires or from the last date on which access to the data was requested by a third party. They must also ensure that effective data curation is provided throughout the full data life cycle as defined by the Digital Curation Centre.
EPSRC do not provide data centres. Instead, research organisations must ensure adequate resources are provided to support the curation of data arising from publicly funded research. These resources must be allocated from within their existing public funding streams, whether received from RCs as direct or indirect support for specific projects or from HEFCE as block grants.
Further information:
http://www.dcc.ac.uk/resources/policy-and-legal/research-funding-policies/epsrc
http://www.epsrc.ac.uk/about/infoaccess/Pages/roaccess.aspx
http://www.epsrc.ac.uk/funding/managing/Documents/goodpracticeguide.pdf
http://www.epsrc.ac.uk/about/standards/researchdata/Pages/default.aspx
http://www.epsrc.ac.uk/about/standards/researchdata/Pages/expectations.aspx
ESRC require applicants to consider what outputs will be created at the proposal stage and how these will be made available in the long term. Researchers are expected to make all outputs accessible as soon as possible. ESRC provide a publications repository and data service to facilitate this. Grant holders are then expected to deposit publications at the earliest opportunity and data must be offered to the Economic and Social Data Service, ESDS, based at the UK Data Archive in Colchester within three months of the end of the award. Planning to do this is part of the grant application process. ESRC's research data policy was last updated in Sep'2010.
Further information:
http://www.dcc.ac.uk/resources/policy-and-legal/research-funding-policies/esrc
http://www.esrcsocietytoday.ac.uk/ESRCInfoCentre/Support/access/
http://www.esrc.ac.uk/about-esrc/information/data-policy.aspx
http://www.esrc.ac.uk/_images/Research_Data_Policy_2010_tcm8-4595.pdf
NERC published a position statement on access to research outputs in 2006. To support access to environmental data, their data policy also requires that award holders offer a copy of any data set resulting from NERC funded activities to its data centres. A new version of the data policy was published in Jan'2011 with data management requirements expected to be implemented in 2012.
Long term curation is central to NERC and an extensive data centre support infrastructure is in place to facilitate this. Use of these centres is free to NERC funded researchers who are expected to consider aspects of data creation and management prior to beginning research. Over arching data plans are produced for each thematic programme. BADC's Data Management Plans template is indicative. Grant applicants must include a plan for their work drawn up and implemented with an appropriate data centre.
The current NERC data centres are as follows.
Information on all data held within the centres will be made available through the NERC Data Discovery Service which provides an integrated, searchable catalogue.
NERC also support an e-Prints document repository. Publications should be made accessible through this or other institutional repositories. Publications resulting from NERC funding must be deposited at the earliest opportunity and data must be offered after a ``reasonable period'' of exclusive use, currently considered to be two years from the end of data collection.
Further information:
http://www.dcc.ac.uk/resources/policy-and-legal/research-funding-policies/nerc
http://www.nerc.ac.uk/about/access/statement.asp
http://www.nerc.ac.uk/research/sites/data/
http://www.nerc.ac.uk/research/sites/data/policy.asp
As an example, the full policy statement from STFC (as published Sep'2011) is re-produced in Appendix B.
Researchers are expected to make publications that arise from STFC funded research available at the earliest opportunity. An e-Pubs system has been set up for this purpose. Activities in the e-Science Centre have led to STFC's statement on data management or sharing and best practice, but there is currently no overall formal policy covering long term curation. It is suggested that a domain specific or institutional repository be used and that data should be retained for a minimum of 10 years. Data archives are being implemented for facilities such as Diamond and ISIS, see Secion 5.5. There is also separate provision for access and management of particle physics data through the GridPP consortium and UK Tier-1 centre.
Note that STFC, at the Rutherford Appleton Laboratory, host the BADC, the particle physics Tier-1 Centre and a data archive for BBSRC. STFC also formerly hosted the HPCx service at Daresbury Laboratory.
Further information:
http://www.dcc.ac.uk/resources/policy-and-legal/research-funding-policies/stfc
http://www.scitech.ac.uk/rgh/rghDisplay2.aspx?m=s&s=64
http://www.stfc.ac.uk/stfcconsultation/sources/strategy/StrategyConsultationDocument.pdf
The JISC, the Joint Information Systems Committee, funded a programme on Managing Research Data from 2009-11, see http://www.jisc.ac.uk/whatwedo/programmes/mrd/outputs.aspx.
The Web page provides a narrative guide to outputs from the programme and some related JISC funded activities. This contains links to the projects and services mentioned below. It is intended as an easy point of introduction to key outputs that will be of interest to others seeking to improve research data management in universities, and therefore relevant to the discussion above. It is intended that this will be useful for institutions seeking to improve research data management.
This is an ongoing activity, with JISC funding for further projects in this area announced from time to time.
Research data management support for researchers
The Incremental project has produced Web pages providing support and guidance for managing research data: Support for Managing Research Data at the University of Cambridge; and Data Managing Support for Researchers at the University of Glasgow.
The EIDCSR project, sister project to SUDAMIH, created a similar Research Data Management site for the University of Oxford. Likewise, University of Edinburgh Information Services has put together a site providing Research Data Management Guidance.
Introductions and ``How To'' guides
The UK Data Archive has recently revised its guide Managing and Sharing Data - Best Practice Guide for Researchers, in part as a result of work undertaken by the JISC funded project Data Management Planning for ESRC Research Data Rich Investments (DMP-ESRC).
Building on its wide ranging set of briefing papers, the Digital Curation Centre is also producing a series of ``How To'' guides which provide a working knowledge of curation topics, aimed at people in research or support posts who are new to curation, but are taking on responsibilities for managing data, whether at local research group level or in an institutional data centre or repository. The first two guides in this series deal with how to appraise and select research data and how to license research data.
In its early stages, the ERIM project produced a Review of the State of the Art of the Digital Curation of Research Data which serves as an introduction to and overview of the issues.
Model data management plans and guidance
The DMP-ESRC project produced a substantial and detailed set of Data Management Recommendations for Research Centres and Programmes as well as a summary guide to two key recommendations, relating to research data management strategies and the maintenance of a resources library. Although targeted at large ESRC research investments, these guidelines are widely applicable and could be useful for data management planning in other disciplines.
The ERIM project, examining research data management and sharing issues for researchers at the University of Bath's Innovative Design and Manufacturing Research Centre produced a Draft Data Management Plan for IdMRC Projects. The work builds on a set of high level Principles for Engineering Research Data Management; a Thematic Analysis of Data Management Plan Tools and Exemplars; and a Requirement Specification for an Engineering Research Data Management Plan.
Case studies and requirements analyses
A number of the projects in the programme produced requirements analyses. Some projects took a broad institutional view while others focused on the requirements of specific disciplines.
It is worth noting that a number of other institutional DM projects (e.g. Bristol, UCL) have found that including the library as a partner is essential. This brings in people with essential information management skills.
Research data management platforms
Many projects in the Managing Research Data programme developed technical platforms and software to help researchers manage their data.
As an example, the core technical output of the I2S2 project was to develop the I2S2 Information Model and to implement this within the STFC's ICAT Lite ``personal workbench for managing data flows''. This allows the user to manage data, to capture provenance information and to ``commit data'' for long term storage. The project has produced a useful implementation plan and a description of their pilot implementation. ICAT was formerly a product of the STFC e-Science Centre at Daresbury.
Research data management costing
Understanding how to model the full cost of research data management is a challenging area and one which will require further work at the institutional level. Material for understanding activity based costing has come out of the Keeping Research Data Safe project and a good starting point is the project's user guide.
The DMP-ESRC project produced a light weight activity based research data management costing tool for researchers in the social sciences.
Training materials
A number of projects in the JISC programme produced training materials which are available for re-use and adaptation. As an exmaple, working with the Humanities Division at the University of Oxford, the SUDAMIH project produced training materials for humanities researchers.
Management of research data produced in collaboration with or derived from the NHS falls under the Research Governance Framework for Health and Social Care [35]. This requires an organisation to have clearly documented standard operating procedures for the management of all research data. As an example, the University of Manchester ensures compliance with the framework through a Research Governance MoU with partner NHS Trusts, and a joint Research Governance Group meets regularly [11]. The framework states that data collected in the course of research must be retained for an appropriate period, to allow further analysis by the original or other research teams subject to consent, and to support monitoring by regulatory and other authorities.
The work described above can be compared to similar work in Australia, see [39]. In 2008 the Federal Department of Industry, Innovation, Science Research and Tertiary Education (DIISRTE) entered into an agreement with Monash University to establish ANDS under the National Collaborative Research Infrastructure Strategy (NCRIS). Subsequent funding from several departments has enabled a number of projects to contribute to this overall activity.
The task of ANDS is to create the infrastructure to enable Australian researchers to easily publish, discover, access and use research data. Its approach is to engage in partnerships with the research institutions to enable better local data management that enables structured collections to be created and published. ANDS then connects those collections so that they can be found and used. Since 2009, these connected collections, together with the infrastructure, formed the ``Australian Research Data Commons'', see Figure 1.
This includes all higher education providers in Australia and all research organisations that are publicly funded such as CSIRO, GeoScience Australia, Bureau of Meteorology, ABS, AIMS, DsPI. The ten year objectives for data management within ANDS are as follows.
Scientists often consider ``data management'' to mean a physical data store with an access layer for movement of data from one location to another. The scope of scientific data management is however much broader, encompassing both its meaning and content.
The cutting edge of computational science involves very large simulations taking many hours or days on the latest high performance (and therefore expensive) computers. For business data, large companies implement enterprise wide data architectures, with data warehousing and data mining to extract information from their data. Can something similar be done for scientific data?
Some problems identified from current scientific projects include the following.
The increasing size of scientific data collections brings not only problems, but also opportunities. One of the biggest is the possibility to re-use existing data for new studies. This was one of the great hopes of the e-Science programme. Many projects investigated data curation, provenance and metadata definitions based on common ontologies. At STFC it was a goal of the Facilities e-Infrastructure Programme with a focus on SRB, ICAT and DataPortal for SRS, Diamond and ISIS.
Scientific data is composed not only of bytes, but also of workflow definitions, computation parameters, environment setup and provenance. Capturing and using all this associated information is a goal of, among others, the JISC funded myExperiment project which focusses on workflows encapsulated in ``research objects'', see http://www.myexperiment.org.
Other aspects of data management, particularly virtualisation of the underlying storage, has been tackled in projects such as SRB, the Storage Resource Broker from SDSC, now superceded by iRODS.
Much of the work mentioned above was aimed a creating catalogues such as ICAT which reference large data collections, see below. A similar project for high energy physics data is LFC, the LCG File Catalogue. Entries in the catalogues may point to the outputs of a facility or long term research programme which will have consumed extensive public funding and are deemed to be of lasting and sometimes national importance.
We will not attempt to fully describe or re-produce this e-Science work here, the background to which is discussed in a report [14]. We also do not consider issues of data curation for which appropriate standards and processes have been developed and documented by the Digital Curation Centre as noted above. Instead, we will now focus on ways to manipulate and analyse large scientific data sets. There are lists of open data repositories on-line, for instance http://oad.simmons.edu/oadwiki/Data_repositories. A few examples are as follows.
The LHC data analysis represents an extreme case as it will generate upwards of 14PByte of data a year, which has to be distributed across the EGEE, OSG and NorduGrid Grids for analysis. Such data volumes cannot be handled easily with current production networks, so have required the provisioning of optical private networks (an unusual form of SAN) linking CERN's Tier-0 centre to the key national Tier-1 computing centres around the world, the one for the UK being situated at RAL. Dedicated 10Gb/s network links are provided in this way for data movement.
We do not consider the middleware aspects of such data grids in this report, but focus more on the requirements of a high end data centre focusing on computational simulation and modelling. The following figure illustrates the architecture implemented in the SciDAC Data Management Center [25].
Here activities are organised in three layers that abstract the end-to-end data flow. The layers are: Storage Efficient Access (SEA); Data Mining and Analytics (DMA); and Scientific Process Automation (SPA). The SEA layer is immediately on top of the infrastructure, i.e. data intensive computing hardware, operating systems, file systems and hierarchical mass storage systems, and provides parallel data access technology and transparent access to archival storage. The DMA layer, which builds on the functionality of the SEA layer, consists of indexing, feature selection and parallel statistical analysis technology. The SPA layer, above the DMA layer, provides the ability to compose scientific workflows from the components in the DMA layer as well as application specific modules. This architecture has been used in the centre to organise its components and apply them to various scientific applications.
Important aspects in any data management system have been identified as follows.
To be accessible, scientific data must be stored in a widely recognised format. Not surprisingly, many data format standards are concerned with multi-dimensional data or image processing. In some cases these are domain specific, but others are more generic.
Data management issues and processes are mostly independent of the data format. Nevertheless some formats are more suited to support metadata and more helpful in reaching the various data management goals. The format can also have a significant effect on i/o performance and the ability to search across and pull out sub-sets of data. An example of this is slicing through 3D data. For system independence, e.g. transferring data between big endian and little endian systems, formats such as XDR (eXternal Data Representation) can be used, although translation may be slow.
There have been several attempts to use XML to describe binary data formats, but these have been largely un-successful. Initiatives such as DFDL, Data Format Description Language and BINX attempt to do this. Traditional formats like HDF-5, NetCDF and CGNS are widely used and newer formats using XML may be more suited for storing metadata. It is also possible to use relational or object databases for certain types of data. We note that usually the metadata is stored separately from the actual data, for instance in a catalogue (searchable database) which contains location references or URLs. There are issues of transaction management and consistency when data or metadata are distributed or replicated.
We do not address low level file systems such as FAT, EXT3, HPFS, etc. in this document, although we note that some performance improvements can be associated with an appropriate choice. This is particularly true of newer ones which are being developed to address scalability to large networked file stores, such as Oracle's BTRFS, B-Tree File System (fault tolerant) introduced in 2007, see http://btrfs.wiki.kernel.org and CRFS, the Coherent Remote File System. ZFS from Sun offers some similiar capabilities and is available on the newer NW-GRID clusters. This includes volume management functions, scalability, snapshots and copy-on-write clones plus built in integrity checking and repair, RAID and NFS-4 support. See Wikipedia for more information about file systems, http://en.wikipedia.org/wiki/List_of_file_systems.
The following list describes a number of widely used data set formats expanding on that in [5]. Section 4.2 goes on to describe storage and distributed high performance file systems which may be of interest.
The CGNS system is designed to facilitate the exchange of data between sites and applications and to help stabilise the archiving of aerodynamic data. The data are stored in a compact, binary format and are accessible through a comprehensive and extensible library of functions. The API is platform independent and can be easily implemented in C, C++ and Fortran. A data viewer is available.
A major feature of the FITS format is that image metadata is stored in a human readable ASCII header, so that an interested user can examine the headers to investigate a file of unknown provenance.
FITS is also often used to store non-image data, such as spectra, photon lists, data cubes, or even structured data such as multi-table databases. A FITS file may contain several extensions, and each of these may contain a data object. For example, it is possible to store X-ray and infrared exposures in the same file.
HDF was originally from NCSA and is now supported by the HDF Group, see http://www.hdfgroup.org/HDF5. We note that HDF-4 still exists, but is significantly different both in design and API.
Note: Q5cost is an HDF-5 based format which has a Fortran API developed in an EU COST D23 project for computational chemistry [4], see http://abigrid.cineca.it/abigrid/the-docs-archive/q5cost/.
The project is actively supported. The recently released (2008) version 4.0 greatly enhances the data model by allowing the use of the HDF-5 data file format. See http://www.unidata.ucar.edu/software/netcdf.
Like HDF-5, NeXuS is a hierarchical format with a directory style structure. Some metadata in NeXuS files is encoded as XML with standard tag names making them easy to interpret. NeXuS is used in ICAT.
There are many other formats, some proprietary or application
specific, see also Section
.
Distributed File Systems are sometimes called Distributed Datastore Networks - see Wikipedia. In this report we consider only those which work on a wide area network and are thefore suitable for Campus or inter-site computing and data management. Normally many implementations have been made, they are location dependent and they have access control lists (ACLs), unless otherwise stated below.
We separate the rest of this section into server centric storage systems and those supporting distributed file servers. Most systems do however have some dependency on one or more central services, such as metadata services, database or catalogues, which are noted. We assume that all systems reviewed can access distributed storage or provide storage to distributed clients in some way.
Keywords include: data migration; hierarchical storage management; information lifecycle management; storage area network; tiered storage.
Data Migration
Data migration is the transferring of data between storage types, formats, or computer systems. Data migration is usually performed programmatically to achieve an automated migration. It is required when organisations or individuals change computer systems or upgrade to new systems. Migration is a key issue in data curation, see the Digital Curation Centre http://www.dcc.ac.uk. Migration is thus a means of overcoming technological obsolescence by transferring digital resources from one hardware or software generation to the next. The purpose of migration is to preserve the intellectual content of digital objects and to retain the ability for clients to retrieve, display and otherwise use them with evolving technology. Migration differs from the refreshing of storage media in that it is not always possible to make an exact digital copy or replicate original features and appearance and still maintain the compatibility of the resource with the new generation of technology.
ILM
Information Lifecycle Management (ILM) is a comprehensive approach to managing the flow of information, data and metadata from creation to obsolescence. ILM encapsulates potentially complex criteria for storage management going beyond age of data and access frequency.
ILM thus refers to a wide ranging set of strategies for administering storage systems on computing devices. Specifically, four categories of storage strategies may be considered under the auspices of ILM. These concern: policies including SLAs around data management; operational aspects including backup and data protection; logical and physical infrastructure; and definition of how the strategies are applied.
ILM products automate the processes involved, typically organising data into separate tiers (see below) according to specified policies, and automating data migration from one tier to another based on those criteria. As a rule, newer data, and data that must be accessed more frequently, is stored on faster, but more expensive storage media, while less critical data is stored on cheaper, but slower media. ILM can specify different policies for data that declines in value at different rates or that retains its value throughout its life span.
SAN
A Storage Area Network (SAN) is a high speed network designed to attach computer storage devices such as disk array controllers and tape libraries to servers. SANs became widely used in enterprise (campus) storage from around 2006.
A SAN allows a machine to connect to remote targets such as disks and tape drives on a network usually for block level i/o. From the point of view of the class drivers and application software, the devices appear as locally attached devices.
There are two variations of SAN:
Tiered Storage
Tiered storage is a data storage environment consisting of two or more kinds of storage differentiated by at least one of four attributes: Price; Performance; Capacity; Function. Any significant difference in one or more of the four defining attributes can be sufficient to justify a separate storage tier.
Examples include the following.
HSM
Hierarchical Storage Management is related to tiered storage. It is a data storage technique that automatically migrates data between high cost and low cost (and probably higher capacity) storage media. HSM systems exist because high speed storage devices, such as hard disk drives, are typically more expensive (per byte stored) than slower devices, such as optical discs and magnetic tape drives. Whilst it would be ideal to have all data available on high speed devices all the time, this would be prohibitively expensive for most installations. HSM systems instead store the bulk of the organisation's data on slower devices and copy data to faster disk drives only when needed. In effect, HSM turns the fast disk drives into caches for the slower mass storage. The HSM system monitors the way data is used and makes best guesses as to which data can safely be relegated to slower devices and which data should stay on the fast disks.
HSM thus implements policy based management of file backup and archiving in a way that uses storage devices economically and without the user needing to be aware of when files are being retrieved from backup storage media. Although HSM can be implemented on a standalone system, it is more frequently used in the distributed network of an enterprise. Using an HSM product, an administrator can establish and state guidelines for how often different kinds of files are to be copied to a backup storage device. Once the guideline has been set up, the HSM software manages everything automatically.
File Migration
Assuming a file based storage system, efficient file migration services are at the heart of HSM. It is also relevant when moving data from one vendor's product to another, but see under Data Migration above.
File migration thus arises from an ILM strategy that relegates data to less expensive devices as the data decreases in value to the enterprise. Migration may also be driven by a need to simplify or standardise environments, to improve storage space utilisation, to balance workloads between file systems, or to consolidate storage management.
File migration services are present in many commercial products, e.g. from CommVault, HP, LSI and Symantec.
This section lists some ``traditional'' storage systems aimed principally at backup, restore and archival, but now increasingly including business logic. These are typically aimed at managing (e.g. by indexing) many small files to produce a searchable archive store sometimes known as a collection.
StoreAge MultiMigrate product, now from LSI, see http://www.lsi.com/storage_home/products_home/storage_virtualization_data_services/storeage_multimigrate. This enables the on-line migration of data from any storage device to any other storage device, regardless of vendor. The migration takes place while the applications remain on-line without any interruption. It is aimed at migrating critical applications from older storage devices onto newer platforms.
High performance computing environments require parallel file systems and access to data from multiple clients. Traditional server based file systems such as those exported via NFS are unable to scale efficiently to support hundreds of nodes or multiple servers. Parallel file systems are typically deployed for dedicated high performance storage solutions within clusters, usually as part of a vendor’s integrated cluster solution. These file systems are often tightly integrated with a single cluster’s hardware and software environment making sharing them impractical. Recently, several parallel file systems have been introduced that are designed to make sharing a files between clusters feasible in the presence of hardware and software heterogeneity.
In this section we review distributed file systems, which are also possibly also parallel and fault tolerant, stripe and replicate data over multiple servers for high performance and maintain data integrity.
All file systems listed here focus on high availability, scalability and high performance unless otherwise stated. Whilst these provide distinct advantages over more traditional file systems they may be more or less complicated to install, configure and manage and may require specific Linux kernel patches.
FraunhoferFS is written from scratch and incorporates results from previous experience. It is a fully Posix compliant, scalable file system including features as follows.
GAM software enables formation of fixed content storage systems that can scale to petabytes of data across numerous sites. It has an efficient wide area replication to deliver a storage system spanning sites linked together with differing bandwidth networks. GAM software optimises broad availability of data across sites through network file system interfaces (CIFS and NFS) and as an object store delivering a global name space.
GlusterFS has client and server components. Servers are typically deployed as storage bricks, with each server running a glusterfsd daemon to export a local file system as a volume. The glusterfs client process, which connects to servers with a custom protocol over TCP/IP, InfiniBand or SDP, composes remote volumes into larger ones using stackable translators. The final volume is then mounted by the client host through the FUSE mechanism to provide a Posix interface. Applications doing large amounts of file i/o can also use the libglusterfs client library to connect to the servers directly and run in-process translators, without going through the file system and incurring FUSE overhead.
Most of the functionality of GlusterFS is implemented as translators, including: file based mirroring and replication; file based striping; file based load balancing; volume failover; scheduling and disk caching; storage quotas.
The GlusterFS server is kept minimally simple - it exports an existing file system as-is, leaving it up to client side translators to structure the store. The clients themselves are stateless, do not communicate with each other, and are expected to have translator configurations consistent with each other. This can cause coherency problems, but allows GlusterFS to scale up to several petabytes on commodity hardware by avoiding bottlenecks that normally affect more tightly coupled distributed file systems - there is no master node.
There seems to be good community support for Gluster, although most users seem to be from Web hosting companies, see http://www.gluster.org. It is being considered for use by Streamline Computing.
GPFS provides concurrent high speed file access to applications executing on multiple nodes of clusters. It can be used with AIX, Linux, Microsoft Windows Server 2003-r2 or a heterogeneous cluster. In addition to providing file system storage, GPFS provides tools for management and administration of the storage cluster and allows for shared access to file systems from remote GPFS clusters.
HDFS is thus built from a cluster of data nodes, each of which serves up blocks of data over the network using a protocol specific to HDFS. They can also serve the data over HTTP, allowing access to all content from a Web browser or other client. Data nodes can talk to each other to re-balance data, to move copies around, and to keep the replication of data high (which helps with access speed using peer-to-peer methods). It can also be used for storage scavaging.
A Hadoop file system requires one unique server, the name node. This is therefore a single point of failure for HDFS (or indeed GFS or KFS). When it comes back up, the name node must replay all outstanding operations. Another limitation of HDFS is that it cannot be directly mounted by an existing operating system as it does not have a Posix API. Getting data into and out of the HDFS file system is an action that often needs to be performed before and after executing a job, so this can be inconvenient.
It was speculated in 2010 that Sun Grid Engine v6.2-5 would include support for Hadoop. Thus SGE, which is aware of HDFS, would be able to route processing jobs to where the data is already located in the nodes, which speeds up execution of those jobs. This is more efficient than starting up a job somewhere and then trying to move the data over to that node. It can also be of use for chaining multi-stage jobs which use large data sets. It was speculated in 2009 that Condor would take a similar approach from v7.4 onwards, how this has not yet been seen.
HDFS may also be valuable for data mining applications, see
Section
. It is now available with support for
running on Amazon EC2 or S3 clouds, although its performance has been
questioned in this context.
A number of applications are becoming available built on Hadoop. Yale University's HadoopDB uses MapReduce and is aimed at petabyte databases. Hive is a data warehouse infrastructure with its own query language (called Hive QL). Hive makes Hadoop more familiar to those with a Structured Query Language (SQL) background, but it also supports the traditional MapReduce infrastructure for data processing. HBase is a high performance database system similar to Google BigTable. Instead of traditional file processing, HBase makes database tables the input and output form for MapReduce processing. Finally, PIG is a platform on Hadoop for analysing large data sets. PIG provides a high level language that compiles to MapReduce applications.
Lustre is now developed and maintained by Oracle (previously Sun) with input from many other individuals and companies after the acquisition of Cluster File Systems, Inc. in 2007, with the intention of bringing the benefits of Lustre technologies to Sun's ZFS file system and the Solaris operating system. See http://wiki.lustre.org/index.php/Main_Page.
Lustre has failover, but multi-server RAID-1 or RAID-5 is still in their roadmap for future versions. Available for Linux under GPL. Largest current Lustre implementation is on the TeraGrid Data Capacitor [19].
Cluster aware applications can make use of parallel i/o. It can also be used as a fail over setup to increase its resiliance. Apart from being used with Oracle's Real Application Cluster (RAC) database product, OCFS-2 is currently in use to provide scalable Web servers, file servers, mail servers and for hosting virtual machine images.
Some of the notable features of OCFS are:
Optimised allocations (extents, reservations, sparse,
unwritten extents, punch holes);
Inode based writeable snapshots;
Indexed directories;
Metadata checksums;
Extended attributes (unlimited number of attributes per
inode);
Advanced security (Posix ACLs and SELinux);
User and group quotas;
Variable block and cluster sizes;
Journaling (ordered and writeback data journaling modes);
Endian and architecture neutral (x86, x86_64, ia64 and
ppc64);
Buffered, direct, asynchronous, splice and memory mapped
i/o;
In-built Clusterstack with a distributed lock manager
Cluster aware tools (mkfs, fsck, tunefs, etc.)
PanFS is used at LANL for RoadRunner and on the NW-GRID clusters and some other systems at Daresbury Laboratory. See http://www.panasas.com.
The developers of SRB have now developed iRODS, another data Grid software system with the important addition of a distributed rules engine. Users can execute rules and micro-services to automate the enforcement of management policies to control data access, manipulate data at distributed sites, etc. The use of rules provides iRODS with a flexibility that would have to be hard coded using SRB.
We have chosen to illustrate in more detail the architectures of six distributed file systems: AFS, Ceph, iRODS, Lustre, GPFS and Panasas. All are suitable for use in large data centres or experimental facilities.
AFS has several benefits over traditional networked file systems, particularly in the areas of security and scalability. It is not un-common for enterprise AFS cells to exceed 25,000 clients. AFS uses Kerberos for authentication and implements access control lists on directories for users and groups. Each client caches files on the local file system for increased speed on subsequent requests for the same file. This also allows limited file access in the event of a server crash or a network outage.
Read and write to an open file is directed only to the local copy. When a modified file is closed, the changes are copied back to the file server. Cache consistency is maintained by a callback mechanism. Clients are informed by the server if the file is changed elsewhere. Callbacks are discarded and must be re-established after any client, server or network failure, including a time-out. Re-establishing a callback involves a status check and does not require re-reading the file itself.
A consequence of the file locking strategy required to implement the above mechanism is that AFS does not support large shared databases or record updating within files shared between client systems. This was a deliberate design decision based on the perceived needs of the university computing environment at the time.
A significant feature of AFS is the ``volume'', a tree of files, sub-directories and AFS mount points (links to other AFS volumes). Volumes are created by administrators and linked at a specific named path in an AFS cell. Once created, users of the file system may create directories and files as usual without concern for the physical location of the volume. A volume may have a quota assigned to it in order to limit the amount of space consumed. AFS administrators can move that volume to another server and disk location as required without the need to notify users; indeed the operation can occur while files in that volume are being used.
AFS volumes can be replicated to read only cloned copies. When accessing files in a read only volume, a client system will retrieve data from a particular read only copy. If at some point that copy becomes unavailable, clients will look for any of the remaining copies. Again, users of that data are un-aware of the location of the read only copy; administrators can create and relocate such copies as needed. The AFS command suite guarantees that all read only volumes contain exact copies of the original read-write volume at the time the read only copy was created.
The file name space on an Andrew workstation is partitioned into a shared and local name space. The shared name space (usually mounted as /afs on the Unix file system) is identical on all workstations. The local name space is unique to each workstation. It only needs to contain temporary files needed for workstation initialisation and symbolic links to files in the shared name space.
AFS heavily influenced NFS-4. Additionally, a variant of AFS, the Distributed File System (DFS) was adopted by the Open Software Foundation in 1989 as part of their Distributed Computing Environment.
There are currently three major implementations from Transarc (IBM), OpenAFS and Arla, although the Transarc software is losing support and is deprecated. AFS-2 is also the predecessor of the Coda file system.
A fourth implementation exists in the Linux kernel source code since at least version 2.6.10. This was committed by Red Hat, but is a fairly simple implementation in its early stages of development and therefore still incomplete.
Note: AFS is in use at University of Manchester and clients can be provided on the National Grid Service clusters if projects require them.
Ceph has three main components: 1) a cluster of Object Storage Devices (OSDs), which collectively store all data and metadata; 2) a metadata server (MDS) cluster, which manages the namespace (file names and directories), consistency and coherence; 3) clients, each instance of which exposes a Posix like API.
Ceph storage consists of a potentially large number of OSDs, a smaller set of MDS daemons and a few monitor daemons for managing cluster membership and state. The OSDs handle data migration, replication, failure detection and recovery. They rely on the new Linux BTRFS object store and use an algorithm like hashing to compute the data location rather than using lookup tables (the data distribution function is referred to as CRUSH, Controlled Replica Under Scalable Hashing). Replicas are used in CRUSH to improve access and reliability. The storage cluster is simple to deploy, while providing better scalability than other current block based cluster file systems. The placement policy in CRUSH can also take into account storage and server hierarchy, e.g. utilising redundant or spatially separated devices to enhance resilience.
Metadata daemons compute the data location and use the Paxos consensus protocol to arbitrate access, see http://en.wikipedia.org/wiki/Paxos_algorithm. This avoids any need to exchange location metadata. There is typically one MSD per 100 OSDs.
Clients use the FUSE mechanism for Posix like i/o and cache data locations returned from the MDS.
Ceph is being considered for use at Imperial College London.
iRODS stands for integrated Rule Oriented Data Systems. It is a second generation data grid system providing a unified view and seamless access to distributed digital objects across a wide area network. It is an evolution of the first generation Storage Resource Broker (SRB) which provided a unified view based on logical naming concepts - users, resources, data objects and virtual directories were abstracted by logical names and mapped onto physical entities thus providing a physical-to-logical independence for client level applications. iRODS builds on this by abstracting the data management process itself - this is referred to as policy abstraction.
iRODS v1.0 provides user friendly installation tools, a modular environment for extensibility through micro-services, a Web based interface and support for Java, C and shell programming through libraries and utilities for application development.
The ``integrated'' part of iRODS comes from the fact that it provides a unified software environment for underlying services which interact in a complex fashion among themselves. This idea is different to a toolkit methodology where one is provided with a suite of modules to be integrated by the user or application to form a customised system. IRODS integration exposes a uniform interface to the client application hiding the underlying complexity. SRB also had an integrated envelope methodology with a single server installation hiding the details of third party authentication, authorisation, auditing, metadata managment, streaming access mechanism, resource (vendor level), etc.
iRODS currently has around 100 API functions and 80 command level utilities. These build on the integrated envelope adding more functionality and services. Functionalities include the following.
I presentation containing more informatino can be found at http://tardis.dl.ac.uk/NWGrid/irods_2011.pdf. For an evaluation of iRODS in the JISC funded iREAD project see http://www.wrg.york.ac.uk/iread.
Note: SRB has been used in the NERC funded e-Minerals e-Science project, the JISC funded Cheshire-3 VRE project and on the Diamond synchrotron facility. iRODS is under evaluation at STFC.
Lustre is probably the most pervasive parallel file system on large scale systems at the time of writing. It is open source and found on aroud 60 of the top 100 systems. Since Lustre was a Sun product, there is now some doubt about continued support from Oracle from Lustre-2 onwards. The Open Scaleable File System Consortium is addressing some of the concerns.
A Lustre file system has three major functional units as follows.
The MDT, OST, and client can be on the same node or on different nodes. In typical installations these functions are on separate nodes with two to four OSTs per OSS node communicating over a network. Lustre supports several network types, including InfiniBand, TCP/IP on Ethernet, Myrinet, Quadrics and other proprietary technologies. Lustre can take advantage of remote direct memory access (RDMA) transfers, e.g. over IB, to improve throughput and reduce CPU usage.
The storage attached to the servers is partitioned, optionally organised with logical volume management (LVM) and/ or RAID and formatted as file systems. The Lustre OSS and MDS servers read, write, and modify data in the format imposed by these file systems.
An OST is a dedicated file system that exports an interface to byte ranges of objects for read and write operations. An MDT is a dedicated file system that controls file access and tells clients which object(s) make up a file. MDTs and OSTs currently use a modified version of ext3 to store data. In the future it was planned that Sun's ZFS or DMU would be used.
When a client accesses a file, it completes a filename lookup on the MDS. As a result, a file is created on behalf of the client or the layout of an existing file is returned to the client. For read or write operations, the client then passes the layout to a logical object volume (LOV), which maps the offset and size to one or more objects, each residing on a separate OST. The client then locks the file range being operated on and executes one or more parallel read or write operations directly to the OSTs. With this approach, bottlenecks for client-to-OST communications are eliminated, so the total bandwidth available for the clients to read and write data scales almost linearly with the number of OSTs in the file system.
Clients do not directly modify the objects on the OST file systems, but, instead, delegate this task to OSSes. This approach ensures scalability for large scale clusters and super-computers, as well as improved security and reliability.
In a typical Lustre installation on a Linux client, a Lustre file system driver module is loaded into the kernel and the file system is mounted like any other local or network file system. Client applications see a single, unified Posix like file system even though it may be composed of tens to thousands of individual servers and MDT or OST file systems.
On some HPC installations, computational nodes can access a Lustre file system by re-directing their i/o requests to a dedicated node configured as a Lustre client. This approach was for instance used in the LLNL BlueGene installation.
Another approach uses the liblustre library to provide user space applications with direct file system access. Liblustre is a user level library that allows nodes to mount and use the Lustre file system as a client. Using liblustre, the nodes can access the file system even if the service node on which the job was launched is not a Lustre client. Liblustre allows data movement directly between application space and the Lustre OSSes without requiring an intervening data copy through the kernel, thus providing low latency, high bandwidth access from nodes directly to Lustre. Good performance characteristics and scalability make this approach the most suitable for using Lustre with HPC systems. Liblustre is the most significant design difference between Lustre implementations on systems such as Cray XT3 and Lustre implementations on conventional clustered workstations.
High availability features include a robust failover and recovery mechanism, making server failures and re-boots transparent. Version inter-operability between successive minor versions of the software enables a server to be upgraded by taking it off-line (or failing it over to a standby server), performing the upgrade, and re-starting it, while all active jobs continue to run. Users merely experience a delay.
Note: Lustre is being deployed for the Diamond synchrotron facility. It is used on the Jaguar system at ORNL which supports some 26,000 file system clients and 10PB of RAID6 storage. Lustre is also deployed on HECToR and will be part of the HPC Wales Grid with Fujitsu's Exa-byte File System (FEFS) as deployed in the K-computer in Riken. It is also used on WhamCloud supported by OpenSFS.
GPFS is an IBM proprietary high performance file system. GPFS provides higher i/o performance by ``striping'' blocks of data from individual files over multiple disks and reading and writing them in parallel. Other features provided by GPFS include high availability, support for heterogeneous clusters, disaster recovery, security, DMAPI, HSM and ILM. See http://www.ibm.com.
A file that is written to GPFS is broken up into blocks of a configured size, typically less than 1MB each. These blocks are distributed across multiple nodes, so that a single file is fully distributed across the disk array. This results in high read and write speeds as the combined bandwidth of the many physical drives is high. This however makes the file system vulnerable to disk failures, so to prevent data loss, the file system nodes also have RAID controllers. It is also possible to replicate blocks on different file system nodes.
Other features of the file system include the following.
For ILM, storage pools allow for the grouping of disks within a file system. Tiers of storage can be created by grouping disks based on criteria of performance, locality or reliability.
A file set is a sub-tree of the file system namespace and provides a way to partition the namespace into smaller, more manageable units. File sets provide an administrative boundary that can be used to set quotas and be specified in a policy to control initial data placement or data migration. Data in a single file set can reside in one or more storage pools. Where the file data resides and how it is migrated is based on a set of rules in a user defined policy.
There are two types of user defined policies in GPFS: file placement and file management. File placement policies direct file data as files are created to the appropriate storage pool. File placement rules are determined by attributes such as file name, the user name or the fileset. File management policies allow the file's data to be moved or replicated or files deleted. File management policies can be used to move data from one pool to another without changing the file's location in the directory structure. File management policies are determined by file attributes such as last access time, path name or size of the file.
The GPFS policy processing engine is scalable and can be run on many nodes at once. This allows management policies to be applied to a single file system with billions of files and complete in a few hours.
Note: GPFS is used at Daresbury Laboratory for HPCx, BlueGene, iDataPlex, POWER-7 and related services. It is also used at POL, the Proudman Oceanographic Laboratory and other UK academic data centres including many members of the UK HPC-SIG.
PanFS has roots in common with Lustre as they are contemporary designs from Carnegie Mellon University. Garth Gibson, CTO of Panasas and former professor at CMU was also a co-author of RAID in 1988.
The Panasas system is a specialised storage cluster. It uses per-file, client driven RAID, has parallel RAID rebuild, treatment of different classes of metadata (block, file, system) and a commodity parts based blade hardware with integrated UPS. It also has many other NOW standard features such as object storage, fault tolerance, caching and cache consistency and a simple management model.
The storage cluster is divided into storage nodes and manager nodes at a ratio of typically about 10:1. The storage nodes implement an object store, and are accessed directly from PanFS clients. The manager nodes control the storage cluster, implement the distributed file system semantics, handle failure recovery and can export the Panasas file system via NFS and CIFS.
Each file is striped over two or more objects to provide redundancy and high bandwidth access. The file system semantics are implemented by metadata managers that mediate client access to objects using the iSCSI/OSD protocol for read and write operations. I/O proceeds directly and in parallel to the storage nodes, bypassing the metadata managers. The clients interact with the metadata managers out-of-band via RPC to obtain access capabilities and location information for the objects that store files.
Object attributes are used to store file level attributes, and directories are implemented with objects that store name to object ID mappings. Thus the file system metadata is kept in the object store itself, rather than being kept in a separate database or some other form of storage on the metadata nodes.
Note: Panasas is used at Daresbury on several clusters and also at other NW-GRID sites. Panasas is expected to be used on the LANL Cielo system (Cray). It is used at other US government sites including Lawrence Berkeley, LLNL, NASA, ORNL and Sandia.
In this section we illustrate some use cases with collections of technologies and tools in use in typical data centric research environments.
Typical requirements are to have high bandwidth for parallel i/o within an HPC system, HSM for data migration and backup, distributed replicas for resiliance, multi-access within a data centre for pre- and post-processing. Remote access may be required with search facilities using metadata. It is likely that such a system will be coupled with a dedicate large memory server or other data intensive system and have the capability to export compressed data streams for remote visualisation.
The overall architecture of the SciDAC Data Management Center has been described in Section 1. For further details about the project, see [25]. In practice the project has adopted the following software technologies.
The Institute for Data Intensive Engineering and Science (IDIES) was founded in Apr'2009. It was based on the work of Alex Szalay and Jim Gray who provided a large scale database to allow astronomers to use SQL queries to extract data and execute user defined functions on 12TB data from the Sloan Digital Sky Survey. Applications from IDIES are explored further in [2].
This initial work has now been extended to a number of research domains including: turbulence, with a 27TB database; biology and environment, with 120M observations from forest sensor networks; data center monitoring using wireless sensors in collaboration with Microsoft; computer science research into data preservation and parallel query optimisation; data intensive architectures such as the GrayWulf and Amdahl Blade; 3D surface fitting such as in the Stanford Digital Michaelangelo Project and the LIDAR survey of New York City; neuro-science databases for statistical inference on EM and MR imaging data; Pan-STARRS asteroid database; Large Synoptic Survey Telescope data. Many of the services are publicly acessible via the Internet and have been used for ``crowd sourcing'' projects such as GalaxyZoo where users were asked to visually identify galaxy types.
Hardware available includes: GrayWulf, 50 Dell servers (500 CPU) and 1PB disk, Amdahl number to memory 1.0 and to disc 0.5; a 1,200 core cluster with 2TB memory connected using InfiniBand to database servers and the GrayWulf; 50 nodes with 100 nVidia GPUs to execute user defined DB functions out of process (SQLCLR); 36 node Amdahl Blade system, N330 dual core Atom, 4GB memory, 16 GPU cores total 76TB disk including SSD; visualisation facility producing 3D video streams from PB data sets with remote interaction; 10Gb/s dedicated connection to ORNL and UIC; proposed DataScope facility.
The Pacific Northwest National Laboratory's approach to data intensive computing initially focused on three key research areas. From 2006-10 they developed and combined new technologies to create capabilities to test: (1) enabling scientific discovery and insight applied to remediating the environment; (2) decision support and control in securing cyber networks; and (3) situational awareness and response in preventing terrorism.
Some propsed applications include: Social media analysis; Contingency analysis for the electric power grid; High-throughput video analysis; Understanding text documents; Architectural studies on multi-threaded languages; Chapel language for hybrid systems methods; Dynamic network analysis; Social network analysis; Irregular database and runtime systems; Compiler and runtime system; Performance analysis and tools; Communication software for hybrid systems. The systems are also used for applications such as un-structured mesh generation and machine learning.
The National HPCx service ran from 2002-2008, and NW-GRID started in 2005. Technologies deployed include:
Note: HECToR is using Lustre.
CERN currently uses CASTOR-2 to manage storage of LHC experimental data. The system design goals are to provide reliable central recording of the data, plus transparent access. Much of this access is intended to be from remote sites around the world via data Grid tools such as SRM. CASTOR itself is designed around a mass storage (tape) system and is therefore not suitable for deployment at sites without this facility.
CASTOR-2 is such a large installation it only exists at the largest sites in EGEE. These are CERN, plus three of the Tier-1 centres including RAL (UK), CNAF (Italy) and ASGC (Taiwan). The instance at CERN currently manages more than 50 million files and 5PB of data. The work at RAL [] is standardising on CASTOR-2 plus some additional e-Science software components for management of data across all of the STFC experimental facilities. Variants are already in use on DLS, ISIS and CLF.
CASTOR-2 has been designed around a central Oracle database, which handles, as much as possible, the state of the system. The resilience of this is key to CASTOR's reliability. Around this database core all daemons are designed to be stateless, allowing for redundancy and daemon restarts without loss of service.
The Stager is a key component responsible for file migration. This manages the disk pools in front of the tape system, see Figure 3.
In order for the Stager to manage access to files on the disk pool it uses a scheduler plugin. This balances the load on the disk servers and can also provide policy such as ``fair share'' for file access. The scheduling problem for disk pools is similar to that faced in compute clusters, so CASTOR currently uses the commercial LSF batch scheduler from Platform Computing (recently acquired by IBM).
CASTOR has the ability to dynamically replicate ``hot'' files and even to switch access on an open file to a less busy replica.
The Name Server implements a hierarchical ``directory'' view of files stored in CASTOR. Files may actually be segmented, replicated and stored on various media in the system. The Name Server interface implements functionality required by the Posix standard.
Catalogue services are provided by ICAT which has a metadata database based on CSMD (the Core Scientific Metadata Model) [15]. ICAT also includes user session management. Functionality is provided via a web service so that a variety of client applications can be used. ICAT is available as open source from http://code.google.com/p/icatproject/ and has active support and development. The TopCat client provides a browser based user interface for context aware search and data access via ICAT. TopCat is available as open source from http://code.google.com/p/topcat. TopCat has replaced DataPortal. Access control of metadata and data is supported. The releases of TopCat and ICAT are coordinated to ensure continuing compatibility. ICAT therefore responds to the need to make scientific data collected from experimental facilities available and re-usable, see Appendix B. The data itself is stored, curated and retained in other systems such as CASTOR. ICAT captures the context of the data including provenance and relevant experimental conditions.
Monitoring of the CERN system is carried out by integration with the LHC Era. Alarms are issued when any abnormal conditions are detected. Logging is done into the Oracle database, allowing the central gathering of information, plus the ability to cross query logs from different services.
This work was partly funded by EPSRC through the SLA with Daresbury Laboratory.
The author would like to thank the following people for input:
June Finch (NeISS Project and University of Manchester), Simon Hodson (JISC), Jason Lander (NGS and University of Leeds), Brian Matthews (RAL), Alistair Mills (RAL), David Corney (RAL), Terry Hewitt (WTH Associates Ltd.), Anthony Beitz (Monash University, Australia), Paul Watry (Liverpool),
http://www.rcuk.ac.uk/research/Pages/DataPolicy.aspx
STFC, through the facilities it operates and subscribes to and the grants it funds, is one of the main UK producers of scientific data. This data is one of the major outputs of STFC and a major source of its economic impact. STFC, as a publicly funded organisation, has a responsibility to ensure that this data is carefully managed and optimally exploited, both in the short and the long term.
This policy applies to all scientific data produced as a result of STFC funding:
This includes data produced as a result of past funding from STFC or its predecessor organisations (e.g. PPARC, CCLRC) which has already been curated.
This policy does not apply to:
This document was generated using the LaTeX2HTML translator Version 2008 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -local_icons -split 3 -html_version 4.0 LargeData
The translation was initiated by Rob Allan on 2012-04-19