Data Intensive Computing

Rob Allan

Computational Science and Engineering Department,

***** D R A F T *****

STFC Daresbury Laboratory, Daresbury, Warrington WA4 4AD


This report looks at a variety of requirements for and approaches to data intensive computing. It also considers novel ways to extract information from large data sets, particularly if this can be carried out ``in situ'' avoiding the need for lengthy file transfers.

© Science and Technology Facilities Council 2011. Neither the Council nor the Laboratory accept any responsibility for loss or damage arising from the use of information contained in any of their reports or in any communication about their tests or investigations.


An introduction to data intensive computing is given in Wikipedia There, it is defined as a class of parallel computing applications which use a data parallel approach to processing large volumes of data typically terabytes or petabytes in size. They devote most of their processing time to I/O and manipulation of data rather than computation. This is orthogonal to what we are familiar with from modelling and simulation on high performance computing systems.

Data intensive computing refers to capturing, managing, analysing and understanding data at volumes and rates that push the frontiers of current technologies. The challenge is to provide the hardware architectures and related software systems and techniques which are capable of transforming ultra-large data into valuable information and ultimately knowledge.

The National Science Foundation has noted that data intensive computing requires a ``fundamentally different set of principles'' to other computing approaches. They funded a programme to seek ``increased understanding of the capabilities and limitations of data intensive computing''. The key focus areas are:

Pacific Northwest National Labs, responding to the NSF call, defined data intensive computing as ``capturing, managing, analyzing, and understanding data at volumes and rates that push the frontiers of current technologies''. They believe that to address the rapidly growing data volumes and complexity requires ``epochal advances in software, hardware and algorithm development'' which can scale readily with size of the data and provide effective and timely analysis and processing results.

There are several important common characteristics of data intensive computing systems that distinguish them from other forms of computing.

  1. The principle of co-locating the data and applications or algorithms. To achieve high performance in data intensive computing, it is important to minimise movement of data. For this reason it is useful to execute applications on the nodes where the data resides. High bandwidth and low latency networking using technologies, such as InfiniBand which enables RDMA, allow data to be stored in a separate nearby repository and provide performance comparable to data on local disk.

  2. The programming model used. Typical data intensive computing applications are expressed in terms of high level operations on data, and the runtime system transparently controls the scheduling, execution, load balancing, communications and movement of computation and data across the distributed computing cluster. The programming abstraction and language tools allow the processing to be expressed in terms of data flows and transformations incorporating new programming languages and shared libraries of common data manipulation algorithms such as sorting. A database is often used as optimisations are well known.

  3. A focus on reliability and availability. Data intensive computing systems must be designed to be fault tolerant. This typically involves redundant copies of data on disk, storage of intermediate processing results on disk, automatic detection of node or processing failures and selective rollback or re-computation of results. Database technologies are also used for this purpose.

  4. The inherent scalability of the underlying hardware and software architecture. Data intensive computing systems can typically be scaled in a linear fashion to accommodate virtually any amount of data, or to meet time critical performance requirements by simply adding additional processing and storage nodes. The number of nodes and processing tasks assigned for a specific application can be variable or fixed depending on the hardware, software, communications and distributed file system architecture.

Note that these criteria implicitly preclude in-memory solutions (but see more below).

A variety of system architectures have been implemented for data intensive computing and large scale data analysis applications including parallel and distributed relational database management systems which have been available to run on clusters for more than two decades. This assumes the data is structured and can be mapped onto a set of connected tables for manipulation. However most data growth is with data in un-structured ore semi-structured form and new processing paradigms with more flexible models are needed. Emerging solutions include the MapReduce architecture pioneered by Google [8] and now available in an open source implementation called Apache Hadoop, see [46,2,1] (note Dryad/ LINQ has been used on Windows platforms).

Amdahl's Laws

Amdahl's Laws are as follows.

  1. Amdahl's parallelism law: If a computation has a serial part which takes time S to execute and a parallel component which takes time P/N to execute on N processors, then the speedup on N is (S+P)/(S+P/N). The maximim speedup is therefore (S+P)/S.
  2. Amdahl's balanced system law: A system needs one bit of I/O per second per instruction per second.
  3. Amdahl's memory law: alpha=1: that is the MB/MIPS ratio, in a balanced system is 1. That is one byte of memory per cpu instruction cycle.
  4. Amdahl's I/O law: Programs do one I/O per 50,000 instructions

Looking at Amdahl's balanced system law, we typically see the following.

The GrayWulf

We should be aware that with multi-level caching, a much lower I/O to MIPS ratio coupled with a large enough memory can still provide satisfactory performance [37]. This is however only true if the problem fits in memory.

Typically, cache based systems with branch prediction and speculation hardware in CPU are not useful for applications which require streaming data. Here the order of magnitude mis-match between disk memory and CPU bandwidth hampers the processing. A more balanced system is required. Both FPGAs and GPUs have been investigated for this purpose. SSD devices might also be used to reduce the bandwidth to disk and they significantly reduce the seek time for un-structured data. Some data intensive benchmarks were created to test different architectures by Gokhale et al. [11].

Special purpose computers based on Amdahl's balanced system law are built to process large volumes of data that will not fit entirely in memory. The original machine designed by Jim Gray of Microsoft is now known as the GrayWulf [37]. This is typically a good solution for university groups who host their own large data sets (it is difficult to move over 1 TB of data over the internet) but who could maintain a reasonably compact commodity cluster solution.

The goals of the GrayWulf project were as follows.

It is found, for instance by Gokhale et al. [11] that using special purpose hardware can remove the CPU bottlneck and ultimately the system becomes limited by bandwidth to disk.

Some examples of machines based on this concept are given below.


The original GrayWulf at Johns Hopkins University has 416 CPUs capable of 1,107 GI/s with a total memory of 1,152 GB. Disk I/O performance was a total of 70 GB/s from just over 1PB of storage. This results in an Amdahl memory number of 1.04 and I/O number of 0.506. The modular structure of this solution is shown in the following table.

No. Servers CPU ea Memory ea Disk ea Interconnect
Wide area interconnect, 10Gb/s FC
Tier 1
2 16 core 128GB 11.25TB InfiniBand
  Dell R900   1x MD1000 as below QLogic 20Gb/s
Tier 2
4 16 core 64GB 33.75TB InfiniBand
  Dell R900   3x MD1000 as below QLogic 20Gb/s
Tier 3
40 8 core 16GB 22.5TB InfiniBand
  Dell 2950 2.66 GHz   2x MD1000 SAS total 30x 750GB 7,200RPM SATA discs QLogic 20Gb/s

Further details are given in [37].

This is basically a distributed SQL server cloud. Applications are run on the Tier-1 servers which are also used for remote access and data transfer.


A second example of a more energy efficient solution has been built using Atom/ Ion technology.

No. Servers CPU ea Memory ea Disk ea Interconnect
Tier 1
36 2x Atom + 16x GPU 4GB 1x 120GB SSD  
  Zotac N330 Intel Atom/ nVidia Ion 1.6GHz   + 2x 1TB Samsung F1  

In this case, the Ion GPUs are used typically for data mining over astronomy image data. They can be used for other user-defined functions executed as out of process SQL procedures.


The Edinburgh Data Intensive Machine 1 was built recently also using Atom technology.

No. Servers CPU ea Memory ea Disk ea Interconnect
Tier 1
120 2x Atom + 16x GPU 4GB 1x 256MB SDD  
  Zotac N330 Intel Atom/ nVidia Ion 1.6GHz + 3x 2TB HDD    


The Gordon system at SDSC was launched on 6/12/2011. At the time of writing it is the world's largest flash memory based super-computer. It was developed over a period of two years based on the Dash prototype and supported by an NSF Track-2 grant of by a $20M. Gordon thus represents the first really big purpose built super-computer for data intensive applications. One aim was to pioneer the technology for the wider enterprise market.

The intention of SDSC and the NSF is to draw in data intensive science codes that have never had a platform this size to push the envelope. An example is world-wide societal evoluation. Another example is genomics, a specific design target of the system engineers. Genomics is the classic "big data" science problem, and is the one most frequently cited in HPC circles as suffering from the data deluge crisis. Other application areas like graph problems, geophysics, financial market analytics and data mining are targeted.

The system is an evolution of the Appro HPC cluster, using the third generation Xtreme-X architecture and Intel 8-core SandyBridge Xeon E5 CPUs prior to GA. There are 1,024 dual socket nodes each with 64GB of DDR3 memory and a peak performance of 280 Tflop/s. The system has over 300 TB of Intel solid state ``flash'' disks, spread over 64 I/O nodes.

Despite each Intel SSD being relatively slow, the aggregate IO/s (IOPS) performance of the machine is impressive - on all 64 I/O nodes it achieves a peak output of 36M IO/s. This has to be converted to aggregate bandwidth using a figure of around 250 MB/s for 8kB read operations per SSD yielding 256 GB/s.

Operating system is based on ScaleMP's virtual SMP (vSMP) technology. It allows users to run large memory applications on what they call a "super-node" - an aggregation of 32 compute servers plus two I/O servers, providing access to 512 cores, 2 TB of RAM and 9.6 TB of SSD. To a program running on a supernode, the hardware behaves as a big cache coherent server. The system can be partitioned into 32 super-nodes. The system is connected to a disk sub-system via NFS and Lustre, currently 150 TB but planned to rise to 4 PB by July 2010.

The flash device being employed is Intel's new iSolid-State Drive 710. The 710 uses Intel's High Endurance Technology (HET), which is Intel's version of enterprise multi-level cell (eMLC) flash memory. Like eMLC, the HET flash features the performance and resiliency of single level cell (SLC) flash, but at a much lower cost. SDSC also developed its own flash device drivers to maximize performance of the SSD gear.

Inserting this much flash memory into a supercomputer had never been attempted before, and this aspect was probably the biggest risk for the project. When they began the Gordon effort two years ago, flash memory was just starting to make its way into enterprise storage and was an expensive and un-proven technology.

The system is currently undergoing acceptance testing and is expected to be available for production use dedicated to XSEDE users (the evolution of TeraGrid) on 1/1/2012.

No. Servers CPU ea Memory ea Disk ea Interconnect
32 512 2TB 9.6TB Mellanox QDR IB
Compute Nodes
1,024 16 core 64GB   Mellanox QDR IB
dual socket SandyBridge 2.6GHz DDR3   dual 3D torus
I/O Nodes
64 12 48GB 4.8TB Mellanox QDR IB
dual socket Westmere 2.67GHz DDR3 Intel iSSD 710 dual 3D torus
      16x 2Gb/s  

This gives an Ahmdahl memory number of 1.53 and Amdahl I/O number of 0.048.

A summer institute was held at SDSC in 2011 at which applications were suggested in the areas of: societal evolution; data mining; de novo genome assembly; asteroid search and discovery using LSST data; graph analysis; real time data; seismic hasard evaluation;

Graph500 Benchmark

The Graph500 benchmark is intended to guide the design of hardware architectures and software systems to support data intensive applications of which graph based algorithms are a core part. The benchmark currently includes: (i) concurrent search; (ii) optimisation (single source shortest path); and (iii) edge oriented (maximal independent set). The output is a metric of TEPS, traversed edges per second. Results are typically in the millions or billions.

The Web site contains further information and provides the source code.

Software Approaches

We now consider software for data intensive systems. Much of this discussion is based on [34]. We first consider the use of a database approach for structured data sets. A high level language based on SQL is typically used to encode required operations which can be uploaded by users.

Most scientific data analyses (part of Information Lifecycle Management) are performed in hierarchical steps. During the first pass, a subset of the data is extracted by either filtering on certain attributes (typically removing erroneous or unwanted data) or extracting a vertical subset of the columns. In the next step, data are usually transformed or aggregated in some way. In more complex datasets, these patterns are often accompanied by complex joins among multiple attributes, such as external calibrations or extracting and analysing different parts of a gene sequence. For very large datasets, the most efficient way to perform most of these computations is clearly to move the analysis functions as close to the data as possible. Many of the patterns that have been identified are easily expressed by a set oriented, declarative language whose execution can benefit enormously from cost based query optimisation, automatic parallelism and indexing.


These features were recognised by Jim Gray and his collaborators at Microsoft and sometimes referred to as ``Gray's Laws''. They have shown in several projects that existing relational database technologies can be successfully applied in this context [34]. It is also possible to integrate complex functions written in a procedural or other languages (such as C or R) as an extension of the underlying database engine, as exemplified by the work at Johns Hopkins University.

In recent years, MapReduce has become a popular distributed data analysis and computing paradigm [8], see below. The principles behind it resemble the distributed grouping and aggregation patters that already existed in parallel relational database systems. Such capabilities are now being referred to as ``MapReduce in the database''. Benchmarks to compare different approaches are being developed [26].

It was found by Szalay et al. that data operations performed by users of the Sloan Digital Survey database [37] exhibit a 1/f power law distribution. Most queries are very simple single row lookups based on simple indices such as celestial position (high frequency, low volume). A small number of analyses did not use pre-computed indices, and required lengthy sequential scans through the data possibly combined with merge-join operations. These can take over an hour on a multi-terabyte database. Other access patterns involved multiple simple scripts to extract a series of results, a kind of database ``crawler''. Application of such scripts might be combined into a more complex workflow which can be executed from a batch queue.

For efficient analysis of structured data. it is important to start with an appropriate schema (for metadata) and database design (table layout and connectivity, data ordering, etc.). It is notoriously hard to modify these on an ad hoc basis. Given these requirements it is then possible to provide a set of sample SQL scripts which might meet the requirements of most users. The database design should aim to accommodate these scripts in the simplest way. The scripts can be extended to do more complex analyses.

To facilitate access and reduce I/O requirements the data must be factored in different ways which reflect requirements, for instance it could be divided in space or time into blocks which can be read in and processed in parallel. Typically this will result in a hierarchical database architecture. This is explained in further detail with examples from the Pan-STARRS and SLOAN DB in [34].

Management of the database and servers, in particular to ensure fault tolerance, is also described in [34].

A number of management tasks can be codified into workflows as follows.

Load Data:
Data streaming from sensors and instruments or simulations may need to be assimilated into existing databases. The load workflow does the initial data format conversion from the incoming data format into the database schema. This may involve data translation, verifying structure consistency, checking quality, mapping the incoming data structure to the database schema, determining optimised placement of the ``load databases'', performing the actual load and running post-load validations.
Merge Data:
The merge workflow ingests the incoming data in the load databases into the ``cold'' databases and re-organises the incremented database in an efficient manner. If multiple databases are to be loaded, there may be additional scheduling required to do the updates concurrently and expose a consistent data release version to the users.
Data Replication:
The data replication workflow creates copies of databases over the network in an efficient manner. This is necessary to ensure availability of adequate copies of the databases, replicating databases upon updates or failures. The workflow coordinates with job schedulers to acquire locks on ``hot'' and ``warm'' databases, determines data placement, makes optimal use of I/O and network bandwidth and ensures the databases are not corrupted during copy.
Complete scans of the available data are required to check for silent data corruption, or to update object data statistics. These workflows need to be designed as minimally intrusive background tasks on live systems. A common pattern for these crawlers is to ``bucket'' portions of the large data space and apply the validation or statistic on that bucket. This allows the tasks to be performed incrementally, checkpoint, recover from faults at the last processed bucket, balance the load on the machine and enable parallelisation.
Fault Recovery:
Recovering from hardware or software faults is a key task. Fault recovery requires complex coordination between several software components and must be fully automated. Fault recovery can be modelled using state diagrams that can then be implemented as workflows. These may be associated as fault handlers for other workflows or triggered by a monitoring event. Occasionally, a fault can be recovered in a shorter time by allowing existing tasks to proceed to completion; hence look ahead planning is required.


Some authors, e.g. [20] have claimed that MapReduce is the most successful abstration to date to go beyond the classic von Neumann model of computing. It is aimed at developing scale free data intensive applications.

MapReduce is a software methodology introduced by Google in 2004 to support distributed computing on large data sets on loosely coupled compute clusters. The framework is inspired by the map and reduce operations commonly used in functional programming, but in a modified form. MapReduce libraries have now been written in C++, C#, Erlang, Java, OCaml, Perl, Python, PHP, Ruby, F#, R and other languages.

Such abstractions manage complexity by hiding details and presenting well defined behaviours to users. They make certain tasks easier but others more difficult, and sometimes impossible. It may be only the first in a new class of programming models that will allow us to more effectively organise computations at a massive scale, and extensions are already being suggested.

MapReduce frameworks were developed for commercial information retrieval, which is perhaps currently the world's most demanding data analysis problem. Exploiting commercial approaches offers a good chance that one can achieve a high quality robust solution. MapReduce now has both commercial and open source implementations. Qiu et al. [29] looked at MapReduce and MPI, and showed how to analyse biological samples with modest numbers of sequences on a multi-core cluster. To support iterative operations they evaluated the open source Java i-MapReduce [51] software which uses Apache Hadoop, see [46,2]. Another iterative MapReduce implementation is Twister [9,47].

Iterative data intensive computation is pervasive in many applications such as data mining or social network analysis. These typically involve massive data sets containing at least millions or even billions of records. MapReduce in its original form lacks built in support for iterative processes that require parsing data sets repeatedly and doing book keeping. Besides specifying MapReduce jobs, users have to write a driver program that submits multiple jobs and performs convergence or other testing at the client. In a system such as i-MapReduce, users specify the iterative computation with the separated map and reduce functions. The system provides support for automatic iterative processing without the need of a user defined driver program. As well as making it easier to use, this can improve the performance by: (1) reducing the overhead of creating new MapReduce jobs repeatedly; (2) eliminating the shuffling of static data; and (3) allowing asynchronous execution of map tasks.

Whilst specially optimised block based file systems such as Hadoop FS have been developed to support MapReduce [2,1], there are also versions implemented in MPI such as the library from Sandia National Labs [28,27] and Indiana [15]. Interfaces from Sandia are available for C, C++, Python and OINK (c.f. PIG, the scripting language for Hadoop). It has for instance been used to parallelise BLAST and SOM algorithms [35]. Another is YAMML, a Google code prototype. Other are mentioned in [1].

There are several database like implementations on top of Hadoop, for instance Hive which offers an SQL like interface.

Geoffrey Fox (Indiana University) has compared MPI with iterative MapReduce. He noted that MapReduce is more dynamic and fault tolerant than MPI and it is simpler and easier to use. Both Fox [9] and Chen and Scholosser [7] noted that it however requires some extensions and an iteration framework to make it useful for a wider range of applications. Some applications listed in papers on MapReduce are as follows.

K-means clustering:
divides data points into a pre-defined number of clusters based on mean ``distance''
Matrix multiplication:
using traditional row and column decomposition. Maps columns onto processors then iterestes over rows and reduces the partial blocks into row blocks for output
Sequence comparison:
Smith Waterman Gotoh pair wise distance algorithm
sequence assembly program for mRNA analysis
Gene Analysis Tool Kit
Network analysis:
MDS (multi-dimensional scaling) compared to GTM (Generative Topographic Mapping) using GraphViz for visualisation of output graphs. These are examples of non-linear dimensionality reduction applied to interpret large data sets and for statistical analysis.
A computational biology code from Microsoft. PhyloD is a statistical tool that can identify HIV mutations that defeat the function of the HLA proteins in certain patients.
data analysis originally coded in ROOT and CINT.
effectively a recursive Markov Chain algorithm which iterates to convergence
Graph search:
a breadth first search algorithm by Cailin uses iterative MapReduce
Word count:
simple counting algorithm adds up occurrences of words in files
Mesh refinement:
using hierarchically tiled arrays (HTA), e.g. with htalib [5]
Machine learning:
e.g. classification algorithms, image recognition, ranking
BLAST and SOM algorithms have been implemented using both Twister and MPI-MapReduce [35,47]
Self Organising Map, for dimensionality reduction and clustering or classification used in bio-informatics, see above
Agent Based Modelling and Simulation has been implemented by Wang et al. [45] using their own BRACE framework and BRASIL scripting language.
there has been at least one attempt to implement a Molecular Dynamics algorithm using Hadoop [13]. Other uses of MapReduce are in trajectory analysis.

We note that applications written in languages other than Java can also be used on Hadoop with the Pipes library [46]. There are several projects to integrate with R, for example RHIPE from Purdue University.


Dryad [16] is a Microsoft Research project developing an execution environment for data parallel applications expressed as directed acyclic graphs (DAGs). In general this ongoing programme of work is investigating programming models.


Pregel [22] is another technology developed by Google to mine information from graphs. Examples include road networks (maps), the Internet, etc. Pregel uses three design patterns based on MapReduce [21] to extract information about the nodes and vertices. This is an inherently iterative algorithm with a similar approach to BSP [44].

It is over 20 years since Les Valiant and Bill McColl started working on Block Synchronous Programming, and interest in BSP is now growing again. Pregel is a BSP system for graph analytics at scale and is now a major element of Google's internal big data infrastructure. Within the Hadoop community, BSP models are now being used to achieve much higher levels of performance than can be achieved using MapReduce.


Some examples of data intensive work in the UK and elsewhere which could presumably benefit from access to a larger facility and some joint development work.

Life Sciences: Biology and Bio-informatics

In biology, an obvious set of examples is represented by the so called next generation sequencing methods for nucleic acids. Recent reviews of these are given by [23,33]. Big databases with fast access and information visualisation look like being important for this area of biology.

Modelling biological networks is inherently data driven and data intensive. The combinatorial nature of this type of modelling, however, requires new methods capable of dealing with the enormous size and irregularity of the search. Searching via ``back tracking'' is one possible solution that avoids exhaustive searches by constraining the search space to the sub-space of feasible solutions. Despite its wide use in many combinatorial optimisation problems, there are currently few parallel implementations of backtracking capable of effectively dealing with the memory intensive nature of the process and the extremely un-balanced loads present.

Some other topics include: bio-medical studies in gene selection and filtering; feature selection algorithms for mining high dimensional DNA micro-array data; random matrix theory to analyse biological data.

Bio-informatics in this context is primarily concerned with sequence studies. Qiu et al. [29] provide a detailed description of current approaches to this.


Some topics include: processing extreme scale datasets in the geo-sciences; geo-spatial data management, e.g. with TerraFly (from NASA); parallel earthquake simulations on large scale multi-core super-computers.

Social Science and Security

One application is to discovering relevant entities in large scale social information systems. This typically involves graph based network analysis. A breadth first search will compute all nodes in the tree which can be accessed from a given node and gives the shortest path between them. This is done by processing an adjacency list iteratively down each connected node.

The adjacency list can be repressented as a sparse adjacency matrix. In Cailin's method, colouring is used to identify if a node is yet to be explored, is being explored or its depth has been found. This illustrates a problem with the basic Hadoop implementation. Since the algorithm is iterative, a global condition must be set to ensure termination when there are no more links still being explored. There is no support in the Hadoop framework to do this. See

Evolutionary Biology

Example of such work is from Prof. Mark Pagel FRS, University of Reading.

His group builds statistical models to examine the evolutionary processes imprinted in human behaviour, from genomics to the emergence of complex systems, to culture. Their latest work examines the parallels between linguistic and biological evolution by applying methods of phylogenetics, the study of evolutionary relatedness among groups, essentially viewing language as a culturally transmitted replicator with many of the same properties we find in genes. They are looking for patterns in the rates of evolution of language elements, and hoping to find the social factors that influence trends of language evolution.

Software and projects are based on Bayesian analysis and visualisation.

Astronomy, Astrophysics and Cosmology

Examples are from Jeremy Yates, DIRAC Consortium, see

Researchers at 27 HEIs in the UK are studying the properties of sub-atomic particles, cosmology and the physics of the early universe, the formation of large scale structure and galaxies, the properties of galaxies, the formation of stars and planets, astro-chemistry and the inter-stellar medium, the properties of exo-planets and planetary atmospheres, the physics of the sun and solar system, the properties of nano-particles, molecules and atoms in astrophysical conditions.

Software and projects include: AstroGrid; COSMOS; VIRGO; Miracle.

Elsewhere, projects like the Large Synoptic Sky Telescope, LSST, will begin collecting data in 2012 at a rate of approx. 500 MB/s which will need to be processed in real time. SWarp is used in the processing pipeline and is based on a Lanczos re-sampling filter [4]. This normalises the images to the sky and makes it possible to recognise new or changed features by comparing images.

Data Mining

Some applications of data mining are in: Science - Chemistry, Physics, Medicine; Biochemical analysis, remote sensors on a satellite, Telescopes, star galaxy classification, medical image analysis. Bioscience - Sequence-based analysis, protein structure and function prediction, protein family classification, microarray gene expression. Pharmaceutical companies, Insurance and Health care, Medicine - Drug development, identify successful medical therapies, claims analysis, fraudulent behavior, medical diagnostic tools, predict office visits. Financial Industry, Banks, Businesses, E-commerce - Stock and investment analysis, identify loyal customers vs. risky customer, predict customer spending, risk management, sales forecasting.

ADMIRE, Advanced Data Mining and Integration Research for Europe, is an EU project which includes EPCC and NeSC, see

The main application areas addressed are: gene annotation, flood forecasting using data mining on observed data linked to simulations, analytical CRM from call centre data, search for quasars using digital telescope data, analysis of seismology databases.

One of the deliverables of the project is DISPEL, a parallel process engineering language.

The ADMIRE software stack is open source, see and

Other software which support data mining includes: Weka, RapidMiner, Fastlib, MatLab, R, Octave, Phoenix MapReduce (a shared memory implementation from Stanford written in C++).

AI and Computational Linguistics

Language processing is becoming increasingly important, for instance in analysis of text on the Web. This is not confined to the English language. It requires processing large document streams.

Some work at University of Bristol, group won the 2009 Morpho challenge for machine learning. Other work by Prof. Paul Watry, University of Liverpool, see

Current focus is on developing and implementing technologies in the areas of computational linguistics and textual analysis. Current work in understanding the set of relationships that exist between human and machine defined semantics and exploring the application of semantic grid technologies to improve discovery and generate knowledge. Related work in the analysis of electronic document structure, data and text mining, natural language processing, corpus linguistic tools (information retrieval), annotation and visualisation technologies, persistent archives and new media.

Software and projects from the Liverpool group include: Cheshire III Digital Library software; iRODS; SHAMAN; PrestoPrime. See also

One processing techique is to use n-grams consisting of a sequence of n characters or words. The algorithm extracts the t most frequent n-grams found. See Wikipedia Document profiles can be compared, e.g. to determine their similarity or what language they are written in by closeness to the language training set profile. Languages can be modelled statistically as n-grams, based on Shannons' theories of information. These are sometimes Markov models. FPGAs can be used for this purpose, e.g. via a parallel Bloom filter as a probabilistic test of set membership [11]. N-grams are also used in gene sequence analysis.

Graph analysis is another application described in [11].

There is a book on text processing with MapReduce [20].

Data Analytics

This typically refers to what is done with customer derived commercial data using SQL or a combination of SQL and MapReduce. It can yield information used in: predictive and granular forecasting; trend analysis and modelling; credit and risk management; fraud detection; cross platform advert and event attribution; cross platform media affinity analysis; merchandising and packaging optimisation; service personalisation; graph analysis; consumer segmentation; consumer buying patterns and behaviour; click stream analysis; compliance and regulatory reporting.

Sometimes the acronym OLAP is used for On Line Analytic Processing. It can be found that an RDBMS approach does not scale to the size requires for this kind of semi-structured data, and solutions like Hadoop Hive and PIG are used instead.

MapReduce Design Patterns

In this section we expand on MapReduce programming with examples of some commonly found design patterns which can be used to construct applications.

For more information about some of the examples mentioned above see

See also Google code tutorial

Keys and Values

All data objects in MapReduce are referenced by keys and values which are defined by the user. These are handled as key value pairs or as key multi-value sets .

The first thing to decide when implementing an algorithm is what are the keys. Keys are objects can be simple data types or structures. Some examples are:

a string: for instance each word from a text in the word sort or related examples
a vertex (or node): for instance v=(i,j) can refer to an element of a 2D array or v=(i,j,k) a 3D array or similar entity in a graph

One or several values is associated with a key, again values can be simple data types or structures.


A MapReduce framework relies on functiona callbacks. Thus, the user will define the map and reduce functions which are invoked by the framework.

The map operation translates input keys and values to new keys and values .

The framework sorts the keys and groups each unique key with all its values .

The reduce operation translates the set of values for each key to a new key and value .

The map and reduce callbacks need to be able to handle the types of objects defined for the keys and values. It may also be necessary to provide a callback function for the sort operation if the values are not simple types.

MapReduce Job

It is important to remember a few things about how the framework functions. A ``job'' is referred to as a single map operation with a reduce operation. The grouping and sorting is done by the framework. An iterative application may require to do many MapReduce jobs, typically data will be output to local file store between jobs or held in memory.

Map and reduce operations are similar to those in functional languages like Lisp. The data is distributed and a process run on each CPU against its local data.

MapReduce is particularly suitable for graph based algorithms. Any sequential data dependencies should be avoided, e.g. do breadth first, not depth first. MapReduce applications can typically be represented as a network diagram.

[example using GraphViz]

The ``map'' operation is required, and is typically a parallel process which distributes the input data across processors according to the keys. This can be assumed to perform some load balancing. The result of each map is a set of zero or more KV pairs, but distributed such that a key may appear more than once.

The ``aggregate'' operation re-maps the KV pairs such that all occurences of a particular key are on the same processor. It also performs load balancing again, possibly using a hashing function. This is an all-to-all communication method.

The ``convert'' operation groups takes the set of values for each unique key and converts to a KMV object. This should be done following the aggregate step and is an on-processor operation. Typically the two operations are combined into one, such as ``collate''.

The ``reduce'' operation is optional, but if present will operate over the keys in ``sort order'', but keys may still be randomly distributed across reducers (processors). Thus the KMV set is un-ordered. The result of each reduce is a set of zero or more KV pairs.

Thus a skeleton MR job might begin to look as follows:

  MapReduce *mr = new MapReduce(MPI_COMM_WORLD);
  mr->map(input, map_callback);
  mr->reduce(reduce_callback, output);

Other things to note are as follows:

Filter, grep

Read in all lines in a file. The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output.

Count, sum, index

The word count example is probably the most common one seen in MapReduce tutorials. One or more text (or other) files are read in and the mapper selects words (or other entities) and assigns values to them, initially 1 for each word, i.e. . These are sorted and gathered such that identical words are on one processor . The reducer can then output the final KV pair s .

In the indexing application, instead of a count the value stored is . This is reduced to a list of document ids or pages on which the word occurs.

Reverse links

Read content from a Web URL. The map function searches for target URLs in the content and associates them with the source URL as a KV. These are gathered to a KMV and and reduced to . This can be further processed using the Graph methods explained below.

Parse, convert

Graph search

Figure 1: Simple Graph
Image simple_graph

This is typical of a garph processing algorithm, and uses a breadth first search as explained by Cailin Nelson. It is not optimised.

A graph as in Figure 1 can typically be stored as an ``adjacency list'' connecting its nodes, e.g.

This can be represented using the node as key and values as structures as follows . One node, e.g. ``a'' is assigned at the start to be at distance zero and coloured ``grey''. Other nodes are coloured ``white'' and are yet to be processed. All other distances are initially ``unknown''.

The algorithm is iterative. At each map step we take all grey nodes and expand their adjacency lists which are known to be at the next largest distance as follows. This adds more KV pairs. The nodes which have been processed are then marked ``black''.


Now the set is sorted and gathered to give the following.


The reducers combine occurences of the same key as follows to give all nodes (in this case b and e) at distance 1 from the source.


This proceeds in a similar fashion until there are no more grey nodes; either they are all black or some remain white which are not connected to the source node. The resulting rank is the distance from the source or unknown if not connected.

Bin, collate

distributed tasks

total sorting

chained jobs

group by


secondary sort

co-grouping, joining

distributed total sort



One algorithm used is K-means. This looks at points distributed in an n-dimensional space and attempts to use a Euclidian distance measure to assoociate points with cluster centres. The centres are refined iteratively.

Assign random set of cluster centres at start and read in all points. Map tasks calculates Euclidean distance from each point in its partition to each cluster center. Map tasks assign points to cluster centers and sum the partial cluster center values. KV is cluster center sums + number of points assigned. Reduce task sums all the corresponding partial sums and calculate new cluster centers.

This process is iterated until a stable result is found.


dimension reduction

evolutionary algorithm

Matrix multiplication

This description is written in terms of row and column elements, but should be generalised to row and column blocks.

C=A*B, simplified to square (N*N) matrices.

Take rows i from matrix A and columns j from matrix B. The key is which is the element index for output matrix C. Input values are which must be reduced to .

In the Twister approach, rows of A are read in and processed iteratively until the whole output matrix is complete. The trick here is to avoid having to hold the entire 2*N*N data set which could be very large. Reading sucessive rows required only N+(N*N) elements to be handled on each iteration to compute output row C(i,*). The map operation ensures that N+N elements are multiplied on each of N processors.

More Complex Examples

There is a small but growing on-line literature addressing more complex algorithms using MapReduce based patterns. Just a few are listed here.

Radix-2 FFT, outline algorithm only by J. Chen

Large scale integer maths using Hadoop, presentation by Tsz-Wo Sze plus separate papers by the same author [38,39].

A number of authors have presented matrix multiplication algorithms: J. Norstad; G. Fox; S. Seo et al. [31]; E.J. Yoon;

Conjugate Gradient method for solving linear systems: S. Seo et al. [31];

Data Exploration

This section discusses some other ``non-traditional'' ways to analyse large data sets resulting from numerical modelling and simulation.


Visualisation typically uses large volumes of data. Rendering is carried out to view and study parts of the data, e.g. to produce surface or volume images of multi-dimensional data. It is usual to want to scan through this data interactively, viewing successive parts, for instance rendering slices through a volume to explore the interior structure. Other activities are to map functions using artefacts such as streamlines or vectors (hedgehog plots). It is also likely that these will be composed together as a ``movie''. Some examples of work performed on HECToR are given by Turner and Leaver [43].

The visualisation workflow is thus to extract a part of the data and then manipulate and render it. In terms of what is commonly referred to as the standard visualisation pipeline the steps are: (i) read and filter data; (ii) map; (iii); render; and (iv) display. This involves a combination of data extraction, computation and rendering which is greatly facilitated by an appropriate dataset format. For instance slicing multi-dimensional meteorological data is facilitated by array structured formats such as GRIB. Software is often build from modules, each of which is responsible for one of the pipeline steps (e.g. parallel rendering) and may produce intermediate file output requiring high bandwidth i/o.

A certain amount of filtering can be applied to reduce the volume of data required in this process. Visualisation doesn't always need 64-bit, so reduction is possible. It may also be possible to extract a sub-set of data to manipulate if the full volume is not required for rendering. This can help if the data needs to be migrated to the visualisation system. An alternative is to do remote visualisation but rendering with data in situ.

Many visualisation software packages exist which are intended to be used with data in one or more of the available scientific formats discussed in [2]. Here are pointers to some lists of information about this software. An overview can be found on Wikipedia at Brief descriptions and pointers to software that can be used with netCDF is at A page of links to many scientific visualisation and graphics software packages is at

Server Side Visualisation

The primary techniques for building visualisation servers are as follows.

Parallel Architecture:
uses multiple graphics cards for one application, sharing the load. These graphics accelerators could be attached to a single system, e.g. with large memory and multiple CPUs to render large data sets. A cluster with graphics accelerators can form a parallel rendering system reducing data available on local disks.
Parallel Decomposition:
splits the problem across multiple graphics cards in the application's data space or in display space.
Open Software Architecture:
addresses server side rendering at middleware, API, or application levels.

Software such as Chromium, FieldView, ParaView, OpenSceneGraph, DMX, VisIt and SciRun can also do parallel rendering so it is important to be able to manipulate the data inside a cluster to enable this, for instance re-ordering the data from the original application to an order suitable for the rendering package. This could be done in MapReduce. Table 6.2 lists properties of some well known packages.

Table 5: Some Visualisation Packages
Package Name Comments/ File Formats Web site
Amira (licensed) *, Amira, Analyze, various images, netCDF, DXF, PS, HTML, Hoc, Icol, Interfile, Matlab, Nifti, Open Inventor, PSI, Ply, raw binary, SGI RGB, STL, SWC, Tecplot, VRML, Vevo
AtomEye PDB, standard CFG and extended CFG. Provides scripts to convert between PDB and CFG as well as for converting VASP files into CFG
Avizo (licensed) * see
AVS/ Express (licensed) ** netCDF, etc. see
Chimera see
Chromium parallel rendering
DMX multi-display X
EnSight (free or licensed) CFD and multi-physics, netCDF, etc. see User Manual Chapter 11
Environmental Workbench netCDF, ...
IDL netCDF, ...
FieldView Plot3D, FV-UNS, etc.
OpenDX **, CDF, netCDF, HDF-4, HDF-5, DX, various images, mesh, ...
ParaView netCDF, HDF-5, ParaView, EnSight, VTK, Exodus, XDMF, Plot3d, SpyPlot, DEM, VRML, Ply, PDB, XMol, Stereo Lithography, Gaussian Cube, AVS, Meta Image, Facet, various images, SAF, LS-Dyna, raw binary
PV-Wave (licensed) netCDF, ...
SciRun **, Nimrod, ...
VisIt Analyze, Ansys, BOV, Boxlib, CGNS, Chombo, CTRL, Curve2D, Ensight, Enzo, Exodus, FITS, FLASH, Fluent, FVCOM, GGCM, GIS, H5Nimrod, H5Part, various images, ITAPS, MFIX, MM5, NASTRAN, Nek5000, netCDF, OpenFOAM, PATRAN, Plot3d, Point3D, PDB, SAMRAI, Silo, Spheral, STL, TecPlot, VASP, Vis5D, VTK, Wavefront OBJ, Xmdv, XDMF, ZeusMP (HDF-4)
VMD molecular coordinates, dynamics and fields, see

* Amira also has a number of ``packs'' available, such as Molecular Pack, Mesh Pack, VR Pack, etc. which support additional application specific file formats.

* Avizo is also packaged in six editions with extension modules: Standard; Earth; Wind; Fire and Green.

** OpenDX, SciRun and AVS can be extended by adding special purpose file readers into the pipeline.

For more activities involving scientific visualisation in the UK, see the JISC funded VizNet project

Remote Visualisation

Some of the packages noted above can operate in client-server mode, this permits remote visualisation so that the data does not have to be moved. Instead, a visualisation server is installed on a system wih good connection to the data store and a client allows the user to control the images.

The primary techniques to provide easy, high performance access from remote networked clients are as follows.

Transmitting images:
rather than graphical data, from servers to clients.
compressing images on the server and de-compressing them on the client to avoid the need for high network bandwidth. This generally increases performance but can make the CPU a bottleneck.

Widely used packages which operate in this way include the following.

FieldView from Intelligent Light is a proprietary system similar to ParaView described below. See It is used as a post-processor for CFD applications such as Fluent. FieldView will run in parallel and client-server mode and can be used for exploration of large data sets in an immersive VR environment.

Deep Computing Visualisation. The DCV client can be freely installed and is used with a modified version of RealVNC to display 3D images from the server via a compressed OpenGL graphics stream. The DCV server can also be freely installed for academic use. DCV is now supported by NICE and used in the EnginFrame portal, see

ParaView is an open source, multi-platform application designed to visualise data sets of various sizes. The goals of the ParaView project include supporting distributed computational models to process large data sets. It has an open, flexible, and intuitive user interface. Furthermore, ParaView is built on an extensible architecture based on open standards. It runs on distributed and shared memory parallel as well as single processor systems and has been tested on Windows, Linux, Mac OS-X, IBM BlueGene, Cray XT3 and various Unix workstations and clusters. Under the hood, ParaView uses the Visualization Toolkit (VTK) as the data processing and rendering engine and has a user interface written using the Qt cross platform application framework. See

SGI RemoteVUE:
Part of the VUE family of products which includes PowerVUE. No longer available after SGI merged with Rackable Systems in 2009.

Virtual Network Computing with accelerated JPEG compression. See

which interposes on un-modified Unix 3D applications, re-directing its 3D commands onto a server side 3D graphics accelerator, reading back rendered images, and converting the application images into a video stream with which remote clients can interact to view and control the 3D application in real time. The video stream is normally compressed using server specific optimisation. See

VisIT from LLNL is an open source package for parallel and distributed visualisation. This enables it to handle extremely large data sets interactively. VisIt's rendering and data processing capabilities are split into viewer and engine components that may be distributed across multiple machines. The Viewer is responsible for rendering and is typically run on a local desktop or visualisation server so that it can leverage the extremely powerful graphics cards that have been developed in the last few years. The Engine is responsible for the bulk of the data processing and i/o. It typically runs on a remote server where the data is located. This eliminates the need to move the data and makes high end compute and i/o resources available. The engine can be run serially or in parallel on the remote system. See

Data Mining and Feature Recognition

Techniques similar to visualisation also apply to processes of data mining and feature recognition, some examples of which were mentioned above. Sub-sets of data can be used and processes can be applied in situ. Example, looking for phase transitions in aluminium flourides [42]. Data Mining is sometimes referred to as KDD, Knowledge Discovery through Data.

Data mining is a process for extracting patterns from data. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and is also used in scientific discovery. In recent years, data mining has been widely used in areas of science and engineering including bio-informatics, genetics, medicine, education, electrical power engineering, plasma physics experiments and simulations, remote sensing imagery, video surveillance, climate simulations, astronomy and fluid mixing experiments and simulations. An introduction is provided in a paper by Ramakrishnan and Grama [30] and a more recent review is provided by Kamath [17].

A project called Sapphire at LLNL is carrying out research to develop scalable algorithms for interactive exploration of large, complex, multi-dimensional scientific data sets. By applying and extending ideas from data mining, image and video processing, statistics and pattern recognition, they are developing a new generation of computational tools and techniques that are being used to improve the way in which scientists extract useful information from data. See (last updated Mar'2009).

An interesting application is to code vaildation through comparison of simulations to experiment, for instance in studies of the Richtmyer-Meshkov instability, see [18] and This could become increasingly important as the complexity of the model increases, e.g. in fluids and plasmas, climate modelling, agent based modelling, etc.

Methodologies applied include Bayesian analysis, regression analysis, and self organising feature maps (SOMs or Kohonen maps). In addition to some standardisation efforts, freely available open source software have been developed, including Orange, RapidMiner, Weka, KNIME, and packages in the GNU R Project which include data mining processes. Most of these systems are able to import and export models in PMML (Predictive Model Markup Language) which provides a standard way to represent data mining models so that they can be shared between different statistical applications. PMML is an XML based language developed by the Data Mining Group (DMG), an independent group composed of many companies interested in data mining. PMML version 4.0 was released in June 2009, see

A useful places to look at activities around statisitical analysis and data mining are the communities using the Analytic Bridge Web 2.0 tool and KDD Nuggets

Computational Steering

Computational steering is a potentially valuable procedure for scientific investigation in which the parameters of a running program can be altered interactively and the results visualised. As an investigative paradigm its history goes back more than ten years. With the advent of Grid computing, the range of problems that can be tackled interactively has widened, although some implementations have been very complicated. Applications specialists now see the prospect of accomplishing ``real'' science using computational steering whilst, at the same time, computing researchers are working to improve the supporting infrastructure of middleware, tool kits and environments that makes this possible.

EPSRC funded a collaboration network to develop and promote computational steering techniques from 2006-2008, see

Computational steering typically requires remote visualisation, see above. Some packages available include the following.

Computational Steering Environment, Centre for Mathematics and Computer Science, Netherlands,;
Oak Ridge National Laboratory,;
University of Leeds,;
Progress and Magellan:
Georgia Institute of Technology
University of Manchester,;
University of Utah,;
Vizualisation and Application Steering Environment, University of Illinois,
University of Munich

A survey and comparison is presented in [24].


This work was partly funded by EPSRC through the SLA with Daresbury Laboratory.

The author would like to thank the following people for input:

Srikanth Nagella (RAL), Martin Turner (Manchester), Paul Watry (Liverpool),


Y. Ahmad, R. Burns, M. Kazhdan, C. Meneveau, A. Szalay and A. Terzis Scientific Data Management at the Johns Hopkins Institute for Data Intensive Engineering and Science. In SIGREC (2010)

R.J. Allan Management and Analysis of Large Research Data Sets DL Technical Report (2011) draft See

G. Bell, T. Hey and A. Szalay Beyond the data deluge Science 323 (2009) 1297-8

E. Bertin SWarp v2.17.0, User’s Guide Institut d'Astrophysique et Observatoire de Paris (2008)

G. Bikshandi, J. Guo, C. von Praun, G. Tanase, B.B. Fraguela, M.J. Garzarán , D. Padua and L. Rauchwerger Design and Use of htalib – a Library for Hierarchically Tiled Arrays

R.H. Bisseling Parallel Scientific Computation A Structured Approach using BSP and MPI OUP (2004) 326pp ISBN 978-0-19-852939-2

S. Chen and S.W. Schlosser MapReduce meets wider varieties of Applications Intel IRP-TR-08-05 (2008)

J. Dean and S. Ghemawat MapReduce: Simplified Data Processing on Large Clusters Comm. ACM 51 (2008) 107-13 DOI: 10.1145/1327452.1327492.

J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, G. Fox Twister: A Runtime for Iterative MapReduce 1st Int. Workshop on MapReduce and its Applications at HPDC2010 (2010)

Twister Iterative MapReduce Pervasive Technology Institute, Indiana University. Web site:

B. Furht and A. Escalante (Eds.) Handbook of Data Intensive Computing (Springer, 2011) 962pp. ISBN: 978-1461414148

M. Gokhale, J. Cohen, A. Yoo, W.M. Miller, A. Jacob, C. Ulmer and R. Pearce Hardware Technologies for high performance Data Intensive Computing IEEE Computer Society (2008) 0018-9162/08 32-48

I. Gorton, P. Greenfield, A. Szalay and R. Williams Computing in the 21st Century IEEE Computer 41:4 (2008)30-32.

C. He Molecular Dynamics Simulation based on Hadoop MapReduce M.Sc. Thesis (University of Nebraska, Lincoln, 2011)

T. Hey, S. Tansley and K. Tolle (eds.) The Fourth Paradigm: Data-Intensive Scientific Discovery (Microsoft Research, Redmond, 2009) ISBN 978-0-9825442-0-4

T. Hoeffer, A. Lumsdaine and J. Dongarra Towards Efficient MapReduce Using MPI Lecture Notes in Computer Science Vol.5759. M. Ropo, J. Westerholm, J. Dongarra (Eds.) (Springer, 2009) 240–49

M. Isard, M. Budiu, Y. Yu, A. Birrell and D. Fetterly Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks European Conference on Computer Systems (EuroSys) (Lisbon, 21-23/3/2007)

C. Kamath Scientific Data Mining: a Practical Perspective SIAM (2009) 286pp ISBN 978-0898716757

C. Kamath and T. Nguyen Feature Extraction from Simulations and Experiments: Preliminary Results Using a Fluid Mix Problem Technical report UCRL-TR-208853 (Lawrence Livermore National Laboratory, Jan'2005)

R.T. Kouzes, G.A. Anderson, S.T. Elbert, I. Gorton and D.K. Gracio The Changing Paradigm of Data-Intensive Computing Computer 42:1 (2009) 26-3

J. Lin and C. Dyer Data Intensive Text Processing with MapReduce Morgan and Claypool (Synthesis Lectures on Human Language Technologies) (2010) 178pp ISBN: 978-1608453429

J. Lin and M. Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Eighth Workshop on Mining and Learning with Graphs (MLG'10, 2010) 78-85

G. Malewicz, M.H. Austern, A.J.C. Bik, J.C. Dehnert, I. Horn, N. Leiser and G. Czajkowski Pregel: a system for large-scale graph processing In Proc. 28th ACM symposium on Principles of distributed computing (PODC '09) (ACM, New York, 2009) 6-6 DOI: 10.1145/1582716.1582723

E.R. Mardis Next generation DNA sequencing methods Annual Rev. Genomics Hum. Genet. 9 (2008) 387-402

J.D. Mulder, J.J. van Wijk and R. van Liere A Survey of Computational Steering Environments Elsevier Science (July 1998)

B.-S. Park, N.F. Samatova, T. Karpinets, A. Jallouk, S. Molony, S. Horton and S. Arcangeli Data-driven, data-intensive computing for modelling and analysis of biological networks: application to bioethanol production J. Phys.: Conf. Ser. 78 (2007) DOI:10.1088/1742-6596/78/1/012061

A. Pavlo, E. Paulson, A. Rasin, D.J. Abadi, D.J. Dewitt, S. Madden and M. Stonebraker A Comparison of Approaches to Large-Scale Data Analysis Proc. 35th SIGMOD Int. Conf. on Management of Data (2009). DOI: 10.1145/1559845.1559865.

S.J. Plimpton and K.D. Devine, MapReduce in MPI for Large-Scale Graph Algorithms Parallel Computing 37:9 (Elsevier, 2011) 610-32. DOI: 10.1016/j.paco.2011.2.004

S.J. Plimpton and K.D. Devine MapReduce-MPI Library Sandia National Labs. Web site:

J. Qiu, J. Ekanayake, T. Gunarathne, J. Youl Choi, S.-H. Bae, Y. Ruan, S. Ekanayake, S. Wu, S. Beason, G. Fox, M. Rho and H. Tang Data Intensive Computing for Bioinformatics In ``Data Intensive Distributed Computing: Challenges for large scale information management'' ed. T. Kosar (IGI Global, 2012) 351pp. DOI: 10.4018/978-1-61520-971-2; ISBN: 978-1-61520-971-2

N. Ramakrishnan and A.Y. Grama Mining Scientific Data Adv. Computers 55 (2001) 119-69

S. Seo, E.J. Yoon, J. Kim and S. Jin HAMA: an Efficient Matrix Computation with the MapReduce Framework CS-TR-2012-330 (KAIST, July 2012)

P. Sethia High Performance Multi-Agent System based Simulations M.Sc. Thesis (IIIT, Hyderabad, June 2011)

J. Shendure and H. Ji Next generation DNA sequencing Nat. Biotechnol. 26:10 (2008) 1135-45.

Y. Simmhan, R. Barga, C. van Ingen, M. Nieto-SantiSteban, L. Dobos, N. Li, M. Shipway, A.S. Szalay, S. Werner, J. Heasley GrayWulf: Scalable Software Architecture for Data Intensive Computing HICSS'09. 42nd Hawaii International Conference on System Sciences (5-8/1/2009) 1-10 ISBN: 978-0-7695-3450-3 DOI: 10.1109/HICSS.2009.235

S.-J. Sul and A. Tovchigrechko Parallelizing BLAST and SOM algorithms with MapReduce-MPI library IEEE International Parallel and Distributed Processing HICOMB Symposium (2011)

A.S. Szalay and J. Gray Science in an Exponential World Nature 440 (2006) 23-24

A.S. Szalay, G. Bell, J. Vandenberg, A. Wonders, R. Burns, D. Fay, J. Heasley, T. Hey, M. Nieto-SantiSteban, A. Thakar, C. van Ingen and R. Wilton. GrayWulf: Scalable Clustered Architecture for Data Intensive Computing HICSS'09. 42nd Hawaii International Conference on System Sciences (2009) 1-10. DOI: 10.1109/HICSS.2009.234

Tsz-Wo Sze Schönhage-Strassen Algorithm with MapReduce for Multiplying Tera-bit Integrers Symbolic Numeric Computatino (2011)

Tsz-Wo Sze The Two Quadrillionth Bit of Pi is 0! Distributed Computation of Pi with Apache Hadoop 2nd Int. Conf. on Cloud Computing Technology and Science (CloudCom 2012, IEEE) 727-32

J. Talbot, R. Yoo and A. Romano. Phoenix MapReduce (Stanford, 2007 onwards)

R.C. Taylor An overview of the Hadoop/ MapReduce/ HBase Framework and its current applications in bio-informatics BMC Bioinformatics 11 (Suppl 12):S1 (2010) doi:10.1186/1471-2105-11-S12-S1

J.M.H. Thomas, J. Kewley, R.J. Allan, J.M. Rintelman, P. Sherwood, C.L. Bailey, A. Wander, B.G. Searle, N.M. Harrison, S. Mukhopadyhay, A. Trevin, G.R. Darling and A.I. Cooper Experiences with Different Middleware Solutions on the NW-GRID Proc. UK e-Science All Hands Conference 2007 (EPSRC, 2007)

M. Turner and G. Leaver Remote Scientific Visualization for Large Datasets EG-UK Theory and Practice of Computer Graphics (Elsevier, 2010)

L.G. Valiant A bridging model for parallel computation Communications of the ACM 33:8 (1990)

G. Wang, M.V. Salles, B. Sowell, X. Wang, T. Cao, A. Demers, J. Gehrke and W. White Behavioural Simulations in MapReduce Proc. 36th Int. Conf. on Very Large Data Bases 3:1 (VLDB Endowment, 2010) ISSN: 2150-8097/10/09

T. White Hadoop, the definitive Guide (O'Reilly and Yahoo! Press, 2009, 2nd edn. 2010) ISBN: 978-1-449-38973-4

B. Zhang, Y. Ruan, T.-L. Wu, J. Qiu, A. Hughes and G. Fox Applying Twister to Scientific Applications Proc. 2nd Int. Conf. on Cloud Computing Technology for Science (CloudCom'10, IEEE, 2010) 25-32 ISBN: 978-1-4244-9405-7 DOI: 10.1109/CloudCom.2010.37

SciDAC Scientific Data management Center.

Data Intensive Computing by NSF Data Intensive Computing (2009)

Data Intensive Computing by PNNL Data Intensive Computing (2008)

i-MapReduce Web site:

G.K. Lockwood Hadoop's Uncomfortable Fit in HPC (16/5/2014) Blog article:

About this document ...

Data Intensive Computing

This document was generated using the LaTeX2HTML translator Version 2008 (1.71)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -local_icons -split 3 -html_version 4.0 DataIntensive

The translation was initiated by Rob Allan on 2014-05-19

Rob Allan 2014-05-19