© Science and Technology Facilities Council 2011. Neither the Council nor the Laboratory accept any responsibility for loss or damage arising from the use of information contained in any of their reports or in any communication about their tests or investigations.
An introduction to data intensive computing is given in Wikipedia http://en.wikipedia.org/wiki/Data_Intensive_Computing. There, it is defined as a class of parallel computing applications which use a data parallel approach to processing large volumes of data typically terabytes or petabytes in size. They devote most of their processing time to I/O and manipulation of data rather than computation. This is orthogonal to what we are familiar with from modelling and simulation on high performance computing systems.
Data intensive computing refers to capturing, managing, analysing and understanding data at volumes and rates that push the frontiers of current technologies. The challenge is to provide the hardware architectures and related software systems and techniques which are capable of transforming ultra-large data into valuable information and ultimately knowledge.
The National Science Foundation has noted that data intensive computing requires a ``fundamentally different set of principles'' to other computing approaches. They funded a programme to seek ``increased understanding of the capabilities and limitations of data intensive computing''. The key focus areas are:
Pacific Northwest National Labs, responding to the NSF call, defined data intensive computing as ``capturing, managing, analyzing, and understanding data at volumes and rates that push the frontiers of current technologies''. They believe that to address the rapidly growing data volumes and complexity requires ``epochal advances in software, hardware and algorithm development'' which can scale readily with size of the data and provide effective and timely analysis and processing results.
There are several important common characteristics of data intensive computing systems that distinguish them from other forms of computing.
Note that these criteria implicitly preclude in-memory solutions (but see more below).
A variety of system architectures have been implemented for data intensive computing and large scale data analysis applications including parallel and distributed relational database management systems which have been available to run on clusters for more than two decades. This assumes the data is structured and can be mapped onto a set of connected tables for manipulation. However most data growth is with data in un-structured ore semi-structured form and new processing paradigms with more flexible models are needed. Emerging solutions include the MapReduce architecture pioneered by Google  and now available in an open source implementation called Apache Hadoop, see [46,2,1] (note Dryad/ LINQ has been used on Windows platforms).
Amdahl's Laws are as follows.
Looking at Amdahl's balanced system law, we typically see the following.
We should be aware that with multi-level caching, a much lower I/O to MIPS ratio coupled with a large enough memory can still provide satisfactory performance . This is however only true if the problem fits in memory.
Typically, cache based systems with branch prediction and speculation hardware in CPU are not useful for applications which require streaming data. Here the order of magnitude mis-match between disk memory and CPU bandwidth hampers the processing. A more balanced system is required. Both FPGAs and GPUs have been investigated for this purpose. SSD devices might also be used to reduce the bandwidth to disk and they significantly reduce the seek time for un-structured data. Some data intensive benchmarks were created to test different architectures by Gokhale et al. .
Special purpose computers based on Amdahl's balanced system law are built to process large volumes of data that will not fit entirely in memory. The original machine designed by Jim Gray of Microsoft is now known as the GrayWulf . This is typically a good solution for university groups who host their own large data sets (it is difficult to move over 1 TB of data over the internet) but who could maintain a reasonably compact commodity cluster solution.
The goals of the GrayWulf project were as follows.
It is found, for instance by Gokhale et al.  that using special purpose hardware can remove the CPU bottlneck and ultimately the system becomes limited by bandwidth to disk.
Some examples of machines based on this concept are given below.
The original GrayWulf at Johns Hopkins University has 416 CPUs capable of 1,107 GI/s with a total memory of 1,152 GB. Disk I/O performance was a total of 70 GB/s from just over 1PB of storage. This results in an Amdahl memory number of 1.04 and I/O number of 0.506. The modular structure of this solution is shown in the following table.
|No. Servers||CPU ea||Memory ea||Disk ea||Interconnect|
|Wide area interconnect, 10Gb/s FC|
|Dell R900||1x MD1000 as below||QLogic 20Gb/s|
|Dell R900||3x MD1000 as below||QLogic 20Gb/s|
|Dell 2950 2.66 GHz||2x MD1000 SAS total 30x 750GB 7,200RPM SATA discs||QLogic 20Gb/s|
Further details are given in .
This is basically a distributed SQL server cloud. Applications are run on the Tier-1 servers which are also used for remote access and data transfer.
A second example of a more energy efficient solution has been built using Atom/ Ion technology.
|No. Servers||CPU ea||Memory ea||Disk ea||Interconnect|
|36||2x Atom + 16x GPU||4GB||1x 120GB SSD|
|Zotac N330 Intel Atom/ nVidia Ion 1.6GHz||+ 2x 1TB Samsung F1|
In this case, the Ion GPUs are used typically for data mining over astronomy image data. They can be used for other user-defined functions executed as out of process SQL procedures.
The Edinburgh Data Intensive Machine 1 was built recently also using Atom technology.
|No. Servers||CPU ea||Memory ea||Disk ea||Interconnect|
|120||2x Atom + 16x GPU||4GB||1x 256MB SDD|
|Zotac N330 Intel Atom/ nVidia Ion 1.6GHz||+ 3x 2TB HDD|
The Gordon system at SDSC was launched on 6/12/2011. At the time of writing it is the world's largest flash memory based super-computer. It was developed over a period of two years based on the Dash prototype and supported by an NSF Track-2 grant of by a $20M. Gordon thus represents the first really big purpose built super-computer for data intensive applications. One aim was to pioneer the technology for the wider enterprise market.
The intention of SDSC and the NSF is to draw in data intensive science codes that have never had a platform this size to push the envelope. An example is world-wide societal evoluation. Another example is genomics, a specific design target of the system engineers. Genomics is the classic "big data" science problem, and is the one most frequently cited in HPC circles as suffering from the data deluge crisis. Other application areas like graph problems, geophysics, financial market analytics and data mining are targeted.
The system is an evolution of the Appro HPC cluster, using the third generation Xtreme-X architecture and Intel 8-core SandyBridge Xeon E5 CPUs prior to GA. There are 1,024 dual socket nodes each with 64GB of DDR3 memory and a peak performance of 280 Tflop/s. The system has over 300 TB of Intel solid state ``flash'' disks, spread over 64 I/O nodes.
Despite each Intel SSD being relatively slow, the aggregate IO/s (IOPS) performance of the machine is impressive - on all 64 I/O nodes it achieves a peak output of 36M IO/s. This has to be converted to aggregate bandwidth using a figure of around 250 MB/s for 8kB read operations per SSD yielding 256 GB/s.
Operating system is based on ScaleMP's virtual SMP (vSMP) technology. It allows users to run large memory applications on what they call a "super-node" - an aggregation of 32 compute servers plus two I/O servers, providing access to 512 cores, 2 TB of RAM and 9.6 TB of SSD. To a program running on a supernode, the hardware behaves as a big cache coherent server. The system can be partitioned into 32 super-nodes. The system is connected to a disk sub-system via NFS and Lustre, currently 150 TB but planned to rise to 4 PB by July 2010.
The flash device being employed is Intel's new iSolid-State Drive 710. The 710 uses Intel's High Endurance Technology (HET), which is Intel's version of enterprise multi-level cell (eMLC) flash memory. Like eMLC, the HET flash features the performance and resiliency of single level cell (SLC) flash, but at a much lower cost. SDSC also developed its own flash device drivers to maximize performance of the SSD gear.
Inserting this much flash memory into a supercomputer had never been attempted before, and this aspect was probably the biggest risk for the project. When they began the Gordon effort two years ago, flash memory was just starting to make its way into enterprise storage and was an expensive and un-proven technology.
The system is currently undergoing acceptance testing and is expected to be available for production use dedicated to XSEDE users (the evolution of TeraGrid) on 1/1/2012.
|No. Servers||CPU ea||Memory ea||Disk ea||Interconnect|
|32||512||2TB||9.6TB||Mellanox QDR IB|
|1,024||16 core||64GB||Mellanox QDR IB|
|dual socket||SandyBridge 2.6GHz||DDR3||dual 3D torus|
|64||12||48GB||4.8TB||Mellanox QDR IB|
|dual socket||Westmere 2.67GHz||DDR3||Intel iSSD 710||dual 3D torus|
This gives an Ahmdahl memory number of 1.53 and Amdahl I/O number of 0.048.
A summer institute was held at SDSC in 2011 at which applications were suggested in the areas of: societal evolution; data mining; de novo genome assembly; asteroid search and discovery using LSST data; graph analysis; real time data; seismic hasard evaluation;
The Graph500 benchmark is intended to guide the design of hardware architectures and software systems to support data intensive applications of which graph based algorithms are a core part. The benchmark currently includes: (i) concurrent search; (ii) optimisation (single source shortest path); and (iii) edge oriented (maximal independent set). The output is a metric of TEPS, traversed edges per second. Results are typically in the millions or billions.
The Web site http://www.graph500.org/ contains further information and provides the source code.
We now consider software for data intensive systems. Much of this discussion is based on . We first consider the use of a database approach for structured data sets. A high level language based on SQL is typically used to encode required operations which can be uploaded by users.
Most scientific data analyses (part of Information Lifecycle Management) are performed in hierarchical steps. During the first pass, a subset of the data is extracted by either filtering on certain attributes (typically removing erroneous or unwanted data) or extracting a vertical subset of the columns. In the next step, data are usually transformed or aggregated in some way. In more complex datasets, these patterns are often accompanied by complex joins among multiple attributes, such as external calibrations or extracting and analysing different parts of a gene sequence. For very large datasets, the most efficient way to perform most of these computations is clearly to move the analysis functions as close to the data as possible. Many of the patterns that have been identified are easily expressed by a set oriented, declarative language whose execution can benefit enormously from cost based query optimisation, automatic parallelism and indexing.
These features were recognised by Jim Gray and his collaborators at Microsoft and sometimes referred to as ``Gray's Laws''. They have shown in several projects that existing relational database technologies can be successfully applied in this context . It is also possible to integrate complex functions written in a procedural or other languages (such as C or R) as an extension of the underlying database engine, as exemplified by the work at Johns Hopkins University.
In recent years, MapReduce has become a popular distributed data analysis and computing paradigm , see below. The principles behind it resemble the distributed grouping and aggregation patters that already existed in parallel relational database systems. Such capabilities are now being referred to as ``MapReduce in the database''. Benchmarks to compare different approaches are being developed .
It was found by Szalay et al. that data operations performed by users of the Sloan Digital Survey database  exhibit a 1/f power law distribution. Most queries are very simple single row lookups based on simple indices such as celestial position (high frequency, low volume). A small number of analyses did not use pre-computed indices, and required lengthy sequential scans through the data possibly combined with merge-join operations. These can take over an hour on a multi-terabyte database. Other access patterns involved multiple simple scripts to extract a series of results, a kind of database ``crawler''. Application of such scripts might be combined into a more complex workflow which can be executed from a batch queue.
For efficient analysis of structured data. it is important to start with an appropriate schema (for metadata) and database design (table layout and connectivity, data ordering, etc.). It is notoriously hard to modify these on an ad hoc basis. Given these requirements it is then possible to provide a set of sample SQL scripts which might meet the requirements of most users. The database design should aim to accommodate these scripts in the simplest way. The scripts can be extended to do more complex analyses.
To facilitate access and reduce I/O requirements the data must be factored in different ways which reflect requirements, for instance it could be divided in space or time into blocks which can be read in and processed in parallel. Typically this will result in a hierarchical database architecture. This is explained in further detail with examples from the Pan-STARRS and SLOAN DB in .
Management of the database and servers, in particular to ensure fault tolerance, is also described in .
A number of management tasks can be codified into workflows as follows.
Some authors, e.g.  have claimed that MapReduce is the most successful abstration to date to go beyond the classic von Neumann model of computing. It is aimed at developing scale free data intensive applications.
MapReduce is a software methodology introduced by Google in 2004 to support distributed computing on large data sets on loosely coupled compute clusters. The framework is inspired by the map and reduce operations commonly used in functional programming, but in a modified form. MapReduce libraries have now been written in C++, C#, Erlang, Java, OCaml, Perl, Python, PHP, Ruby, F#, R and other languages.
Such abstractions manage complexity by hiding details and presenting well defined behaviours to users. They make certain tasks easier but others more difficult, and sometimes impossible. It may be only the first in a new class of programming models that will allow us to more effectively organise computations at a massive scale, and extensions are already being suggested.
MapReduce frameworks were developed for commercial information retrieval, which is perhaps currently the world's most demanding data analysis problem. Exploiting commercial approaches offers a good chance that one can achieve a high quality robust solution. MapReduce now has both commercial and open source implementations. Qiu et al.  looked at MapReduce and MPI, and showed how to analyse biological samples with modest numbers of sequences on a multi-core cluster. To support iterative operations they evaluated the open source Java i-MapReduce  software which uses Apache Hadoop, see [46,2]. Another iterative MapReduce implementation is Twister [9,47].
Iterative data intensive computation is pervasive in many applications such as data mining or social network analysis. These typically involve massive data sets containing at least millions or even billions of records. MapReduce in its original form lacks built in support for iterative processes that require parsing data sets repeatedly and doing book keeping. Besides specifying MapReduce jobs, users have to write a driver program that submits multiple jobs and performs convergence or other testing at the client. In a system such as i-MapReduce, users specify the iterative computation with the separated map and reduce functions. The system provides support for automatic iterative processing without the need of a user defined driver program. As well as making it easier to use, this can improve the performance by: (1) reducing the overhead of creating new MapReduce jobs repeatedly; (2) eliminating the shuffling of static data; and (3) allowing asynchronous execution of map tasks.
Whilst specially optimised block based file systems such as Hadoop FS have been developed to support MapReduce [2,1], there are also versions implemented in MPI such as the library from Sandia National Labs [28,27] and Indiana . Interfaces from Sandia are available for C, C++, Python and OINK (c.f. PIG, the scripting language for Hadoop). It has for instance been used to parallelise BLAST and SOM algorithms . Another is YAMML, a Google code prototype. Other are mentioned in .
There are several database like implementations on top of Hadoop, for instance Hive which offers an SQL like interface.
Geoffrey Fox (Indiana University) has compared MPI with iterative MapReduce. He noted that MapReduce is more dynamic and fault tolerant than MPI and it is simpler and easier to use. Both Fox  and Chen and Scholosser  noted that it however requires some extensions and an iteration framework to make it useful for a wider range of applications. Some applications listed in papers on MapReduce are as follows.
We note that applications written in languages other than Java can also be used on Hadoop with the Pipes library . There are several projects to integrate with R, for example RHIPE from Purdue University.
Dryad  is a Microsoft Research project developing an execution environment for data parallel applications expressed as directed acyclic graphs (DAGs). In general this ongoing programme of work is investigating programming models.
Pregel  is another technology developed by Google to mine information from graphs. Examples include road networks (maps), the Internet, etc. Pregel uses three design patterns based on MapReduce  to extract information about the nodes and vertices. This is an inherently iterative algorithm with a similar approach to BSP .
It is over 20 years since Les Valiant and Bill McColl started working on Block Synchronous Programming, and interest in BSP is now growing again. Pregel is a BSP system for graph analytics at scale and is now a major element of Google's internal big data infrastructure. Within the Hadoop community, BSP models are now being used to achieve much higher levels of performance than can be achieved using MapReduce.
Some examples of data intensive work in the UK and elsewhere which could presumably benefit from access to a larger facility and some joint development work.
In biology, an obvious set of examples is represented by the so called next generation sequencing methods for nucleic acids. Recent reviews of these are given by [23,33]. Big databases with fast access and information visualisation look like being important for this area of biology.
Modelling biological networks is inherently data driven and data intensive. The combinatorial nature of this type of modelling, however, requires new methods capable of dealing with the enormous size and irregularity of the search. Searching via ``back tracking'' is one possible solution that avoids exhaustive searches by constraining the search space to the sub-space of feasible solutions. Despite its wide use in many combinatorial optimisation problems, there are currently few parallel implementations of backtracking capable of effectively dealing with the memory intensive nature of the process and the extremely un-balanced loads present.
Some other topics include: bio-medical studies in gene selection and filtering; feature selection algorithms for mining high dimensional DNA micro-array data; random matrix theory to analyse biological data.
Bio-informatics in this context is primarily concerned with sequence studies. Qiu et al.  provide a detailed description of current approaches to this.
Some topics include: processing extreme scale datasets in the geo-sciences; geo-spatial data management, e.g. with TerraFly (from NASA); parallel earthquake simulations on large scale multi-core super-computers.
One application is to discovering relevant entities in large scale social information systems. This typically involves graph based network analysis. A breadth first search will compute all nodes in the tree which can be accessed from a given node and gives the shortest path between them. This is done by processing an adjacency list iteratively down each connected node.
The adjacency list can be repressented as a sparse adjacency matrix. In Cailin's method, colouring is used to identify if a node is yet to be explored, is being explored or its depth has been found. This illustrates a problem with the basic Hadoop implementation. Since the algorithm is iterative, a global condition must be set to ensure termination when there are no more links still being explored. There is no support in the Hadoop framework to do this. See http://www.johnandcailin.com/blog/cailin/breadth-first-graph-search-using-iterative-map-reduce-algorithm.
Example of such work is from Prof. Mark Pagel FRS, University of Reading. http://www.evolution.reading.ac.uk/.
His group builds statistical models to examine the evolutionary processes imprinted in human behaviour, from genomics to the emergence of complex systems, to culture. Their latest work examines the parallels between linguistic and biological evolution by applying methods of phylogenetics, the study of evolutionary relatedness among groups, essentially viewing language as a culturally transmitted replicator with many of the same properties we find in genes. They are looking for patterns in the rates of evolution of language elements, and hoping to find the social factors that influence trends of language evolution.
Software and projects are based on Bayesian analysis and visualisation.
Examples are from Jeremy Yates, DIRAC Consortium, see http://www.ucl.ac.uk/star/research/miracle/dirac.
Researchers at 27 HEIs in the UK are studying the properties of sub-atomic particles, cosmology and the physics of the early universe, the formation of large scale structure and galaxies, the properties of galaxies, the formation of stars and planets, astro-chemistry and the inter-stellar medium, the properties of exo-planets and planetary atmospheres, the physics of the sun and solar system, the properties of nano-particles, molecules and atoms in astrophysical conditions.
Software and projects include: AstroGrid; COSMOS; VIRGO; Miracle.
Elsewhere, projects like the Large Synoptic Sky Telescope, LSST, will begin collecting data in 2012 at a rate of approx. 500 MB/s which will need to be processed in real time. SWarp is used in the processing pipeline and is based on a Lanczos re-sampling filter . This normalises the images to the sky and makes it possible to recognise new or changed features by comparing images.
Some applications of data mining are in: Science - Chemistry, Physics, Medicine; Biochemical analysis, remote sensors on a satellite, Telescopes, star galaxy classification, medical image analysis. Bioscience - Sequence-based analysis, protein structure and function prediction, protein family classification, microarray gene expression. Pharmaceutical companies, Insurance and Health care, Medicine - Drug development, identify successful medical therapies, claims analysis, fraudulent behavior, medical diagnostic tools, predict office visits. Financial Industry, Banks, Businesses, E-commerce - Stock and investment analysis, identify loyal customers vs. risky customer, predict customer spending, risk management, sales forecasting.
ADMIRE, Advanced Data Mining and Integration Research for Europe, is an EU project which includes EPCC and NeSC, see http://www.admire-project.eu/.
The main application areas addressed are: gene annotation, flood forecasting using data mining on observed data linked to simulations, analytical CRM from call centre data, search for quasars using digital telescope data, analysis of seismology databases.
One of the deliverables of the project is DISPEL, a parallel process engineering language.
The ADMIRE software stack is open source, see http://www.admire-project.eu/ and http://sf.net/projects/admire/.
Other software which support data mining includes: Weka, RapidMiner, Fastlib, MatLab, R, Octave, Phoenix MapReduce (a shared memory implementation from Stanford written in C++).
Language processing is becoming increasingly important, for instance in analysis of text on the Web. This is not confined to the English language. It requires processing large document streams.
Some work at University of Bristol, group won the 2009 Morpho challenge for machine learning. Other work by Prof. Paul Watry, University of Liverpool, see http://www.liv.ac.uk/english/staff/paulwatry.htm.
Current focus is on developing and implementing technologies in the areas of computational linguistics and textual analysis. Current work in understanding the set of relationships that exist between human and machine defined semantics and exploring the application of semantic grid technologies to improve discovery and generate knowledge. Related work in the analysis of electronic document structure, data and text mining, natural language processing, corpus linguistic tools (information retrieval), annotation and visualisation technologies, persistent archives and new media.
Software and projects from the Liverpool group include: Cheshire III Digital Library software; iRODS; SHAMAN; PrestoPrime. See also http://www.nactem.ac.uk/.
One processing techique is to use n-grams consisting of a sequence of n characters or words. The algorithm extracts the t most frequent n-grams found. See Wikipedia http://en.wikipedia.org/wiki/N-gram. Document profiles can be compared, e.g. to determine their similarity or what language they are written in by closeness to the language training set profile. Languages can be modelled statistically as n-grams, based on Shannons' theories of information. These are sometimes Markov models. FPGAs can be used for this purpose, e.g. via a parallel Bloom filter as a probabilistic test of set membership . N-grams are also used in gene sequence analysis.
Graph analysis is another application described in .
There is a book on text processing with MapReduce .
This typically refers to what is done with customer derived commercial data using SQL or a combination of SQL and MapReduce. It can yield information used in: predictive and granular forecasting; trend analysis and modelling; credit and risk management; fraud detection; cross platform advert and event attribution; cross platform media affinity analysis; merchandising and packaging optimisation; service personalisation; graph analysis; consumer segmentation; consumer buying patterns and behaviour; click stream analysis; compliance and regulatory reporting.
Sometimes the acronym OLAP is used for On Line Analytic Processing. It can be found that an RDBMS approach does not scale to the size requires for this kind of semi-structured data, and solutions like Hadoop Hive and PIG are used instead.
For more information about some of the examples mentioned above see http://www.iterativemapreduce.org/samples.html.
See also Google code tutorial http://code.google.com/edu/parallel/mapreduce-tutorial.html.
All data objects in MapReduce are referenced by keys and values which are defined by the user. These are handled as key value pairs or as key multi-value sets .
The first thing to decide when implementing an algorithm is what are the keys. Keys are objects can be simple data types or structures. Some examples are:
a string: for instance each word from a text in the word sort or
a vertex (or node): for instance v=(i,j) can refer to an element of a 2D array or v=(i,j,k) a 3D array or similar entity in a graph
One or several values is associated with a key, again values can be simple data types or structures.
A MapReduce framework relies on functiona callbacks. Thus, the user will define the map and reduce functions which are invoked by the framework.
The map operation translates input keys and values to new keys and values .
The framework sorts the keys and groups each unique key with all its values .
The reduce operation translates the set of values for each key to a new key and value .
The map and reduce callbacks need to be able to handle the types of objects defined for the keys and values. It may also be necessary to provide a callback function for the sort operation if the values are not simple types.
It is important to remember a few things about how the framework functions. A ``job'' is referred to as a single map operation with a reduce operation. The grouping and sorting is done by the framework. An iterative application may require to do many MapReduce jobs, typically data will be output to local file store between jobs or held in memory.
Map and reduce operations are similar to those in functional languages like Lisp. The data is distributed and a process run on each CPU against its local data.
MapReduce is particularly suitable for graph based algorithms. Any sequential data dependencies should be avoided, e.g. do breadth first, not depth first. MapReduce applications can typically be represented as a network diagram.
[example using GraphViz]
The ``map'' operation is required, and is typically a parallel process which distributes the input data across processors according to the keys. This can be assumed to perform some load balancing. The result of each map is a set of zero or more KV pairs, but distributed such that a key may appear more than once.
The ``aggregate'' operation re-maps the KV pairs such that all occurences of a particular key are on the same processor. It also performs load balancing again, possibly using a hashing function. This is an all-to-all communication method.
The ``convert'' operation groups takes the set of values for each unique key and converts to a KMV object. This should be done following the aggregate step and is an on-processor operation. Typically the two operations are combined into one, such as ``collate''.
The ``reduce'' operation is optional, but if present will operate over the keys in ``sort order'', but keys may still be randomly distributed across reducers (processors). Thus the KMV set is un-ordered. The result of each reduce is a set of zero or more KV pairs.
Thus a skeleton MR job might begin to look as follows:
MapReduce *mr = new MapReduce(MPI_COMM_WORLD); mr->map(input, map_callback); mr->collate(NULL); mr->reduce(reduce_callback, output);
Other things to note are as follows:
Read in all lines in a file. The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output.
The word count example is probably the most common one seen in MapReduce tutorials. One or more text (or other) files are read in and the mapper selects words (or other entities) and assigns values to them, initially 1 for each word, i.e. . These are sorted and gathered such that identical words are on one processor . The reducer can then output the final KV pair s .
In the indexing application, instead of a count the value stored is . This is reduced to a list of document ids or pages on which the word occurs.
Read content from a Web URL. The map function searches for target URLs in the content and associates them with the source URL as a KV. These are gathered to a KMV and and reduced to . This can be further processed using the Graph methods explained below.
This is typical of a garph processing algorithm, and uses a breadth first search as explained by Cailin Nelson. http://www.johnandcailin.com/blog/cailin/breadth-first-graph-search-using-iterative-map-reduce-algorithm. It is not optimised.
A graph as in Figure 1 can typically be stored as an ``adjacency list'' connecting its nodes, e.g.
This can be represented using the node as key and values as structures as follows . One node, e.g. ``a'' is assigned at the start to be at distance zero and coloured ``grey''. Other nodes are coloured ``white'' and are yet to be processed. All other distances are initially ``unknown''.
The algorithm is iterative. At each map step we take all grey nodes and expand their adjacency lists which are known to be at the next largest distance as follows. This adds more KV pairs. The nodes which have been processed are then marked ``black''.
Now the set is sorted and gathered to give the following.
The reducers combine occurences of the same key as follows to give all nodes (in this case b and e) at distance 1 from the source.
This proceeds in a similar fashion until there are no more grey nodes; either they are all black or some remain white which are not connected to the source node. The resulting rank is the distance from the source or unknown if not connected.
One algorithm used is K-means. This looks at points distributed in an n-dimensional space and attempts to use a Euclidian distance measure to assoociate points with cluster centres. The centres are refined iteratively.
Assign random set of cluster centres at start and read in all points. Map tasks calculates Euclidean distance from each point in its partition to each cluster center. Map tasks assign points to cluster centers and sum the partial cluster center values. KV is cluster center sums + number of points assigned. Reduce task sums all the corresponding partial sums and calculate new cluster centers.
This process is iterated until a stable result is found.
This description is written in terms of row and column elements, but should be generalised to row and column blocks.
C=A*B, simplified to square (N*N) matrices.
Take rows i from matrix A and columns j from matrix B. The key is which is the element index for output matrix C. Input values are which must be reduced to .
In the Twister approach, rows of A are read in and processed iteratively until the whole output matrix is complete. The trick here is to avoid having to hold the entire 2*N*N data set which could be very large. Reading sucessive rows required only N+(N*N) elements to be handled on each iteration to compute output row C(i,*). The map operation ensures that N+N elements are multiplied on each of N processors.
There is a small but growing on-line literature addressing more complex algorithms using MapReduce based patterns. Just a few are listed here.
Radix-2 FFT, outline algorithm only by J. Chen http://theotherjchen.blogspot.com/2010/08/fft-with-mapreduce.html.
Large scale integer maths using Hadoop, presentation by Tsz-Wo Sze http://www.slideshare.net/hortonworks/large-scale-math-with-hadoop-mapreduce plus separate papers by the same author [38,39].
A number of authors have presented matrix multiplication algorithms: J. Norstad http://www.norstad.org/matrix-multiply/index.html; G. Fox http://www.iterativemapreduce.org/samples.html; S. Seo et al. ; E.J. Yoon http://blog.udanax.org/2008/10/parallel-matrix-multiply-on-hadoop.html;
Conjugate Gradient method for solving linear systems: S. Seo et al. ;
This section discusses some other ``non-traditional'' ways to analyse large data sets resulting from numerical modelling and simulation.
Visualisation typically uses large volumes of data. Rendering is carried out to view and study parts of the data, e.g. to produce surface or volume images of multi-dimensional data. It is usual to want to scan through this data interactively, viewing successive parts, for instance rendering slices through a volume to explore the interior structure. Other activities are to map functions using artefacts such as streamlines or vectors (hedgehog plots). It is also likely that these will be composed together as a ``movie''. Some examples of work performed on HECToR are given by Turner and Leaver .
The visualisation workflow is thus to extract a part of the data and then manipulate and render it. In terms of what is commonly referred to as the standard visualisation pipeline the steps are: (i) read and filter data; (ii) map; (iii); render; and (iv) display. This involves a combination of data extraction, computation and rendering which is greatly facilitated by an appropriate dataset format. For instance slicing multi-dimensional meteorological data is facilitated by array structured formats such as GRIB. Software is often build from modules, each of which is responsible for one of the pipeline steps (e.g. parallel rendering) and may produce intermediate file output requiring high bandwidth i/o.
A certain amount of filtering can be applied to reduce the volume of data required in this process. Visualisation doesn't always need 64-bit, so reduction is possible. It may also be possible to extract a sub-set of data to manipulate if the full volume is not required for rendering. This can help if the data needs to be migrated to the visualisation system. An alternative is to do remote visualisation but rendering with data in situ.
Many visualisation software packages exist which are intended to be used with data in one or more of the available scientific formats discussed in . Here are pointers to some lists of information about this software. An overview can be found on Wikipedia at http://en.wikipedia.org/wiki/Visualization_computer_graphics. Brief descriptions and pointers to software that can be used with netCDF is at http://www.unidata.ucar.edu/packages/netcdf/utilities.html. A page of links to many scientific visualisation and graphics software packages is at http://static.msi.umn.edu/user_support/scivis/scivis-list.html.
The primary techniques for building visualisation servers are as follows.
Software such as Chromium, FieldView, ParaView, OpenSceneGraph, DMX, VisIt and SciRun can also do parallel rendering so it is important to be able to manipulate the data inside a cluster to enable this, for instance re-ordering the data from the original application to an order suitable for the rendering package. This could be done in MapReduce. Table 6.2 lists properties of some well known packages.
|Package Name||Comments/ File Formats||Web site|
|Amira (licensed)||*, Amira, Analyze, various images, netCDF, DXF, PS, HTML, Hoc, Icol, Interfile, Matlab, Nifti, Open Inventor, PSI, Ply, raw binary, SGI RGB, STL, SWC, Tecplot, VRML, Vevo||http://www.amiravis.com|
|AtomEye||PDB, standard CFG and extended CFG. Provides scripts to convert between PDB and CFG as well as for converting VASP files into CFG||http://mt.seas.upenn.edu/Archive/Graphics/A/|
|Avizo (licensed)||* see http://www.uni-ulm.de/fileadmin/website_uni_ulm/kiz/it/software-betriebssysteme/software-liste/Visualisierungssoftware/Avizo/VSG_Avizo_FileFormats.pdf||http://www.vsg3d.com|
|AVS/ Express (licensed)||** netCDF, etc. see http://personal.cscs.ch/~mvalle/AVS/Reader_Status_List.html||http://www.avs.com|
|EnSight (free or licensed)||CFD and multi-physics, netCDF, etc. see User Manual Chapter 11||http://www.ensight.com|
|Environmental Workbench||netCDF, ...||http://www.ssesco.com/files/ewb.html|
|FieldView||Plot3D, FV-UNS, etc.||http://www.ilight.com/products.php|
|OpenDX||**, CDF, netCDF, HDF-4, HDF-5, DX, various images, mesh, ...||http://www.opendx.org|
|ParaView||netCDF, HDF-5, ParaView, EnSight, VTK, Exodus, XDMF, Plot3d, SpyPlot, DEM, VRML, Ply, PDB, XMol, Stereo Lithography, Gaussian Cube, AVS, Meta Image, Facet, various images, SAF, LS-Dyna, raw binary||http://www.itk.org/Wiki/ParaView|
|PV-Wave (licensed)||netCDF, ...||http://www.vni.com/products/wave/index.html|
|SciRun||**, Nimrod, ...||http://www.sci.utah.edu|
|VisIt||Analyze, Ansys, BOV, Boxlib, CGNS, Chombo, CTRL, Curve2D, Ensight, Enzo, Exodus, FITS, FLASH, Fluent, FVCOM, GGCM, GIS, H5Nimrod, H5Part, various images, ITAPS, MFIX, MM5, NASTRAN, Nek5000, netCDF, OpenFOAM, PATRAN, Plot3d, Point3D, PDB, SAMRAI, Silo, Spheral, STL, TecPlot, VASP, Vis5D, VTK, Wavefront OBJ, Xmdv, XDMF, ZeusMP (HDF-4)||https://wci.llnl.gov/codes/visit|
|VMD||molecular coordinates, dynamics and fields, see http://www.ks.uiuc.edu/Research/vmd/plugins/molfile/||http://www.ks.uiuc.edu/Research/vmd/|
* Amira also has a number of ``packs'' available, such as Molecular Pack, Mesh Pack, VR Pack, etc. which support additional application specific file formats.
* Avizo is also packaged in six editions with extension modules: Standard; Earth; Wind; Fire and Green.
** OpenDX, SciRun and AVS can be extended by adding special purpose file readers into the pipeline.
For more activities involving scientific visualisation in the UK, see the JISC funded VizNet project http://www.viznet.ac.uk.
Some of the packages noted above can operate in client-server mode, this permits remote visualisation so that the data does not have to be moved. Instead, a visualisation server is installed on a system wih good connection to the data store and a client allows the user to control the images.
The primary techniques to provide easy, high performance access from remote networked clients are as follows.
Widely used packages which operate in this way include the following.
Techniques similar to visualisation also apply to processes of data mining and feature recognition, some examples of which were mentioned above. Sub-sets of data can be used and processes can be applied in situ. Example, looking for phase transitions in aluminium flourides . Data Mining is sometimes referred to as KDD, Knowledge Discovery through Data.
Data mining is a process for extracting patterns from data. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and is also used in scientific discovery. In recent years, data mining has been widely used in areas of science and engineering including bio-informatics, genetics, medicine, education, electrical power engineering, plasma physics experiments and simulations, remote sensing imagery, video surveillance, climate simulations, astronomy and fluid mixing experiments and simulations. An introduction is provided in a paper by Ramakrishnan and Grama  and a more recent review is provided by Kamath .
A project called Sapphire at LLNL is carrying out research to develop scalable algorithms for interactive exploration of large, complex, multi-dimensional scientific data sets. By applying and extending ideas from data mining, image and video processing, statistics and pattern recognition, they are developing a new generation of computational tools and techniques that are being used to improve the way in which scientists extract useful information from data. See https://computation.llnl.gov/casc/sapphire/research.html (last updated Mar'2009).
An interesting application is to code vaildation through comparison of simulations to experiment, for instance in studies of the Richtmyer-Meshkov instability, see  and https://computation.llnl.gov/casc/sapphire/jacobsdata/jacobsdata.html. This could become increasingly important as the complexity of the model increases, e.g. in fluids and plasmas, climate modelling, agent based modelling, etc.
Methodologies applied include Bayesian analysis, regression analysis, and self organising feature maps (SOMs or Kohonen maps). In addition to some standardisation efforts, freely available open source software have been developed, including Orange, RapidMiner, Weka, KNIME, and packages in the GNU R Project which include data mining processes. Most of these systems are able to import and export models in PMML (Predictive Model Markup Language) which provides a standard way to represent data mining models so that they can be shared between different statistical applications. PMML is an XML based language developed by the Data Mining Group (DMG), an independent group composed of many companies interested in data mining. PMML version 4.0 was released in June 2009, see http://www.dmg.org.
A useful places to look at activities around statisitical analysis and data mining are the communities using the Analytic Bridge Web 2.0 tool http://www.analyticbridge.com/ and KDD Nuggets http://www.kdnuggets.com/.
Computational steering is a potentially valuable procedure for scientific investigation in which the parameters of a running program can be altered interactively and the results visualised. As an investigative paradigm its history goes back more than ten years. With the advent of Grid computing, the range of problems that can be tackled interactively has widened, although some implementations have been very complicated. Applications specialists now see the prospect of accomplishing ``real'' science using computational steering whilst, at the same time, computing researchers are working to improve the supporting infrastructure of middleware, tool kits and environments that makes this possible.
EPSRC funded a collaboration network to develop and promote computational steering techniques from 2006-2008, see http://compusteer.dcs.hull.ac.uk/index.php.
Computational steering typically requires remote visualisation, see above. Some packages available include the following.
A survey and comparison is presented in .
This work was partly funded by EPSRC through the SLA with Daresbury Laboratory.
The author would like to thank the following people for input:
Srikanth Nagella (RAL), Martin Turner (Manchester), Paul Watry (Liverpool),
Twister Iterative MapReduce Pervasive Technology Institute, Indiana University. Web site: http://www.iterativemapreduce.org/
This document was generated using the LaTeX2HTML translator Version 2008 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -local_icons -split 3 -html_version 4.0 DataIntensive
The translation was initiated by Rob Allan on 2014-05-19