Sakai Applications: Bioinformatics

Bioinformatics Applications

Genomic sequence analysis is notoriously resource intensive and often performed on multi-node computer systems. It is increasingly recognized that state-of-the-art statistics will be required for the advance of modern sequence analysis. While not an ideal environment for handling strings, the R platform provides unique access to a plethora of statistical methods combined with good abilities to integrate C/ C++ code. The creation of a library of building blocks providing seamless access to computational resources from both R and C will greatly facilitate the development of efficient large scale analyses.

The R system is, moreover, rapidly becoming the method of choice for biologists who are involved in microarray data analysis. A collection of tools and extensions is being developed by statisticians and made available to the community via the Bioconductor project to encourage this. The first computer language to become widely accepted amongst biologists has been Perl, which however lacks access to quality statistical methods. It is therefore to be expected that biologists with an interest in data analysis will generally become fluent in the ``R'' language, providing the ability to combine building blocks into more complex analyses.

Easier access to computing resources will allow more sophisticated analyses. For example, the design of oligo-nucleotide probes for DNA microarrays faces the tricky challenge of both predicting the thermodynamic hybridization properties of probe candidates as well as likely cross-reactions to non-targets from their nucleotide sequences alone. Present approaches are limited to traditional sequence similarity searches and rough approximations in calculating thermodynamic properties. Extensions made feasible by transparent access to distributed and specialized computing resources will allow DNA folding criteria to much earlier enter the scan for permissible probe candidates improving the quality of the obtained final set. Similarly, more realistic calculations of hybridization behaviour would thus become feasible on a large scale.

Three aspects play a major role in this context: (i) the availability of large computational resources, (ii) library support in allocating resources to sub-tasks and collecting and integrating results, (iii) accessibility for non-experts in the application sciences. While shared resources can address issue (i), the overhead required for developing and deploying solutions that adequately exploit heterogeneous shared resources for any individual laboratory in the applied sciences is prohibitive.

Activities and Deliverables