Explaining Genome Analysis in the Clouds ... Computing!

Colleagues,

As reported in Genome Web ... See excerpts from post of Michael Schatz of the Johns Hopkins Center for Bioinformatics & Computational Biology (CBCB) outlining how Crossbow, an open-source, Hadoop-enabled pipeline quickly, accurately, and cheaply analyzes human genomes using cloud computing ...

... DNA sequencing has improved tremendously since the completion of the human genome project in 2003, and it is now possible to sequence a genome in a few days for about 50 thousand dollars. The more-than-one-thousand-fold improvement in throughput and cost is spurring a new era of biomedical research, where genomes from many individuals are sequenced and studied over the course of one project...

... While sequencing has undoubtedly become an important and ubiquitous tool, the rapid improvements in sequencing technology have created a “firehose” problem of how to store and analyze the huge volume DNA sequence data being generated. The human genome is about 3 billion DNA nucleotides (characters), about the same as the English portion of the Wikipedia...

... because of the limitations of DNA sequencing technology, we cannot simply read an entire genome end-to-end. Instead the machine reports a very large number of tiny fragments called reads, each 25-500 letters long, collected from random locations in the genome. Then, much like how raindrops will eventually cover the whole sidewalk, we can sequence an entire genome by sequencing many billions of reads, with 20-fold to 30-fold oversampling to ensure each nucleotide is seen. Presently, this process generates about 100GB of compressed data (read sequences and associated quality scores) for one human genome...

... the problem of mapping and scanning 100GB of data isn’t too onerous, especially for large sequencing centers with large compute grids ... The “problem” is sequencing technology is continuing to improve, and pretty soon a single sequencing machine will generate 100GB of data in a few hours. If our computational methods aren’t as efficient as our sequencing methods, we’ll only get further and further behind as more and more data arrives. Clearly we need very efficient and scalable methods if we hope to keep up, especially as sequencing moves from large sequencing centers, to smaller research centers, and perhaps eventually to hospitals and clinical labs...

... This is exactly the problem Crossbow aims to solve. Crossbow combines one of the fastest sequence alignment algorithms, Bowtie, with a very accurate genotyping algorithm, SoapSNP, within Hadoop to distribute and accelerate the computation (CC's comment ... in the "Cloud"). The pipeline can accurately analyze an entire genome in one day on a 10-node local cluster, or in about three hours for less than $100 using a 40-node, 320-core cluster rented from Amazon’s EC2 utility(CC's comment ... "Cloud") computing service ...

... As sequencing reaches an ever wider audience and becomes used in small labs, Crossbow will enable the computational analysis without requiring researchers to own or maintain their own compute infrastructure (CC's comment... by using the "Cloud")

... This is a compelling result from both a users and a systems perspective: it is an accurate, fast, and cheap way of squeezing 1000 hours of computation into an afternoon...

For more information see: http://bowtie-bio.sf.net/crossbow.

Schatz is also the man behind CloudBurst, a parallel read-mapping algorithm optimized for mapping next-generation sequence data to the human genome as well as other reference genomes. CloudBurst, which is based on the short read mapping program RMAP, also uses Hadoop to parallelize execution of multiple nodes (CC's comment ... in the "Cloud").

Read on at: http://www.genomeweb.com/blog/genome-analysis-clouds?emc=el&m=5...

ENJOY!

CC

0 members like this