The recent technological advances in high throughput DNA sequencing machines (e.g., Illumina Genome Analyzer, HiSeq, etc.) have revolutionized bioscience research. Per megabase raw sequencing cost has plummeted to less than a dollar from more than five thousand dollars fifteen years ago. As a result, the growth in the size of sequencing data has outpaced Moore’s Law, which governs the speed at which the lithographic technologies can reduce feature sizes of silicon circuits.
Consequently, current scientific methods for real-time big data genome analysis are creating a dire need for more compute cycles per processor than ever before. The capabilities of Intel HyperThreading, which offers only two simultaneous threads per core, have been found to limit the performance of these recent advances in big data genome analysis techniques. For example, despite our multiple prior Hadoop-based genome analysis attempts on an existing Intel-based LSU HPC resources, a large 3.2TB metagenome dataset could not be analyzed in a reasonable period of time on more than 100 nodes.
On the other hand, OpenPOWER technologies, such as IBM POWER8 (8-SMT) or CAPI, with several orders of magnitude more computational power, are rapidly becoming the natural platform to further drive genome research. In this talk, we will discuss our evaluation of the IBM POWER8 system with respect to our Hadoop-based benchmark genome assembler. In particular, we highlight how we analyzed a 3.2TB metagenome data set with Hadoop, producing results in only 6.5 hours and rendering a 6.6TB graph data structure, called a de Bruijn graph, on a cluster of 40 POWER8 S824L nodes.
Dr. Seung-Jong Park is an associate professor of the School of Electrical Engineering & Computer Science and the Center for Computation Technology at Louisiana State University. He received his Ph.D. from The School of Electrical and Computer Engineering at Georgia Institute of Technology, 2004. Interdisciplinary research involving (1) distributed computing ranging from cloud computing over high speed optical networks to mobile computing over wireless networks; and (2) Computational Science developing large-scale whole genome sequence analysis software tools over high performance computing and high speed networking. He has developed many protocols and cyberinfrastructures at NSF CC-NIE, NSF Global Environment for Network Innovations (GENI) projects, NSF Major Research Instrumentation (MRI) project and Office of Naval Research project, etc.