Mining large-scale repetitive sequences in a MapReduce setting

作者:Cao Hongfei; Phinney Michael; Petersohn Devin; Merideth Benjamin; Shyu Chi Ren*
来源:International Journal of Data Mining and Bioinformatics, 2016, 14(3): 210-228.
DOI:10.1504/IJDMB.2016.074873

摘要

Recent research suggests DNA repeats play critical roles in cellular regulatory functions and disease development. The challenge associated with identifying repeats across a collection of genomes arises from the amount of data stored within DNA, and intermediate data generated by alignment- and hash-based approaches are substantial. We present a MapReduce-based method for repeat identification and propose efficient storage and search techniques. Our approach distributes the computation and storage across a cluster of commodity computers, lending a cost-effective, flexible, robust, and scalable solution to the challenge of identifying various types of repetitive sequences across a collection of genomes. In this study, we benchmark our method using a collection of six genomes, totalling approximately 14.2 billion base pairs. We demonstrate a tenfold speedup over previous state-of-the-art approaches and linear scalability. In addition, we conduct a deeper scalability analysis by processing a collection of 39 genomes, approximately 104 billion base pairs.

  • 出版日期2016