Mining large-scale repetitive sequences in a MapReduce setting

Cao Hongfei; Phinney Michael; Petersohn Devin; Merideth Benjamin; Shyu Chi Ren<sup>*</sup>

doi:10.1504/IJDMB.2016.074873

摘要

Recent research suggests DNA repeats play critical roles in cellular regulatory functions and disease development. The challenge associated with identifying repeats across a collection of genomes arises from the amount of data stored within DNA, and intermediate data generated by alignment- and hash-based approaches are substantial. We present a MapReduce-based method for repeat identification and propose efficient storage and search techniques. Our approach distributes the computation and storage across a cluster of commodity computers, lending a cost-effective, flexible, robust, and scalable solution to the challenge of identifying various types of repetitive sequences across a collection of genomes. In this study, we benchmark our method using a collection of six genomes, totalling approximately 14.2 billion base pairs. We demonstrate a tenfold speedup over previous state-of-the-art approaches and linear scalability. In addition, we conduct a deeper scalability analysis by processing a collection of 39 genomes, approximately 104 billion base pairs.

出版日期2016

全文

访问全文

收藏分享被引(3) 浏览

更新时间：2021-03-21 15:20

Mining large-scale repetitive sequences in a MapReduce setting

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友