GapReduce: A Gap Filling Algorithm Based on Partitioned Read Sets

作者:Luo, Junwei; Wang, Jianxin*; Shang, Juan; Luo, Huimin; Li, Min; Wu, Fangxiang; Pan, Yi
来源:IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2020, 17(3): 877-886.
DOI:10.1109/TCBB.2018.2789909

摘要

With the advances in technologies of sequencing and assembly, draft sequences of more and more genomes are available. However, there commonly exist gaps in these draft sequences which influence various downstream analysis of biological studies. Gap filling methods can shorten the length of gaps and improve the completion of these draft sequences of genomes. Although some gap filling tools have been developed, their effectiveness and accuracy need to be improved. In this study, we develop a novel tool, called GapReduce, which can fill the gaps using the paired reads. For a gap, GapReduce selects the reads whose mate reads are aligned on the left or the right flanking region, and partitions the reads to two sets. Then GapReduce adopts different k values and k-mermer frequency thresholds to iteratively construct De Bruijn graphs, which are used for finding the correct path to fill the gap. For overcoming the branching problems caused by repetitive regions and sequencing errors in the procedure of path selection, GapReduce designs a novel approach that simultaneously considers k-mermer frequency and distribution of paired reads based on the partitioned read sets. We compare the performance of GapReduce with current popular gap filling tools. The experimental results demonstrate that GapReduce can produce satisfactory gap filling results, especially for long insert size datasets. GapReduce is publicly available for downloading at https://github.com/bioinfomaticsCSU/GapReduce.