Unsupervised Binning of Metagenomic Assembled Contigs Using Improved Fuzzy C-Means Method

作者:Liu, Yun*; Hou, Tao; Kang, Bing; Liu, Fu
来源:IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2017, 14(6): 1459-1467.
DOI:10.1109/TCBB.2016.2576452

摘要

Metagenomic contigs binning is a necessary step of metagenome analysis. After assembly, the number of contigs belonging to different genomes is usually unequal. So a metagenomic contigs dataset is a kind of imbalanced dataset and traditional fuzzy c-means method (FCM) fails to handle it very well. In this paper, we will introduce an improved version of fuzzy c-means method (IFCM) into metagenomic contigs binning. First, tetranucleotide frequencies are calculated for every contig. Second, the number of bins is roughly estimated by the distribution of genome lengths of a complete set of non-draft sequenced microbial genomes from NCBI. Then, IFCM is used to cluster DNA contigs with the estimated result. Finally, a clustering validity function is utilized to determine the binning result. We tested this method on a synthetic and two real datasets and experimental results have showed the effectiveness of this method compared with other tools.