摘要

Although high-throughput sequencing is an efficient approach to study the structure of microbial communities in detail, it is still impossible to enumerate all individuals using this method. Therefore, it is a common strategy to generate sampling datasets that are representative of the assemblages. However, the representativeness of these sampling datasets has never been assessed. In this study, we developed a method to determine the minimum number sequences that are required to be analyzed to obtain a reliable description of microbial community structure. First, a set of datasets from microbial communities were constructed by in silico sampling at different depths. Second, the correlation equation between dissimilarity of the sampling datasets and sampling depths was fitted, and thereby the minimum number of 16S rRNA gene sequences was predicted. Finally, we verified the method using empirical data of microbiota from a laboratory mesocosm. Our method showed that at least 5,528,079 sequences were required to reliably characterize microbial communities inhabiting the mesocosms. However, if only dominant OTUs ( > 1%) were considered, thousands of sequences were enough. This promising method provides a criterion to ensure sequencing sufficiency when analyzing the structure of natural microbial communities.