A database of metazoan cytochrome c oxidase subunit I gene sequences derived from GenBank with CO-ARBitrator

作者:Heller Philip*; Casaletto James; Ruiz Gregory; Geller Jonathan
来源:Scientific Data, 2018, 5(1): 180156.
DOI:10.1038/sdata.2018.156

摘要

The Cytochrome C Oxidase subunit I gene ("COI") is the de facto standard for animal DNA barcoding. Organism identification based on COI requires an accurate and extensive annotated database of COI sequences. Such a database can also be of value in reconstructing evolutionary history and in diversity studies. Two COI databases are currently available: BOLD and Midori. BOLD's submissions conform to stringent sequence and metadata requirements; BOLD is specific to COI but makes no attempt to be comprehensive. Midori, derived from GenBank, has more sequences but less stringent standards than BOLD, resulting in higher error rates. To address the need for a comprehensive and accurate COI database, we adapted the ARBitrator algorithm, which classifies based only on sequence properties and has successfully auto-curated bacterial genes mined from GenBank. The adapted algorithm, which we call CO-ARBitrator, built a database of over a million metazoan COI sequences. Sensitivity and specificity are significantly higher than Midori. Specificity is comparable to what BOLD achieves with data quality prerequisites. Results and software are publicly available.

  • 出版日期2018-8-7