摘要

k-mer frequency has been widely used as digital features of DNA fragments in microbial DNA recognition. However, to achieve ideal identification accuracy, it often needs to extract a nearly ten thousand-dimensional vector from DNA fragments as species labels. The high dimension of the feature vector will lead to excessive calculation loss. Rough set theory is a good method for attitude reduction but can only deal with discrete data, so a new OTSU discretization method is presented in this paper. Experiments on 30 microbial strains signals and 6 UCI datasets were carried out and the results show that using rough set theory can get less feature dimension and higher classification accuracy after discretization by this method. The number of features can be reduced by 69.53%, with 6.28% higher accuracy achieved and the operation time can be reduced by 78.38%.

全文