摘要

The partial least squares-discriminant analysis (PLS-DA) is the most widely used statistical tool to perform classification and biomarker screening in metabolomics. However, the PLS-DA tends to overfit the data, and the selection of biomarkers is often unstable because of the disturbance of uninformative variables in principal components. In this paper, we propose an algorithm for performing stable biomarker screening and for seeking the optimal generalization performance, in which the biomarker identification is based on sparse regularization variable selection in combination with subsampling (SRS), and the classification is subsequently performed by a linear support vector machine (SVM) classifier in the selected-variable space to obtain the maximum classification accuracy. Two metabolomics datasets measured by gas chromatography-mass spectrometry are employed to evaluate the performance of the proposed SRS-SVM algorithm, and the comparison with existing related algorithms is given. The result shows that the SRS-SVM algorithm outperforms the PLS-DA and is competitive with other related algorithms in terms of prediction classification accuracy measured by both internal and external validation. Furthermore, the selection of candidate biomarkers is quite stable by the SRS-SVM algorithm, and it can be an alternative and competitive method for the analysis of metabolomics data. The R code for implementing the SRS-SVM algorithm is available in the Electronic supplementary material.