摘要

To select informative variables for improving the ensemble performance in random forests (RF), a modified RF method, named random forest combined with Monte Carlo and uninformative variable elimination (MC-UVE-RF), is proposed for multi-class classification analysis of near-infrared (NIR) spectroscopy in this work. The MC method is used to increase the diversity of classification trees in RF and the UVE method is applied to gradually eliminate the less important variables based on variable reliability obtained by aggregation of each sub-model. The above two steps can be regarded as a variable selection process. As comparisons to MC-UVE-RF, the conventional RF, model population analysis combined with RF (MPA-RF) and support vector machine (SVM) for discrimination of tobacco grades by NIR spectroscopy have also been investigated. MC-UVE-RF has a marked superiority for discriminating tobacco samples into high-quality, medium-quality and low-quality groups of dataset I and II with external validation accuracy 100% and 96.83%, respectively (coarse classification). Furthermore, a good external validation accuracy in the subdivision of high-quality, medium-quality and low-quality groups of dataset I is 88.46%, 97.22% and 96%, and that of the subdivision of dataset II's three groups is 100%, 97.14% and 100%, respectively, which are better than or equal to those by other methods (refined classification). Therefore, MC-UVE-RF is a powerful alternative to multiple classification problems. Moreover, it could be a fast and powerful method for discrimination of tobacco leaf grades coupled with NIR technology instead of artificial judgment.