A new validity index of feature subset for evaluating the dimensionality reduction algorithms

作者:Liu, Chuan; Wang, Wenyong*; Konan, Martin; Wang, Siyang; Huang, Lisheng; Tang, Yong; Zhang, Xiang
来源:Knowledge-Based Systems, 2017, 121: 83-98.
DOI:10.1016/j.knosys.2017.01.017

摘要

A critical aspect of dimensionality reduction is to assess the quality of selected (or produced) feature subsets properly. Feature subset assessment in machine learning refers to split a given feature subset into a training set, which is used to estimate the parameters of a classification model, and a test set used to estimate the predictive performance of the model. Then, averaging the results of multiple splitting (i.e., Cross-Validation, CV) is commonly used to decrease the variance of the estimator. But in practice, CV scheme is very computationally expensive. In this paper, we propose a new statistics index method called LW-index for evaluation of feature subset and dimensionality reduction algorithms in general. The proposed method is a type of "classical statistics" approach that uses the feature subset to compute an empirical estimate of the quality of feature subset. A large number of performance comparisons with the machine learning approach conducted on fourteen benchmark collections show that the proposed LW index is highly correlated with the external indices (i.e., MacroF(1), MicroF(1)) of SVM and Centroid-Based Classifier (CBC) trained by five-fold CV scheme. Furthermore, the experimental results indicate that LW index has the same performance as the traditional CV scheme for evaluating the dimensionality reduction algorithms and it is more efficient than the traditional methodology. Therefore, one contribution of this paper is to present an alternative methodology, based on an internal index typically used in the unsupervised learning context, that is computationally cheaper than the traditional CV methodology. Another contribution is to propose a new internal index that behaves better than other similar indices widely used in clustering and shows high correlation with the results obtained by the traditional methodology.