摘要

Feature elimination happens because either the features are irrelevant or they are redundant. The major challenge with feature selection for clustering is that relevance of a feature is not well defined. In this paper, an attempt to address this gap is made. Feature relevance is firstly defined in terms of Variability Score (VSi), a novel score which measures a feature's contribution to the overall variability of the dataset. Secondly, feature relevance is evaluated using entropy. VSi is a multivariate measure of feature relevance, where as entropy is univariate. Both of them have been used in a greedy forward search to select optimal feature subset (FSELCET - VS, FSELECT - EN). Redundancy is handled using Pearson's correlation coefficient. Dataset characteristics also influence result. Therefore it is recommended to apply both and adopt the best for that particular dataset. Extensive empirical study over thirty publicly available datasets show that the proposed method produces better performance compared to a few state of the art methods. The average feature reduction produced is 44%. No statistically significant reduction in performance (t = -0.35, p = 0.73) when compared with all features was observed. Moreover, the proposed method is shown to be relatively computationally inexpensive as well.

  • 出版日期2017