摘要

Feature subset selection and/or dimensionality reduction is an essential preprocess before performing any data mining task, especially when there are too many features in the problem space. In this paper, a clustering-based feature subset selection (CFSS) algorithm is proposed for discriminating more relevant features. In each level of agglomeration, it uses similarity measure among features to merge two most similar clusters of features. By gathering similar features into clusters and then introducing representative features of each cluster, it tries to remove some redundant features. To identify the representative features, a criterion based on mutual information is proposed. Since CFSS works in a filter manner in specifying the representatives, it is noticeably fast. As an advantage of hierarchical clustering, it does not need to determine the number of clusters in advance. In CFSS, the clustering process is repeated until all features are distributed in some clusters. However, to diffuse the features in a reasonable number of clusters, a recently proposed approach is used to obtain a suitable level for cutting the clustering tree. To assess the performance of CFSS, we have applied it on some valid UCI datasets and compared with some popular feature selection methods. The experimental results reveal the efficiency and fastness of our proposed method.

  • 出版日期2018-2

全文