A feature cluster taxonomy based feature selection technique

作者:Goswami Saptarsi*; Das Amit Kumar; Chakrabarti Amlan; Chakraborty Basabi
来源:Expert Systems with Applications, 2017, 79: 76-89.
DOI:10.1016/j.eswa.2017.01.044

摘要

Feature subset selection is basically an optimization problem for choosing the most important features from various alternatives in order to facilitate classification or mining problems. Though lots of algorithms have been developed so far, none is considered to be the best for all situations and researchers are still trying to come up with better solutions. In this work, a flexible and user-guided feature subset selection algorithm, named as FCTFS (Feature Cluster Taxonomy based Feature Selection) has been proposed for selecting suitable feature subset from a large feature set. The proposed algorithm falls under the genre of clustering based feature selection techniques in which features are initially clustered according to their intrinsic characteristics following the filter approach. In the second step the most suitable feature is selected from each cluster to form the final subset following a wrapper approach. The two stage hybrid process lowers the computational cost of subset selection, especially for large feature data sets. One of the main novelty of the proposed approach lies in the process of determining optimal number of feature clusters. Unlike currently available methods, which mostly employ a trial and error approach, the proposed method characterises and quantifies the feature clusters according to the quality of the features inside the clusters and defines a taxonomy of the feature clusters. The selection of individual features from a feature cluster can be done judiciously considering both the relevancy and redundancy according to user's intention and requirement. The algorithm has been verified by simulation experiments with different bench mark data set containing features ranging from 10 to more than 800 and compared with other currently used feature selection algorithms. The simulation results prove the superiority of our proposal in terms of model performance, flexibility of use in practical problems and extendibility to large feature sets. Though the current proposal is verified in the domain of unsupervised classification, it can be easily used in case of supervised classification.

  • 出版日期2017-8-15