摘要

Feature selection is an important problem for pattern classification systems. As compared to unsupervised feature selection methods, the supervised ones have better performance. However, almost all existing supervised ones use class labels as supervised information, very less work has been done for other forms of supervision information such as pairwise constraints, which specifies whether a pair of data samples belongs to the same class (must-link constraints) or different classes (cannot-link constraints). In reality, pairwise constraints can be easily obtained by specifying whether some pairs of examples belong to the same class or not. Therefore, a new filter method for feature selection with pairwise constraints, called Constraint Score, was proposed. Unfortunately, Constraint Score does not consider the case where only cannot-link constraints are given. Also, the conclusion 'must-link constraints are more important than cannot-link constraints' given by Constraint Score algorithm needs to be further verified, since 'cannot-link constraints' seems more important than 'must-link constraints' from the viewpoint of hypothesis-margin or margin. In addition, like the existing supervised feature selection methods, the currently proposed hypothesis-margin based approach for feature selection, called Simba, also utilizes class labels as supervision information. In this paper, to further study the feature selection problem aiming at pairwise constraints, we introduce a novel hypothesis-margin based approach for feature selection with side pairwise constraints, called Simba-sc, which only uses cannot-link constraints as supervision information. We compare our algorithm with the well-known Constraint Score, Fisher Score and Laplacian Score algorithms. Experiments are carried out on 6 UCI data sets using three different classifiers. Experimental results show that, with a few cannot-link constraints, Simba-sc achieves similar or even higher performance than Fisher Score with full class labels on all training data, and has better or comparable performance than Constraint Score.