摘要

Feature selection (FS) plays an important role in data mining and recognition, especially regarding large scale text, images and biological data. The Markov blanket provides a complete and sound solution to the selection of optimal features in supervised feature selection, and investigates thoroughly the relevance of features relating to class and the conditional independence relationship between features. However, incomplete label information makes it particularly difficult to acquire the optimal feature subset. In this paper, we propose a novel algorithm called the Semi-supervised Representatives Feature Selection algorithm based on information theory (SRFS), which is independent of any algorithm used for classification learning, and can rapidly and effectively identify and remove non-essential information and irrelevant and redundant features. More importantly, the unlabeled data are utilized in the Markov blanket as the labeled data through the relevance gain. Our results on several benchmark datasets demonstrate that SRFS can significantly improve upon state of the art supervised and semi-supervised algorithms.