摘要

Mutual information (MI) is a powerful concept for correlation-centric applications. It has been used for feature selection from microarray gene expression data in many works. One of the merits of MI is that, unlike many other heuristic methods, it is based on a mature theoretic foundation. When applied to microarray data, however, it faces some challenges. First, due to the large number of features (i.e., genes) present in microarray data, the true distributions for the expression values of some genes may be distorted by noise. Second, evaluating inter-group mutual information requires estimating multi-variate distributions, which is quite difficult if not impossible. To address these problems, in this paper, we propose a new MI-based feature selection approach for microarray data. Our approach relies on two strategies: one is relevance boosting, which requires a desirable feature to show substantially additional relevance with class labeling beyond the already selected features, the other is feature interaction enhancing, which probabilistically compensates for feature interaction missing from simple aggregation-based evaluation. We justify our approach from both theoretical perspective and experimental results. We use a synthetic dataset to show the statistical significance of the proposed strategies, and real-life datasets to show the improved performance of our approach over the existing methods.