摘要

Histograms have extensively been used as a simple tool for nonparametric probability density function estimation. However, practically, the accuracy of some histogram-based derived quantities, such as the marginal entropy (ME), the joint entropy (JE), or the mutual information (MI) depends on the number of bins chosen for the histogram. In this paper, we investigate the binning problem of bi-histogram for the estimation of JE. By minimizing a theoretical mean square error (MSE) of JE estimation, we derive a new formula for the optimal number of bins of bi-histogram for continuous random variables. This novel JE estimation has been used in the MI estimation to avoid the error accumulation of joint MI between the class variable and feature subset in the feature selection. In a synthetic Gaussian feature selection problem, only the proposed method permits to retrieve the exact number of relevant features that explain the class variable when compared to a concurrent univariate estimator based on binning formula that has been proposed for ME estimation. In speech and speaker recognition applications, the proposed method permits to select a limited number of features which guaranties approximately the same or an even better recognition rate than using the total number of features.

  • 出版日期2018-1-1