摘要

Imbalanced data sets in real-world applications have a majority class with normal instances and a minority class with abnormal or important instances. Learning from such data sets usually generates biased classifiers that have a higher predictive accuracy over the majority class, but a rather poorer predictive accuracy over the minority class. The Synthetic minority over-sampling technique (SMOTE) is specifically designed for learning from imbalanced data sets. This paper presents a novel approach for learning from imbalanced data sets, based on an improved SMOTE algorithm. The approach deals with noise data by a hierarchical filtering mechanism, employs a selection strategy of the minority instances and makes full use of dynamic distribution density of the minority followed by the SMOTE process. This empirical analysis of the approach showed quantitatively competitive with SMOTE and series of its improved algorithm in terms of the receiver operating characteristic curve when applied to several highly and moderately imbalanced data sets.

全文