摘要

This paper addresses the problem of identifying meaningful patterns and trends in data via clustering (i.e. automatically dividing a data set into meaningful homogenous sub-groups such that the data within the same sub-group are very similar, and data in different subgroups are very different). The clustering framework that we propose is based on the generalized Dirichlet distribution, which is widely accepted as a flexible modeling approach, and a hierarchical Dirichlet process mixture prior. A main advantage of the adopted hierarchical Dirichlet process is that it provides a principled elegant nonparametric Bayesian approach to model selection by supposing that the number of mixture components can go to infinity. In addition to capturing the structure of the data, the combination of hierarchical Dirichlet process and generalized Dirichlet distribution is shown to offer a natural efficient solution to the feature selection problem when dealing with high-dimensional data. We develop two variational learning approaches (i.e. batch and incremental) for learning the parameters of the proposed model. The batch algorithm examines the entire data set at once while the incremental one learns the model one step at a time (i.e. update the model's parameters each time new data are introduced). The utility of the proposed approach is demonstrated on real applications namely face detection, facial expression recognition, human gesture recognition, and off-line writer identification. The obtained results show clearly the merits of our statistical framework.