摘要

One major challenge in building Bayesian text classifiers is the data sparsity problem, especially when the size of training data is very small. Recently, log-bilinear language model, as a form of neural language model, has been proved to be an effective way to fight data sparsity. In this paper, we propose a novel semantic smoothing method based on log-bilinear model to improve the performance of naive Bayes classifier. The key idea is to learn semantically oriented representations for words, and perform semantic smoothing based on these representations. Noise-constrictive estimation is employed to perform fast training on large document collections. We conduct comprehensive experiments on three testing collections (20NG, Reuters, and WebKB) to compare our smoothing method with other approaches. Experiment results show that the proposed method not only outperforms two commonly used smoothing methods for Bayesian text classification, but also beats the state-of-the-art SVM classifiers when the size of training documents is small.

全文