A method for Chinese text classification based on apparent semantics and latent aspects

作者:Chen Ye Wang; Wang Jiong Liang; Cai Yi Qiao; Du Ji Xiang*
来源:Journal of Ambient Intelligence and Humanized Computing, 2015, 6(4): 473-480.
DOI:10.1007/s12652-015-0257-z

摘要

The existing methods for text classification fail to achieve high accuracy in processing Chinese texts, for that the basic unit of Chinese texts is not hanzis but Chinese phrases, and there is no natural delimiter in Chinese texts to separate the phrases. Things go even worse in the case of processing large number of Chinese Web texts, for these texts often lack of enough context, because most of these text are often short, irregular and sparse. In this paper, a new classification method is proposed for Chinese texts based on apparent semantics and latent aspects (ASLA). First, the apparent semantics of Chinese text are extracted as features instead of hanzis by BaiduBaike; Second, pLSA is applied for mining the latent aspects of these apparent semantics. Third, the relevant degree of a document to a category is calculated according to the apparent semantics and latent aspects. Finally, the category of a document is determined by the relevant degree. The proposed method is able to process Chinese web short text well with mini train data. Our experiments showed that the proposed method is promising, and it outperforms pLSA,SVM, KNN and CRF in the case of training data is not enough and the text is irregular.