A method for Chinese text classification based on apparent semantics and latent aspects

Chen Ye Wang; Wang Jiong Liang; Cai Yi Qiao; Du Ji Xiang<sup>*</sup>

doi:10.1007/s12652-015-0257-z

摘要

The existing methods for text classification fail to achieve high accuracy in processing Chinese texts, for that the basic unit of Chinese texts is not hanzis but Chinese phrases, and there is no natural delimiter in Chinese texts to separate the phrases. Things go even worse in the case of processing large number of Chinese Web texts, for these texts often lack of enough context, because most of these text are often short, irregular and sparse. In this paper, a new classification method is proposed for Chinese texts based on apparent semantics and latent aspects (ASLA). First, the apparent semantics of Chinese text are extracted as features instead of hanzis by BaiduBaike; Second, pLSA is applied for mining the latent aspects of these apparent semantics. Third, the relevant degree of a document to a category is calculated according to the apparent semantics and latent aspects. Finally, the category of a document is determined by the relevant degree. The proposed method is able to process Chinese web short text well with mini train data. Our experiments showed that the proposed method is promising, and it outperforms pLSA,SVM, KNN and CRF in the case of training data is not enough and the text is irregular.

出版日期2015-8
单位华侨大学

全文

访问全文

收藏分享被引(8) 浏览

更新时间：2021-04-12 08:46

A method for Chinese text classification based on apparent semantics and latent aspects

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友