Using cross ambiguity model improves the effect of vietnamese word segmentation

Niu, Yitong<sup>*</sup>; Xiong, Mingming; Guo, Jianyi; Mao, Cunli; Xian, Yantuan; Yu, Zhengtao

摘要

The ambiguity problem is widely distributed in Vietnamese sentences and impacts the accuracy of word segmentation. In this paper, we proposed a Vietnamese word segmentation method based on CRF (Condition Random Field) and cross ambiguity models, which we combined with Vietnamese lexical features to incorporate essential characteristics of Vietnamese into Condition Random Fields. Overall,5377 ambiguity fragments were extracted from the training corpus, selected statistical features, ambiguity field internal features and ambiguity contextual features and placed into the maximum entropy model and cross ambiguity model, and then incorporated into the segmentation model. The training corpus is divided into ten copies evenly for the cross validation experiment; the segmentation accuracy reached 96.55%. And compared with the Vietnamese segmentation tool,VnTokenizer, the experimental results suggest that our proposed method for Vietnamese segmentation performs well and is precise. The precision and recall rates of the proposed model are increased by 1.34% and 0.63% over VnTokenizer, and alignment error rate (AER) is reduced by 0.98%.

出版日期2016-11
单位昆明理工大学

收藏分享被引浏览

更新时间：2024-05-12 16:54

Using cross ambiguity model improves the effect of vietnamese word segmentation

摘要

产品服务

站内浏览

服务支持

联系方式

科研之友