摘要

The ambiguity problem is widely distributed in Vietnamese sentences and impacts the accuracy of word segmentation. In this paper, we proposed a Vietnamese word segmentation method based on CRF (Condition Random Field) and cross ambiguity models, which we combined with Vietnamese lexical features to incorporate essential characteristics of Vietnamese into Condition Random Fields. Overall,5377 ambiguity fragments were extracted from the training corpus, selected statistical features, ambiguity field internal features and ambiguity contextual features and placed into the maximum entropy model and cross ambiguity model, and then incorporated into the segmentation model. The training corpus is divided into ten copies evenly for the cross validation experiment; the segmentation accuracy reached 96.55%. And compared with the Vietnamese segmentation tool,VnTokenizer, the experimental results suggest that our proposed method for Vietnamese segmentation performs well and is precise. The precision and recall rates of the proposed model are increased by 1.34% and 0.63% over VnTokenizer, and alignment error rate (AER) is reduced by 0.98%.