An Improved Unsupervised Approach to Word Segmentation

作者:Wang Hanshi; Han Xuhong; Liu Lizhen*; Song Wei; Yuan Mudan
来源:China Communications, 2015, 12(7): 82-95.
DOI:10.1109/CC.2015.7188527

摘要

ESA is an unsupervised approach to word segmentation previously proposed by Wang, which is an iterative process consisting of three phases: Evaluation, Selection and Adjustment. In this article, we propose ExESA, the extension of ESA. In ExESA, the original approach is extended to a 2-pass process and the ratio of different word lengths is introduced as the third type of information combined with cohesion and separation. A maximum strategy is adopted to determine the best segmentation of a character sequence in the phrase of Selection. Besides, in Adjustment, ExESA re-evaluates separation information and individual information to overcome the overestimation frequencies. Additionally, a smoothing algorithm is applied to alleviate sparseness. The experiment results show that ExESA can further improve the performance and is time-saving by properly utilizing more information from un-annotated corpora. Moreover, the parameters of ExESA can be predicted by a set of empirical formulae or combined with the minimum description length principle.

全文