Unsupervised language identification based on Latent Dirichlet Allocation

Zhang, Wei<sup>*</sup>; Clark, Robert A. J.<sup>*</sup>; Wang, Yongyuan; Li, Wen

doi:10.1016/j.csl.2016.02.001

摘要

To automatically build, from scratch, the language processing component for a speech synthesis system in a new language, a purified text corpora is needed where any words and phrases from other languages are clearly identified or excluded. When using found data and where there is no inherent linguistic knowledge of the language/languages contained in the data, identifying the pure data is a difficult problem. We propose an unsupervised language identification approach based on Latent Dirichlet Allocation where we take the raw n-gram count as features without any smoothing, pruning or interpolation. The Latent Dirichlet Allocation topic model is reformulated for the language identification task and Collapsed Gibbs Sampling is used to train an unsupervised language identification model. In order to find the number of languages present, we compared four kinds of measure and also the Hierarchical Dirichlet process on several configurations of the ECI/UCI benchmark. Experiments on the ECI/MCI data and a Wikipedia based Swahili corpus shows this LDA method, without any annotation, has comparable precisions, recalls and F-scores to state of the art supervised language identification techniques.

出版日期2016-9
单位中国海洋大学

全文

访问全文

收藏分享被引(14) 浏览

更新时间：2024-05-12 15:27

Unsupervised language identification based on Latent Dirichlet Allocation

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友