摘要

The automatic transcription of out-of-vocabulary words into their corresponding phoneme strings has been widely adopted for speech synthesis and spoken-term detection systems. By combining various methods in order to meet the challenges of grapheme-to-phoneme (G2P) conversion, this paper proposes a phoneme transition network (PTN)-based architecture for G2P conversion. The proposed method first builds a confusion network using multiple phoneme-sequence hypotheses generated from several G2P methods. It then determines the best final-output phoneme from each block of phonemes in the generated network. Moreover, in order to extend the feasibility and improve the performance of the proposed PTN-based model, we introduce a novel use of right-to-left (reversed) grapheme-phoneme sequences along with grapheme-generation rules. Both techniques are helpful not only for minimizing the number of required methods or source models in the proposed architecture but also for increasing the number of phoneme-sequence hypotheses, without increasing the number of methods. Therefore, the techniques serve to minimize the risk from combining accurate and inaccurate methods that can readily decrease the performance of phoneme prediction. Evaluation results using various pronunciation dictionaries show that the proposed model, when trained using the reversed grapheme-phoneme sequences, often outperformed conventional left-to-right grapheme-phoneme sequences. In addition, the evaluation demonstrates that the proposed PTN-based method for G2P conversion is more accurate than all baseline approaches that were tested.

  • 出版日期2016-4