Autoregressive Neural F0 Model for Statistical Parametric Speech Synthesis

作者:Wang Xin; Takaki Shinji; Yamagishi Junichi
来源:IEEE/ACM Transactions on Audio Speech and Language Processing, 2018, 26(8): 1406-1419.
DOI:10.1109/TASLP.2018.2828650

摘要

Recurrent neural networks (RNNs) have been successfully used as fundamental frequency (F0) models for text-to-speech synthesis. However, this paper showed that a normal RNN may not take into account the statistical dependency of the F0 data across frames and consequently only generate noisy F0 contours when F0 values are sampled from the model. A better model may take into account the causal dependency of the current F0 datum on the previous frames' F0 data. One such model is the shallow autoregressive (AR) recurrent mixture density network (SAR) that we recently proposed. However, as this study showed, an SAR is equivalent to the combination of trainable linear filters and a conventional RNN. It is still weak for F0 modeling. To better model the temporal dependency in F0 contours, we propose a deep AR model (DAR). On the basis of an RNN, this DAR propagates the previous frame's F0 value through the RNN, which allows nonlinear AR dependency to be achieved. We also propose F0 quantization and data dropout strategies for the DAR. Experiments on a Japanese corpus demonstrated that this DAR can generate appropriate F0 contours by using the random-sampling-based generation method, which is impossible for the baseline RNN and SAR. When a conventional mean-based generation method was used in the proposed DAR and other experimental models, the DAR generated accurate and less oversmoothed F0 contours and achieved a better meanopinion-score in a subjective evaluation test.

  • 出版日期2018-8