摘要

In this paper, the explicit and implicit modelling of the subsegmental excitation information are experimentally compared. For explicit modelling, the static and dynamic values of the standard Liljencrants-Fant (LF) parameters that model the glottal flow derivative (GFD) are used. A simplified approximation method is proposed to compute these LF parameters by locating the glottal closing and opening instants. The proposed approach significantly reduces the computation needed to implement the LF model. For implicit modelling, linear prediction (LP) residual samples considered in blocks of 5 ms with shift of 2.5 ms are used. Different speaker recognition studies are performed using NIST-99 and NIST-03 databases. In case of speaker identification, the implicit modelling provides significantly better performance compared to explicit modelling. Alternatively, the explicit modelling seem to be providing better performance in case of speaker verification. This indicates that explicit modelling seem to have relatively less intra and inter-speaker variability. The implicit modelling on the other hand, has more intra and inter-speaker variability. What is desirable is less intra and more inter-speaker variability. Therefore, for speaker verification task explicit modelling may be used and for speaker identification task implicit modelling may be used. Further, for both speaker identification and verification tasks the explicit modelling provides relatively more complimentary information to the state-of-the-art vocal tract features. The contribution of the explicit features is relatively more robust against noise. We suggest that the explicit approach can be used to model the subsegmental excitation information for speaker recognition.

  • 出版日期2013-8