Autocorrelation-Based Features for Speech Representation

作者:Ando Yoichi*
来源:Acta Acustica United with Acustica, 2015, 101(1): 145-154.
DOI:10.3813/AAA.918812

摘要

This study investigates autocorrelation-based features as a potential basis for phonetic and syllabic distinctions. These features have emerged from a theory of auditory signal processing that was originally developed for architectural acoustics. Correlation-based auditory features extracted from monaural autocorrelation and binaural cross-correlation functions are used to predict perceptual attributes important for the design of concert halls: pitch, timbre, loudness, duration, reverberation-related coloration, sound direction, apparent source width, and envelopment [1, 2, 3, 4]. The current study investigates the use of features of monaural autocorrelation functions (ACFs) for representing phonetic elements (vowels), syllables (CV pairs), and phrases using a small set of temporal factors extracted from the short-term running ACF. These factors include listening level (loudness), zero-lag ACF peak width (spectral tilt), tau(1) (voice pitch period), phi(1) (voice pitch strength), tau(e) (effective duration of the ACF envelope, temporal repetitive continuity/contrast), segment duration, and Delta phi(1)/Delta t (the rate of pitch strength change, related to voice pitch attack-decay dynamics). Times at which ACF effective duration tau(e) is minimal reflect rapid signal pattern changes that usefully demarcate segmental boundaries. Results suggest that vowels, CV syllables, and phrases can be partially distinguished on the basis of this ACF-derived feature set, whose neural correlates lie in population-wide distributions of all-order interspike intervals in early auditory stations.

  • 出版日期2015-2