摘要

Gabor features have been proposed for extracting spectro-temporal modulation information from speech signals, and have been shown to yield large improvements in recognition accuracy. We use a flexible Tandem system framework that integrates multi-stream information including Gabor, MFCC, and pitch features in various ways, by modeling either or both of the tone and phoneme variations in Mandarin speech recognition. We use either phonemes or tonal phonemes (tonemes) as either the target classes of MLP posterior estimation and/or the acoustic units of HMM recognition. The experiments yield a comprehensive analysis on the contributions to recognition accuracy made by either of the feature sets. We discuss their complementarities in tone, phoneme, and toneme classification. We show that Gabor features are better for recognition of vowels and unvoiced consonants, while MFCCs are better for voiced consonants. Also, Gabor features are capable of capturing changes in signals across time and frequency bands caused by Mandarin tone patterns, while pitch features further offer extra tonal information. This explains why the integration of Gabor, MFCC, and pitch features offers such significant improvements.