摘要

In the millions of emergency reporting calls made each year, about a quarter are non-emergencies. To avoid responding to such situations, forensic examination of the reported situation in the presence of speech as evidence has become an indispensable requirement for emergency response centers. Caller profile information like gender, age, emotional state, transcript, and contextual sounds determined from emergency calls, may be highly beneficial for their sophisticated forensic analysis. However, callers reporting emergency situations often express emotional stress which cause variations in speech production. Furthermore, low voice quality, and background noise make it very difficult to efficiently recognize caller attributes in such unconstrained environments. To overcome limitations of traditional classification systems in such situations, a hybrid two-stage classification scheme is proposed in this paper. Our framework consist of an ensemble of support vector machines (e-SVM) and deep neural networks (DNN) in a cascade. The first stage e-SVM consists of two models discriminatively trained on normal and stressful speech from emergency calls. Deep neural network forming the second stage of classification pipeline, is utilized only in case of ambiguous prediction results from the first stage. The adaptive nature of this two stage classification scheme helps achieve efficiency and high performance. Experiments conducted with a large dataset affirm the suitability of proposed architecture for efficient real-time speaker attribute recognition. The framework is evaluated for gender recognition from emergency calls in the presence of emotions and background noise. The framework yields significant performance improvements in comparison with other similar state-of-the-art gender recognition approaches.

  • 出版日期2018-2