摘要

The classification of human's age and gender from speech and face images is a challenging task that has important applications in real-life and its applications are expected to grow more in the future. Deep neural networks (DNNs) and Convolutional neural networks (CNNs) are considered as one of the state-of-art systems as feature extractors and classifiers and are proven to be very efficient in analyzing problems with complex feature space. In this work, we propose a new cost function for fine-tuning two DNNs jointly. The proposed cost function is evaluated by using speech utterances and unconstrained face images for age and gender classification task. The proposed classifier design consists of two DNNs trained on different feature sets, which are extracted from "the same input data. Mel-frequency cepstral coefficients (MFCCs) and fundamental frequency (F0) and the shifted delta cepstral coefficients (SDC) are extracted from speech as the first and second feature sets, respectively. Facial appearance and the depth information are extracted from face images as the first and second feature sets, respectively. Jointly training of two DNNs with the proposed cost function improved the classification accuracies and minimized the over-fitting effect for both speech-based and image-based systems. Extensive experiments have been conducted to evaluate the performance and the accuracy of the proposed work. Two publicly available databases, the Age-Annotated Database of the German Telephone Speech database (aGender) and the Adience database, are used to evaluate the proposed system. The overall accuracy of the proposed system is calculated as 56.06% for seven speaker classes and overall exact accuracy is calculated as 63.78% for Adience database.

  • 出版日期2017-11-1