Audio-visual speech recognition integrating 3D lip information obtained from the Kinect

Wang, Jianrong; Zhang, Ju; Honda, Kiyoshi; Wei, Jianguo<sup>*</sup>; Dang, Jianwu

doi:10.1007/s00530-015-0499-9

摘要

Audio-visual speech recognition (AVSR) has shown impressive improvements over audio-only speech recognition in the presence of acoustic noise. However, the problems of region-of-interest detection and feature extraction may influence the recognition performance due to the visual speech information obtained typically from planar video data. In this paper, we deviate from the traditional visual speech information and propose an AVSR system integrating 3D lip information. The Microsoft Kinect multi-sensory device was adopted for data collection. The different feature extraction and selection algorithms were applied to planar images and 3D lip information, so as to fuse the planar images and 3D lip feature into the visual-3D lip joint feature. For automatic speech recognition (ASR), the fusion methods were investigated and the audio-visual speech information was integrated into a state-synchronous two stream Hidden Markov Model. The experimental results demonstrated that our AVSR system integrating 3D lip information improved the recognition performance of traditional ASR and AVSR system in acoustic noise environments.

出版日期2016-6
单位天津大学

全文

访问全文

收藏分享被引(8) 浏览

更新时间：2022-03-14 17:24

Audio-visual speech recognition integrating 3D lip information obtained from the Kinect

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友