Audio visual speech source separation via improved context dependent association model

Kazemi Alireza<sup>*</sup>; Boostani Reza; Sobhanmanesh Fariborz

doi:10.1186/1687-6180-2014-47

摘要

In this paper, we exploit the non-linear relation between a speech source and its associated lip video as a source of extra information to propose an improved audio-visual speech source separation (AVSS) algorithm. The audio-visual association is modeled using a neural associator which estimates the visual lip parameters from a temporal context of acoustic observation frames. We define an objective function based on mean square error (MSE) measure between estimated and target visual parameters. %26lt;br%26gt;This function is minimized for estimation of the de-mixing vector/filters to separate the relevant source from linear instantaneous or time-domain convolutive mixtures. We have also proposed a hybrid criterion which uses AV coherency together with kurtosis as a non-Gaussianity measure. Experimental results are presented and compared in terms of visually relevant speech detection accuracy and output signal-to-interference ratio (SIR) of source separation. The suggested audio-visual model significantly improves relevant speech classification accuracy compared to existing GMM-based model and the proposed AVSS algorithm improves the speech separation quality compared to reference ICA- and AVSS-based methods.

出版日期2014-4-5

全文

访问全文

收藏分享被引(1) 浏览

更新时间：2021-04-18 11:58

Audio visual speech source separation via improved context dependent association model

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友