Multimodal (audio-visual) source separation exploiting multi-speaker tracking, robust beamforming and time-frequency masking

作者:Naqvi S Mohsen*; Wang W; Khan M Salman; Barnard M; Chambers J A
来源:IET Signal Processing, 2012, 6(5): 466-477.
DOI:10.1049/iet-spr.2011.0124

摘要

A novel multimodal source separation approach is proposed for physically moving and stationary sources which exploits a circular microphone array, multiple video cameras, robust spatial beamforming and time-frequency masking. The challenge of separating moving sources, including higher reverberation time (RT) even for physically stationary sources, is that the mixing filters are time varying; as such the unmixing filters should also be time varying but these are difficult to determine from only audio measurements. Therefore in the proposed approach, visual modality is used to facilitate the separation for both stationary and moving sources. The movement of the sources is detected by a three-dimensional tracker based on a Markov Chain Monte Carlo particle filter. The audio separation is performed by a robust least squares frequency invariant data-independent beamformer. The uncertainties in source localisation and direction of arrival information obtained from the 3D video-based tracker are controlled by using a convex optimisation approach in the beamformer design. In the final stage, the separated audio sources are further enhanced by applying a binary time-frequency masking technique in the cepstral domain. Experimental results show that using the visual modality, the proposed algorithm cannot only achieve performance better than conventional frequency-domain source separations algorithms, but also provide acceptable separation performance for moving sources.

  • 出版日期2012-7