摘要

In conventional computational auditory scene analysis, the segment segregation and pitch estimation are necessary. However, it is hard to obtain an accurate pitch contour of clean speech for segregating the segments in low signal-to-noise ratio. This often leads to an inaccurate estimation of binary mask and produces artifacts and temporal discontinuity in the enhanced speech. To overcome this problem, we propose in this paper a new estimation method for binary mask based on convex optimization of speech power. In this proposed method, the segment segregation and pitch estimation are excluded and only the speech power of each Gammatone channel is considered as a key cue used for labeling the binary masks. Considering the cross-correlation between the power spectra of noisy speech and noise in each channel, the objective function of speech power is built, and the speech power is solved by the gradient descent method. Accordingly, the time-frequency units of speech and noise are labeled by computing a decision factor derived from the powers of noisy speech, estimated speech and the pre-estimated noise. The erroneous local masks are refined by time-frequency unit smoothing. The objective measurements including segmental signal-to-noise ratio, HIT-False Alarm rate, the percentage of energy loss and the percentage of noise residue and the additional subjective listening test demonstrate the effectiveness of the proposed method.

  • 出版日期2018-3