摘要

In this paper, we investigate how features can be effectively learned by deep neural networks for audio forensic problems. By providing a preliminary feature preprocessing based on electric network frequency (ENF) analysis, we propose a convolutional neural network (CNN) for training and classification of genuine and recaptured audio recordings. Hierarchical representations which contain levels of details of the ENF components are learned from the deep neural networks and can be used for further classification. The proposed method works for small audio clips of 2 second duration, whereas the state of the art may fail with such small audio clips. Experimental results demonstrate that the proposed network yields high detection accuracy with each ENF harmonic component represented as a single-channel input. The performance can be further improved by a combined input representation which incorporates both the fundamental ENF and its harmonics. The convergence property of the network and the effect of using an analysis window with various sizes are also studied. Performance comparison against the support tensor machine demonstrates the advantage of using CNN for the task of audio recapture detection. Moreover, visualization of the intermediate feature maps provides some insight into what the deep neural networks actually learn and how they make decisions.