摘要

Video action recognition is widely applied in video indexing, intelligent surveillance, multimedia understanding, and other fields. Recently, it was greatly improved by incorporating the learning of deep information using Convolutional Neural Network (CNN). This motivated us to review the notable CNN-based action recognition works. Because CNN is primarily designed to extract 2D spatial features from still image and videos are naturally viewed as 3D spatiotemporal signals, the core issue of extending the CNN from image to video is temporal information exploitation. We divide the solutions for exploiting temporal information exploration into three strategies: 1) 3D CNN; 2) taking the motion-related information as the CNN input; and 3) fusion. In this paper, we present a comprehensive review of the CNN-based action recognition methods according to these strategies. We also discuss the action recognition performance on recent large-scale benchmarks and the limitations and future research directions of CNN-based action recognition. This paper offers an objective and clear review of CNN-based action recognition and provides a guide for future research.