Beyond Frame-level CNN: Saliency-Aware 3-D CNN With LSTM for Video Action Recognition

Wang, Xuanhan; Gao, Lianli<sup>*</sup>; Song, Jingkuan; Shen, Hengtao

doi:10.1109/LSP.2016.2611485

摘要

Human activity recognition in videos with convolutional neural network (CNN) features has received increasing attention in multimedia understanding. Taking videos as a sequence of frames, a new record was recently set on several benchmark datasets by feeding frame-level CNN sequence features to long short-term memory (LSTM) model for video activity recognition. This recurrentmodel-based visual recognition pipeline is a natural choice for perceptual problems with time-varying visual input or sequential outputs. However, the above-mentioned pipeline takes frame-level CNN sequence features as input for LSTM, which may fail to capture the rich motion information from adjacent frames or maybe multiple clips. Furthermore, an activity is conducted by a subject or multiple subjects. It is important to consider attention that allows for salient features, instead of mapping an entire frame into a static representation. To tackle these issues, we propose a novel pipeline, saliency-aware three-dimensional (3-D) CNN with LSTM, for video action recognition by integrating LSTM with salient-aware deep 3-D CNN features on videos shots. Specifically, we first apply saliency-aware methods to generate saliency-aware videos. Then, we design an end-to-end pipeline by integrating 3-D CNN with LSTM, followed by a time series pooling layer and a soft max layer to predict the activities. Noticeably, we set a new record on two benchmark datasets, i.e., UCF101 with 13 320 videos and HMDB-51 with 6766 videos. Our method outperforms the state-of-the-art end-to-end methods of action recognition by 3.8% and 3.2%, respectively on above two datasets.

出版日期2017-4
单位电子科技大学

全文

访问全文

收藏分享被引(216) 浏览

更新时间：2024-05-12 10:37

Beyond Frame-level CNN: Saliency-Aware 3-D CNN With LSTM for Video Action Recognition

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友