摘要

Human action recognition in naturalistic videos is an important task with a broad range of applications. Recently, the encoder-decoder framework based on attention mechanism has been applied to action recognition. Although such conventional methods reach state-of-the-art, they always face a bottleneck of distinguishing similar actions. To solve this problem, we propose a novel recurrent attention convolutional neural network (RACNN), which incorporates convolutional neural networks (CNNs), long short-term memory (LSTM) and attention mechanism. Inspired by the composition of the action, the pre-action and the result of action might be important parts of an action, we introduce bi-direction LSTM with hierarchical structure. Additionally, the separated spatial-temporal attention is employed into our method. Furthermore, we find that incorporating spatio-temporal features extracted from three-dimensional CNNs (3DCNNs) and RGB features can enhance the relationship mined in each frame. Our comprehensive experimental results on two benchmark datasets, i.e., HMDB51 and UCF101, verify the effectiveness of our proposed methods and show that our proposals can significantly outperform the current state-of-the-art methods.