摘要

In this paper, we consider multi-stage jointly training two-stream convolutional neural network for action recognition in videos. The challenge of action recognition is to extract appearance and motion information to describe actions efficiently and to classify videos of different levels correctly. The proposed architecture has preferable model capacity and it enables us to obtain appearance and motion information validly from images in videos. Besides, with the proposed multi-stage jointly training strategy, multiple classifiers are jointly optimized to process different qualities samples of action videos. Finally, the Support Vector Machine classifier is employed to replace Softmax classifier, achieving decent classification results. Our model is trained and evaluated on the standard actions benchmark of UCF-101, and we also test the model on HMDB51 dataset through transfer learning, proving that our method is competitive with the state of the art.