摘要

Automatically describing videos containing rich and open-domain activities is a very challenging task for computer vision and machine learning research. Obviously, accurate descriptions of video contents need the understanding of both visual concepts and their temporal dynamics. A lot of efforts have been made to understand visual concepts in still image tasks, e.g., image classification and object detection. However, the combination of visual concepts and temporal dynamics has not been given sufficient attention. To delve deeper into the unique characteristic of videos, we propose a novel video captioning architecture to integrate both visual concepts and temporal dynamics. In this paper, an attention mechanism and memory networks are combined together into our multimodal framework with a feature selection algorithm. Specially, we utilize the soft attention mechanism to choose visual concepts relevant frames based on previously generated words, and the memorization of temporal dynamics is implemented by the memory networks, which have great advantages of memorizing long-term information. Then the visual concepts and the temporal dynamics are integrated together into our multimodal architecture. Moreover, the feature selection algorithm is applied to select more relevant features between them according to the part of speech. Finally, we test our proposed framework on both MSVD and MSR-VTT datasets and achieve competitive performance compared with other state-of-the-art methods.