Attention-based LSTM with Semantic Consistency for Videos Captioning

作者:Guo Zhao*; Gao Lianli; Song Jingkuan; Xu Xing; Shao Jie; Shen Heng Tao
来源:24th ACM Multimedia Conference (MM), 2016-10-15 To 2016-10-19.
DOI:10.1145/2964284.2967242

摘要

Recent progress in using Long Short-Term Memory (LSTM) for image description has motivated the exploration of their applications for automatically describing video content with natural language sentences. By taking a video as a sequence of features, LSTM model is trained on video-sentence pairs to learn association of a video to a sentence. However, most existing methods compress an entire video shot or frame into a static representation, without considering attention which allows for salient features. Furthermore, most existing approaches model the translating error, but ignore the correlations between sentence semantics and visual content. To tackle these issues, we propose a novel end-to-end framework named aLSTMs, an attention-based LSTM model with semantic consistency, to transfer videos to natural sentences. This framework integrates attention mechanism with LSTM to capture salient structures of video, and explores the correlation between multi-modal representations for generating sentences with rich semantic content. More specifically, we first propose an attention mechanism which uses the dynamic weighted sum of local 2D Convolutional Neural Network (CNN) and 3D CNN representations. Then, a LSTM decoder takes these visual features at time t and the word-embedding feature at time t-1 to generate important words. Finally, we uses multi-modal embedding to map the visual and sentence features into a joint space to guarantee the semantic consistence of the sentence description and the video visual content. Experiments on the benchmark datasets demonstrate the superiority of our method than the state-of-the-art baselines for video captioning in both BLEU and METEOR.