摘要

Video understanding is a challenging problem and it attracts a lot of research attention. Lately, a new task called movie fill in the blank (MovieFIB) is proposed. In this task, given a movie clip and a description which has one blank, we need to predict the word in the blank accurately. Previous studies make many contributions to tackling this problem. However, some of them do not utilize the relationship between words and video frames, and some others treat visual information as essential elements for blank word prediction, which fail to distinguish the effects of texts before and after the blank. To overcome the limitations, in this paper we propose to use adaptive temporal attention and fuse text information with attention. We first extract video and word features. Then, adaptive temporal attention is used to update original description. For the updated description, we extract its text information. Attention mechanism is applied to fuse text information. Finally, we use adaptive temporal attention to predict the blank word. Extensive experiments demonstrate that our model achieves satisfactory performance.