Movie fill in the blank by joint learning from video and text with adaptive temporal attention

Chen, Jie; Shao, Jie<sup>*</sup>; He, Chengkun

doi:10.1016/j.patrec.2018.06.030

摘要

Video understanding is a challenging problem and it attracts a lot of research attention. Lately, a new task called movie fill in the blank (MovieFIB) is proposed. In this task, given a movie clip and a description which has one blank, we need to predict the word in the blank accurately. Previous studies make many contributions to tackling this problem. However, some of them do not utilize the relationship between words and video frames, and some others treat visual information as essential elements for blank word prediction, which fail to distinguish the effects of texts before and after the blank. To overcome the limitations, in this paper we propose to use adaptive temporal attention and fuse text information with attention. We first extract video and word features. Then, adaptive temporal attention is used to update original description. For the updated description, we extract its text information. Attention mechanism is applied to fuse text information. Finally, we use adaptive temporal attention to predict the blank word. Extensive experiments demonstrate that our model achieves satisfactory performance.

出版日期2020-4
单位电子科技大学

全文

访问全文

收藏分享被引(8) 浏览

更新时间：2024-05-09 05:01

Movie fill in the blank by joint learning from video and text with adaptive temporal attention

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友