Multimodal Deep Embedding via Hierarchical Grounded Compositional Semantics

Zhuang, Yueting<sup>*</sup>; Song, Jun; Wu, Fei; Li, Xi; Zhang, Zhongfei; Rui, Yong

doi:10.1109/TCSVT.2016.2606648

摘要

For a number of important problems, isolated semantic representations of individual syntactic words or visual objects do not suffice, but instead a compositional semantic representation is required; for example, a literal phrase or a set of spatially concurrent objects. In this paper, we aim to harness the existing image-sentence databases to exploit the compositional nature of image-sentence data for multimodal deep embedding. In particular, we propose an approach called hierarchical-alike (bottom-up two layers) multimodal grounded compositional semantics (hiMoCS) learning. The proposed hiMoCS systemically captures the compositional semantic connotation of multimodal data in the setting of hierarchical-alike deep learning by modeling the inherent correlations between two modalities of collaboratively grounded semantics, such as the textual entity (with its describing attribute) and visual object, the phrase (e.g., subject-verb-object triplet), and spatially concurrent objects. We argue that hiMoCS is more appropriate to reflect the multimodal compositional semantics of the image and its narrative textual sentence, which are strongly coupled. We evaluate hiMoCS on the several benchmark data sets and show that the utilization of the hiMoCS (textual entities and visual objects, textual phrase, and spatially concurrent objects) achieves a much better performance than only using the flat grounded compositional semantics.

出版日期2018-1
单位浙江大学

全文

访问全文

收藏分享被引(17) 浏览

更新时间：2024-05-10 18:40

Multimodal Deep Embedding via Hierarchical Grounded Compositional Semantics

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友