Multimodal Deep Embedding via Hierarchical Grounded Compositional Semantics

作者:Zhuang, Yueting*; Song, Jun; Wu, Fei; Li, Xi; Zhang, Zhongfei; Rui, Yong
来源:IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(1): 76-89.
DOI:10.1109/TCSVT.2016.2606648

摘要

For a number of important problems, isolated semantic representations of individual syntactic words or visual objects do not suffice, but instead a compositional semantic representation is required; for example, a literal phrase or a set of spatially concurrent objects. In this paper, we aim to harness the existing image-sentence databases to exploit the compositional nature of image-sentence data for multimodal deep embedding. In particular, we propose an approach called hierarchical-alike (bottom-up two layers) multimodal grounded compositional semantics (hiMoCS) learning. The proposed hiMoCS systemically captures the compositional semantic connotation of multimodal data in the setting of hierarchical-alike deep learning by modeling the inherent correlations between two modalities of collaboratively grounded semantics, such as the textual entity (with its describing attribute) and visual object, the phrase (e.g., subject-verb-object triplet), and spatially concurrent objects. We argue that hiMoCS is more appropriate to reflect the multimodal compositional semantics of the image and its narrative textual sentence, which are strongly coupled. We evaluate hiMoCS on the several benchmark data sets and show that the utilization of the hiMoCS (textual entities and visual objects, textual phrase, and spatially concurrent objects) achieves a much better performance than only using the flat grounded compositional semantics.