摘要

With the development of multimedia technology, effective cross-modal retrieval methods are increasingly demanded. The key point of cross-modal retrieval is analyzing the correlation of heterogeneous modalities. There are mainly two types of correlation: content correlation and semantic correlation. Semantic correlation is constructed at a high level of abstraction which is more close to the human understanding than content correlation. In this paper, we investigate a semantic model to construct the semantic correlation for cross-modal retrieval. We assume that the semantic correlation of multimedia data from different modalities can be conditionally generated by semantic concepts in a probabilistic generation framework. The cross-modal semantic generation model (CMSGM) is proposed based on this assumption. We consider three cases of the cross-modal retrieval task. The first is the ideal case that all manifest concepts exist in training data for constructing the correlation, and we propose manifest CMSGM (M-CMSGM) which directly uses CMSGM on the manifest semantic concepts for retrieval. The second is the case that there are no manifest concepts in training data, and latent CMSGM (L-CMSGM) based on latent semantic concepts is proposed for this case, where the latent semantic concepts are learned by asymmetric spectral clustering. The last is the most general case that some of the manifest concepts exist, and we combine M-CMSGM and L-CMSGM to get combinative CMSGM (C-CMSGM) to solve this case. Experimental results on Wikipedia featured articles and MIR Flickr show that our methods have better performance compared with previous state-of-the-art methods. And C-CMSGM can maintain good performance in the case that manifest concepts are lacking, which confirms the robustness and practicality of C-CMSGM.