摘要

Understanding the sentiments of users from cross media contents which contain texts and images is an important task for many social network applications. However, due to the semantic gap between cross media features and sentiments, machine learning methods need a lot of human labeled samples. Furthermore, for each kind of media content, it is necessary to constantly add a lot of new human labeled samples because of new expressions of sentiments. Fortunately, there are some emotion signals, like emoticons, which denote users' emotions in cross media contents. In order to use these weakly labels to build a unified multi-modality sentiment learning framework, we propose an Explicit Emotion Signal (EES) based multi-modality sentiment learning approach which uses huge number of weakly labeled samples in sentiment learning. There are three advantages in our approach. Firstly, only a few human labeled samples are needed to reach the same performance which can be obtained by the traditional machine learning based sentiment prediction approaches. Secondly, this approach is flexible and can easily combine text and vision based sentiment learning through deep neural networks. Thirdly, because a lot of weakly labeled samples can be used in EES, trained model is more robust in different domain transfer. In this paper, firstly, we investigate the correlation between sentiments and emoticons and choose emoticons as the Explicit Emotion Signals in our approach; secondly, we build a two stages multi-modality sentiment learning framework based on Explicit Emotion Signals. Our experiment results show that our approach not only achieves the best performance but also only needs 3% and 43% training samples to obtain the same performance of Visual Geometry Group (VGG) model and Long Short-Term Memory (LSTM) model in images and texts, respectively.