摘要

Deep learning has been successfully applied to multimodal representation learning. Similar with single modal deep learning method, such multimodal deep learning methods consist of a greedy layer-wise feedforward propagation and a backpropagation (BP) fine-tune conducted by diverse targets. These models have the drawback of time consuming. While, extreme learning machine (ELM) is a fast learning algorithm for single hidden layer feedforward neural network. And previous works has shown the effectiveness of ELM based hierarchical framework for multilayer perceptron. In this paper, we introduce an ELM based hierarchical framework for multimodal data. The proposed architecture consists of three main components: (1) self-taught feature extraction for specific modality by an ELM-based sparse autoencoder, (2) fused representation learning based on the features learned by previous step and (3) supervised feature classification based on the fused representation. This is an exact feedforward framework that once a layer is established, its weights are fixed without fine-tuning. Therefore, it has much better learning efficiency than the gradient based multimodal deep learning methods. We conduct experiments on MNIST, XRMB and NUS datasets, the proposed algorithm obtains faster convergence and achieves better classification performance compared with the other existing multimodal deep learning models.