摘要

Scene classification is one of the most important issues in remote sensing image processing. To obtain a high discriminative feature representation for an image to be classified, traditional methods usually consider to densely accumulate hand-crafted low-level descriptors (e.g., scale-invariant feature transform) by feature encoding techniques. However, the performance is largely limited by the hand-crafted descriptors as they are not capable of describing the rich semantic information contained in various remote sensing images. To alleviate this problem, we propose a novel method to extract discriminative image features from the rich hierarchical information contained in convolutional neural networks (CNNs). Specifically, the low-level and middle-level intermediate convolutional features are, respectively, encoded by vector of locally aggregated descriptors (VLAD) and then reduced by principal component analysis to obtain hierarchical global features; meanwhile, the fully connected features are average pooled and subsequently normalized to form new global features. The proposed encoded mixed-resolution representation (EMR) is the concatenation of all the above-mentioned global features. Due to the usage of encoding strategies (VLAD and average pooling), our method can deal with images of different sizes. In addition, to reduce the computational consumption in the training stage, we directly extract EMR from VGG-VD and ResNet pretrained on the ImageNet dataset. We show in this paper that CNNs pretrained on the natural image dataset are more easily applied to the remote sensing dataset when the local structure similarity between two datasets is higher. Experimental evaluations on the UC-Merced and Brazilian Coffee Scenes datasets demonstrate that our method is superior to the state of the art.