摘要

Convolutional Neural Networks (CNNs) have achieved a breakthrough on a large number of image retrieval benchmarks. However, most previous works make use of the CNNs following the image classification strategy, where the last fully connected layer activations of the whole image are occupied as a single holistic feature vector. To improve the representation power of CNNs, this paper proposes a Multilayer Fusion (MF) approach to aggregate deep activations for image retrieval task. The key insight of our approach is that different layers of a CNN are sensitive to specific patterns, and are complementary with each other for image representation. Specifically, our approach transforms CNN activations to deep binary codes embedded in the inverted index of Bag-of-Words structure for fast retrieval. Those activations are derived from multiple layers of a CNN on local patches, for features from orderless local areas have proved superior to global ones in the low level handcrafted cases. Corresponding weights and diffusion process are thereafter utilized to penalize and re-rank the individual similarity scores of layers. Our method is efficient, which extracts visual features from different layers only once. Furthermore, the proposed MF approach can be easily extended to include SIFT features to enhance the representation power. Extensive experiments on four public retrieval datasets quantitatively evaluate the effectiveness of our contributions, and the proposed algorithm prove to be the new state-of-the-art on the Holidays and UKBench datasets.