摘要

Image description generation is of great application value in online image searching. Inspired by the recent achievements on neocortex study, we design a deep image understanding framework to implement a description generator for general images involving human activities. Different from existing work on image description, which regards it as a retrieval problem instead of trying to understand an image, our framework can recognize the human-object interaction (HOI) activity in the image based on the co-occurrence analysis of 3-D spatial layout and generate natural language description according to what is really happening in the image. We propose a deep hierarchical model to do the image recognition and a syntactic tree-based model to do the natural language generation. With the consideration of supporting online image searching, these two models are designed to uniformly extract features from humans and different object classes and produce well-formed sentences describing the exact things happening in the image. By conducting experiments on the dataset containing images from the phrasal recognition dataset, the six-class sports dataset and the UIUC Pascal sentence dataset, we demonstrate that our framework outperforms the state-of-the-art methods on recognizing HOI activities and generating image descriptions.