摘要

This paper presents an effective deep attention network for joint hand gesture localization and recognition using static RGB-D images. Our method trains a CNN framework based on a soft attention mechanism in an end-to-end manner, which is capable of automatically localizing hands and classifying gestures using a single network rather than relying on the conventional means of stage-wise hand segmentation/detection and classification. More precisely, our attention network first computes the weight for each proposal generated from the entire image, in order to judge the probability of the hand appearing in a given region. It then implements a global-sum operation for all proposals, which is influenced by their corresponding weights, in order to obtain a representation of the entire image. We demonstrate the feasibility and effectiveness of our method through extensive experiments on the NTU Hand Digits (NTU-HD) benchmark and the challenging HUST American Sign Language (HUST-ASL) dataset. Moreover, the proposed attention network is simple to train, without requiring bounding-box or segmentation mask annotations, which makes it easy to apply in hand gesture recognition systems. Based on the proposed attention network and taken RGB-D images as input, we obtain the state-of-the-art hand gesture recognition performance on the challenging HUST-ASL dataset.