摘要

In this paper, an eleven-layered Convolutional Neural Network with Visual Attention is proposed for facial expression recognition. The network is composed of three components. First, local convolutional features of faces are extracted by a stack of ten convolutional layers. Second, the regions of interest are automatically determined according to these local features by the embedded attention model. Third, the local features in these regions are aggregated and used to infer the emotional label. These three components are integrated into a single network which can be trained in an end-to-end scheme. Extensive experiments on four kinds of data (namely aligned frontal faces, faces in different poses, aligned unconstrained faces, and grouped unconstrained faces) prove that the proposed method can improve the accuracy and obtain good visualization. The visualization shows that the learned regions of interest are partly consistent with the locations of emotion specific Action Units. This founding confirms the interpretation of Facial Action Coding System and Emotional Facial Action Coding System from a machine learning perspective.