摘要

We focus on the recognition of human actions in uncontrolled videos that may contain complex temporal structures. It is a difficult problem because of the large intra-class variations in viewpoint, video length, motion pattern, etc. To address these difficulties, we propose a novel system in this paper that represents each action class by hidden temporal models. In this system, we represent the crucial action event per category by a video segment that covers a fixed number of frames and can move temporally within the sequences. To capture the temporal structures, the video segment is described by a temporal pyramid model. To capture large intra-class variations, multiple models are combined using Or operation to represent alternative structures. The index ofmodel and the start frame of segment are both treated as hidden variables. We implement a learning procedure based on the latent SVM method. The proposed approach is tested on two difficult benchmarks: the Olympic Sports and HMDB51 data sets. The experimental results reveal that our system is comparable to the state-of-the-art methods in the literature.