摘要

Human action may be observed from multi-view, which are highly related but sometimes look different from each other. Traditional metric learning algorithms have achieved satisfactory performance in single-view, but they often fail or do not satisfy when they are utilized to fuse different views. Thus, multi-view discriminative and structured dictionary learning with group sparsity and graph model (GM-GS-DSDL) is proposed to fuse different views and recognize human actions. First, spatio-temporal interest points are extracted for each view, and then multi-view bag of words (MVBoW) representation is employed, at the same time, the graph model is also utilized to fuse different views, which will remove overlapped interest points to explore their consistency properties. Furthermore, GM-GS-DSDL is formulated to discover the latent correlation among multiple views. In addition, we also issue a new multi-view action dataset with RGB, depth and skeleton data (called CVS-MV-RGBD). Large-scale experimental results on multi-view IXMAX and CVS-MV-RGBD datasets show that the exploring of consistency properties of different views by graph model is very useful, moreover, GM-GS-DSDL for each view, which are learnt simultaneously, can further improve the fusion performance. Comparative experiments demonstrate that our proposed algorithm can obtain competing performance against the state-of-the-art methods.