摘要

In this paper, we present a novel view-invariant, motion-pose geometric descriptor (MPGD) as a human-human interaction representation to capture the semantic meaning of body-parts between two interacting humans. The proposed MPGD representation is based on utilizing the concept of anatomical planes to construct a motion profile and a pose profile for each human. Those two profiles are then concatenated to form a descriptor for the two interacting humans. Using the proposed MPGD representation, we study two problems related to human-human interaction analysis, namely human-human interaction classification and prediction. For the human-human interaction classification problem, we propose a hierarchical classification framework consisting of a representation layer and three classification layers. The classification framework aims to realize what is the performed interaction in an input video by understanding how and when each individual performed sub-activities to each other over time. The human-human interaction prediction problem aims to predict the class of ongoing human-human interaction at its early stages. To do so, we propose a prediction framework that utilizes the proposed MPGD to construct an accumulated histograms-based representation for an ongoing interaction. The accumulated histograms of MPGDs are then used to train a set of support-vector-machine classifiers with a probabilistic output to predict the class of an ongoing interaction. In order to evaluate our proposed MPGD representation and both the classification and the prediction frameworks, we utilize a Microsoft Kinect sensor to capture human-human interactions in a video dataset that consists of 12 interactions performed by 12 individuals. We evaluate the performance of our proposed classification framework and compare the results with an appearance-based representation and a representation that combines both the MPGD representation and the appearance-based representation. On the one hand, our proposed MPGD representation performance has shown promising results compared to the appearance-based representation with an average accuracy of 94.86% in classifying human-human interactions. On the other hand, human-human interaction prediction framework has achieved an average prediction accuracy of 82.46% with only 50% of the interaction video being observed.

  • 出版日期2015-8