摘要

Human actions in movies and sitcoms usually capture semantic cues for story understanding, which offer a novel search pattern beyond the traditional video search scenario. However, there are great challenges to achieve action-level video search, such as global motions, concurrent actions, and actor appearance variances. In this paper, we introduce a generalized action retrieval framework, which achieves fully unsupervised, robust, and actor-independent action search in large-scale database. First, an Attention Shift model is presented to extract human-focused foreground actions from videos containing global motions or concurrent actions. Subsequently, a spatiotemporal vocabulary is built based on 3D-SIFT features extracted from these human-focused action regions. These 3D-SIFT features offer robustness against rotations and viewpoints. And the spatiotemporal vocabulary guarantees our search efficiency, which is achieved by inverted indexing structure with approximate nearest-neighbor search. In the online ranking, we employ dynamic time warping distance to handle the action duration variances, as well as partial action matching. Finally, an appearance hashing strategy is presented to address the performance degeneration caused by divergent actor appearances. For experimental validation, we have deployed actor-independent action retrieval framework in 3-season "Friends" sitcoms (over 30 h). In this database, we have reported the best performance (MAP@1 > 0.53) with comparisons to alternative and state-of-the-art approaches.