摘要

Detecting complex events in a large video collection crawled from video websites is a challenging task. When applying directly good image-based feature representation, e.g., HOG, SIFT, to videos, we have to face the problem of how to pool multiple frame feature representations into one feature representation. In this paper, we propose a novel learning-based frame pooling method. We formulate the pooling weight learning as an optimization problem and thus our method can automatically learn the best pooling weight configuration for each specific event category. Extensive experimental results conducted on TRECVID MED 2011 reveal that our method outperforms the commonly used average pooling and max pooling strategies on both high-level and low-level features.