摘要

Nowadays content-based video event detection faces great challenges due to complex scenes and blurred actions in surveillance videos. To alleviate these challenges, we propose a novel spatial-temporal architecture of deep convolutional neural networks for this task. By taking advantage of spatial-temporal information, we fine-tune two-stream networks, and then, fuse spatial and temporal features at convolution layers using a 2D pooling fusion method to enforce the consistence of spatial-temporal information. Based on the two-stream networks and spatial-temporal layer, a triple-channel model is obtained. Furthermore, we implement trajectory-constrained pooling to deep features and hand-crafted features to combine their merits. A fusion method on triple-channel yields the final detection result. The experiments on two benchmark surveillance video data sets including VIRAT 1.0 and VIRAT 2.0, which involve a suit of challenging events, such as person loading an object to a vehicle or person opening a vehicle trunk, manifest that the proposed method can achieve superior performance compared with the state-of-the-art methods on these event benchmarks.