摘要

In recent years, deep learning approaches have gained great attention due to their superior performance and the availability of high speed computing resources. These approaches are also extended towards the real time processing of multimedia content exploiting its spatial and temporal structure. In this paper, we propose a deep learning-based video description framework which first extracts visual features from video frames using deep convolutional neural networks (CNN) and then pass the derived representations into a long-short term memory-based language model. In order to capture accurate information for human presence, a fine-tuned multi-task CNN is presented. The proposed pipeline is end-to-end, trainable, and capable of learning dense visual features along with an accurate framework for the generation of natural language descriptions of video streams. The evaluation is done by calculating Metric for Evaluation of Translation with Explicit ORdering and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores between system generated and human annotated video descriptions for a carefully designed data set. The video descriptions generated by the traditional feature learning and proposed deep learning frameworks are also compared through the ROUGE scores.

  • 出版日期2018