摘要

Tracking objects across multiple frames is a well-investigated problem in computer vision. The majority of the existing algorithms that assume an accurate initialization is readily available. However, in many real-life settings, in particular for applications where the video is streaming in real time, the initialization has to be provided by a human operator. This limitation raises an inevitable uncertainty issue. Here, we first collect a large and new data set of inputs that consists of more than 20 K human initialization clicks, by several subjects under three practical user interface scenarios for the popular TB50 tracking benchmark. We analyze the factors and mechanisms of human input, derive statistical models, and show that human input always contains deviations, which exacerbate further when the relative object-camera motion becomes large. We also design and evaluate alternative refinement schemes, and propose a strategy that refits an object window on the most probable target region after a single click. To compensate for the human initialization errors, our method generates window proposals using objectness cues extracted from color and motion attributes, accumulates them into a likelihood map that is weighted by the initial click position and visual saliency scores, and assigns the final window by the maximum likelihood estimate. Our experiments demonstrate that the presented refinement strategy effectively reduces human input errors.

  • 出版日期2017-2
  • 单位复杂系统智能控制与决策国家重点实验室; 北京理工大学