Video is increasingly used in various advanced applications. Many of these applications require common video represen- tations that should be oriented towards how people describe video content. In this paper we first discuss the background of high-level video representations. We then introduce a computational framework for high-level video representation that evolves towards how people describe video content. Our framework represents a video shot in terms of its moving objects and their related semantic features such as events and other high-level motion features. To achieve higher applicability, content should be extracted independently of the type and the context of the input video. Our representation system, implemented on 6371 images with multi-object occlusion and artifacts, produces stable results in real-time. This is due to the adaptation to noise, the compensation of estimation errors at the various processing levels, and the division of the processing system into simple but effective tasks.
|