We approach the problem around several main directions of video temporal segmentation and propose an intensity-based dissolve detection approach that is able to perform on animated video contents. It uses the hypothesis that during a dissolve the amount of fading-out and fading-in pixels should be significant compared with other visual transitions. We use this information as a visual discontinuity function. Instead of just applying a global threshold to filter these values, as most of the existing approaches do, we use a twin-thresholding approach and the shape analysis of the discontinuity measure. This allows us to reduce false detections caused by steep intensity fluctuations as well as to retrieve dissolves caught up in other visual transitions (e.g., caused by movement, color effects, etc.). Experimental tests conducted on more than 452 dissolve transitions show that whether classic approaches tend to fail, the proposed method is able to provide good performance achieving average precision and recall ratios above 94% and 79.6%, respectively.
We propose an audio-visual approach to video genre classification using content descriptors that exploit audio, color, temporal, and contour information. Audio information is extracted at block-level, which has the advantage of capturing local temporal information. At the temporal structure level, we consider action content in relation to human perception. Color perception is quantified using statistics of color distribution, elementary hues, color properties, and relationships between colors. Further, we compute statistics of contour geometry and relationships. The main contribution of our work lies in harnessing the descriptive power of the combination of these descriptors in genre classification. Validation was carried out on over 91 h of video footage encompassing 7 common video genres, yielding average precision and recall ratios of 87% to 100% and 77% to 100%, respectively, and an overall average correct classification of up to 97%. Also, experimental comparison as part of the MediaEval 2011 benchmarking campaign demonstrated the efficiency of the proposed audio-visual descriptors over other existing approaches. Finally, we discuss a 3-D video browsing platform that displays movies using feature-based coordinates and thus regroups them according to genre.
We address the issue of producing automatic video abstracts in the context of the video indexing of animated movies. For a quick browse of a movie's visual content, we propose a storyboard-like summary, which follows the movie's events by retaining one key frame for each specific scene. To capture the shot's visual activity, we use histograms of cumulative interframe distances, and the key frames are selected according to the distribution of the histogram's modes. For a preview of the movie's exciting action parts, we propose a trailer-like video highlight, whose aim is to show only the most interesting parts of the movie. Our method is based on a relatively standard approach, i.e., highlighting action through the analysis of the movie's rhythm and visual activity information. To suit every type of movie content, including predominantly static movies or movies without exciting parts, the concept of action depends on the movie's average rhythm. The efficiency of our approach is confirmed through several end-user studies.
In order to improve the link between an operator and its machine, some human oriented communication systems are now using natural languages like speech or gesture. The goal of this paper is to present a gesture recognition system based on the fusion of measurements issued from different kind of sources. It is necessary to have some sensors that are able to capture at least the position and the orientation of the hand such as Dataglove and a video camera. Datagloge gives a measure of the hand posture and a video camera gives a measure of the general arm gesture which represents the physical and spatial properties of the gesture, and based on the 2D skeleton representation of the arm. The measurements used are partially complementary and partially redundant. The application is distributed on intelligent cooperating sensors. The paper presents the measurement of the hand and the arm gestures, the fusion processes, and the implementation solution.