We tested our previously reported sports highlights playback for personal video recorders with a carefully chosen set of
sports aficionados. Each subject spent about an hour with the content, going through the same basic steps of
introduction, trying out the system, and follow up questionnaire. The main conclusion was that the users unanimously
liked the functionality very much even when it made mistakes. Furthermore, the users felt that if the user interface were
made much more responsive so as to quickly compensate for false alarms and misses, the functionality would be vastly
enhanced. The ability to choose summaries of any desired length turned out to be the main attraction.
The immediate availability of a vast amount of multimedia content has created a growing need for improvements in the field of content analysis and summarization. While researchers have been rapidly making contributions and improvements to the field, we must never forget that content analysis and summarization themselves are not the user's goals. Users' primary interests fall into one of two categories; they normally either want to be entertained or want to be informed (or both). Summarization is therefore just another tool for improving the entertainment value or the information gathering value of the video watching experience. In this paper, we first explore the relationship between the viewer, the interface, and the summarization algorithms. Through an understanding of the user's goals and concerns, we present means for measuring the success of the summarization tools. Guidelines for the successful use of summarization in consumer video devices are also discussed.
Severe complexity constraints on consumer electronic devices motivate us to investigate general-purpose video summarization techniques that are able to apply a common hardware setup to multiple content genres. On the other hand, we know that high quality summaries can only be produced with domain-specific processing. In this paper, we present a time-series analysis based video summarization technique that provides a general core to which we are able to add small content-specific extensions for each genre. The proposed time-series analysis technique consists of unsupervised clustering of samples taken through sliding windows from the time series of features obtained from the content. We classify content into two broad categories, scripted content such as news and drama, and unscripted content such as sports and surveillance. The summarization problem then reduces to finding either finding semantic boundaries of the scripted content or detecting highlights in the unscripted content. The proposed technique is essentially an event detection technique and is thus best suited to unscripted content, however, we also find applications to scripted content. We thoroughly examine the trade-off between content-neutral and content-specific processing for effective summarization for a number of genres, and find that our core technique enables us to minimize the complexity of the content-specific processing and to postpone it to the final stage. We achieve the best results with unscripted content such as sports and surveillance video in terms of quality of summaries and minimizing content-specific processing. For other genres such as drama, we find that more content-specific processing is required. We also find that judicious choice of key audio-visual object detectors enables us to minimize the complexity of the content-specific processing while maintaining its applicability to a broad range of genres. We will present a demonstration of our proposed technique at the conference.
We present a consumer video browsing system that enables use of multiple alternative summaries in a simple and effective user interface suitable for consumer electronics platforms. We present a news and talk video segmentation and summary generation technique for this platform. We use face detection on consumer video, and use simple face features such as face count, size, and x-location to classify video segments. More specifically, we cluster 1-face segments using face sizes and x-locations. We observe that different scenes such as anchorperson, outdoor correspondent, weather report, etc. form separate clusters. We then apply temporal morphological filtering on the label streams to obtain alternative summary streams for smooth summaries and effective browsing through stories. We also apply our technique to talk show video to generate separate summaries of monologue segments and guest interviews.
We discuss the meaning and significance of the video mining problem, and present our work on some aspects of video mining. A simple definition of video mining is unsupervised discovery of patterns in audio-visual content. Such purely unsupervised discovery is readily applicable to video surveillance as well as to consumer video browsing applications. We interpret video mining as content-adaptive or "blind" content processing, in which the first stage is content characterization and the second stage is event discovery based on the characterization obtained in stage 1. We discuss the target applications and find that using a purely unsupervised approach are too computationally complex to be implemented on our product platform. We then describe various combinations of unsupervised and supervised learning techniques that help discover patterns that are useful to the end-user of the application. We target consumer video browsing applications such as commercial message detection, sports highlights extraction etc. We employ both audio and video features. We find that supervised audio classification combined with unsupervised unusual event discovery enables accurate supervised detection of desired events. Our techniques are computationally simple and robust to common variations in production styles etc.
In our previous work, we described an adaptive fast playback framework for video summarization where we changed the playback rate using the motion activity feature so as to maintain a constant “pace.” This method provides an effective way of skimming through video, especially when the motion is not too complex and the background is mostly still, such as in surveillance video. In this paper, we present an extended summarization framework that, in addition to motion activity, uses semantic cues such as face or skin color appearance, speech and music detection, or other domain dependent semantically significant events to control the playback rate. The semantic features we use are computationally inexpensive and can be computed in compressed domain, yet are robust, reliable, and have a wide range of applicability across different content types. The presented framework also allows for adaptive summaries based on preference, for example, to include more dramatic vs. action elements, or vice versa. The user can switch at any time between the skimming and the normal playback modes. The continuity of the video is preserved, and complete omission of segments that may be important to the user is avoided by using adaptive fast playback instead of skipping over long segments. The rule-set and the input parameters can be further modified to fit a certain domain or application. Our framework can be used by itself, or as a subsequent presentation stage for a summary produced by any other summarization technique that relies on generating a sub-set of the content.
We present a psychophysical and analytical framework for the comparison of the performance of different analytical measures of motion activity in video segments with respect to a subjective ground truth. We first construct a test-set of video segments and conduct a psychophysical experiment to obtain a ground truth for the motion activity. Then we present several low-complexity motion activity descriptors computed from compressed domain block motion vectors. In the first analysis, we quantize the descriptors and show that they perform well against the ground truth. We also show that the MPEG-7 motion activity descriptor is among the best. In the second analysis, we find the pairs of video segments for which the human subjects unanimously rate one as higher activity than the other. Then we examine the specific cases where each descriptor fail to give the correct ordering. We show that the distance from camera, and strong camera motion are main cases where motion vector based descriptors tend to overestimate or underestimate the intensity of motion activity. We finally discuss the experimental methodology and analysis methods we used and possible alternatives. We review the applications of motion activity and how the results presented here relate to those applications.
We present a technique for rapidly generating highlights of sports videos using temporal patterns of motion activity extracted in the compressed domain. The basic hypothesis of this work is that temporal patterns of motion activity are related with the grammar of the sports video. We present experimental verification of this hypothesis. By using very simple rules depending on the type of sport, we are thus able to provide highlights by skipping over the uninteresting parts of the video and identifying interesting events characterized, for instance, by falling edge or raising edge in the activity domain. Moreover the compressed domain extraction of motion activity intensity is much simpler than the color based summarization calculations. Other compressed domain features or more complex rules can be used to further improve the accuracy.
We describe a technique for video summarization that uses motion descriptors computed in the compressed domain to speed up conventional color based video summarization technique. The basic hypothesis of the work is that the intensity of motion activity of a video segment is a direct indication of its 'summarizability.' We present experimental verification of this hypothesis. We are thus able to quickly identify easy to summarize segments of a video sequence since they have a low intensity of motion activity. Moreover, the compressed domain extraction of motion activity intensity is much simpler than the color-based calculations. We are able to easily summarize these segments by simply choosing a key-frame at random from each low- activity segment. We can then apply conventional color-based summarization techniques to the remaining segments. We are thus able to speed up color-based summarization techniques by reducing the number of segments on which computationally more expensive color-based computation is needed.
We present a psycho-visual and analytical framework for automatic measurement of motion activity in view sequences. We construct a test-set of video segments by carefully selecting video segments form the MPEG-7 video test set. We construct a ground truth, based on subjective test with naive subjects. We find that the subjects agree reasonably on the motion activity of video segments, which makes the ground truth reliable. We present a set of automatically extractable, known and novel, descriptors of motion activity based on different hypotheses about subjective perception of motion activity. We show that all the descriptors perform well against the ground truth. We find that the MPEG-7 motion activity descriptor, based on variance of motion vector magnitudes, is one of the best in overall performance over the test set.