This paper presents our latest work on analyzing and understanding
the content of learning media such as instructional and training
videos, based on the identification of video frame types. In
particular, we achieve this goal by first partitioning a video
sequence into homogeneous segments where each segment contains
frames of the same image type such as slide or web-page; then we
categorize the frames within each segment into one of the
following four classes: slide, web-page, instructor and picture-in-picture, by analyzing various
visual and text features. Preliminary experiments carried out on
two seminar talks have yielded encouraging results. It is our
belief that by classifying video frames into semantic image
categories, we are able to better understand and annotate the
learning media content and subsequently facilitate its content
access, browsing and retrieval.
A scalable video summarization and navigation system is proposed
in this work. Particularly, given the desired number of keyframes
for a video sequence, we first distribute it among underlying
video scenes and sinks based on their respective importance ranks.
Then, we select the most important shot of each sink as its R-shot
and further assign each sink's designated number of keyframes to
its R-shot. Finally, a time-constrained keyframe extraction scheme
is developed to locate all keyframes. Consequently, we can achieve
a scalable video summary from the initial keyframe set by
exploiting such a video structure-based ranking scheme. In
addition, a content navigation tool is also developed which could
help users freely access or locate specific video scenes or shots.
Sophisticated user studies have shown that this summarization and
navigation system can not only help users quickly browse video
content, but also assist them in searching for particular video
The problem of identifying speakers for movie content analysis is
addressed in this paper. While most previous work on speaker
identification was carried out in a supervised mode using pure
audio data, more robust results can be obtained in real-time by
integrating knowledge from multiple media sources in an
unsupervised mode. In this work, both audio and visual cues will
be employed and subsequently combined in a probabilistic framework
to identify speakers. Particularly, audio information is used to
identify speakers with a maximum likelihood (ML)-based approach
while visual information is adopted to distinguish speakers by
detecting and recognizing their talking faces based on face
detection/recognition and mouth tracking techniques. Moreover, to
accommodate for speakers' acoustic variations along time, we
update their models on the fly by adapting to their newly
contributed speech data. Encouraging results have been achieved
through extensive experiments, which shows a promising future of
the proposed audiovisual-based unsupervised speaker identification
This research addresses the problem of automatically extracting semantic video scenes from daily movies using multimodal information. A 3-stage scene detection scheme is proposed. In the first stage, we use pure visual information to extract a coarse-level scene structure based on generated shot sinks. In the second stage, the audio cue is integrated to further refine scene detection results by considering various kinds of audio scenarios. Finally, in the third stage, we allow users to directly interact with the system so as to fine-tune the detection results to their own satisfaction. The generated scene structure can provide a compact yet meaningful abstraction of the video data, which will apparently facilitate the content access. Preliminary experiments on integrating multiple media cues for movie scene extraction have yielded encouraging results.
A fundamental task in video analysis is to organize and index multimedia data in a meaningful manner so as to facilitate user access for tasks such as browsing and retrieval. This paper addresses the problem of automatic index generation of movie databases based on audiovisual information. In particular, given a movie we first extract key movie events including two-speaker dialog scenes, multiple-speaker dialog scenes and hybrid scenes by using the proposed window-based sweep algorithm and the K-means clustering algorithms. Following event detection, the identity of each individual speaker in a dialog scene is recognized based on a statistical maximum likelihood approach. The identification relies on the likelihood ratio calculation between the incoming speech data and Gaussian mixture models of the speakers and the background. It is evident that the event and the speaker identity information will serve as a crucial part of the movie index table. Preliminary experimental results show that, by integrating multiple media information, we can obtain robust and meaningful event detection and speaker identification results.
An automatic web content classification system is proposed in this research for web information filtering. A sample group of web contents are first collected via commercial search engines. Then, they are classified into different subject group and more related web pages can be searched for further analysis. It can free from the troublesome and routine process that are performed by human beings in most search engines. And the clustered information can be updated at any specified time automatically. Preliminary experimental results are used to demonstrate the effectiveness of the performance of the proposed system.
Automatic generation of user profiles, as specified in the MPEG-7 user preference description scheme, for personized broadcast services is investigated in this work. Our research has focused on categorization of user-favored video into different semantically meaningful classes. This knowledge is then used in media filtering guidance and user preferred AV content selection. Several visual and motion features are extracted from source video sequences, such as the number of intra-coded macroblocks, the macroblock motion information, temporal variances and shot activity histograms, for the classification purposes. Moreover, to further improve the accuracy of classification results, a 'fuzzy nearest prototype classifier' is applied in this work. It is shown by experimental results that the proposed classification scheme is efficient and accurate.
Detecting and extracting commercial breaks from a TV program is important for achieving efficient video storage and transmission. In this work , we approach this problem by utilizing both visual and audio information. Commercial breaks have several special characteristics such as a restricted temporal length, a high cut frequency, a high level of actions, delimiting black frames and silences, etc, which can be used for their separation from regular TV programs. A feature-based commercial break detection system is thus proposed to fulfill this task. We first perform a coarse-level detection of commercial breaks with pure visual information, since the high activity and the high cut frequency will somehow manifest themselves in the statistics of some measurable features. At the second step, we proceed to refine detected break boundaries by integrating audio clues. That is, there is always a short period of silence between commercial breaks and the TV program. Two audio features, i.e. the short- time energy and short-time average zero-crossing rate, are extracted for the silence detection purpose. At the last step, we return to the visual information domain again to achieve a frame-wise precision by locating the black frames. Extensive experiments show that by combining both visual and audio information, we can obtain accurate commercial break results.