Model vector-based retrieval is a novel approach for video indexing that uses a semantic model vector signature that describes the detection of a fixed set of concepts across a lexicon. The model vector basis is created using a set of independent binary classifiers that correspond to the semantic concepts. The model vectors are created by applying the binary detectors to video content and measuring the confidence of detection. Once the model vectors are extracted, simple techniques can be used for searching to find similar matches in a video database. However, since confidence scores alone do not capture information about the reliability of the underlying detectors, techniques are needed to ensure good performance in the presence of varying qualities of detectors. In this
paper, we examine the model vector-based retrieval framework for video and propose methods using detector validity to improve matching performance. In particular, we develop a model vector distance metric that weighs the dimensions using detector validity scores. In this paper, we explore the new model vector-based retrieval method for video indexing and empirically evaluate the retrieval effectiveness on a large video test collection using different methods of measuring and incorporating detector validity indicators.
Many recent efforts have been made to automatically index multimedia content with the aim of bridging the semantic gap between syntax and semantics. In this paper, we propose a novel framework to automatically index video using context for video understanding. First we discuss the notion of context and how it relates to video understanding. Then we present the framework we are constructing, which is modeled as an expert system that uses a rule-based engine, domain knowledge, visual detectors (for objects and scenes), and different data sources available with the video (metadata, text from automatic speech recognition, etc.). We also describe our approach to align text from speech recognition and video segments, and present experiments using a simple implementation of our framework. Our experiments show that context can be used to improve the performance of visual detectors.
Enabling semantic detection and indexing is an important task in multimedia content management. Learning and classification techniques are increasingly relevant to the state of the art content management systems. From relevance feedback to semantic detection, there is a shift in the amount of supervision that precedes retrieval from light weight classifiers to heavy weight classifiers. In this paper we compare the performance of some popular classifiers for semantic video indexing. We mainly compare among other techniques, one technique for generative modeling and one for discriminant learning and show how they behave depending on the number of examples that the user is willing to provide to the system. We report results using the NIST TREC Video Corpus.
Media analysis for video indexing is witnessing an increasing influence of statistical techniques. Examples of these techniques include the use of generative models as well as discriminant techniques for video structuring, classification, summarization, indexing, and retrieval. Advances in multimedia analysis are
related directly to advances in signal processing, computer vision, pattern recognition, multimedia databases, and smart sensors. This paper highlights the statistical techniques in
multimedia retrieval with particular emphasis on semantic characterization.
Model-based approach to video retrieval requires ground-truth data for training the models. This leads to the development of video annotation tools that allow users to annotate each shot in the video sequence as well as to identify and label scenes, events, and objects by applying the labels at the shot-level. The annotation tool considered here also allows the user to associate the object-labels with an individual region in a key-frame image. However, the abundance of video data and diversity of labels make annotation a difficult and overly expensive task. To combat this problem, we formulate the task of annotation in the framework of supervised training with partially labeled data by viewing it as an exercise in active learning. In this scenario, one first trains a classifier with a small set of labeled data, and subsequently updates the classifier by selecting the most informative, or most uncertain subset of the available data-set. Consequently, propagation of labels to yet unlabeled data is automatically achieved as well. The purpose of this paper is primarily twofold. The first is to describe a video annotation tool that has been developed for the purpose of annotating generic video sequences in the context of a recent video-TREC benchmarking exercise. The tool is semi-automatic in that it automatically propagates labels to similar shots, which requires the user to confirm or reject the propagated labels. The second purpose is to show how active learning strategy can be potentially implemented in this context to further improve the performance of the annotation tool. While many versions of active learning could be thought of, we specifically report results on experiments with support vector machine classifiers with polynomial kernels.
A necessary capability for content-based retrieval is to support the paradigm of query by example. In the past, there have been several attempts to use low-level features for video retrieval. None of the approaches however uses the multimedia information content of the video. We present an algorithm for matching multi modal patterns for the purpose of content-based video retrieval. The novel ability of our approach to use the information content in multiple media coupled with a strong emphasis on temporal similarity differentiates it from the state-of-the-art in content-based retrieval. At the core of the pattern matching scheme is a dynamic programming algorithm, which leads to a significant improvement in performance. Coupling the use of audio with video this algorithm can be applied to grouping of shots based on audio-visual similarity. This is much more effective in constructing scenes from shots than using only visual content to do the same.
Semantic filtering of multimedia content is a challenging problem. The gap that exists between low-level media features and high-level semantics of multimedia is difficult to bridge. We propose a flexible probabilistic graphical framework to bridge this gap to some extent and perform automatic detection of semantic concepts. Using probabilistic multimedia objects and a network of such objects we support semantic filtering. Discovering the relationships that exist between semantic concepts, we show how the detection performance can be improved upon. We show that concepts which may not be directly observed in terms of media features, can be inferred based on their relation with those that are already detected. Heterogeneous features also can be fused in the multinet. We demonstrate this by inferring the concept outdoor based on the five detected multijects sky, snow, rocks, water and forestry and a frame- level global-features based outdoor detector.
Image classification into meaningful classes is essentially a supervised pattern recognition problem. These classes include indoor, outdoor, landscape, urban, faces, etc. The recognition problem necessitates a large set of labeled examples for training the classifier. Any stratagem, which reduces the burden of labeling, is therefore very important to the deployment of such classifiers in practical applications. In this paper we show that the labeled training set can be augmented by an unlabeled set of examples in order to boost the performance of the classifier. In general, the set of unlabeled examples is not guaranteed to improve the classifier performance. We show that if the actual examples to be labeled are automatically selected through an unsupervised clustering step, the performance is more likely to improve with the unlabeled set. In this paper, we first present a modified EM algorithm, which combined labeled and unlabeled sets for training. We then apply this algorithm to image classification. Using mutually exclusive classes we show that the clustering step is crucial to the improvement in classifier performance.
Efficient ways to manage digital video data have assumed enormous importance lately. An integral aspect is the ability to browse, index nd search huge volumes of video data automatically and efficiently. This paper presents a novel scheme for matching video sequences base on low-level features. The scheme supports fast and efficient matching and can search 450,000 frames of video data within 72 seconds on a 400 MHz. Pentium II, for a 50 frame query. Video sequences are processed in the compressed domain to extract the histograms of the images in the DCT sequence is implemented for matching video clips. The binds of the histograms of successive for comparison. This leads to efficient storage and transmission. The histogram representation can be compacted to 4.26 real numbers per frame, while achieving high matching accuracy. Multiple temporal resolution sampling of the videos to be matched is also supported and any key-frame-based matching scheme thus becomes a particular implementation of this scheme.
Tools for efficient and intelligent management of digital content are essential for digital video data management. An extremely challenging research area in this context is that of multimedia analysis and understanding. The capabilities of audio analysis in particular for video data management are yet to be fully exploited. We present a novel scheme for indexing and segmentation of video by analyzing the audio track. This analysis is then applied to the segmentation and indexing of movies. We build models for some interesting events in the motion picture soundtrack. The models built include music, human speech and silence. We propose the use of hidden Markov models to model the dynamics of the soundtrack and detect audio-events. Using these models we segment and index the soundtrack. A practical problem in motion picture soundtracks is that the audio in the track is of a composite nature. This corresponds to the mixing of sounds from different sources. Speech in foreground and music in background are common examples. The coexistence of multiple individual audio sources forces us to model such events explicitly. Experiments reveal that explicit modeling gives better result than modeling individual audio events separately.