Model vector-based retrieval is a novel approach for video indexing that uses a semantic model vector signature that describes the detection of a fixed set of concepts across a lexicon. The model vector basis is created using a set of independent binary classifiers that correspond to the semantic concepts. The model vectors are created by applying the binary detectors to video content and measuring the confidence of detection. Once the model vectors are extracted, simple techniques can be used for searching to find similar matches in a video database. However, since confidence scores alone do not capture information about the reliability of the underlying detectors, techniques are needed to ensure good performance in the presence of varying qualities of detectors. In this
paper, we examine the model vector-based retrieval framework for video and propose methods using detector validity to improve matching performance. In particular, we develop a model vector distance metric that weighs the dimensions using detector validity scores. In this paper, we explore the new model vector-based retrieval method for video indexing and empirically evaluate the retrieval effectiveness on a large video test collection using different methods of measuring and incorporating detector validity indicators.
A personalized video summary is dynamically generated in our video personalization and summarization system based on user preference and usage environment. The three-tier personalization system adopts the server-middleware-client architecture in order to maintain, select, adapt, and deliver rich media content to the user. The server stores the content sources along with their corresponding MPEG-7 metadata descriptions. In this paper, the metadata includes visual semantic annotations and automatic speech transcriptions. Our personalization and summarization engine in the middleware selects the optimal set of desired video segments by matching shot annotations and sentence transcripts with user preferences. Besides finding the desired contents, the objective is to present a coherent summary. There are diverse methods for creating summaries, and we focus on the challenges of generating a hierarchical video summary based on context information. In our summarization algorithm, three inputs are used to generate the hierarchical video summary output. These inputs are (1) MPEG-7 metadata descriptions of the contents in the server, (2) user preference and usage environment declarations from the user client, and (3) context information including MPEG-7 controlled term list and classification scheme. In a video sequence, descriptions and relevance scores are assigned to each shot. Based on these shot descriptions, context clustering is performed to collect consecutively similar shots to correspond to hierarchical scene representations. The context clustering is based on the available context information, and may be derived from domain knowledge or rules engines. Finally, the selection of structured video segments to generate the hierarchical summary efficiently balances between scene representation and shot selection.
There are many ways of capturing images to represent a detailed scene. Our motivation is to use inexpensive digital cameras with little setup requirements and to allow photographers to differentially capture both low-resolution overviews and high-resolution details. We present the heterogeneous image pyramid as a non-uniform representation composed of multiple captured multi-resolution images. Each resolution image captures a specific portion of the scene at the photographer’s discretion with the desired resolution. These images are highly correlated since they are captured from the same scene. Consequently, these images can be registered and represented more compactly in a 3-dimensional spatial image pyramid called the heterogeneous image pyramid.
Many recent efforts have been made to automatically index multimedia content with the aim of bridging the semantic gap between syntax and semantics. In this paper, we propose a novel framework to automatically index video using context for video understanding. First we discuss the notion of context and how it relates to video understanding. Then we present the framework we are constructing, which is modeled as an expert system that uses a rule-based engine, domain knowledge, visual detectors (for objects and scenes), and different data sources available with the video (metadata, text from automatic speech recognition, etc.). We also describe our approach to align text from speech recognition and video segments, and present experiments using a simple implementation of our framework. Our experiments show that context can be used to improve the performance of visual detectors.
A video personalization and summarization system is designed and implemented incorporating usage environment to dynamically generate a personalized video summary. The personalization system adopts the three-tier server-middleware-client architecture in order to select, adapt, and deliver rich media content to the user. The server stores the content sources along with their corresponding MPEG-7 metadata descriptions. Our semantic metadata is provided through the use of the VideoAnnEx MPEG-7 Video Annotation Tool. When the user initiates a request for content, the client communicates the MPEG-21 usage environment description along with the user query to the middleware. The middleware is powered by the personalization engine and the content adaptation engine. Our personalization engine includes the VideoSue Summarization on Usage Environment engine that selects the optimal set of desired contents according to user preferences. Afterwards, the adaptation engine performs the required transformations and compositions of the selected contents for the specific usage environment using our VideoEd Editing and Composition Tool. Finally, two personalization and summarization systems are demonstrated for the IBM Websphere Portal Server and for the pervasive PDA devices.
We have designed and implemented a video semantic summarization system, which includes an MPEG-7 compliant annotation interface, a semantic summarization middleware, a real-time MPEG-1/2 video transcoder on PCs, and an application interface on color/black-and-white Palm-OS PDAs. We designed a video annotation tool, VideoAnn, to annotate semantic labels associated with video shots. Videos are first segmentated into shots based on their visual-audio characteristics. They are played back using an interactive interface, which facilitate and fasten the annotation process. Users can annotate the video content with the units of temporal shots or spatial regions. The annotated results are stored in the MPEG-7 XML format. We also designed and implemented a video transmission system, Universal Tuner, for wireless video streaming. This system transcodes MPEG-1/2 videos or live TV broadcasting videos to the BW or indexed color Palm OS devices. In our system, the complexity of multimedia compression and decompression algorithms is adaptively partitioned between the encoder and decoder. In the client end, users can access the summarized video based on their preferences, time, keywords, as well as the transmission bandwidth and the remaining battery power on the pervasive devices.
Model-based approach to video retrieval requires ground-truth data for training the models. This leads to the development of video annotation tools that allow users to annotate each shot in the video sequence as well as to identify and label scenes, events, and objects by applying the labels at the shot-level. The annotation tool considered here also allows the user to associate the object-labels with an individual region in a key-frame image. However, the abundance of video data and diversity of labels make annotation a difficult and overly expensive task. To combat this problem, we formulate the task of annotation in the framework of supervised training with partially labeled data by viewing it as an exercise in active learning. In this scenario, one first trains a classifier with a small set of labeled data, and subsequently updates the classifier by selecting the most informative, or most uncertain subset of the available data-set. Consequently, propagation of labels to yet unlabeled data is automatically achieved as well. The purpose of this paper is primarily twofold. The first is to describe a video annotation tool that has been developed for the purpose of annotating generic video sequences in the context of a recent video-TREC benchmarking exercise. The tool is semi-automatic in that it automatically propagates labels to similar shots, which requires the user to confirm or reject the propagated labels. The second purpose is to show how active learning strategy can be potentially implemented in this context to further improve the performance of the annotation tool. While many versions of active learning could be thought of, we specifically report results on experiments with support vector machine classifiers with polynomial kernels.
We present a novel whiteboard system that uses one or more active input devices. The system is especially suitable for situations in which several users provide input on a large writing surface which also serves as a projection surface. That surface can have general orientation. We have implemented a system that uses an inexpensive, infrared-emitting stylus and an off-the-shelf videoconferencing camera fitted with an IR transmitting/visible blocking filter to capture handwritten strokes. The system uses a calibration method that uses projective mapping thus allowing off-axis camera placement. We also propose a self-calibrating system in which the capture device shares its optical system with the projector. In a system with multiple local users the question of 'who wrote what' becomes relevant. Therefore, we address the question of attaching identify to stroke information.