This paper describes research activities at FX Palo Alto Laboratory (FXPAL) in the area of multimedia browsing,
search, and retrieval. We first consider interfaces for organization and management of personal photo collections.
We then survey our work on interactive video search and retrieval. Throughout we discuss the evolution of both
the research challenges in these areas and our proposed solutions.
Hypervideo is a form of interactive video that allows users to follow links to other video. A simple form of hypervideo, called “detail-on-demand video,” provides at most one link from one segment of video to another, supporting a single-button interaction. Detail-on-demand video is well suited for interactive video summaries, because the user can request a more detailed summary while watching the video. Users interact with the video is through a special hypervideo player that displays keyframes with labels indicating when a link is available. While detail-on-demand summaries can be manually authored, it is a time-consuming task. To address this issue, we developed an algorithm to automatically generate multi-level hypervideo summaries. The highest level of the summary consists of the most important clip from each take or scene in the video. At each subsequent level, more clips from each take or scene are added in order of their importance. We give one example in which a hypervideo summary is created for a linear training video. We also show how the algorithm can be modified to produce a hypervideo summary for home video.
We present a framework, motivated by rate-distortion theory and the human visual system, for optimally representing the real world given limited video resolution. To provide users with high fidelity views, we built a hybrid video camera system that combines a fixed wide-field panoramic camera with a controllable pan/tilt/zoom (PTZ) camera. In our framework, a video frame is viewed as a limited-frequency representation of some "true" image function. Our system combines outputs from both cameras to construct the highest fidelity views possible, and controls the PTZ camera to maximize information gain available from higher spatial frequencies. In operation, each remote viewer is presented with a small panoramic view of the entire scene, and a larger close-up view of a selected region. Users may select a region by marking the panoramic view. The system operates the PTZ camera to best satisfy requests from multiple users. When no regions are selected, the system automatically operates the PTZ camera to minimize predicted video distortion. High-resolution images are cached and sent if a previously recorded region has not changed and the PTZ camera is pointed elsewhere. We present experiments demonstrating that the panoramic image can effectively predict where to gain the most information, and also that the system provides better images to multiple users than conventional camera systems.
A system for detecting and locating user-specified
search strings, or phrases, in lines of imaged text is described. The phrases may be single words or multiple words, and may contain a partially specified word. The imaged text can be composed of a number of different fonts and graphics. Textlines in a deskewed image are hypothesized using multiresolution morphology. For each textline, the baseline, topline and x-height are identified by simple statistical methods and then used to normalize each textline bounding box. Columns of pixels in the resulting bounding box serve as feature vectors. One hidden Markov model is created for each userspecified phrase and another represents all text and graphics other
than the user-specified phrases. Phrases are identified using Viterbi decoding on a spotting network created from the models. The operating point of the system can be varied to trade off the percentage of words correctly spotted and the percentage of false alarms. Results are given using a subset of the UW English Document Image Database I.
A system that searches for user-specified phrases in imaged text is described. The search `phrases' can be word fragments, words, or groups of words. The imaged text can be composed of a number of different fonts and can contain graphics. A combination of morphology, simple statistical methods and hidden Markov modeling is used to detect and locate the phrases. The image is deskewed, and then bounding boxes are found for text-lines in the image using multiresolution morphology. Baselines, toplines and the x-height in a text-line are identified using simple statistical methods. The distance between baseline and x-height is used to normalize each hypothesized text-line bounding box, and the columns of pixel values in a normalized bounding box serve as the feature vector for that box. Hidden Markov models are crated for each user-specified search string and to represent all text and graphics other than the search strings. Phrases are identified using Viterbi decoding on a spotting network created from the models. The operating point of the system can be varied to trade off the percentage of words correctly spotted and the percentage of false alarms. Results are given using a subset of the UW English Document Image Database I.
In this paper, a technique for audio indexing based on speaker identification is proposed. When speakers are known a priori, a speaker index can be created in real time using the Viterbi algorithm to segment the audio into intervals from a single talker. Segmentation is performed using a hidden Markov model network consisting of interconnected speaker sub- networks. Speaker training data is used to initiate sub-networks for each speaker. Sub- networks can also be used to model silence, or non-speech sounds such as musical theme. When no prior knowledge of the speakers is available, unsupervised segmentation is performed using a non-real time iterative algorithm. The speaker sub-networks are first initialized, and segmentation is performed by iteratively generating a segmentation using the Viterbi algorithm, and retraining the sub-networks based on the results of the segmentation. Since the accuracy of the speaker segmentation depends on how well the speaker sub-networks are initiated, agglomerative clustering is used to approximately segment the audio according to speaker for initialization of the speaker sub-networks. The distance measure for the agglomerative clustering is a likelihood ratio in which speed segments are characterized by Gaussian distributions. The distance between merged segments is recomputed at each stage of the clustering, and a duration model is used to bias the likelihood ratio. Segmentation accuracy using agglomerative clustering initialization matches accuracy using initialization with speaker labeled data.