We describe a real-time wide area surveillance system (WA-ACTV) for the automatic tracking of vessels using a network
of PTZ cameras. The system is capable of optimally managing hundreds of PTZ cameras to simultaneously track a large
numbers of vessels. The tracked vessels are fingerprinted using a scale-invariant part-based representation and subsequently
used for reacquiring the tracks when the vessel comes back into view on a different sensor thus allowing the system to
extend the tracking range over the wide area of surveillance. We have realized a small-scale version of the system and
demonstrated it in an inland waterway as well as a small section of a sea port. The system operates in real-time at a frame
rate of 15 Hz and is easily scalable to hundreds of PTZ cameras. The fingerprint-based reacquisition of targets has been
evaluated to have a accuracy of 91%.
Gunshot recordings have the potential for both tactical detection and forensic evaluation particularly to ascertain
information about the type of firearm and ammunition used. Perhaps the most significant challenge to such an analysis is
the effect of recording conditions on the audio signature of recorded data. In this paper we present a first study of using
an exemplar embedding approach to automatically detect and classify firearm type across different recording conditions.
We demonstrate that a small number of exemplars can span the space of gunshot audio signatures and that this optimal
set can be obtained using a wrapper function. By projecting a given gunshot to the subspace spanned by the exemplar set
a distance measure/feature vector is obtained that enables comparisons across recording conditions. We also investigate
the use of a hierarchy of gunshot classifications that assists in improving finer level classification by pruning out gunshot
labeling that is inconsistent with its higher level type. The embedding based approach can thus be used both by itself and
as a pruning stage for other search techniques.
Our dataset includes 20 different gun types captured in a number of different conditions. This data acts as our original
exemplar set. The dataset also includes 12 gun types each with multiple shots recorded in the same conditions as the
exemplar set. This second set provides our training and testing sets. We show that we can reduce our exemplar space
from 20 to only 4 uniquely different gunshots without significantly limiting the ability of our embedding approach to
discriminate different gunshots in the training and testing sets. The basic hypothesis in the embedding approach is that
the relationship between the set of exemplars and space of gunshots including the testing/training set would be robust to
a change in recording conditions or the environment. That is to say the embedding distance between a particular gunshot
and the exemplars would tend to remain the same in changing environments. The implication of this are two-fold; first,
unlike other dimensionality reduction approaches we have access to particular instances/examples of entities (the
exemplars), which act as bridges to connect different recording conditions. Second, the embedding distances are
invariant across recording conditions, the embedded vector can be used as a feature of similarity between gunshots
recorded in different conditions.
Unlike other dimensionality reduction approaches , our approach generates descriptions that are always in terms of the
same exemplars. In other approaches such as PCA, the data driven nature makes it difficult if not impossible to make
correspondence in the dimensions in one space to another.
We have shown that gunshot classification across different recording conditions can be performed at a reasonable degree
of certainty (60-72%) at a finer level (gunshot to weapon model) and at a high degree of certainty (95-100%) at a
higher degree of abstraction (gunshot to `handgun' or `rifle'). We also investigate the use of simulated recording
conditions and artificial noise to quantitatively evaluate the performance of our approach.
Association of audio events with video events presents a challenge to a typical camera-microphone approach in order to
capture AV signals from a large distance. Setting up a long range microphone array and performing geo-calibration of
both audio and video sensors is difficult. In this work, in addition to a geo-calibrated electro-optical camera, we propose
to use a novel optical sensor - a Laser Doppler Vibrometer (LDV) for real-time audio sensing, which allows us to
capture acoustic signals from a large distance, and to use the same geo-calibration for both the camera and the audio (via
LDV). We have promising preliminary results on association of the audio recording of speech with the video of the
We apply a unique hierarchical audio classification technique to weapon identification using gunshot analysis. The
Audio Classification classifies each audio segment as one of ten weapon classes (e.g., 9mm, 22, shotgun etc.) using lowcomplexity
Gaussian Mixture Models (GMM). The first level of hierarchy consists of classification into broad weapons
categories such as Rifle, Hand-Gun etc. and the second consists of classification into specific weapons such as 9mm, 357
etc. Our experiments have yielded over 90% classification accuracy at the coarse (rifle-handgun) level of the
classification hierarchy and over 85% accuracy at the finer level (weapon category such as 9mm).
We describe a novel scalable approach for the management of a large number of Pan-Tilt-Zoom (PTZ) cameras
deployed outdoors for persistent tracking of humans and vehicles, without resorting to the large fields of view of
associated static cameras. Our system, Active Collaborative Tracking - Vision (ACT-Vision), is essentially a real-time
operating system that can control hundreds of PTZ cameras to ensure uninterrupted tracking of target objects while
maintaining image quality and coverage of all targets using a minimal number of sensors. The system ensures the
visibility of targets between PTZ cameras by using criteria such as distance from sensor and occlusion.
We present a technique for genre-independent scene-change detection using audio and video features in a discriminative
support vector machine (SVM) framework. This work builds on our previous work by adding a
video feature based on the MPEG-7
"scalable color" descriptor. Adding this feature improves our detection rate over all genres by 5% to 15% for a fixed false positive rate of 10%. We also find that the genres that benefit the
most are those with which the previous audio-only was least effective.
Proc. SPIE. 6820, Multimedia Content Access: Algorithms and Systems II
KEYWORDS: FDA class I medical device development, Semantic video, Data modeling, Visualization, Image segmentation, Video, Video surveillance, Analytical research, Systems modeling, Information visualization
In this paper, we present a content-adaptive audio texture based method to segment video into audio scenes. The audio
scene is modeled as a semantically consistent chunk of audio data. Our algorithm is based on "semantic audio texture
analysis." At first, we train GMM models for basic audio classes such as speech, music, etc. Then we define the
semantic audio texture based on those classes. We study and present two types of scene changes, those corresponding to
an overall audio texture change and those corresponding to a special "transition marker" used by the content creator,
such as a short stretch of music in a sitcom or silence in dramatic content. Unlike prior work using genre specific
heuristics, such as some methods presented for detecting commercials, we adaptively find out if such special transition
markers are being used and if so, which of the base classes are being used as markers without any prior knowledge about
the content. Our experimental results show that our proposed audio scene segmentation works well across a wide variety
of broadcast content genres.
We tested our previously reported sports highlights playback for personal video recorders with a carefully chosen set of
sports aficionados. Each subject spent about an hour with the content, going through the same basic steps of
introduction, trying out the system, and follow up questionnaire. The main conclusion was that the users unanimously
liked the functionality very much even when it made mistakes. Furthermore, the users felt that if the user interface were
made much more responsive so as to quickly compensate for false alarms and misses, the functionality would be vastly
enhanced. The ability to choose summaries of any desired length turned out to be the main attraction.
Severe complexity constraints on consumer electronic devices motivate us to investigate general-purpose video summarization techniques that are able to apply a common hardware setup to multiple content genres. On the other hand, we know that high quality summaries can only be produced with domain-specific processing. In this paper, we present a time-series analysis based video summarization technique that provides a general core to which we are able to add small content-specific extensions for each genre. The proposed time-series analysis technique consists of unsupervised clustering of samples taken through sliding windows from the time series of features obtained from the content. We classify content into two broad categories, scripted content such as news and drama, and unscripted content such as sports and surveillance. The summarization problem then reduces to finding either finding semantic boundaries of the scripted content or detecting highlights in the unscripted content. The proposed technique is essentially an event detection technique and is thus best suited to unscripted content, however, we also find applications to scripted content. We thoroughly examine the trade-off between content-neutral and content-specific processing for effective summarization for a number of genres, and find that our core technique enables us to minimize the complexity of the content-specific processing and to postpone it to the final stage. We achieve the best results with unscripted content such as sports and surveillance video in terms of quality of summaries and minimizing content-specific processing. For other genres such as drama, we find that more content-specific processing is required. We also find that judicious choice of key audio-visual object detectors enables us to minimize the complexity of the content-specific processing while maintaining its applicability to a broad range of genres. We will present a demonstration of our proposed technique at the conference.
The immediate availability of a vast amount of multimedia content has created a growing need for improvements in the field of content analysis and summarization. While researchers have been rapidly making contributions and improvements to the field, we must never forget that content analysis and summarization themselves are not the user's goals. Users' primary interests fall into one of two categories; they normally either want to be entertained or want to be informed (or both). Summarization is therefore just another tool for improving the entertainment value or the information gathering value of the video watching experience. In this paper, we first explore the relationship between the viewer, the interface, and the summarization algorithms. Through an understanding of the user's goals and concerns, we present means for measuring the success of the summarization tools. Guidelines for the successful use of summarization in consumer video devices are also discussed.
We present a consumer video browsing system that enables use of multiple alternative summaries in a simple and effective user interface suitable for consumer electronics platforms. We present a news and talk video segmentation and summary generation technique for this platform. We use face detection on consumer video, and use simple face features such as face count, size, and x-location to classify video segments. More specifically, we cluster 1-face segments using face sizes and x-locations. We observe that different scenes such as anchorperson, outdoor correspondent, weather report, etc. form separate clusters. We then apply temporal morphological filtering on the label streams to obtain alternative summary streams for smooth summaries and effective browsing through stories. We also apply our technique to talk show video to generate separate summaries of monologue segments and guest interviews.
We present a systematic framework for arriving at audio classes for detection of crimes in elevators. We use a time series analysis framework to analyze the low-level features extracted from the audio of an elevator surveillance content to perform an inlier/outlier based temporal segmentation. Since suspicious events in elevators are outliers in a background of usual events, such a segmentation help bring out such events without any a priori knowledge. Then, by performing an automatic clustering on the detected outliers, we identify consistent patterns for which we can train supervised detectors. We apply the proposed framework to a collection of elevator surveillance audio data to systematically acquire audio classes such as banging, footsteps, non-neutral speech and normal speech etc. Based on the observation that the banging audio class and non-neutral speech class are indicative of suspicious events in the elevator data set, we are able to detect all of the suspicious activities without any misses.
In our past work on sports highlights extraction, we have shown the utility of detecting audience reaction using an audio classification framework. The audio classes in the framework were chosen based on intuition. In this paper, we present a systematic way of identifying the key audio classes for sports highlights extraction using a time series clustering framework. We treat the low-level audio features as a time series and model the highlight segments as "unusual" events in a background of an "usual" process. The set of audio classes to characterize the sports domain is then identified by analyzing the consistent patterns in each of the clusters output from the time series clustering framework. The distribution of features from the training data so obtained for each of the key audio classes, is parameterized by a Minimum Description Length Gaussian Mixture Model (MDL-GMM). We also interpret the meaning of each of the mixture components of the MDL-GMM for the key audio class (the "highlight" class) that is correlated with highlight moments. Our results show that the "highlight" class is a mixture of audience cheering and commentator's excited speech. Furthermore, we show that the precision-recall performance for highlights extraction based on this "highlight" class is better than that of our previous approach which uses only audience cheering as the key highlight class.
Proc. SPIE. 5308, Visual Communications and Image Processing 2004
KEYWORDS: Human-machine interfaces, Detection and tracking algorithms, Video, Computer programming, Feature extraction, Video compression, Information technology, Embedded systems, Analytical research, Digital video discs
We describe a MPEG-7 Meta-Data enhanced Audio-Visual Encoder system that targets DVD recorders. We extract features in the compressed domain with both video and audio, which allows us to add the meta-data extraction without altering the hardware architecture of the encoder core. Our feature extraction algorithms are simple, and thus implementable through a simple combination of software and hardware on the integrated DVD chip. The primary application of the meta-data is video summarization, which enables rapid browsing of stored video by the end user. The simplicity of our summarization and feature extraction algorithms enables incorporation of the powerful functionality of smart content navigation through content summarization, into the DVD recorder at a low cost.
Removing commercials from television programs is a much
sought-after feature for a personal video recorder. In this paper,
we employ an unsupervised clustering scheme (CM_Detect) to detect
commercials in television programs. Each program is first divided
into W8-minute chunks, and we extract audio and visual features
from each of these chunks. Next, we apply k-means clustering to
assign each chunk with a commercial/program label. In
contrast to other methods, we do not make any assumptions
regarding the program content. Thus, our method is highly
content-adaptive and computationally inexpensive. Through
empirical studies on various content, including American news,
Japanese news, and sports programs, we demonstrate that our method
is able to filter out most of the commercials without falsely
removing the regular program.
We discuss the meaning and significance of the video mining problem, and present our work on some aspects of video mining. A simple definition of video mining is unsupervised discovery of patterns in audio-visual content. Such purely unsupervised discovery is readily applicable to video surveillance as well as to consumer video browsing applications. We interpret video mining as content-adaptive or "blind" content processing, in which the first stage is content characterization and the second stage is event discovery based on the characterization obtained in stage 1. We discuss the target applications and find that using a purely unsupervised approach are too computationally complex to be implemented on our product platform. We then describe various combinations of unsupervised and supervised learning techniques that help discover patterns that are useful to the end-user of the application. We target consumer video browsing applications such as commercial message detection, sports highlights extraction etc. We employ both audio and video features. We find that supervised audio classification combined with unsupervised unusual event discovery enables accurate supervised detection of desired events. Our techniques are computationally simple and robust to common variations in production styles etc.
In our past work, we have attempted to use a mid-level feature namely the state population histogram obtained from the Hidden Markov Model (HMM) of a general sound class, for speaker change detection so as to extract semantic boundaries in broadcast news. In this paper, we compare the performance of our previous approach with
another approach based on video shot detection and speaker change detection using the Bayesian Information Criterion (BIC). Our experiments show that the latter approach performs significantly better than the former. This motivated us to examine the mid-level feature closely. We found that the component population histogram
enabled discovery of broad phonetic categories such as vowels, nasals, fricatives etc, regardless of the number of distinct speakers in the test utterance. In order for it to be useful for speaker change detection, the individual components should model the phonetic sounds of each speaker separately. From our experiments, we conclude that state/component population histograms can only be useful for further clustering or semantic class discovery
if the features are chosen carefully so that the individual states represent the semantic categories of interest.
In our previous work, we described an adaptive fast playback framework for video summarization where we changed the playback rate using the motion activity feature so as to maintain a constant “pace.” This method provides an effective way of skimming through video, especially when the motion is not too complex and the background is mostly still, such as in surveillance video. In this paper, we present an extended summarization framework that, in addition to motion activity, uses semantic cues such as face or skin color appearance, speech and music detection, or other domain dependent semantically significant events to control the playback rate. The semantic features we use are computationally inexpensive and can be computed in compressed domain, yet are robust, reliable, and have a wide range of applicability across different content types. The presented framework also allows for adaptive summaries based on preference, for example, to include more dramatic vs. action elements, or vice versa. The user can switch at any time between the skimming and the normal playback modes. The continuity of the video is preserved, and complete omission of segments that may be important to the user is avoided by using adaptive fast playback instead of skipping over long segments. The rule-set and the input parameters can be further modified to fit a certain domain or application. Our framework can be used by itself, or as a subsequent presentation stage for a summary produced by any other summarization technique that relies on generating a sub-set of the content.
We present a technique for rapidly generating highlights of soccer videos using peaks in audio volume in conjunction with temporal patterns of motion activity extracted in the compressed domain. Our intuition is that any interesting event, such as a goal, in a soccer match leads to an interruption of the game for a non-trivial duration. Furthermore, interesting events are associated with a sharp increase (or peak) in audio volume since the crowd noise goes up in anticipation of the event or as a result of the event. We thus use the temporal patterns of motion activity around each audio peak to detect and capture interesting events. Our preliminary results indicate that the scheme works well for a variety of soccer content from different parts of the world. The computational simplicity of our scheme enables rapid and flexible generation of highlights.
In Casey describes a generalized sound recognition framework based on reduced rank spectra and Minimum-Entropy Priors. This approach enables successful recognition of a wide variety of sounds such as male speech, female speech, music, animal sounds etc. In this work, we apply this recognition framework to news video to enable quick video browsing. We identify speaker change positions in the broadcast news using the sound recognition framework. We combine the speaker change position with color & motion cues from video and are able to locate the beginning of each of the topics covered by the news video. We can thus skim the video by merely playing a small portion starting from each of the locations where one of the principal cast begins to speak. In combination with our motion-based video browsing approach, our technique provides simple automatic news video browsing. While similar work has been done before, our approach is simpler and faster than competing techniques, and provides a rich framework for further analysis and description of content.
We present a novel low-complexity content-based browsing system for personal video recorders. It provides convenient access to any part of the content with an integrated browser-player that uses unique rapid summarization and indexing with compressed domain color and motion features, as well as audio features. Our summarization is similar in accuracy to other competing techniques and is computationally much simpler.
We present a psychophysical and analytical framework for the comparison of the performance of different analytical measures of motion activity in video segments with respect to a subjective ground truth. We first construct a test-set of video segments and conduct a psychophysical experiment to obtain a ground truth for the motion activity. Then we present several low-complexity motion activity descriptors computed from compressed domain block motion vectors. In the first analysis, we quantize the descriptors and show that they perform well against the ground truth. We also show that the MPEG-7 motion activity descriptor is among the best. In the second analysis, we find the pairs of video segments for which the human subjects unanimously rate one as higher activity than the other. Then we examine the specific cases where each descriptor fail to give the correct ordering. We show that the distance from camera, and strong camera motion are main cases where motion vector based descriptors tend to overestimate or underestimate the intensity of motion activity. We finally discuss the experimental methodology and analysis methods we used and possible alternatives. We review the applications of motion activity and how the results presented here relate to those applications.
We describe a technique for reducing the data set for a technique for reducing the data set for principal cast and other taking head detection in broadcast news content using the spatial attributes of MPEG-7 Motion Activity descriptor. The fact that these descriptors are easy to extract from compressed domain and also work well when used for matching talking head sequences, motivated us to utilize them for rapidly pruning the data set for subsequent sophisticated face detection techniques. We are thus able to speed up the process of finding the principal cast from broadcast news content by reducing the number of segments on which computationally more expensive face detection and recognition is employed. We present the experimental results of two from the centroid of ground truth set and is computationally less expensive. The second clustering procedure is based on multiple templates, which are the mean feature vectors of the component Gaussians of a Gaussian Mixture Model (GMM) trained best to fit the training data. We are able to save 50% on computation measured in terms of number of rejected shots to total number of shots while missing 25% of talking head shots in the news program. We also observe that the second clustering procedure while being slightly computationally intensive allows for higher pruning factors with more accuracy.
We present a technique for rapidly generating highlights of sports videos using temporal patterns of motion activity extracted in the compressed domain. The basic hypothesis of this work is that temporal patterns of motion activity are related with the grammar of the sports video. We present experimental verification of this hypothesis. By using very simple rules depending on the type of sport, we are thus able to provide highlights by skipping over the uninteresting parts of the video and identifying interesting events characterized, for instance, by falling edge or raising edge in the activity domain. Moreover the compressed domain extraction of motion activity intensity is much simpler than the color based summarization calculations. Other compressed domain features or more complex rules can be used to further improve the accuracy.
The soon to be released MPEG-7 standard provides a Multimedia Content Description Interface. In other words, it provides a rich set of tools to describe the content with a view to facilitating applications such as content based querying, browsing and searching of multimedia content. In this paper, we describe practical applications of MPEG-7 tools. We use descriptors of features such as color, shape and motion to both index and analyze the content. The aforementioned descriptors stem from our previous work and are currently in the draft international MPEG-7 standard. In our previous work, we have shown the efficacy of each of the descriptors individually. In this paper, we show how we combine color and motion to effectively browse video in our first application. In our second application, we show how we can combine shape and color to recognize objects in real time. We will present a demonstration of our system at the conference. We have already successfully demonstrated it to the Japanese press.
The ongoing MPEG-7 standard intends to provide a "Multimedia Content Description Interface." In other words, it will provide a rich set of tools to describe content with a view to facilitating applications such as content based querying, browsing and searching of multimedia content. The MPEG-4 standard provides tools for compressing multimedia content at bitrates that are feasible with typical internet connections. Such bitrates fall significantly short of those supported by prior standards such as MPEG-i and MPEG-2. Thus, in this paper, we present a remote video and still image browsing system that uses MPEG-7 for the querying/browsing/searching and MPEG-4 for compressing any transmitted content. We use descriptors of features such as color, shape and motion to annotate the stored content with MPEG-7 like metadata. The aforementioned descriptors stem from our previous work and are currently in the working draft of the MPEG-7 standard. In our previous work, we have shown the efficacy of each of the descriptors individually. In this paper, we show how we combine some of the features to effectively browse remote video and still image content. Our emphasis is on accurate and quick browsing of the remote content. Our system consists of a video web server with stored MPEG-4 video/still content that is able to support remote requests through a simple browser interface. We have used a combination of cgi script and servletapplet based configurations. We will present a demonstration of our system at the conference. We have already successfully demonstrated it to the Japanese press.
We present a psycho-visual and analytical framework for automatic measurement of motion activity in view sequences. We construct a test-set of video segments by carefully selecting video segments form the MPEG-7 video test set. We construct a ground truth, based on subjective test with naive subjects. We find that the subjects agree reasonably on the motion activity of video segments, which makes the ground truth reliable. We present a set of automatically extractable, known and novel, descriptors of motion activity based on different hypotheses about subjective perception of motion activity. We show that all the descriptors perform well against the ground truth. We find that the MPEG-7 motion activity descriptor, based on variance of motion vector magnitudes, is one of the best in overall performance over the test set.
We describe a technique for video summarization that uses motion descriptors computed in the compressed domain to speed up conventional color based video summarization technique. The basic hypothesis of the work is that the intensity of motion activity of a video segment is a direct indication of its 'summarizability.' We present experimental verification of this hypothesis. We are thus able to quickly identify easy to summarize segments of a video sequence since they have a low intensity of motion activity. Moreover, the compressed domain extraction of motion activity intensity is much simpler than the color-based calculations. We are able to easily summarize these segments by simply choosing a key-frame at random from each low- activity segment. We can then apply conventional color-based summarization techniques to the remaining segments. We are thus able to speed up color-based summarization techniques by reducing the number of segments on which computationally more expensive color-based computation is needed.
In this paper, we present a fade detection technique for indexing of MPEG-2 and MPEG-4 compressed video sequences. We declare a fade-in if the number of positive residual dc coefficients in P frames exceeds a certain percentage of the total number of non-zero dc coefficients consistently over several consecutive frames. Our fade-detection technique has fair accuracy and the advantage of high simplicity since it uses only entropy decoding and does not use computationally expensive inverse DCTs.
In this paper we present a new descriptor for spatial distribution of motion activity in video sequences. We use the magnitude of the motion vectors as a measure of the intensity of motion cavity in a macro-block. We construct a matrix Cmv consisting of the magnitudes of the motion vector for each macro-block of a given P frame. We compute the average magnitude of the motion vector per macro-block Cavg, and then use Cavg as a threshold on the matrix C by setting the elements of C that are less than Cavg to zero. We classify the runs of zeros into three categories based on length, and count the number of runs of each category in the matrix C. Our activity descriptor for a frame thus consists of four parameters viz. the average magnitude of the motion vectors and the numbers of runs of short, medium and long length. Since the feature extraction is in the compressed domain and simple, it is extremely fast. We have tested it on the MPEG-7 test content set, which consists of approximately 14 hours of MPEG-1 encoded video content of different kinds. We find that our descriptor enables fast and accurate indexing of video. It is robust to noise and changes in encoding parameters such as frame size, frame rate, encoding bit rate, encoding format etc. It is a low-level non-semantic descriptor that gives semantic matches within the same program, and is thus very suitable for applications such as video program browsing. We also find that indirect and computationally simpler measures of the magnitude of the motion vectors such as bits taken to encode the motion vectors, though less effective, also can be used in our run-length framework.
In this paper, we present a new, computationally efficient, effective technique for detection of abrupt scene changes in MPEG-4/2 compressed video sequences. We combine the dc image-based approach of Feng, Lo, and Mehrpour. The bit allocation-based approach has the advantage of computational simplicity, since it only requires entropy decoding of the sequence. Since extraction of dc images from I- Frames/Objects is simple, the dc image-based technique of Yeo is a good alternative for comparison of I- Frames/Objects. For P-Frames/Objects, however, Yeo's algorithm requires additional computation. We find that the bit allocation-change based approach is prone to false detection in comparison to intracoded objects in MPEG-4 sequences. However, if a suspected scene/object change has been located accurately in a group of consecutive frames/objects, the bit allocation-based technique quickly and accurately locates the cut point therein. This motivates us to use dc image-based detection between successive I- Frames/Objects, to identify the subsequences with scene/object changes, and then use bit allocation-based detection to find the cut point therein. Our technique thus has only a marginally greater complexity than the completely bit allocation-based technique, but has greater accuracy. It is applicable to both MPEG-2 sequences and MPEG-4 multiple- object sequences. In the MPEG-4 multiple object case, we use a weighted sum of the change in each object of the frame, using the area of the object as the weight.