This PDF file contains the front matter associated with SPIE Proceedings Volume 9089 including the Title Page, Copyright information, Table of Contents, Introduction to Part A, and Conference Committees listing.
Automated event recognition in video data has numerous practical applications. The ability to recognize events in practice depends on accurate tracking of objects in the video data. Scene complexity has a large effect on tracker
performance. Background models can address this problem by providing a good estimate of the image region surrounding the object of interest. However, the utility of the background model depends on accurately representing
current imaging conditions. Changing imaging conditions, such as lighting and weather, render the background model
inaccurate, degrading the tracker performance. As a preprocessing step, developing a set of robust background models
can substantially improve system performance. We present an approach to robustly modeling the background as a
function of the data acquisition conditions. We will describe the formulation of these models and discuss model
selection in the context of real-time processing. Using results from a recent experiment, we demonstrate empirically the
performance benefits from using the robust background modeling.
Traditional motion-based trackers often fail in maritime environments due to a lack of image features to help stabilize video. In this paper, we describe a computationally efficient approach which automatically detects, tracks and classifies different objects within aerial full motion video (FMV) sequences in the maritime domain. A multi-layered saliency detector is utilized to first remove any image regions likely belonging to background categories (ie, calm water) followed-by progressively pruning out distractor categories such as wake, debris, and reflection. This pruning stage combines features generated at the level of each individual pixel, with 2D descriptors formulated around the outputs of prior stages grouped into connected components. Additional false positive reduction is performed via aggregating detector outputs across multiple frames, by formulating object tracks from these detections and, lastly, by classifying the resultant tracks using machine learning techniques. As a by-product, our system also produces image descriptors specific to each individual object, which are useful in later pipeline elements for appearance-based indexing and matching.
We study an efficient texture segmentation model for multichannel videos using a local feature fitting based active contour scheme. We propose a flexible motion segmentation approach using fused features computed from texture and intensity components in a globally convex continuous optimization and fusion framework. A fast numerical implementation is demonstrated using an efficient dual minimization formulation. The novel contributions include the fusion of local feature density functions including luminance-chromaticity and local texture in a globally convex active contour variational method, combined with label propagation in scale space using noisy sparse object labels initialized from long term optical flow-based point trajectories. We provide a proof-of-concept demonstration of this novel multi-scale label propagation approach to video object segmentation using synthetic textured video objects embedded in a noisy background and starting with sparse label set trajectories for each object.
Through its ability to create situation awareness, multi-target target tracking is an extremely important capability for almost any kind of surveillance and tracking system. Many approaches have been proposed to address its inherent challenges. However, the majority of these approaches make two assumptions: the probability of detection and the clutter rate are constant. However, neither are likely to be true in practice. For example, as the projected size of a target becomes smaller as it moves further from the sensor, the probability of detection will decline. When target detection is carried out using templates, clutter rate will depend on how much the environment resembles the current target of interest.
In this paper, we begin to investigate the impacts on these effects. Using a simulation environment inspired by the challenges of Wide Area Surveillance (WAS), we develop a state dependent formulation for probability of detection and clutter. The impacts of these models are compared in a simulated urban environment populated by multiple vehicles and cursed with occlusions. The results show that accurate modelling the effects of occlusion and degradation in detection, significant improvements in performance can be obtained.
Image change detection has long been used to detect significant events in aerial imagery, such as the arrival or departure
of vehicles. Usually only the underlying structural changes are of interest, particularly for movable objects, and the
challenge is to differentiate the changes of intelligence value (change detections) from incidental appearance changes (false
detections). However, existing methods for automated change detection continue to be challenged by nuisance variations in
operating conditions such as sensor (camera exposure, camera viewpoints), targets (occlusions, type), and the environment
(illumination, shadows, weather, seasons). To overcome these problems, we propose a novel vehicle change detection
method based on the detection response maps (DRM). The detector serves as an advanced filter that normalizes the images
being compared specifically for object level change detection (OLCD). In contrast to current methods that compare pixel
intensities, the proposed DRM-OLCD method is more robust to nuisance changes and variations in image appearance. We
demonstrate object-level change detection for vehicle appearing and disappearing in electro-optical (EO) visual imagery.
Optimal full motion video (FMV) registration is a crucial need for the Geospatial community. It is required for
subsequent and optimal geopositioning with simultaneous and reliable accuracy prediction. An overall approach being
developed for such registration is presented that models relevant error sources in terms of the expected magnitude and
correlation of sensor errors. The corresponding estimator is selected based on the level of accuracy of the a priori
information of the sensor’s trajectory and attitude (pointing) information, in order to best deal with non-linearity effects.
Estimator choices include near real-time Kalman Filters and batch Weighted Least Squares. Registration solves for
corrections to the sensor a priori information for each frame. It also computes and makes available a posteriori
accuracy information, i.e., the expected magnitude and correlation of sensor registration errors. Both the registered
sensor data and its a posteriori accuracy information are then made available to “down-stream” Multi-Image
Geopositioning (MIG) processes. An object of interest is then measured on the registered frames and a multi-image
optimal solution, including reliable predicted solution accuracy, is then performed for the object’s 3D coordinates. This
paper also describes a robust approach to registration when a priori information of sensor attitude is unavailable. It
makes use of structure-from-motion principles, but does not use standard Computer Vision techniques, such as
estimation of the Essential Matrix which can be very sensitive to noise. The approach used instead is a novel, robust,
direct search-based technique.
In this paper we demonstrate a technique for extracting 3-dimensional data from 2-dimensional GPS-tagged video. We call our method Minimum Separation Vector Mapping (MSVM), and we verify it's performance versus traditional Structure From Motion (SFM) techniques in the field of GPS-tagged aerial imagery, including GPS-tagged full motion video (FMV). We explain how MSVM is better posed to natively exploit the a priori content of GPS tags when compared to SFM. We show that given GPS-tagged images and moderately well known intrinsic camera parameters, our MSVM technique consistently outperforms traditional SFM implementations under a variety of conditions.
Future surveillance systems will work in complex and cluttered environments which require systems engineering
solutions for such applications such as airport ground surface management. In this paper, we highlight the use of a L1
video tracker for monitoring activities at an airport. We present methods of information fusion, entity detection, and
activity analysis using airport videos for runway detection and airport terminal events. For coordinated airport security,
automated ground surveillance enhances efficient and safe maneuvers for aircraft, unmanned air vehicles (UAVs) and
unmanned ground vehicles (UGVs) operating within airport environments.
In the last decade, there have been numerous developments in wide-area motion imagery (WAMI) from the sensor design to data exploitation. In this paper, we summarize the published literature on WAMI results in an effort to organize the techniques, discuss the developments, and determine the state-of-the-art. Using the organization of developments, we see the variations in approaches and relations to the data sets available. The literature summary provides and anthology of many of the developers in the last decade and their associated techniques. In our use case, we showcase current methods and products that enable future WAMI exploitation developments.
Existing nadir-viewing aerial image databases such as that available on Google Earth contain data from a variety of sources at varying spatial resolutions. Low-cost, low-altitude, high-resolution aerial systems such as unmanned aerial vehicles and balloon- borne systems can provide ancillary data sets providing higher resolution, oblique looking data to enhance the data available to the user. This imagery is difficult to georeference due to the different projective geometry present in these data. Even if this data is accompanied by metadata from global positioning system (GPS) and inertial measurement unit (IMU) sensors, the accuracy obtained from low-cost versions of these sensors is limited. Combining automatic image registration techniques with the information provided by the IMU and onboard GPS, it is possible to improve the positioning accuracy of these oblique data
sets on the ground plane using existing orthorectified imagery available from sources such as Google Earth. Using both the affine scale-invariant feature transform (ASIFT) and maximally stable extremal regions (MSER), feature detectors aid in automatically detecting correspondences between the obliquely collected images and the base map. These correspondences are used to georeference the high-resolution, oblique image data collected from these low-cost aerial platforms providing the user with an enhanced visualization experience.
With recent advances in technologies, reconstructions of three-dimensional (3D) point clouds from multi-view aerial imagery are readily obtainable. However, the fidelity of these point clouds has not been well studied, and voids often exist within the point cloud. Voids in the point cloud are present in texturally flat areas that failed to generate features during the initial stages of reconstruction, as well as areas where multiple views were not obtained during collection or a constant occlusion existed due to collection angles or overlapping scene. A method is presented for identifying the type of void present using a voxel-based approach to partition the
3D space. By using collection geometry and information derived from the point cloud, it is possible to detect
unsampled voxels such that voids can be identified. A similar line-of-sight analysis can then be used to pinpoint locations at aircraft altitude at which the voids in the point clouds could theoretically be imaged, such that the new images can be included in the 3D reconstruction, with the goal of reducing the voids in the point cloud that are a result of lack of coverage. This method has been tested on high-frame-rate oblique aerial imagery captured over Rochester, NY.
Advanced wide area persistent surveillance (WAPS) sensor systems on manned or unmanned airborne vehicles are
essential for wide-area urban security monitoring in order to protect our people and our warfighter from terrorist attacks.
Currently, human (imagery) analysts process huge data collections from full motion video (FMV) for data exploitation
and analysis (real-time and forensic), providing slow and inaccurate results. An Automated Data Exploitation System
(ADES) is urgently needed. In this paper, we present a recently developed ADES for airborne vehicles under heavy
urban background clutter conditions. This system includes four processes: (1) fast image registration, stabilization, and
mosaicking; (2) advanced non-linear morphological moving target detection; (3) robust multiple target (vehicles,
dismounts, and human) tracking (up to 100 target tracks); and (4) moving or static target/object recognition (super-resolution).
Test results with real FMV data indicate that our ADES can reliably detect, track, and recognize multiple
vehicles under heavy urban background clutters. Furthermore, our example shows that ADES as a baseline platform can
provide capability for vehicle abnormal behavior detection to help imagery analysts quickly trace down potential threats
This paper deals with deblurring of aerial imagery and develops a methodology for blind restoration of spatially varying blur induced by camera motion caused by instabilities of the moving platform. This is a topic of significant relevance with a potential impact on image analysis, characterization and exploitation. A sharp image is beneficial not only from the perspective of visual appeal but also because it forms the basis for applications such as moving object tracking, change detection, and robust feature extraction. In the presence of general camera motion, the apparent motion of scene points in the image will vary at different locations resulting in space-variant blurring. However, due to the large distances involved in aerial imaging, we show that the blurred image of the ground plane can be expressed as a weighted average of geometrically warped instances of the original focused but unknown image. The weight corresponding to each warp denotes the fraction of the total exposure duration the camera spent in that pose. Given a single motion blurred aerial observation, we propose a scheme to estimate the original focused image affected by arbitrarily-shaped blur kernels. The latent image and its associated warps are estimated by optimizing suitably derived cost functions with judiciously chosen priors within an alternating minimization framework. Several results are given on the challenging VIRAT aerial dataset for validation.
In recent years, digital cameras have been widely used for image capturing. These devices are equipped in cell phones, laptops, tablets, webcams, etc. Image quality is an important component of digital image analysis. To assess image quality for these mobile products, a standard image is required as a reference image. In this case, Root Mean Square Error and Peak Signal to Noise Ratio can be used to measure the quality of the images. However, these methods are not possible if there is no reference image. In our approach, a discrete-wavelet transformation is applied to the blurred image, which decomposes into the approximate image and three detail sub-images, namely horizontal, vertical, and diagonal images. We then focus on noise-measuring the detail images and blur-measuring the approximate image to assess the image quality. We then compute noise mean and noise ratio from the detail images, and blur mean and blur ratio from the approximate image. The Multi-scale Blur Detection (MBD) metric provides both an assessment of the noise and blur content. These values are weighted based on a linear regression against full-reference y values. From these statistics, we can compare to normal useful image statistics for image quality without needing a reference image. We then test the validity of our obtained weights by R2 analysis as well as using them to estimate image quality of an image with a known quality measure. The result shows that our method provides acceptable results for images containing low to mid noise levels and blur content.
Wide-Area Motion Imagery (WAMI) feature extraction is important for applications such as target tracking, traffic management
and accident discovery. With the increasing amount of WAMI collections and feature extraction from the data,
a scalable framework is needed to handle the large amount of information. Cloud computing is one of the approaches
recently applied in large scale or big data. In this paper, MapReduce in Hadoop is investigated for large scale feature
extraction tasks for WAMI. Specifically, a large dataset of WAMI images is divided into several splits. Each split has a
small subset of WAMI images. The feature extractions of WAMI images in each split are distributed to slave nodes in the
Hadoop system. Feature extraction of each image is performed individually in the assigned slave node. Finally, the feature
extraction results are sent to the Hadoop File System (HDFS) to aggregate the feature information over the collected imagery.
Experiments of feature extraction with and without MapReduce are conducted to illustrate the effectiveness of our
proposed Cloud-Enabled WAMI Exploitation (CAWE) approach.
Detecting anomalies in non-stationary signals has valuable applications in many fields including medicine and meteorology. These include uses such as identifying possible heart conditions from an Electrocardiography (ECG) signals or predicting earthquakes via seismographic data. Over the many choices of anomaly detection algorithms, it is important to compare possible methods. In this paper, we examine and compare two approaches to anomaly detection and see how data fusion methods may improve performance. The first approach involves using an artificial neural network (ANN) to detect anomalies in a wavelet de-noised signal. The other method uses a perspective neural network (PNN) to analyze an arbitrary number of “perspectives” or transformations of the observed signal for anomalies. Possible perspectives may include wavelet de-noising, Fourier transform, peak-filtering, etc.. In order to evaluate these techniques via signal fusion metrics, we must apply signal preprocessing techniques such as de-noising methods to the original signal and then use a neural network to find anomalies in the generated signal. From this secondary result it is possible to use data fusion techniques that can be evaluated via existing data fusion metrics for single and multiple perspectives. The result will show which anomaly detection method, according to the metrics, is better suited overall for anomaly detection applications. The method used in this study could be applied to compare other signal processing algorithms.
Various initiatives have been taken all over the world to involve the citizens in the collection and reporting of data to make better and informed data-driven decisions. Our work shows how the geotagged images collected through the general population can be used to combat Malaria and Dengue by identifying and visualizing localities that contain potential mosquito breeding sites.
Our method first employs image quality assessment on the client side to reject the images with distortions like blur and artifacts. Each geotagged image received on the server is converted into a feature vector using the bag of visual words model. We train an SVM classifier on a histogram-based feature vector obtained after the vector quantization of SIFT features to discriminate images containing either a small stagnant water body like puddle, or open containers and tires, bushes etc. from those that contain flowing water, manicured lawns, tires attached to a vehicle etc. A geographical heat map is generated by assigning a specific location a probability value of it being a potential mosquito breeding ground of mosquito using feature level fusion or the max approach presented in the paper. The heat map thus generated can be used by concerned health authorities to take appropriate action and to promote civic awareness.
Wavelet transformation has become a cutting edge and promising approach in the field of image and signal processing. A wavelet is a waveform of effectively limited duration that has an average value of zero. Wavelet analysis is done by breaking up the signal into shifted and scaled versions of the original signal. The key advantage of a wavelet is that it is capable of revealing smaller changes, trends, and breakdown points that are not revealed by other techniques such as Fourier analysis. The phenomenon of polarization has been studied for quite some time and is a very useful tool for target detection and tracking. Long Wave Infrared (LWIR) polarization is beneficial for detecting camouflaged objects and is a useful approach when identifying and distinguishing manmade objects from natural clutter. In addition, the Stokes Polarization Parameters, which are calculated from 0°, 45°, 90°, 135° right circular, and left circular intensity measurements, provide spatial orientations of target features and suppress natural features. In this paper, we propose a wavelet-based polarimetry analysis (WPA) method to analyze Long Wave Infrared Polarimetry Imagery to discriminate targets such as dismounts and vehicles from background clutter. These parameters can be used for image thresholding and segmentation. Experimental results show the wavelet-based polarimetry analysis is efficient and can be used in a wide range of applications such as change detection, shape extraction, target recognition, and feature-aided tracking.
Motion imagery capabilities within the Department of Defense/Intelligence Community (DoD/IC) have advanced
significantly over the last decade, attempting to meet continuously growing data collection, video processing and
analytical demands in operationally challenging environments. The motion imagery tradecraft has evolved accordingly,
enabling teams of analysts to effectively exploit data and generate intelligence reports across multiple phases in
structured Full Motion Video (FMV) Processing Exploitation and Dissemination (PED) cells. Yet now the operational
requirements are drastically changing. The exponential growth in motion imagery data continues, but to this the
community adds multi-INT data, interoperability with existing and emerging systems, expanded data access, nontraditional
users, collaboration, automation, and support for ad hoc configurations beyond the current FMV PED cells.
To break from the legacy system lifecycle, we look towards a technology application and commercial adoption model
course which will meet these future Intelligence, Surveillance and Reconnaissance (ISR) challenges. In this paper, we
explore the application of cutting edge computer vision technology to meet existing FMV PED shortfalls and address
future capability gaps. For example, real-time georegistration services developed from computer-vision-based feature
tracking, multiple-view geometry, and statistical methods allow the fusion of motion imagery with other georeferenced
information sources - providing unparalleled situational awareness. We then describe how these motion imagery
capabilities may be readily deployed in a dynamically integrated analytical environment; employing an extensible
framework, leveraging scalable enterprise-wide infrastructure and following commercial best practices.
Activity Based Intelligence (ABI) is the derivation of information from the composite of a series of individual
actions being recorded over a period of time. Due to its temporal nature, ABI is usually developed from
Motion Imagery (MI) or Full Motion Video (FMV) taken from a given scene. One common misconception,
is that ABI boils down to a simple resolution problem; more pixels at a higher frame rate is better. As part
of this research an experiment was designed and performed to address this assumption; by analyzing varying
temporal resolutions in conjunction with several modalities, a trade space for characterizing activities can
be developed. Thermal Infrared (IR), multispectral, and polarimetric data were used to augment RGB MI.
As these data are still being analyzed, this paper gives an update to the experiment and analysis process.
The ability of computer systems to perform gender classification using the dynamic motion of the human subject has
important applications in medicine, human factors, and human-computer interface systems. Previous works in motion
analysis have used data from sensors (including gyroscopes, accelerometers, and force plates), radar signatures, and
video. However, full-motion video, motion capture, range data provides a higher resolution time and spatial dataset for
the analysis of dynamic motion. Works using motion capture data have been limited by small datasets in a controlled
environment. In this paper, we explore machine learning techniques to a new dataset that has a larger number of
subjects. Additionally, these subjects move unrestricted through a capture volume, representing a more realistic, less
controlled environment. We conclude that existing linear classification methods are insufficient for the gender
classification for larger dataset captured in relatively uncontrolled environment. A method based on a nonlinear support
vector machine classifier is proposed to obtain gender classification for the larger dataset. In experimental testing with a
dataset consisting of 98 trials (49 subjects, 2 trials per subject), classification rates using leave-one-out cross-validation
are improved from 73% using linear discriminant analysis to 88% using the nonlinear support vector machine classifier.
Government agencies, including defense and law enforcement, increasingly make use of video from surveillance systems and camera phones owned by non-government entities.Making advanced and standardized motion imaging technology available to private and commercial users at cost-effective prices would benefit all parties. In particular, incorporating thermal infrared into commercial surveillance systems offers substantial benefits beyond night vision capability. Face rendering is a process to facilitate exploitation of thermal infrared surveillance imagery from the general area of a crime scene, to assist investigations with and without cooperating eyewitnesses. Face rendering automatically generates greyscale representations similar to police artist sketches for faces in surveillance imagery collected from proximate locations and times to a crime under investigation. Near-realtime generation of face renderings can provide law enforcement with an investigation tool to assess witness memory and credibility, and integrate reports from multiple eyewitnesses, Renderings can be quickly disseminated through social media to warn of a person who may pose an immediate threat, and to solicit the public's help in identifying possible suspects and witnesses. Renderings are pose-standardized so as to not divulge the presence and location of eyewitnesses and surveillance cameras. Incorporation of thermal infrared imaging into commercial surveillance systems will significantly improve system performance, and reduce manual review times, at an incremental cost that will continue to decrease. Benefits to criminal justice would include improved reliability of eyewitness testimony and improved accuracy of distinguishing among minority groups in eyewitness and surveillance identifications.
There is an explosion in the quantity and quality of IMINT data being captured in Intelligence Surveillance and
Reconnaissance (ISR) today. While automated exploitation techniques involving computer vision are arriving, only a
few architectures can manage both the storage and bandwidth of large volumes of IMINT data and also present results to
analysts quickly. Lockheed Martin Advanced Technology Laboratories (ATL) has been actively researching in the area
of applying Big Data cloud computing techniques to computer vision applications. This paper presents the results of this
work in adopting a Lambda Architecture to process and disseminate IMINT data using computer vision algorithms. The
approach embodies an end-to-end solution by processing IMINT data from sensors to serving information products
quickly to analysts, independent of the size of the data. The solution lies in dividing up the architecture into a speed layer
for low-latent processing and a batch layer for higher quality answers at the expense of time, but in a robust and fault-tolerant
way. This approach was evaluated using a large corpus of IMINT data collected by a C-130 Shadow Harvest
sensor over Afghanistan from 2010 through 2012. The evaluation data corpus included full motion video from both
narrow and wide area field-of-views. The evaluation was done on a scaled-out cloud infrastructure that is similar in
composition to those found in the Intelligence Community. The paper shows experimental results to prove the scalability
of the architecture and precision of its results using a computer vision algorithm designed to identify man-made objects
in sparse data terrain.
This paper reviews and compares the performance of several methods to detect target tracks in image sequences. The
targets are assumed to be sub-pixel or not resolved by the imaging system, and moving over a static background. To
process the resulting large amount of data requires simple, fast and robust processing methods to quickly find and
display tracks of moving targets in a single image. An object moving through a pixel in a scene will momentarily perturb
the pixel intensity signal, introducing a change of both skewness and kurtosis in the intensity histogram relative to an
undisturbed pixel. Numerical experiments show that for Gaussian and Poisson distributed system noise higher order
moments (<2) perform better than second order detectors.
In the context of aerial imagery, one of the first steps toward a coherent processing of the information contained in multiple images is geo-registration, which consists in assigning geographic 3D coordinates to the pixels of the image. This enables accurate alignment and geo-positioning of multiple images, detection of moving objects and fusion of data acquired from multiple sensors. To solve this problem there are different approaches that require, in addition to a precise characterization of the camera sensor, high resolution referenced images or terrain elevation models, which are usually not publicly available or out of date. Building upon the idea of developing technology that does not need a reference terrain elevation model, we propose a geo-registration technique that applies variational methods to obtain a dense and coherent surface elevation model that is used to replace the reference model. The surface elevation model is built by interpolation of scattered 3D points, which are obtained in a two-step process following a classical stereo pipeline: first, coherent disparity maps between image pairs of a video sequence are estimated and then image point correspondences are back-projected. The proposed variational method enforces continuity of the disparity map not only along epipolar lines (as done by previous geo-registration techniques} but also across them, in the full2D image domain. In the experiments, aerial images from synthetic video sequences have been used to validate the proposed technique.
Image and video compression plays a major role in multimedia transmission. Specifically the discrete cosine transform (DCT) is the key tool employed in a vast variety of compression standards such as H.265/HEVC due to its remarkable energy compaction properties. Rapid growth in digital imaging applications, such as multimedia and automatic surveillance that operates with limited bandwidths has led to extensive development of video processing systems. The main objective of this paper is to discuss some DCT approximations equipped with fast algorithms which require minimum addition operations and zero multipliers or bit-shifting operations leading to significant reductions in chip area and power consumption compared to conventional DCT algorithms.
We provide complete design details for several k × k, k = 8, 16 blocked 2-D algorithms for DCT computation
with video evaluation using HEVC software encoder. Custom digital architectures are proposed, simulated and implemented on Xilinx FPGAs and verified in conjunction with software models.