Aerial video recognition is challenging due to various factors. Prior work on action recognition imposes constraints in terms of unavailability of object detection bounding box ground-truth inhibiting the application of localization models and computational constraints preventing the usage of expensive space-time self-attention. Optical flow and pretrained models for detecting human actor performing action do not work too well due to domain gap issues. Our contributions1, 2 are as follows: 1. We present a frequency-domain space-time attention method that encapsulates long-range space-time dependencies by emulating the weighted outer product in the frequency domain. 2. We present a frequency-based object background disentanglement method to inherently separate out the moving human actor from the background. 3. We present a mathematical model for static salient regions and an identity loss function to learn disentangled features in a differentiable manner.
Hyperspectral imaging provides extensive spectral reflectance information useful for material classification and discrimination not available with conventional broadband imaging. In this work, we first seek to characterize the hyperspectral signature of human faces in the shortwave infrared (SWIR) band. A hyperspectral SWIR face dataset of 100 subjects was collected as part of this study. Regions of interest (ROI) were defined for each subject and the mean and variance of each ROI were computed. The results show that hyperspectral signatures are similar between male and female subjects for the cheek, forehead, and hair ROIs. Furthermore, this study investigated whether the hyperspectral face signatures from the ROIs contained discriminative information for gender classification. We implemented and trained five different classifiers for gender classification. Results from the machine learning experiments indicate that hyperspectral facial signatures in the SWIR band is only weakly discriminative with respect to gender.
This paper presents an overview of polarimetric thermal imaging for biometrics, focusing on face recognition, with a short discussion on fingerprints and iris. Face recognition has been and continues to be an active area of biometrics research, with most of the research dedicated to recognition in the visible spectrum. However, face recognition in the visible spectrum is not practical for discrete surveillance in low-light and nighttime scenarios. Polarimetric thermal imaging represents an ideal modality for acquiring the naturally emitted thermal radiation from the human face, providing additional geometric and textural details not available in conventional thermal imagery. One of the main challenges lies in matching the acquired polarimetric thermal facial signature to gallery databases containing only visible facial signature, for interoperability with existing government biometric repositories. This paper discusses approaches and algorithms to exploit polarization information, as represented by the Stokes vectors, through feature extraction and nonlinear regression to enable polarimetric thermal-to-visible face recognition. In addition to cross-spectrum feature based approaches, crossspectrum image synthesis methods are discussed that seek to reconstruct a visible-like image given a polarimetric thermal face image input. Beyond facial biometrics, this paper presents an initial exploration of polarimetric thermal imaging for latent fingerprint acquisition. Latent prints are formed when the oils and sweat from the finger are deposited onto another surface through contact, and are typically collected by first dusting with powder before being imaged and then lifted with adhesive tape. This paper presents polarimetric thermal imagery of latent prints from a nonporous glass surface, acquired without the dusting process. A brief discussion of the utility of polarimetric thermal imaging for iris recognition is also presented.
Light detection and ranging (LIDAR) has become a widely used tool in remote sensing for mapping, surveying, modeling, and a host of other applications. The motivation behind this work is the modeling of piping systems in industrial sites, where cylinders are the most common primitive or shape. We focus on cylinder parameter estimation in three-dimensional point clouds, proposing a mathematical formulation based on angular distance to determine the cylinder orientation. We demonstrate the accuracy and robustness of the technique on synthetically generated cylinder point clouds (where the true axis orientation is known) as well as on real LIDAR data of piping systems. The proposed algorithm is compared with a discrete space Hough transform-based approach as well as a continuous space inlier approach, which iteratively discards outlier points to refine the cylinder parameter estimates. Results show that the proposed method is more computationally efficient than the Hough transform approach and is more accurate than both the Hough transform approach and the inlier method.
Human group activity recognition is a very complex and challenging task, especially for Partially Observable Group Activities (POGA) that occur in confined spaces with limited visual observability and often under severe occultation. In this paper, we present IRIS Virtual Environment Simulation Model (VESM) for the modeling and simulation of dynamic POGA. More specifically, we address sensor-based modeling and simulation of a specific category of POGA, called In-Vehicle Group Activities (IVGA). In VESM, human-alike animated characters, called humanoids, are employed to simulate complex in-vehicle group activities within the confined space of a modeled vehicle. Each articulated humanoid is kinematically modeled with comparable physical attributes and appearances that are linkable to its human counterpart. Each humanoid exhibits harmonious full-body motion - simulating human-like gestures and postures, facial impressions, and hands motions for coordinated dexterity. VESM facilitates the creation of interactive scenarios consisting of multiple humanoids with different personalities and intentions, which are capable of performing complicated human activities within the confined space inside a typical vehicle. In this paper, we demonstrate the efficiency and effectiveness of VESM in terms of its capabilities to seamlessly generate time-synchronized, multi-source, and correlated imagery datasets of IVGA, which are useful for the training and testing of multi-source full-motion video processing and annotation. Furthermore, we demonstrate full-motion video processing of such simulated scenarios under different operational contextual constraints.
KEYWORDS: Skin, Space operations, Detection and tracking algorithms, RGB color model, Light sources and illumination, Visual process modeling, Data modeling, Statistical modeling, Video, Machine vision
In many military and homeland security persistent surveillance applications, accurate detection of different skin colors in varying observability and illumination conditions is a valuable capability for video analytics. One of those applications is In-Vehicle Group Activity (IVGA) recognition, in which significant changes in observability and illumination may occur during the course of a specific human group activity of interest. Most of the existing skin color detection algorithms, however, are unable to perform satisfactorily in confined operational spaces with partial observability and occultation, as well as under diverse and changing levels of illumination intensity, reflection, and diffraction. In this paper, we investigate the salient features of ten popular color spaces for skin subspace color modeling. More specifically, we examine the advantages and disadvantages of each of these color spaces, as well as the stability and suitability of their features in differentiating skin colors under various illumination conditions. The salient features of different color subspaces are methodically discussed and graphically presented. Furthermore, we present robust and adaptive algorithms for skin color detection based on this analysis. Through examples, we demonstrate the efficiency and effectiveness of these new color skin detection algorithms and discuss their applicability for skin detection in IVGA recognition applications.
Human activity detection and recognition capabilities have broad applications for military and homeland security. These tasks are very complicated, however, especially when multiple persons are performing concurrent activities in confined spaces that impose significant obstruction, occultation, and observability uncertainty. In this paper, our primary contribution is to present a dedicated taxonomy and kinematic ontology that are developed for in-vehicle group human activities (IVGA). Secondly, we describe a set of hand-observable patterns that represents certain IVGA examples. Thirdly, we propose two classifiers for hand gesture recognition and compare their performance individually and jointly. Finally, we present a variant of Hidden Markov Model for Bayesian tracking, recognition, and annotation of hand motions, which enables spatiotemporal inference to human group activity perception and understanding. To validate our approach, synthetic (graphical data from virtual environment) and real physical environment video imagery are employed to verify the performance of these hand gesture classifiers, while measuring their efficiency and effectiveness based on the proposed Hidden Markov Model for tracking and interpreting dynamic spatiotemporal IVGA scenarios.
Visible spectrum face detection algorithms perform pretty reliably under controlled lighting conditions. However, variations in illumination and application of cosmetics can distort the features used by common face detectors, thereby degrade their detection performance. Thermal and polarimetric thermal facial imaging are relatively invariant to illumination and robust to the application of makeup, due to their measurement of emitted radiation instead of reflected light signals. The objective of this work is to evaluate a government off-the-shelf wavelet based naïve-Bayes face detection algorithm and a commercial off-the-shelf Viola-Jones cascade face detection algorithm on face imagery acquired in different spectral bands. New classifiers were trained using the Viola-Jones cascade object detection framework with preprocessed facial imagery. Preprocessing using Difference of Gaussians (DoG) filtering reduces the modality gap between facial signatures across the different spectral bands, thus enabling more correlated histogram of oriented gradients (HOG) features to be extracted from the preprocessed thermal and visible face images. Since the availability of training data is much more limited in the thermal spectrum than in the visible spectrum, it is not feasible to train a robust multi-modal face detector using thermal imagery alone. A large training dataset was constituted with DoG filtered visible and thermal imagery, which was subsequently used to generate a custom trained Viola-Jones detector. A 40% increase in face detection rate was achieved on a testing dataset, as compared to the performance of a pre-trained/baseline face detector. Insights gained in this research are valuable in the development of more robust multi-modal face detectors.
Face images are an important source of information for biometric recognition and intelligence gathering. While face
recognition research has made significant progress over the past few decades, recognition of faces at extended ranges is
still highly problematic. Recognition of a low-resolution probe face image from a gallery database, typically containing
high resolution facial imagery, leads to lowered performance than traditional face recognition techniques. Learning and
super-resolution based approaches have been proposed to improve face recognition at extended ranges; however, the
resolution threshold for face recognition has not been examined extensively. Establishing a threshold resolution
corresponding to the theoretical and empirical limitations of low resolution face recognition will allow algorithm
developers to avoid focusing on improving performance where no distinguishable information for identification exists in
the acquired signal. This work examines the intrinsic dimensionality of facial signatures and seeks to estimate a lower
bound for the size of a face image required for recognition. We estimate a lower bound for face signatures in the visible
and thermal spectra by conducting eigenanalysis using principal component analysis (PCA) (i.e., eigenfaces approach).
We seek to estimate the intrinsic dimensionality of facial signatures, in terms of reconstruction error, by maximizing the
amount of variance retained in the reconstructed dataset while minimizing the number of reconstruction components.
Extending on this approach, we also examine the identification error to estimate the dimensionality lower bound for low-resolution
to high-resolution (LR-to-HR) face recognition performance. Two multimodal face datasets are used for this
study to evaluate the effects of dataset size and diversity on the underlying intrinsic dimensionality: 1) 50-subject
NVESD face dataset (containing visible, MWIR, LWIR face imagery) and 2) 119-subject WSRI face dataset (containing
visible and MWIR face imagery).
Human activity recognition research relies heavily on extensive datasets to verify and validate performance of activity
recognition algorithms. However, obtaining real datasets are expensive and highly time consuming. A physics-based
virtual simulation can accelerate the development of context based human activity recognition algorithms and techniques
by generating relevant training and testing videos simulating diverse operational scenarios. In this paper, we discuss in
detail the requisite capabilities of a virtual environment to aid as a test bed for evaluating and enhancing activity
recognition algorithms. To demonstrate the numerous advantages of virtual environment development, a newly
developed virtual environment simulation modeling (VESM) environment is presented here to generate calibrated multisource
imagery datasets suitable for development and testing of recognition algorithms for context-based human
activities. The VESM environment serves as a versatile test bed to generate a vast amount of realistic data for training
and testing of sensor processing algorithms. To demonstrate the effectiveness of VESM environment, we present
various simulated scenarios and processed results to infer proper semantic annotations from the high fidelity imagery
data for human-vehicle activity recognition under different operational contexts.
KEYWORDS: Surveillance, Video, Video surveillance, Data modeling, Binary data, Video compression, Process modeling, Machine vision, Computer vision technology, Statistical analysis
Surveillance cameras have become ubiquitous in society, used to monitor areas such as residential blocks, city streets, university campuses, industrial sites, and government installations. Surveillance footage, especially of public areas, is frequently streamed online in real time, providing a wealth of data for computer vision research. The focus of this work is on detection of anomalous patterns in surveillance video data recorded over a period of months to years. We propose an anomaly detection technique based on support vector data description (SVDD) to detect anomalous patterns in video footage of a university campus scene recorded over a period of months. SVDD is a kernel-based anomaly detection technique which models the normalcy data in a high dimensional feature space using an optimal enclosing hypersphere – samples that lie outside this boundary are detected as outliers or anomalies. Two types of anomaly detection are conducted in this work: track-level analysis to determine individual tracks that are anomalous, and day-level analysis using aggregate scene level feature maps to determine which days exhibit anomalous activity. Experimentation and evaluation is conducted using a scene from the Global Webcam Archive.
Recognizing faces acquired in the thermal spectrum from a gallery of visible face images is a desired capability for the
military and homeland security, especially for nighttime surveillance and intelligence gathering. However, thermal-tovisible
face recognition is a highly challenging problem, due to the large modality gap between thermal and visible
imaging. In this paper, we propose a thermal-to-visible face recognition approach based on multiple kernel learning
(MKL) with support vector machines (SVMs). We first subdivide the face into non-overlapping spatial regions or
blocks using a method based on coalitional game theory. For comparison purposes, we also investigate uniform spatial
subdivisions. Following this subdivision, histogram of oriented gradients (HOG) features are extracted from each block
and utilized to compute a kernel for each region. We apply sparse multiple kernel learning (SMKL), which is a MKLbased
approach that learns a set of sparse kernel weights, as well as the decision function of a one-vs-all SVM classifier
for each of the subjects in the gallery. We also apply equal kernel weights (non-sparse) and obtain one-vs-all SVM
models for the same subjects in the gallery. Only visible images of each subject are used for MKL training, while
thermal images are used as probe images during testing. With subdivision generated by game theory, we achieved
Rank-1 identification rate of 50.7% for SMKL and 93.6% for equal kernel weighting using a multimodal dataset of 65
subjects. With uniform subdivisions, we achieved a Rank-1 identification rate of 88.3% for SMKL, but 92.7% for equal
kernel weighting.
In this paper, an unsupervised pedestrian detection algorithm is proposed. An input image is first divided into overlapping detection windows in a sliding fashion and Histogram of Oriented Gradients (HOG) features are collected over each window using non-overlapping cells. A distance metric is used to determine the distance between histograms of corresponding cells in each detection window and the average pedestrian HOG template (determined a priori). These distances over a group of cells are concatenated to obtain the feature vector pertaining to a block of cells. The feature vectors over overlapping blocks of cells are concatenated to form the distance feature vector of a detection window. Each window provides a data sample and the data samples extracted from the whole image are then modeled as a normalcy class using Support Vector Data Description (SVDD). The benefit of using the state-of-the-art SVDD technique to model the normalcy class is that it can be controlled by setting an upper limit on the permissible outliers during the modeling process. Assuming that most of the image is covered by background, the outliers that are detected during the modeling of the normalcy class can be hypothesized as detection windows that contain pedestrians in them. The detections are obtained at different scales in order to account for the different sizes of pedestrians. The final pedestrian detections are generated by applying non-maximal suppression on all the detections at all scales. The system is tested on the INRIA pedestrian dataset and its performance analyzed with respect to accuracy and detection rate.
In this paper, we present our implementation of a cascaded Histogram of Oriented Gradient (HOG) based pedestrian detector. Most human detection algorithms can be implemented as a cascade of classifiers to decrease computation time while maintaining approximately the same performance. Although cascaded versions of Dalal and Triggs's HOG detector already exist, we aim to provide a more detailed explanation of an implementation than is currently available. We also use Asymmetric Boosting instead of Adaboost to train the cascade stages. We show that this reduces the number of weak classifiers needed per stage. We present the results of our detector on the INRIA pedestrian detection dataset and compare them to Dalal and Triggs's results.
Airborne laser scanning light detection and ranging (LiDAR) systems are used for remote sensing topology and bathymetry. The most common data collection technique used in LiDAR systems employs a linear mode scanning. The resulting scanning data form a non-uniformly sampled 3D point cloud. To interpret and further process the 3D point cloud data, these raw data are usually converted to digital elevation models (DEMs). In order to obtain DEMs in a uniform and upsampled raster format, the elevation information from the available non-uniform 3D point cloud data are mapped onto the uniform grid points. After the mapping is done, the grid points with missing elevation information are lled by using interpolation techniques. In this paper, partial di erential equations (PDE) based approach is proposed to perform the interpolation and to upsample the 3D point cloud onto a uniform grid. Due to the desirable e ects of using higher order PDEs, smoothness is maintained over homogeneous regions, while sharp edge information in the scene well preserved. The proposed algorithm reduces the draping e ects near the edges of distinctive objects in the scene. Such annoying draping e ects are commonly associated with existing point cloud rendering algorithms. Simulation results are presented in this paper to illustrate the advantages of the proposed algorithm.
In low light conditions, visible light face identification is infeasible due to the lack of illumination. For nighttime
surveillance, thermal imaging is commonly used because of the intrinsic emissivity of thermal radiation from the
human body. However, matching thermal images of faces acquired at nighttime to the predominantly visible
light face imagery in existing government databases and watch lists is a challenging task. The difficulty arises
from the significant difference between the face's thermal signature and its visible signature (i.e. the modality
gap). To match the thermal face to the visible face acquired by the two different modalities, we applied face
recognition algorithms that reduce the modality gap in each step of face identification, from low-level analysis to
machine learning techniques. Specifically, partial least squares-discriminant analysis (PLS-DA) based approaches
were used to correlate the thermal face signatures to the visible face signatures, yielding a thermal-to-visible face
identification rate of 49.9%. While this work makes progress for thermal-to-visible face recognition, more efforts
need to be devoted to solving this difficult task. Successful development of a thermal-to-visible face recognition
system would significantly enhance the Nation's nighttime surveillance capabilities.
Vast amounts of video footage are being continuously acquired by surveillance systems on private premises, commercial
properties, government compounds, and military installations. Facial recognition systems have the potential to identify
suspicious individuals on law enforcement watchlists, but accuracy is severely hampered by the low resolution of typical
surveillance footage and the far distance of suspects from the cameras. To improve accuracy, super-resolution can
enhance suspect details by utilizing a sequence of low resolution frames from the surveillance footage to reconstruct a
higher resolution image for input into the facial recognition system. This work measures the improvement of face
recognition with super-resolution in a realistic surveillance scenario. Low resolution and super-resolved query sets are
generated using a video database at different eye-to-eye distances corresponding to different distances of subjects from
the camera. Performance of a face recognition algorithm using the super-resolved and baseline query sets was calculated
by matching against galleries consisting of frontal mug shots. The results show that super-resolution improves
performance significantly at the examined mid and close ranges.
Flash laser detection and ranging (LADAR) systems are increasingly used in robotics applications
for autonomous navigation and obstacle avoidance. Their compact size, high frame rate, wide field
of view, and low cost are key advantages over traditional scanning LADAR devices. However,
these benefits are achieved at the cost of spatial resolution. Super-resolution enhancement can be
applied to improve the resolution of flash LADAR devices, making them ideal for small robotics
applications. Previous work by Rosenbush et al. applied the super-resolution algorithm of
Vandewalle et al. to flash LADAR data, and observed quantitative improvement in image quality in
terms of number of edges detected. This study uses the super-resolution algorithm of Young et al. to
enhance the resolution of range data acquired with a SwissRanger SR-3000 flash LADAR camera.
To improve the accuracy of sub-pixel shift estimation, a wavelet preprocessing stage was developed
and applied to flash LADAR imagery. The authors used the triangle orientation discrimination
(TOD) methodology for a subjective evaluation of the performance improvement (measured in terms
of probability of target discrimination and subject response times) achieved with super-resolution.
Super-resolution of flash LADAR imagery resulted in superior probabilities of target discrimination
at the all investigated ranges while reducing subject response times.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.