In this paper, we present a pipeline and prototype vision system for near-real-time semantic segmentation and classification of objects such as roads, buildings, and vehicles in large high-resolution wide-area real-world aerial LiDAR point-cloud and RGBD imagery. Unlike previous works, which have focused on exploiting ground- based sensors or narrowed the scope to detecting the density of large objects, here we address the full semantic segmentation of aerial LiDAR and RGBD imagery by exploiting crowd-sourced labels that densely canvas each image in the 2015 Dublin dataset.1 Our results indicate important improvements to detection and segmentation accuracy with the addition of aerial LiDAR over RGB imagery alone, which has important implications for civilian applications such as autonomous navigation and rescue operations. Moreover, the prototype system can segment and search geographic areas as big as 1km2 in a matter of seconds on commodity hardware with high accuracy (_ 90%), suggesting the feasibility of real-time scene understanding on small aerial platforms.
Despite the large availability of geospatial data, registration and exploitation of these datasets remains a persis- tent challenge in geoinformatics. Popular signal processing and machine learning algorithms, such as non-linear SVMs and neural networks, rely on well-formatted input models as well as reliable output labels, which are not always immediately available. In this paper we outline a pipeline for gathering, registering, and classifying initially unlabeled wide-area geospatial data. As an illustrative example, we demonstrate the training and test- ing of a convolutional neural network to recognize 3D models in the OGRIP 2007 LiDAR dataset using fuzzy labels derived from OpenStreetMap as well as other datasets available on OpenTopography.org. When auxiliary label information is required, various text and natural language processing filters are used to extract and cluster keywords useful for identifying potential target classes. A subset of these keywords are subsequently used to form multi-class labels, with no assumption of independence. Finally, we employ class-dependent geometry extraction routines to identify candidates from both training and testing datasets. Our regression networks are able to identify the presence of 6 structural classes, including roads, walls, and buildings, in volumes as big as 8000 m3 in as little as 1.2 seconds on a commodity 4-core Intel CPU. The presented framework is neither dataset nor sensor-modality limited due to the registration process, and is capable of multi-sensor data-fusion.
We discuss an algorithmic approach for detecting spatially stationary, dim signals in cluttered optical data. In the problem considered here, cluttered scene backgrounds are substantially more intense than sensor noise and signal variations from scene anomalies of interest. As a result, clutter estimation and rejection algorithms are performed prior to implementing signal detection schemes. Even then, stationary residual clutter may be spatially similar to, and have intensities much greater than, those of the signals of interest. This poses an extreme challenge for the automated detection of low-contrast scene anomalies, and detectors based solely on spatial properties of the optical scene generally fail. In our newly developed signal detection algorithm, we exploit not only the structure of the dim signals of interest, but also the time-lapsed residual clutter. By examining the properties and statistics of both the signals of interest and the signals we wish to reject, Toyon has developed an algorithm for the automated detection of low-contrast signals in the presence of high-intensity clutter. We discuss here the developed signal detection algorithm and results for overcoming the challenges inherent to heavily cluttered optical data.
A method for generating and utilizing structure from motion (SfM) uncertainty estimates within image-based pose estimation is presented. The method is applied to a class of problems in which SfM algorithms are utilized to form a geo-registered reference model of a particular ground area using imagery gathered during flight by a small unmanned aircraft. The model is then used to form camera pose estimates in near real-time from imagery gathered later. The resulting pose estimates can be utilized by any of the other onboard systems (e.g. as a replacement for GPS data) or downstream exploitation systems, e.g., image-based object trackers. However, many of the consumers of pose estimates require an assessment of the pose accuracy. The method for generating the accuracy assessment is presented. First, the uncertainty in the reference model is estimated. Bundle Adjustment (BA) is utilized for model generation. While the high-level approach for generating a covariance matrix of the BA parameters is straightforward, typical computing hardware is not able to support the required operations due to the scale of the optimization problem within BA. Therefore, a series of sparse matrix operations is utilized to form an exact covariance matrix for only the parameters that are needed at a particular moment. Once the uncertainty in the model has been determined, it is used to augment Perspective-n-Point pose estimation algorithms to improve the pose accuracy and to estimate the resulting pose uncertainty. The implementation of the described method is presented along with results including results gathered from flight test data.
In this paper, we discuss algorithmic approaches for exploiting wide-area persistent EO/IR motion imagery for multisensor
geo-registration and automated information extraction, including moving target detection. We first present
enabling capabilities, including sensor auto-calibration and automated high-resolution 3D reconstruction using passive
2D motion imagery. We then present algorithmic approaches for 3D-based geo-registration, and demonstrate and
quantify performance achieved using public release data from AFRL's Columbus Large Image Format (CLIF) 2006 data
collection and the Ohio Geographically Referenced Information Program (OGRIP). Finally, we discuss algorithmic
approaches for 3D-based moving target detection with near-optimal parallax mitigation, and demonstrate automated
detection of dismount and vehicle targets in coarse-resolution CLIF 2006 imagery.
We present a system for scale and affine invariant recognition of vehicular objects in video sequences. We use
local descriptors (SIFT keypoints) from image frames to model the object. These features are claimed in the
literature to be highly distinctive and invariant to rotation, scale, and affine transformations. However, since the
SIFT keypoints that are extracted from an object are instance-specific (variable), they form a dynamic feature
space. This presents certain challenges for classification techniques, which generally require use of the same set
of features for every instance of an object to be classified. To resolve this difficulty, we associate the extracted
keypoints to the components (representative keypoints) in a mixture model for each target class. While the
extracted keypoints are variable, the mixture components are fixed. The mixture models the keypoint features,
as well as the location and scale at which each keypoint was detected in the frame. Keypoint to component
association is achieved via a switching optimization procedure that locally maximizes the joint likelihood of
keypoints and their locations and scales with the latter based on an affine transformation. To each mixture
component from a class, we link a (first layer) support vector machine (SVM) classifier which votes for or
against the hypothesis that the keypoint associated to the component belongs to the model's target class. A
second layer SVM pools the votes from the ensemble of SVM classifiers in the first layer and gives the final
class decision. We show promising results of experiments for video sequences from the VIVID database.
Vast quantities of EO and IR data are collected on airborne platforms (manned and unmanned) and terrestrial platforms
(including fixed installations, e.g., at street intersections), and can be exploited to aid in the global war on terrorism.
However, intelligent preprocessing is required to enable operator efficiency and to provide commanders with actionable
target information. To this end, we have developed an image plane tracker which automatically detects and tracks
multiple targets in image sequences using both motion and feature information. The effects of platform and camera
motion are compensated via image registration, and a novel change detection algorithm is applied for accurate moving
target detection. The contiguous pixel blob on each moving target is segmented for use in target feature extraction and
model learning. Feature-based target location measurements are used for tracking through move-stop-move maneuvers,
close target spacing, and occlusion. Effective clutter suppression is achieved using joint probabilistic data association
(JPDA), and confirmed target tracks are indicated for further processing or operator review. In this paper we describe
the algorithms implemented in the image plane tracker and present performance results obtained with video clips from
the DARPA VIVID program data collection and from a miniature unmanned aerial vehicle (UAV) flight.