Finding, tracking and monitoring events and activities of interest on a continuous basis remains one of our highest
Intelligence Surveillance and Reconnaissance (ISR) requirements. Unmanned Aerial Systems (UAS) serve as one of the
warfighter's primary and most responsive means for surveillance and gathering intelligence information and are
becoming vital assets in military operations. This is demonstrated by their significant use in Afghanistan during
Operation Enduring Freedom and in Iraq as part of Operation Iraqi Freedom. Lessons learned from these operations
indicate that UAVs provide critical capabilities for enhancing situational awareness, intelligence gathering and force
protection for our military forces. Current UAS high resolution electro-optics offers a small high resolution field of
view (FOV). This narrow FOV is a limiting factor on the utility of the EO system. The UAS that are available offer
persistence; however, the effectiveness of the EO system is limited by the sensors and available processing.
DARPA is addressing this developing the next generation of persistent, very wide area surveillance with the
Autonomous Real-time Ground Ubiquitous Surveillance - Imaging System (ARGUS-IS). The system will be capable of
imaging an area of greater than 40 square kilometers with a Ground Space Distance (GSD) of 15 cm at video rates of
greater than 12 Hz. This paper will discuss the elements of the ARGUS-IS program.
Automatic target detection (ATD) systems process imagery to detect and locate targets in imagery in support of a variety of military missions. Accurate prediction of ATD performance would assist in system design and trade studies, collection management, and mission planning. A need exists for ATD performance prediction based exclusively on information available from the imagery and its associated metadata. We present a predictor based on image measures quantifying the intrinsic ATD difficulty on an image. The modeling effort consists of two phases: a learning phase, where image measures are computed for a set of test images, the ATD performance is measured, and a prediction model is developed; and a second phase to test and validate performance prediction. The learning phase produces a mapping, valid across various ATR algorithms, which is even applicable when no image truth is available (e.g., when evaluating denied area imagery). The testbed has plug-in capability to allow rapid evaluation of new ATR algorithms. The image measures employed in the model include: statistics derived from a constant false alarm rate (CFAR) processor, the Power Spectrum Signature, and others. We present a performance predictor using a trained classifier ATD that was constructed using GENIE, a tool developed at Los Alamos National Laboratory. The paper concludes with a discussion of future research.
A major challenge for ATR evaluation is developing an accurate image truth that can be compared to an ATR algorithm's decisions to assess performance. We have developed a semi-automated video truthing application, called START, that greatly improves the productivity of an operator truthing video sequences. The user, after previewing the video selects a set of salient frames (called "keyframes"), each corresponding to significant events in the video. These keyframes are then manually truthed. We provide a spectrum of truthing tools that generates truth for additional frames from the keyframes. These tools include: fully-automatic feature tracking, interpolation, and completely manual methods. The application uses a set of diagnostic measures to manage the user's attention, flagging portions in the video for which the computed truth needs review. This changes the role of the operator from raw data entry, to that of expert appraiser supervising the quality of the image truth. We have implemented a number of graphical displays summarizing the video truthing at various timescales. Additionally, we can view the track information, showing only the lifespan information of the entities involved. A combination of these displays allows the user to manage their resources more effectively. Two studies have been conducted that have shown the utility of START: one focusing on the accuracy of the automated truthing process, and the other focusing on usability issues of the application by a set of expert users.
The development and evaluation of precision strike weaponry requires high fidelity image simulation, as data collections involving moving platforms are difficult to schedule and costly to perform. Furthermore, live data collections where the weapon is being guided by an autonomous target acquisition (ATA) system cannot be performed in dense urban environments. The only solution is to develop high fidelity image and navigation simulations of realistic operating environments. We are currently developing a system that automatically generates a detailed urban scene requiring minimal user input. Given a set of parameters such as population, terrain, and city style, the system generates a two-dimensional city plan containing features such as road networks, buildings, vehicles, vegetation, and miscellaneous additional urban objects. The two-dimensional city representation is then processed by an interactive scene modeling and simulation environment that generates a textured, high-resolution, three-dimensional representation of the scene in a format compatible with well-known LADAR and IR sensor simulation suites such as IRMA. At each step in the process, the user has the ability to interact with the scene, whether to change specific scene parame-ters or to manually insert, remove, or modify targets and objects of interest.
Image exploitation algorithms for Intelligence, Surveillance and Reconnaissance (ISR) and weapon systems are extremely sensitive to differences between the operating conditions (OCs) under which they are trained and the extended operating conditions (EOCs) in which the fielded algorithms are tested. As an example, terrain type is an important OC for the problem of tracking hostile vehicles from an airborne camera. A system designed to track cars driving on highways and on major city streets would probably not do well in the EOC of parking lots because of the very different dynamics. In this paper, we present a system we call ALPS for Adaptive Learning in Particle Systems. ALPS takes as input a sequence of video images and produces labeled tracks. The system detects moving targets and tracks those targets across multiple frames using a multiple hypothesis tracker (MHT) tightly coupled with a particle filter. This tracker exploits the strengths of traditional MHT based tracking algorithms by directly incorporating tree-based hypothesis considerations into the particle filter update and resampling steps. We demonstrate results in a parking lot domain tracking objects through occlusions and object interactions.
Many fielded mobile robot systems have demonstrated the importance of directly estimating the 3D shape of objects in the robot's vicinity. The most mature solutions available today use active laser scanning or stereo camera pairs, but both approaches require specialized and expensive sensors. In prior publications, we have demonstrated the generation of stereo images from a single very low-cost camera using structure from motion (SFM) techniques. In this paper we demonstrate the practical usage of single-camera stereo in real-world mobile robot applications. Stereo imagery tends to produce incomplete 3D shape reconstructions of man-made objects because of smooth/glary regions that defeat stereo matching algorithms. We demonstrate robust object detection despite such incompleteness through matching of simple parameterized geometric models. Results are presented where parked cars are detected, and then recognized via license plate recognition, all in real time by a robot traveling through a parking lot.
Mobile robot designers frequently look to computer vision to solve navigation, obstacle avoidance, and object detection problems such as those encountered in parking lot surveillance. Stereo reconstruction is a useful technique in this domain and can be done in two ways. The first requires a fixed stereo camera rig to provide two side-by-side images; the second uses a single camera in motion to provide the images. While stereo rigs can be accurately calibrated in advance, they rely on a fixed baseline distance between the two cameras. The advantage of a single-camera method is the flexibility to change the baseline distance to best match each scenario. This directly increases the robustness of the stereo algorithm and increases the effective range of the system. The challenge comes from accurately rectifying the images into an ideal stereo pair. Structure from motion (SFM) can be used to compute the camera motion between the two images, but its accuracy is limited and small errors can cause rectified images to be misaligned. We present a single-camera stereo system that incorporates a Levenberg-Marquardt minimization of rectification parameters to bring the rectified images into alignment.
Most goal-oriented mobile robot tasks involve navigation to one or more known locations. This is generally done using GPS coordinates and landmarks outdoors, or wall-following and fiducial marks indoors. Such approaches ignore the rich source of navigation information that is already in place for human navigation in all man-made environments: signs. A mobile robot capable of detecting and reading arbitrary signs could be tasked using directions that are intuitive to hu-mans, and it could report its location relative to intuitive landmarks (a street corner, a person's office, etc.). Such ability would not require active marking of the environment and would be functional in the absence of GPS. In this paper we present an updated version of a system we call Sign Understanding in Support of Autonomous Navigation (SUSAN). This system relies on cues common to most signs, the presence of text, vivid color, and compact shape. By not relying on templates, SUSAN can detect a wide variety of signs: traffic signs, street signs, store-name signs, building directories, room signs, etc. In this paper we focus on the text detection capability. We present results summarizing probability of detection and false alarm rate across many scenes containing signs of very different designs and in a variety of lighting conditions.
The interpretation of video imagery is the quintessential goal of computer vision. The ability to group moving pixels into regions and then associate those regions with semantic labels has long been studied by the vision community. In urban nighttime scenarios, the difficulty of this task is simultaneously alleviated and compounded. At night there is typically less movement in the scene, which makes the detection of relevant motion easier. However, the poor quality of the imagery makes it more difficult to interpret actions from these motions. In this paper, we present a system capable of detecting moving objects in outdoor nighttime video. We focus on visible-and-near-infrared (VNIR) cameras, since they offer low cost and very high resolution compared to alternatives such as thermal infrared. We present empirical results demonstrating system performance on a parking lot surveillance scenario. We also compare our results to a thermal infrared sensor viewing the same scene.
A major challenge for ATR evaluation is developing an accurate image truth that can be compared to an ATR algo-rithm's decisions to assess performance. While many standard truthing methods and scoring metrics exist for stationary targets in still imagery, techniques for dealing with motion imagery and moving targets are not as prevalent. This is par-tially due to the fact that the moving imagery / moving targets scenario introduces the data association problem of as-signing targets to tracks. Video datasets typically contain far more imagery than static collections, increasing the size of the truthing task. Specifying the types and locations of the targets present for a large number of images is tedious, time consuming, and error prone. In this paper, we present an updated version of a complete truthing system we call the Scoring, Truthing, And Registration Toolkit (START). The application consists of two components: a truthing compo-nents that assists in the automated construction of image truth, and a scoring component that assesses the performance of a given algorithm relative to the specified truth. In motion imagery, both stationary and moving targets can be de-tected and tracked over portions of a motion imagery clip. We summarize the capabilities of START with emphasis on the target tracking and truthing diagnostics. The user manually truths certain key frames, truth for intermediate frames is then inferred and sets of diagnostics verify the quality of the truth. If ambiguous situations are encountered in the inter-mediate frames, diagnostics flag the problem so that the user can intervene manually. This approach can dramatically reduce the effort required for truthing video data, while maintaining high fidelity in the truth data. We present the results of two user evaluations of START, one addressing the accuracy and the other focusing on the human factors aspects of the design.
Automatic Target Recognition (ATR) algorithms are extremely sensitive to differences between the operating conditions under which they are trained and the extended operating conditions in which the fielded algorithms operate. For ATR algorithms to robustly recognize targets while retaining low false alarm rates, they must be able to identify the conditions under which they are operating and tune their parameters on the fly. In this paper, we present a method for tuning the parameters of a model based ATR algorithm using estimates of the current operating conditions. The problem has two components: 1) identifying the current operating conditions and 2) using that information to tune parameters to improve performance. In this paper, we explore the use of a reinforcement learning technique called tile coding for parameter adaptation. In tile coding, we first define a set of valid states describing the world (the operating conditions of interest, such as the level of obscuration). Next, actions (or parameter settings used by the ATR) are defined that are applied when in that state. Parameter settings for each operating condition are learned using an off-line reinforcement learning feedback loop. The result is a lookup table to select the optimal parameter settings for each operation condition. We present results on real LADAR imagery based on parameter tuning learned off-line using synthetic imagery.
Automatic Target Recognition (ATR) algorithms are extremely sensitive to differences between the operating conditions under which they are trained and the extended operating conditions (EOCs) in which the fielded algorithms are tested. These extended operating conditions can cause a target's signature to be drastically different from training exemplars/models. For example, a target's signature can be influenced by: the time of day, the time of year, the weather, atmospheric conditions, position of the sun or other illumination sources, the target surface and material properties, the target composition, the target geometry, sensor characteristics, sensor viewing angle and range, the target surroundings and environment, and the target and scene temperature. Recognition rates degrade if an ATR is not trained for a particular EOC. Most infrared target detection techniques are based on a very simple probabilistic theory. This theory states that a pixel should be assigned the label of "target" if a set of measurements (features) is more likely to have come from an assumed (or learned) distribution of target features than from the distribution of background features. However, most detection systems treat these learned distributions as static and they are not adapted to changing EOCs. In this paper, we present an algorithm for assigning a pixel the label of target or background based on a statistical comparison of the distributions of measurements surrounding that pixel in the image. This method provides a feature-level adaptation to changing EOCs. Results are demonstrated on infrared imagery containing several military vehicles.
A major challenge for ATR evaluation is developing an accurate image truth that can be compared to an ATR algorithm's decisions to assess performance. While many standard truthing methods and scoring metrics exist for stationary targets in still imagery, techniques for dealing with motion imagery and moving targets are not as prevalent. This is partially because the moving imagery / moving targets scenario introduces the data association problem of assigning targets to tracks. This problem complicates the truthing and scoring task in two ways. First, video datasets typically contain far more imagery that must be truthed than static collections. Specifying the types and locations of the targets present for a large number of images is tedious, time consuming and error prone. Second, scoring ATR performance is ambiguous when assessing performance over a collection of video sequences. For example, if a target is tracked and successfully identified for 90% of a single video sequence, is the identification rate 90%, or is the single sequence evaluated in its entirety and the vehicle identification simply recorded as correct? In the former case, a bias will be introduced for easily identified targets that show up frequently in a sequence. In the latter case, the bias is avoided but system accuracy could be overstated.
In this paper, we present a complete truthing system we call the Scoring, Truthing, And Registration Toolkit (START). The first component is registration, which involves aligning the images of the same scene to a common reference frame. Once that reference frame has been determined, the second component, truthing, is used to specify target identity, posi-tion, orientation, and other scene characteristics. The final component, scoring, is used to assess the performance of a given algorithm as compared to the specified truth. In motion imagery, both stationary and moving targets can be de-tected and tracked over portions of a motion imagery clip. We present an approach to scoring performance in the context that provides a natural generalization of the standard methods for dealing with still imagery.
Model-based Automatic Target Recognition (ATR) algorithms are adept at recognizing targets in high fidelity 3D LADAR imagery. Most current approaches involve a matching component where a hypothesized target and target pose are iteratively aligned to pre-segmented range data. Once the model-to-data alignment has converged, a match score is generated indicating the quality of match. This score is then used to rank one model hypothesis over another. The main drawback of this approach is twofold. First, to ensure the correct target is recognized, a large number of model hypotheses must be considered. Even with a highly accurate indexing algorithm, the number of target types and variants that need to be explored is prohibitive for real-time operation. Second, the iterative matching step must consider a variety of target poses to ensure that the correct alignment is recovered. Inaccurate alignments produce erroneous match scores and thus errors when ranking one target hypothesis over another. To compensate for such drawbacks, we explore the use of situational awareness information already available to an image analyst. Examples of such information include knowledge of the surrounding terrain (to assess potential occlusion levels) and targets of interest (to account for target variants).
Mobile robot designers frequently look to computer vision to solve navigation, obstacle avoidance, and object detection problems. Potential solutions using low-cost video cameras are particularly alluring. Recent results in 3D scene reconstruction from a single moving camera seem particularly relevant, but robot designers who attempt to use such 3D techniques have uncovered a variety of practical concerns. We present lessons-learned from developing a single-camera 3D scene reconstruction system that provides both a real-time camera motion estimate and a rough model of major 3D structures in the robot’s vicinity. Our objective is to use the motion estimate to supplement GPS (indoors in particular) and to use the model to provide guidance for further vision processing (look for signs on <i>walls</i>, obstacles on the <i>ground</i>, etc.). The computational geometry involved is closely related to traditional two-camera stereo, however a number of degenerate cases exist. We also demonstrate how SFM can use used to improve the performance of two specific robot navigation tasks.
The success of any potential application for mobile robots depends largely on the specific environment where the application takes place. Practical applications are rarely found in highly structured environments, but unstructured environments (such as natural terrain) pose major challenges to any mobile robot. We believe that semi-structured environments-such as parking lots-provide a good opportunity for successful mobile robot applications. Parking lots tend to be flat and smooth, and cars can be uniquely identified by their license plates. Our scenario is a parking lot where only known vehicles are supposed to park. The robot looks for vehicles that do not belong in the parking lot. It checks both license plates and vehicle types, in case the plate is stolen from an approved vehicle. It operates autonomously, but reports back to a guard who verifies its performance. Our interest is in developing the robot's vision system, which we call Scene Estimation & Situational Awareness Mapping Engine (SESAME). In this paper, we present initial results from the development of two SESAME subsystems, the ego-location and license plate detection systems. While their ultimate goals are obviously quite different, our design demonstrates that by sharing intermediate results, both tasks can be significantly simplified. The inspiration for this design approach comes from the basic tenets of Situational Awareness (SA), where the benefits of holistic perception are clearly demonstrated over the more typical designs that attempt to solve each sensing/perception problem in isolation.
Mobile robots currently cannot detect and read arbitrary signs. This is a major hindrance to mobile robot usability, since they cannot be tasked using directions that are intuitive to humans. It also limits their ability to report their position relative to intuitive landmarks. Other researchers have demonstrated some success on traffic sign recognition, but using template based methods limits the set of recognizable signs. There is a clear need for a sign detection and recognition system that can process a much wider variety of signs: traffic signs, street signs, store-name signs, building directories, room signs, etc. We are developing a system for Sign Understanding in Support of Autonomous Navigation (SUSAN), that detects signs from various cues common to most signs: vivid colors, compact shape, and text. We have demonstrated the feasibility of our approach on a variety of signs in both indoor and outdoor locations.
A common approach to detecting targets in laser radar (LADAR) 3-dimensional x, y and z imagery is to first estimate the ground plane. Once the ground plane is identified, the regions of interest (ROI) are segmented based on height above that plane. The ROIs can then be classifed based on their shape statistics (length, width, height, moments, etc.) In this paper, we present an empirical comparison of three different ground plane estimators. The first estimates the ground plane based on global constraints (a least median squares fit to the entire image). The second two are based on progressively more local constraints: a least median squares fit to each row and column the image, and a local histogram analysis of the re-projected range data. These algorithms are embedded in a larger system that first computes the target height above the ground plane and then recognizes the targets based on properties within the target region. The evaluation is performed using 98 LADAR images containing eight different targets and structured clutter (trees). Performance is measured in terms of percentage of correct detection and false alarm.
Every year, large volumes of imagery are collected for the sole purpose of evaluating Automatic Target Recognition (ATR) algorithms. However, this data cannot be used without adequate truthing information for each image. Truthing information typically consists of the types and locations of the targets present in the imagery. Specifying this information for a large number of images is tedious, time consuming, and error prone. In this paper, we present a complete truthing system we call the Scoring, Truthing, And Registration Toolkit (START). The first component is registration, which involves aligning heterogeneous and homogenous sensor images of the same scene to a common reference frame. Once that reference frame has been determined, the second component, truthing, is used to specify target identity, position, orientation, and other scene characteristics. The final component, scoring, is used to assess the performance of a given algorithm as compared to the specified truth. The scoring module allows statistical comparisons to assess algorithm sensitivity to specific operating conditions (e.g., sensitive to object occlusion).
Laser vibrometry sensors measure minute surface motion colinear with the sensor's line-of-sight. If the vibrometry sensor has a high enough sampling rate, an accurate estimate of the surface vibration is measured. For vehicles with running engines, an automatic target recognition algorithm can use these measurements to produce identification estimates. The level of identification possible is a function of the distinctness of the vibration signature. This signature is dependent upon many factors, such as engine type and vehicle weight. In this paper, we present results of using data mining techniques to assess the identification potential of vibrometry data. Our technique starts with unlabeled vibrometry measurements taken from a variety of vehicles. Then an unsupervised clustering algorithm is run on features extracted from this data. The final step is to analyze the produced cluters and determine if physical vehicle characteristics can be mapped onto the clusters.
Fusing information from sensors with very different phenomenology is an attractive and challenging option for autonomous target acquisition (ATA) systems because correct target detections should correlate between sensors while false alarms might not. In this paper, we present a series of algorithms for detecting and segmenting targets from their background in passive millimeter wave (PMMW) and laser radar (LADAR) data. PMMW sensors provide a consistent signature for metallic targets. They also can effectively operate under adverse weather conditions, however they exhibit poor angular resolution. LADAR sensors produce high-resolution range and reflectance images, but are sensitive to adverse weather conditions. Sensor fusion techniques are applied with the goal of maintaining high probability of detection while decreasing the false alarm rate.
Numerous feature detectors have been defined for detecting military vehicles in natural scenes. These features can be computed for a given image chip containing a known target and used to train a classifier. This classifier can then be used to assign a label to an un-labeled image chip. The performance of the classifier is dependent on the quality of the set of features used. In this paper, we first describe a set of features commonly used by the Automatic Target Recognition (ATR) community. We then analyze feature performance on a vehicle identification task in laser radar (LADAR) imagery. Our features are computed over both the range and reflectance channels. In addition, we perform feature subset selection using two different methods and compare the results. The goal of this analysis is to determine which subset of features to choose in order to optimize performance in LADAR Autonomous Target Acquisition (ATA).
Automatic Target Recognition (ATR) algorithm performance is sensitive to variability in the observed target signature. Algorithms are developed and tested under a specific set of operating conditions and then are often required to perform well under very different conditions (referred to as Extended Operating Conditions, or EOCs). The stability of the target signature as the operating conditions change dictates the success or failure of the recognition algorithm. Laser vibrometry is a promising sensor modality for vehicle identification because target signatures tend to remain stable under a variety of EOCs. A micro-doppler vibrometry sensor measures surface deflection at a very high frequency, thus enabling the surface vibrations of a vehicle to be sensed from afar. Vehicle identification is possible since most vehicles with running engines have a unique vibration signature defined by the engine type. In this paper, we present an ATR algorithm that operates over data collected from a set of accelerometers. These contact accelerometers were placed at a variety of locations on three target vehicles to emulate an ideal laser vibrometer. We discuss a set of features that are useful for discrimination of the three different target categories. We also present classification results based on these features.
Situational Awareness (SA) is a critical component of effective autonomous vehicles, reducing operator workload and allowing an operator to command multiple vehicles or simultaneously perform other tasks. Our Scene Estimation & Situational Awareness Mapping Engine (SESAME) provides SA for mobile robots in semi-structured scenes, such as parking lots and city streets. SESAME autonomously builds volumetric models for scene analysis. For example, a SES-AME equipped robot can build a low-resolution 3-D model of a row of cars, then approach a specific car and build a high-resolution model from a few stereo snapshots. The model can be used onboard to determine the type of car and locate its license plate, or the model can be segmented out and sent back to an operator who can view it from different viewpoints. As new views of the scene are obtained, the model is updated and changes are tracked (such as cars arriving or departing). Since the robot's position must be accurately known, SESAME also has automated techniques for deter-mining the position and orientation of the camera (and hence, robot) with respect to existing maps. This paper presents an overview of the SESAME architecture and algorithms, including our model generation algorithm.
The need for air-to-ground missiles with day/night, adverse weather and pinpoint accuracy Autonomous Target Acquisition (ATA) seekers is essential for today's modern warfare scenarios. Passive millimeter wave (PMMW) sensors have the ability to see through clouds; in fact they tend to show metallic objects in high contrast regardless of weather conditions. However, their resolution is very low when compared with other ATA sensor such as laser radar (LADAR). We present an ATA algorithm suite that combines the superior target detection potential of PMMW with the high-quality segmentation and recognition abilities of LADAR. Preliminary detection and segmentation results are presented for a set of image-pairs of military vehicles that were collected for this project using an 89 Ghz, 18 inch aperture PMMW sensor from TRW and a 1.06 (mu) high-resolution LADAR.