Reliable vehicle detection and tracking in wide area motion imagery (WAMI), a novel class of imagery captured by airborne sensor arrays and characterized by large ground coverage and low frame rate, are the basis for higher-level image analysis tasks in wide area aerial surveillance. Possible applications include real-time traffic monitoring, driver behavior analysis, and anomaly detection. Most frameworks for detection and tracking in WAMI data rely on motion-based input detections generated by frame differencing or background subtraction. Subsequently employed tracking approaches aim at recovering missing motion detections to enable persistent tracking, i.e. continuous tracking also for vehicles that become stationary. Recently, a moving object detection method based on convolutional neural networks (CNNs) showed promising results on WAMI data. Therefore, in this work we analyze how CNN-based detection methods can improve persistent WAMI tracking compared to detection methods based on difference images. To find detections, we employ a network that uses consecutive frames as input and computes detection heatmaps as output. The high quality of the output heatmaps allows for detection localization by non-maximum suppression without further post processing. For quantitative evaluation, we use several regions of interest defined on the publicly available, annotated WPAFB 2009 dataset. We employ the common metrics precision, recall, and f-score to evaluate detection performance, and additionally consider track identity switches and multiple object tracking accuracy to assess tracking performance. We first evaluate the moving object detection performance of our deep network in comparison to a previous analysis of difference-image based detection methods. Subsequently, we apply a persistent multiple hypothesis tracker with WAMI-specific adaptations to the CNN-based motion detections, and evaluate the tracking results with respect to a persistent tracking ground truth. We yield significant improvement of both the motion-based input detections and the output tracking quality, demonstrating the potential of CNNs in the context of persistent WAMI tracking.
Wide area motion imagery (WAMI) acquired by an airborne multicamera sensor enables continuous monitoring of large urban areas. Each image can cover regions of several square kilometers and contain thousands of vehicles. Reliable vehicle tracking in this imagery is an important prerequisite for surveillance tasks, but remains challenging due to low frame rate and small object size. Most WAMI tracking approaches rely on moving object detections generated by frame differencing or background subtraction. These detection methods fail when objects slow down or stop. Recent approaches for persistent tracking compensate for missing motion detections by combining a detection-based tracker with a second tracker based on appearance or local context. In order to avoid the additional complexity introduced by combining two trackers, we employ an alternative single tracker framework that is based on multiple hypothesis tracking and recovers missing motion detections with a classifierbased detector. We integrate an appearance-based similarity measure, merge handling, vehicle-collision tests, and clutter handling to adapt the approach to the specific context of WAMI tracking. We apply the tracking framework on a region of interest of the publicly available WPAFB 2009 dataset for quantitative evaluation; a comparison to other persistent WAMI trackers demonstrates state of the art performance of the proposed approach. Furthermore, we analyze in detail the impact of different object detection methods and detector settings on the quality of the output tracking results. For this purpose, we choose four different motion-based detection methods that vary in detection performance and computation time to generate the input detections. As detector parameters can be adjusted to achieve different precision and recall performance, we combine each detection method with different detector settings that yield (1) high precision and low recall, (2) high recall and low precision, and (3) best f-score. Comparing the tracking performance achieved with all generated sets of input detections allows us to quantify the sensitivity of the tracker to different types of detector errors and to derive recommendations for detector and parameter choice.