In this paper, we benchmark five state-of-the-art trackers on aerial platform videos: Multi-domain Convolutional Neural Network (MDNET) tracker, which was the winner of the VOT2015 tracking challenge, the Fully Convolutional Neural network Tracker (FCNT), the Spatially Regularized Correlation Filter (SRDCF) tracker, the Continuous Convolution Operator Tracker (CCOT) tracker, which was the winner of the VOT2016 challenge, and the Tree structure Convolutional Neural Network (TCNN) tracker. We assess performance in terms of both tracking accuracy and processing speed based on two sets of videos: a subset of the OTB dataset where the cameras are located at a high vantage point and a new dataset of aerial videos captured by a moving platform. Our results indicate that these trackers performed as expected for the videos in the OTB subset, however, tracker performance degraded significantly in aerial videos due to target size, camera motion and target occlusions. The CCOT tracker yielded the best overall performance in terms of accuracy, while the SRDCF tracker was the fastest.
As Unmanned Aerial Systems grow in numbers, pedestrian detection from aerial platforms is becoming a topic of increasing importance. By providing greater contextual information and a reduced potential for occlusion, the aerial vantage point provided by Unmanned Aerial Systems is highly advantageous for many surveillance applications, such as target detection, tracking, and action recognition. However, due to the greater distance between the camera and scene, targets of interest in aerial imagery are generally smaller and have less detail. Deep Convolutional Neural Networks (CNN’s) have demonstrated excellent object classification performance and in this paper we adopt them to the problem of pedestrian detection from aerial platforms. We train a CNN with five layers consisting of three convolution-pooling layers and two fully connected layers. We also address the computational inefficiencies of the sliding window method for object detection. In the sliding window configuration, a very large number of candidate patches are generated from each frame, while only a small number of them contain pedestrians. We utilize the Edge Box object proposal generation method to screen candidate patches based on an "objectness" criterion, so that only regions that are likely to contain objects are processed. This method significantly reduces the number of image patches processed by the neural network and makes our classification method very efficient. The resulting two-stage system is a good candidate for real-time implementation onboard modern aerial vehicles. Furthermore, testing on three datasets confirmed that our system offers high detection accuracy for terrestrial pedestrian detection in aerial imagery.
Unmanned Aerial Vehicles are becoming an increasingly attractive platform for many applications, as their cost decreases and their capabilities increase. Creating detailed maps from aerial data requires fast and accurate video mosaicking methods. Traditional mosaicking techniques rely on inter-frame homography estimations that are cascaded through the video sequence. Computationally expensive keypoint matching algorithms are often used to determine the correspondence of keypoints between frames. This paper presents a video mosaicking method that uses an object tracking approach for matching keypoints between frames to improve both efficiency and robustness. The proposed tracking method matches local binary descriptors between frames and leverages the spatial locality of the keypoints to simplify the matching process. Our method is robust to cascaded errors by determining the homography between each frame and the ground plane rather than the prior frame. The frame-to-ground homography is calculated based on the relationship of each point’s image coordinates and its estimated location on the ground plane. Robustness to moving objects is integrated into the homography estimation step through detecting anomalies in the motion of keypoints and eliminating the influence of outliers. The resulting mosaics are of high accuracy and can be computed in real time.
With the growing ubiquity of mobile devices, advanced applications are relying on computer vision techniques to provide novel experiences for users. Currently, few tracking approaches take into consideration the resource constraints on mobile devices. Designing efficient tracking algorithms and optimizing performance for mobile devices can result in better and more efficient tracking for applications, such as augmented reality. In this paper, we use binary descriptors, including Fast Retina Keypoint (FREAK), Oriented FAST and Rotated BRIEF (ORB), Binary Robust Independent Features (BRIEF), and Binary Robust Invariant Scalable Keypoints (BRISK) to obtain real time tracking performance on mobile devices. We consider both Google’s Android and Apple’s iOS operating systems to implement our tracking approach. The Android implementation is done using Android’s Native Development Kit (NDK), which gives the performance benefits of using native code as well as access to legacy libraries. The iOS implementation was created using both the native Objective-C and the C++ programing languages. We also introduce simplified versions of the BRIEF and BRISK descriptors that improve processing speed without compromising tracking accuracy.