Visual object tracking, which is the process of estimating the motion parameters such as location, scale, and rotation of the object in an image sequence given the initial box in the first frame, is a popular problem in computer vision, with wide-ranging applications including visual navigation, military reconnaissance, and human–computer interaction.1,2 Although significant progress has been made in recent years, the problem is still difficult due to factors such as partial occlusion, deformation, scale variation, rotation, and background clutter.3 To solve these problems, numerous algorithms have been proposed.45.–6
The online learning algorithm is one of the useful algorithms that has been widely used to solve the problem of objects’ changing appearance. As some information of the objects to be tracked is known in advance in various scenarios, it is possible to employ prior knowledge to design the tracker. However, for other applications, as nothing about the objects of interest is known beforehand, no prior knowledge can be of use. Also, it is impossible to employ offline machine learning techniques to achieve efficient tracking because the appearance of an object is likely to vary due to its constant movements and also under different environmental conditions, such as varying level of brightness.7,8 Instead, online learning algorithms have been employed to adapt the object model to the abovementioned uncertainties. In practice, however, updating a model often introduces errors as it is difficult to explicitly assign hard class labels.
To efficiently track the constantly changing object and avoid the errors caused by an online learning algorithm, a model that precisely represents the object is needed. Various forms of representation of the object are used in practice, for example: points,9,10 contours,11,12 optical flow,13,14 or articulated models.15,16 Models that decompose the object into parts are more robust,17,18 as local changes only affect individual parts. Even when individual parts are lost or in an erroneous state, other object parts can still represent the object well. Keypoint, such as SIFT,19 SURF,20 ORB,21 AKAZE,22 and so on, is a representative kind of local feature that has been widely used in image fusion, object recognition, and other fields.
In this paper, a model-free tracking method based on fusing AKAZE and KLT features is proposed. The brief procedure is as follows: first, generate matching-keypoints by finding corresponding keypoints from the consecutive frames and the object template, then generate tracking-keypoints using the forward–backward flow tracking method, and at last, obtain credible keypoints by AKT fusion algorithm. To avoid the instability of a statistical method, the median method is adopted to compute object’s location, scale, and rotation in each frame.
AKAZE22 is regarded as the improved version of SIFT features and SURF. It is a more stable feature detection algorithm. Traditional SIFT and SURF feature detection algorithms build scale space by the linear Gaussian pyramid. However, this kind of linear decomposition can cause loss of accuracy, object’s edge blur, and loss of details. In order to solve these problems, AKAZE algorithm uses the method based on nonlinear scale space. The fast explicit diffusion (FED)23 is used to construct scale space. By using this method, any step length can be applied. Compared to SIFT and SURF, the computational complexity is greatly reduced and the robustness is improved. In the following subsections, the detailed procedures of constructing nonlinear scale space using FED scheme will be illustrated. The process of feature detection and the effects of feature description of AKAZE algorithm based on modified-local difference binary (M-LDB) will then be discussed.
Building Nonlinear Scale Space
Similar to SIFT in the construction of the nonlinear scale space, scale level increases logarithmically. The scale space constructed has octaves and each octave has layers. Different octaves and layers are marked with serial numbers o and s, respectively. The relationship between them and the scale parameter σ is shown in the equation below:
Feature detection of AKAZE is achieved by computing the Hessian local maxima after normalization of various scales for the filtered images in the nonlinear scale space. Calculation of a Hessian matrix is as follows:
The diagram in Fig. 1, as supplied by Ref. 22, demonstrates LDB24 and M-LDB tests between grid divisions around a keypoint. The intensity is expressed by colorful grids and the gradients in are expressed by the arrows. The feature description of AKAZE algorithm is based on M-LDB that exploits gradient and intensity information from the nonlinear scale space. And there are two main improvements of M-LDB compared with LDB: (1) rotation invariance is obtained by estimating the main orientation of the keypoint, as is done in KAZE,25 and rotating the grid of LDB accordingly. (2) A function of the scale σ is used as the subsample grids in steps instead of using the average of all pixels inside each subdivision of the grid. The scale-dependent sampling in turn makes the descriptor robust to changes in scale.
Fusing AKT Tracking
Forward–Backward Flow Tracking
Because of the environmental impact or object’s appearance change, the results of KLT often produce deviation, an evaluation method needs to be established to judge the accuracy of tracking results. Forward–backward error,26 which is based on the forward–backward continuity assumption, can effectively estimate the trajectory error of keypoints, i.e., if the object tracking is correct, then the tracking results are independent of time.
As shown in Fig. 2, for two adjacent frame and , is a random keypoint from object template in the frame , is the corresponding keypoint of in the frame using forward tracking, and is the corresponding keypoint of in the frame using backward tracking. Forward–backward error is defined as the Euclidean distance between two keypoints in frame , i.e., . If error is bigger than a threshold which we set, the keypoint will be tracked falsely.
We set the location of keypoint and status of forward–backward error as a pair pair (keypoint, status). If the status corresponding to keypoint is TRUE, which means the status of forward KLT and backward KLT themselves both must be TRUE, and error is smaller than the Euclidean distance threshold, then we call the keypoint with TRUE status tracking-keypoint. The rest are called failing tracking-keypoint.
Model of AKT
When calculating the homographic matrix between the initial keypoints and the current keypoint based on the traditional AKAZE algorithm, robust statistical methods, such as RANSAC and LMEDS, are usually adopted. However, when the number of outliers is too much, homographic matrix estimation will get poor results. So, in this paper, we put forward a tracking model called AKT, which can fundamentally eliminate the false matching-keypoints and reduce the proportion of outliers to effectively solve the problem of inaccurate parameter estimation.
The diagram in Fig. 3 demonstrates how the AKT algorithm fuses the matching-keypoints and tracking-keypoints by AKT algorithm. The collection of is composed of matching-keypoints in the ’th frame corresponding to the keypoints in object template obtained by AKAZE matching algorithm. And these matching-keypoints are represented by black circles in Fig. 3. The collection of is composed of tracking-keypoints in the ’th frame corresponding to the keypoints in object template obtained by KLT algorithm. And these tracking-keypoints are represented by gray circles in Fig. 3. There is a one-to-one correspondence between matching-keypoints and tracking-keypoints. Keypoints surrounded by the curve are credible keypoints in the ’th frame, which will make contributions to calculating an object’s location, scale, and rotation. The rest of the key points are outliers and thus, they are deleted. The credible keypoints are obtained by fusing matching-keypoints and tracking-keypoints. Its collection is .
Sort the Euclidean distance between the ’th pair of matching-keypoints and tracking-keypoints in the ’th frame in descending order, then the experiments show that the optimal value to be set as maximum allowable deviation threshold is in 0.26th of the distance sequence because enough credible keypoints are ensured, and the obvious false matching-keypoints can be removed. This means that the all but the bottom 0.74th pairs of points are valid matches. Set keypoint as the center, as the width and the height of the patch as . The degree of similarity between two patches is defined as
The traditional ways to calculate the homographic matrix are statistical methods, such as RANSAC and LMEDS. However, experiments show that the estimation of homography gives poor results for nonplanar objects, even though the keypoint association was performed correctly.27 So, in this paper, the median method is put forward to compute object’s location, scale, and rotation in each frame.
As shown in Fig. 4, and represent the center of the initial template and the object’s bounding box in the ’th frame, respectively. and represent credible keypoints of the initial template and that in the ’th frame. and represent the angle between the and keypoints of the initial template and that in the ’th frame. and , respectively, represent the Euclidean distance between the keypoints in the initial template and that in the ’th frame. With the following equations, the relative changing rate of position, scale and rotation angle can be calculated:
Given a sequence of images and an initializing region in , our aim in each frame of the sequence is to recover the box of the object of interest. Steps of the AKT Algorithm 1 are as follows:
Fusing AKAZE-KLT tracking.
We evaluated the proposed tracking algorithm based on fusing AKAZE and KLT (AKT) algorithm using sequences, as supplied by Ref. 28, with challenging factors including partial occlusion, drastic illumination changes, nonrigid deformation, background clutter, and motion blur. We compared the proposed AKT tracker with seven state-of-the-art methods: tracking-learning-detection (TLD),14 compressive tracker (CT),29 context tracker (CXT),30 color-based probabilistic tracking (CPF),31 structured output tracking with kernels (Struck),32 multiple instance learning tracker (MIL)33 and the circulant structure of tracking with kernels (CSK).34 All data in the experimental results and the quantitative evaluation are based on the unified dataset and the same initial state conditions. Since our algorithm focuses primarily on the challenges of partial occlusion, deformation, rotation, and scale variation, we only include eight of the videos that mainly contain these challenges and neglect the others in the following discussions. Additionally, the results of precision and success rate are based on 22 videos, in which the good ones are as shown in Fig. 5 and Table 1. Experimental environment: Visual Studio 2013 + OpenCV3.1.0. Equipment is configured to: 2.00 GHz, dual processor, a 64-bit operating system, the 32-Gb installed memory.
The CLE and average frame per second (pixel/FPS).
There are a range of measures available in previous research for assessing the performance of tracking algorithms quantitatively. Many authors employ the center-error measure that expresses the distance between the centroid of the algorithmic output and the centroid of the ground truth. This measure is only a rough assessment of the localization. Since it is not bounded, the comparison of results obtained on different sequences is difficult. So, we also employed the widely used overlap measure35
Since the rotation is not considered in the ground truth of the benchmarks, it is excluded in the overlap comparisons between our results and the benchmarks.
Accuracy Comparison of Methods for Tracking
The tracking performance of the AKT algorithm on different datasets28 is as shown in Fig. 5. Sequences (a) and (b) mainly contain the challenging aspect of partial occlusion. Sequences (c) and (d) mainly contain deformation. Sequences (e) and (f) mainly contain plane rotation and out-of-rotation. Sequences (g) and (h) mainly contain scale variation, and so on. The results show that facing different situations, the AKT algorithm can accurately track the object and has a very good robustness.
Although the AKT algorithm shows good tracking results in these videos, there are still some challenges that are hard to deal with. Since the AKT algorithm is based on keypoints, when the object’s appearance is smooth or the texture is not rich, it may struggle, as shown in Fig. 6(a). Also, when the object’s appearance is almost or totally changed, the tracking box may drift. For example, the initial object is the face, but when the person turns around, it is hard to track because of the changed appearance, as shown in Fig. 6(b).
Performance Comparison of Methods for Tracking
The center location error (CLE) and average frame per second (fps) of AKT algorithm and other seven kinds of tracking algorithms are shown in Table 1 (bold fonts indicate the best or second best performance), the results of the other seven kinds of tracking methods on different sequences in the table comes from Ref. 26. In Table 1, the results show that among the tracking on the eight datasets, the frame rate of AKT algorithm is 77.9 fps, showing a high real-time performance (the average fps comes in the top two 7 times), and achieving a high tracking accuracy with the average CLE of 11.9 pixels (the average CLE comes in the top two 5 times), the tracking performance is better than the other seven methods.
The CLE is defined as the average Euclidean distance between the center locations of the tracking boxes using our method and the manually labeled ground truths. Then the average CLE over all the frames of one sequence is used to summarize the overall performance for that sequence. Precision plot shows the percentage of frames whose estimated location is within the given threshold distance of the ground truth, as shown in Fig. 7(a). The results show that precision of AKT tracking is higher than the other algorithms and similar to Struck.
To measure the performance of success rate on a sequence of frames, we count the number of successful frames whose overlap is larger than the given threshold . The success plot shows the ratios of successful frames at the thresholds varied from 0 to 1, as shown in Fig. 7(b). The results show that AKT algorithm is superior to other algorithms.
Error Comparison of Methods for Homography Estimation
In order to evaluate the different methods for homography estimation, we developed our own dataset because the data supplied by Ref. 26 did not include rotation data. We gained a total of 200 frames randomly as original frames. Then we transformed these frames using the affine model, as shown in Eq. (13).
Then, under the condition that the keypoints of original frames and that of transformed frames are the same, we calculate the errors of displacement (pixel), scale (1) and rotation (deg) to get the error figures (method LMEDS in red, RANSAC in blue, MEDIAN in green), as shown in Fig. 8. The independent variable of error figures is the number of frames, whereas the dependent variable is the error.
The average error (AE) is used for comparison as the first evaluation criterion, as shown in Table 2. There will be noises causing by obvious variable estimation error, so to make better comparison of the methods for homography estimation, we set up average error without noise (AEN) as the second evaluation criterion. From the error figures, we set 100 pixels as location noise threshold, 10 as scale noise threshold, 150 deg as rotation noise threshold. The lower the AE and AEN, the better the performance of method for homography estimation. The smaller the difference between AE and AEN, the more stable the method for homography estimation. Therefore, the experimental results show not only that the median method is more stable, not having apparent noises, but also that its value of AE and AEN is less than that of the traditional statistical method.
The AE and AEN of center location, scale, and rotation (pixel/1/deg).
|Methods||x-coordinate (pixel)||y-coordinate (pixel)||Scale||Rotation (deg)|
|AE||AEN (<100)||AE||AEN (<100)||AE||AEN (<10)||AE||AEN (<150)|
Selection of Threshold for Tracking Results
The ratio of the number of inliers to the total number of matching-keypoints is called inlier ratio (IR). The larger the IR, the better the estimation of homographies. We impose that the error in location for two corresponding keypoints has to be less than 2.5 pixels, i.e., , where is the true homography between the frames, is the location of keypoint in original frame , and is the location of keypoint in transformed frame . The keypoint meeting above condition is called inlier. To find the threshold for better tracking, we still use the dataset put forward in Sec. 4.3 with the total number changed to 2000. We calculate the IR of these corresponding frames and the mean of IR is 0.74, as shown in Fig. 9. Therefore, we set optimal value for tracking as the mean of IR to avoid outliers.
In this paper, in an effort to reduce an excess of outliers when using traditional AKAZE match-tracking algorithm and solve the problems caused by poor homography estimates produced by statistical methods, AKT algorithm is put forward. The experimental results on different datasets show that the AKT algorithm can deal with challenges, such as partial occlusion, deformation, scale variation, rotation, and background clutter, showing high real-time performance and accuracy. However, since the tracking method used is based on keypoints, when the objects appearance is smooth, and texture is not rich, using the AKT algorithm may result in reduction of the effectiveness of tracking. Therefore, in future work, we will address the problems mentioned above.
This work was supported by the National Natural Science Foundation of China (Grant No. 61471194), Science and Technology on Electro-optic Control Laboratory and Aeronautical Science Foundation of China (Grant No. 20135152049), CASC (Grant No. China Aerospace Science and Technology Corporation) Aerospace Science and Technology Innovation Foundation Project; the Fundamental Research Funds for the Central Universities; Nanjing University of Aeronautics and Astronautics Graduate School Innovation Base (Laboratory) Open Foundation Program (Grant No. kfjj20151505).
K. Cannons, “A review of visual tracking,” Technical Report CSE-2008-07, Department of Computer Science Engineering, York University, Toronto, Canada (2008).Google Scholar
E. Maggio and A. Cavallaro, Video Tracking: Theory and Practice, Wiley Online Library, Hoboken, New Jersey (2011).Google Scholar
T. K. Lee et al., “Reliable tracking algorithm for multiple reference frame motion estimation,” J. Electron. Imaging 20(3), 033003 (2011).JEIME51017-9909http://dx.doi.org/10.1117/1.3605574Google Scholar
A. W. Smeulders et al., “Visual tracking: an experimental survey,” IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1442–1468 (2014).ITPIDJ0162-8828http://dx.doi.org/10.1109/TPAMI.2013.230Google Scholar
Y. Junhua et al., “Real-time tracking of targets with complex state based on ICT algorithm,” J. Huazhong Univ. Sci. Technol. (Natural Sci. Ed.) 43(3), 107–112 (2015).Google Scholar
B. Babenko, M. H. Yang and S. Belongie, “Robust object tracking with online multiple instance learning,” IEEE Trans. Pattern Anal. Mach. Intell. 33(8), 1619–1632 (2011).ITPIDJ0162-8828http://dx.doi.org/10.1109/TPAMI.2010.226Google Scholar
P. Sand and S. Teller, “Particle video: long-range motion estimation using point trajectories,” Int. J. Comput. Vision 80(1), 72–91 (2008).IJCVEQ0920-5691http://dx.doi.org/10.1007/s11263-008-0136-6Google Scholar
G. Nebehay and R. Pflugfelder, “Consensus-based matching and tracking of keypoints for object tracking,” in IEEE Winter Conf. on Applications of Computer Vision, pp. 862–869, IEEE (2014).http://dx.doi.org/10.1109/WACV.2014.6836013Google Scholar
C. Bibby and I. Reid, “Robust real-time visual tracking using pixel-wise posteriors,” in European Conf. on Computer Vision (2008).Google Scholar
C. Bibby and I. Reid, “Real-time tracking of multiple occluding objects using level sets,” in IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1307–1314 (2010).http://dx.doi.org/10.1109/CVPR.2010.5539818Google Scholar
T. Brox et al., “High accuracy optical flow estimation based on a theory for warping,” in European Conf. on Computer Vision, pp. 25–36 (2004).Google Scholar
D. Ramanan, D. A. Forsyth and A. Zisserman, “Tracking people by learning their appearance,” IEEE Trans. Pattern Anal. Mach. Intell. 29(1), 65–81 (2007).http://dx.doi.org/10.1109/TPAMI.2007.250600Google Scholar
P. Buehler et al., “Long term arm and hand tracking for continuous sign language TV broadcasts,” in British Machine Vision Conf. (2008).Google Scholar
A. Adam, E. Rivlin and I. Shimshoni, “Robust fragments-based tracking using the integral histogram,” in IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, pp. 798–805, IEEE (2006).http://dx.doi.org/10.1109/CVPR.2006.256Google Scholar
S. M. S. Nejhum, J. Ho and M. H. Yang, “Online visual tracking with histograms and articulating blocks,” Comput. Vision Image Understanding 114(8), 901–914 (2010).CVIUF41077-3142http://dx.doi.org/10.1016/j.cviu.2010.04.002Google Scholar
D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vision 60(2), 91–110 (2004).IJCVEQ0920-5691http://dx.doi.org/10.1023/B:VISI.0000029664.99615.94Google Scholar
H. Bay, T. Tuytelaars and L. Van Gool, “SURF: speeded up robust features,” in European Conf. on Computer Vision, pp. 404–417, Springer, Berlin Heidelberg (2006).Google Scholar
E. Rublee et al., “ORB: an efficient alternative to SIFT or SURF,” in Int. Conf. on Computer Vision, pp. 2564–2571 (2011).Google Scholar
P. Alcantarilla, J. Nuevo and A. Bartoli, “Fast explicit diffusion for accelerated features in nonlinear scale spaces,” In Proc. British Machine Vision Conf., 1–11 (2013).ITPIDJ0162-8828Google Scholar
S. Grewenig, J. Weickert and A. Bruhn, “From box filtering to fast explicit diffusion,” in Pattern Recognition, pp. 533–542, Springer, Berlin Heidelberg (2010).Google Scholar
X. Yang and K. T. Cheng, “LDB: an ultra-fast feature for scalable augmented reality on mobile devices,” in IEEE Int. Symp. on Mixed and Augmented Reality, pp. 49–57, IEEE (2012).http://dx.doi.org/10.1109/ISMAR.2012.6402537Google Scholar
P. F. Alcantarilla, A. Bartoli and A. J. Davison, “KAZE features,” in European Conf. on Computer Vision, pp. 214–227, Springer, Berlin Heidelberg (2012).Google Scholar
Z. Kalal, K. Mikolajczyk and J. Matas, “Forward-backward error: automatic detection of tracking failures,” in 20th Int. Conf. on Pattern Recognition, pp. 2756–2759, IEEE (2010).http://dx.doi.org/10.1109/ICPR.2010.675Google Scholar
G. Nebehay and R. Pflugfelder, “Consensus-based matching and tracking of keypoints for object tracking,” in IEEE Winter Conf. on Applications of Computer Vision, pp. 862–869 (2014).http://dx.doi.org/10.1109/WACV.2014.6836013Google Scholar
K. Zhang, L. Zhang and M. H. Yang, “Real-time compressive tracking,” in European Conf. on Computer Vision, pp. 864–877, Springer, Firenze, Italy (2012).Google Scholar
T. B. Dinh, N. Vo and G. Medioni, “Context tracker: exploring supporters and distracters in unconstrained environments,” in IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1177–1184 (2011).http://dx.doi.org/10.1109/CVPR.2011.5995733Google Scholar
P. Pérez et al., “Color-based probabilistic tracking,” in European Conf. on Computer Vision, pp. 661–675, Springer, Berlin Heidelberg (2002).Google Scholar
S. Hare, S. Amir and P. H. Torr, “Struck: Structured output tracking with kernels,” in Int. Conf. on Computer Vision, pp. 263–270 (2011).Google Scholar
B. Babenko, M. H. Yang and S. Belongie, “Visual tracking with online multiple instance learning,” in IEEE Conf. on Computer Vision and Pattern Recognition, pp. 983–990 (2009).http://dx.doi.org/10.1109/CVPR.2009.5206737Google Scholar
J. F. Henriques et al., “Exploiting the circulant structure of tracking-by-detection with kernels,” in European Conf. on Computer Vision, pp. 702–715, Springer, Berlin Heidelberg (2012).Google Scholar
Junhua Yan is an assistant professor at Nanjing University of Aeronautics and Astronautics, a visiting researcher in Science and Technology on Electro-Optic Control Laboratory. She received her BSc, MSc, and PhD degrees from Nanjing University of Aeronautics and Astronautics in 1993, 2001, and 2004, respectively. She is the author of more than 30 journal papers and has 5 patents. Her current research interests include multisource information fusion, and target detection, tracking, and recognition.
Zhigang Wang received his BSc degree from Nanjing University of Aeronautics and Astronautics in 2013. Now, he is a MSc degree candidate at Nanjing University of Aeronautics and Astronautics. His main research direction is object detection and tracking.