Real-time tracking of deformable objects based on combined matching-and-tracking

Abstract. Visual tracking is very challenging due to the existence of several sources of variations, such as partial occlusion, deformation, scale variation, rotation, and background clutter. A model-free tracking method based on fusing accelerated features using fast explicit diffusion in nonlinear scale spaces (AKAZE) and KLT features is presented. First, matching-keypoints are generated by finding corresponding keypoints from the consecutive frames and the object template, then tracking-keypoints are generated using the forward–backward flow tracking method, and at last, credible keypoints are obtained by AKAZE-KLT tracking (AKT) algorithm. To avoid the instability of a statistical method, the median method is adopted to compute the object’s location, scale, and rotation in each frame. The experimental results show that the AKT algorithm has strong robustness and can achieve accurate tracking especially under conditions of partial occlusion, scale variation, rotation, and deformation. The tracking performance shows higher robustness and accuracy in a variety of datasets and the average frame rate reaches 78 fps, showing good performance in real time.


Introduction
Visual object tracking, which is the process of estimating the motion parameters such as location, scale, and rotation of the object in an image sequence given the initial box in the first frame, is a popular problem in computer vision, with wide-ranging applications including visual navigation, military reconnaissance, and human-computer interaction. 1,2lthough significant progress has been made in recent years, the problem is still difficult due to factors such as partial occlusion, deformation, scale variation, rotation, and background clutter. 3][6] The online learning algorithm is one of the useful algorithms that has been widely used to solve the problem of objects' changing appearance.As some information of the objects to be tracked is known in advance in various scenarios, it is possible to employ prior knowledge to design the tracker.However, for other applications, as nothing about the objects of interest is known beforehand, no prior knowledge can be of use.Also, it is impossible to employ offline machine learning techniques to achieve efficient tracking because the appearance of an object is likely to vary due to its constant movements and also under different environmental conditions, such as varying level of brightness. 7,8nstead, online learning algorithms have been employed to adapt the object model to the abovementioned uncertainties.In practice, however, updating a model often introduces errors as it is difficult to explicitly assign hard class labels.
To efficiently track the constantly changing object and avoid the errors caused by an online learning algorithm, a model that precisely represents the object is needed.Various forms of representation of the object are used in practice, for example: points, 9,10 contours, 11,12 optical flow, 13,14 or articulated models. 15,16Models that decompose the object into parts are more robust, 17,18 as local changes only affect individual parts.Even when individual parts are lost or in an erroneous state, other object parts can still represent the object well.Keypoint, such as SIFT, 19 SURF, 20 ORB, 21 AKAZE, 22 and so on, is a representative kind of local feature that has been widely used in image fusion, object recognition, and other fields.
In this paper, a model-free tracking method based on fusing AKAZE and KLT features is proposed.The brief procedure is as follows: first, generate matching-keypoints by finding corresponding keypoints from the consecutive frames and the object template, then generate tracking-keypoints using the forward-backward flow tracking method, and at last, obtain credible keypoints by AKT fusion algorithm.To avoid the instability of a statistical method, the median method is adopted to compute object's location, scale, and rotation in each frame.
2 Background Work AKAZE 22 is regarded as the improved version of SIFT features and SURF.It is a more stable feature detection algorithm.Traditional SIFT and SURF feature detection algorithms build scale space by the linear Gaussian pyramid.However, this kind of linear decomposition can cause loss of accuracy, object's edge blur, and loss of details.In order to solve these problems, AKAZE algorithm uses the method based on nonlinear scale space.The fast explicit diffusion (FED) 23 is used to construct scale space.By using this method, any step length can be applied.Compared to SIFT and SURF, the computational complexity is greatly reduced and the robustness is improved.In the following subsections, the detailed procedures of constructing nonlinear scale space using FED scheme will be illustrated.The process of feature detection and the effects of feature description of AKAZE algorithm based on modified-local difference binary (M-LDB) will then be discussed.

Building Nonlinear Scale Space
Similar to SIFT in the construction of the nonlinear scale space, scale level increases logarithmically.The scale space constructed has P t ai ðx; yÞ octaves and each octave has V a layers.Different octaves and layers are marked with serial numbers o and s, respectively.The relationship between them and the scale parameter σ is shown in the equation below: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 6 3 ; 5 7 2 σ i ðo; sÞ ¼ 2 oþs∕S ; (1) where o ∈ ½0: M is the total number of images that remain after filtration by the filter.Since the nonlinear diffusion filter is based on the scale of time, scale parameters σ i with the unit of pixel is transformed to the unit of time, as shown below: ; t e m p : i n t r a l i n k -; e 0 0 2 ; 6 3 ; 4 8 6 t i ¼ where t i is o ∈ ½0: : : O − 1, s ∈ ½0: : : S − 1, i ∈ ½0: : : M called evolutionary time.For each input image, a Gaussian filter is first applied, then the gradient histogram of the image is calculated.The contrast factor P t i ðx; yÞ is set as 70% of the gradient histogram.In the case of twodimensional (2-D) images, since the image derivative is one pixel grid size, the maximal step size t max is 0.25 without violating stable conditions.Then by using a set of evolutionary time t i , all the images of scale space can be obtained using FED scheme.

Feature Detection
Feature detection of AKAZE is achieved by computing the Hessian local maxima after normalization of various scales for the filtered images in the nonlinear scale space.Calculation of a Hessian matrix is as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 6 3 ; 2 6 4 L i where σ i;norm ¼ σ i ∕2 o i .For computing the second order derivatives, the concatenated Scharr filters with step size σ i;norm are applied.First, search for maxima of the detector response in spatial location.Check that the detector response is higher than a predefined threshold and that it is a maxima in a window of 3 × 9 pixels of three adjacent sublevels.Finally, the 2-D position of the keypoint is estimated with subpixel accuracy by fitting a 2-D quadratic function to the determinant of the Hessian response in a 3 × 3 pixels neighborhood and finding its maximum.

Feature Description
The diagram in Fig. 1, as supplied by Ref. 22, demonstrates LDB 24 and M-LDB tests between grid divisions around a keypoint.The intensity is expressed by colorful grids and the gradients in x are expressed by the arrows.The feature description of AKAZE algorithm is based on M-LDB that exploits gradient and intensity information from the nonlinear scale space.And there are two main improvements of M-LDB compared with LDB: (1) rotation invariance is obtained by estimating the main orientation of the keypoint, as is done in KAZE, 25 and rotating the grid of LDB accordingly.(2) A function of the scale σ is used as the subsample grids in steps instead of using the average of all pixels inside each subdivision of the grid.The scale-dependent sampling in turn makes the descriptor robust to changes in scale.

Forward-Backward Flow Tracking
Because of the environmental impact or object's appearance change, the results of KLT often produce deviation, an evaluation method needs to be established to judge the accuracy of tracking results.Forward-backward error, 26 which is based on the forward-backward continuity assumption, can effectively estimate the trajectory error of keypoints, i.e., if the object tracking is correct, then the tracking results are independent of time.
As shown in Fig. 2, for two adjacent frame I t−1 and I t , x t−1 is a random keypoint from object template in the frame I t−1 , x t is the corresponding keypoint of x t−1 in the frame I t using forward tracking, and xt−1 is the corresponding keypoint of x t in the frame I t−1 using backward tracking.Forward-backward error is defined as the Euclidean distance between two keypoints in frame t−1 is bigger than a threshold which we set, the keypoint will be tracked falsely.
We set the location of keypoint and status of forwardbackward error as a pair pair (keypoint, status).If the status corresponding to keypoint is TRUE, which means the status of forward KLT and backward KLT themselves both must be TRUE, and error e FB t−1 is smaller than the Euclidean distance threshold, then we call the keypoint with TRUE status tracking-keypoint.The rest are called failing trackingkeypoint.

Model of AKT
When calculating the homographic matrix between the initial keypoints and the current keypoint based on the traditional AKAZE algorithm, robust statistical methods, such as RANSAC and LMEDS, are usually adopted.However, when the number of outliers is too much, homographic matrix estimation will get poor results.So, in this paper, we put forward a tracking model called AKT, which can fundamentally eliminate the false matching-keypoints and reduce the proportion of outliers to effectively solve the problem of inaccurate parameter estimation.
The diagram in Fig. 3 demonstrates how the AKT algorithm fuses the matching-keypoints and tracking-keypoints by AKT algorithm.The collection of V a is composed of matching-keypoints P t ai ðx; yÞ in the t'th frame corresponding to the keypoints in object template obtained by AKAZE matching algorithm.And these matching-keypoints are represented by black circles in Fig. 3.The collection of V k is composed of tracking-keypoints P t ki ðx; yÞ in the t'th frame corresponding to the keypoints in object template obtained by KLT algorithm.And these tracking-keypoints are represented by gray circles in Fig. 3.There is a one-to-one correspondence between matching-keypoints and trackingkeypoints.Keypoints surrounded by the curve are credible keypoints in the t'th frame, which will make contributions to calculating an object's location, scale, and rotation.The rest of the key points are outliers and thus, they are deleted.
The credible keypoints are obtained by fusing matchingkeypoints and tracking-keypoints.Its collection is V.
Sort the Euclidean distance l t i between the i'th pair of matching-keypoints and tracking-keypoints in the t'th frame in descending order, then the experiments show that the optimal value l t Th to be set as maximum allowable deviation threshold is in 0.26th of the distance sequence because enough credible keypoints are ensured, and the obvious false matching-keypoints can be removed.This means that the all but the bottom 0.74th pairs of points are valid matches.Set keypoint P t i ðx; yÞ as the center, a as the width and the height of the patch as M t i .The degree of similarity between two patches is defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 3 2 6 ; 6 0 9 where β NCC is the normalization correlation coefficient.Set minimum allowed similarity threshold to be α Th , the set V of credible keypoints is composed of three parts: (1) when the Euclidean distance between the i'th pair of matching-keypoints and tracking-keypoints satisfies l t i ≤ l t Th , keypoints P t ai ðx; yÞ ∈ V; (2) when l t i > l t Th , AKAZE match or KLT track may cause an error, lead to an excessively large deviation, so mistakenly deleted credible keypoints can be screened out by referring to similarity, namely if αðM i ; M t i Þ > α Th , matching-keypoints P t ai ðx; yÞ ∈ V; and (3) if αðM i ; M t i Þ > α Th , tracking-keypoints P t ki ðx; yÞ ∈ V.

Bounding Box
The traditional ways to calculate the homographic matrix are statistical methods, such as RANSAC and LMEDS.
However, experiments show that the estimation of homography gives poor results for nonplanar objects, even though the keypoint association was performed correctly. 27So, in this paper, the median method is put forward to compute object's location, scale, and rotation in each frame.
As shown in Fig. 4, P center ðx; yÞ and P t center ðx; yÞ represent the center of the initial template and the object's bounding box in the t'th frame, respectively.P i ðx; yÞ and P t i ðx; yÞ represent credible keypoints of the initial template and that in the t'th frame.θ n and θ t n represent the angle between the i and i þ 1 keypoints of the initial template and that in the t'th frame.d n and d t n , respectively, represent the Euclidean distance between the keypoints in the initial template and that in the t'th frame.With the following equations, the relative changing rate of position, scale and rotation angle can be calculated: Fig. 3 The model of AKT.
where x t rotate and y t rotate , respectively, represent the x-coordinate and y-coordinate after rotation.P t ri ðx; yÞ are the four vertices' coordinates of tracking box in the t'th frame.The tracking box B ¼ ðb 1 ; b 2 ; : : : b n Þ of each frame can be obtained through the calculation above.

Algorithm Procedure
Given a sequence of images I 1 ; : : : ; I n and an initializing region b 1 in I 1 , our aim in each frame of the sequence is to recover the box of the object of interest.Steps of the AKT Algorithm 1 are as follows:

Experimental Results
We evaluated the proposed tracking algorithm based on fusing AKAZE and KLT (AKT) algorithm using sequences, as supplied by Ref. 28, with challenging factors including partial occlusion, drastic illumination changes, nonrigid deformation, background clutter, and motion blur.We compared the proposed AKT tracker with seven state-of-the-art methods: tracking-learning-detection (TLD), 14 compressive tracker (CT), 29 context tracker (CXT), 30 color-based probabilistic tracking (CPF), 31 structured output tracking with kernels (Struck), 32 multiple instance learning tracker (MIL) 33 and the circulant structure of tracking with kernels (CSK). 34ll data in the experimental results and the quantitative evaluation are based on the unified dataset and the same initial state conditions.Since our algorithm focuses primarily on the challenges of partial occlusion, deformation, rotation, and scale variation, we only include eight of the videos that mainly contain these challenges and neglect the others in the following discussions.Additionally, the results of precision and success rate are based on 22 videos, in which the good ones are as shown in Fig. 5 and Table 1.Experimental environment: Visual Studio 2013 + OpenCV3.1.0.Equipment is configured to: 2.00 GHz, dual processor, a 64-bit operating system, the 32-Gb installed memory.
There are a range of measures available in previous research for assessing the performance of tracking algorithms quantitatively.Many authors employ the center-error measure that expresses the distance between the centroid of the algorithmic output and the centroid of the ground truth.This measure is only a rough assessment of the localization.Since it is not bounded, the comparison of results obtained Algorithm 1: Fusing AKAZE-KLT tracking.
1: P i ðx; yÞ←AKAZE detectðI 1 Þ, detect and describe keypoints of object template in the first frame using AKAZE algorithm.on different sequences is difficult.So, we also employed the widely used overlap measure E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 2 ; 6 3 ; 9 6o where b T is the tracker output and b GT refers to the manually annotated bounding box, ∪ represents union, namely, the overlap of b T , and b GT , ∩ represents intersection of these boxes.The overlap rate is a better indicator for per-frame success when bounded between 0 and 1. 35  Since the rotation is not considered in the ground truth of the benchmarks, it is excluded in the overlap comparisons between our results and the benchmarks.

Accuracy Comparison of Methods for Tracking
The tracking performance of the AKT algorithm on different datasets 28 is as shown in Fig. 5. Sequences (a) and (b) mainly contain the challenging aspect of partial occlusion.Sequences (c) and (d) mainly contain deformation.Sequences (e) and (f) mainly contain plane rotation and out-of-rotation.Sequences (g) and (h) mainly contain scale variation, and so on.The results show that facing different situations, the AKT algorithm can accurately track the object and has a very good robustness.
Although the AKT algorithm shows good tracking results in these videos, there are still some challenges that are hard to deal with.Since the AKT algorithm is based on keypoints, when the object's appearance is smooth or the texture is not rich, it may struggle, as shown in Fig. 6(a).Also, when the object's appearance is almost or totally changed, the tracking box may drift.For example, the initial object is the face, but    1, the results show that among the tracking on the eight datasets, the frame rate of AKT algorithm is 77.9 fps, showing a high real-time performance (the average fps comes in the top two 7 times), and achieving a high tracking accuracy with the average CLE of 11.9 pixels (the average CLE comes in the top two 5 times), the tracking performance is better than the other seven methods.
The CLE is defined as the average Euclidean distance between the center locations of the tracking boxes using our method and the manually labeled ground truths.Then the average CLE over all the frames of one sequence is used to summarize the overall performance for that sequence.Precision plot shows the percentage of frames whose estimated location is within the given threshold distance T th of the ground truth, as shown in Fig. 7(a).The results show that precision of AKT tracking is higher than the other algorithms and similar to Struck.
To measure the performance of success rate on a sequence of frames, we count the number of successful frames whose overlap o is larger than the given threshold T 0 th .The success plot shows the ratios of successful frames at the thresholds varied from 0 to 1, as shown in Fig. 7(b).The results show that AKT algorithm is superior to other algorithms.

Error Comparison of Methods for Homography
Estimation In order to evaluate the different methods for homography estimation, we developed our own dataset because the data supplied by Ref. 26 did not include rotation data.We gained a total of 200 frames randomly as original frames.Then we transformed these frames using the affine model, as shown in Eq. ( 13).
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 3 ; 6 3 ; 4 2 5 x 0 y where ½x 0 y 0 T represents the coordinate of a point in the original frame.½xy T represents the coordinate of a point in the transformed frame.s, α, and ½dxdy T , respectively, represent scale, rotation, and displacement of the affine model.After transforming, we can get the dataset composed of original frames and transformed frames with known affine homography.
Then, under the condition that the keypoints of original frames and that of transformed frames are the same, we calculate the errors of displacement (pixel), scale (1) and rotation (deg) to get the error figures (method LMEDS in red, RANSAC in blue, MEDIAN in green), as shown in Fig. 8.The independent variable of error figures is the number of frames, whereas the dependent variable is the error.
The average error (AE) is used for comparison as the first evaluation criterion, as shown in Table 2.There will be noises causing by obvious variable estimation error, so to make better comparison of the methods for homography estimation, we set up average error without noise (AEN) as the second evaluation criterion.From the error figures, we set 100 pixels as location noise threshold, 10 as scale noise threshold, 150 deg as rotation noise threshold.The lower the AE and AEN, the better the performance of method for homography estimation.The smaller the difference between AE and AEN, the more stable the method for homography estimation.Therefore, the experimental results show not only that the median method is more stable, not having apparent noises, but also that its value of AE and AEN is less than that of the traditional statistical method.

Selection of Threshold for Tracking Results
The ratio of the number of inliers to the total number of matching-keypoints is called inlier ratio (IR).The larger the IR, the better the estimation of homographies.We impose that the error in location for two corresponding keypoints has to be less than 2.5 pixels, i.e., kF 0 b − HðF a Þk < 2.5, where H is the true homography between the frames, F a is the location of keypoint a in original frame F, and F 0 b is the location of keypoint b in transformed frame F b .The keypoint meeting above condition is called inlier.To find the threshold for better tracking, we still use the dataset put forward in Sec.4.3 with the total number changed to 2000.We calculate the IR of these corresponding frames and the mean of IR is 0.74, as shown in Fig. 9. Therefore, we set optimal value l t Th for tracking as the mean of IR to avoid outliers.

Conclusion
In this paper, in an effort to reduce an excess of outliers when using traditional AKAZE match-tracking algorithm and solve the problems caused by poor homography estimates produced by statistical methods, AKT algorithm is put forward.The experimental results on different datasets show that the AKT algorithm can deal with challenges, such as partial occlusion, deformation, scale variation, rotation, and background clutter, showing high real-time performance and accuracy.However, since the tracking method used is based on keypoints, when the objects appearance is smooth, and texture is not rich, using the AKT algorithm may result in reduction of the effectiveness of tracking.Therefore, in future work, we will address the problems mentioned above.

Fig. 2
Fig. 2 Forward-backward error in two adjacent frames.

Fig. 4
Fig.4The median method to get object's location, scale, and rotation.

Fig. 6 Fig. 7 (
Fig. 6 The AKT algorithm suffers from texture less object and the changed appearance: (a) the tracking box is given falsely because of fewer keypoints and (b) the tracking box drifts because of the changed appearance.

4. 2
Performance Comparison of Methods for Tracking The center location error (CLE) and average frame per second (fps) of AKT algorithm and other seven kinds of tracking algorithms are shown in

Fig. 8
Fig. 8 Comparison results of methods for homography estimation: (a) similar accurate results for homography estimation, (b) LMEDS and RANSAC gives poor results while MEDIAN gives good result, (c) errors of x -coordinate displacement, (d) errors of y -coordinate displacement, (e) errors of scale, and (f) errors of rotation.

Fig. 9
Fig. 9 IR and mean of IR.
P t di ðx; yÞ←AKAZE detect½I t ðROIÞ, detect and describe keypoints of search window in the t th frame 4: P t ai ðx; yÞ←AKAZE matchðP i ; P t di Þ, match keypoints of object template and search window using AKAZE algorithm.5:P t ki ðx; yÞ←KLTtrack½P i ðx; yÞ; I 1 ; I t , track keypoints of object template in search window in the t th frame using forward-backward KLT algorithm.

Table 1
The CLE and average frame per second (pixel/FPS).

Table 1 (
bold fonts indicate the best or second best performance), the results of the other seven kinds of tracking methods on different sequences in the table comes from Ref.26.In Table