Recently, the correlation filter (CF)-based methods have achieved great success in the field of object tracking. In most of these methods, the CF utilizes L2 norm as the regularization, which does not pay attention to the stability and robustness of the feature. However, there may exist some unstable points in the image because the object in the video may have different appearance changes. We propose a tracking method based on a structured robust correlation filter (SRCF), which employs the L2,1 norm as the regularization. The robust CF can not only retain the accuracy from the regression formulation but also take into account the stability of the image region to improve the robustness of the appearance model. The alternating direction method of multipliers algorithm is used to solve the L2,1 optimization problem in SRCF. Moreover, the multilayer convolutional features are adopted to further improve the representation accuracy. The proposed method is evaluated in several benchmark datasets, and the results demonstrate that it can achieve comparable performance with respect to the state-of-the-art tracking methods.
Visual object tracking is a hot research topic in the domains of computer vision, multimedia, etc. It has been successfully used in many fields, such as video surveillance,1,2 traffic monitoring,3 and motion analysis, and has attracted the attention of more researchers.4,5 However, realizing accurate and robust tracking is still a challenging task because there are many complex conditions, including appearance deformation, occlusion of similar or different objects, illumination variations, scale changes, background clutter, etc.
According to the appearance model, the tracking methods can be divided into two types, i.e., the generative model6–12 and the discriminative model.13–17 The generative model often formulates tracking as a matching problem, which only uses the information of the target. On the contrary, the discriminative model utilizes the information of both the target and the background, which is always formulated as a binary classification or a regression problem. Because the discriminative model-based methods use more information, they can get better performance during the tracking process. Furthermore, the regression formulation, which uses more spatial information, attracts more attention because it replaces the sparse sampling in binary classification with dense sampling.
Recently, some tracking methods based on a correlation filter (CF), which corresponds to the regression formulation, have achieved great success.18–21 On one hand, the CF addresses the sparse sampling in binary classification model, which makes full use of the spatial information. On the other hand, by introducing circulant assumption to generate training samples, CF can greatly improve the efficiency of sample selection and speed up the training and detection process by fast Fourier transform (FFT). Bolme et al.18 first model the appearance by learning the CF and propose a minimum output sum-of-squared error filter tracking method, but this method does not make full use of the spatial constraints. Henriques et al.19,22 exploit the circulant structure of the local image patch and learn a ridge regression as well as a CF for tracking. Danelljan et al.23 develop the adaptive color attributes based tracker by adding the color attribute to augment the intensity feature. Zhang et al.24 incorporate geometric transformations into a CF-based network to handle boundary effect issue.
Inspired by the successful applications in face recognition, image detection, image classification, etc., deep learning has been introduced into tracking by some researchers as well. For example, Wang and Yeung25 introduce an autoencoder into tracking and develop the first deep learning-based tracker. Li et al.26 present a single convolutional neural network (CNN) based tracking method, which can learn effective feature representations. Nam and Han27 propose to learn multidomain CNN for tracking, which is composed of shared layers and multiple branches of domain-specific layers. Due to the powerful representation ability, deep learning greatly improves the tracking performance. Commonly, deep learning works together with the generative model, different classifiers, or regression algorithms, thus, the tracking methods with deep learning still retain the disadvantages of these formulations.
Recently, some researchers28–31 have also proposed some new tracking methods, which utilize both the deep CNN and CF to further improve the tracking performance. For example, Ma et al.28 develop the hierarchical convolutional features based tracking, which exploits the multiple levels of abstraction for pyramid representation under the CF tracking framework. Mueller et al.30 present the context-aware CF tracking, which takes global context into account and incorporates it into the CF. Danalljan et al.29 introduce a factorized convolution operation and a compact generative model of the training sample distribution in CF tracking, which greatly improves the tracking efficiency. However, in most of these methods, only norm is used and less attention is focused on the unstable positions in the image region. In practice, because of the appearance changes caused by deformation or occlusion (Fig. 1), there always exist unstable points in the region.
In this study, we propose a tracking method based on the structured robust correlation filter (SRCF) with norm. First, to address the impact of the unreliable points in the image region with a multichannel feature, we develop a robust CF and formulate tracking as a structured robust regression problem. By introducing the structured sparse formulation, the stable features can be adaptively selected. Further, we derive the solution algorithm corresponding to the SRCF based on alternating direction method of multiplier (ADMM) approach. Second, based on the traditional CF tracking methods, we implement a concrete tracking algorithm based on the proposed SRCF. Specifically, we extract the multilayer multichannel features with CNN for representation, which can further improve the representation ability. Moreover, we also present a judgment-based update model to improve the tracking robustness in complex conditions. We evaluate the performance of the proposed tracking methods on many public datasets, and the experimental results illustrate that the proposed tracking method based on SRCF with norm can achieve comparable performance to many state-of-the-art trackers.
The remainder of this study is organized as follows. In Sec. 2, we introduce the related work of classical CF-based tracking. In Secs. 3 and 3.5, we describe the proposed SRCF and its corresponding tracking method, respectively. Section 4 shows the experimental results and the last section concludes the study.
Correlation Filter Tracking
Before discussing our proposed tracking method based on the robust CF, we first review the tracking method based on the traditional CF.22 Hereby, we briefly introduce the key components of the CF tracker, which includes the ridge regression formulation, fast realization with FFT, the dense sampling, and the circulant assumption.
The CF corresponding to the ridge regression is represented as follows:
Based on the circulant assumption, we can obtain the solution to 1 in Fourier domain:
CF tracking brings many benefits. First, by formulating tracking as a regression problem, the spatial information of the image can be fully utilized, and the appearance model built based on the CF can be represented more accurately. Second, based on the circulant assumption, much more training samples can be generated virtually without increasing the computation complexity. Since the regularization term in the traditional filter is norm, it can be realized fast by using FFT algorithm. However, because the object is always moving in the video sequences, the appearance of the object may change heavily, which will generate unstable regions. In this condition, norm is not robust to the outlier points and the appearance model may be not accurate enough.
Tracking with Structured Robust Correlation Filter
To address the unstable points and improve the accuracy of the appearance model, we formulate tracking as a robust regression problem and develop a SRCF-based tracking method. The overview of the proposed method is shown in Fig. 2. Different from the traditional ridge regression formulation, we formulate tracking as a structured robust regression problem with norm, which can adaptively select the robust features for tracking. First, the SRCF with norm regularization, which is built based on the training region and predefined response map, is trained. Then, the learned filter is used for tracking in the following frame. Specifically, the multilayer CNN features are used to improve the representation ability. Moreover, an update model with judgment and incremental strategies is constructed to accommodate the filters.
Norm Based Robust Correlation Filter
We first introduce the robust CF with norm, which is suitable for the single-channel feature. In this condition, each element of the feature corresponds to a specific position in the image region. Therefore, using norm can adaptively choose the stable points, alleviating the effect of the appearance changes.
Assume that the training sample matrix is denoted as , whose element is and its corresponding label is denoted as . Similar to the sample generation in the traditional CF, can be approximately obtained by circular shifts of . Inspired by the feature selection property of the norm and considering the stability of the points, we develop the CF with norm:
Norm-Based Structured Robust Correlation Filter
norm is only suitable for the single-channel feature. Since the single-channel feature always means intensity, it is not able to represent the appearance accurately. Commonly, to improve the representation ability, the single-channel feature can be extended to multichannel feature, such as histogram of oriented gradient (HOG), CNN, etc. In the condition of multichannel feature, there is a group of feature elements in each specific position of the image region. Choosing the specific group of features can be taken as a structured sparse learning problem, which can be solved by norm. Thus, norm is extended to norm to select the stable feature group. Correspondingly, the new CF with norm regularization is named SRCF.
The CF with norm regularization can be represented as
We employ the ADMM algorithm to solve the problem in Eq. (4). By introducing the auxiliary variable and adding more constraints, Eq. (4) becomes
Then, the Lagrange function can be represented as
First, update with the other parameters fixed. In this condition, the optimization problem becomes
By setting the gradient of Eq. (7) with respect to to 0, we can get the closed-form solution:
Based on the circulant assumption, the feature matrix can be obtained by circular shifts of , where is the ’th channel of the center sample feature map . Thus, by FFT algorithm, the Fourier transform of the parameter in the ’th channel is represented as follows:
Second, update with the other parameters fixed. Hereby, the optimization problem becomes
The problem with both norm and norm in Eq. (10) has a closed-form solution:
Third, the rest of the parameters can be updated as follows:
Tracking with SRCF
Under the CF tracking framework, we develop the tracking method with the proposed SRCF. Moreover, we utilize the convolutional feature to represent the appearance and use a judgment strategy to improve the accuracy of the update model.
The representation in tracking includes two parts: selection of the training and searching regions and feature extraction. Since we follow the CF tracking framework, we adopt the same region selection scheme. For the training region, we select an image region that has the same center as the target and a much larger area. On one hand, the larger region can satisfy the circulant assumption, which is useful for the fast realization with FFT. On the other hand, because the training samples are sampled approximately based on circular shifts, the larger region indicates the dense sampling, which improves the discriminability of the model. The searching region is selected in the next frame according to the same manner as the training region.
Once the training or searching region is selected, specific features can be extracted for better representation. In the original CF tracking method and some variants, both the intensity and HOG features are adopted. Inspired by the powerful representation ability of convolutional features, we extract the convolutional features via VGG-Net, which is trained in the ImageNet dataset and achieves excellent performance on classification and detection challenges. Different layers of the features describe the image from different aspects, i.e., the lower layers have more location information while the higher layers keep more semantic information. Because both location and semantic information is important for tracking, we use several layers of the convolutional features for representation.
Because the convolutional features with multiple layers are used for representation, we train a multilayer SRCF group as the appearance model. Assume that the feature map of the training region in the ’th layer is . According to Eqs. (7)–(12), we can train an individual SRCF corresponding to each feature layer. Then, the CFs in all layers are collected and taken as the model for tracking.
Determining the tracking result
Assume the multichannel feature map of the searching region in the ’th layer is . Based on the trained SRCF and its Fourier transform , we can calculate the response map in the Fourier domain:
To better capture the changes of the appearance, the CF should be updated in a timely manner. Besides, online learning should be adaptively controlled to avoid learning the occlusion. In our method, the model update includes two stages. In the first stage, to alleviate the impact of the occlusion, we present a judgment strategy to control the update. We calculate the cosine similarity of two consecutive frames:
In the second stage, we utilize an incremental strategy to realize the model update. Once it is determined to update, we can select a new training region in the current frame and extract its multichannel feature. Then, the model is updated by
During the tracking process, the scale of the object may be changed. To obtain better tracking performance, we adopt the scale adaption strategy23 to address the scale changes. Besides the translation filters used for location, the scale filter is built to estimate the optimal scales of the target. The scale filter is learned based on the image patch centered around the target and 33 scales are used for scale estimation (Algorithm 1).
SRCF tracking: iteration in frame t.
|Frame ; Previous object position ; Robust filter .|
|Object position and corresponding bounding box; Updated filter .|
|(1) Crop the candidate image patch at from and extract the convolutional features conv3-4, conv4-4 and conv5-4 of VGG-Net-19;|
|(2) Calculate the response map by Eq. (13);|
|(3) Determine the optimal position by taking the maximum value of .|
|(1) Crop the training patch at from , and extract and augment the features conv3-4, conv4-4 and conv5-4 of VGG-Net-19;|
|(2) Judge whether to update by calculating the similarity with Eq. (14).|
|IF: , NOT update;|
|(2.1) Update the and by Eq. (15);|
|(2.2) Learn the filter by solving Eq. (4) with ADMM.|
We denote the proposed tracking method as SRCF, which is initialized as follows. VGG-Net-19 network is used to extract features and the outputs of the conv3-4, conv4-4 and conv5-4 are taken as the features. For each layer of the feature, we train a corresponding model and the final tracking response map is the summation of the response in the above three layers. The weights of the above three layers are set as 1, 0.5, and 0.25, respectively. For the Gaussian function, . The regularization parameter is set as 0.01. The coefficient is set as 3 and the iterations for ADMM are set as 15. The judging threshold is set as 0.99 and the update rate is set as 0.99. The padding factor for the larger region selection is set at 1.8. All parameters are fixed for all sequences.
The precision plots and success plots, which are obtained by precision and success rate (SR), are used to evaluate the performance of the trackers. Precision is calculated by the ratio of the number of frames in which center location error is smaller than a threshold and the number of the total frames. Visual overap rate (VOR) is defined as the average of , where and represent the bounding boxes of the tracking result and ground truth, respectively.32 SR is defined as the ratio of the number of success frames and the total frames, where tracking in one frame is taken to be successful if the VOR in that frame is larger than a predefined threshold . By assigning different values to and , the precision plots and success plots can be obtained to display the overall performance. The area under the curve (AUC) is used as another evaluation criterion as well.
Comparison with State-of-the-Art Methods
We compare the performance of the proposed SRCF tracking method with several state-of-the-art tracking methods in the OTB-2013 dataset.33 The competing trackers include DSLT,34 MetaCREST,35 MetaSDNet,35 DaSiamRPN,36 STRCF,37 CNNSVM,38 MEEM,39 KCF,22 Struck,40 SCM,41 TLD,42 ASLA,43 HDT,44 and CXT.45
We first evaluate the overall performance of the proposed SRCF tracker and the competing trackers. The comparison results are shown in Fig. 3, which displays the precision plots and the success plots of SRCF and the competing trackers. It can be seen that our SRCF achieves the precision at 20 pixels 0.911, which ranks the second among the trackers, and obtains the AUC score 0.653, which ranks the fifth and outperforms most of the competing trackers.
In our method, we exploit the powerful representation ability of the CNN and extract the features from three different layers of VGG-Net-19 for representation. To verify the role of the CNN features, we build another two trackers, which only use the handcrafted features, i.e., HOG and grayscale features for comparison. We evaluate the performance of the trackers in the OTB-2013 dataset and show the result in Fig. 4. It can be found that the tracker with CNN feature achieves better performance on both the precision and success plots. Specifically, we can also see that the precision at 20 pixels and AUC obtained by the SRCF tracker with only HOG feature also outperform the KCF method by 2.5% and 5.9%, respectively.
Since we use three different layers of VGG-Net-19, i.e., the conv3-4, conv4-4, and conv5-4 for comprehensive representation, we further implore the contribution of each layer. Besides the standard tracker, which uses all of the three layers, we build another three trackers, each of which makes use of the feature in a single layer. The comparison results on OTB-2013 are shown in Fig. 5. It can be seen that, among the competing trackers, the tracker with the conv4-4 obtains the best precision and success plots, the tracker with the conv5-4 achieves the second-best results, and the tracker with the conv3-4 ranks the third. However, by combing features in all of the three layers, the tracking performance can be further improved, indicating that all three layers have a significant contribution for tracking.
Analysis of regularization
The main difference between our formulation and the traditional tracking methods is that we adopt the structured robust regularization with norm instead of the original norm (Frobenius norm for multichannel feature). Hereby, we explore the contribution of norm by building a new comparison tracker with norm. Hereby, the CF with norm has the same configuration with the standard SRCF except the norm regularization.
The comparison results on the OTB-2013 dataset are shown in Fig. 6. It can be found that the precision at of the tracker with norm is 0.912, whereas the precision obtained by the norm is 0.898. The AUC score of norm is 0.652, which outperforms the norm by 1%. Figure 7 shows two examples that SRCF with norm gets better results than CF with norm. Note that the proposed SRCF outperforms the traditional CF tracking method, indicating that the norm achieves better robustness than the norm.
Analysis of model update
In our method, we develop a judgment-based update model to adaptively learn the appearance changes, which can effectively handle the appearance changes and occlusion problems. To explore the impact of the judgment-based update model, we also build a tracker, which only uses the traditional update model without judgment. The comparison results in OTB-100 dataset are shown in Fig. 8. It can be seen that the precision and AUC score of the SRCF tracker with judgment outperform that without judgment by 2.3% and 1.8%, respectively, indicating that the performance of SRCF can be further improved by introducing the judgment-based update.
Evaluation in More Datasets
Besides the OTB-2013 dataset in which the SRCF tracker has achieved good results, we also evaluate its performance in more datasets, including the Tcolor128 dataset,46 OTB-100 dataset,47 VOT2016 dataset,48 and VOT201749 dataset to explore the effect of the settings of the tracker.
Evaluation in the OTB-100 dataset
We further evaluate the performance of SRCF in OTB-100 dataset, which includes 100 different sequences. We compare SRCF with several famous tracking methods, including DSLT,34 MetaCREST,35 MetaSDNet,35 DaSiamRPN,36 STRCF,37 CNNSVM,38 MEEM,39 KCF,22 Struck,40 SCM,41 TLD,42 ASLA,43 HDT,44 and CXT45 in this dataset. The comparison results of precision plots and success plots are shown in Fig. 9. We can see that SRCF achieves similar performance to that in OTB-2013 dataset. The precision at of SRCF is 0.854, which ranks the fifth among the competing trackers and AUC score is 0.613, which ranks the sixth.
Evaluation in the Tcolor128 dataset
There are 128 color sequences in the Tcolor128 dataset, in which the sequences are collected from various circumstances, including highway, airport terminal, railway station, etc. Hereby, we evaluate our SRCF tracker in the Tcolor128 dataset with the competing trackers, which include ECO,29 ADNet,50 MEEM,39 KCF,22 Struck,40 VTD,51 CN2,23 ASLA,43 and L1APG.52 The overall precision plots and the success plots over the whole dataset are shown in Fig. 10. It can be observed that the precision at obtained by SRCF is 0.698, and the AUC score is 0.504, both of which rank the third among the competing trackers. In this dataset, the MEEM method achieves good results on both criteria. Since MEEM adopts the color feature, the encoded LAB color model greatly improves the performance. However, ECO, ADNet, and our SRCF tracker, which use deep convolutional features, have more powerful representation ability and get better tracking performance. Moreover, we would like to encode the color feature to further improve our method as well.
Evaluation in the VOT2016 and VOT2017 datasets
We further evaluate the performance of SRCF in VOT2016 and VOT2017 datasets, each of which contains 60 challenging sequences. The accuracy and robustness scores are used as the criteria for evaluation. In VOT2106 dataset, we mainly compare our method with DaSiamRPN, DSLT, SA-Siam,53 SiamRPN,54 SRDCF, Staple,55 Struck, KCF, ASMS,56 BDF,57 HCF,28 DFST,58 and SAMF,59 and the comparison results are shown in Fig. 11. Our SRCF tracker ranks the eighth on accuracy and fifth on the robustness. We also compare our method with DasiamRPN, KCF, MEEM, SA-Siam, SiamFC, SiamRPN, SRDCF, Staple, L1APG, Struck, and ASMS in VOT2017 and display the results in Fig. 12. It can be seen that the proposed SRCF tracker gets better robustness than most trackers, indicating that the norm can improve the robustness to some degree.
The running speed is also important for the tracker. Our method is implemented in MATLAB on a PC with an Intel i7 CPU 3.4 GHz and a Nvidia GTX1080 GPU. We compare the running speed of our SRCF and some famous trackers in OTB-100 dataset and show the result in Fig. 1. We can see that the SRCF runs at , which is similar to MDNet,27 LSART,60 faster than C-COT,61 SCM, and slower than KCF, DaSiamRPN, MEEM, DSLT, STRCF, and MetaCREST. Although the norm increases the robustness, it also decreases the running speed. Thus, some parallel strategies will be explored to further improve the running speed in the future.
Since our SRCF tracker is realized based on the CF tracking framework, it retains many advantages of CF tracking methods. For example, it can make full use of the spatial information of the training region, which can improve the representation accuracy. Moreover, it also borrows the feature extraction method from the deep learning algorithms, which further improves the representation ability. Compared to the methods that also follow the CF tracking methods, e.g., ECO, STRCF, SRDCT, Staple, and HDT, our SRCF tracker runs slower because of the norm, but it also improves the tracking robustness by adaptively selecting the robust features. On the other hand, compared to the methods that adopt the end-to-end deep neural network, such as SiamRPN, DaSiamPRN, MetaSDnet, and LSART, SRCF may obtain some lower accuracy but does not need complex training process. In addition, compared to the sparse learning-based methods, such as SCM, L1APG, and ASLA and ensemble learning-based methods, such as MEEM, SRCF has significant advantages on both accuracy and robustness (Table 1).
The comparison of running speed of SRCF and some famous trackers in the OTB-100.
In this study, we present a tracking method based on SRCF. Different from the traditional CF, which only uses norm for regularization, the proposed method introduces norm to deal with the unstable region and is suitable to the multichannel CNN features. Besides, we also use the ADMM method to solve the problem in the SRCF. The proposed method is tested on many public datasets and outperforms many state-of-the-art tracking methods. In the future, we expect to improve the method’s efficiency to satisfy the real-time applications.
This work was supported by the National Natural Science Foundation of China (Grant No. 61601021), the Beijing Natural Science Foundation (Grant No. L172022), and the Fundamental Research Funds for the Central Universities (Grant No. 2016RC015).
Yongjin Guo received his BS degree from Wuhan University of Technology in 2001 and MS degree from Tsinghua University in 2008. He is currently a senior engineer in the Systems Engineering Research Institute. His research interests include artificial intelligence, image processing, and computer vision.
Shunli Zhang received his BS and MS degrees from Shandong University in 2008 and 2011, respectively, and his PhD from Tsinghua University in 2016. He is currently a faculty member at Beijing Jiaotong University. His research interests include image processing, pattern recognition, and computer vision.