Robust visual tracking via multiscale deep sparse networks

Abstract. In visual tracking, deep learning with offline pretraining can extract more intrinsic and robust features. It has significant success solving the tracking drift in a complicated environment. However, offline pretraining requires numerous auxiliary training datasets and is considerably time-consuming for tracking tasks. To solve these problems, a multiscale sparse networks-based tracker (MSNT) under the particle filter framework is proposed. Based on the stacked sparse autoencoders and rectifier linear unit, the tracker has a flexible and adjustable architecture without the offline pretraining process and exploits the robust and powerful features effectively only through online training of limited labeled data. Meanwhile, the tracker builds four deep sparse networks of different scales, according to the target’s profile type. During tracking, the tracker selects the matched tracking network adaptively in accordance with the initial target’s profile type. It preserves the inherent structural information more efficiently than the single-scale networks. Additionally, a corresponding update strategy is proposed to improve the robustness of the tracker. Extensive experimental results on a large scale benchmark dataset show that the proposed method performs favorably against state-of-the-art methods in challenging environments.


Introduction
Visual tracking is one of the current research hotspots in computer vision.It has been widely used in many fields, such as visual surveillance, human-computer interface, medical image analysis, 1,2 and so on.Given the initial state of the target (including position, scale, etc.), the classical visual trackers achieve the tracking by estimating its continuous states in following frames.
In recent years, a large number of tracking algorithms have been proposed.Existing tracking algorithms can be divided into two categories: 3 generative methods and discriminative methods.The former is a "model-driven" method that uses the target's information to establish the target model and determines the most similar sample as the tracking result.Some popular generative methods include incremental visual tracking (IVT), 4 multitask tracking (MTT), 5 and adaptive structural local appearance model (ASLA). 6The latter is a "data-driven" method that deals with the tracking process as a binary classification problem between target and background.Some state-of-the-art trackers, such as compressive tracking (CT), 7 tracking-learning-detection (TLD), 8 and multiple instance learning (MIL), 9 are discriminative methods.These above trackers, which use hand-crafted features, achieve a good performance in simple and controllable environments, but there are always some problems of tracking drifting or a target missing in some practical and complicated environments, such as illumination variation, deformation, occlusion, motion blur, and so on.Therefore, there is still a challenging gap between a robust real-time tracker and the realistic application in extreme and complicated conditions.The emergence and development of "deep learning" has gradually become the potential solution to the above problems. 10Different from hand-crafted features, deep learning learns the high-level semantic features automatically.These features are effective in distinguishing the target from background due to the deep architectures of deep learning.Recently, the deep learning-based trackers have been gradually becoming the tendency in visual tracking fields due to their outperformance compared with traditional tracking methods.
However, the tracking methods based on deep learning still suffer from some difficulties. 11Numerous data are required to train a robust and stable deep network.However, there is limited number of labeled data in an actual tracking scene.The unsupervised pretraining method with numerous auxiliary training datasets 12 solved the problem to some extent, but it still requires high-performance hardware and is complicated and time-consuming.Moreover, the learned generic representation may not be suitable tracking a specific object.2. The "gradient vanishing" problem easily occurs in the stochastic gradient descent 13 method during the training process of deep networks.It is caused by the property of saturation of the traditional nonlinear activation function and often results in a dilemma in training a robust deep network.3. Traditional deep learning-based methods track the targets via a single-scale deep network.The samples are usually normalized into a unified pattern in the single tracking network.It will cause the deformation of the target and loss of some inner structure information of data.These factors are more likely to result in tracking drift to some degree.
In this paper, we propose a multiscale deep sparse network (MDSN) and build a robust tracker: multiscale sparse networks-based tracker (MSNT).The main contributions of our work can be summarized as follows: 1. We propose an MDSN based on the stacked sparse autoencoders (SAE) 14 and rectifier linear unit (ReLU). 15,16The combination of SAE and ReLU makes the deep network highly sparse and avoids the complex and time-consuming pretraining.The multiscale networks can retain the inner structure information of targets as much as possible.The architecture improves the robustness of deep networks for different shapes of targets.A large number of experiments and analyses are carried out on the CVPR2013 tracking benchmark dataset 17 (including 51 challenging videos) with nine recent state-of-the-art trackers.The experimental results show that our tracker achieves outstanding performance in challenging environments and attains a practical tracking speed.

Related Work
The concept of "deep learning" was first proposed by Hinton and Salakhutdinov. 12Since then, deep learning technology has been widely concerned and has been making great progress.With its robust and efficient features, deep learning has been applied in diverse fields, such as image classification, 14,18 automatic speech recognition, 19 face recognition, 20 and so on.
Recently, deep neural networks (DNNs) have been applied in the visual tracking field.Fan et al. 21extracted specific features from convolutional neural networks (CNNs) with offline pretraining for human tracking and obtained acceptable tracking results in some complex situations.Through training a stacked denoising autoencoder on a large scale image dataset, deep learning tracker (DLT) 22 learned generic features and achieved a robust tracking performance.Li et al. 23 applied a single-CNN on visual tracking tasks without pretraining and combined it with multiple image cues to improve the tracking success rate.Wang et al. 24 used hierarchical features for tracking by training a two-layer CNN on an auxiliary dataset and gained a good result in complicated tracking situations.Zhang et al. 25 proposed a convolutional network-based tracker (CNT), which combined the local structure feature and global geometric information of tracking targets and attained a state-of-the-art performance.
The sparse distributed representation (SDR) is the key for learning powerful features in deep learning, while the activation function plays an important role in encouraging sparsity. 26The performance of the activation function will directly influence the effectiveness and robustness of the extracted features.The most popular nonlinear activation functions are "sigmoid" and "tanh."They have been widely used in many deep networks, but they suffer from some drawbacks, 11 such as a slow training speed and a poor local solution with random initialization without good predictive performance.Recently, a sparse activation function called ReLU was proposed in Ref. 15.As illustrated in Fig. 1, different from traditional activation functions, such as sigmoid and tanh, the rectifier function ReLUðxÞ ¼ maxð0; xÞ is a one-side activation function.It enforces hard zeros in the learned feature representation 26 and leads to the sparsity of hidden units by rectifying the negative output of the hidden units. 16The sparsity of hidden units has the same effectiveness as the pretraining methods.The experimental results in Ref. 27 showed that pretraining will lead to more sparsity of the deep networks compared to DNNs without pretraining.
Moreover, Glorot's experiments 16 proved that deep networks with ReLU can reach their best performance without any unsupervised pretraining due to the sparsity.More experiments further proved the conclusion in Ref. 27 and showed that there is no significant improvement for DNNs with ReLU using pretraining.Moreover, ReLU was used in a sparse deep stacking network (S-DSN) for image classification in Ref. 18.It avoided the expensive inference effectively and achieved higher sparsity and better classification performance than S-DSN with sigmoid.Furthermore, the active part of ReLU is an unsaturated linear function, which alleviates the "gradient vanishing" problem effectively in training and improves the speed of training.Therefore, ReLU is a practical activation function for quickly building sparse and powerful deep architectures without requiring pretraining process.

Deep Sparse Network Model
Different from sparse coding (SC), the sparsity of neural networks attempts to represent the features of the input data using the least amount of hidden neurons.The feature of objects in sparse neural networks is SDR, 26 which dictates that all representational units participate in data representation while very few units activate for a single data sample.It can exploit more powerful and robust feature representations from input data.Therefore, it is reasonable to build a model of deep sparse network for tracking.The SAE is a basic unsupervised learning model and is often used in deep learning.In this paper, we use a structure of SAE that is similar to Ref. 14 and obtain a deep sparse network by training the stacked-SAEs using the "layer-bylayer greedy algorithm." 12The cost function in the model is defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 6 3 ; 6 8 6 where xi denotes the reconstruction of x i , W and W 0 are the weight matrices of encoder and decoder, respectively, b is the bias vector of encoder included in xi , m is the number of samples, λ is a penalty factor, which balances the reconstruction loss and weights, μ is the sparsity penalty factor, and k • k F denotes the Frobenius norm.The crossentropy HðρjjρÞ is given as ; t e m p : i n t r a l i n k -; e 0 0 2 ; 6 3 ; 5 4 4 HðρjjρÞ ¼ − E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 6 3 ; 4 9 9 where k and n are the number of neurons in the input layer and hidden layer, respectively.h j ðx i Þ denotes the activation value in the j'th hidden layer to the input x i .The sparsity target ρ is close to 0, and it is set to 0.05 in our experiments.In Refs.16, 18, and 27, it is proven that ReLU will bring the inherent sparsity to DNNs, which let the pretraining become less effective for DNNs when using the ReLU activation function.Hence, we adopt ReLU as an activation function to the aforementioned deep sparse network to leave out the offline pretraining.Benefiting from the intrinsic sparsity of ReLU, around 50% of the hidden units' output values are real zeros once the deep network is built.This makes the basic stacked-SAEs transform into a variant, as shown in Fig. 2.Moreover, this percentage of inactive neurons (units that do not activate for any data sample) can easily increase with online training based on the sparsity constraints of SAE. 16ased on the architecture of Fig. 2(b), a "softmax" classifier layer is added to the model as the last layer to classify the learned features.The logistic regression is included in the softmax classifier layer E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 3 2 6 ; 5 2 6 where l θ ðtÞ is a value in [0, 1], i.e., it represents the probability of the sample t as the true target in the visual tracking problem and θ is the model parameter.The final model of the deep sparse neural network for tracking is shown in Fig. 3.
During the tracking process, each sampling patch gets a value in [0, 1] through the softmax classifier in the tracking network.

Tracking Algorithm Based on Multiscale Deep
Sparse Networks A single deep sparse network for tracking is introduced in Sec. 3.However, this fixed architecture of deep network is too rigid in practical tracking tasks, and it cannot adapt to different situations effectively.Based on the single network model mentioned in Sec. 2, we propose an MDSN and combine it with a particle filter framework to cope with the complex tracking tasks.

Multiscale Deep Sparse Networks
The conventional neural network for tracking usually normalizes the initial target patch or sampling patches into the same size in the input layer, which can reduce the number of input neurons and the complexity of networks effectively.For example, the target patch in the first frame is normalized into a low-resolution (LR) image with 32 × 32 pixels in DLT. 22owever, we observe in several experiments that the fixed normalization for different targets will cause various degrees of stretching or compressing for the targets.The deformation damages the inner structure information of the targets, reduces robustness of the extracted features, and increases tracking drifting.However, when different normalized method is used in different shapes of targets, it reserves more inner structure information and achieves a better tracking result due to the reduction of deformation of targets.
As shown in Fig. 4, the red bounding box and line represent the tracker based on a 32 × 16 normalization scale (normalization-2) while the green ones represent the 32 × 32 normalization (normalization-1).We clearly observe that the 32 × 16 normalization has better performance than 32 × 32 normalization in this case.
Based on the observations, we propose an MDSN to adapt to different targets and situations effectively.It is called "multiscale" because we build four different architectures of deep sparse networks aimed at four different kinds of situations.The four defined situations of tracked targets and corresponding architectures of deep network are as follows: 1. LR-target: The number of pixels inside the initial ground-truth bounding box is less than t r (t r ¼ 400). 17n this situation, we normalize the input image patches into a 16 × 16 pixels grayscale and build a six-layer deep network in which the amounts of neurons of each layer are [256 512 256 128 64 1].The deep architecture has an overcomplete filter layer after the input layer.It will capture the image's structure more effectively 22 for LR-targets.2. Square target (S-target): The target is not LR, and the aspect ratio r ∈ ½ 2 3 ; 3 2 , where r ¼ w∕h and w and h are the width and height of the initial ground-truth bounding box of the target, respectively.In this situation, the width and height of the initial target are approximately equal, so we normalize the input image patches into 32 × 32 grayscale.Hence, a six-layer deep network with neurons of [1024 512 256 128 64 1] is built.3. Vertical target (V-target): The target is not LR, and the aspect ratio r < 2 3 .In this situation, the height is 1.5 times greater than width of the initial target, so we consider the target a V-target and normalize the input image patches into 32 × 16 grayscales.A five-layer deep network is built, and the amounts of neurons in each layer are [512 256 128 64 1]. 4. Horizontal target: The target is not LR, and the aspect ratio r > 2 3 .Contrary to the V-target, the width is 1.5 times greater than height of the initial target.We normalize the image patches into 16 × 32 grayscales and build a five-layer deep network of [512 256 128 64 1].
The entire architecture of the MDSN is shown in Fig. 5.
With a new tracking task, MDSN first chooses a corresponding tracking network according to the above defined situations.The multiscale architecture reserves the inner structure information of targets as much as possible, so it will improve the robustness of the extracted features.

Particle Filter Tracking Framework
The particle filter algorithm 22,28 is a popular tracking framework used in the visual tracking field.Let s t and z t denote the state and observation of the target at time t, respectively.The tracking task can be considered the process of searching for the target's state of maximum probability at time t according to the observations fz 1∶t g E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 3 2 6 ; 4 0 3 s t ¼ arg max pðs t jz 1∶t Þ; (5 where pðs t jz 1∶t Þ is the posterior distribution of the target at time t.According to Bayes criterion, we get that E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 3 2 6 ; 3 5 0 The particle filter algorithm estimates the posterior distribution through a set of random particles fs i t g N i¼1 with corresponding weights fω i t g N i¼1 , where N denotes the numbers of sampling particles and the initial weights are 1∕N.The weights of particles easily produce weight degeneracy, so the weights are updated as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 6 3 ; 5 2 2 where qðs i t js i t−1 ; z t Þ is the proposed distribution, which depends on the particle distribution at time t − 1 and the observation at time t.Additionally, it is often simplified to a first-order Markov process qðs i t js i t−1 Þ, which is independent of the current observation.Thus, the update formulation can be simplified to E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 6 3 ; 4 1 0 Meanwhile, the weights should be further normalized to satisfy the below equation E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 6 3 ; 3 5 8 In our proposed algorithm, we use the particle filter to randomly sample the candidate patches around the last tracking results; then, we send the sampling patches to the tracking network, which is proposed in Sec.4.1.We get the confidence coefficient ς i through the classifier layer, i.e., the posterior distribution pðs t jz 1∶t Þ ¼ ς i , and then we choose the maximum ς i to get the current target's state by Eq. ( 5).Meanwhile, to adapt to the changes of the target's scales during tracking, a random disturbance r ¼ ðw r ; h r Þ is added to the width and height of the candidate patches.In this paper, w r and h r obey normal distribution with zero mean and a variance of 0.01.

Online Training and Updating Strategy for Tracking Network
After determining the corresponding tracking network, the tracking network with random initialization cannot satisfy the requirements of the specific tracking task, so we adjust the network parameters using specific labeled samples by online training.
In specific tracking tasks, we need to collect enough positive and negative samples to train the network while only the initial state s 0 ¼ fx 0 ; y 0 ; w 0 ; h 0 g is given.Here, ðx 0 ; y 0 Þ denotes the initial position of the target and w 0 and h 0 denote the width and height, respectively.In our proposed method, we randomly collect 10 positive samples close to the target's center and 100 negative samples far away from the target.Using these positive and negative samples to train the tracking network at the beginning of tracking, we get the adapted network for specific tasks.
A robust tracking algorithm should be able to consistently track the target without drifting or losing, which requires the tracker to have the capacity of adjusting parameters adaptively according to changes of environments.The condition to update the proposed method is as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 0 ; 3 2 6 ; 4 1 2 where τ is the threshold of updating, fn is the accumulative frames after the last update, and η is the maximum Fig. 5 The architecture of MDSN.
Table 1 The main steps of MSNT algorithm.

Output:
Tracking results, i.e., the estimated object state ŝi for frame i.
Step 1 Determine the tracking network based on the target type with s 0 and initialize network.
Step 2 Collect positive sample patches and negative sample patches to train the network online.
Step 3 For i ¼ 1;2; • • • ; n: Step 3.1 Do particle sampling to get N sample patches in the neighborhood of ðx i−1 ; y i−1 Þ; Step 3.2 Send the sample patches to the tracking network, to get the confidence coefficient ς i ; Step 3.3 Choose the maximum ς i to get the estimated state by Eq. ( 5); Step 3.4 Update the network according to Eq. ( 10) and the updating strategy.
Step 4 End of the image sequences.
accumulative frames.If Eq. ( 10) is satisfied, the current tracking result will be added to the positive samples set, and the negative samples will be randomly sampled again in the current frame.Then, it is retrained by utilizing the updated positive and negative samples to realize the updating of the tracking network.

Overall Process of Algorithm
Integrating the above description of the key components, we present a visual tracking method MSNT via the proposed MDSNs.The main steps of MSNT are shown in Table 1, and the flow chart of the overall algorithm is shown in Fig. 6.

Experiments
The proposed MSNT algorithm is realized in MATLAB ® on the experimental platform of a CPU (Intel Xeon 2.4 GHz) and GPU (TITAN X).The initial parameters of the MSNT are as follows: λ ¼ 0.005, ρ ¼ 0.05, μ ¼ 0.2, η ¼ 50, and τ ¼ 0.9.In addition, we set the learning rate ξ to 0.01 during the online training.The weight matrix W is randomly initialized E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 1 ; 6 3 ; 1 8 5 W i;iþ1 ¼ where W i;iþ1 ∈ W denotes the weight matrix between the i'th layer and the (i þ 1)'th layer, n i and n iþ1 denote the number of the neurons of the i'th and the (i þ 1)'th layer, and randðn i ; n iþ1 Þ generates a random matrix of n × n with uniform distribution between 0 and 1.Therefore, W is randomly initialized into ½−0.5;0.5 of the microweights, and the weights are different in different layers.
To verify the validity of our proposed method, the onepass evaluation (OPE) as in Ref. 16 is used in our experiments.The MSNT algorithm is evaluated on the tracking benchmark dataset, 16 which includes 51 fully annotated videos.We compare the performance of our tracker with nine state-of-the-art trackers, including DLT, 22 CNT, 25 kernelized correlation filters (KCF), 29 tracking with Gaussian processes regression (TGPR), 30 sparsity-based collaborative model (SCM), 31 Struck, 32 structural sparse tracking (SST), 33 linearization to nonlinear learning tracker (LNLT), 34 and circulant sparse tracker (CST). 35A brief introduction of these referenced trackers is shown in Table 2, and their tracking results are provided by their authors.Some qualitative and quantitative comparisons are implemented to evaluate the performance of our tracker.The detailed and color comparisons can be obtained in the online version of this paper.

Qualitative Comparisons
In qualitative comparisons, eight challenging sequences are selected to evaluate the MSNT intuitively.The results are shown in Fig. 7, and the different colors indicate different trackers.Then, we analysis the trackers qualitatively from the following aspects: 1. Illumination variation: Taking the video of "Coke" for an example, when the illumination changes dramatically, MSNT, TGPR, and Struck can always track the target correctly, but the others lose the target easily.2. Scale variation: Taking the videos of "Car4" and "Singer1" for examples, MSNT can adapt to the scale variation of the target, but KCF, Struck, TGPR, and CST cannot adjust the size of the bounding box adaptively; the tracking drifting even appeared for TGPR in Singer1.

Occlusion: It indicates the full or partial occlusion of
the target by background or other objects.For the #78 frame in "Jogging-1," when the full occlusion disappears gradually, only MSNT and LNLT can track the target immediately and accurately.For the partial occlusion in "Tiger1," only MSNT can track the target from beginning to the end.4. Fast motion: For the target in "Deer," the motion of the target is fast and even causes the blur of the target region.The KCF, DLT, SST, SCM, and LNLT fail to track the target when the target moves too fast, but MSNT can always track the target very well. 5. Background clutter: In "Girl," the tracking drift arises in KCF, TGPR, CST, and LNLT when a background similar to the target appears in the tracking region, such as #441 and #470, while MSNT can successfully track the target.6. LR: For targets of LR, such as "Freeman4," the information of the target is too small to extract enough features.Due to the overcomplete layer appended to the "LR-target" tracking network, MSNT captures more available features to track the target robustly.7. Rotation: It is divided into in-plane and out-of-plane rotation.Both rotations are in Girl in which MSNT tracks the target consistently and the sizes of the bounding boxes match the target well.

Quantitative Comparisons
To evaluate our tracker comprehensively and reliably, we use four quantitative evaluation metrics, which are introduced in Ref. 17, to carry out quantitative analysis.
1. Overlap rate: Given the ground-truth bounding box S G and the tracked bounding box S T , the overlap rate is defined as α ¼ jS T ∩S G j jS T ∪S G j , where ∩ and ∪ represent the intersection and union of two regions, respectively, and j • j denotes the area of the region.The larger value of the overlap rate indicates a better performance of the tracker.

Center location error (CLE): It is defined as the
Euclidean distance between the center locations of the tracked results and the manually annotated ground truths.The smaller value of the CLE indicates a better performance of the tracker.3. Success rate: Success rate is associated with the overlap rate α.Given a threshold t 0 , the targets are considered to be tracked successfully if and only if α > t 0 .The success rate is defined as the percentage of the successful frames, and the larger value indicates a better performance of the tracker.

Precision: Precision shows the ratio of frames whose
CLE is within a given threshold, and the larger value indicates a better performance of the tracker.
In our experiments, we quantitatively analyze our tracker from three aspects: the tracking performance for a single sequence, the overall performance, and the attribute-based performance for 51 sequences. 17

Tracking performance for a single sequence
The above eight challenging sequences, which are introduced in Sec.5.1, are used to compare the tracking performance of a single sequence quantitatively.
Figure 8 and Table 3 show the overlap rate plots and the success rate in the success threshold t 0 ¼ 0.5, respectively, of the 10 trackers on eight challenging sequences.From Fig. 8, we see that the overlap rates of our tracker are always at a high level in these eight challenging sequences, and the success rates of our tracker in Table 3 are higher than most other trackers.These metrics prove that our tracker achieves a good tracking success rate for single sequences in different challenging scenes.
Figure 9 and Table 4 show the CLE plots and the average CLEs of the trackers, respectively, on eight challenging sequences.In the tracking process for a single sequence, our tracker maintains lower center errors compared to others and achieves a low tracking error for the whole sequence.These quantitative metrics show that our tracker possesses a higher precision during the tracking process.

Overall performance for 51 sequences
For evaluating our tracker's overall performance for 51 sequences in the benchmark, 17 we plot the success plots and the precision plots of the above 10 trackers.The success plot shows the success rates at a varied overlap threshold t 0 in the interval [0, 1], and the precision plot shows the precisions at a varied CLE threshold from 0 to 50 pixels.Furthermore, to verify the effectiveness of the multiscale tracking networks of our tracker, a single network tracker based on the S-target network, named single network-based tracker (SNT) algorithm, is used for comparison.
Figure 10 shows the overall performance comparisons of 11 trackers based on success plots and precision plots.These trackers are ranked according to the area under curve (AUC) values of success plots in Fig. 10(a) and the precision values at the threshold of 20 pixels in Fig. 10(b).For success plots, MSNT achieves the AUC value of 0.564 and ranks first of 11 trackers.Compared with DLT and CNT, which are also based on the deep learning method, the value of MSNT is improved by 29.4% and 4.0% over these, respectively.For precision plots, the precision of MSNT achieves 0.753 and ranks second, which is only after CST of 0.777.Similarly, the precision of MSNT is increased by 28.3% and 4.1% more than DLT and CNT, respectively.Table 3 The success rates in the success threshold t 0 ¼ 0.  Analyzing the success plots and precision plots of MSNT and SNT, we find that MSNT improves the performance of SNT apparently in both of these metrics.The MSNT improves the value by 6.6% more than SNT in the success plots and improves the precision by 8.5% more than SNT.These results suggest that our proposed multiscale networks can extract more robust and effective features and have better performance than the single and fixed network.
These experimental data and the above analyses illustrate that our tracker outperforms these state-of-the-art trackers and achieves satisfactory tracking results in different challenging scenarios.

Attribute-based performance for 51 sequences
To further analyze the performance of our tracker under different tracking conditions, we evaluated these trackers on 11 attributes, which are defined in Ref. 17.The success plots and precision plots on different attributes are shown in Figs.11 and 12, respectively.Among the 11 attributes, MSNT ranks first in eight attributes (including "illumination variation," "out-of-plane rotation," "scale variation," "occlusion," "fast motion," "in-plane rotation," "out of view," and "LR") and outperforms SNT in all attributes in Fig. 9.Only on the attributes of "deformation" and "background clutter" does MSNT not rank in the top 3 in the success plots.
For the precision plots in Fig. 12, MSNT ranks in the top 3 on eight attributes and outperforms SNT in all attributes.In particular, in the attributes of fast motion, out of view, and LR, MSNT has the best performance for the tracking precisions.However, MSNT has a worse performance on the attributes of illumination variation, deformation, and background clutter than some trackers, such as CST, KCF, and LNLT.Some observations we obtained from these attributebased data: first, our tracker achieves a good performance in most attributions, especially in the attributes of fast motion, out of view, and LR.Second, our tracker cannot perform as well as CST and KCF on some attributes, such as deformation and background clutter, especially in the precision plots.These may be the next research areas for improving our tracker.Third, MSNT outperforms SNT in all attributes whether in success plots or precision plots.It further proves the availability of a multiscale tracking network.

Tracking Speed of Tracker
On our experimental platform, our tracker achieves a practical tracking speed of an average of 13.2 frame per second (FPS) for the 51 sequences.Table 5 shows the tracking speed of the above 10 trackers.All the data are published by the authors in their papers.The "-" indicates that the author does not give the tracking speed explicitly.From Table 5, we see that KCF has the highest tracking speed, and our tracker achieves a faster speed than CST, SST, TGPR, and SCM.Compared to DLT, which is also based on deep learning, our tracker has a slightly slower tracking speed, but it avoids the complex and time-consuming pretraining process.This property makes the establishment and adjustment of tracking networks more simple and flexible than DLT.

Discussion
For a more thorough evaluation, we also add the following recent trackers with their corresponding results (success rate, precision, and FPS) to the comparison: STCT (0.640, 0.780, 2.5), 36 RTT (0.588, 0.827, 3 to 4), 37 and DLRT (0.512, 0.694, 3). 38Among these trackers, the proposed MSNT (0.564, 0.753, 13.2) achieves better performance than DLRT but is inferior to STCT and RTT.Nevertheless, our tracker has a faster processing speed than these trackers and achieves comparable performance as RTT in success rate and as STCT in precision.However, our tracker still has room to improve compared with the best tracker STCT.STCT regards CNN as an ensemble of base learners and trains the convolutional layers with random binary masks.These techniques reduce the correlation between the learned features and prevent overfitting effectively, although these lead to higher computation cost to some degree.Like the random binary masks in STCT, the similar trick, "Dropout," 39 may be used in our tracker to further avoid overfitting.
In this paper, we propose a simple but effective MDSN for achieving real-time tracking.A robust tracker is built based on MDSN without offline pretraining with an auxiliary dataset, and the tracker alleviates the "gradient vanishing" in the training process due to the ReLU activation function.However, as shown in Fig. 13, there are some serious failed cases for our MSNT tracker.In "Bolt," the runners have similar appearances, so it is difficult to discriminate the correct target from the others.In "Ironman," the comprehensive factors (including intense lighting changes, similar background, fast motion, and rotation, etc.) make the tracker be unable to differentiate the dark target from the noisy background effectively.Finally, in "MotorRolling," the significant rotation  and deformation of the target cause the tracking failure.In these cases, our tracker easily makes the biggest errors, i.e., the tracking drifting and target missing at the beginning of the tracking.Analyzing these failed cases, the deformation and background clutter of targets may be the main factors to cause failure for our proposed MSNT.Moreover, the rankings of our tracker in Figs.11 and 12 also indicate that MSNT has a relatively poor performance on the attributes of deformation and background clutter.In addition, the above trick for preventing overfitting, such as in Dropout, can improve the performance of our tracker to some degree;     37), may be potential solutions to our method on both challenging attributes.These problems will be the interesting research directions in our future work.

Conclusions
In this paper, we proposed an MDSN for extracting robust and powerful features for visual tracking.The intrinsic sparsity of the networks avoids offline pretraining with an auxiliary image dataset and exploits more sparse and robust feature representations.The multiscale networks can adaptively select the corresponding tracking networks based on the shapes of targets.It will capture more useful structural information of targets.Combined the MDSN with the particle filter tracking framework, the MSNT tracker is proposed to solve the tracking problems.Through quantitative and qualitative comparison with state-of-the-art trackers on the challenging tracking benchmark dataset, our proposed tracker achieves a satisfactory result and practicable tracking speed in experiments.Furthermore, there are several possible directions to investigate in detail for this work.First, the technique of blocking, such as histograms of oriented gradients (HOG) 40 can be used in the proposed method to improve the performance on the attributes of deformation and background clutter.Second, CNNs and other feature extractors may be combined in our proposed method to exploit more robust and semantic features for tracking.Third, more effective classification methods, such as support vector machine, will be employed instead of softmax classifier, which may further improve the robustness of tracking.
Zhiqiang Hou graduated from AFEU and received his MS degree in 1998.He received his PhD from Xi'an Jiaotong University in 2005.He was a visiting scholar at the University of Bristol, UK, in 2009.He is currently a professor at AFEU.His research interests include pattern recognition, computer vision, image processing, and information fusion.
Wangsheng Yu received his MS degree and PhD in communication and information systems from the AFEU in 2010 and 2014, respectively.He is currently a lecturer at the Information and Navigation College, AFEU.His research interests include computer vision and image processing.
Yang Xue received his BS degree in information engineering from AFEU, Xi'an, China, in 2015.He is currently working toward his graduate degree in the group of Professor L. H. Ma

2 .
Due to unsaturation and constancy of the gradient of ReLU, the "gradient vanishing" problem of training is effectively alleviated by MDSN.It also makes the online training of the deep networks easier and faster.3. Combined with the particle filter framework, we built a simple but effective tracker named MSNT by the MDSN for overcoming the weakness of traditional trackers based on a single network.MSNT can automatically choose the corresponding tracking network according to different targets.It further improves the robustness of the trackers based on a single network.

Fig. 2
Fig. 2 The basic stacked-SAEs and its variant with ReLU: (a) the basic stacked-SAEs and (b) the variant of stacked-SAEs with ReLU.

Fig. 4
Fig. 4 Comparisons for two trackers based on different normalizations (red bounding box and line represent 32 × 16 normalization while green ones represent 32 × 32 normalization).(a) The tracking results of two trackers based on different normalizations and (b) CLEs of two trackers based on different normalizations.

Fig. 11
Fig. 11 The success plots of OPE for the trackers on different attributes.

Fig. 12
Fig.12The precision plots of OPE for the trackers on different attributes.

Fig. 13
Fig. 13 Some failure cases for our tracker.Red boxes show our results and the yellow ones are the ground truth.(a) Bolt, (b) Ironman, and (c) MotorRolling.

Table 2
Brief introduction to nine referenced trackers.
5 of the trackers on eight challenging sequences.
Note: %, the best results are in bold and the second best in italics.

Table 4
Average CLEs of the trackers on eight sequences.

Table 5
The tracking speed comparison for the 10 trackers.
combination with more robust and semantic feature extractors, such as CNNs (in Ref.36)or RNNs (in Ref. • Vol.56(4) Wang et al.: Robust visual tracking via multiscale deep sparse networks Downloaded From: https://www.spiedigitallibrary.org/journals/Optical-Engineering on 26 May 2019 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use a . His research interests include quantum information and machine learning.Zefenfen Jin received her BS degree from Hunan University, Changsha, China, in 2015.She is currently a master's candidate at the Information and Navigation College, AFEU.Her current research interests include image processing, pattern recognition, and visual tracking.Bo Dai received his BS and MS degrees from AFEU in 2014 and 2016, respectively.His current research interests include computer vision, image processing, and machine learning.