Online visual tracking based on selective sparse appearance model and spatiotemporal analysis

Abstract. To tackle robust visual tracking in complex environment, an online algorithm based on generative model is proposed. The target is represented with overlapped and selected local patches based on key point proportion ranking, and its location is estimated by spatiotemporal analysis. Temporally, a propagated affine warping dynamical model is newly introduced. Spatially, an observation model based on weighted sparse representation and geometric confidence inference is newly established. Both selection pattern and templates are periodically updated to adapt the target’s appearance variation. Experiments demonstrate that the proposed approach achieves more favorable performance compared with classical works on challenging image sequences.


Introduction
Visual tracking is an important topic in the computer vision community and has been intensively investigated during the recent decades.It lays the foundation for high-level visual problems such as motion analysis and behavior understanding.3][4] Recently, methods proposed to track targets while evolving the appearance model in an online manner, called online visual tracking, have been popular. 5An online visual tracking method typically follows the Bayesian inference framework 6 and mainly consists of three components: an appearance representation scheme, a dynamical model (or state transition model), and an observation model.In these components, the first one considers the formulation uniqueness of target appearance, the second one aims to describe the target states and their interframe relationship, and the third one evaluates the likelihood of an observed image patch belonging to the object class.Obviously, appearance model variation introduces several challenges.For example, the evolution incurs the risk of including wrong measurements and thus causes the tracking window to drift from the target.Moreover, the tracker must be able to online evaluate the quality of estimated results in the last frame, so that it could adjust its contributions to model update in the current frame.Although visual tracking has been intensively investigated, there are still many challenges such as partial occlusion, appearance variation, scale change, significant motion, cluttered background, etc.These challenges make the establishment of efficient online visual tracking a difficult task.
In this paper, an online visual tracking algorithm is proposed based on selective sparse appearance model and spatiotemporal analysis.Compared with other online tracking methods, main contributions of this work are concluded as follows: (1) For the representation aspect, a selective sparse appearance model is novelly proposed based on key patch selection, which establishes a balance between flexibility and uniqueness in target representation.(2) Temporally, an adaptive dynamical model is newly introduced based on target state analysis and joint-Gaussian propagation.The sampling covariance matrix is timely updated in view of the previous tracking results, which is different from the parameter-fixed proposals in other tracking algorithms.(3) Spatially, a geometric inference method is proposed to measure the appearance similarity for observation modeling.Different from the maximum-a-posterior (MAP) estimation in other generative works, target location estimation in this paper is conducted based on confidence inference using a portion of most similar candidates.Evaluations on numerous image sequences have been conducted, and the results demonstrate a more satisfactory performance compared with state-of-the-art online algorithms.
The remainder of this paper is organized as follows.Related works are presented in Sec. 2. In Sec. 3, general description of the proposed algorithm is introduced.Accordingly, details on target representation scheme are described in Sec. 4, whereas the tracking framework based on Bayesian inference is proposed in Sec. 5. Experimental results and discussions are given in Sec. 6.In Sec. 7, concluding remarks and possible directions for future research are provided.
2 Related Works and Context Visual tracking has been studied for several decades.In this section, studies related to our work are summarized.A thorough survey can be found in the related references. 1,7

Appearance Representation in Visual Tracking
Representation of the target is basic but important to appearance-based visual tracking.Discrimination capability, computational efficiency, and occlusion resistance are generally considered as three main aspects for evaluation.Old tracking works construct the scheme in the form of feature point, 8 contour, 2 or silhouette. 3For online visual tracking, the schemes are classified into patch-based schemes (e.g., holistic gray-level image vector 9,10 and fragments [11][12][13] ), featurebased schemes, [14][15][16][17] statistics-based schemes [18][19][20][21] and their combinations.In patch-based schemes, Yang et al. 12 propose an attentional visual tracking algorithm by early extracting a pool of attentional regions that have good localization properties.Zhou et al. 13 explore the informative fragments based on human detectors to compose the reference model during the tracking process.
In target representation, taking the whole target region could be a good choice, since it collects all the visual information from the target and can be directly implemented without additional processing.However, such scheme could be blunt and lack flexibility, especially when the target appearance sharply varies, or when occlusion or abrupt motion occurs.Moreover, since the target is labeled using rectangles, the region inside the labeling rectangle but outside the target area could negatively affect the tracking performance.Practically, all visual data of the target is needed to be further processed, which results in heavy computation.Discovering the features or regions with little variance in scale, rotation, and translation is important in visual tracking. 8Feature points take advantages in their uniqueness and flexibility on appearance representation.However, numbers of previous works merely transform visual information into data statistics, which lacks generalization capability.It also prevents further processing directly from the visual aspect.Moreover, intrinsic visual characteristics, such as continuity and sparsity, cannot be further exploited.Though targets could be jointly represented based on features and holistic regions, complicated calculations might cause slow processing speed.

Particle Filtering for Online Visual Tracking
Particle filtering is a Bayesian sequential importance sampling technique for the posterior distribution estimation of state variables characterizing a dynamical system.For visual tracking, various improved works have been proposed since the condensation algorithm. 2In online visual tracking currently, it is regarded as a dynamical modeling method.Ross et al. 9 propose a variant of the condensation algorithm called affine warping.They model the target state X t by a Gaussian distribution around the previous state X t−1 , pðX t jX t−1 Þ ¼ N ðX t ; X t−1 ; ΨÞ, where Ψ is an affine covariance vector.Kwon et al. 22 propose a geometric method, where the two-dimensional (2-D) affine motion of a given target is estimated by means of coordinate-invariant particle filtering on the 2-D affine lie group Aff(2).Mei and Ling 19 treat the local target motion as a constant velocity model and add the latest horizontal and vertical velocities to the translation parameters.
However, these works only consider the target state in the latest frame, which could be regarded as an one-dimensional (1-D) Markovian chain.They fail to make use of more previous tracking results.Moreover, they predefine the covariance matrix manually corresponding to different image sequences and keep it fixed in the whole tracking process.Therefore, they separate covariance and tracking result from each other and could prevent sampling from searching better candidates so that the tracking performance might be negatively affected.

Online Generative Visual Tracking with Sparse Representation
Observation modeling refers to a similarity evaluation process between the sampling candidates and the target and could be classified into three categories: 7 generative methods, discriminative methods, and hybrid methods.Generative methods focus on the exploration of a target observation with minimal predefined error based on specific evaluation criteria, whereas discriminative ones make attempts to maximize the margin between the target and nontarget regions using classification techniques.Hybrid trackers often integrate the two methods above into a combination framework.Specifically, generative visual trackers could be summarized including mixture models, 23,24 integral histogram, 11 subspace learning, 9,10 sparse representation, [19][20][21][25][26][27] visual tracking decomposition, 28 covariance tracking, 29 etc. They ften drive the localization procedure by a maximum-likelihood or a MAP formulation relying on the target appearance model. Jepso et al. 23 design an elaborate mixture model with an online expectation-maximization algorithm to explicitly model the appearance changes during tracking.Adam et al. 11 decompose the template into fragments and vote on the possible positions and scales of the target by comparing their histograms with the corresponding candidate counterparts.Ross et al. 9 propose a generalized tracking framework based on the incremental principal component analysis subspace learning method with a sample mean update.Li et al. 29 explore the log-Euclidean Riemannian metric for statistics based on the covariance matrices of target features.Kwon and Lee 28 decompose the target observation model into multiple basic object models and then a compound tracking scheme is established by information integration and exchange via interactive Markov chain Monte Carlo (IMCMC).Cruz-Mota et al. 10 introduce spatial and temporal weights to the algorithm proposed by Ross et al. 9 and establish an incremental temporally weighted visual tracking algorithm with spatial penalty (ITWVTSP) for visual tracking. Sprse representation follows the native linear combination characteristics and could capture the region similarity in a more efficient way.30,31 It is first introduced to visual tracking by Mei and Ling.19 They propose a l 1 minimization tracking algorithm, where the target is approximately spanned by target templates and trivial templates.The candidate with the smallest projection error is considered as the estimated tracking result.Liu et al. 20 model the target appearance based on a static sparse dictionary and a dynamically updated basis distribution, which is learned by K-selection and sparse-constraint-regularized meanshift.Bao et al. 32 apply the accelerated proximal gradient (APG) optimization approach to realize the real-time tracking performance. Baand Li 26 construct the target appearance using a sparse linear combination of structured subspace unions, which consists of a learned eigen template set and a partitioned occlusion template set.Jia et al. 27 propose a structural local sparse appearance model to represent the target and introduce an alignment pooling method for location estimation.

General Description of Proposed Online Visual
Tracking Algorithm In this paper, we continue to explore the partial selection routines in appearance representation inspired by Yang et al. 12 and Zhou et al., 13 and a generative online visual tracking algorithm is proposed based on selective sparse appearance model and spatiotemporal analysis.The workflow diagram is shown in Fig. 1.Once the target region is divided into overlapped patches, key patches would be selected as the representation of the target based on key point proportion ranking (KPPR).Accordingly, masked sparse representation is introduced to compute the patch coefficients based on elastic net regularization.In dynamical modeling, candidates are sampled based on affine temporal affine warping propagation.State analysis is conducted based on the joint Gaussian assumption and tracking information in the previous frames, and a parameter update scheme is introduced to adjust the dynamical model.Then, in observation modeling, the masked sparse representation is conducted to obtain the coefficients of the candidates, and their p-norms of kernel-weighted traces are established as the confidence scores for ranking.Most similar candidates obtained would be further used to estimate the target location based on Gaussian approximation.As time evolves, both selection pattern and template are periodically updated to adapt the target's appearance.
The proposed formulation has the following advantages.First, the proposed target representation scheme takes advantages of not only feature points in uniqueness and flexibility but also holistic region in comprehensiveness and efficiency.Second, the proposed affine propagation method temporally flexiblizes the covariance matrix of the distribution and provides more opportunities in searching better candidates.Third, the proposed process solves the linear approximation based on a masked and weighted convex optimization with elastic net regularizer, and thus manual setting of l 1 norm constraints is not necessary.The proposed p-norm of kernel-weighted trace function can capture the overall infinitesimal change in volume of the sparse coding output.Fourth, the proposed inference scheme has little negative influence in tracking accuracy but shows its spatial robustness against various visual challenges, especially cluttered background and severe occlusion.

Target Representation Based on Selective
Sparse Appearance Model In this section, we propose a selective sparse appearance model for target representation.Definition of a key patch and the KPPR algorithm is introduced and then the corresponding sparse representation scheme based on selected patches is presented.

KPPR for Patch Selection
We define a KEY patch for better selection of the target patches as follows: Definition 1 In an image Y, a region P is defined as a KEY patch when and only when the following conditions are satisfied: 1.At least the location and size of P have been defined inside Y; 2. At least there is one key point p KEY in P: p KEY ∈ P.
Thus, suppose L key feature points p KEY i ; i ¼ 1; 2; : : : ; L have been detected in the target region, and K patches P j ; j ¼ 1; 2; : : : ; K; K ≤ N have been defined, the KEY patch P KEY j is generated as follows: jp KEY i ∈ P j ;i ¼ 1; 2;:::;L;j ¼ 1; 2; : : :; Kg: In the rest of this paper, we would use P j to represent a KEY patch P KEY j for simplification if there is no additional comment.Obviously, if there are key feature points for each patch, all the patches are regarded as key patches.Moreover, the number of key feature points in each patch can be different, and it could be assumed that the importance of a patch is positively proportional to the number of feature points that it contains.This assumption naturally follows the characteristics of features and could also be considered reasonable from a context perspective.Heuristically, if a key point is found, its local neighborhood could be also regarded as an important and representative region.Therefore, more feature points in a fix-size region infer that the neighborhoods connect with each other and compose a larger important region.In the extreme case, each pixel in the patch is decided as a feature point, and thus the whole region uniquely represents itself.
For feature point extraction in this paper, the Shi-Tomasi corner detector method is chosen. 8It finds points with large response function where ρ 1 ; ρ 2 are eigenvalues of a structured tensor A ¼ ½ x y h g x g xy g xy g y ih x y i :g x ; g y and g xy are the horizontal, vertical, and diagonal image gradients convolved with a circularly weighted window function.Other well-known feature point extraction methods might also be available.To select the most important patches, a key point proportion (KPP) is defined as follows.
Definition 2 For a KEY patch P j ; j ¼ 1; 2; : : : ; K with L key feature points, its KPP is L, KPP j ≜ L, when and only when Eq. ( 1) is satisfied.Feature points are important in invariance capture for visual tracking, and it could be concluded that the more feature points a region contains, the more important it is.Thus, KPP is applied to evaluate the importance of a patch, and a KPPR is further presented to select the most important KEY patches, which is illustrated in Fig. 2 and summarized in Algorithm 1, namely, patches with the most key feature points are chosen.Once the KEY patches are decided, the selection pattern would be fixed in the next few frames before update.

Target Sparse Representation Based on Selected Overlapped Patches
The global appearance of an object under different illumination and viewpoint conditions is known to lie approximately in a low-dimensional subspace. 19In this work, it is assumed that good target could be sparsely represented with a projecting residual by its selected overlapped patches in the target template subspace.Suppose at time t, the target region Y t with size s x ; s y , s ¼ s x × s y is sampled into N overlapped patches Y t ¼ ½P 1 t ; P 2 t ; : : : ; P N t , whose size is d ¼ d x × d y ; d x ≤ s x ; d y ≤ s y , and K patches are selected based on KPPR described above.Moreover, there exist a set of templates T t ¼ ½t 1 t ; t 2 t ; : : : ; t M t ∈ R ðd x ×d y ×KÞ×M , where M refers to the number of the templates.The corresponding patches, t j t ¼ ½b 1 j ; b 2 j ; : : : ; b K j ∈ R d×K ; j ¼ 1; 2; : : : ; M, have been stacked, normalized, and vectorized.They share the same patch sampling and selection scheme with that of the target candidates.Then, any patch of a target candidate P i t ∈ R d ; i ¼ 1; 2; : : : ; K in current frame will approximately lie in the linear span of the corresponding template patches in the past M frames for some scalars, β i k ∈ R K ,i ¼ 1; 2;:::;N;k ¼ 1; 2;:::;M × K.

Input:
Target region Y t , required KEY patch number K .
Predefined overlapped patches number N,patch size d x ; d y , overlap rate R o .
1: Sample region Y t into N patches P ¼ fP 1 ; P 2 ; : : : ; P N g, with d x ; d y and R o .
2: Compute the key feature points p for Y t .
Thus, a target patch P i t ; i ¼ 1; 2; : : : ; K is represented based on the dictionary composed of the corresponding templates by solving mask convex optimization problem based on elastic net regularization: 33,34 min s:t:β j ≥0;j¼1;2;:::;K; where diagðσ j Þ refers to the diagonal matrix supported by σ j , and ϕ j belongs to a block circulant mask matrix P ¼ ½σ 1 ; σ 2 ; : : : ; σ N ∈ R d×N .Each column of P corresponds to a vector compose of d successive "1" elements and s − d "0" elements.λ 1 and λ 2 are regularization constants.Therefore, only K columns would be selected, and D ¼ ½t M ∈ R d×ðM×KÞ refers to the dictionary, whose columns are composed of the template patches according to the KPPR selection scheme described above.

Generative Visual Tracking Process Based
on Spatiotemporal Analysis An online visual tracking process could be interpreted as a Bayesian recursive and sequential inference task in a Markov model with hidden state variables.It could be further divided into cascaded estimation of dynamical model and observation model. 9Suppose a set of target images Y t ¼ fy 1 ; y 2 ; : : : ; y t g have been provided up to time t, the hidden state variable of the target X t could be estimated as follows: where pðX t jX t−1 Þ refers to the dynamical model between two consecutive states and pðy t jX t Þ denotes the observation model related to the likelihood estimation of y t based on the state X t .The target state in this paper is approximately parameterized using a six-tuple set introduced by Ross et al., 9 X t ¼ fx t ; y t ; θ t ; s t ; α t ; ϕ t g.The elements, respectively, denote horizontal and vertical translation, rotation angle, scale, aspect ratio, and skew direction.

Dynamical Modeling: Temporal Propagation Based on Affine State Analysis
In this paper, we analyze the current target state based on previous ones with a joint Gaussian assumption proposed below.The comparison of original and proposed affine warping is shown in Fig. 3. Correspondingly, a theorem is described as follows with informal proof afterwards.
Theorem Suppose X t ¼ fx t ; y t ; θ t ; s t ; α t ; ϕ t g; t ≥ 0, where each element is time-varying random variable, X t is joint Gaussian.
Proof Since the joint distribution of single gaussian variables is still Gaussian, 35 based on Gaussian assumption proposed by Ross et al. 9 and the target state definition, the theorem holds.▯ Thus, the dynamical model could be updated based on the analysis of previous target states in a joint Gaussian way, the new model is presented as where α is an update rate parameter, and Ψ 0 contains the initial affine variances of six elements.To tackle unexpected motion variation, the target states in previous R frames are approximately considered as the input for Ψ calculation in this paper.Correspondingly, suppose X R ¼ ½X 1 ; X 2 ; : : : ; X R T , μ and Ψ up to time t could be computed following Gaussian kernel estimation by where varðX R Þ refers to the variance of X R .Xt−1 is computed detailed in Sec.5.3.The proposed dynamical model could also be viewed as a weighted multidimensional Markovian chain form for affine warping, which transforms the 1-D Markovian chain to a weighted R-D form.Moreover, it is also a sample-biased estimation.Though the general dynamical assumption between two target states in the indefinite time process follows a Gaussian distribution without bias, the states of a specific target are predictable given motion continuity assumption, and thus the estimation could be biased associated with previous target states given fixed time interval.

Observation Modeling: Confidence Calculation Based on Weighted Sparse Representation
Based on the selective sparse appearance model described above, we introduce a patch-view form of Eq. ( 4) as min s:t:β j ≥ 0; j ¼ 1; 2; : : : ; K: Equation ( 8) can be solved by the least angle regression (LARS) algorithm to compute the coefficients β j ¼ ½β j 1 ; β j 2 ; : : : ; β j M×K .The details of the LARS algorithm could be referred to Ref. 36.
Earlier templates could be more similar with the initial appearance of the target, but it might influence the target appearance approximation in abrupt variation.Thus, a temporal weight W is introduced as w ¼ e −η 0 P K−1 j¼0 e −ðη 0 −ηjÞ ; e −ðη 0 −ηÞ P K−1 j¼0 e −ðη 0 −ηjÞ ; : : : ; where η 0 and η are constants to control the weights.Thus, Eq. ( 8) changes to min s:t:β j ≥ 0; j ¼ 1; 2; : : : ; K; (11)   where h•; •i refers to the inner product.Equation ( 11) could also be solved by LARS. 36The difference is that each column is premultiplied with a weight w.It should be noted that though the template-based sparse representation has recently been discussed, [19][20][21]26,27 all of them fail to consider the issue of template importance from a temporal perspective. The fagmented tracking algorithm 11 applies the kernelweighted scheme, which assigns low weights to the pixels far from the target's center.These pixels are more likely to contain background information or occluding objects, and thus their contributions to location estimation should be diminished.In this paper, we apply this conception to coefficient-based confidence modeling.Suppose β ¼ fβ 1 ; β 2 ; : : : ; β M g ∈ R ðM×KÞ×M have been obtained and the corresponding trace is a p-norm of kernel-weighted trace for β is presented for confidence calculation, and the confidence score L v for a certain target candidate is defined as where e i refers to the i'th element.k is defined as , where κ i refers to the i'th value of a vectorized Gaussian kernel function κ.It follows the same selection pattern with that of the patch described in Sec. 4.

Observation Modeling: Geometric Inference of Candidate Confidence
Compared with the maximal scheme in previous works, we construct the observation estimation based on the spatial distribution of top candidates in the confidence ranking results.
To begin with, a 3-D confidence-coordinate space (CCS) is introduced as follows.
Definition 3 Given a set of target candidates ðx k t ; y k t Þ; k ∈ Z þ , and the corresponding normalized confidence scores are L k , the CCS is defined as If we illustrate the distribution of top candidate confidence scores in a local area around the true target location shown in Fig. 4, it could be found that without noise introduced, the more candidates we obtain, the more Gaussian the confidence distribution would be.This could be proved by classical center limit theorem, and each candidate is regarded as a sample of confidence.Suppose there is only one point with the maximal confidence corresponding to the target in the current frame, and each candidate is sampled following a Gaussian distribution around the target, the confidence would gradually drop as it moves away from the extreme point.Based on these conceptions, we assume that the Fig. 4 Confidence scores distribution in a local area and inference result by Gaussian approximation.Without noise introduced, the more candidates we obtain, the more Gaussian the distribution of the confidence would be.
confidence values follow a Gaussian distribution in CCS of a local limited region.
Then, a geometric inference method is presented to estimate the target location.Suppose Q points with highest confidence scores are known, the observation in this paper approximates a 2-D Gaussian function in CCS to find the peak.Furthermore, the observation estimation for a certain target candidate is proportional to the geometric confidence inference output defined as where, Ið•Þ refers to the inference result.It should be noted that Q should not be large, since noises could be introduced, and therefore the assumption above might not be met.In this paper, only the minimum of Q is predefined, and the sample number finally used is subject to increase.The geometric inference process is summarized in Algorithm 2. Each time, Q points in CCS are obtained, a Gaussian fit is conducted.The inference results would be checked and used to compose the target state in current frame, otherwise inference would be applied for another maximal M − 1 times with Q update per time.Eventually, if there is no suitable result, the target state used for sampling would be updated with a predicted bias vector Δ computed by constant velocity approximation.

Template and Selection Pattern Update
Long-time fixed templates might negatively affect the tracking performance in dynamic scenes, and an update is essential.In this paper, we propose to periodically replace one of the templates set t i t ; i ¼ 1; 2; : : : ; M by sparse representation.A template t could be obtained by sparsely representing the estimated target vector Ỹt using a linear combination of eigen-basis vectors based on elastic net.The equation is where H ¼ ½UI; c ¼ ½qe.U is the matrix composed of eigen-basis vectors computed following the method by Ross et al., 9 q refers to the coefficients of eigen-basis vectors, and e represents trivial noises.A similar process also appears in Ref. 27. Comparatively, we do not apply the l 1 constraint but the elastic net one.This process could also be viewed as template denoising with underlying formulation t ¼ Uq þ e, so that reconstruction errors in Eqs. ( 4) and ( 8) due to appearance variation can be effectively reduced.If deformation occurs, the selected patch would regularly change to adapt the appearance variation.Since the target is labeled in rectangles, some areas that do not belong to the target might be within the rectangles.However, it would not affect the final tracking performance because these areas are limited.The overlapped patches within the target region cover the major areas and would eliminate the noise.The template update strategy is summarized in Algorithm 3.
In this paper, it is assumed that the first M templates of the target are known, which can be generated by manual labeling or other trackers.In the mean time, the KPPR algorithm would be reapplied on the tracking result to re-select the KEY patches.

Summary of Algorithm
The proposed algorithm is integrated in Algorithm 4.
Qualitatively, in Algorithm 4, sparse coding in confidence score calculation and template update are the most time-Algorithm 2 Spatial confidence inference based on 2-D Gaussian approximation in CCS.

Output:
Target state X t , state for sampling Xt .17: Obtain a predicted bias vector consuming part, and the proposed spatial confidence inference process ranks second.The dynamical modeling and patch selection part take the least running time.To speed up processing, we apply a C implementation of elastic net regulation proposed by Mairal et al. 34 Moreover, we define an inference flag counter F c in the proposed confidence inference algorithm.It controls the maximal iteration number so that the algorithm would not take infinite time to search for an inference result.Further quantitative analysis is described in the next section.

Experiment and Discussion
In this section, we present experiments on test image sequences to demonstrate the efficiency and effectiveness of the proposed algorithm.Both qualitative and quantitative evaluations are presented as follows, and additionally, separate evaluations and analysis on the number of patch selection, the confidence inference algorithm and the computation complexity are also conducted.

Experiment Setup
The proposed algorithm is implemented in MATLAB and C/ C++, which runs at 1.0 to 1.6 fps on a 2.5-GHz machine with 2 GB RAM.For parameter configuration, the target region is normalized to 32 × 32 pixels, d x ¼ d y ¼ 32; d ¼ 1024, and the patch size is set to 16 × 16 pixels, s x ¼ s y ¼ 16; s ¼ 256, while the overlapped percentage of neighbored patch is 0.5.Thus, totally nine overlapped patches are sampled, N ¼ 9.
Six hundred particles are used for dynamical modeling, V ¼ 600.Target states of the latest eight frames are used for propagation, R ¼ 8, and the update rate parameter is set to 0.1, α ¼ 0.1.M ¼ 10, where the target at the first frame is manually labeled and the other M − 1 frames are labeled based on the tracking results by a KD-tree forest visual tracker. 37The regularization constants λ 1 and λ 2 are set to be 0.01, and Q ¼ 5 for the initial number of particle inference.The inference tolerance is set to be 0.1, Tol ¼ 0.1.
Both the template and KEY patch selection pattern are set to be updated for every five frames, U f ¼ 5.The weight parameters in Eq. ( 10) are η 0 ¼ 1; η ¼ 0.1.In all the experiments of this paper except Sec.6.4, six patches are selected, It should be noted that the settings on V, M, U f , λ 1 , and λ 2 above are based on the setup of classical online visual tracking algorithms so as for better performance comparison. 9,10,16,17,19,38The overlapped percentage of neighbored patch is related to the appearance variation of the target region.Since low percentage number would lead to lower efficiency, and the benchmark video is of various kinds, an unbiased number 0.5 is set.Q is set considering the least numbers for Gaussian fitting.Other parameters including R, α, η 0 , η, and Tol are established after times of experiments with reference to the balance between accuracy and efficiency.Increasing them would lead to lower accuracy, while high R and α would cause the sampling location drift away, resulting in unfavorable adaption for fast motion and occlusion handling.
For tracking performance evaluation, 14 image sequences, totally more than 6,000 frames, are used in the experiments, where the target locations through all the frames are already manually labeled as ground truth.Comparatively, the proposed tracker is evaluated against eight state-of-theart algorithms based on the source codes provided by the authors, including Frag, 11 IVT, 9 VTD, 28 L1T, 19 MIL, 16 TLD, 17 ITWVTSP, 10 and PLS. 38These image sequences Algorithm 3 Template update based on elastic net regulation.

Output:
New template set T t .
2: Generate a random integral number i ∈ ½2; M to index the template to be replaced.3: based on affine warping propagation by Eq. 6 and Eq. 7.
6: (Observation Modeling) Obtain fL k v g V k¼1 based on confidence score by Eq. 12 and 13. 7: (Observation Modeling) Conduct geometric inference to obtain X t and Xt based on The maximal iteration number is set M. described above are also separatively obtained from their web sites.Their parameter settings are shown in Table 1.
Since implicit stochasticity exists in all of the algorithms, each quantitative score below is averagely computed considering the results of five independent runs of the corresponding algorithm.Live video demos and more results can be obtained from the authors.

Qualitative Evaluation
Qualitative analysis and discussions are provided as follows in common use of tracking human bodies, vehicles, and human and animal faces.The visual challenges include heavy occlusion, illumination change, scale change, fast motion, cluttered background, pose variation, motion blur, and low contrast.

Human bodies
Tracking human bodies is widely used in motion-based recognition and automated surveillance.The sequences used for evaluation include Caviar 1, Caviar 2, and Singer.
It is shown in Fig. 5 that IVT, 9 ITWVTSP, 10 MIL, 16 L1T, 19 and PLS 38 do not perform well in Caviar 1.They fail to discover the target when it is occluded by a similar object (e.g., #0133 and #0192).Only the proposed tracker, VTD, 28 Frag, 11 and TLD 17 handle the heavy occlusion successfully.However, VTD 28 and Frag 11 cannot smoothly adapt the scale changes of the person (e.g., #0133 and #0367).In Caviar 2, almost all the trackers evaluated except PLS 38 and MIL 16 can follow the target.However, many of them including 9 VTD, 28 ITWVTSP, 10 and TLD 17 cannot adapt the scale as the human moves near to the camera (e.g., #0220 and #0455).By contrast, our algorithm performs well in terms of position estimation and scale adaptation.
In Singer shown in Fig. 5(c), only the results of partial trackers (e.g., proposed and VTD) 28 are satisfactory, while the others cannot adjust the scale [e.g., Frag, 11 L1T, 19 and MIL 16 ] or accurately locate the target [e.g., TLD 17 at #098, #0116 and #0226, IVT 9 at #0126].Both drastic scale and location deviation occur when lighting condition changes.Especially, PLS 38 cannot capture the scale variation of the target through all the frames of Singer.The ITWVTSP 10 algorithm performs much better than the IVT algorithm 9 in this video.Comparatively, the proposed algorithms can locate the target more accurately and robustly against illumination variation.

Human and animal faces
Face detection and tracking are very important in HCI and animal monitoring application.In the experiments, five videos are used including David Indoor, Occlusion 1, Occlusion 2, Girl, and Deer.
Figure 6 shows that in Occlusion 1, all the evaluation algorithms can follow the target approximately correctly, yet some trackers drift from the face when occlusion occurs [e.g., MIL 16 at #0300, #0565, and #0833, ITWVTSP 10 at #0565 and #0833, IVT, 9 L1T, 19 Frg, 11 TLD, 17 and VTD 28 at #0565].In Occlusion 2, the differences are more obvious.It can be found that L1T 19 drifts more from the target compared with other algorithms [e.g., MIL 16 at #0576, and #0713], and IVT 9 and TLD 17 cannot adapt the appearance during occlusion and head rotation (e.g., #0713).Though the VTD 28 and ITWVTSP 10 could locate the face center more accurately, they could not cover the occluded area due to pose variation (e.g., #0713).PLS 38 cannot continuously follow the target, while MIL 16 and Frag 11 estimate the target less accurately than the proposed algorithm.
In Girl, it is found in Fig. 7 that only the proposed algorithm, Frag, 11 TLD, 17 and VTD 28 can consistently follow the face, while the proposed method can estimate the location more accurately (e.g., at #0310 and #0345).The other trackers gradually drift from the target to the surroundings.In David Indoor, some algorithms [e.g., Frag 11 and PLS 38 ] drift away from the target during the tracking process, while some algorithms cannot adapt the scale when out-ofplane rotation occurs [e.g., MIL 16 and L1T 19 at #0175 and #0389, VTD 28 and ITWVTSP 10 at #00389].In Deer, the successful trackers only include the proposed algorithms, VTD 28 and PLS, 38 while the others fail to capture the head of deer when it jumps up and down repeatedly.Comprehensively and qualitatively speaking, the proposed algorithms perform the best.

Vehicles
In vehicle navigation, especially self-driving technology, the basic role is to steadily track the rear of vehicles against different kinds of weather conditions and road environments.The sequences used for evaluation include Car 4 and Car 11, which are separately recorded in the day and at night.It is shown in Fig. 8 that Frag 11 and MIL 16 do not perform well in the first two sequences.When the car goes into or out of the shadows, there is a drastic lighting change, which causes the estimated locations by VTD 28 and L1T 19 to drift (e.g., at #0312 and #0429).The ITWVTSP 10 tracker can locate the target center accurately but fails to adapt the scale change.In Car 11, only IVT, 9 ITWVTSP, 10 PLS, 38 and the proposed algorithm successfully track the target in the whole sequence.
The remaining trackers drift away or take the surroundings as the target [e.g., MIL 16 at #0182 and #0269 and VTD 28 and L1T 19 at #0269 and #0336].

Quantitative Evaluation
Besides qualitative evaluation, quantitative evaluation of the tracking results is also an important issue which typically computes the difference between the predicted and the manually labeled ground truth information.Similar with other classical works, two performance criteria are applied to compare the proposed tracker with other reference trackers.The first one refers to center error (CE) evaluation, which is the CE based on Euclidean distance from the tracking location to the ground truth center at each frame.The second one refers to the overlap ratio evaluation, which is also used in object detection 39 and defined as the share area proportion of the box obtained by tracker and the one by ground truth at each frame.Furthermore, in this paper, the average CE (ACE) and average overlap rate (AOR) are introduced, which are defined as where c i eval ; c i gt ∈ R 2×1 refer to the horizontal and vertical center coordinates of the evaluation and ground-truth labeling results at the i'th frame, respectively, and A i eval ; A i gt ∈ R þ are corresponding areas of the target in one test sequence.
The results of ACE and AOR for 10 sequences above are summarized in Table 2.For each sequence, the first line refers to ACE, whereas the second refers to AOR.It can be concluded that the proposed tracking method runs the best or the second-best performance on ACE and AOR in all the tested trackers.Though some CE values are higher, the gaps are limited, and all the AORs of proposed tracker  except Car 4 are better than those of the others.Moreover, based on the ACE and AOR performance averages across all the experimental sequences, it can be concluded that the proposed performs comprehensively more favorably than the other methods.The details of the "center error" and "overlap rate" plot can be obtained from the authors.The number of selected patches is one of the key issues related to tracking performance in the proposed algorithm.An experiment is conducted to evaluate its robustness.A number selection rate P is introduced to fluctuate the selection number K, K ¼ roundðK 0 × PÞ, where K 0 refers to the patch number without selection, and roundð•Þ is the approximation function.The rate P varies from 0.2 to 0.9, while the other parameters are the same with the settings above.Two challenging sequences PETS2001 25 and Woman 11 are used.
Results of patch selection are shown in Fig. 9. Correspondingly, the CE and OR values are shown in Fig. 10.It is shown that the proposed tracker can generally follow target with different selection rates rather than totally lose it.As P decreases, the performance does not deteriorate much.It is obvious that too limited information of the target could prevent the tracker from uniquely and successfully modeling the target's appearance, and thus the tracker fails to estimate the location with high accuracy.However, our proposed tracker could still find the likely location.Suppose a target is regarded as being successfully tracked when the OR is >0.5, a threshold line is added to the figure.Similar criterion is also applied in PASCAL VOC. 39t is found in Fig. 10 that the proposed method is able to successfully track the target with limited selected patches, where P is not <0.4 empirically.

Comparison between Maximal and Proposed
Inference Scheme In Sec.5.3, a geometric inference method is proposed to locate the target.Since the final target location would not be decided by the candidate with highest confidence score but with the inference output of highest candidates in CCS, it might affect the tracking accuracy.However, we argue that the influence is quite limited, and more favorable performance compared with other works has been obtained as described above.More importantly, the proposed scheme is quite useful in cluttered background and complete occlusion environment when it is integrated with covariance variation in dynamic modeling.Heuristically, it could be viewed as a soft and local abnormality detection scheme.
In cluttered background, the tracker is subject to the target's outside distraction.Under the motion continuity assumption, the proposed scheme obtains spatial cues from the most confident candidates to stabilize and centralize the location.In the complete occlusion situation, the scheme provides  extra opportunities to detect the target in a wider area.To demonstrate such characteristics and advantages over the maximal scheme, two challenging sequences Football 28 and Pets2009 40 are used.The selected qualitative results are shown in Fig. 11.In Fig. 11(a) on sequence Football, it is found that without geometric inference, the tracker gradually drifts to the surrounding areas of the player's head due to the neighborhood similarity (e.g., #0289 and #336), while more stable performance is obtained with the proposed geometric inference scheme.The sequence Pets2009 is quite challenging; because when the target is heavily occluded, another pedestrian is passing by him.The tracker with the maximal scheme eventually follows a wrong object.In the proposed method, although the tracker mistakes the wrong pedestrian for the target in the first several frames, sparse coefficients of the false target would scatter the points distribution in the CCS, violate the inference condition, and cause the sampling state Xt to be much biased.Based on these unacceptable inference results, the searching range is extended according to Algorithm 2. When the true target appears again without much appearance variation, the tracker re-detects it and continues with correct location estimation in the following sequences.

Computational Complexity Analysis
In Sec. 5, it could be found that sparse coding and the proposed spatial confidence inference algorithm are most time consuming.Thus, we also compare the computation complexity and processing time with three representative trackers including IVT, 9 ITWVTSP, 10 and L1T which is show in Table 3. 19 Suppose d refers to the dimension of a vectorized image, and M is the number of eigen vectors or templates, d ≫ M, the computational complexity of IVT 9 and ITWVTSP 10 is OðdMÞ, for they mainly involve matrixvector multiplication.The computational load of L1T 19 is Oðd 2 þ dMÞ, while the load of the proposed algorithm is OðKsM þ 9QMÞ.The first part is related to sparse coding, where K refers to the number of selected patches, and s is the patch size, d 2 ≫ Ks > d.The second part is related to geometric inference, Q is inference point number, M < 9QM ≪ d.Moreover, processing times of different normalized image sizes (16 × 16 and 32 × 32) for solving one image are also presented.It can be found that enlarging the normalized size of the target region increases the computation time.Both the L1T 19 algorithm and the proposed one apply sparse representation and yet the proposed tracker is much faster than the L1T 19 tracker.Although the proposed algorithm is slower than the IVT 9 and ITWVTSP 10 algorithm, it achieves a better performance in accuracy evaluation.

Conclusion
This paper presents a generative tracking algorithm based on sparse representation of selected overlapped patches via KPPR and spatiotemporal geometric inference of candidate confidences sampled by propagated affine motion modeling.Not only qualitative and quantitative evaluations but also the analysis on selected patch number and geometric inference process are conducted.The experiments demonstrate that on challenging image sequences, our proposed tracking algorithm comprehensively performs more favorably against state-of-the-art online tracking algorithms.The future work might include exploring more efficient l 1 minimization algorithms (e.g., APG) 32 for real-time application and extending this algorithm to multiple-object tracking given certain application environments.Currently, the temporal weight matrix in Eq. ( 10) is fixed during the tracking process.More information could be introduced for its adaption to the latest tracking conditions.

Fig. 1
Fig.1Workflow of proposed algorithm.The proposed selective sparse appearance model based on KPPR in yellow shading is detailed in Sec. 4, whereas the proposed affine warping propagation, confidence calculation, geometric inference, and update process colored in green shading are described from Sec. 5.1 to 5.4.

Fig. 2 Algorithm 1
Fig. 2 Key point proportion ranking (KPPR).Key point features are firstly extracted and then the point proportions for each patch are calculated and ranked to boost the selected K KEY patches.

Fig. 3
Fig. 3 Affine warping comparison.The original affine warping introduced by Ross et al. 9 in (a) only considers the target state in the latest frame, and the proposed one in (b) consider more previous target state with a nonfixed covariance update.

8: ift∕U f ¼ 0 then 9 :
(Template Update) Obtain new template set T t by Algorithm 3. 10: (Selection Pattern Update) Update the K KEY patches based on KPPR by Algorithm 1

Fig. 5
Fig. 5 Qualitative evaluation of (a) Caviar 1, (b) Caviar 2, and (c) Singer, where object appearances change due to heavy occlusion, scale change, and light variation.Similar objects also appear in the scenes.Six patches are selected for the proposed algorithm.

Fig. 6
Fig. 6 Qualitative evaluation of (a) Occlusion 1 and (b) Occlusion 2, where object appearances change drastically due to heavy occlusion and pose variation.Six patches are selected for the proposed algorithm.

Fig. 7
Fig. 7 Qualitative evaluation of (a) Girl, (b) David Indoor, and (c) Deer, where object appearances change drastically due to fast motion, pose variation, light variation, scale change, cluttered background and motion blur.Six patches are selected for the proposed algorithm.

Fig. 8
Fig. 8 Qualitative evaluation of (a) Car 4 and (b) Car 11, where object appearance changes drastically due to scale change, abrupt illumination variation, cluttered background, and low contrast.Six patches are selected for the proposed algorithm.

3 :
Replace the template t i t with t ¼ Uq. 4: Normalize the template set T t .Algorithm 4 Proposed online visual tracking algorithm.Image sequence with T frames, initial target state X 0 , particle numbers V , inference point number Q, template and selection update frequency U f , template weight W, fitting tolerance Tol, template number M, overlapped percentage, state number for analysis R, update rate α, constant λ 1 , λ 2 , d x , d y , s x , s y , η 0 , η. Track the target in the first M frames to obtain the state X 1∶M and template set T 1∶M .

Table 1
Main parameter settings for eight state-of-the-art algorithms.

Table 2
ACE (pixels) and average OR of tracking methods.The best two results are in bold and italics.

Table 3
Computation complexity and processing time (seconds) of tracking methods.