## 1.

## Introduction

Probability hypothesis density (PHD) filter-based^{1} trackers have enjoyed growing popularity in recent years, particularly in the field of nonlinear non-Gaussian multitarget visual tracking. The original PHD filter-based visual tracker usually uses outputs of detectors, such as a motion detector to establish the observation model, whose efficiency relies on the accuracy of the detection.^{2} In addition, due to the potential nonlinearity and non-Gaussianity of target models in most visual trackers, a particle PHD filter^{3} is used to implement the PHD recursion. However, the intersections of multiple targets like occlusion and clutter often lead to the complex multimodality distribution of the resampled particles, which obviously increase the complexity of state extraction. The classical *k*-means clustering algorithm may present serious degradation in state extraction performance.

In this paper, to avoid inaccurate detection generating estimation errors in the original PHD filter-based visual tracker,^{2} color histogram with position constraints^{4} is incorporated into the PHD filtering framework, which combines the appearance model of the target with its temporal dynamics in a unifying framework. Moreover, to obtain more accurate state estimates, a new state extraction method based on Gaussian mixture model (GMM) clustering is proposed. Hence, a robust visual tracking framework is obtained.

The multitarget visual tracking problem can be formulated as multitarget Bayes filter in a random finite set (RFS) framework by propagating the multiple-target posterior in time. The particle PHD filter^{3} is a sequential Monte Carlo implementation for the multitarget Bayes filter, which approximates the PHD with a set of random samples (weighted particles). The particle PHD filter involves prediction and update steps. Let posterior PHD at time *k* − 1 be approximated by
[TeX:]
$\{ w_{k - 1}^{(i)},x_{k - 1}^{(i)} \} _{i = 1}^{L_{k - 1} }$
${\{{w}_{k-1}^{\left(i\right)},{x}_{k-1}^{\left(i\right)}\}}_{i=1}^{{L}_{k-1}}$
of *L*_{k − 1} particles and their corresponding weights. The predicted PHD *v*_{k|k − 1}(*x*_{k}) can be approximated by
[TeX:]
$\{ \tilde w_{k|k - 1}^{(i)},\tilde x_k^{(i)} \} _{i = 1}^{L_{k - 1} + J_k }$
${\{{\stackrel{\u0303}{w}}_{k|k-1}^{\left(i\right)},{\stackrel{\u0303}{x}}_{k}^{\left(i\right)}\}}_{i=1}^{{L}_{k-1}+{J}_{k}}$
after applying importance sampling below

## Eq. 1

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} v_{k|k - 1} (x_k) = \sum\limits_{i = 1}^{L_{k - 1} + J_k } {\tilde w_{k|k - 1}^{(i)} \delta _{\tilde x_k^{(i)} } } (x_k), \end{equation}\end{document} $${v}_{k|k-1}\left({x}_{k}\right)=\sum _{i=1}^{{L}_{k-1}+{J}_{k}}{\stackrel{\u0303}{w}}_{k|k-1}^{\left(i\right)}{\delta}_{{\stackrel{\u0303}{x}}_{k}^{\left(i\right)}}\left({x}_{k}\right),$$## Eq. 2

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{eqnarray} \displaystyle\tilde w_{_{k|k - 1} }^{(i)} = \! \left\{ {\begin{array}{*{20}c} {\displaystyle\frac{{\phi _{k|k - 1} \big(\tilde x_k^{(i)},x_{k - 1}^{(i)} \big)w_{k - 1}^{(i)} }}{{q_k \big(\tilde x_k^{(i)} |x_{k - 1}^{(i)},Z_k \big)}},} & {i = 1, \cdots,L_{k - 1} } \\ [-4pt] \\ {\displaystyle\frac{{\gamma _k \big(\tilde x_k^{(i)}\big)}}{{J_k p_k \big(\tilde x_k^{(i)} |Z_k \big)}}{\rm,}} & \!\!\!\!\!\!{i \!=\! L_{k - 1} \!+ \! 1, \cdots,L_{k - 1} \!+ \! J_k } \\ \end{array}} \right.\!\!\!.\hspace{-10pt}\nonumber\\ \end{eqnarray}\end{document} $$\begin{array}{c}\hfill {\displaystyle {\stackrel{\u0303}{w}}_{k|k-1}^{\left(i\right)}=\left\{\begin{array}{cc}{\displaystyle \frac{{\phi}_{k|k-1}\left({\stackrel{\u0303}{x}}_{k}^{\left(i\right)},{x}_{k-1}^{\left(i\right)}\right){w}_{k-1}^{\left(i\right)}}{{q}_{k}\left({\stackrel{\u0303}{x}}_{k}^{\left(i\right)}|{x}_{k-1}^{\left(i\right)},{Z}_{k}\right)},}& i=1,\cdots ,{L}_{k-1}\\ \\ {\displaystyle \frac{{\gamma}_{k}\left({\stackrel{\u0303}{x}}_{k}^{\left(i\right)}\right)}{{J}_{k}{p}_{k}\left({\stackrel{\u0303}{x}}_{k}^{\left(i\right)}|{Z}_{k}\right)},}& i={L}_{k-1}+1,\cdots ,{L}_{k-1}+{J}_{k}\end{array}\right..}\end{array}$$*p*

_{k}( · |

*Z*

_{k}) are the importance functions for targets at time

*k*− 1 and new targets at time

*k*, ϕ

_{k|k − 1}( ·, ·) denotes the intensity of survived and spawned targets from time

*k*− 1, and γ

_{k}( · ) is the intensity of new target birth RFS. Once the observation likelihood [TeX:] $p(z_k |\tilde x_k^{(i)})$ $p\left({z}_{k}\right|{\stackrel{\u0303}{x}}_{k}^{\left(i\right)})$ is obtained, the weights in Eq. 2 are updated by

## Eq. 3

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} \tilde w_k^{(i)} = \left[ {P_M (\tilde x^{(i)}) + \sum\limits_{z_k \in Z_k } {\frac{{P_D \big(\tilde x_k^{(i)} \big) P \big(z_k |\tilde x_k^{(i)} \big)}}{{\kappa _k (z) + C_k (z)}}} } \right]\tilde w_{_{k|k - 1} }^{(i)}, \end{equation}\end{document} $${\stackrel{\u0303}{w}}_{k}^{\left(i\right)}=\left[{P}_{M}\left({\stackrel{\u0303}{x}}^{\left(i\right)}\right)+\sum _{{z}_{k}\in {Z}_{k}}\frac{{P}_{D}\left({\stackrel{\u0303}{x}}_{k}^{\left(i\right)}\right)P\left({z}_{k}|{\stackrel{\u0303}{x}}_{k}^{\left(i\right)}\right)}{{\kappa}_{k}\left(z\right)+{C}_{k}\left(z\right)}\right]{\stackrel{\u0303}{w}}_{k|k-1}^{\left(i\right)},$$_{k}( · ) is the clutter intensity, and [TeX:] $C_k (z) \break = \sum\nolimits_{j = 1}^{L_{k - 1} + J_k } {P_D (\tilde x_k^{(j)})P_k (z_k |\tilde x_k^{(j)})w_{_{k|k - 1} }^{(j)} }$ ${C}_{k}\left(z\right)={\sum}_{j=1}^{{L}_{k-1}+{J}_{k}}{P}_{D}\left({\stackrel{\u0303}{x}}_{k}^{\left(j\right)}\right){P}_{k}\left({z}_{k}\right|{\stackrel{\u0303}{x}}_{k}^{\left(j\right)}){w}_{k|k-1}^{\left(j\right)}$ .

## 2.

## Tracking Model

In the proposed tracker, the target candidate in an image is approximated with a *w* × *h* rectangle. Let the state of a target at time *k* be
[TeX:]
$x_k = (p_{x,k},\dot p_{x,k},p_{y,k},\dot p_{y,k},w,h)^T$
${x}_{k}={({p}_{x,k},{\stackrel{\u0307}{p}}_{x,k},{p}_{y,k},{\stackrel{\u0307}{p}}_{y,k},w,h)}^{T}$
with the centriod **p**_{k} = (*p*_{x, k}, *p*_{y, k}) and the target speed. Assume that each target follows a linear Gaussian constant velocity model, i.e.,

## Eq. 4

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} x_k = {\bf F}x_{k - 1} + v_k, \end{equation}\end{document} $${x}_{k}=\mathbf{F}{x}_{k-1}+{v}_{k},$$**F**is the state, transition matrix and

*v*

_{k}is the zero-mean Gaussian white process noise.

To incorporate the appearance model into the tracking framework, we design the observation model by a color histogram.^{4} Let
[TeX:]
$\{ {\bf s}_i \} _{{i}= 1 \cdots {n _{h }}}$
${\left\{{\mathbf{s}}_{i}\right\}}_{i=1\cdots {n}_{h}}$
be the pixel locations of the target centered at **p**_{k} = (*p*_{x, k}, *p*_{y, k}), and the window radius be **h** = (*w*, *h*). Define a function
[TeX:]
${\mathop{\rm b}\nolimits}:R^2 \to \{ 1 \cdots m\}$
$\mathrm{b}:{R}^{2}\to \{1\cdots m\}$
associating the pixel at location **s**_{i} to the index
[TeX:]
${\mathop{\rm b}\nolimits} ({\bf s}_i)$
$\mathrm{b}\left({\mathbf{s}}_{i}\right)$
of the histogram bin corresponding to the color of that pixel. The color histogram of a target candidate
[TeX:]
${\bf \hat q}({\bf p}_k)$
$\widehat{\mathbf{q}}\left({\mathbf{p}}_{k}\right)$
and the probability of the feature *u* = 1, ⋅⋅⋅, *m* are defined by Eqs.
5, 6,

## Eq. 5

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} {\bf \hat q}({\bf p}_k) = \{ \hat q^{(u)} ({\bf p}_k)\} _{u = 1, \cdots,m},\sum\limits_{u = 1}^m {\hat q^{(u)} ({\bf p}_k)} = 1, \end{equation}\end{document} $$\widehat{\mathbf{q}}\left({\mathbf{p}}_{k}\right)={\left\{{\widehat{q}}^{\left(u\right)}\left({\mathbf{p}}_{k}\right)\right\}}_{u=1,\cdots ,m},\sum _{u=1}^{m}{\widehat{q}}^{\left(u\right)}\left({\mathbf{p}}_{k}\right)=1,$$## Eq. 6

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} \hat q^{(u)} ({\bf p}_k) = C_h \sum\limits_{i = 1}^{n_h } {\mathop{\rm k}\nolimits} \left(\left\| {\frac{{{\bf p}_k - {\bf s}_i }}{{\bf h}}} \right\|\right)\delta [{\mathop{b}\nolimits} ({\bf s}_i) - u], \end{equation}\end{document} $${\widehat{q}}^{\left(u\right)}\left({\mathbf{p}}_{k}\right)={C}_{h}\sum _{i=1}^{{n}_{h}}\mathrm{k}\left(\Vert \frac{{\mathbf{p}}_{k}-{\mathbf{s}}_{i}}{\mathbf{h}}\Vert \right)\delta [b\left({\mathbf{s}}_{i}\right)-u],$$*u*denotes the color histogram bins,

*k*is a spatially weighting function and

*C*

_{h}is a normalization term. Similarly, the reference target model can be represented by [TeX:] ${\bf \hat q}_c \break = \{ \hat q_c^{(u)} \} _{u = 1, \cdots,m}$ ${\widehat{\mathbf{q}}}_{c}={\left\{{\widehat{q}}_{c}^{\left(u\right)}\right\}}_{u=1,\cdots ,m}$ . Then the observation likelihood is defined by the similarity between a target candidate [TeX:] ${\bf \hat q}({\bf p}_k)$ $\widehat{\mathbf{q}}\left({\mathbf{p}}_{k}\right)$ and the reference target model [TeX:] ${\bf \hat q}_c $ ${\widehat{\mathbf{q}}}_{c}$ , i.e.,

## Eq. 7

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} p(z_k |x_k) = \frac{1}{{\sqrt {2\pi } \sigma _c }}\exp \left\{ { - \frac{{d^2 ({\bf \hat q}({\bf p}_k),{\bf \hat q}_c)}}{{2\sigma _c^2 }}} \right\}, \end{equation}\end{document} $$p\left({z}_{k}\right|{x}_{k})=\frac{1}{\sqrt{2\pi}{\sigma}_{c}}\mathrm{exp}\left\{-\frac{{d}^{2}(\widehat{\mathbf{q}}\left({\mathbf{p}}_{k}\right),{\widehat{\mathbf{q}}}_{c})}{2{\sigma}_{c}^{2}}\right\},$$_{c}is the standard deviation of noise which is determined experimentally.

## 3.

## GMM Clustering

In the particle PHD filter, a clustering algorithm is required to detect the peaks of the PHD defining candidate states of targets from the resampled particles. We propose a GMM clustering method for state extraction. First, GMM is used to fit the underlying distribution of a resampled particle *x*_{k} as

## Eq. 8

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} S_k (x_k |\Theta _k) = \sum\limits_{l = 1}^{G_k } {\pi _k^l } {\cal N} \big(x_k |\mu _k^l,\Sigma _k^l \big){\rm with }\sum\limits_{l = 1}^{G_k } {\pi _k^l = 1}, \end{equation}\end{document} $${S}_{k}\left({x}_{k}\right|{\Theta}_{k})=\sum _{l=1}^{{G}_{k}}{\pi}_{k}^{l}\mathcal{N}\left({x}_{k}|{\mu}_{k}^{l},{\Sigma}_{k}^{l}\right)\mathrm{with}\sum _{l=1}^{{G}_{k}}{\pi}_{k}^{l}=1,$$*G*

_{k}is the number of Gaussian components, and [TeX:] $\Theta _k = \{ \pi _k^l,\mu _k^l,\Sigma _k^l \}$ ${\Theta}_{k}=\{{\pi}_{k}^{l},{\mu}_{k}^{l},{\Sigma}_{k}^{l}\}$ with weight, mean, and covariance is the parameter set of a Gaussian item. Assume that the state vectors of all particles are independent, the resulting density for the resampled particles [TeX:] $\tilde X_k = \{ w_k^{(i)},x_k^{(i)} \} _{i = 1}^{L_k }$ ${\stackrel{\u0303}{X}}_{k}={\{{w}_{k}^{\left(i\right)},{x}_{k}^{\left(i\right)}\}}_{i=1}^{{L}_{k}}$ is

## Eq. 9

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{eqnarray} S_k (\tilde X_k |\Theta _k) = \prod\limits_{i = 1}^{L_k } {S_k \big(x_k^{(i)} |\Theta _k \big)} = \prod\limits_{i = 1}^{L_k } {\sum\limits_{l = 1}^{G_k } {\pi _k^l } {\cal N} \big(x_k^{(i)} |\mu _k^l,\Sigma _k^l \big)}.\nonumber\\ \end{eqnarray}\end{document} $$\begin{array}{c}\hfill {S}_{k}\left({\stackrel{\u0303}{X}}_{k}\right|{\Theta}_{k})=\prod _{i=1}^{{L}_{k}}{S}_{k}\left({x}_{k}^{\left(i\right)}|{\Theta}_{k}\right)=\prod _{i=1}^{{L}_{k}}\sum _{l=1}^{{G}_{k}}{\pi}_{k}^{l}\mathcal{N}\left({x}_{k}^{\left(i\right)}|{\mu}_{k}^{l},{\Sigma}_{k}^{l}\right).\end{array}$$*k*− 1 and new clusters generated by position observations at time

*k*. The detailed algorithm is presented in Sec. 4.

## 4.

## Particle PHD Filter-Based Visual Tracker with Robust State Extraction

When tracking starts, the target's initial state RFS is input into the proposed algorithm and extract reference models of targets using Eq. 5 at time *k* = 0. Then the tracking starts from time *k* ⩾ 1 as follows.

Prediction: according to Eq. 1, draw particles [TeX:] $\tilde x_k$ ${\stackrel{\u0303}{x}}_{k}$ and compute the predicted weights [TeX:] $\{ \tilde w_{_{k|k - 1} }^{(i)} \} _{i = 1}^{L_{k - 1} + J_k }$ ${\left\{{\stackrel{\u0303}{w}}_{k|k-1}^{\left(i\right)}\right\}}_{i=1}^{{L}_{k-1}+{J}_{k}}$ .

Compute the observation likelihood: for

*i*= 1, ⋅⋅⋅,*L*_{k − 1}+*J*_{k}, compute [TeX:] $p(z_k |\tilde x_k^{(i)})$ $p\left({z}_{k}\right|{\stackrel{\u0303}{x}}_{k}^{\left(i\right)})$ using Eq. 7.Update: update the weights [TeX:] $\{ \tilde w_{_{k|k - 1} }^{(i)} \} _{i = 1}^{L_{k - 1} + J_k }$ ${\left\{{\stackrel{\u0303}{w}}_{k|k-1}^{\left(i\right)}\right\}}_{i=1}^{{L}_{k-1}+{J}_{k}}$ using [TeX:] $p(z_k |\tilde x_k^{(i)})$ $p\left({z}_{k}\right|{\stackrel{\u0303}{x}}_{k}^{\left(i\right)})$ according to Eq. 3.

Resampling: resample the updated particles to get [TeX:] $\{ w_k^{(i)},x_k^{(i)} \} _{i = 1}^{L_k }$ ${\{{w}_{k}^{\left(i\right)},{x}_{k}^{\left(i\right)}\}}_{i=1}^{{L}_{k}}$ using multinomial resampling algorithm.

GMM clustering: cluster the resampled particles [TeX:] $\{ w_k^{(i)},x_k^{(i)} \} _{i = 1}^{L_k }$ ${\{{w}_{k}^{\left(i\right)},{x}_{k}^{\left(i\right)}\}}_{i=1}^{{L}_{k}}$ using the proposed GMM clustering method.

Step 1: generate observations (positions)

*Z*_{k}of candidate targets by detectors like background subtraction.Step 2: ∀ each observation

*z*∈*Z*_{k}, associate*z*to [TeX:] $\hat \Theta _{k - 1}$ ${\widehat{\Theta}}_{k-1}$ by the nearest neighborhood algorithm, discard*z*if it can be associated to an old cluster, otherwise add it to θ_{k}. Then ∀ each remained observation*c*∈ θ_{k}, initialize a new cluster [TeX:] $\{ {{1 / {\hat N_{k|k} }},[c,0,0],\sum {(c)} } \}$ $\left\{1/\phantom{\rule{0.0pt}{0ex}}{\widehat{N}}_{k|k},[c,0,0],\sum \left(c\right)\right\}$ and add it to new cluster set Θ′.Step 3: augment [TeX:] $\hat \Theta _{k - 1}$ ${\widehat{\Theta}}_{k-1}$ with Θ′, and update their parameters using EM algorithm on [TeX:] $\{ w_k^{(i)},x_k^{(i)} \} _{i = 1}^{L_k }$ ${\{{w}_{k}^{\left(i\right)},{x}_{k}^{\left(i\right)}\}}_{i=1}^{{L}_{k}}$ to obtain [TeX:] $\tilde \Theta _k$ ${\stackrel{\u0303}{\Theta}}_{k}$ .

Step 4: remove the small clusters in [TeX:] $\tilde \Theta _k$ ${\stackrel{\u0303}{\Theta}}_{k}$ with [TeX:] $\tilde \pi _k \break < 0.2$ ${\stackrel{\u0303}{\pi}}_{k}<0.2$ , where 0.2 is set experimentally and merge similar clusters using pruning method in Ref. 5 to obtain [TeX:] $\hat \Theta _k$ ${\widehat{\Theta}}_{k}$ .

State output: extract [TeX:] $\hat X_k = \{ {\hat u_{k,i} |\hat \pi _{k,i} > 0.5}\}_{i = 1}^{N_s }$ ${\widehat{X}}_{k}={\left\{{\widehat{u}}_{k,i}|{\widehat{\pi}}_{k,i}>0.5\right\}}_{i=1}^{{N}_{s}}$ from [TeX:] $\hat \Theta _k$ ${\widehat{\Theta}}_{k}$ where 0.5 is set experimentally.

## 5.

## Results

The pedestrian sequence from the CAVIAR data set is used as a test video. Figure 1 indicates that PHD filter-based visual trackers can deal with a variable number of targets tracking problems without data association. Figure 1 presents the detection results by a background subtraction detector. Figure 1 shows the particle PHD filter directly using detection results as measurements (denoted as DPHD) would like to generate some false state estimates due to inaccurate detection such as person detection splitting into several blobs. Figure 1 shows the particle PHD filter with observation likelihood based on color histogram and *K*-means clustering (denoted as KPHD) can avoid failures due to inaccurate detection but output state estimates without satisfying accuracy. Figure 1 demonstrates that more accurate state estimates can be filtered and extracted effectively by our method. Figure 1 shows an example of a slower response of the proposed tracker due to color histogram variation of the candidate target region suffering occlusion. Moreover, it can be derived that the appearance variation of targets due to illumination change and occlusion, as well as regions of background with similar color histograms to targets would mislead the tracker using color histograms only. To improve the tracker additional information is needed.

The Wasserstein distance^{5} is introduced here to evaluate the performance of trackers. In Fig. 2, the comparison of Wasserstein distance of the three trackers is provided and it demonstrates that our tracker is the best.

## 6.

## Conclusions and Discussion

In this paper, we have presented a robust multitarget visual tracking framework based on the PHD filter which stabilizes the tracker by incorporating color histograms of targets and their temporal dynamics in a unifying framework and improving the accuracy of state extraction using the proposed GMM clustering method. Experiments show the proposed framework can effectively track a varying number of targets with more accurate state estimates. Possible topics of future work include the incorporation of brightness gradient into the appearance model for more robust observation likelihood and the development of a more efficient particle clustering method.

## Acknowledgments

This paper is jointly supported by the National Natural Science Foundation of China ( 61074106) and China Aviation Science Foundation ( 2009ZC57003).

## References

**,” Proc. SPIE, 4380 184 –195 (2001). https://doi.org/10.1117/12.436947 Google Scholar**

*Multitarget filtering using a multitarget first-order moment statistic***,” IEEE Trans. Circuits Syst. Video Technol., 18 (8), 1016 –1027 (2008). https://doi.org/10.1109/TCSVT.2008.928221 Google Scholar**

*Efficient multitarget visual tracking using random finite sets***,” 792 –799 (2003). Google Scholar**

*Sequential Monte Carlo implementation of the PHD filter for multi-target tracking***,” 142 –149 (2000). Google Scholar**

*Real-time tracking of non-rigid objects using mean shift***,” IEEE Trans. Signal Process., 54 (11), 4091 –4104 (2006). https://doi.org/10.1109/TSP.2006.881190 Google Scholar**

*The Gaussian mixture probability hypothesis density filter*