1 July 2005 Performance analysis for tracking of variable scale objects using mean-shift algorithm
Abstract
Classic mean-shift trackers have no integrated scale adaptation, which limits their performance in tracking variable scale objects. By analyzing the similarity of object kernel histograms, we found that the changes of object scale and position within the fixed kernel make the Bhattacharyya coefficient monotonic decreasing. The work plays a guiding role in solving scaling problems within the mean-shift framework.

The mean-shift algorithm1 is an efficient method for mode seeking without doing an exhaustive search, which leads to a real-time property. It has been introduced recently for tracking applications.2, 3, 4, 5 However, the fixed kernel bandwidth is always leading to poor localization in tracking objects changing in scale. A moment is used to compute the size of the tracking windows.2 However, the computational complexity is too high to meet the real-time requirement. In general, an object scale is detected by calculating the Bhattacharyya coefficient for three different sizes (same scale, $±5%$ change) and choosing the size that gives the highest similarity to the target model.5 Since it is a naive method for scale adaptation without considering the underlying relationship between the similarity and the object scale changes, the size of the tracking windows cannot always keep up with the object scale changes. In this paper, this relationship is theoretically analyzed for a possible total solution in the future.

Definition 1. A round region $T$ containing the whole object region $F$ and some background region $B$ is called a tracking window. Function $c\left(T\right)$ and $c\left(F\right)$ denote the center of $T$ and $F$ , respectively. Their distance is measured by $d\left(T,F\right)=\parallel c\left(T\right)-c\left(F\right)\parallel$ .

Definition 2. Let ${\left\{{x}_{i}\right\}}_{i=1\dots n}$ be the pixel locations with $c\left(T\right)$ as the origin point. The kernel histogram5 of $T$ with $m$ bins is defined by $P={\left\{{p}_{\mu }\right\}}_{\mu =1\dots m}$ where

## 1

${p}_{\mu }=C\sum _{i=1}^{n}k\left({\parallel {x}_{i}∕r\parallel }^{2}\right)\delta \left[q\left({x}_{i}\right)-\mu \right].$
Here $k$ is the kernel function and $r$ is the kernel bandwidth, which determines the radius of $T$ . Function $q:{R}^{2}\to \left\{1\dots m\right\}$ associates the pixel at location ${x}_{i}$ to the index $q\left({x}_{i}\right)$ of the kernel-histogram bin corresponding to the color of that pixel. $C$ is derived by imposing the constraint ${\sum }_{\mu =1}^{m}{p}_{\mu }=1$ . Suppose the color distribution of $F$ is distinguished from $B$ . It can be approximately satisfied in many applications, e.g., traffic surveillance, and described by ${\sum }_{\mu =1}^{m}{p}_{\mu }^{i}\bullet {p}_{\mu }^{j}=0$ where the color distribution of $F$ and $B$ are represented by ${P}_{i}={\left\{{p}_{\mu }^{i}\right\}}_{\mu =1\dots m}$ and ${P}_{j}={\left\{{p}_{\mu }^{j}\right\}}_{\mu =1\dots m}$ , respectively.

Definition 3. The similarity of two kernel histograms ${P}_{i}$ and ${P}_{j}$ with $m$ bins is measured by the Bhattacharyya coefficient5

## 2

$\rho \left(i,j\right)=\sum _{\mu =1}^{m}\sqrt{{p}_{\mu }^{i}{p}_{\mu }^{j}},\phantom{\rule{1em}{0ex}}i\ne j,$
where ${p}_{\mu }^{i}$ and ${p}_{\mu }^{j}$ are the value of bin $\mu$ in ${P}_{i}$ and ${P}_{j}$ , respectively.

Theorem 1. Given ${T}_{1}$ with $c\left({F}_{1}\right)=c\left({T}_{1}\right)$ in frame $i$ and ${T}_{2}$ with the same position of ${T}_{1}$ in frame $i+1$ where object scale and position are changed, $\forall {T}_{3}∊\text{frame}$ $i+1$ , if $d\left({T}_{2},{F}_{2}\right) then $\rho \left(2,1\right)>\rho \left(3,1\right)$ .

Proof. By assuming without loss of generality that (1) the object shrinks its scale from frame $i$ to $i+1$ . (2) ${F}_{i}$ , $i=1,2,3$ consists of $u$ subregions with different intensity levels, i.e., ${F}_{i}={\left\{{f}_{j}\right\}}_{j=1\dots u}$ , while ${B}_{i}$ , $i=1,2,3$ consists of ${\nu }_{i}$ subregions with different intensity levels, i.e., ${B}_{i}={\left\{{b}_{j}\right\}}_{j=1\dots {\nu }_{i}}$ . (3) Consider ${T}_{i}$ ; suppose its kernel histogram ${P}_{i}={\left\{{p}_{\mu }^{i}\right\}}_{\mu =1\dots m}$ consists of two entries, sets ${\left\{f{p}_{j}^{i}\right\}}_{j=1\dots u}$ and ${\left\{b{p}_{j}^{i}\right\}}_{j=1\dots {\nu }_{i}}$ , corresponding to the subregion ${\left\{{f}_{j}\right\}}_{j=1\dots u}$ and ${\left\{{b}_{j}\right\}}_{j=1\dots {\nu }_{i}}$ , respectively, where $u+\mathrm{max}\left({\nu }_{1},{\nu }_{2},{\nu }_{3}\right)⩽m$ .

The continuous form of Eq. 1 is as follows:

$\left\{\begin{array}{c}f{p}_{j}^{i}={C}_{i}{\iint }_{\sigma ={f}_{j}}k\left({\parallel x∕r\parallel }^{2}\right)d\sigma ,\phantom{\rule{1em}{0ex}}i=1,2,3;\phantom{\rule{1em}{0ex}}j=1\dots u\\ b{p}_{j}^{i}={C}_{i}{\iint }_{\sigma ={b}_{j}}k\left({\parallel x∕r\parallel }^{2}\right)d\sigma ,\phantom{\rule{1em}{0ex}}i=1,2,3;\phantom{\rule{1em}{0ex}}j=1\dots {v}_{i}\end{array}\phantom{\right\}},$
where
${C}_{i}=1}{\left[\sum _{j=1}^{u}{\iint }_{\sigma ={f}_{j}}k\left({\parallel x∕r\parallel }^{2}\right)d\sigma +\sum _{j=1}^{{v}_{i}}{\iint }_{\sigma ={b}_{j}}k\left({\parallel x∕r\parallel }^{2}\right)d\sigma \right]}.$
By using integral theorem of mean, we have

## 3

$\left\{\begin{array}{c}f{p}_{j}^{2}={C}_{2}\bullet {S}_{{f}_{j}}^{2}\bullet k\left({\parallel {\xi }_{{f}_{j}}^{2}∕r\parallel }^{2}\right),\phantom{\rule{1em}{0ex}}{\xi }_{{f}_{j}}^{2}∊{f}_{j}\phantom{\rule{0.3em}{0ex}}\text{in}\phantom{\rule{0.3em}{0ex}}{F}_{2}\\ f{p}_{j}^{3}={C}_{3}\bullet {S}_{{f}_{j}}^{3}\bullet k\left({\parallel {\xi }_{{f}_{j}}^{3}∕r\parallel }^{2}\right),\phantom{\rule{1em}{0ex}}{\xi }_{{f}_{j}}^{3}∊{f}_{j}\phantom{\rule{0.3em}{0ex}}\text{in}\phantom{\rule{0.3em}{0ex}}{F}_{3}\end{array}\phantom{\right\}}$
where ${S}_{{f}_{i}}^{2}$ and ${S}_{{f}_{i}}^{3}$ are areas of subregion ${f}_{j}$ in ${F}_{2}$ and ${F}_{3}$ , respectively.

The fixed kernel bandwidth leads to ${C}_{2}=1∕{\iint }_{\sigma ={T}_{2}}k\left({\parallel x∕r\parallel }^{2}\right)d\sigma ={C}_{3}$ , and it is clear that ${S}_{{f}_{j}}^{2}={S}_{{f}_{j}}^{3}$ owing to ${F}_{2}={F}_{3}$ . Since $k$ is monotonic decreasing1 and $d\left({T}_{2},{F}_{2}\right) , we have $k\left({\parallel {\xi }_{{f}_{j}}^{2}∕r\parallel }^{2}\right)>k\left({\parallel {\xi }_{{f}_{j}}^{3}∕r\parallel }^{2}\right)$ . Consequently, we obtain $f{p}_{j}^{2}>f{p}_{j}^{3}$ . Moreover, ${\sum }_{j=1}^{{\nu }_{2}}b{p}_{j}^{2}<{\sum }_{j=1}^{{\nu }_{3}}b{p}_{j}^{3}$ holds owing to the constraint

## 4

$\sum _{j=1}^{u}f{p}_{j}^{i}+\sum _{j=1}^{{v}_{i}}b{p}_{j}^{i}=1,\phantom{\rule{1em}{0ex}}i=1,2,3.$
Since the scale of ${F}_{2}$ is less than ${F}_{1}$ , the area of ${B}_{2}$ is greater than ${B}_{1}$ . Thus, ${\sum }_{j=1}^{{\nu }_{1}}b{p}_{j}^{1}<{\sum }_{j=1}^{{\nu }_{2}}b{p}_{j}^{2}$ holds and then ${\sum }_{j=1}^{u}f{p}_{j}^{2}<{\sum }_{j=1}^{u}f{p}_{j}^{1}$ . Therefore,

## 5

$\left\{\begin{array}{c}1>\sum _{j=1}^{u}f{p}_{j}^{1}>\sum _{j=1}^{u}f{p}_{j}^{2}>\sum _{j=1}^{u}f{p}_{j}^{3}>0\\ 0<\sum _{j=1}^{{v}_{1}}b{p}_{j}^{1}<\sum _{j=1}^{{v}_{2}}b{p}_{j}^{2}<\sum _{j=1}^{{v}_{3}}b{p}_{j}^{3}<1\end{array}\phantom{\right\}}.$
According to Eqs. 2, 4, the geometric interpretation of the Bhattacharyya coefficient is the cosine of the angle between the $m$ -dimensional unit vectors $\left(\sqrt{{p}_{1}^{i}}\dots \sqrt{{p}_{m}^{i}}\right)$ and $\left(\sqrt{{p}_{1}^{j}}\dots \sqrt{{p}_{m}^{j}}\right)$ . The smaller angle they have, the more similar the two kernel histograms are. For the target tracking application, this angle is equal to the angle between two 2-D unit vectors: ${\mathbf{Z}}_{i}=\left[{\left({\sum }_{j=1}^{u}f{p}_{j}^{i}\right)}^{1∕2},{\left({\sum }_{j=1}^{{\nu }_{i}}b{p}_{j}^{i}\right)}^{1∕2}\right]$ and ${\mathbf{Z}}_{j}=\left[{\left({\sum }_{k=1}^{u}f{p}_{k}^{j}\right)}^{1∕2},{\left({\sum }_{k=1}^{{\nu }_{j}}b{p}_{k}^{j}\right)}^{1∕2}\right]$ . Then, $\rho \left(i,j\right)$ can be measured by $\angle \left({\mathbf{Z}}_{i},{\mathbf{Z}}_{j}\right)$ . Using Eqs. 4, 5 in conjunction with the geometric relationship, it is clear that $\angle \left({\mathbf{Z}}_{3},{\mathbf{Z}}_{1}\right)>\angle \left({\mathbf{Z}}_{2},{\mathbf{Z}}_{1}\right)$ . Finally, $\rho \left(2,1\right)>\rho \left(3,1\right)$ .

Using theorem 1, we can easily determine that the Bhattacharyya coefficient $\rho \left(t,1\right)$ is monotonic decreasing and achieves its maximum in the case where $d\left({T}_{t},{F}_{t}\right)=0$ . It means the image in ${T}_{t}$ $\left[d\left({T}_{t},{F}_{t}\right)=0\right]$ is most similar to the image in ${T}_{1}$ . As long as some parts of the object in the next frame reside inside the kernel, theorem 1 ensures mean-shift iterations converge to the object center.2, 5

In our experiments, the object kernel histogram computed by the Gaussian kernel has been derived in the RGB space with $32×32×32$ bins. Figure 1 shows two video clips where the size of tracking window (white circle) is unchanged. The top row shows the tracking results where the object expands its scale, while the bottom row demonstrates the results for the object shrinking its scale. In the first frame of each clip, the initial kernel histogram is obtained from the initial tracking window whose center overlaps the object center. Figure 2 shows the Bhattacharyya coefficients corresponding to the tracking windows centered in a $60×60$ neighborhood around the object center. Figures 2a and 2b correspond to Figs. 1b and 1d, respectively. The Bhattacharyya coefficient in Fig. 2b is monotonic decreasing and the maximum corresponds to the object center, which validates our theorem. In the case where the object expands its scale and can not be enwrapped by the tracking window, the monotonic decreasing profile in Fig. 2b no longer holds and poor localization potentially occurs; see also top row in Fig. 1. The reason lies in the fact that there are more local maxima in Fig. 2a and any location of a tracking window that is too small will yield a similar value of the Bhattacharyya coefficient.

## Fig. 1

Tracking results with fixed kernel bandwidth (left to right).

## Fig. 2

Surface plot of Bhattacharyya coefficient around the object center.

In conclusion, the changes of object scale and position within the fixed kernel will not impact the localization accuracy of the mean-shift tracking algorithm. When the object scale exceeds the size of the tracking window, the tracker outputs poor localization. On the contrary, when the object shrinks its scale, the center of the tracking window locates the object center all the time. Indeed, our previous work4 for tracking rigid objects with scale changes is based on this conclusion. We hope this paper will valuable for fully solving scaling problems within the mean-shift framework in the future.

## References

1.  K. Fukanaga and L. D. Hostetler, “The estimation of the gradient of a density function, with applications in pattern recognition,” IEEE Trans. Inf. Theory0018-9448 10.1109/TIT.1975.1055330 21, 32–40 (1975). Google Scholar

2.  G. R. Bradski, “Computer vision face tracking for use in a perceptual user interface,” in IEEE Workshop on Applications of Computer Vision, p. 214–219, Princeton, NJ (1998). Google Scholar

3.  A. Yilmaz, K. Shafique, and M. Shah, “Target tracking in airborne forward looking infrared imagery,” Image Vis. Comput.0262-8856 10.1016/S0262-8856(03)00059-3 21, 623–635 (2003). Google Scholar

4.  N. S. Peng, J. Yang, and J. X. Chen, “Kernel-bandwidth adaptation for tracking object changing in size,” Lecture Notes in Computer Science 3212, 581–588 (2004). Google Scholar

5.  D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object tracking,” IEEE Trans. Pattern Anal. Mach. Intell.0162-8828 5, 564–575 (2003). Google Scholar

© (2005) Society of Photo-Optical Instrumentation Engineers (SPIE)
Ning Song Peng, Ning Song Peng, Jie Yang, Jie Yang, Zhi Liu, Zhi Liu, } "Performance analysis for tracking of variable scale objects using mean-shift algorithm," Optical Engineering 44(7), 070505 (1 July 2005). https://doi.org/10.1117/1.1985487 . Submission:
JOURNAL ARTICLE
3 PAGES

SHARE
KEYWORDS
RELATED CONTENT