The mean-shift algorithm^{1} is an efficient method for mode seeking without doing an exhaustive search, which leads to a real-time property. It has been introduced recently for tracking applications.^{2, 3, 4, 5} However, the fixed kernel bandwidth is always leading to poor localization in tracking objects changing in scale. A moment is used to compute the size of the tracking windows.^{2} However, the computational complexity is too high to meet the real-time requirement. In general, an object scale is detected by calculating the Bhattacharyya coefficient for three different sizes (same scale,
$\pm 5\%$
change) and choosing the size that gives the highest similarity to the target model.^{5} Since it is a naive method for scale adaptation without considering the underlying relationship between the similarity and the object scale changes, the size of the tracking windows cannot always keep up with the object scale changes. In this paper, this relationship is theoretically analyzed for a possible total solution in the future.

*Definition 1*. A round region
$T$
containing the whole object region
$F$
and some background region
$B$
is called a tracking window. Function
$c\left(T\right)$
and
$c\left(F\right)$
denote the center of
$T$
and
$F$
, respectively. Their distance is measured by
$d(T,F)=\parallel c\left(T\right)-c\left(F\right)\parallel $
.

*Definition 2*. Let
${\left\{{x}_{i}\right\}}_{i=1\dots n}$
be the pixel locations with
$c\left(T\right)$
as the origin point. The kernel histogram^{5} of
$T$
with
$m$
bins is defined by
$P={\left\{{p}_{\mu}\right\}}_{\mu =1\dots m}$
where

## 1

$${p}_{\mu}=C\sum _{i=1}^{n}k\left({\parallel {x}_{i}\u2215r\parallel}^{2}\right)\delta [q\left({x}_{i}\right)-\mu ].$$*Definition 3*. The similarity of two kernel histograms
${P}_{i}$
and
${P}_{j}$
with
$m$
bins is measured by the Bhattacharyya coefficient^{5}

## 2

$$\rho (i,j)=\sum _{\mu =1}^{m}\sqrt{{p}_{\mu}^{i}{p}_{\mu}^{j}},\phantom{\rule{1em}{0ex}}i\ne j,$$*Theorem 1*. Given
${T}_{1}$
with
$c\left({F}_{1}\right)=c\left({T}_{1}\right)$
in frame
$i$
and
${T}_{2}$
with the same position of
${T}_{1}$
in frame
$i+1$
where object scale and position are changed,
$\forall {T}_{3}\u220a\text{frame}$
$i+1$
, if
$d({T}_{2},{F}_{2})<d({T}_{3},{F}_{3})$
then
$\rho (2,1)>\rho (3,1)$
.

*Proof*. By assuming without loss of generality that (1) the object shrinks its scale from frame
$i$
to
$i+1$
. (2)
${F}_{i}$
,
$i=1,2,3$
consists of
$u$
subregions with different intensity levels, i.e.,
${F}_{i}={\left\{{f}_{j}\right\}}_{j=1\dots u}$
, while
${B}_{i}$
,
$i=1,2,3$
consists of
${\nu}_{i}$
subregions with different intensity levels, i.e.,
${B}_{i}={\left\{{b}_{j}\right\}}_{j=1\dots {\nu}_{i}}$
. (3) Consider
${T}_{i}$
; suppose its kernel histogram
${P}_{i}={\left\{{p}_{\mu}^{i}\right\}}_{\mu =1\dots m}$
consists of two entries, sets
${\left\{f{p}_{j}^{i}\right\}}_{j=1\dots u}$
and
${\left\{b{p}_{j}^{i}\right\}}_{j=1\dots {\nu}_{i}}$
, corresponding to the subregion
${\left\{{f}_{j}\right\}}_{j=1\dots u}$
and
${\left\{{b}_{j}\right\}}_{j=1\dots {\nu}_{i}}$
, respectively, where
$u+\mathrm{max}({\nu}_{1},{\nu}_{2},{\nu}_{3})\u2a7dm$
.

The continuous form of Eq. 1 is as follows:

## 3

$$\{\begin{array}{c}f{p}_{j}^{2}={C}_{2}\bullet {S}_{{f}_{j}}^{2}\bullet k\left({\parallel {\xi}_{{f}_{j}}^{2}\u2215r\parallel}^{2}\right),\phantom{\rule{1em}{0ex}}{\xi}_{{f}_{j}}^{2}\u220a{f}_{j}\phantom{\rule{0.3em}{0ex}}\text{in}\phantom{\rule{0.3em}{0ex}}{F}_{2}\\ f{p}_{j}^{3}={C}_{3}\bullet {S}_{{f}_{j}}^{3}\bullet k\left({\parallel {\xi}_{{f}_{j}}^{3}\u2215r\parallel}^{2}\right),\phantom{\rule{1em}{0ex}}{\xi}_{{f}_{j}}^{3}\u220a{f}_{j}\phantom{\rule{0.3em}{0ex}}\text{in}\phantom{\rule{0.3em}{0ex}}{F}_{3}\end{array}\phantom{\}}$$The fixed kernel bandwidth leads to
${C}_{2}=1\u2215{\iint}_{\sigma ={T}_{2}}k\left({\parallel x\u2215r\parallel}^{2}\right)d\sigma ={C}_{3}$
, and it is clear that
${S}_{{f}_{j}}^{2}={S}_{{f}_{j}}^{3}$
owing to
${F}_{2}={F}_{3}$
. Since
$k$
is monotonic decreasing^{1} and
$d({T}_{2},{F}_{2})<d({T}_{3},{F}_{3})$
, we have
$k\left({\parallel {\xi}_{{f}_{j}}^{2}\u2215r\parallel}^{2}\right)>k\left({\parallel {\xi}_{{f}_{j}}^{3}\u2215r\parallel}^{2}\right)$
. Consequently, we obtain
$f{p}_{j}^{2}>f{p}_{j}^{3}$
. Moreover,
${\sum}_{j=1}^{{\nu}_{2}}b{p}_{j}^{2}<{\sum}_{j=1}^{{\nu}_{3}}b{p}_{j}^{3}$
holds owing to the constraint

## 4

$$\sum _{j=1}^{u}f{p}_{j}^{i}+\sum _{j=1}^{{v}_{i}}b{p}_{j}^{i}=1,\phantom{\rule{1em}{0ex}}i=1,2,3.$$## 5

$$\{\begin{array}{c}1>\sum _{j=1}^{u}f{p}_{j}^{1}>\sum _{j=1}^{u}f{p}_{j}^{2}>\sum _{j=1}^{u}f{p}_{j}^{3}>0\\ 0<\sum _{j=1}^{{v}_{1}}b{p}_{j}^{1}<\sum _{j=1}^{{v}_{2}}b{p}_{j}^{2}<\sum _{j=1}^{{v}_{3}}b{p}_{j}^{3}<1\end{array}\phantom{\}}.$$Using theorem 1, we can easily determine that the Bhattacharyya coefficient
$\rho (t,1)$
is monotonic decreasing and achieves its maximum in the case where
$d({T}_{t},{F}_{t})=0$
. It means the image in
${T}_{t}$
$[d({T}_{t},{F}_{t})=0]$
is most similar to the image in
${T}_{1}$
. As long as some parts of the object in the next frame reside inside the kernel, theorem 1 ensures mean-shift iterations converge to the object center.^{2, 5}

In our experiments, the object kernel histogram computed by the Gaussian kernel has been derived in the RGB space with $32\times 32\times 32$ bins. Figure 1 shows two video clips where the size of tracking window (white circle) is unchanged. The top row shows the tracking results where the object expands its scale, while the bottom row demonstrates the results for the object shrinking its scale. In the first frame of each clip, the initial kernel histogram is obtained from the initial tracking window whose center overlaps the object center. Figure 2 shows the Bhattacharyya coefficients corresponding to the tracking windows centered in a $60\times 60$ neighborhood around the object center. Figures 2a and 2b correspond to Figs. 1b and 1d, respectively. The Bhattacharyya coefficient in Fig. 2b is monotonic decreasing and the maximum corresponds to the object center, which validates our theorem. In the case where the object expands its scale and can not be enwrapped by the tracking window, the monotonic decreasing profile in Fig. 2b no longer holds and poor localization potentially occurs; see also top row in Fig. 1. The reason lies in the fact that there are more local maxima in Fig. 2a and any location of a tracking window that is too small will yield a similar value of the Bhattacharyya coefficient.

In conclusion, the changes of object scale and position within the fixed kernel will not impact the localization accuracy of the mean-shift tracking algorithm. When the object scale exceeds the size of the tracking window, the tracker outputs poor localization. On the contrary, when the object shrinks its scale, the center of the tracking window locates the object center all the time. Indeed, our previous work^{4} for tracking rigid objects with scale changes is based on this conclusion. We hope this paper will valuable for fully solving scaling problems within the mean-shift framework in the future.