## 1.

## Introduction

Tracking reliability evaluation of a tracking algorithm is an important issue because it can guide the design of a good tracker. A variety of algorithms for measuring reliability are presented to improve the robustness of the tracking process.^{1, 2, 3, 4} Several feature-points-based metrics are proposed in Ref. 1 for analysis of partial and total occlusion in video tracking. Erdem introduced other metrics based on the color and motion differences.^{2} However, these feature-points and color-based metrics are not fit for evaluating the tracking performance of video infrared target tracking because the extracted feature points and color information of the target region are not reliable in infrared images. The infrared sequences are extremely noisy due to rampant systemic noise or color noise sources incurred by the sensing instrument and the noise from the environment.^{5} The aim of this letter is to design a proper metric to evaluate the performance quantitatively of infrared target tracking while utilizing the intensity values information discriminatively and avoiding extracting the feature points of the target region with a kernel-based method.

## 2.

## Tracker Evaluation Metric

A kernel-based target tracking approach, such as mean shift algorithm,^{6} is a commonly used method in the tracking field. Let
$\{{x}_{i}{\}}_{i=1\dots n}$
be the normalized pixel locations in the target region with center
$c$
in the current frame. The function
$b:{R}^{2}\to \{1\dots m\}$
(
$m$
-bin histogram is used) associates to the pixel at location
${x}_{i}$
the index
$b\left({x}_{i}\right)$
of its bin in the quantized feature space. The kernel density estimation of the feature
$u=1\dots m$
in the target region is computed as^{6}

## 1

$${q}_{u}=C\sum _{i=1}^{n}k\left({\parallel \frac{{x}_{i}-c}{h}\parallel}^{2}\right)\delta [b\left({x}_{i}\right)-u],$$It is unavoidable that some background parts exist in the located target region when we don’t use a contour-based method in which tracking is achieved by evolving the contour frame to frame.^{7} To evaluate the tracking performance, we seek discriminative components of the tracking model. The selected components of the tracking model are the components that can best describe the tracked target. A rectangular set of pixels covering the target is chosen to represent the target pixels, and an outer surrounding ring set of pixels is chosen to form the sampled background. Given a certain feature
$u$
, let
${q}_{u}$
and
${o}_{u}$
be kernel density estimation values of feature
$u$
for pixels in the target region and background sample, respectively. The log-likelihood ratio of the feature
$u$
is given by^{8}

## 3

$$L\left(u\right)=\mathrm{log}\phantom{\rule{0.2em}{0ex}}\frac{\mathrm{max}({q}_{u},\xi )}{\mathrm{max}({o}_{u},\xi )},\phantom{\rule{1em}{0ex}}u=1\dots m,$$^{6}Therefore, a cost function ${S}_{k}$ is defined to embody the lost information of the selected discriminative components of the initial target region during the tracking process:where $N$ is the number of pixels in the target region that construct the selected components in the initial frame and ${N}_{k}$ is the number of pixels in the target region that construct these components in frame $k$ . Large values of ${S}_{k}$ are an indication of the information decrease of the selected components of the initial target model.

For two discrete valued random vectors $X$ and $Y$ with marginal probability mass function $p\left(x\right),p\left(y\right)$ and joint probability function $p(x,y)$ , the mutual information between them is defined as

## 6

$$I(X,Y)=\sum _{x}\sum _{y}p(x,y)\mathrm{log}\phantom{\rule{0.2em}{0ex}}\frac{p(x,y)}{p\left(x\right)p\left(y\right)}.$$## 9

$$p(v\mid u)=\frac{1}{\sqrt{2\pi \sigma}}\phantom{\rule{0.2em}{0ex}}\mathrm{exp}\left[\frac{{(u-v)}^{2}}{2{\sigma}^{2}}\right],$$## 10

$$I(U,V)=\sum _{u=1}^{m}\sum _{v=1}^{m}p(u,v)\mathrm{log}\phantom{\rule{0.2em}{0ex}}\frac{p(u,v)}{p\left(u\right)p\left(v\right)}.$$## 12

$${H}_{1}=\sum _{u=1}^{m}p\left(u\right)\mathrm{log}\phantom{\rule{0.2em}{0ex}}u,\phantom{\rule{1p}{0ex}}{H}_{2}=\sum _{u=1}^{m}p\left(v\right)\mathrm{log}\phantom{\rule{0.2em}{0ex}}v,$$A single metric can be obtained to evaluate the tracking performance by combining the information of the discriminative components of the kernel target model in frame $k$ and kernel mutual information cost function defined above as follows:

where the constants ${c}_{1},\alpha $ , and ${c}_{2}$ are chosen to satisfyIn our work, the constants ${c}_{1},\alpha $ , and ${c}_{2}$ are chosen in the same way as the feature-points-based mutual information metric presented in Ref. 1, that is, ${c}_{1}=0.5$ , $\alpha =-1$ , ${c}_{2}=1$ . This means that when the tracked target is lost $({S}_{k}=1,{M}_{k}=0)$ , ${E}_{k}$ achieves the minimum value 0 while the target is entirely accurate located $({S}_{k}=0,{M}_{k}=1)$ , ${E}_{k}$ achieves the maximum value 1. The kernel-based metric ${E}_{k}$ is a measure of the tracking performance of a tracking process. A large value of ${E}_{k}$ represents a good tracking performance and reliable tracker output in the current frame.## 3.

## Experimental Results

Different tracked regions of a standard mean shift tracker^{6} of a 400-frame infrared ship sequence (the size of each frame is
$128\times 128$
pixels) and a 100-frame infrared plane sequence (the size of each frame is
$160\times 120$
pixels) are evaluated by the kernel-based metric. The intensity space is taken as a feature space and it is quantized into 64 bins. We implement the tracking algorithm with the metric output in
$\mathrm{VC}++6.0$
on a Pentium 4 platform and the current implementation of the tracking algorithm with the metric output is capable of tracking at 15 and
$17\phantom{\rule{0.2em}{0ex}}\mathrm{frames}\u2215\mathrm{s}$
of the ship sequence and plane sequence, respectively. The kernel-based metric is adopted properly in this situation to evaluate the tracking process after a top-hat transform preprocessing in the target region. Some representative frames from these sequences are shown in Figs. 1 and 2, respectively. The rectangle shown in the infrared image indicates the located target region. The outputs of the metric of different located target regions represent quantitatively the amount of information of the selected target that the tracker can capture in different frames. The variations of the tracking performance denoted by the proposed metric for various image frames in different sequences are also shown in Figs. 3 and 4.

The variable parameters ${c}_{1},\alpha $ , and ${c}_{2}$ in Eq. 14 are chosen to satisfy the requirement $0\le {E}_{k}\le 1$ and their values are kept constant throughout the experiments. From Fig. 4, we find that the variation of the cost function ${M}_{k}$ is almost the same as that of the proposed metric and the cost function ${S}_{k}$ has a similar curve to them but with reverse variation because it evaluates the lost information of the selected components of the initial target model during the tracking process. In fact, we can treat the cost functions identically by assigning the variable parameters as ${c}_{1}=0.5,\alpha =-1$ , and ${c}_{2}=1$ in most cases. Notice that for abrupt appearance changes (for example, the size of the tracked target will abruptly increase when one target across another), the metric will be ineffective because the tracker output is not reliable in this situation. Since such abrupt changes are transient, the metric works effectively again after that. As we know, a robust tracker with a proper model update method is less sensitive to the appearance changes and can track the target even though the tracked target model is largely different than the initial target model. Here, $N$ in Eq. 5 and ${H}_{1}$ in Eq. 11, which are computed from the target region of the initial frame, are also updated when a model update method is implemented.

## 4.

## Conclusions

This paper has presented a kernel-based metric to evaluate the reliability of the tracking process. The metric is constructed with a kernel method by embodying the information flow of the selected discriminative components of the kernel target model and kernel mutual information of the target regions of the initial frame and current frame. Future research will attempt to design a more suitable kernel target model to complement the kernel-based metric.

## Acknowledgments

We would like to thank the anonymous reviewers for their valuable comments. This work is partially supported by the Aeronautics Science Fund (China) under Grant No. 04F57004.