## 1.

## Introduction

Foreground segmentation plays an important role in a wide range of computer vision applications. Foreground modeling^{1, 2} has been recently used in conjunction with background modeling^{3} for segmentation. Foreground and background models can be created in a consistent fashion, and the nonparametric statistical model^{4} is the frequently used model now.

To compare the performance of different segmentation algorithms, a few metrics are presented. Precision and recall^{1} are the standard measures used in current literatures. The two measures compare segmentations with the ground truth in a pixel-level way, ignoring region-level information. Nascimento and Marques^{5} proposed a region-level method to classify segmentation errors into detection failures, false alarms, splits, merges, and split/merges. The method presented in Ref. 6 is also a pixel-level approach, which is designed to compare the ground truth with detected silhouettes used in gait recognition.

We have known that segmentation performance is largely dependent on foreground modeling. Although some metrics have been presented for the comparison of segmentation performance, no metrics are reported for the comparison of model performance. We present a novel metric to compare the performance of different foreground models. Further, the proposed metric is capable of explaining the difference in segmentation performance of different algorithms from the perspective of foreground modeling. This metric is also helpful in developing new foreground models.

This work is organized as follows. The proposed metric is described in Sec. 2. Experimental results are given in Sec. 3, followed by conclusions in Sec. 4.

## 2.

## Proposed Metric

Some nonparametric methods use multiple features as statistical variables.^{7, 8} Although the performance improvement of segmentation is distinct by the use of multiple statistical variables, it is still difficult to get full segmentation, because the statistical analysis cannot resolve the uncertainty of the foreground, such as the motion and deformation of a moving object. Thus we do not consider those models taking multiple features for statistical analysis, but only those models taking advantage of multiple features in a way different from statistical analysis. For example, the model proposed in Ref. 9 uses the color histogram to select the most suited samples for foreground modeling from all historical segmentations.

Let ${I}^{t}$ be the input image at time instant $t$ , and ${I}_{n}^{t}$ be the color vector of a pixel in position $n$ (for fair comparison, the YUV color space is used for all models). All nonparametric statistical foreground models of ${I}^{t}$ can be denoted in the form of ${\varphi}^{t}=\{{Y}^{1},{Y}^{2},\dots ,{Y}^{R}\}$ , where each element in ${\varphi}^{t}$ consists of all pixels labeled foreground at certain time instants, and $R$ is the frame length of ${\varphi}^{t}$ . Let ${X}^{t}$ be the binary ground truth of image ${I}^{t}$ , with 1 and 0 denoting foreground and background pixels, respectively. Let ${X}^{r}$ be the binary mask of ${Y}^{r}$ , with 1 and 0 denoting pixels labeled foreground in ${Y}^{r}$ and all other pixels, respectively. Let ${Q}^{t,r}$ be the XOR image of ${X}^{t}$ and ${X}^{r}$ , where each pixel ${Q}_{n}^{t,r}$ of ${Q}^{t,r}$ is defined as

## 1

$${Q}_{n}^{t,r}=\{\begin{array}{ll}1,& \text{if}\phantom{\rule{0.3em}{0ex}}{X}_{n}^{t}={X}_{n}^{r}=1\\ 0,& \text{if}\phantom{\rule{0.3em}{0ex}}{X}_{n}^{t}={X}_{n}^{r}=0\\ -1,& \text{if}\phantom{\rule{0.3em}{0ex}}{X}_{n}^{t}\ne {X}_{n}^{r}\end{array}\phantom{\}}.$$The performance of ${\varphi}^{t}$ can be measured with the proposed metric as

## 2

$$M\left({\varphi}^{t}\right)=\frac{{\sum}_{r}{\sum}_{n}{Q}_{n}^{t,r}}{{R}^{*}{\sum}_{n}{X}_{n}^{t}}.$$According to the definition of $M$ , the most desirable foreground model should be such that each element ${X}^{r}$ of the model is the same as ${X}^{t}$ . In other words, each element in the most desirable model is a segmentation in which the moving object shows the same shape in the same place as the moving object in the current frame. In the previous definitions, $Y$ is a set of all pixels classified as foreground in a certain frame, where each pixel is a color vector, and $X$ is a binary image with the same size as the original image $I$ .

To test the robustness of the model to an object’s motion, we can compute the position distance of corresponding objects between ${X}^{t}$ and ${X}^{r}$ . To test the robustness of the model to an object’s deformation, we can compute the shape distance of the moving object between ${X}^{t}$ and ${X}^{r}$ using various shape descriptors. However, to find corresponding features is a very labor intensive task and error prone, and ${X}^{r}$ is often corrupted with splits and defects, making shape similarity measurement unreliable. The advantage of the proposed metric is that the computation of distance is avoided, and the uncertainty of the moving object is implicitly highlighted in the XOR operation.

## 3.

## Experimental Results

The proposed metric is applied to characterize the performance of five foreground models. Each foreground model is used in conjunction with a background model to classified pixels based on energy minimization, as in Ref. 1. The five foreground models are simply described as follows.

The first foreground model, the general foreground model
${\varphi}_{G}^{t}$
,^{1} can be denoted as
${\varphi}_{G}^{t}=\{{G}^{t-1},\dots ,{G}^{t-r},\dots ,{G}^{t-R}\}$
, where
${G}^{t-r}$
is a set of all pixels labeled foreground at time instant
$t-r$
. Then we consider two ways of using motion information for foreground modeling. For simplicity, only single object detection is considered. The centroid of the moving object in the current frame is predicted by a Kalman filter. Then we move all elements in
${\varphi}_{G}^{t}$
from their centroids to the predicted centroid, resulting in the second foreground model
${\varphi}_{P}^{t}$
.^{10} A substitution of prediction is tracking. The moving object is tracked from one frame to the next by the mean-shift tracker. All elements in
${\varphi}_{G}^{t}$
are shifted from their centroids to the centroid of the tracking window in the current frame, leading to the third foreground model
${\varphi}_{T}^{t}$
.

The fourth foreground model^{9} takes advantage of the shape and motion information for foreground modeling. First, predetection is carried out on the current frame with
${\varphi}_{G}^{t}$
. Then we align all historical segmentations to the presegmentation based on the centroid of the moving object. The shape similarity of the moving object between presegmentation and each aligned segmentation is measured based on the color histogram. The
$R$
frames of aligned segmentations, which have the largest similarity values, are chosen to form the fourth foreground model
${\varphi}_{H}^{t}$
. The fifth foreground model
${\varphi}_{U}^{t}$
consists of the same historical segmentations as
${\varphi}_{H}^{t}$
, but with each element unaligned, which means the motion information is ignored.

The first column of Fig. 1 shows two typical images of the first test sequence with serious color similarity between foreground and background. Detected foreground by
${\varphi}_{G}^{t}$
,
${\varphi}_{U}^{t}$
,
${\varphi}_{P}^{t}$
,
${\varphi}_{T}^{t}$
, and
${\varphi}_{H}^{t}$
is shown from the second to the sixth columns, respectively. Segmentations of 32 frames are compared with the ground truth in terms of recall,^{1} which is able to characterize the robustness of the segmentations to splits and defects due to the color similarity problem. The performance test is shown in Fig. 2.

Figure 2 shows that the segmentation performance is certainly dependent on the model performance. By the use of motion information, the model accuracy is largely improved compared with ${\varphi}_{G}^{t}$ ; as a result, segmentations with much better recall are derived by ${\varphi}_{P}^{t}$ and ${\varphi}_{T}^{t}$ . The segmentations by ${\varphi}_{T}^{t}$ are a little better than the segmentations by ${\varphi}_{P}^{t}$ , because of the better model performance of ${\varphi}_{T}^{t}$ . The reason for this is that the information in the current frame is used by the tracker but not by the predictor.

The shape alone cannot provide a notable improvement in modeling and segmentation. However, combining motion and shape, as ${\varphi}_{H}^{t}$ , displays more obvious improvement than using motion or shape alone. Some foreground pixels still cannot be detected in the last of Fig. 1. This suggests the use of finer features for shape representation, such as the Zernike moment descriptor and the Fourier descriptor. The performance of different shape descriptors for foreground modeling also can be identified by the proposed metric.

Experimental results on the second test sequence are shown in Figs. 3 and 4. This sequence also can be seen in Ref. 11. The second column to the last column in Fig. 3 are segmentations by ${\varphi}_{G}^{t}$ , ${\varphi}_{U}^{t}$ , ${\varphi}_{P}^{t}$ , ${\varphi}_{T}^{t}$ , and ${\varphi}_{H}^{t}$ , respectively. We can think that the same curve as Fig. 2 can be obtained by observing Fig. 3. By comparing Figs. 3 and 4, the same conclusions can be obtained as those from the first test sequence.

## 4.

## Conclusion

We propose a metric to check the robustness of different foreground models with the uncertainty of the moving foreground. This metric is able to explain the difference in segmentations by different algorithms from the perspective of foreground modeling. Our future work is to develop new foreground models to more effectively take advantage of the motion and shape information of the moving object based on the proposed metric.

## Acknowledgments

The authors are grateful to the anonymous reviewers for their comments, which have helped us to improve this work. This study is supported by the China 863 High Tech. Plan (number 2007AA01Z164), and supported by the National Natural Science Foundation of China (numbers 60602012, 60772097, and 60675023).