## 1.

## Introduction

Foreground object detection in a complex scene is an important step in many computer vision applications, such as visual surveillance,^{1}2.^{–}^{3} intelligent behavior recognition^{4}5.6.^{–}^{7} and vehicle motion tracking.^{8}9.10.11.^{–}^{12} It is always desirable to accurately extract the foreground object(s). These applications are highly dependent on the results of foreground object detection. A common and efficient approach for extracting foreground objects from complex background is using background subtraction methods.^{13}14.15.^{–}^{16} Background subtraction methods detect foreground objects by comparing each current frame with its background model. A difficulty with the background subtraction method is that complex scenes are usually dynamic. Complex scenes could be caused by waving trees, falling rain, illumination changes, and other background changes. To process these complex scenes, the technology of a robust background model is crucial.

Early background models^{17}^{18}^{19}^{–}^{20} had the advantage of low memory consumption and high processing speed. These approaches work well with stationary scenes, but usually they cannot handle complex scenes properly. Therefore, a number of background modeling methods have been proposed, and the most common one is the Mixture of Gaussians Model (MoG).^{21} By using more than one Gaussian distribution per pixel, it is able to handle complex scene, but MoG consumes more memory space and processing time. Therefore, some methods^{22}23.^{–}^{24} proposed to improve this drawback. Instead of MoG, Kim^{13} presented a real-time object detection algorithm based on codebook model that is efficient in either memory or processing time, but it does not take into account dependence between adjacent pixels. Heikkila and Pietikainen^{25} presented a novel approach to background subtraction, in which the background is modeled by texture features. It is capable of real-time processing under $160\times 120$ image size. Wang^{3} presented a background modeling method called SACON that computes sample consensus and estimates a statistical model of the background scene. Chan and Chien^{26} used a multi-background registration technique to calculate weight value for each pixel to update their background model. According to the weight value, the updating mechanism determines whether the pixel is replaced or not. thus it consumes less memory and computation time. Chiu^{27} proposed a fast background model construct algorithm that improved the original weighting average method. It uses a probability-based algorithm to construct the background model and to detect the object. This approach works well with slight background changes, but it needs connect-component labeling methods to overcome the challenges of complex scenes.

In recent years, video compression technology and neural networks have been used to solve many problems in video surveillance. Wang^{28} proposed a background modeling approach in the frequency domain, which constructs background model using DCT coefficients to achieve lower processing time. However, it has difficulty handling complex scenes. Maddalean and Petrosino^{29} presented a new self-organizing method for modeling background by learning motion patterns that was employed to model complex scenes. This method records the weight vector at each pixel by using a large memory to store the neuronal map. Tsai^{30} proposed a Fourier spectrum background model that can adapt to illumination changes, but it is only suitable for indoor work and grayscale images. Ref. 31 proposed a background subtraction algorithm called Vibe that can be initialized with a single frame. This algorithm was embedded in a digital camera which has a low speed ARM processor. Ref. 32 presents a real-time approach for background subtraction which can overcome gradual and sudden illumination changes. This approach segments each pixel by using a probabilistic mixture-based and non-parametric model.

In this paper, a hybrid background modeling approach is based on stable and multi-layer astable records. It can effectively offer foreground object detection in complex scenes. Our approach is applicable to those background pixels which vary over time. In the detection phase, it takes into account the dependence of adjacent pixels in the astable background record by using homogeneous background subtraction; therefore, we can extract the foreground object with a low error rate. In this way, our hybrid pixel-based background (HPB) model and detection method are resistant against erroneous detection in the complex scene.

## 2.

## Proposed HPB Model and Homogenous Background Subtraction

In our proposed foreground object detection system, the block diagram is illustrated in Fig. 1, which includes learning and detection phases. In the beginning, the video input is switched to the learning phase. According to the pixel variation and analysis, the steady pixels are kept in a stable record and the varying pixels are saved by the multi-layer astable records. After using the above records to construct an HPB model completely, the system switches the input sequence into the detection phase and starts to perform the foreground object detection.

## 2.1.

### Creating a Stable Background Record

In the learning phase, stable pixel analysis is applied to make a stable background record (SBR). For a video sequence with $n\times n$-pixel frame size, let ${x}_{i,j}(t)$ be denoted as the pixel at location ($i,j$) in RGB at time $t$, as shown in Eq. (1). Equation (5) is used to assess the similarity between two pixel ${x}_{i,j}(t)$ and ${x}_{i,j}(t-1)$. These two pixels are regarded as similar if their difference is smaller than threshold value ($\mathrm{Th}=30$). Stable Time, ${\mathrm{ST}}_{i,j}(t)$, is used to store the duration of those pixels that are alike; a value of ${\mathrm{ST}}_{i,j}(t)$ stands for the pixel has unchanged for a period of time. If ${x}_{i,j}(t)$ is not similar to ${x}_{i,j}(t-1)$, Stable Time will be reset to 0 and counting is resumed. Thus, the value of ${\mathrm{ST}}_{i,j}(t)$ implies the stability of a pixel, and stable background record can be built according to Eqs. (1) and (2),

## (2)

$${\mathrm{ST}}_{i,j}(t)=\{\begin{array}{ll}{\mathrm{ST}}_{i,j}(t-1)+1& \mathrm{if}|{x}_{i,j}(t)-{x}_{i,j}(t-1)|\le \mathrm{Th}\\ 0,& \text{else}\end{array},\phantom{\rule{0ex}{0ex}}$$## (3)

$${\mathrm{SBR}}_{i,j}=\{\begin{array}{ll}{x}_{i,j}(t),& \text{if}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\mathrm{ST}}_{i,j}(t)\ge \mathrm{Th}\_S\\ \varphi ,& \text{else}\end{array}\mathrm{.}$$## 2.2.

### Constructing the Astable Background Record

Building the astable background record (ABR) is important since a stable background record is not sufficient to obtain a foreground object precisely. The astable background record consists of multi-layer 2-dimension buffers to store the complex scene. In the learning phase, the record ${\mathrm{ABR}}_{i,j}^{u}$ stands for one background pixel stored in ($i,j$) buffer of the $u$th layer buffer. Let ${\mathrm{MC}}_{i,j}^{u}$ represent the match counter to count how many ${x}_{i,j}(t)$ is matched to ${\mathrm{ABR}}_{i,j}^{u}$. The multi-layer astable background record is gradually established frame by frame.

Step 1: Initialization at $t=1$;

Step 2: Major learning for the pixel of frames from $t=2$ to $N$;

for $i$,$j=1,2,\dots ,n$

Step 2.1: Find ${x}_{i,j}(t)$ matching to the ${\mathrm{ABR}}_{i,j}^{u}$, while ${\mathrm{ABR}}_{i,j}^{u}\ne \mathrm{\Phi}$

if $|{x}_{i,j}(t)-{\mathrm{ABR}}_{i,j}^{u}|\le \mathrm{Th}$

then, ${\mathrm{MC}}_{i,j}^{u}={\mathrm{MC}}_{i,j}^{u}+1$;

end

Step 2.2: If there is no match, save the input pixel into new layer ${\mathrm{ABR}}_{i,j}^{u}$.

${\mathrm{ABR}}_{i,j}^{u}={x}_{i,j}(t)$ end

Step 3: Release the ${\mathrm{ABR}}_{i,j}^{u}$ based on the criteria (a) or (b)

(a)${\mathrm{MC}}_{i,j}^{u}<\mathrm{Th}\_f$

(b)$|{\mathrm{ABR}}_{i,j}^{u}-{\mathrm{SBR}}_{i,j}^{u}|<\text{\hspace{0.17em}}\mathrm{Th}$

In the beginning frame, the input pixel ${x}_{i,j}(1)$ is stored in the first layer of the ${\mathrm{ABR}}_{i,j}^{1}$. For the second frame, the pixel ${x}_{i,j}(2)$ is compared with the corresponding ${\mathrm{ABR}}_{i,j}^{1}$. If they are similar, then ${\mathrm{MC}}_{i,j}^{1}$ will be increased. If there is no match, then ${x}_{i,j}(2)$ will be stored in the second layer of ${\mathrm{ABR}}_{i,j}^{2}$, and so on. At the end of learning phase, we have to delete the useless ${\mathrm{ABR}}_{i,j}^{u}$ which meet criteria (a) or (b) to reduce the memory requirements. When ${\mathrm{MC}}_{i,j}^{u}$ is less than $\mathrm{Th}\_f(\mathrm{Th}\_f=15)$, it means that this ${\mathrm{ABR}}_{i,j}^{u}$ could be a foreground element appearing temporarily or noise. If ${\mathrm{ABR}}_{i,j}^{u}$ is similar to ${\mathrm{SBR}}_{i,j}$, the corresponding ${\mathrm{ABR}}_{i,j}^{u}$ should be deleted to save memory space.

Figure 2 shows an example of finding a stable background record (SBR) and 3-layer ABR. Figure 2(a) illustrates a static background record after 300 learning frames. Figure 2(b) shows the ${\mathrm{ABR}}^{1}$, in which shaking leaves, falling rain, and water were recorded correctly as an astable background. Figures 2(c) and 2(d), which show the other shaking leaves, falling rain, are the second and the third layers of the astable background record, respectively.

## 2.3.

### Foreground Object Detection with Homogenous Background Subtraction

After the construction of HPB model, foreground objects can be obtained by background subtraction. However, according to Figs. 2(b)–2(d), it can be observed that the astable background records have composition of homogeneous blob movements in the shaky area. In order to reduce detection error and save recording memory, the characteristic of homogeneity in an area has to be taken into account while performing the background subtraction. Thus, the input ${x}_{i,j}(t)$ is compared to the neighbors ${\mathrm{ABR}}_{i\pm s,j\pm p}^{u}$ in area ($(s,p1,2,\dots r)$. The neighboring area can be $r\times r\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{pixels}$ and centered at ${x}_{i,j}(t)$. A foreground object (FO) can be detected by homogenous background subtraction, as in Eq. (4):

## (4)

$${\mathrm{FO}}_{i,j}(t)=\{\begin{array}{ll}0,& \text{if}|{x}_{i,j}(t)-{\mathrm{SBR}}_{i,j}|\le \mathrm{Th}\\ 0,& \text{if}|{x}_{i,j}(t)-{\mathrm{ABR}}_{i\pm s,j\pm p}^{u}|\le \mathrm{Th}\\ 1,& \text{else}\end{array}\phantom{\rule[-0.0ex]{1em}{0.0ex}},$$## 3.

## Background Updating

In the detection phase, we must update the HPB model over time to prevent detection errors resulting from outdated background information. Since the SBR and ABR are established in different ways, methods for updating them are different as well. The SBR is composed of stable pixels, therefore only background information updates are needed. On the other hand, updating the background information for ABRs needs a more complex replacement mechanism.

In Eq. (5), when ${\mathrm{SBR}}_{i,j}$ and ${x}_{i,j}(t)$ match, we use the running average to update the corresponding ${\mathrm{SBR}}_{i,j}$,

## (5)

$${\mathrm{SBR}}_{i,j}=\alpha \times {x}_{i,j}(t)+(1-\alpha )\times {\mathrm{SBR}}_{i,j},\phantom{\rule[-0.0ex]{2em}{0.0ex}}\text{if}|{\mathrm{SBR}}_{i,j}-{x}_{i,j}(t)|<\mathrm{Th},$$Similarly, as in Eq. (6), when ${x}_{i,j}(t)$ and ${\mathrm{ABR}}_{i,j}^{u}$ match, we also use the running average to update the ABR. However, when ${x}_{i,j}(t)$ and ${\mathrm{ABR}}_{i,j}^{u}$ do not match, we use an exponential distribution probability density function to determine if ${\mathrm{ABR}}_{i,j}^{u}$ should be replaced. Thus, ${\mathrm{pr}}_{i,j}^{u}$ represents the probability of whether or not the replacement should occur. The ABR update and replacement procedure is as follows: Compare the input pixel ${x}_{i,j}(t)$ with ${\mathrm{ABR}}_{i,j}^{u}$ for all $i$,$j$,$u$.

Step 1: If ${x}_{i,j}(t)$ matches to ${\mathrm{ABR}}_{i,j}^{u}(|{\mathrm{ABR}}_{i,j}^{u}-{x}_{i,j}(t)|<\mathrm{Th})$.

Step 2: If ${x}_{i,j}(t)$ not match to ${\mathrm{ABR}}_{i,j}^{u}$.

where $\mathrm{Th}\_p$ is a threshold for probability. And then, the probability value can be obtained as

## (7)

$${\mathrm{pr}}_{i,j}^{u}(t)=\frac{{\mathrm{MC}}_{i,j}^{u}}{N}\times \mathrm{exp}(\frac{-{\mathrm{MC}}_{i,j}^{u}}{N}\times {\mathrm{TI}}_{i,j}^{u}(t))\phantom{\rule[-0.0ex]{2em}{0.0ex}}u=1,2,\dots ,m,$$## (8)

$${\mathrm{TI}}_{i,j}^{u}(t)=\{\begin{array}{ll}{\text{TI}}_{i,j}^{u}(t-1)+1,& \text{if}|{x}_{i,j}(t)-{\mathrm{ABR}}_{i,j}^{u}|\\ 0,& \text{else}\end{array}\phantom{\rule[-0.0ex]{2em}{0.0ex}}u=1,2,\dots ,m,$$## 4.

## Experimental Results and Comparison

To evaluate the performance of background subtraction, three test video sequences including waving trees, torrential rain and wind, and PETS’2001 Dataset 3^{33} were used in the experiments. The performance of the proposed method was compared with that of Codebook,^{13} MOG (Wu),^{23} SACON,^{3} Chien,^{26} and ViBe.^{31} A pixel-based error rate based on ground truth is a fair and often adapted assessment method,^{27} and was used to evaluate each method’s performance. The error rate is given in Eq. (9),

*fp*and

*fn*are the sum of all false positives and the sum of false negatives, respectively. A smaller error rate means that the detected result is more similar to the ground truth.

It is important to choose a proper number of background layers to trade off between the hardware memory and the number of ABR layers based on the scene requirements. Figures 3(c)–3(g) show the foreground detection results for the waving trees video images, based on our proposed HPB model with 1 to 5-layer ABRs. In Fig. 4, the error rates of the 3-layer, 4-layer, and 5-layer ABRs are small. However, the 3-layer ABR uses much less memory.

In Figs. 5Fig. 6–7, we demonstrate that the proposed approach exhibits better foreground detection than the other methods in three benchmarks. Furthermore, Figs. 8Fig. 9–10 demonstrate that the proposed method presents a lower error rate in the ground truth comparison. The average error rates of the results from the six methods for various sequences are depicted in Table 1. It shows that the proposed approach has lower average error rate than the other methods.

## Table 1

Average of error rate in various benchmarks.

Video sequence | Codebook | MoG (Wu) | SACON | Chien | ViBe (N=20) | Proposed |
---|---|---|---|---|---|---|

Waving tree(160×120) | 4.11% | 2.42% | 2.93% | 7.75% | 4.41% | 1.05% |

Raining(320×240) | 2.49% | 1.56% | 1.31% | 3.19% | 2.14% | 0.81% |

PETS 2001(768×576) | 0.93% | 1.73% | 0.92% | 1.37% | 0.78% | 0.75% |

As shown in Fig. 11, we use the TI TMS320DM6446 Davinci development kit as our development platform which has a dual-core device including ARM926EJ-S and C64x+TM DSP. The resources of an embedded platform are limited, so the implementation has to consider memory consumption. Table 2 lists the real memory utilization for all six methods when applied to the different video sequences. It shows the memory requirement of our proposed method is much less than other approaches; thus, our approach can achieve the real-time operation with 23 frames per second for the waving trees video. The proposed method is suitable for implementation in an embedded platform.

## Table 2

Memory comparison of background models.

Video sequence | Codebook | MoG(Wu) | SACON | Chien | ViBe(N=20) | Proposed |
---|---|---|---|---|---|---|

Waving tree(160×120) | 1638 KB | 1920 KB | 1459 KB | 1094 KB | 1152 KB | 1056 KB |

Raining(320×240) | 5330 KB | 7680 KB | 5836 KB | 4377 KB | 4608 KB | 4224 KB |

PETS 2001(768×576) | 26.5 MB | 44.2 MB | 33.6 MB | 25.2 MB | 26.5 MB | 24.3 MB |

## 5.

## Conclusions

An efficient and precise foreground object detection method was proposed in this paper. The proposed method applies a stable background record and multi-layer astable background records to construct a correct background model. While when more layers are used, more background information can be recorded to improve the precision, it also needs more memory as well as more calculation effort. Thus, it is important to choose a proper number of background layers to trade off between the memory load and the number of dynamic background layers required by the scene. To save more memory space and calculation time, the 3-layer dynamic background model was used in our approach. According to our experimental results, the error rates of the 3-layer, 4-layer, and 5-layer HPB models are similar in many video benchmarks. The results demonstrate that the proposed method has a lower error rate with ground truth than the five other models tested. Furthermore, the proposed approach has higher precision of object detection than other methods for various sequences. The final verification was done using a 2.66 GHz CPU with a video resolution of $768\times 576$ and an execution speed of $21\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{frames}/\text{second}$ in complex background scene. In addition, the proposed method can achieve real-time for complex scene on a Davinci embedded platform.

## Acknowledgments

We would like to thank Oliver Barnich and M. Van Droogenbroeck, who provided the C-like source code for his algorithms.

## References

^{4}: real-time surveillance of people and their activities,” IEEE Trans. Pattern Anal. Mach. Intel. 22(8), 809–830 (2000).ITPIDJ0162-8828 http://dx.doi.org/10.1109/34.868683 Google Scholar

## Biography

**Wen-Kai Tsai** is a PhD candidate at Graduate School of Engineering Science and Technology, National Yunlin University of Science and Technology, Taiwan. He received BS and MS degrees in electronics engineering from National Yunlin University of Science and Technology, Taiwan, in 2004 and 2006, respectively. His research interests include digital signal processing and image processing.

**Chung-Chi Lin** received the MS degree in computer science from University of Houston, TX, USA, in 1983, and the PhD degree in engineering science and technology from National Yunlin University of Science & Technology, Taiwan, in 2009. Since 2009, he has been an Associate Professor with the Department of Computer Science, Tunghai University, Taiwan. His current research interests include image processing, digital signal processing, and System-on-chip design.

**Ming-Hwa Sheu** received the BS degree in electronics engineering from National Taiwan University of Science and Technology, Taipei, Taiwan, in 1986, and the MS and PhD degrees in Electrical Engineering from National Cheng Kung University, Tainan, Taiwan, in 1989 and 1993, respectively. From 1994 to 2003, he was an Associate Professor at National Yunlin University of Science and Technology, Touliu, Taiwan. Currently, he is a Professor in the Department of Electronics Engineering, National Yunlin University of Science and Technology. His research interests include CAD/VLSI, digital signal processing, algorithm analysis, and system-on-chip design.