## 1.

## Introduction

Stereo matching algorithms find corresponding points in a pair of images to locate 3-D positions. They can be classified into either local or global matching approaches.^{1} Local approaches, like correlation and dynamic programming methods, deal only with subimages. These approaches have the advantage of real-time speed,^{2, 3} but tend to produce high errors. In contrast, the global approaches, like graph cuts and belief propagations (BPs),^{4} deal with full images. These approaches have the advantage of low errors, but tend to execute huge computational loads. In real-time applications, like robot vision, the stereo matching system should be compact and fast. In this context, we present a very large scale integration (VLSI) architecture for stereo matching with BP.

## 2.

## Belief Propagation Formulation for Stereo Matching

Given the left and right images
${g}^{r}$
,
${g}^{l}$
, and the parameters
${c}_{d}$
,
${c}_{v}$
,
${K}_{d}$
, and
${K}_{v}$
, we describe the energy model for a 2-D Markov random field (MRF) as follows.^{4}

## 1.

^{2}.

## 3.

## Proposed Architecture of Stereo Matching

Generally, BP can be separated into two methods, mainly according to the update style. The first method updates all the nodes synchronously on the 2-D MRF.^{4} The second method updates all the nodes sequentially; first in the inward direction from the leaf to the root and next in the reverse direction.^{5, 6} The sequential method needs to update only once at each node and obtains the final results. Therefore, the nodes can be propagated fast with the small number of operations. In Ref. 5, the authors reported that the sequential update based on spanning trees in MRF can achieve fast convergence. We applied the tree structure to each scan line, as shown in Figs. 1 and 1
. The tree messages are updated using messages from the neighboring scan lines that have been determined in the previous iteration times. For the image with
$M\times N$
pixels in Fig. 1,
$N$
scan lines can be separated into
$G$
groups, and the
$H$
lines of each group
$g\u220a[0,G-1]$
can be processed in parallel with
$H$
processors. This observation is shown in our VLSI parallel sequences as follows. A node located in a pixel is denoted by a 2-D vector
$\mathbf{p}={\left[{p}_{0}\phantom{\rule{0.3em}{0ex}}{p}_{1}\right]}^{T}$
.
For synchronous iteration
$k$
from 1 to
$K$
,
for group
$g$
from 0 to
$G-1(=N\u2215H-1)$
,
for each parallel processor
$h$
from 0 to
$\mathbf{H}-1$
,
$(\mathbf{p}={[gH+h{p}_{1}]}^{T})$
.

1 Inward processing to root, for ${p}_{1}=0,\dots ,M-1$ , message̱update̱function $(\mathbf{p},\mathbf{p}+{\left[0\phantom{\rule{0.3em}{0ex}}1\right]}^{T},k,k)$ for rightward message.

2 Outward processing to leaf, for ${p}_{1}=M-1,\dots ,0$ , message̱update̱function $(\mathbf{p},\mathbf{p}-{\left[0\phantom{\rule{0.3em}{0ex}}1\right]}^{T},k,k)$ for leftward message message̱update̱function $(\mathbf{p},\mathbf{p}+{\left[1\phantom{\rule{0.3em}{0ex}}0\right]}^{T},k,k+1)$ for downward message message̱update̱function $(\mathbf{p},\mathbf{p}-{\left[1\phantom{\rule{0.3em}{0ex}}0\right]}^{T},k,k+1)$ for upward message decisioṉfunction $(\mathbf{p},K)$ .

As shown in Fig. 2 , the $H$ processors calculate the messages in the parallel, receiving the left and right pixel data from the $H$ scan line buffers, and reading and writing with each message buffer. The processor consists of the processing elements (PE) ${\mathit{PE}}^{f}$ , ${\mathit{PE}}^{b}$ , ${\mathit{PE}}^{u}$ , ${\mathit{PE}}^{d}$ , and ${\mathit{PE}}^{o}$ . Using the image data and the neighboring messages, ${\mathit{PE}}^{f}$ calculates the forward message in the inward time and ${\mathit{PE}}^{b}$ , ${\mathit{PE}}^{u}$ , ${\mathit{PE}}^{d}$ , and ${\mathit{PE}}^{o}$ calculate each direction’s message and disparity $\widehat{d}$ in the outward time.

## 4.

## Architecture of Processing Element

The PE is the basis logic to calculate the new message at each node as follows.

^{4}the time complexity is $O\left(5D\right)$ for $D$ disparity levels. Due to our pipeline structure, 2-D clocks are necessary for calculating the message ${m}_{\mathbf{p}\mathbf{q}}^{k+1}\left({d}_{\mathbf{p}}\right)$ . In the forward initialization, ${D}_{1}(-1)=B$ , ${D}_{2}(-1)=B$ ( $B$ is as big as possible).

For clock $t$ from 0 to $D-1$ in the forward PE,

## 3.

For clock $t$ from 0 to $D-1$ in the backward PE,

^{3}, the parameters are calculated for the backward time. In our system, the minimum cost of ${m}_{\text{sum}}\left(t\right)$ is used for the normalized parameter $\alpha $ .

The backward processor reads the ${m}_{f}(D-1-t)$ from the stack, calculates the minimum value ${D}_{3}\left(t\right)$ recursively, outputs the minimum value between ${D}_{3}\left(t\right)$ and the parameter ${m}_{f}(-1)$ , and then subtracts it by $\alpha $ for the normalization.

## 5.

## Experimental Results

As shown in Fig. 3
, we tested our system using four grayscale images and ground truth data from the Middlebury database.^{1} The error rate represents the percentage of disparity error of more than 1 between output
$d(x,y)$
and ground truth
${d}_{\text{TRUE}}(x,y)$
,

^{2, 3}and BP in the Tsukuba image. Based on Fig. 4, our disparity error converges rapidly around 12 iterations.

## Table 1

Error rate comparison in several images.

Methods(iteration) | Tsukuba 384×288 | Venus 436×383 | Map 436×380 | Sawtooth 284×216 | Speedperformance |
---|---|---|---|---|---|

Local case[i] | 4.3% | 1.5% | 0.8% | 1.3% | No real-time hardware |

Local case[ii] | 4.3% | 2.2% | 0.8% | 2.1% | SIMD in Pentium 4, $320\times 240$ , $7\phantom{\rule{0.3em}{0ex}}\mathrm{fr}\u2215\mathrm{s}$ |

BP | 4.8% | 8.9% | 4.2% | 1.4% | No real-time hardware |

Our chip(12) | 2.6% | 0.8% | 0.2% | 0.8% | FPGA $256\times 240$ , $25\phantom{\rule{0.3em}{0ex}}\mathrm{fr}\u2215\mathrm{s}$ |

Given an $M\times N$ image, $D$ disparity levels, and $T$ iterations, we need only 2-D forward and backward processing clocks in PE, and $2M$ inward and outward processing steps at each scan line. $G(=N\u2215H)$ groups are iterated $T$ times by $H$ processors in parallel at $F$ frame rates. From Fig. 5 , the total necessary number of clocks to process $F$ frames in one second is $49M$ clocks $[=2\mathrm{D}\left(2M\right)(N\u2215H)TF=(16\times 2)\times (256\times 2)\times (240\u221524)\times 12\times 25]$ . Our system’s $65\text{-}\mathrm{MHz}$ clock was enough for real-time processing.

As shown in Fig. 2, the overall memory is spent for the scan line buffer and message buffer. For real-time processing, our processors should be allowed to access a pair of images in the scan line buffer, while the buffer loads new images from the cameras continuously. Hence, the size of buffer for four images is allocated on the field programmable gate array (FPGA), which means $2\phantom{\rule{0.3em}{0ex}}\mathrm{Mbits}[=4\times (256\times 240\times 8)]$ . Given $C(=4)\phantom{\rule{0.3em}{0ex}}\text{bits}$ as the size of message at each disparity level and $H(=24)$ processors, the message buffer memory size is as follows: leftward message: $1.5\phantom{\rule{0.3em}{0ex}}\mathrm{kbits}=\mathrm{HCD}$ , rightward message: $0.4\phantom{\rule{0.3em}{0ex}}\mathrm{Mbits}=\mathrm{HCDM}$ , upward and downward message: $8\phantom{\rule{0.3em}{0ex}}\mathrm{Mbits}=2HCDMG$ .

In outward processing, the newly calculated leftward message is only used for the next pixel processing. One scan line size of rightward messages in inward time should be stored for outward processing. Since the upward and downward messages are updated synchronously for the entire image, we need to store all the pixel messages in the image. Due to this big memory size, 22 processors access the external memory through an $8\times 22\text{-}\text{bit}$ data bus. Two processors use the internal block RAMs on the FPGA. The overall memory resource usage is described in Fig. 5.

## 6.

## Conclusions

Although BP produces good error performances in the area of image processing, the VLSI architecture has not been fully studied yet. In this context, we propose a parallel VLSI architecture for stereo matching. The test system has only 16 disparity levels, which might not be satisfactory for 3-D recognition tasks. However, for applications, like real-time Z-keying and target-background separation, where low disparity error at the object boundary is important, the proposed chip can be effectively used.