## 1.

## Introduction

Motion estimation is a critical component of various video processing tasks, especially video compression, allowing redundancy reduction in the temporal domain. International standards for video communications such as MPEG- $1\u22152$ and $\mathrm{H.}261\u22153\u22154$ employ the well-established hybrid two-component architecture, which relies on motion estimation and compensation as well as on the lossy compression of the motion-compensated prediction error. Motion estimation in such standards is carried out by means of block matching in the data domain with one motion vector portraying the motion of each block. Such block-based approaches offer well-documented implementation advantages such as low complexity and low overheads mainly due to their regularity. On the other hand, they have a number of well-known disadvantages. A block may contain more moving objects than just one, in which case a single motion vector derived from the most dominant object will cause large motion compensation errors in areas occupied by the other objects. Conversely, a moving object may be contained in more blocks than one. Any errors in estimating motion vectors for those blocks are likely to cause blocking artefacts. Various attempts at departing from established block-based approaches have been made, most notably in video coding systems like MPEG-4, while in H.264 a variable block size approach was preferred to shape coding. While motion estimation is not a normative element, most compliant architectures implement arbitrary shape motion estimation in the pixel domain using suitable modifications to the well-known block-matching algorithm, such as extrapolation of the arbitrary shaped area until the limits of a bounding rectangle are reached. Such pixel-domain modifications inherit some of the disadvantages of baseline block matching such as complexity, assuming an exhaustive search strategy for the identification of the minimum error location. In the wider literature, object-based motion estimation is well represented albeit most approaches operate in the pixel domain. An exhaustive review would be outside the scope of this paper but it is worth mentioning Refs. 1, 2, 3, 4 to name but a few.

Recently there has been a lot of interest in motion estimation techniques operating in the frequency domain because they offer well-documented advantages in terms of computational efficiency due to the employment of fast algorithms. Perhaps the best-known method in this class is phase correlation (PC),^{5} which has become one of the motion estimation methods of choice for a wide range of nonconsumer applications.^{6} In addition to computational efficiency, PC offers key advantages in terms of its strong response to edges and salient picture features, its immunity to illumination changes and moving shadows, and its ability to measure large displacements without sacrificing subpixel accuracy.

In this letter we propose an arbitrary-shape motion estimation algorithm based on PC. While in Ref. 7 we proposed a solution using conventional extrapolation, i.e., by padding with the average (mean) intensity of the arbitrary-shaped object, in this work we present an extrapolation-free algorithm using the shape adaptive discrete Fourier transform (SA-DFT) methodology.^{8}

This letter is organized as follows. In Sec. 2 we briefly review the principles underlying subpixel motion estimation using PC. In Sec. 3 we present the arbitrary-shape phase correlation algorithm. In Sec. 4 we report experimental results while in Sec. 5 we draw conclusions arising from this work.

## 2.

## Motion Estimation Using Phase Correlation

Baseline PC operates on a pair of images (or cosited blocks) ${f}_{t}$ and ${f}_{t+1}$ of identical dimensions belonging to consecutive frames or fields of a moving sequence sampled at $t$ , $t+1$ . The estimation of motion relies on the detection of the maximum of the cross-correlation function between ${f}_{t}$ and ${f}_{t+1}$ . Since all functions involved are discrete, cross-correlation is circular and can be carried out as a multiplication in the frequency domain using fast implementations. The real-valued correlation surface is defined as

## 1

$${c}_{t,t+1}(k,l)={F}^{-1}\left(\frac{{F}_{t}^{*}{F}_{t+1}}{\mid {F}_{t}^{*}{F}_{t+1}\mid}\right)$$## 2

$$({k}_{m},{l}_{m})=\mathrm{arg}\phantom{\rule{0.2em}{0ex}}\mathrm{max}\left\{{c}_{t,t+1}(k,l)\right\}.$$^{6}Using the notation in Eq.

^{2}above, prototype functions are fitted to the triplets:

## 3

$$\{{c}_{t,t+1}({k}_{m}-1,{l}_{m}),{c}_{t,t+1}({k}_{m},{l}_{m}),{c}_{t,t+1}({k}_{m}+1,{l}_{m})\}\phantom{\rule{0.3em}{0ex}}\text{and}\phantom{\rule{0.3em}{0ex}}\{{c}_{t,t+1}({k}_{m},{l}_{m}-1),{c}_{t,t+1}({k}_{m},{l}_{m}),{c}_{t,t+1}({k}_{m},{l}_{m}+1)\}.$$^{3}yields a closed-form solution for the horizontal component of the motion estimate $dx$ as follows:

## 4

$$dx=\frac{{c}_{t,t+1}({k}_{m}+1,{l}_{m})-{c}_{t,t+1}({k}_{m}-1,{l}_{m})}{2(2{c}_{t,t+1}({k}_{m},{l}_{m})-{c}_{t,t+1}({k}_{m}+1,{l}_{m})-{c}_{t,t+1}({k}_{m}-1,{l}_{m}))}.$$^{3}.

## 3.

## Arbitrary Shape Phase Correlation

The proposed arbitrary shape phase correlation (ASPC) scheme assumes that an object is available (i.e., obtained using segmentation) for an object of interest in the target (i.e., next) frame. The motion of this object needs to be estimated relative to a reference (i.e., current) frame. The ASPC algorithm consists of two main steps. First, SA-DFT is applied to pixel values inside the object both in the target and reference frames. Then, PC is applied using Eqs. ^{1, 2, 3, 4} in order to obtain motion estimates of subpixel accuracy.

## 3.1.

### Shape Adaptive Discrete Fourier Transform

SA-DFT is based on the notion that the DFT of a vector can be interpreted as the transform of a periodic signal, one period of which is processed. The signal doesn’t need to be shifted, it suffices to extend it periodically and to compute the DFT starting from the beginning of a row/column.

We consider a one-dimensional vector of data samples $x\left(n\right)$ of size $N$ with $n=0,1,\dots ,N$ . We assume that ${N}_{S}$ samples belong to the object of interest with $n=m$ , $m+1,\dots ,{N}_{s}+m-1$ while the remaining samples are considered to belong to another object or the background. The application of SA-DFT consists of the following steps:

• Periodic extension of the ${N}_{S}$ object samples to span the entire range of values $n=0,1,\dots ,N$ , i.e., set ${x}^{\prime}(n+k{N}_{S})=x\left(n\right)$ with $k=\dots ,-2,-1,0,1,2,\dots $ , and $0<n+k{N}_{S}<N$ .

• Computation of the ${N}_{S}$ -point DFT.

• Scaling of the results. In this case the scaling factor is $1\u2215\sqrt{{N}_{S}}$ , which makes the 2-D transform orthogonal.

^{9}

Given an arbitrary-shaped region in frame
$t+1$
(i.e., the next frame) motion is estimated between the image data
${f}_{t+1}$
inside this region and image data
${f}_{t}$
inside a cosited region in frame
$t$
(i.e., the current frame). If
${F}_{t}$
and
${F}_{t+1}$
denote respectively the SA-DFT of
${f}_{t}$
and
${f}_{t+1}$
, the final step is to apply the phase correlation method using Eqs. ^{1, 2, 3, 4}.

## 4.

## Experimental Results

In our experiments we used the well-known broadcast resolution ( $720\times 576\phantom{\rule{0.3em}{0ex}}\text{pixels}$ , 50 fields per second) MPEG test sequences ‘Mobcal’ and ‘Basketball.’ Only the luminance component was considered and to avoid complications due to interlacing, only even-parity field data were retained. We highlight the principles underlying our scheme using an artificial example. We manually segment the ‘ball’ object from a single frame of ‘Mobcal’ and superimpose it on the ‘calendar’ object from the same sequence without low-pass filtering object boundaries. We then create an artificial sequence by displacing the two objects relative to each other by progressively varying shifts of subpixel accuracy. These shifts are then used as ground truth. We consider two cases (shown in Fig. 1 ) corresponding respectively to accurate and inaccurate segmentation of the ‘ball’ object to reflect the real possibility of a segmentation algorithm providing variable quality results. In Fig. 2 we show a magnified portion of the correlation surface for the proposed ASPC scheme. While for conventional phase correlation two peaks would have emerged (each corresponding to the motion of each of the two objects ‘ball’ and ‘calendar’), for ASPC only a single peak emerges corresponding to the motion of the ‘ball’ object. This becomes a critical consideration when the motion parameters of two (or more) objects are similar (but not identical) and hence the proximity of the corresponding peaks on the correlation surface may prevent the accurate extraction of the dominant motion parameters.

Next, we compare the proposed scheme with conventional PC and the shape adaptive phase correlation (SAPC) algorithm presented in Ref. 7, which is based on padding using the average intensity of the object of interest. In Table 1 we show the average MSE relative to the ground truth subpixel displacement parameters (computed over the total number of frames) obtained for both accurate and inaccurate segmentation scenarios of object ‘calendar.’ Overall, these results demonstrate (best cases underlined) that the proposed scheme achieves consistently higher accuracy compared to both alternative schemes.

## Table 1

Average ground truth MSE performance comparison for artificial motion.

Artificial Motion | ‘calendar’ | |
---|---|---|

Accurate mask | Inaccurate mask | |

Phase Correlation | 0.1363671 | 0.1363671 |

SAPC^{7} | 0.1316538 | 0.1302917 |

ASPC (proposed) | 0.1277144 | 0.1268891 |

Next we turn our attention to the estimation of real motion. We identify arbitrary-shaped objects of interest such as the ‘train’ object in ‘Mobcal’ and the ‘player’ object in ‘Basketball.’ The objects are determined manually and are shown in Fig. 3 . We have deliberately allowed the objects to be inaccurate, i.e., not following closely the outline of the object under consideration or even omitting part(s) of it. This is a reflection of the fact that object definition will occasionally be inaccurate in a practical situation if this is obtained from automatic segmentation while this may not be true for manual segmentation. Performance is compared to conventional PC using a rectangular block (of the same size in pixels as the arbitrary-shaped object), which includes a large section of the object. Performance is assessed either using the motion-compensated prediction MSE (average values computed over the total number of frames are shown in Table 2 ) or the peak signal-to-noise ratio (PSNR) as a function of frame number (shown in Fig. 4 ) for the two sequences under consideration. It is worth noting that Table 2 includes a comparison with spatial-domain block-matching, which further underlines the potential of our method. Our results confirm the superiority of the proposed scheme, i.e., by as much as $4\phantom{\rule{0.3em}{0ex}}\mathrm{dB}$ compared to conventional phase correlation.

## Table 2

Average motion-compensated prediction MSE performance comparison for real motion.

Real Motion | ‘train’ | ‘player’ |
---|---|---|

Block Matching | 281.949546 | 313.995271 |

Phase Correlation | 312.856572 | 266.257170 |

SAPC^{7} | 181.317667 | 202.621286 |

ASPC (proposed) | 153.140966 | 201.528410 |

## 5.

## Conclusions

In this letter an arbitrary shape motion estimation algorithm based on phase correlation was presented. Owing to the fact that the scheme operates in the frequency domain it enjoys a high degree of computational efficiency and can be implemented by fast algorithms such as the FFT. Our approach avoids extrapolation and uses information only from moving objects of interest thereby yielding higher accuracy motion estimates. Our results have shown that the proposed method outperforms conventional phase correlation as well as shape adaptive phase correlation using extrapolation both for artificially-induced motion using manually extracted objects as well as actual interframe motion.

## references

*BBC Res. Dept. Rep.*(1987). Google Scholar

*NORSIG-99*(Sept. 1999). Google Scholar