Key-frame extraction based on motion acceleration

Yanzhuo Ma; Yiling Chang; Hui Yuan

doi:10.1117/1.2977795

1 September 2008 Key-frame extraction based on motion acceleration

Yanzhuo Ma, Yiling Chang, Hui Yuan

Author Affiliations +

Optical Engineering, Vol. 47, Issue 9, 090501 (September 2008). https://doi.org/10.1117/1.2977795

Abstract

Building on the argument that the change of motion states attracts more attention than the motion itself, this letter develops a novel method for key-frame extraction based on motion acceleration vectors. Different from the traditional methods using maximal or minimal motion energy, the proposed method uses the change of motion states, in magnitude and phase, of the main moving objects as the metric for key-frame extraction. Experimental results show that although similar objective performance is achieved by using the proposed method to that achieved with a widely used method based on motion energy, the key frames extracted by the proposed method are more consistent with human perception.

1. Introduction

Among existing methods for key-frame extraction, those based on motion energy (ME) have proved to be both effective and computationally efficient.^{1, 2, 3, 4} This kind of methods is based on the simple idea that the more motion in the scene, the more interest of people should be attracted. Accordingly, local maximal or minimal ME, related to the motion magnitude, is usually employed as the metric for key-frame extraction. However, the extracted key-frames using an ME-based method are not representative in that many motions exist in most frames of video sequences. Naturally, objects keep normal motion states of rest or of uniform motion in a straight line unless compelled by external forces to change that state. The change of motion states is unpredictable and therefore more interesting to people than the motion itself. Therefore, the frames with an object changing its motion states, such as start, stop, acceleration, deceleration, or direction change, will provide more information and attract more attention than the frames containing uniform motion scenes. Thus, extracting key frames based on the changes of motion states, which can be uniformly represented as acceleration of the moving objects, is more consistent with human perception.

Motivated by this consideration, we define the frames with the most significant acceleration (MSA) of the main moving object as key frames, and accordingly propose a novel key-frame extraction method. The key frames obtained by the proposed method can reflect changes of motion states of the main moving objects such as moving in, moving out, and starting to change the motion direction or amplitude.

2. Proposed Method

In this letter, key frames are defined as the frames with the MSA. Generally, acceleration $a (t)$ is defined as

Eq. 1

a (t) = \frac{d v (t)}{d t} = \frac{d v_{x} (t)}{d t} + \frac{d v_{y} (t)}{d t} = a_{x} (t) + a_{y} (t),

where (

v_{x} (t)

,

v_{y} (t)

) are the horizontal and vertical components of the velocity

v (t)

, respectively. Let

v (t - 1)

and

v (t)

denote the motion vectors (MVs) of a moving object at times

t - 1

and

t

, respectively. Then the acceleration vector

a (t)

of the object can be expressed as

Eq. 2

a (t) = v (t) - v (t - 1) = [v_{x} (t) - v_{x} (t - 1)] + [v_{y} (t) - v_{y} (t - 1)] = a_{x} (t) + a_{y} (t) .

In each frame of video sequences, it is often found that many blocks have the same MV. If the greatest number of blocks corresponding to a nonzero MV (denoted as $N_{\max}$ ) is greater than a typical threshold $T_{N}$ (usually expressed as a percentage, 2% as an example, of the number of the blocks in a frame), then these blocks are considered as belonging to the main moving object, and their MV is defined as the feature MV (FMV) of the frame. Otherwise, if $N_{\max}$ is less than $T_{N}$ , the FMV will be set to zero. The framework of the proposed method is shown in Fig. 1. The acceleration vector of the main moving object, denoted as $a_{m} (t)$ , can therefore be computed as

Eq. 3

a_{m} (t) = [m v_{x} (t) - m v_{x} (t - 1)] + [m v_{y} (t) - m v_{y} (t - 1)] = a_{m x} (t) + a_{m y} (t),

where

(m v_{x}

,

m v_{y}

) are the components of the FMV. The vector

a_{m} (t)

in Eq. 3 can also be presented as

Eq. 4

a_{m} (t) = ∣ a_{m} (t) ∣ \cdot \exp [- j θ (t)],

where

∣ a_{m} (t) ∣

and

θ_{m} (t)

represent the magnitude and angle of the acceleration vector

a_{m} (t)

, respectively.

Fig. 1

Key-frame extraction based on acceleration.

The frames with typically high values of $∣ a_{m} (t) ∣$ should be considered as candidates of key frames. However, the changes of motion direction of an object should also be considered, due to their importance to the human perception. Therefore, $∣ a_{m} (t) ∣$ is weighted by a factor $w (t)$ to construct $a_{w} (t)$ as

Eq. 5

a_{w} (t) = w (t) ∣ a_{m} (t) ∣ .

For simplicity,

w (t)

is defined as

w (t) = {\begin{cases} 4 & if the FMV direction reverses either \\ horizontally or vertically at time t, \\ 2 & else if changing from still to moving \\ or moving to still, \\ 1 & else (no direction change) . \end{cases}

Now the frames with typically high values of

a_{w} (t)

are considered as candidates.

A motion change may last for several frames. To determine which frame is the most important one, a new function $a_{c, m} (t)$ , which is a convolution of $a_{w} (t)$ with a time window, is introduced as

Eq. 6

a_{c, m} (t) = \frac{1}{W} \sum_{Δ t = - (W - 1) ∕ 2}^{(W - 1) ∕ 2} a_{w} (t + Δ t),

where

W

is the width of the time window. We choose an odd value for W due to the symmetry of the time window with respect to its center t. Then the frames corresponding to the peaks (i.e., local maxima) of

a_{c, m} (t)

will be extracted as key frames. It should be noted that a proper window size

W

can be selected according to the time resolution of video sequences. For a sequence with a high frame rate and/or slow motion in content, a large

W

should be selected.

3. Experimental Results

Four test sequences, namely “Erik,” “Football,” “Claire,” and “Foreman,” were employed in the experiments. For performance evaluation, the proposed key-frame extraction method is compared with the ME-based method presented in Ref. ² by extracting the same number of key frames. Both objective performance as measured by shot reconstruction degree (SRD) and subjective performance are compared in this letter.

The SRD is the average peak SNR of the interpolated frames, based on inertia with respect to their original frames in the sequence.² Figure 2 shows the curves of the mean SRDs of the four sequences in quarter common intermediate format (QCIF) by the two compared methods with different extracted-key-frame ratios from 2% to 12%.

Fig. 2

SRD of proposed method and ME method.

It is observed from Fig. 2 that the two methods achieve similar SRD performance. For the cases where the percentage of key frames is below 6%, the proposed method outperforms the ME-based method² by about $0.1 to 0.3 dB$ . For larger percentage of key frames, the ME-based method achieves slightly better performance, by up to $0.2 dB$ .

Although the overall difference in SRD performance between the two methods is very small, the subjective performance of the proposed method is better. Specifically, subjective results show that the frames in which changes of object (the head in “Claire”) movement or scene changes (in “Football” and “Foreman”) happened can be well extracted from all the three sequences by the proposed method, but not always by the ME-based method. Due to limitations of space, only the key frames extracted from “Erik” in QCIF and “Foreman” in CIF are shown in Figs. 3 and 4, respectively.

Fig. 3

Extracted key frames (“Erik,” QCIF, key-frame ratio 8%): (a) nose position, pixels on the left edge of the picture; (b) key frames extracted by the ME-based method of Ref. 2; (c) key frames extracted by proposed method.

Fig. 4

Extracted key frames (“Foreman,” CIF, key-frame ratio 2%): (a) ME-based method (frame numbers 0, 43, 107, 151, 171, 243, 299); (b) proposed method (frame numbers 0, 170, 178, 183, 195, 216, 299).

A detailed analysis of the results of the “Erik” sequence, which is representative of typical alternations of uniform motion and motion changes, is given as follows as an example. Figure 3a shows the track of the nose position changing horizontally with time. The whole track can be approximately recovered from several inflection points, viz., at frames 1, 14, 27, 39, and 50. Therefore, these frames are the key frames used as the benchmark frames to evaluate the performance of the key-frame extraction methods. The same number of key frames extracted by the ME-based method² and the proposed method are shown in Figs. 3b and 3c, respectively. The first and the last frame were treated as default key frames in both methods. The frames extracted by the ME-based method² given in Fig. 3b are not the benchmark frames shown in Fig. 3a at all. This is because the frames that yield the local maximum ME when the head has the largest speed are with the head moving in a straight line, not at the inflection points of the track as in the key frames shown in Fig. 3a. The frames extracted by the proposed method, as shown in Fig. 3c, are almost the same as those in Fig. 3a, which demonstrates the excellent subjective performance of the proposed acceleration-based method.

Figure 4 shows the key frames extracted from the “Foreman” sequence. Similarly to the results for “Erik,” the key frames extracted by the proposed method are more distinct, and can describe the whole sequence better than those extracted by the ME-based method.

From the analysis, it can be concluded that the key frames selected by the proposed method are more consistent with the key frames determined by human perception than those selected by the ME-based method.

4. Conclusions

An acceleration-based key-frame extraction method is proposed by constructing a new factor that reflects the motion change of the primary moving object. Experimental results show that the proposed method selects key frames more consistent with human perceptions than the ME-based method.

Acknowledgment

We would like to acknowledge the support provided by the National Science Foundation of China (60772134) and the 111 Project (B08038).

References

1.

B. T. Truong and S. Venkatesh, “Video abstraction: a systematic review and classification,” ACM Trans. on Multimedia Computing, Communications, and Applications (TOMCCAP), , (2007). Google Scholar

2.

T. Y. Liu, X. D. Zhang, J. Feng, and K. T. Lo, “Shot reconstruction degree: a novel criterion for key frame selection,” Pattern Recogn. Lett., 25 1451 –1457 (2004). 0167-8655 Google Scholar

3.

W. S. Chau, O. C. Au, and T. S. Chong, “Key frame selection by macroblock type and motion vector analysis,” 575 –578 Google Scholar

4.

T. M. Liu, H. J. Zhang, and F. H. Qi, “A novel video key-frame-extraction algorithm based on perceived motion energy model,” IEEE Trans. Circuits Syst. Video Technol., 13 (10), 1006 –1013 (2003). 1051-8215 Google Scholar

Citation Download Citation

Yanzhuo Ma, Yiling Chang, and Hui Yuan "Key-frame extraction based on motion acceleration," Optical Engineering 47(9), 090501 (1 September 2008). https://doi.org/10.1117/1.2977795

Published: 1 September 2008

Access the abstract

JOURNAL ARTICLE
3 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

CITATIONS

Cited by 13 scholarly publications.

Explore citations on Lens.org

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

Head

Video

Motion analysis

Nose

Optical engineering

Convolution

Signal to noise ratio

1.

Introduction

2.

Proposed Method

Eq. 1

Eq. 2

Eq. 3

Eq. 4

Fig. 1

Eq. 5

Eq. 6

3.

Experimental Results

Fig. 2

Fig. 3

Fig. 4

4.

Conclusions

Acknowledgment

References

Keywords/Phrases

Search In:

Publication Years