Among existing methods for key-frame extraction, those based on motion energy (ME) have proved to be both effective and computationally efficient.1, 2, 3, 4 This kind of methods is based on the simple idea that the more motion in the scene, the more interest of people should be attracted. Accordingly, local maximal or minimal ME, related to the motion magnitude, is usually employed as the metric for key-frame extraction. However, the extracted key-frames using an ME-based method are not representative in that many motions exist in most frames of video sequences. Naturally, objects keep normal motion states of rest or of uniform motion in a straight line unless compelled by external forces to change that state. The change of motion states is unpredictable and therefore more interesting to people than the motion itself. Therefore, the frames with an object changing its motion states, such as start, stop, acceleration, deceleration, or direction change, will provide more information and attract more attention than the frames containing uniform motion scenes. Thus, extracting key frames based on the changes of motion states, which can be uniformly represented as acceleration of the moving objects, is more consistent with human perception.
Motivated by this consideration, we define the frames with the most significant acceleration (MSA) of the main moving object as key frames, and accordingly propose a novel key-frame extraction method. The key frames obtained by the proposed method can reflect changes of motion states of the main moving objects such as moving in, moving out, and starting to change the motion direction or amplitude.
In this letter, key frames are defined as the frames with the MSA. Generally, acceleration is defined as, ) are the horizontal and vertical components of the velocity , respectively. Let and denote the motion vectors (MVs) of a moving object at times and , respectively. Then the acceleration vector of the object can be expressed as
In each frame of video sequences, it is often found that many blocks have the same MV. If the greatest number of blocks corresponding to a nonzero MV (denoted as ) is greater than a typical threshold (usually expressed as a percentage, 2% as an example, of the number of the blocks in a frame), then these blocks are considered as belonging to the main moving object, and their MV is defined as the feature MV (FMV) of the frame. Otherwise, if is less than , the FMV will be set to zero. The framework of the proposed method is shown in Fig. 1. The acceleration vector of the main moving object, denoted as , can therefore be computed as, ) are the components of the FMV. The vector in Eq. 3 can also be presented as and represent the magnitude and angle of the acceleration vector , respectively.
The frames with typically high values of should be considered as candidates of key frames. However, the changes of motion direction of an object should also be considered, due to their importance to the human perception. Therefore, is weighted by a factor to construct asis defined as are considered as candidates.
A motion change may last for several frames. To determine which frame is the most important one, a new function , which is a convolution of with a time window, is introduced asis the width of the time window. We choose an odd value for W due to the symmetry of the time window with respect to its center t. Then the frames corresponding to the peaks (i.e., local maxima) of will be extracted as key frames. It should be noted that a proper window size can be selected according to the time resolution of video sequences. For a sequence with a high frame rate and/or slow motion in content, a large should be selected.
Four test sequences, namely “Erik,” “Football,” “Claire,” and “Foreman,” were employed in the experiments. For performance evaluation, the proposed key-frame extraction method is compared with the ME-based method presented in Ref. 2 by extracting the same number of key frames. Both objective performance as measured by shot reconstruction degree (SRD) and subjective performance are compared in this letter.
The SRD is the average peak SNR of the interpolated frames, based on inertia with respect to their original frames in the sequence.2 Figure 2 shows the curves of the mean SRDs of the four sequences in quarter common intermediate format (QCIF) by the two compared methods with different extracted-key-frame ratios from 2% to 12%.
It is observed from Fig. 2 that the two methods achieve similar SRD performance. For the cases where the percentage of key frames is below 6%, the proposed method outperforms the ME-based method2 by about . For larger percentage of key frames, the ME-based method achieves slightly better performance, by up to .
Although the overall difference in SRD performance between the two methods is very small, the subjective performance of the proposed method is better. Specifically, subjective results show that the frames in which changes of object (the head in “Claire”) movement or scene changes (in “Football” and “Foreman”) happened can be well extracted from all the three sequences by the proposed method, but not always by the ME-based method. Due to limitations of space, only the key frames extracted from “Erik” in QCIF and “Foreman” in CIF are shown in Figs. 3 and 4, respectively.
A detailed analysis of the results of the “Erik” sequence, which is representative of typical alternations of uniform motion and motion changes, is given as follows as an example. Figure 3a shows the track of the nose position changing horizontally with time. The whole track can be approximately recovered from several inflection points, viz., at frames 1, 14, 27, 39, and 50. Therefore, these frames are the key frames used as the benchmark frames to evaluate the performance of the key-frame extraction methods. The same number of key frames extracted by the ME-based method2 and the proposed method are shown in Figs. 3b and 3c, respectively. The first and the last frame were treated as default key frames in both methods. The frames extracted by the ME-based method2 given in Fig. 3b are not the benchmark frames shown in Fig. 3a at all. This is because the frames that yield the local maximum ME when the head has the largest speed are with the head moving in a straight line, not at the inflection points of the track as in the key frames shown in Fig. 3a. The frames extracted by the proposed method, as shown in Fig. 3c, are almost the same as those in Fig. 3a, which demonstrates the excellent subjective performance of the proposed acceleration-based method.
Figure 4 shows the key frames extracted from the “Foreman” sequence. Similarly to the results for “Erik,” the key frames extracted by the proposed method are more distinct, and can describe the whole sequence better than those extracted by the ME-based method.
From the analysis, it can be concluded that the key frames selected by the proposed method are more consistent with the key frames determined by human perception than those selected by the ME-based method.
An acceleration-based key-frame extraction method is proposed by constructing a new factor that reflects the motion change of the primary moving object. Experimental results show that the proposed method selects key frames more consistent with human perceptions than the ME-based method.
We would like to acknowledge the support provided by the National Science Foundation of China (60772134) and the 111 Project (B08038).