Motion estimation (ME) is used in many video coding standards, such as MPEG-1, MPEG-4, and H.264/AVC, to remove interframe redundancies. There are two main factors that determine the performance of ME. One is the size of the search range (SR), and the other is the number of reference frames (RFs). The values of these two factors are set manually in the configuration file of the encoder and are used as fixed parameters in the whole ME process. Although larger SR and RF values enable better coding performance, they also impose a heavy encoding complexity. In the overall coding system, about 50% of the computation time and 90% of the memory access are dedicated to ME.1,2 Indeed, the memory access bandwidth required to read reference pixels from external memory has gradually become the bottleneck of real-time applications.
Because the coding video differs considerably, fixed SR and RF settings are not suitable for all situations. For example, if the video sequence is motionless, a small SR and single RF are sufficient to determine the best motion vector (MV). On the other hand, if the sequence contains a high degree of movement, we can adaptively adjust the above settings to achieve better video quality. In such cases, dynamic SR adjustment and multiframe selection algorithms have become important in improving coding efficiency and satisfying the needs of real-time applications.
Many algorithms that adaptively adjust the SR size and select the RFs have been proposed. For instance, the influence of the SR on the compression rate and coding quality has been examined in an extensive series of experiments.3 The maximum-likelihood estimator has been used to model the probability density function of the motion vector difference (MVD), with the optimal SR then calculated for a given hitting probability.4,5 However, the integral operation and likelihood estimator make the algorithm somewhat complicated. In Ref. 6, images are divided into motionless and moving regions, and different SR selection methods are given for each. Using the spatial correlation, the SR can be set as a function of the maximum of the neighboring matching errors.7 Similarly, the block can be classified as motionless, slow, or violent based on the magnitude of the predicted motion vector (PMV), allowing a , , or search area to be selected accordingly.8,9 In Ref. 10, three different SRs are selected based on certain thresholds. However, with only a few fixed SRs and thresholds, the adaptability of these algorithms89.–10 is insufficient to cope with the diversity of video sequences. In Ref. 11, the correlation between different block partitions is used to predict the SR size. The SR of the current block has been calculated based on the relationship between the probability density function of MVD and the MV distribution of the neighboring blocks.12 An analysis of the relationship between the compression ratio and the size of the SR has been shown to allow a smaller SR for macroblocks (MBs) with smaller compression ratios.13 In Ref. 14, a linear SR model is defined based on the 1-bit image plane.
Existing multiframe selection algorithms are generally classified into three categories. In the first, the MV composition technique is used to reduce the calculation load of multiframe ME, which uses the MVs from the previous RFs to generate those for the current RF. The activity dominant vector selection criterion can be used to obtain the composited MV.15 The MVs of two successive frames can synthesize the MV of the current RF,16 with only a small range around the composited MV needed for ME. In Ref. 17, the MVs with the same characteristics are first merged and then the two most relevant MVs are selected to generate the MV of the current block. To apply the MV composition algorithm, the MVs of different block types must be stored for further calculation. Therefore, this approach inevitably involves a large additional memory space.
In the second category, the candidate reference frames (CRFs) are predicted from the temporal and spatial neighbors. The priorities of all RFs can be computed from the spatial and temporal correlation of the RF index and MVs,18 or the coding modes and RFs from some reference region may be used to determine the CRFs for the current block.19 In Ref. 20, the initial RF is dynamically selected using the mean of the RF indexes of spatially neighboring MBs.
The third category speeds up the multiframe selection procedure by employing an early termination strategy. For instance, based on the spatial and temporal consistencies of the motion field, multiframe ME may only be selected for a small region of the MBs,21 or the experimentally obtained sum of absolute difference (SAD) curve could be used to skip the third and fourth RFs.22 In Ref. 23, the relationship between the SAD and the all-zero quantized transform coefficients was deduced. If the SAD is less than some threshold, the multiframe ME procedure should be terminated immediately. Alternatively, the MVs and SADs from previous RFs can be used to automatically confirm whether to search the last two RFs.24
Different from the above-mentioned methods, we propose a different dynamic SR and multiframe selection algorithm for the H.264/AVC ME process. For the SR adjustment scheme, we first analyze the relationship between the PMV and the SR and then use this relationship to adjust the SR size. For the multiframe selection scheme, we focus on the correlation between different block sizes of the MB to adaptively select RFs. Simulation results show that the proposed algorithm achieves a higher speedup while maintaining almost the same video quality and total bit rate.
The rest of this paper is organized as follows. Sections 2 and 3 describe the algorithms for dynamic SR adjustment and multiframe selection, respectively. Our simulation results are presented in Sec. 4, and we give our conclusions in Sec. 5.
Proposed SR Adjustment Algorithm
Motivation and Analysis
The SR size can be successfully reduced using the algorithms mentioned in Sec. 1, but the efficiency can be further improved. In the H.264/AVC coding standard, the PMV decides the center of a search area. Therefore, the SR size depends on the location of the PMV more than any other factors, such as the maximum matching error7 or the motion activity.8 The relationship between the SR and the PMV is illustrated in Fig. 1. The vector denotes the search center and is calculated as the median operator over the neighboring MVs, and is the final MV of block obtained by ME. The vector is the difference between and . Intuitively, when PMV () is accurate (i.e., when is close to ), fewer regions need to be searched before the final motion vector () is obtained. Hence, we can consider an adaptive SR adjustment algorithm which depends on the PMV.
To determine an MV within the search window, SAD has been widely used to measure the prediction distortion between the current and the candidate blocks. For a block of size SAD is defined asFig. 2. In this figure, represents the magnitude of the difference between MV and PMV, and is the average Dif value of blocks with the same . A clear positive correlation can be seen, especially for values up to 11. Although does not increase monotonously over the whole range, there is a high probability that will become larger as increases. Hence, we can develop an adaptive SR selection algorithm, which uses Dif as the criterion for determining the SR. That is, the small Dif means the accurate PMV and the small SR value, whereas the large Dif means the large SR value.
For the current block in Fig. 1, is not available until the ME process has been performed across the whole search area, so we must set a predictor for before it can be used. The influence of this predictor on coding quality and efficiency is briefly analyzed as follows. In Eq. (2), if a larger predictor is selected for , a smaller Dif value will be obtained. According to Fig. 2, this should result in a smaller SR size. Although a higher coding speed can be achieved with a smaller SR, the coding quality may be lost if the true MV falls outside this smaller SR. In contrast, a smaller predictor gives a larger SR and a higher coding quality. Hence, our purpose is not to get an accurate estimate for but to achieve a tradeoff between the coding quality and the coding efficiency. Therefore, we replace with a smaller predictor:
To estimate the SR size of block , let , , , , and denote the Dif values of the current block and its four neighbors (left, upper, upper-right, and colocated block), respectively. , , , and are subtracted from , , and , and the four differences are collected in the set :
The SR is then adjusted according to the following conditions.
Condition 1: If any element in is smaller than 0, is smaller than the Dif value of (at least) one of its four neighbors. Thus, it can be determined that the PMV of block is closer to its MV than its neighbor, and therefore the following smaller SR should be set for block :
In Eq. (5), , , , and are the SRs of blocks A, B, C, and E, respectively.
Condition 2: If all elements in are greater than 0 and , where is an adjustment parameter, the PMV of the current block is far from its MV, which is likely to be due to complex motion. In such cases, in order to guarantee the coding quality, a larger SR should be set as follows:
Condition 3: If conditions 1 and 2 are not satisfied, the two minimum elements in (corresponding to the blocks that have the closest Dif values to the current block ) are selected. These blocks are marked and , and the SR of block is set according to
For clarity and completeness, we summarize the detailed description of the proposed algorithm as follows:
After finding the MV within the new SR defined by our algorithm, the Dif and SR values of the block are saved for the next coding block.
In order to verify the correctness of the proposed algorithm, we analyzed the hit ratio to see whether our new SR included the result of ME. The hit ratios for different quantization parameters (QPs) are shown in Table 1. It can be seen that our algorithm gives a high accuracy of 93.28% to 99.61%. The Bus, Mobile, and Stefan sequences contain a high level of motion and thus obtain relatively low hit ratios. For the Akiyo, Paris, and Flower sequences, which contain slower motion, the proposed algorithm yields a hit ratio of around 99%.
Hit ratio of the motion vector (MV) within the new search range (SR) (test conditions: encoded frame=100; number of reference frames(RFs)=1; QP=28, 32, 36, 40; SR=16; group structure is IPP).
|Sequence||Hit rate (%)|
Proposed Multiframe Selection Algorithm
Motivation and Analysis
Figure 3 describes the relationship between multiframe and variable block size ME. As shown in Fig. 3, the multiframe ME process is performed on each block size of the MB. These blocks are different partitions of the same MB, so their best RFs are highly correlated. According to this feature, the best RFs of the upper divisions can be used as CRFs for the lower divisions. However, using this up-to-down method to predict the CRFs of the lower divisions has two shortcomings. First, if the MB contains the object boundary or the complex content, then the probability that the lower and upper divisions will share the same best RF will be small. Second, the lower partitions have more sub-blocks than the upper partitions. Therefore, this up-to-down prediction method is not very efficient. For example, the division contains 36 sub-blocks in total. If the number of CRFs is 2, then the 36 sub-blocks will require 72 ME procedures, whereas for the division, only two ME procedures are required. Therefore, reducing the number of CRFs in the lower divisions is critical for improving the efficiency of ME.
In summary, the block is used as the basic unit in this study to determine the best RFs for the other coding modes. We use the block as it better reflects the motion and texture characteristics of the MB and improves the efficiency as much as possible. For convenience, the RFs are hereafter marked as in accordance with their distance from the current frame.
Due to the temporal consistency of a video sequence, it is highly probable that blocks in static or slow-moving regions will select ref0 as the optimal RF. In this work, the movement complexity (MC) of the MB is defined as
To inspect the relationship between movement complexity and , various video sequences with different motion activities are used as the test material. Table 2 documents the average experimental results for different QPs. is the number of blocks with the same movement complexity, and RF_ratio represents the percentage of blocks whose ultimate RF is . For the static blocks (which MC is equal to 0), an average of 97.87% of them select as the best RF. For the News, Container, and Akiyo sequences, the percentage of blocks choosing among is greater than 99%. Table 2 also shows that the decreases as the value of MC increases. For the Stefan sequence, in particular, when the MC is 3, only 68.32% of select as their ultimate RF. Therefore, we can conclude that when the value of MC is less than some threshold Th (set to 2 in this paper), there is a very high probability that only will be used as the best RF. From this result, we conclude that the search across other RFs can be skipped.
Relationship between the movement complexity and ref0 (test conditions: encoded frame=60; number of RFs=5; QP=28, 32, 36, 40; SR=16; group structure is IPP).
|T_B/RF_ratio (%)||T_B/RF_ratio (%)||T_B/RF_ratio (%)||T_B/RF_ratio (%)|
If MC is larger than Th, it indicates that the motion is complex. Thus, we perform ME on the remaining RFs for each block. The reference maps are then generated according to the correlation between the block and the other blocks. Figure 4 shows the reference maps for other blocks, where , , , and denote the optimal RFs of the first, second, third, and fourth block, respectively. The CRFs for each partition are described as follows:
a. The mode is used to encode regions containing slow motion. Thus, if more than two RFs are contained in Fig. 4(a), the motion of the current block is complex and the mode cannot be selected. In this case, only is selected as the best RF; otherwise all RFs contained in Fig. 4(a) are selected as CRFs for the block.
b. For the block, and are regarded as CRFs for the upper division, whereas and are CRFs for the lower division in Fig. 4(b).
c. For the block, and are regarded as CRFs for the left division, whereas and are CRFs for the right division in Fig. 4(c).
d. For the three subdivisions, the best RF for the block is directly used as the best RF for its subdivisions in Fig. 4(d).
Based on the above description, the pseudocode for our multiframe selection algorithm can be written as shown in Fig. 5. The initial number of CRFs is usually set to 5.
As shown in Fig. 5, ME is first performed on for the four blocks, and the MVs of the four blocks are used to calculate the motion complexity according to Eq. (8). If MC is less than the threshold Th (set as 2), then is set as the CRF. Otherwise, ME is conducted on from to for the four blocks, respectively. The next step is to determine a candidate reference frames set (CRFS) for other partitions based on the reference maps in Fig. 4. ME is then performed on the RFs in the CRFS for the sub-block, block, block, and the block. In the ME process for H.264/AVC, the coding order is , , , and then . As the proposed algorithm changes the coding order, i.e., the block is processed first, the MVs of the second, third, and fourth blocks must be repredicted after finishing the whole ME process, otherwise there will be some inconsistency between the decoder and encoder.
Using the pseudocode above, we investigate the number of RFs under the best- and worst-case scenarios. In the best case, only one RF () is required for all block divisions. In the worst case, all five RFs are required for the blocks; each of the block, block, and block have two CRFs. However, subpartitions such as , , and will always have one RF. Therefore, the average number of RFs required for the entire coding of the MB is .
To verify the effectiveness of the above method, the average hit ratio for different QPs at each block size was compared to the full RF search method. The results are shown in Table 3. As can be seen, our algorithm achieves a high average accuracy of 91% to 98% for all block partitions. For the block size, the hit ratio averages above 95%. Even the sub-block averages over 91%, with a worst-case performance of 84.75% with the Mobile sequence.
Average hit ratios of the proposed algorithm for each block size (test conditions are the same as Table 2).
Simulation Results and Analysis
To evaluate the performance of the proposed approach, it was implemented on the H.264/AVC reference software JM15.1. Various standard sequences were used, including CIF, 4CIF, and HD sizes, at 30 fps for 100 frames. The simulations were performed with rate distortion (RD) optimization and Hadamard transform enabled, , 32, 36, and 40, IPPP sequence types, CAVLC entropy coding, and one or five RFs. The default SR was set to 16 for CIF, 32 for 4CIF, and 64 for HD sequences.
For a numerical comparison, the Bjontegaard delta peak signal-to-noise ratio (BDPSNR) and Bjontegaard delta bit rate (BDBR) were used to evaluate the coding quality. These quantities are recommended for comparing the difference between two RD curves.25 A negative BDPSNR (dB) or positive BDBR (%) indicates a coding loss over the full-search algorithm in JM15.1. For an objective comparison, the average used reference frame (URF) and average computational complexity (CPX) are used to measure the complexity reduction. CPX (%)4 is the speed-up ratio of the SR adjustment algorithm, which is defined as the ratio of the number of search points in the proposed method to the number of search points in the full-search motion estimation (FME) algorithm. For example, a CPX value of 21% implies that the number of search points in the proposed method is 21% of the number in the FME algorithm. The average URF represents the average number of RFs used by each MB, with a smaller URF signifying fewer RFs.
For an objective comparison of coding performance, the proposed approach was compared with corresponding methods. Our SR adjustment method was compared to two recent schemes.4,11 It can be seen from Table 4 that the proposed algorithm reduced the number of search points by an average of 2.56%, albeit at the cost of a 0.14% increase in bit rate and a 0.06 dB reduction in peak signal-to-noise ratio (PSNR). When the hitting probability of MVDs was set to 0.7, the method of Ko et al.4 exhibited a good performance, with an average CPX of 4.94%. However, the implementation cost of this method is much higher than that of the proposed scheme due to the overheads inherent in the integral operation and likelihood estimation. For several sequences, such as Stefan, , and , the method of Ref. 11 exhibited better coding quality than our method in terms of BDPSNR and BDBR, but our approach can further deduct an average of 11% search points with only a slight bit rate increase and small PSNR decrease.
Performance comparison of the proposed algorithm on JM 15.1 (with one RF).
|Sequence||CPX||BDBR (%)||BDPSNR (dB)|
|Ours (Refs. 4, 11)||Ours (Refs. 4, 11)||Ours (Refs. 4, 11)|
For further comparison, Table 5 depicts the ME time of the proposed method compared to the UMHexagonS algorithm in JM15.1. This shows that the proposed method reduces the ME time by 9.15% on average, with a marginal increase in BDBR () and a small loss of BDPSNR (). For sequences with static and slow motion activity, such as and , UMHexagonS achieves a high speed-up factor, because the ME process is terminated after finishing the small and middle diamond pattern search. In contrast, our algorithm works well for CIF sequences with high motion levels and 4CIF and HD sequences with larger SR values.
Performance comparison with UMHexagonS algorithm (test conditions are the same as Table 4).
We compared our multiframe selection algorithm with two well-known efficient algorithms.19,23 The comparison results are listed in Table 6. For most sequences, the method of Lee et al.19 yielded a high bit rate increment and PSNR loss (average of 0.09 dB BDPSNR decrease and 0.61% BDBR increase), with 1.88 frames searched compared to the JM15.1 reference encoding. The method of Ref. 23 achieved good PSNR and bit rate performance (average 0.06 dB BDPSNR decrease and 0.45% BDBR increase) with a higher speed-up factor (1.52 RFs used). The proposed algorithm gave superior results to both of these techniques in terms of coding complexity reduction. It outperformed the method of Lee et al.19 in all aspects, with a 0.63-frame reduction in the number of searched RFs, which implies a reduction of 12% in the number of ME operations. At the same time, the proposed method improved the URF value by 0.31 frames compared to the method of Huang et al.,23 with a better bit rate and the same PSNR.
Performance comparison of the proposed algorithm on JM 15.1 (with five RFs).
|Sequence||URF||BDBR (%)||BDPSNR (dB)|
|Ours (Refs. 19, 23)||Ours (Refs 19, 23)||Ours (Refs. 19, 23)|
It should be mentioned that the URF and CPX parameters are virtually independent of the programming quality or hardware platform. Therefore, it is clear that the performance of the proposed methods is superior to the corresponding methods evaluated above. It must also be noted that the proposed methods can be combined with any fast ME method, such as Fast Full search, EPZS, and UMHexagonS, in the JM reference encoder.
In this paper, we presented an efficient adaptive SR adjustment and multiframe selection algorithm for ME. The proposed SR adjustment algorithm adjusts the SR for ME based on the accuracy of PMV. We also designed an RF selection scheme with the aim of reducing the complexity of multiframe ME based on the relationship between the different block sizes of the MB. The described technique significantly reduces the CPX of ME in H.264/AVC. Through a comparative analysis, it was found that the proposed approach gave an average RF saving of 3.75 frames (1.25 URFs) with a setting of five RFs and a reduction of 97% in the number of search points with a setting of one RF, with a negligible bit rate increment and a minimal loss of image quality. These features will be important in the development of real-time coding applications for video conferencing and telephony. In future work, we will combine our two algorithms to further speed up the ME process.
Yingzhe Liu received BS and MS degrees from YanShan University of China in 2002 and 2004, respectively. He is currently working toward a PhD degree at the Microelectronics Center of Harbin Institute of Technology, China. His current research interests include audio and video coding and digital hardware design for video compression.
Jinxiang Wang received his MS and PhD degrees in semiconductor physics and communication and information systems from Harbin Institute of Technology, China, in 1993 and 1999, respectively. He is now a professor in the School of Astronautics, Harbin Institute of Technology. His current research interests mainly include multimedia data compression, 3G wireless communication, systems on chips (SOC), and networks on chips (NOC).
Fangfa Fu received his MS and PhD degrees in semiconductor physics and communication and information systems from Harbin Institute of Technology, China, in 2006 and 2011, respectively. He is now a lecturer in the School of Astronautics, Harbin Institute of Technology. His research interests include systems on chips (SOC) and networks on chips (NOC).