In video coding, the key to achieve high video coding efficiency is to make full use of correlations in video. The short-term temporal correlation has been well exploited by current video coding. However, a substantial part of videos is characterized with long-term temporal correlation since they contain scenes that appear repeatedly. For example, in news programs, the scenes of studio and logo clips may emerge at intervals. In talk shows, the images of the hosts, the guests, and the audience repeat alternately. In movies and TV series, many scenes in dialogue clips and flashback episodes appear repeatedly. The video coding efficiency would be highly improved if the long-term correlation is well exploited.
During the development of video coding, the temporal correlation in video has been more and more adequately used owing to the improvement of temporal reference techniques. In early video coding standards, such as MPEG1, H.261, H.262/MPEG2,1 and H.263,2 only short-term temporal correlation was exploited because the compression algorithms only made reference to one previous decoded picture. Later, a multiple reference frame technique was introduced by providing a long-term memory that stores seconds of previously decoded frames (up to 50) to the codec so that temporal dependencies in video containing repetitive motion, uncovered background, noninteger pixel displacement, etc. can be exploited.3,4 The multiple reference frame idea was further developed and adopted in the later video coding standards H.264/Advanced Video Coding5 and High Efficiency Video Coding (HEVC),6 in which two kinds of frames, short-term reference (STR) frames and long-term reference (LTR) frames, are stored in the decoded picture buffer and used for motion compensation. STR frames, the neighboring frames of the encoding frame, are used to eliminate short-term temporal correlation while LTR frames, which are from distant past, are employed to make use of long-term temporal correlation. In practical application, the numbers of both STR frames and LTR frames are limited because of: (1) the overhead of syntax to signal reference frames; (2) exhaustive computation complexity cost introduced by motion estimation (ME). Typically, the number of LTR frames is set as one or two. For example, in a well-known codec called VP8,7 only one LTR frame called golden frame is enabled. In the latest Chinese video coding standard AVS2,8 a long-term background reference frame is used to assist the coding of surveillance video.
Although LTR frames have been supported by many video coding standards, the long-term temporal correlation in videos with repeated scenes still exists. Previous works always focus on exploiting the long-term temporal correlation in a long scene.910.11.12.13.14.–15 For example, in Refs. 910.11.–12, a background frame was generated for a surveillance video and used as an LTR frame to improve background prediction. In Refs. 13 and 14, the optimal LTR frame interval was investigated and LTR frames were adaptively selected according to the accumulation of change in a scene. In Ref. 15, the former two ideas were integrated, a background frame was generated for LTR and then updated when more background regions were exposed. However, the method of selecting LTR frames for multiple repeated scenes in a video to exploit their correlation has not been well explored. In addition, random access (RA) is desirable in many applications. It enables seek, fast-forward, and fast-backward operations in locally stored video streams (e.g., DVD, BD, etc.). In video-on-demand (VOD) streaming, it allows the servers to respond to client requests. To support RA functionality, the encoding video is divided into independent RA segments (RASs) by RA points (RAPs) and the correlation between RASs cannot be exploited. Moreover, the RAP frames are all intracoded. An intraframe typically requires several times more bits compared to an interframe with the same quality since it takes no advantage of temporal correlation. Because of above reasons, the rate-distortion performance of current video coding is degraded.
In Ref. 16, we proposed a method for the coding of RAP frames in video with repeated scenes. The representative frames of the video were extracted and stored in a scene library in advance. Then, the RAP frames were coded by only referencing the scene library. The method in Ref. 16 can improve the coding efficiency of RAP frames and somewhat employ the correlation between repeated scenes. However, there are still some problems. First, the construction of scene library was based on a front-to-back analysis of the video. The first frame of a nonrepetitive scene was certainly chosen as library frame, which might not be efficient for reference. Second, the method did not take advantage of the scene library to optimize the coding of non-RAP frames, thus the correlation between RASs was not fully used.
To make the best use of long-term temporal correlation in videos containing repeated scenes, a scene-library-based video coding (SLBVC) scheme is proposed in this paper. The main contributions of this study are as follows: (1) A video coding framework, in which the reference frames are stored in a scene library not cleared by RAPs, is introduced to exploit the correlations between RASs, as well as similar scenes. (2) A method to build the scene library is designed. The method is based on clustering and is able to construct a compact scene library to facilitate the video coding efficiency. (3) A layered coding structure based on the scene library is proposed for the coding of each RAS. In the layered coding structure, the library frames are intracoded. Each RAS is coded by referencing only one library frame to improve coding efficiency while preserving the RA capability. The proposed scheme is suited to stored video and VOD streaming.
The rest of this paper is organized as follows: Sec. 2 presents the motivation and theoretical performance analysis on the proposed scheme. The SLBVC scheme is described in Sec. 3. The experimental results are discussed in Sec. 4. Finally, Sec. 5 concludes the paper.
Motivation and Theoretical Analysis
It has been said that many videos are composed of repeated scenes. Figure 1 gives an example, which is extracted from movie “The Man from Earth.”17 The example video can be divided into 12 clips by scene change, which are signaled as , . The appearance time points of each clip are signaled as , . Visually as some clips belong to the same scene, there are five scenes in total, which are marked as , . In current video coding, to support RA, the encoding video is divided into independent RASs by RAPs. Without loss of generality, here we discuss the coding of , which appears at a set of time points . The coding structure of is shown in Fig. 2. It can be seen that , , and , the clips of , are divided into RASs, which are signaled as , , and , , by RAPs. Not only the temporal correlation between consecutive RASs in , , and cannot be used, the long-term temporal correlation between , , and cannot be used, either. As a result, the coding efficiency of is limited. The RASs are independent from each other because the decoded picture buffer is cleared up by RAP picture. Naturally, we introduce a scene library that stores reference frames and does not clear up at RAPs, as shown in Fig. 3. Then, the quarantine between RASs is broken. For in example video, the long-term temporal correlation between consecutive RASs, as well as clips , , and can be eliminated by referencing the library. As a result, the coding efficiency of can be improved.
More generally, for sequence , assume the RAPs are inserted evenly according to the specified RA interval (RAI). As shown in Fig. 4, sequence is divided into independent RASs , . and , can be regarded as vector sources. According to rate-distortion theory,18 the rate-distortion function of in current video coding can be written as
In SLBVC, we employ a scene library that does not clear at RAP to exploit the long-term temporal between RASs from similar scenes. Use to represent the reference frames in the scene library. is extracted from a part of RASs, which can form a set signaled as . The coding process of SLBVC is analogous to encoding RASs first and then letting the extracted reference frames referenced by the rest RASs. The rate-distortion function of in SLBVC can be written as18,19 we have . Finally, can be expressed as
Scene-Library-Based Video Coding Scheme
The framework of the SLBVC scheme is shown in Fig. 5. It is composed of a traditional codec and two scene libraries. The scene library at the encoder is built by extracting representative frames of similar scenes in a video. Then, the library frames are encoded into a unique stream and transmitted to the decoder, so that the scene library can be reconstructed in the decoder. In the encoding/decoding process, each frame is coded using its most similar library frame (MSLF) for LTR.
Clustering-Based Scene Library Construction
In practical implementation, we first extract library frames from sequence . Then, is coded and the reconstruction is used for the reference of all RASs in . So the rate–distortion function of is
If similar RAP frames can be clustered together, a single cluster representative can serve as a good reference frame of the others. This is a good optimization solution to Eq. (5). To build the library, we classify the RAP frames into different clusters, and the center frames of each cluster are specified as library frames. K-means algorithm is employed for clustering because of its simplicity and efficiency.20 Also, RAP frames from the same scene have similar color histogram features. So, the clustering is based on color histogram feature.21,22 For an RAP frame , , it can be described by a color histogram as
The K-means algorithm is implemented with given clustering number and initial clustering centers , as follows:
Step 1: Classify each RAP frame , to the cluster with minimum difference, as
Step 2: Update the clustering centers. For cluster , the center frame is updated as the frame with the minimum sum of differences with other frames in cluster , which is expressed as
Repeat step 1 and step 2 until clustering centers , do not change any more.
The number of clusters needs to be explicitly specified for K-means algorithm. However, how many library frames should be extracted is not known as a priori. To find the optimal number of clusters, we traverse all clustering options with varying from 1 to . The clustering cost of each option is calculated and the minimum clustering cost corresponds to the optimal clustering number.
We define the clustering cost of each clustering option as the sum of information content of all clusters. For -cluster clustering, the clustering cost is computed as23 They are calculated as
By integrating Eqs. (11)–(13) into Eq. (10), the clustering cost can be derived. Traverse from 1 to to calculate the cost of all clustering options. The clustering number is chosen as which corresponds to the minimum clustering cost. After classifying the RAP frames into clusters, the center frames are extracted as library frames. Note that there may be some clusters containing only one frame, which has little correlation with other RAP frames. Therefore, we only extract the center frames of clusters containing multiple frames as library frames.
The value of is in the range of 1 to . In extremes, when is equal to 1, it means all RAP frames are similar to each other. So, only one library frame is extracted. In contrast, when is equal to , it means the RAP frames are different to each other. It is not efficient to use any frames as the reference of the others. Thus, no scene library frames are extracted.
Use the clustering of the scene change frames of the example video in Fig. 1 as an example. There are 12 scene change frames, which are signaled as , . After traversing all clustering options, a curve of clustering cost relative to is drawn as Fig. 6. It can be seen that the optimal clustering number is five, and the corresponding clustering result is shown in Fig. 7. In the example, frames , , , , and are the center frames. As frame is the only frame in cluster 2, finally frames , , , and are extracted as the library frames.
Coding Structure Based on Scene Library
Given the specified RAI, the distance of two consecutive RAP pictures should be less than or equal to the specified interval. Simply, RAP pictures are inserted with fixed RAI (FRAI). However, RAP picture has another function that is used as the reference frame of its following interframes in the same RAS directly or indirectly. So, RAP pictures are usually coded with higher quality for better interprediction efficiency. For a video with multiple scenes, if scene change occurs in an RAS, frames in the old scene are not suitable to be used as reference frames of the new scene because of their low similarity. Thus, the coding efficiency of the RAS is limited. As a result, we adopt an adaptive RAI (ARAI) coding structure24 in the SLBVC scheme. When scene change happens, the scene change frame is coded as a RAP frame. The later RAP frames are inserted at the specified interval until the next scene change occurs. The ARAI structure can facilitate inter prediction efficiency in the RAS at scene change position so as to improve coding efficiency. Table 1 shows the coding efficiency of ARAI structure compared to FRAI structure in terms of Bjøntegaard delta bit rate (BD-Rate).25 We employed six sequences containing multiple scene changes for test. The details of the test sequences can be found in Sec. 4. The test is conducted under RA common test condition26 and the RAI is specified as 32. It can be seen that 2.1% coding gain can be achieved by the ARAI coding structure, which confirms the adoption of ARAI structure in the proposed scheme.
The performance of ARAI over FRAI with RAI 32.
|Sequence||BD-rate Y (%)||BD-rate U (%)||BD-rate V (%)|
Figure 8(a) shows the coding structure of two RASs from the same scene in conventional video coding (for simplicity we did not show multiframe reference and B-frames). The RAP frames are all intracoded while the non-RAP frames are coded as interframes by referencing neighboring frames to exploit short-term correlation. A traditional LTR scheme is implemented using RAP frame as the LTR frame of following non-RAP frames in the same RAS, which is shown in Fig. 8(b). In this way, the long-term correlation in an RAS can be exploited. Now with the availability of the scene library, the long-term correlation between RASs can also be exploited.
In SLBVC, a layered coding structure is proposed, as shown in Fig. 8(c). The library frames are coded as intraframes. Each RAP frame is coded as an interframe by only referencing its MSLF, which is retrieved by color histogram comparison.27 Then, the MSLF of the RAP frame is used as the LTR frame of the following non-RAP frames. We do it this way because frames in an RAS belong to the same scene with ARAI coding structure. So, the complexity for retrieving the MSLFs of non-RAP frames can be saved. Meanwhile, the RA property of the RAS is guaranteed. As a result, the long-term temporal correlation between RASs can be exploited clearly. In addition to, the long-term temporal correlation in RASs can also be removed to some extent.
Scene Library Management
Due to storage capacity of codec, the scene library is built for video clip with several minutes’ length to limit the number of library frames. In other words, the frames in the scene library are not infinitely accumulated. After encoding/decoding a video clip, the scene library is cleared. Then, a new scene library will be built for the next clip. The video clip and corresponding scene library are encoded into two streams. In video stream, a signal for each frame is added to indicate its MSLF in the library stream. Here, we only discuss the library management strategies in VOD streaming application, because for locally stored video streams, the strategies are all the same except for omitting the stream request operation.
In decoder, the video stream and the library stream are downloaded and decoded synchronously. When the decoder is capable of storing all the decoded library frames, they are stored into the scene library and not to be removed. In this way, the stored library frames can be directly reused by later frames in the video clip. In contrast, when the decoder can only store a part of the decoded library frames, e.g., mobile devices, the downloaded library stream is stored in local. The scene library works in a first-in, first-out manner to keep the most recently decoded library frames. Then, later frames can find their MSLFs either in the scene library or in the library stream. For RA within current video clip, if the RAP frame to be decoded cannot find its MSLF in the scene library, it will search the stored library stream. If the MSLF cannot be found in the stored library stream either, it will be requested from the server. For RA across video clips, the scene library and the stored library stream will be cleared. Then, the streams of the RAP frame and its MSLF will be requested from the server.
Experiments are conducted on six sequences containing repeated scenes to evaluate the performance of the SLBVC scheme. The test sequences were extracted from six movies and TV series: “The Big Bang Theory,” “House of Cards,” “Sherlock,” “The Man from Earth,” “2 Broke Girls,” and “Game of Thrones.” The details of the sequences are shown in Table 2. They all have a length of frames. All sequences are composed of multiple scenes, a large portion of which appear repeatedly.
Description of test sequences.
|Sequence||Resolution||Fps (Hz)||Length||Scene change times|
The proposed scheme is implemented on an HM12.128 encoder. We compare it with HEVC, the LTR scheme, and the method in Ref. 16 to demonstrate its performance. The sequences are encoded under RA common test condition, which is designed for RA applications. Table 3 shows the details of RA common test condition. The quantization parameter (QP) of test sequences is set as 22, 27, 32, and 37 while the QP of scene library frames is empirically set as , which usually leads to the best performance among all QP values. In our experiments, the ARAI coding structure is also employed by HEVC and the LTR scheme for fair comparison. Two different RAI values are tested: 32, as specified in RA common test condition; and 152 (5 s for 30 fps), which is the typical RAI used by online video websites. In the following description, we use RAI 32 and RAI 152 to represent the conditions that the RAI is set as 32 and 152, respectively.
The random-access configurations of HM.
|Profile||Main||LCU Size||64||Fast search||Enable|
|Frame structure||Hierarchical B||Search range||64||SAO||Enable|
|GOP size||8||Hadmard ME||Enable||Rate control||Disable|
Results of Library Construction
The numbers of RAP frames and clusters of test sequences are shown in Table 4. It can be seen that the RAP frames are classified into a smaller number of clusters. The optimal clustering number depends on the video content and the number of RAP frames. An illustration of the curves of clustering cost relative to clustering number of Bigbang is shown in Fig. 9. The optimal clustering numbers are, respectively, 41 and 13 with RAI 32 and 152. After clustering, the scene library is constructed by extracting the center frame of each cluster. As only the center frames of clusters containing more than one frames are chosen as library frames, so the number of library frames is less than or equal to the number of clusters, as also shown in Table 4.
The numbers of RAP frames, clusters, and library frames with RAI 32 and 152.
|Sequence||RAI 32||RAI 152|
|RAP frame||Cluster||Library frame||RAP frame||Cluster||Library frame|
Performance of Proposed Scheme
BD-Rate is used to evaluate the rate-distortion performance of the proposed scheme. Note that each non-RAP frame also uses the library frame as LTR frame, so the coding performance is evaluated on the whole sequence. The bits brought by the scene library are taken into consideration when calculating the bitrate. The achieved gains of the LTR scheme, the scheme in Ref. 16, and the proposed scheme with respect to HEVC are shown in Table 5. It can be seen that with RAI 152, there is some long-term temporal correlation in RAS that can be exploited, so the LTR scheme can achieve 0.5% coding gain. However, with RAI 32, few long-term temporal correlation is left because of the short length of RAS, thus the LTR scheme cannot bring any coding gain. However, the scheme in Ref. 16 and the proposed scheme can achieve much higher coding gain because they are capable of removing the long-term temporal correlation between RASs and repeated scenes. Also for the both schemes, the performance with RAI 32 is better than RAI 152. This is because much more long-term temporal correlation between RASs remains with shorter RAI structure but now can be removed. Compared to Ref. 16, the bit-savings of the proposed scheme are 8.1% with RAI 32 and 7.2% with RAI 152. These coding gains are obtained due to the more efficient scene library built with the cluster-based method and the better coding of non-RAP frames.
The performances of LTR scheme, scheme in Ref. 16, and SLBVC scheme compared to HEVC.
|Sequence||RAI 32||RAI 152|
|LTR (%)||Ref. 16 (%)||SLBVC (%)||LTR (%)||Ref. 16 (%)||SLBVC (%)|
Figures 10 and 11 show the rate-distortion curves for all six test sequences by four schemes: HEVC, the LTR scheme, the scheme in Ref. 16, and the SLBVC scheme with RAI 32 and 152, respectively. The rate-distortion curves of the LTR scheme and HEVC are hardly distinguishable because of their close performance. However, the rate-distortion curves of the SLBVC scheme are obviously above that of the other three schemes, which reveals the remarkable performance of the proposed scheme. For sequences Cards () and Throne (), the maximal bitrate is around 200 and 100 kbps because the images contain a lot of dark regions and lack high efficiency components. It does not take too many bits to encode them even with very high quality. For the other sequences with rich colors and textures, the bitrate can be up to hundreds of kbps. In summary, the proposed scheme can deal with different types of video contents and show advantage over a wide range of bitrate.
In addition to the improvement in objective performance, visual quality is also significantly improved by the proposed scheme. When the SLBVC scheme is applied, coding artifacts caused by aggressive compression, such as blockiness and blurriness, will be greatly alleviated. Figure 12 shows a set of images cropped from the middle frame of an RAS in Sherlock coded with RAI 32. The first image is coded by the proposed scheme, the second image is coded by HEVC, and the third one is the original image used for comparison. The bitrate of the proposed scheme and HEVC is, respectively, 72 and 86 kbps, the PSNR values of the images range between 39 and 41 dB. It can be seen that the contour of the actor’s face (eyes and nose) is much better preserved by the proposed scheme. Similarly, Fig. 13 shows the images cropped from Bigbang coded with RAI 152. Also, the textures of the clothes are visually clearer with the proposed scheme. As a result, we can conclude that the subjective quality of proposed scheme is much better than that of HEVC.
Discussion of Random Access
In VOD streaming, sometimes the users do not want to watch the entire video sequence. They may request only part of the sequence from the server by RA. When only a video part is transmitted to the client, only the referenced library frames instead of the whole library need to be synchronously transmitted. It is feasible because the library frames are intraframes and independent from each other. With the length of the transmitted video part decreasing, the number of frames that a library frame is referenced by also decreases. The performance of the SLBVC scheme may deteriorate. However, since about 75% users watch the entire video in practice,29,30 the proposed scheme is still valuable for VOD streaming application.
We simulate the performance when part of the sequence transmitted in VOD streaming. Assume the test sequence contains RASs with RAI 152. We test a series of transmitted lengths , , where is the function that rounds the variable to the nearest integer, and is the least integer satisfying . In each test option, the test sequence is divided into homogeneous parts at the specified transmitted length. The coding performance (BD-rate) of each part is calculated with considering the bits of the referenced library frames. Then the performances of all parts are averaged to derive the final result. Figure 14 shows the curves of coding performance relative to of all sequences. The value of for Cards and Earthman is 5, whereas for the other sequences is 6. There is performance deterioration with the increase of and even performance loss when is too large. However, performance improvement can still be observed over a large range of transmitted length. For Bigbang, Sherlock, Girls, and Throne, when is between 0 and 3, which corresponds to that 12.5% to 100% of the whole sequence is transmitted, there is coding efficiency improvement. For Cards, a larger range of between 0 and 4 is observed to show coding gain. For Earthman, although the range of with coding gain is narrower (0 to 2), the proposed scheme still outperforms HEVC if only no of the sequence is transmitted.
The complexity of the proposed scheme is imposed by scene library construction (RAP frame clustering and library frame encoding), MSLF selection, and video coding. As the HM encoder is written in C++, the algorithms of scene library construction and MSLF selection are also implemented in C++ for fair comparison. The experiments are conducted on Intel(R) Xeon (R) CPU E5- 2690 0 @2.90 GHz with 190G RAM memory. The complexity of each process is measured by the running time.
The computational complexity of SLBVC can be expressed asTables 6 and 7. In RAI 152, , , and occupy about 0.04%, 0.05%, and 0.001% of the total time. While in RAI 32, the percentages of , , and increase to 0.79%, 0.13%, and 0.002% for the reason of more RAP frames. Especially for , assume the number of RAP frames is , the number of times of computing distances between frames is . Thus, the increase of complexity is much higher than and . Overall, compared with the complexity of video coding, , , , and are rather small and can be neglected.
The complexity distribution of each process in SLBVC with RAI 32.
|Sequence||RAP frame clustering (%)||Library frame encoding (%)||MSLF selection (%)||Video coding (%)|
The complexity distribution of each process in SLBVC with RAI 152.
|Sequence||RAP frame clustering (%)||Library frame encoding (%)||MSLF selection (%)||Video coding (%)|
Use the video coding time of HEVC as anchor, the video coding time of the LTR scheme and the proposed scheme is shown in Table 8. It can be seen that about 24% to 34% extra complexity is brought into video coding by the LTR scheme and the proposed scheme. It is because an additional LTR frame is employed by each frame in both schemes, thus the ME complexity is increased. Also, the video coding complexity of SLBVC is slightly higher than the LTR scheme by 6% in average as the RAP frames are coded as inter-frames. Considering the high performance of the SLBVC scheme, the complexity increase is acceptable.
The video coding time of LTR and SLBVC compare to HEVC.
|Sequence||RAI 152||RAI 32|
|LTR (%)||SLBVC (%)||LTR (%)||SLBVC (%)|
In this paper, an SLBVC scheme is presented to improve the coding efficiency of videos with repeated scenes. The proposed scheme is capable of exploiting long-term temporal correlation between RASs which belong to similar scenes. It can be applied to stored video application (e.g., DVD, BD, etc.) and VOD streaming. Compared to state-of-the-art method, the scheme can achieve 8.1% and 7.2% coding performance improvement on the whole sequence with RAI 32 and 152, respectively. Although the performance may degrade when only part of sequence is transmitted in streaming, the proposed scheme still shows advantage over a large range of transmitted length.
A couple of extensions of our current work can be further explored in the future. For example, more accurate clustering criteria can be investigated for the clustering algorithm to build a more efficient scene library. Also, the coding quality of library frames can be adaptively optimized according to the times they are referenced to improve overall coding efficiency. We will work on these issues to further improve the efficiency of the SLBVC scheme.
This work was supported by the National Natural Science Foundation of China (No. 61371162). The authors declare that there are no conflicts of interest regarding the publication of this paper.
Xuguang Zuo received his BS degree in information and communication engineering from Zhejiang University, Hangzhou, China, in 2010, where he is currently pursuing a PhD. His research interests include video processing, coding, and streaming.
Lu Yu is currently a professor at the Institute of Information and Communication Engineering, Zhejiang University. She received her BS degree in radio engineering and her PhD in communication and electronic systems from Zhejiang University, Hangzhou, China, in 1991 and 1996, respectively. Her research area includes video coding, multimedia communication, and relative ASIC design.
Hualong Yu is pursuing his PhD in the College of Information Science and Electronic Engineering at Zhejiang University, Hangzhou, China. He received his BS degree in Information and Communication Engineering from Zhejiang University in 2013. Currently, he is a member of Multimedia and Communication Laboratory at Zhejiang University and his research interests include video coding, image processing, and media streaming.
Jue Mao received her BS degree in the College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China, in 2014. She has been awarded excellent undergraduate in Zhejiang University, 2014. She is currently a PhD candidate in Zhejiang University, supervised by Professor Lu Yu. Her research interests include video compression, image, and video processing.
Yin Zhao is with Huawei Technologies Co., Ltd., Hangzhou, China. He received his BS and PhD degrees in information engineering from Zhejiang University, Hangzhou, China, in 2008 and 2013, respectively. He was a visiting student at Nanyang Technological University, Singapore. His research interests include 3-D video processing, video quality assessment, and video coding.