Scene-library-based video coding scheme exploiting long-term temporal correlation

. In movies and TV shows, it is common that several scenes repeat alternately. These videos are characterized with the long-term temporal correlation, which can be exploited to improve video coding efficiency. However, in applications supporting random access (RA), a video is typically divided into a number of RA segments (RASs) by RA points (RAPs), and different RASs are coded independently. In such a way, the long-term temporal correlation among RASs with similar scenes cannot be used. We present a scene-library-based video coding scheme for the coding of videos with repeated scenes. First, a compact scene library is built by clustering similar scenes and extracting representative frames in encoding video. Then, the video is coded using a layered scene-library-based coding structure, in which the library frames serve as long-term reference frames. The scene library is not cleared by RAPs so that the long-term temporal correlation between RASs from similar scenes can be exploited. Furthermore, the RAP frames are coded as interframes by only referencing library frames so as to improve coding efficiency while maintaining RA property. Experimental results show that the coding scheme can achieve significant coding gain over state-of-the-art methods. © The Authors. Published by SPIE under a Creative Commons Attribution 3.0 Unported License. Distribution or reproduction of this work in whole or in part requires full attribution of the original publication, DOI. [DOI: 10.1117/1.JEI.26.4.043026]


Introduction
In video coding, the key to achieve high video coding efficiency is to make full use of correlations in video. The shortterm temporal correlation has been well exploited by current video coding. However, a substantial part of videos is characterized with long-term temporal correlation since they contain scenes that appear repeatedly. For example, in news programs, the scenes of studio and logo clips may emerge at intervals. In talk shows, the images of the hosts, the guests, and the audience repeat alternately. In movies and TV series, many scenes in dialogue clips and flashback episodes appear repeatedly. The video coding efficiency would be highly improved if the long-term correlation is well exploited.
During the development of video coding, the temporal correlation in video has been more and more adequately used owing to the improvement of temporal reference techniques. In early video coding standards, such as MPEG1, H.261, H.262/MPEG2, 1 and H.263, 2 only short-term temporal correlation was exploited because the compression algorithms only made reference to one previous decoded picture. Later, a multiple reference frame technique was introduced by providing a long-term memory that stores seconds of previously decoded frames (up to 50) to the codec so that temporal dependencies in video containing repetitive motion, uncovered background, noninteger pixel displacement, etc. can be exploited. 3,4 The multiple reference frame idea was further developed and adopted in the later video coding standards H.264/Advanced Video Coding 5 and High Efficiency Video Coding (HEVC), 6 in which two kinds of frames, shortterm reference (STR) frames and long-term reference (LTR) frames, are stored in the decoded picture buffer and used for motion compensation. STR frames, the neighboring frames of the encoding frame, are used to eliminate short-term temporal correlation while LTR frames, which are from distant past, are employed to make use of long-term temporal correlation. In practical application, the numbers of both STR frames and LTR frames are limited because of: (1) the overhead of syntax to signal reference frames; (2) exhaustive computation complexity cost introduced by motion estimation (ME). Typically, the number of LTR frames is set as one or two. For example, in a well-known codec called VP8, 7 only one LTR frame called golden frame is enabled. In the latest Chinese video coding standard AVS2, 8 a long-term background reference frame is used to assist the coding of surveillance video.
Although LTR frames have been supported by many video coding standards, the long-term temporal correlation in videos with repeated scenes still exists. Previous works always focus on exploiting the long-term temporal correlation in a long scene. [9][10][11][12][13][14][15] For example, in Refs. 9-12, a background frame was generated for a surveillance video and used as an LTR frame to improve background prediction. In Refs. 13 and 14, the optimal LTR frame interval was investigated and LTR frames were adaptively selected according to the accumulation of change in a scene. In Ref. 15, the former two ideas were integrated, a background frame was generated for LTR and then updated when more background regions were exposed. However, the method of selecting LTR frames for multiple repeated scenes in a video to exploit their correlation has not been well explored. In addition, random access (RA) is desirable in many applications. It enables seek, fast-forward, and fast-backward operations in locally stored video streams (e.g., DVD, BD, etc.). In video-on-demand (VOD) streaming, it allows the servers to respond to client requests. To support RA functionality, the encoding video is divided into independent RA segments (RASs) by RA points (RAPs) and the correlation between RASs cannot be exploited. Moreover, the RAP frames are all intracoded. An intraframe typically requires several times more bits compared to an interframe with the same quality since it takes no advantage of temporal correlation. Because of above reasons, the rate-distortion performance of current video coding is degraded.
In Ref. 16, we proposed a method for the coding of RAP frames in video with repeated scenes. The representative frames of the video were extracted and stored in a scene library in advance. Then, the RAP frames were coded by only referencing the scene library. The method in Ref. 16 can improve the coding efficiency of RAP frames and somewhat employ the correlation between repeated scenes. However, there are still some problems. First, the construction of scene library was based on a front-to-back analysis of the video. The first frame of a nonrepetitive scene was certainly chosen as library frame, which might not be efficient for reference. Second, the method did not take advantage of the scene library to optimize the coding of non-RAP frames, thus the correlation between RASs was not fully used.
To make the best use of long-term temporal correlation in videos containing repeated scenes, a scene-library-based video coding (SLBVC) scheme is proposed in this paper. The main contributions of this study are as follows: (1) A video coding framework, in which the reference frames are stored in a scene library not cleared by RAPs, is introduced to exploit the correlations between RASs, as well as similar scenes. (2) A method to build the scene library is designed. The method is based on clustering and is able to construct a compact scene library to facilitate the video coding efficiency. (3) A layered coding structure based on the scene library is proposed for the coding of each RAS. In the layered coding structure, the library frames are intracoded. Each RAS is coded by referencing only one library frame to improve coding efficiency while preserving the RA capability. The proposed scheme is suited to stored video and VOD streaming.
The rest of this paper is organized as follows: Sec. 2 presents the motivation and theoretical performance analysis on the proposed scheme. The SLBVC scheme is described in Sec. 3. The experimental results are discussed in Sec. 4. Finally, Sec. 5 concludes the paper.

Motivation and Theoretical Analysis
It has been said that many videos are composed of repeated scenes. Figure 1 gives an example, which is extracted from movie "The Man from Earth." 17 The example video can be divided into 12 clips by scene change, which are signaled as P i , i ¼ 1;2; : : : ; 12. The appearance time points of each clip are signaled as T i , i ¼ 1;2; : : : ; 12. Visually as some clips belong to the same scene, there are five scenes in total, which are marked as S i , i ¼ 1;2; : : : ; 5. In current video coding, to support RA, the encoding video is divided into independent RASs by RAPs. Without loss of generality, here we discuss the coding of S 1 , which appears at a set of time points fT 1 ; T 3 ; T 11 g. The coding structure of S 1 is shown in Fig. 2. It can be seen that P 1 , P 3 , and P 11 , the clips of S 1 , are divided into RASs, which are signaled as X 1;i , X 3;i , and X 11;i , i ¼ 1;2; : : : , by RAPs. Not only the temporal correlation between consecutive RASs in P 1 , P 3 , and P 11 cannot be used, the long-term temporal correlation between P 1 , P 3 , and P 11 cannot be used, either. As a result, the coding efficiency of S 1 is limited. The RASs are independent from each other because the decoded picture buffer is cleared up by RAP picture. Naturally, we introduce a scene library that stores reference frames and does not clear up at RAPs, as shown in Fig. 3. Then, the quarantine between RASs is broken. For S 1 in example video, the long-term temporal correlation between consecutive RASs, as well as clips P 1 , P 3 , and P 11 can be eliminated by referencing the library. As a result, the coding efficiency of S 1 can be improved.
More generally, for sequence X, assume the RAPs are inserted evenly according to the specified RA interval (RAI). As shown in Fig. 4, sequence X is divided into N independent RASs X n , n ¼ 1;2; : : : ; N. X and X n , n ¼ 1;2; : : : ; N can be regarded as vector sources. According to rate-distortion theory, 18 the rate-distortion function of X in current video coding R C X ðD X Þ can be written as where R X n ðd X n Þ and R X ðD X Þ denote the classical Shannon rate-distortion function of X n and X, respectively. It can be seen that the lower bound to rate of X is higher than the Shannon lower bound because the correlations between RASs cannot be exploited. In SLBVC, we employ a scene library that does not clear at RAP to exploit the long-term temporal between RASs from similar scenes. Use Y to represent the reference frames in the scene library. Y is extracted from a part of RASs, which can form a set signaled as Φ. The coding process of SLBVC is analogous to encoding RASs X n ∈ Φ first and then letting the extracted reference frames Y referenced by the rest RASs. The rate-distortion function of X in SLBVC can be written as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 6 3 ; 3 2 1 R L X ðD X Þ ¼ where R X n jY ðd X n Þ is the conditional rate-distortion function of X n given Y. Clearly, since reference frames Y can help the coding of X n , 18,19 we have R X n jY ðd X n Þ ≤ R X n ðd X n Þ. Finally, R L X ðD X Þ can be expressed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 6 3 ; 2 3 3 R L X ðD X Þ ≤ R C X ðD X Þ: Compared with current video coding, the proposed scheme can push the rate-distortion bound of X downward to approach the Shannon rate-distortion bound.

Scene-Library-Based Video Coding Scheme
The framework of the SLBVC scheme is shown in Fig. 5. It is composed of a traditional codec and two scene libraries.
The scene library at the encoder is built by extracting representative frames of similar scenes in a video. Then, the library frames are encoded into a unique stream and transmitted to the decoder, so that the scene library can be reconstructed in the decoder. In the encoding/decoding process, each frame is coded using its most similar library frame (MSLF) for LTR.

Clustering-Based Scene Library Construction
In practical implementation, we first extract library frames Y from sequence X. Then, Y is coded and the reconstructionŶ is used for the reference of all RASs in X. So the ratedistortion function of X is E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 3 2 6 ; 5 2 7 where R Y ðd Y Þ denotes the rate-distortion function of Y and R X n jŶ ðd X n Þ represents the conditional rate-distortion function of X n givenŶ. The aim of library construction is to minimize the rate-distortion function shown in Eq. (4). However, it is known that the intra-RAP frames use no temporal correlation and their coding efficiency is not high. So, the scene library has a much stronger impact on the RAP frames. It is more efficient to build a library carrying high correlation with the RAP frames rather than with the entire encoding video. In addition, to simplify the construction of the scene library, we also ignore the impact of distortion. As a result, the goal of library construction can be modified as where HðYÞ is the entropy of Y, L n , n ¼ 1;2; : : : ; N are the RAP frames, and HðL n jYÞ is the conditional entropy of L n given Y. To minimize Eq. (5), on one hand, Y should share as much correlations as possible with the RAP frames. On the other hand, the entropy of Y should be limited as low as possible. As the library frames are used for the reference An illustration of video coding with uniform RAI. Fig. 3 Illustration of using a scene library to exploit long-term temporal correlation between RASs. Journal of Electronic Imaging 043026-3 Jul∕Aug 2017 • Vol. 26 (4) of RAP frames, they should be all intracoded to ensure the RA capability of RAP frames. So, we directly extract the library frames from the RAP frames. Then, the problems become: (1) How many RAP frames should be extracted?

Scene library
(2) Which RAP frames should be extracted? If similar RAP frames can be clustered together, a single cluster representative can serve as a good reference frame of the others. This is a good optimization solution to Eq. (5). To build the library, we classify the RAP frames into different clusters, and the center frames of each cluster are specified as library frames. K-means algorithm is employed for clustering because of its simplicity and efficiency. 20 Also, RAP frames from the same scene have similar color histogram features. So, the clustering is based on color histogram feature. 21,22 For an RAP frame L n , 1 ≤ n ≤ N, it can be described by a color histogram as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 6 3 ; 5 7 6 L n ¼ ff 0 n ; f 0 n ; : : : ; f 767 n g; where f p n , 0 ≤ p < 256 represents the number of Y-component with value p, f p n , 256 ≤ p < 512 represents the number of U-component with value p − 256, and f p n , 512 ≤ p < 768 represents the number of V-component with value p − 512. Accordingly, the difference between two RAP frames L i and L j , 1 ≤ i, j ≤ N can be calculated as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 6 3 ; 4 7 9 DðL i ; L j Þ ¼ The K-means algorithm is implemented with given clustering number K and initial clustering centers μ k , k ¼ 1; 2; : : : ; K as follows: Step 1: Classify each RAP frame L n , 1 ≤ n ≤ N to the clusterk with minimum difference, as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 6 3 ; 3 6 6k ¼ arg 1≤k≤K min½DðL n ; μ k Þ: Step 2: Update the clustering centers. For cluster k, the center frame is updated as the frame with the minimum sum of differences with other frames in cluster k, which is expressed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 6 3 ; where n k is the number of frames belonging to cluster k, L k i and L k j are the i'th and j'th frames, respectively, in cluster k.
The number of clusters K needs to be explicitly specified for K-means algorithm. However, how many library frames should be extracted is not known as a priori. To find the optimal number of clusters, we traverse all clustering options with K varying from 1 to N. The clustering cost of each option is calculated and the minimum clustering cost corresponds to the optimal clustering number.
We define the clustering cost of each clustering option as the sum of information content of all clusters. For K-cluster clustering, the clustering cost is computed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 0 ; 3 2 6 ; 7 5 2 CostðKÞ ¼ HðL k i jμ k Þ; (10) where C k is the cost of k'th cluster, Hðμ k Þ is the entropy of μ k , while HðL k i jμ k Þ represents the conditional entropy of L k i given μ k . Use HðL k i ; μ k Þ to represent the joint entropy of L k i and μ k . The relationship between HðL k i jμ k Þ and HðL k i ; μ k Þ is given by E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 1 ; 3 2 6 ; 6 5 3 We use the luma histogram of μ k and the joint luma histogram of L k i and μ k to estimate Hðμ k Þ and HðL k i ; μ k Þ, respectively. 23 They are calculated as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 2 ; 3 2 6 ; 5 8 8 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 3 ; 3 2 6 ; 5 3 7 In Eqs. (12) and (13), TN is the total number of pixels in a picture, f p μ k represents the number of pixels with luma value p in μ k , f p;q L k i ;μ k represents the number of pixel pair ðp; qÞ, which means the luma values of the same position in L k i and μ k are p and q, respectively. By integrating Eqs. (11)-(13) into Eq. (10), the clustering cost CostðKÞ can be derived. Traverse K from 1 to N to calculate the cost of all clustering options. The clustering number is chosen as K opt which corresponds to the minimum clustering cost. After classifying the RAP frames into K opt clusters, the center frames are extracted as library frames. Note that there may be some clusters containing only one frame, which has little correlation with other RAP frames. Therefore, we only extract the center frames of clusters containing multiple frames as library frames.
The value of K opt is in the range of 1 to N. In extremes, when K opt is equal to 1, it means all RAP frames are similar to each other. So, only one library frame is extracted. In contrast, when K opt is equal to N, it means the RAP frames are different to each other. It is not efficient to use any frames as the reference of the others. Thus, no scene library frames are extracted.
Use the clustering of the scene change frames of the example video in Fig. 1 as an example. There are 12 scene change frames, which are signaled as L i , i ¼ 1;2; : : : ; 12. After traversing all clustering options, a curve of clustering cost CostðKÞ relative to K is drawn as Fig. 6. It can be seen that the optimal clustering number is five, and the corresponding clustering result is shown in Fig. 7. In the example, frames L 1 , L 2 , L 6 , L 9 , and L 10 are the center frames. As frame L 2 is the only frame in cluster 2, finally frames L 1 , L 6 , L 9 , and L 10 are extracted as the library frames.

Coding Structure Based on Scene Library
Given the specified RAI, the distance of two consecutive RAP pictures should be less than or equal to the specified interval. Simply, RAP pictures are inserted with fixed RAI (FRAI). However, RAP picture has another function that is used as the reference frame of its following interframes in the same RAS directly or indirectly. So, RAP pictures are usually coded with higher quality for better interprediction efficiency. For a video with multiple scenes, if scene change occurs in an RAS, frames in the old scene are not suitable to be used as reference frames of the new scene because of their low similarity. Thus, the coding efficiency of the RAS is limited. As a result, we adopt an adaptive RAI (ARAI) coding structure 24 in the SLBVC scheme. When scene change happens, the scene change frame is coded as a RAP frame. The later RAP frames are inserted at the specified interval until the next scene change occurs. The ARAI structure can facilitate inter prediction efficiency in the RAS at scene change position so as to improve coding efficiency. Table 1 shows the coding efficiency of ARAI structure compared to FRAI structure in terms of Bjøntegaard delta bit rate Fig. 7 The clustering result of example video corresponding to the optimal clustering number (frames with "*" are center frames).  (BD-Rate). 25 We employed six sequences containing multiple scene changes for test. The details of the test sequences can be found in Sec. 4. The test is conducted under RA common test condition 26 and the RAI is specified as 32.
It can be seen that 2.1% coding gain can be achieved by the ARAI coding structure, which confirms the adoption of ARAI structure in the proposed scheme. Figure 8(a) shows the coding structure of two RASs from the same scene in conventional video coding (for simplicity we did not show multiframe reference and B-frames). The RAP frames are all intracoded while the non-RAP frames are coded as interframes by referencing neighboring frames to exploit short-term correlation. A traditional LTR scheme is implemented using RAP frame as the LTR frame of following non-RAP frames in the same RAS, which is shown in Fig. 8(b). In this way, the long-term correlation in an RAS can be exploited. Now with the availability of the scene library, the long-term correlation between RASs can also be exploited.
In SLBVC, a layered coding structure is proposed, as shown in Fig. 8(c). The library frames are coded as intraframes. Each RAP frame is coded as an interframe by only referencing its MSLF, which is retrieved by color histogram comparison. 27 Then, the MSLF of the RAP frame is used as the LTR frame of the following non-RAP frames. We do it this way because frames in an RAS belong to the same scene with ARAI coding structure. So, the complexity for retrieving the MSLFs of non-RAP frames can be saved. Meanwhile, the RA property of the RAS is guaranteed. As a result, the long-term temporal correlation between RASs can be exploited clearly. In addition to, the long-term temporal correlation in RASs can also be removed to some extent.

Scene Library Management
Due to storage capacity of codec, the scene library is built for video clip with several minutes' length to limit the number of library frames. In other words, the frames in the scene library are not infinitely accumulated. After encoding/decoding a video clip, the scene library is cleared. Then, a new scene library will be built for the next clip. The video clip and corresponding scene library are encoded into two streams. In video stream, a signal for each frame is added to indicate its MSLF in the library stream. Here, we only discuss the library management strategies in VOD streaming application, because for locally stored video streams, the strategies are all the same except for omitting the stream request operation.
In decoder, the video stream and the library stream are downloaded and decoded synchronously. When the decoder is capable of storing all the decoded library frames, they are stored into the scene library and not to be removed. In this way, the stored library frames can be directly reused by later frames in the video clip. In contrast, when the decoder can only store a part of the decoded library frames, e.g., mobile devices, the downloaded library stream is stored in local. The scene library works in a first-in, first-out manner to keep the most recently decoded library frames. Then, later frames can find their MSLFs either in the scene library or in the library stream. For RA within current video clip, if the RAP frame to be decoded cannot find its MSLF in the scene library, it will search the stored library stream. If the MSLF cannot be found in the stored library stream either, it will be requested from the server. For RA across video clips, the scene library and the stored library stream will be cleared. Then, the streams of the RAP frame and its MSLF will be requested from the server.

Experimental Set-up
Experiments are conducted on six sequences containing repeated scenes to evaluate the performance of the SLBVC scheme. The test sequences were extracted from six movies and TV series: "The Big Bang Theory," "House of Cards," "Sherlock," "The Man from Earth," "2 Broke Girls," and "Game of Thrones." The details of the sequences are shown in Table 2. They all have a length of >4000 frames. All sequences are composed of multiple scenes, a large portion of which appear repeatedly.
The proposed scheme is implemented on an HM12.1 28 encoder. We compare it with HEVC, the LTR scheme, and the method in Ref. 16 to demonstrate its performance.  The sequences are encoded under RA common test condition, which is designed for RA applications. Table 3

Results of Library Construction
The numbers of RAP frames and clusters of test sequences are shown in Table 4. It can be seen that the RAP frames are classified into a smaller number of clusters. The optimal clustering number depends on the video content and the number of RAP frames. An illustration of the curves of clustering cost relative to clustering number of Bigbang is shown in Fig. 9. The optimal clustering numbers are, respectively, 41 and 13 with RAI 32 and 152. After clustering, the scene library is constructed by extracting the center frame of each cluster. As only the center frames of clusters containing more than one frames are chosen as library frames, so the number of library frames is less than or equal to the number of clusters, as also shown in Table 4.

Performance of Proposed Scheme
BD-Rate is used to evaluate the rate-distortion performance of the proposed scheme. Note that each non-RAP frame also uses the library frame as LTR frame, so the coding performance is evaluated on the whole sequence. The bits brought by the scene library are taken into consideration when calculating the bitrate. The achieved gains of the LTR scheme, the scheme in Ref. 16, and the proposed scheme with respect to HEVC are shown in Table 5. It can be seen that with RAI 152, there is some long-term temporal correlation in RAS that can be exploited, so the LTR scheme can achieve  0.5% coding gain. However, with RAI 32, few long-term temporal correlation is left because of the short length of RAS, thus the LTR scheme cannot bring any coding gain. However, the scheme in Ref. 16 and the proposed scheme can achieve much higher coding gain because they are capable of removing the long-term temporal correlation between RASs and repeated scenes. Also for the both schemes, the performance with RAI 32 is better than RAI 152. This is because much more long-term temporal correlation between RASs remains with shorter RAI structure but now can be removed. Compared to Ref. 16, the bit-savings of the proposed scheme are 8.1% with RAI 32 and 7.2% with RAI 152. These coding gains are obtained due to the more efficient scene library built with the cluster-based method and the better coding of non-RAP frames. Figures 10 and 11 show the rate-distortion curves for all six test sequences by four schemes: HEVC, the LTR scheme, the scheme in Ref. 16, and the SLBVC scheme with RAI 32 and 152, respectively. The rate-distortion curves of the LTR scheme and HEVC are hardly distinguishable because of their close performance. However, the rate-distortion curves of the SLBVC scheme are obviously above that of the other three schemes, which reveals the remarkable performance of the proposed scheme. For sequences Cards (1280 × 720) and Throne (640 × 360), the maximal bitrate is around 200 and 100 kbps because the images contain a lot of dark regions and lack high efficiency components. It does not take too many bits to encode them even with very high quality. For the other sequences with rich colors and textures, the bitrate can be up to hundreds of kbps. In summary, the proposed scheme can deal with different types of video contents and show advantage over a wide range of bitrate.
In addition to the improvement in objective performance, visual quality is also significantly improved by the proposed scheme. When the SLBVC scheme is applied, coding artifacts caused by aggressive compression, such as blockiness and blurriness, will be greatly alleviated. Figure 12 shows a set of images cropped from the middle frame of an RAS in Sherlock coded with RAI 32. The first image is coded by the proposed scheme, the second image is coded by HEVC, and the third one is the original image used for comparison. The bitrate of the proposed scheme and HEVC is, respectively, 72 and 86 kbps, the PSNR values of the images range between 39 and 41 dB. It can be seen that the contour of the actor's face (eyes and nose) is much better preserved by the proposed scheme. Similarly, Fig. 13 shows the images cropped from Bigbang coded with RAI 152. Also, the textures of the clothes are visually clearer with the proposed scheme.
As a result, we can conclude that the subjective quality of proposed scheme is much better than that of HEVC.

Discussion of Random Access
In VOD streaming, sometimes the users do not want to watch the entire video sequence. They may request only part of the sequence from the server by RA. When only a video part is transmitted to the client, only the referenced library frames instead of the whole library need to be synchronously transmitted. It is feasible because the library frames are intraframes and independent from each other. With the length of the transmitted video part decreasing, the number of frames that a library frame is referenced by also decreases.   Journal of Electronic Imaging 043026-9 Jul∕Aug 2017 • Vol. 26 (4) practice, 29,30 the proposed scheme is still valuable for VOD streaming application. We simulate the performance when part of the sequence transmitted in VOD streaming. Assume the test sequence contains N RASs with RAI 152. We test a series of transmitted lengths roundð N 2 m Þ, m ¼ 0;1; : : : ; mÃ, where roundðÃÞ is the function that rounds the variable to the nearest integer, and mÃ is the least integer satisfying roundð N 2 m Ã Þ ¼ 1. In each test option, the test sequence is divided into homogeneous parts at the specified transmitted length. The coding performance (BD-rate) of each part is calculated with considering the bits of the referenced library frames. Then the performances of all parts are averaged to derive the final result. Figure 14 shows the curves of coding performance relative to m of all sequences. The value of mÃ for Cards and Earthman is 5, whereas for the other sequences is 6. There is performance deterioration with the increase of m and even performance loss when m is too large. However, performance improvement can still be observed over a large range of transmitted length. For Bigbang, Sherlock, Girls, and Throne, when m is between 0 and 3, which corresponds to that 12.5% to 100% of the whole sequence is transmitted, there is coding efficiency improvement. For Cards, a larger range of m between 0 and 4 is observed to show coding gain. For Earthman, although the range of m with coding gain is narrower (0 to 2), the proposed scheme still outperforms HEVC if only no <25% of the sequence is transmitted.

Complexity Analysis
The complexity of the proposed scheme is imposed by scene library construction (RAP frame clustering and library frame encoding), MSLF selection, and video coding. As the HM encoder is written in C++, the algorithms of scene library construction and MSLF selection are also implemented in C++ for fair comparison. The experiments are conducted on Intel(R) Xeon (R) CPU E5-2690 0 @2.90 GHz with 190G RAM memory. The complexity of each process is measured by the running time.
The computational complexity of SLBVC T SLBVC can be expressed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 4 ; 6 3 ; 6 1 7 where T i , i ¼ 1; : : : ; 4 represent the running time of RAP frame clustering, library frame encoding, MSLF selection, and video coding, respectively. The complexity distributions of each process with RAI 32 and 152 are presented in Tables 6 and 7. In RAI 152, T 1 , T 2 , and T 3 occupy about 0.04%, 0.05%, and 0.001% of the total time. While in RAI 32, the percentages of T 1 , T 2 , and T 3 increase to 0.79%, 0.13%, and 0.002% for the reason of more RAP frames. Especially for T 1 , assume the number of RAP frames is N, the number of times of computing distances between frames is OðN 3 Þ. Thus, the increase of complexity is much higher than T 2 and T 3 . Overall, compared with the complexity of video coding, T 4 , T 1 , T 2 , and T 3 are rather small and can be neglected. Use the video coding time of HEVC as anchor, the video coding time of the LTR scheme and the proposed scheme is shown in Table 8. It can be seen that about 24% to 34% extra complexity is brought into video coding by the LTR scheme and the proposed scheme. It is because an additional LTR frame is employed by each frame in both schemes, thus the ME complexity is increased. Also, the video coding complexity of SLBVC is slightly higher than the LTR scheme by 6% in average as the RAP frames are coded as inter-frames. Considering the high performance of the SLBVC scheme, the complexity increase is acceptable.

Conclusion
In this paper, an SLBVC scheme is presented to improve the coding efficiency of videos with repeated scenes. The proposed scheme is capable of exploiting long-term temporal correlation between RASs which belong to similar scenes. It can be applied to stored video application (e.g., DVD, BD, etc.) and VOD streaming. Compared to state-of-theart method, the scheme can achieve 8.1% and 7.2% coding performance improvement on the whole sequence with RAI 32 and 152, respectively. Although the performance may degrade when only part of sequence is transmitted in streaming, the proposed scheme still shows advantage over a large range of transmitted length.
A couple of extensions of our current work can be further explored in the future. For example, more accurate clustering criteria can be investigated for the clustering algorithm to build a more efficient scene library. Also, the coding quality of library frames can be adaptively optimized according to the times they are referenced to improve overall coding efficiency. We will work on these issues to further improve the efficiency of the SLBVC scheme.