The thriving interest of consumers for video content has brought the world closer to the universe of digital video provision than ever before. The recent market success of media services such as internet protocol television and video on demand through consumer electronic products, has created a considerable increase of the network traffic, which has caused the network operators to apply restriction policies, creating a controversial issue on the network neutrality and traffic differentiation.
This situation has created the need for more advanced encoding techniques (e.g., H.265/HEVC) and adaptation schemes1 which achieve higher compression ratios and provide agility to the adaptation of the media service,2 alleviating the network operators from the media-related network traffic. Thus, during the service provision, the video stream may be needed to be dynamically transcoded at different formats/profiles (e.g., such as in the paradigm of mobile-edge computing), resulting in adapted media services that dynamically fit the current network conditions and the terminal device specifications. However, this in-service transcoding process, which today can be supported by the emerging software-defined networking and network function virtualization techniques,3 introduces to the media service a wide variety of encoding impairments that degrade the deduced quality level of the encoded media service. Thus, the quality degradation of the media service, caused by the in-service transcoding process, creates the need for defining flexible video quality assessment (VQA) methods that will be able to evaluate the quality with the same accuracy as the well claimed video quality metrics [e.g., structural similarity (SSIM) index4]. These VQA methods would be able to assess the quality not only during the initial coding process of the source signal, but also across the media service delivery path to the end user, providing useful feedback both to the content provider for service adaptation actions and to the network operator for optimal traffic steering decisions.
To monitor the quality in real time and across the service provision chain,5 it is necessary to use flexible VQA tools, which are suitable for in-service integration, evaluating the media service along its network delivery path.
Currently, the available VQA methods are divided into two categories: the subjective and the objective ones. More specifically, the subjective VQA methods6 are based upon the opinion score of a group of viewers regarding the visual degradation of an encoded video sequence compared to the original uncompressed sequence, establishing them as the primary choice for video quality evaluation tests in terms of reliability, but also practically from a commercial perspective. The subjective video quality evaluation methods are expensive and time-consuming mainly due to their demanding setup within a controlled room/environment with sophisticated apparatus, which leads to the fact that they cannot be commercially exploited, especially within the service provision chain for monitoring purposes of the delivered video service.
Correspondingly, the objective VQA techniques are mathematical computational models that utilize various image characteristics (e.g., luma and chroma),188.8.131.52.184.108.40.206.–16 or other image statistics (e.g., blockiness),17 to approximate as well as possible the subjective test results in an efficient and cost-effective way. The objective methods are categorized into three groups determined by their approach and the metric used for the quality assessment: the full reference (FR) ones, the reduced reference (RR) ones, and the no-reference (NR) ones.
The FR methods evaluate the video quality by comparing the frames of the original video and the target video. The methods perform multiple channel decomposition of the video signal, where the objective method is applied on each channel, which feature a different weight factor according to the characteristics of the human visual system (HVS), using contrast sensitivity functions, channel decomposition, error normalization, weighting, and finally Minkowski error pooling for combining the error measurements into a single perceived quality estimation.18 Also, in the bibliography, FR methods for a single channel have been proposed where the proposed objective metric is applied on the video signal, without considering varying weight functions. Some FR metrics that are based on the video structural distortion have been proposed,19 among which is the widely known SSIM index, which has a very wide range of applicability across many different fields.2021.–22 All FR methods, including SSIM, provide higher accuracy and credibility in comparison to the rest of the categories (RR and NR), but in the evaluation process, they require both the original and the encoded video sequences at the same site, making them inappropriate for integration in the service provision chain where the original signal is not available at the end-user site.
The RR methods are able to evaluate the video quality level based on metrics which use only some extracted structural features from the original signal.23,24 The concept of the RR metrics was introduced by Refs. 7, 24, and 25, where the RR metric was based upon the extraction of various spatial and temporal features of the reference video, which are easily exposed to distortions added by the standard video compression process. RR metrics can be roughly categorized into three categories.
The first category includes all methods based upon models of low-level statistical properties of the original natural image. An RR metric that belongs to this category is described in Ref. 26, which provides a condensed amount of RR information obtained by the comparison of the marginal probability distribution of wavelet coefficients in different wavelet subbands with the probability density function of the wavelet coefficients of the decoded signal, while using the Kullback–Leiber divergence as a distance between distributions.
The second category of RR metrics includes methods that capture visual distortions, to quantify the decoded signal’s quality.2728.29.–30 However, this type of metrics only performs well, when there is sufficient knowledge about the degradation process that the signal underwent. It is not efficient to apply these techniques to general cases whose distortion has not been previously assumed.
The third and the last categories of RR metrics are based upon models of the viewer’s perception, e.g., the HVS.3132.33.–34 These models exploit and apply different psychological and psychological vision studies on the end users in an attempt to imitate the behavior of subjective test groups.
RR methods are more flexible for in-service integration since they require only partial information of the original video signal, but they have reduced accuracy and credibility in comparison to the FR metrics.
Finally, the NR methods evaluate the video quality on the basis of processing the frames of the target video alone. As they do not require any information from the original video sequence, they can be easily integrated within the service provision chain. However, their performance is limited to specific visual artifacts (e.g., tiling), restricting their range of applicability to special cases only. Thus, from the family of the objective methods, the RR and the NR metrics are more suitable for in-service integration, but they suffer from limited efficiency in comparison to the FRs, which offer high accuracy, but applicability limitations, since they require the original video sequence for assessing the video quality.
It is evident that RR methods require an additional communication channel within the network architecture to transmit the extracted features from the video provider site to the end-user terminals. It is obvious that the required bandwidth depends on the specific RR method. A bandwidth in the range of 1 to 150 kbps is usually required, depending on the RR method and the feature extraction type.23,35 However, the method described in Ref. 35, which requires under 1-kbps bandwidth, performs very poorly in terms of absolute difference compared to subjective tests.
An alternative technique is that the extracted features (or in general the reference information) will be encapsulated inside the forward link, along with the video transmission. The more features that are extracted and transmitted to the end user, the more accurate is the objective video quality. However, more features require higher bandwidth of the communication network. So in RR methods, there is a trade-off between the accuracy of the VQA and the constraints in the network bandwidth.
In this paper, an RR method is proposed that is suitable for in-service use. The features extracted from both the original and the target video frames are based on the evaluation of the SSIM index. In this respect, the SSIM index is calculated for each frame of the original video using as reference a static white pattern, i.e., a video whose frames are all white. The SSIM index for each frame is transmitted to the end-user site. At the end-user terminal, the SSIM index is calculated between each frame of the received (target) video and the same static white pattern. The proposed RR metric is the ratio of the two SSIM indices. Through experimental measurements, it is shown that the proposed metric has a value very close to the SSIM index as calculated from the comparison of the original and the target video frames. The accuracy of the proposed metric as compared to SSIM depends on the video degradation, i.e., the higher the degradation, the worse is the accuracy. Experimental measurements show that for an acceptable video quality level,36,37 the mean absolute percentage deviation (MAPD) of the proposed method is lower than 2.56%.
Another advantage of the proposed method is the low bit rate reference information signal that needs to be sent to the end user, which ranges between 400 and 600 bps.
The rest of the paper is organized as follows: Sec. 2 describes the use of SSIM as a feature extraction method from both the original and the target videos and introduces the metric SSIM RR (SRR), which can be directly compared to the original SSIM index. Section 3 presents a qualitative interpretation of the proposed video quality metric SRR. In Sec. 4, the performance evaluation of the proposed method is presented and analyzed, and finally Sec. 5 concludes the paper.
Feature Extraction Using Structural Similarity
Among the most reliable objective evaluation metrics is the SSIM, which measures the SSIM between two image sequences, exploiting the general principle that the main function of the HVS is the extraction of structural information from the viewing field. If and are two video signals, then SSIM is defined as
In the typical SSIM index evaluation process, it is assumed that both the original and the target video sequences are available at the same site, as shown in Fig. 1. evaluation for every frame can be based on any software implementation of Eq. (1), where is the original video sequence () in Fig. 1, is the target video sequence , and is their index.
According to Ref. 4, where SSIM index is defined and introduced, SSIM comprises three image characteristics components: the luminance, the contrast, and the structure comparison. Their combination results in the widely used SSIM index. The three separate comparison components are4 have selected the case that , which results in the well-known expression of the SSIM index. However, the decision that the three components participate equally in the final calculation of SSIM index is not sufficiently justified by the authors in Ref. 4 and it seems that it is a decision reached only for simplicity reasons.
This motivated us to further research the sensitivity analysis of the SSIM accuracy under different , , and weights. More specifically, considering the purpose of this paper is to develop a flexible metric suitable for in-service applicability, we notice that the parameter specifies the importance of the SSIM between signals and , which is the most influential factor for the FR requirement in the applicability of the SSIM, since according to the following type, the factor needs to be calculated for the measurement of Eq. (5).
In the proposed method, in order to investigate the in-service applicability of the SSIM [i.e., in in-service cases where the calculation of Eq. (5) is not feasible due to the lack of the reference signal], we research in this paper the case that , so the relevant importance of Eq. (5) in the SSIM calculation is limited. Therefore, the requirement for the reference signal to be available together with the encoded signal for the estimation of Eq. (5) ceases to exist, allowing the decomposition of the SSIM index exclusively for the Eqs. (3) and (4) parameters, where no ex parameters exist that require the existence of both the reference and the encoded signal at the same place.
Based on this analysis, in the proposed method (see Fig. 2), the SSIM index is used as a tool to extract features from both the original and the target video sequences using a reference video pattern. In this respect, an initial value is evaluated for every frame at the service provider site, by comparing the with a video reference pattern (), e.g., a video sequence of white video frames of the same resolution and frame rate, which is artificially generated [later in this section, the authors test the primary colors (i.e., white, black, red, green, and blue) as reference video patterns and it is shown that the white video pattern performs better than the others]. The evaluation of can be based on any software implementation of Eq. (1), where is and is and refers to each frame. can be considered as a feature of each frame of the original video and is sent to the end-user site by any means, i.e., either through the same communication channel as the video, or any other communication channel with a sufficient bandwidth, or by embedding it inside the transport stream of the video sequence. In any case, it is considered that the value is recovered at the end-user site.
Referring to Fig. 2, an value is evaluated at the end-user site by comparing the received (target) video signal () with a reference video pattern (), which is identical to the one used at the service provider site and is also artificially generated at the site. The evaluation of is based on the same software implementation of Eq. (1) which used at the transmitter site and refers to each frame of the target video sequence. can be considered as a feature of each frame of the target video.
The ratio of these two SSIM values can be considered as a new metric based on SSIM, namely SRR, i.e.,5 show that the SRR efficiently approximates the SSIM, with an MAPD and satisfactory correlation coefficient values with subjective mean opinion scores (MOS).
Concerning the kind of reference video pattern, the applicability of the SRR method was evaluated based on using the primary colors (i.e., white, black, red, green, and blue) as reference video patterns. The selected primary colors portray a distinct RGB diversity, which is essential to demonstrate and evaluate the behavior of the proposed method, maintaining a simplicity in the implementation and representation of the pattern.
In this framework, the proposed method was applied on the experimental set of the 40 test signals, which are encoded with H-264 at three distinct quantization parameters (QP) values, specifically , 22, and 32, each time utilizing a different primary color as a reference video pattern. The QP value regulates how much spatial detail is maintained during the encoding/compression process (modifying, respectively, the quantization step). A low QP value denotes a low quantization step, therefore, almost all the spatial detail of the video is retained, while a high QP value corresponds to high quantization step and the video spatial detail is aggregated resulting in increased distortion and degradation of video quality. For each color and QP value, the proposed SRR method was applied and each time the MAPD value in relevance to the SSIM was calculated.
The experimental results of the process are provided in Table 1, where it is observed that among the primary colors, the white video pattern provides a better performance across different QP values, while the rest of the primary colors perform notably worse than the white. Therefore, considering its better performance than the rest of the colors and its main characteristic as the easiest representation with only one luminance parameter, the white reference video pattern is recommended for implementing the proposed SRR method and it is selected for executing the experimental parts of this paper.
MAPD for 40 test signals with different reference video patterns.
|Reference video pattern||R||G||B||QP:12||QP:22||QP:32|
Concerning the channel requirements and the overhead of the proposed method, is a number and it can be represented by 2 bytes per frame for an accuracy of four decimal places (). In this case, the required bit rate to be transmitted to the end-user site is 400 bps per . For an increased accuracy per frame of six decimal places (), 3 bytes are required, resulting in 600 bps for the . Even 600 bps is significantly lower than the value of to 150 kbps required for other RR methods, as mentioned in Sec. 1.
Qualitative Interpretation of the Proposed Video Quality Metric
In order to interpret the proposed quality metric and its relation to the SSIM index, we consider a video sequence which is encoded at various QP values, resulting in different bit rates and quality levels. The SSIM index between the original and the target video sequence for each frame is denoted as , which is a descending function of QP. Given that the maximum value of is equal to 1, the SSIM variation is an exponential function as shown in Ref. 38. Therefore38
The plot of versus QP is shown in Fig. 3, curve a. As QP increases, more and more information from the original video frame is lost, i.e., the spatial information (i.e., color and intensity of each pixel) of is lost. In the extreme case, where the information is completely lost, all pixels are equal in color and density, i.e., the is degraded to a sequence of uniform frames, as for example a white video pattern (similar to the used reference video pattern). In this extreme case, the lowest value of is the SSIM index between the original video sequence and the reference video pattern, which can be denoted as .
The previous analysis refers to the comparison between the original and the target video frames. If we consider the comparison between the target video frames and the reference video (i.e., the white video pattern explained earlier), the index versus QP will be an ascending function.
This is because at low QPs the target frame is almost identical to the original and, therefore, will greatly differ from the (white) reference video pattern, resulting in a low value. Furthermore, as QP increases, the target frame will gradually become similar to the reference white video pattern, resulting in a higher value of . For very high QP, the target frame is identical to the reference one and the maximum is equal to 1.
Considering an exponential variation of (similar to ), it is deduced thatFig. 3, curve b.
From Eqs. (8) and (9), it can be deduced that
Comparing Eqs. (7) and (10), it is evident thatFig. 3 (dotted line). This deviation will affect the accuracy of Eq. (11), and SRR will differ from the SSIM index by an amount that depends on the difference between curves b and c of Fig. 3.
Performance Comparison of Structural Similarity Reduced Reference to Structural Similarity
The performance of the proposed method is evaluated by comparing SRR with the original SSIM index () for a large number of video frames. In this respect, a wide range of video sets were selected, which include 40 video sequences of various length, resolution, and content. The selected video sequences include 11 reference video sequences,39 2 long-duration sequences (Bigbuckbunny and Elephantsdream), and 27 nonreference video sequences retrieved from movie trailers. The total number of unique frames which were used for evaluating the proposed method is 60,866.
The original uncompressed video sequences were encoded at three QP values: 12, 22, and 32, which satisfactorily cover the achieved video quality range of the encoded/compressed video signals. Higher QP values were not examined because they lead to unacceptable video quality.36 An initial qualitative comparison between SRR and SSIM index is shown in Fig. 4, which shows the variation of SRR and SSIM index for each frame of a video sequence (Kristen&Sara) with . From Fig. 4, the qualitative similarity of RSS and SSIM index is obvious.
For the quantitative measurement of the performance of the proposed method, the MAPD for each frame between SRR and SSIM index is calculated. MAPD is a widely used metric for measurement of the accuracy of a prediction method, specifically in trend estimation such as the the proposed method
Table 2 presents the MAPD for the experimental set of the 40 video sequences at the three QP values (i.e., 12, 22, and 32) and the mean value of MAPD for each QP.
MAPD for SRR and SSIM index for a white video reference pattern.
Table 2 shows that the accuracy of the proposed method ranges from 0.62% () to 2.56% (), which represents the worst case performance, showing that the proposed SRR method maintains a satisfactory performance across all the potential range of QP values, although better accuracy is achieved at lower QP values. The comparison between the correlation of QP and ground truth (i.e., MOS) versus the correlation of SRR scores and ground truth provides the result of 28.65, showing the advantage and the better performance of the proposed metric.
It is calculated from Eq. (12), where equals , as calculated from Eq. (8). Also, equals SRR, i.e.,Fig. 3, which refers to an exemplary practical case.
Performance Comparison of Structural Similarity Reduced Reference to Subjective Difference Mean Opinion Scores and Other Assessment Methods
According to the video quality experts group research,40 in order to obtain a linear relationship between an objective assessment method score and its corresponding subjective score, each metric score is mapped to . The nonlinear best-fitting logistic function is given as follows:
The parameters (, , , , and ) are calculated through minimizing the sum of squared differences among the subjective and the mapped scores. To compare the performance of a newly proposed SRR method with the existing ones, performance evaluation metrics are used, such as the Pearson’s linear correlation coefficient (LCC), which is the LCC between the predicted MOS and subjective MOS. LCC is a measure of prediction accuracy of an objective assessment metric, i.e., the capability of the metric to predict the subjective scores with low error. The LCC can be calculated using the below equation4.1.
Moreover, the Spearman rank order correlation coefficient (SROCC), which measures the monotonicity of the proposed method against subjective human scores, was also applied. SROCC is a nonparametric measure of statistical dependence between two variables, which assesses how well the relationship between two variables can be described using a monotonic function. If there are no repeated data values, a perfect Spearman correlation (equal to 1) occurs when each of the variables is a perfect monotone function of the other.
Therefore, in order to evaluate the performance of the SRR method and derive the LCC, following the aforementioned methodology, the Laboratory for Image and Video Engineering (LIVE) video quality dataset6,41 was used. The LIVE video quality database (VQDB) uses 10 uncompressed high-quality videos with a wide variety of content as reference videos. A set of 150 distorted videos were created from these reference videos (15 distorted videos per reference) using H.264-based compression. Then each degraded video in the LIVE VQDB was assessed by 38 human subjects in a single stimulus study with hidden reference removal, where the subjects scored the video quality on a continuous quality scale. The mean and variance of the difference mean opinion scores (DMOS) were obtained from the subjective evaluations.
A scatter plot of proposed objective SRR scores versus DMOS for all H.264 videos in the LIVE VQDB is shown in Fig. 6 along with the best-fitting logistic function.
The SROCC and the LCC are computed between the objective and the subjective scores. Table 3 shows the performance of the proposed SRR model against other VQA methods, both FR and RR, in terms of the SROCC and LCC.
Comparison of the performance of VQA algorithms—LCC and SROCC.
|Peak signal-to-noise ratio (PSNR)||FR||0.5493||0.4585|
|Yang’s RR VQA43||RR||0.5654||0.5366|
|Visual signal-to-noise ratio (VSNR)44||FR||0.6216||0.6460|
|Proposed SRR method||RR||0.6260||0.5862|
|Video quality metric (VQM)34||FR||0.6459||0.6520|
According to Table 3, the accuracy of the proposed SRR method is measured more accurately than the RR VQA methods RR-LHS,42 J.246,33 and Yang’s RR VQA 43 both in terms of LCC and monotonicity (SROCC), except for the RR metric,35 which provides better results. Similarly, the accuracy of the proposed SRR method is better than those of the PSNR and VSNR FR VQA methods in terms of LCC, while it is slightly lower than the performance of VQM and SSIM index, as expected, due to the reduced reference nature of the proposed methodology. In terms of monotonicity (SROCC), the proposed method performs better than the PSNR, but lower than the rest of the VQA methods, without, however, significantly deviating from their performance range.
This paper proposes an RR VQA method using SSIM index as a tool to extract features from both the original and the target video sequences, using a reference video pattern. The method is suitable for monitoring the video quality in real time and across the service provision chain. The performance of the proposed method was evaluated using a large experimental set of 40 reference and nonreference video sequences, with spatial resolution ranging from common intermediate format up to high definition, utilizing a static white video as a relative reference pattern. The proposed method maintains a satisfactory performance across all the potential range of QP values, although better accuracy is achieved at lower QP values. Moreover, comparison to subjectively evaluated scores of LIVE video quality dataset shows that the accuracy of the proposed method is better than the average performance of RR VQA methods and is within the performance range of the FR VQA methods. Finally, another advantage of the proposed method is the low bit rate reference information signal that must be sent to the end user, which ranges between 400 and 600 bps.
This work was undertaken under the Information Communication Technologies (FP7-ICT-2013-11) EU FP7 T-NOVA project, which is funded by the European Commission under the Grant No. 619520.
Michail-Alexandros Kourtis received his diploma and master’s degree in computer science from Athens University of Economics and Business in 2011 and 2014, respectively. Since January 2010, he has been working at the National Center of Scientific Research (NCSR) “Demokritos” in various research projects. Currently, he is pursuing his PhD at the University of the Basque Country [universidad de pais vasco/Euskal Herriko Unibertsitatea (UPV/EHU)]. His research interests include video processing, video quality assessment, image processing, network function virtualization, and software defined.
Harilaos Koumaras received his BSc degree in physics in 2002, his MSc degree in electronic automation and information systems in 2004 and his PhD in 2007 in computer science, all awarded from the University of Athens. He is a research associate at the NCSR “Demokritos,” with publications at international conferences, journals, and book chapters. Currently, he is the author or coauthor of more than 80 scientific papers, numbering at least 350 nonself-citations.
Fidel Liberal received his MSc degree in telecommunication engineering in 2001. He received his PhD from the University of the Basque Country (UPV/EHU) in 2005 for his work in the area of holistic management of quality in telecommunications services. He has coauthored more than 80 conference and journal papers with more than 300 citations. His current research interests include quality management and multicriteria optimization for multimedia services in 5G.