No-reference image quality metrics are of fundamental interest as they can be embedded in practical applications.
The main goal of this paper is to perform a comparative study of seven well known no-reference learning-based
image quality algorithms. To test the performance of these algorithms, three public databases are used. As a
first step, the trial algorithms are compared when no new learning is performed. The second step investigates
how the training set influences the results. The Spearman Rank Ordered Correlation Coefficient (SROCC) is
utilized to measure and compare the performance. In addition, an hypothesis test is conducted to evaluate the
statistical significance of performance of each tested algorithm.
While objective and subjective quality assessment of images and video have been an active research topic in the recent years, multimedia technologies require new quality metrics and methodologies taking into account the fundamental differences in the human visual perception and the typical distortions of both video and audio modalities. Because of the importance of faces and especially the talking faces in the video sequences, this paper presents an audiovisual database that contains a different talking scenario. In addition to the video, the database also provides subjective quality scores obtained using a tailored single-stimulus test method (ACR). The resulting mean opinion scores (MOS) can be used to evaluate the performance of audiovisual quality metrics as well as for the comparison and for the design of new models.
During the last decade, the important advances and widespread availability of mobile technology (operating systems, GPUs, terminal resolution and so on) have encouraged a fast development of voice and video services like video-calling. While multimedia services have largely grown on mobile devices, the generated increase of data consumption is leading to the saturation of mobile networks. In order to provide data with high bit-rates and maintain performance as close as possible to traditional networks, the 3GPP (The 3rd Generation Partnership Project) worked on a high performance standard for mobile called Long Term Evolution (LTE). In this paper, we aim at expressing recommendations related to audio and video media profiles (selection of audio and video codecs, bit-rates, frame-rates, audio and video formats) for a typical video-calling services held over LTE/4G mobile networks. These profiles are defined according to targeted devices (smartphones, tablets), so as to ensure the best possible quality of experience (QoE). Obtained results indicate that for a CIF format (352 x 288 pixels) which is usually used for smartphones, the VP8 codec provides a better image quality than the H.264 codec for low bitrates (from 128 to 384 kbps). However sequences with high motion, H.264 in slow mode is preferred. Regarding audio, better results are globally achieved using wideband codecs offering good quality except for opus codec (at 12.2 kbps).
Rate control is a critical issue in H.264/AVC video coding standard because it suffers from some shortcomings that make the bit allocation process not optimal. This leads to a video quality that may vary significantly from frame to frame. Our aim is to enhance the rate control efficiency in H.264/AVC baseline profile by handling two of its defects: the initial quantization parameter (QP) estimation for Intra-Frames (I-Frames) and the target number of bits determination for Inter-Frames (P-Frames) encoding. First, we propose a Rate-Quantization (R-Q) model for the I-Frame constructed empirically after extensive experiments. The optimal initial QP calculation is based on both target bit-rate and I-Frame complexity. The I-Frame target bit-rate is derived from the global target bit-rate by using a new non-linear model. Secondly, we propose an enhancement of the bit allocation process by exploiting frame complexity measures. The target number of bits determination for P-Frames is adjusted by combining two temporal measures: the first is a motion ratio based on actual bits used to encode previous frames; the second measure exploits the difference between two consecutive frames and the histogram of this difference. The simulation results, carried out using the JM15.0 reference software and the JVT-O016 rate control algorithm, show that the right choice of initial QP for I-Frame and first P-Frame allows improvement of both the bit-rate and peak signal-to-noise ratio (PSNR). Finally, the Inter-Frame bit allocation process further improves the bit-rates while keeping the same PSNR improvement (up to +1.33 dB/+2 dB for QCIF/CIF resolutions). Moreover, this process reduces the buffer level variation leading to a more consistent quality of reconstructed videos.
In H.264/AVC rate control algorithm, the bit allocation process and the QP determination are not optimal.
At frame layer, there is an implicit assumption considering that the video sequence is more or less stationary
and consequently the neighbouring frames have similar characteristics. So, the target Bit-Rate for each frame
is estimated using a straightforward process that allocates an equal bit budget for each frame regardless of its
temporal and spatial complexities. This uniform allocation is surely not suited especially for all types of video
sequences. The target bits determination at macroblock layer uses the MAD (Mean Absolute Difference) ratio
as a complexity measure in order to promote interesting macroblocks, but this measure remains inefficient in
handling macroblock characteristics. In a previous work we have proposed Rate-Quantization (R-Q) models
for Intra and Inter frames used to deal with the QP determination shortcoming. In this paper, we look to
overcome the limitation of the bit allocation process at the frame and the macroblock layers. At the frame
level, we enhance the bit allocation process by exploiting frame complexity measures. Thereby, the target bit
determination for P-frames is adjusted by combining two temporal measures: The first one is a motion ratio
determined from actual bits used to encode previous frames. The second measure exploits both the difference
between two consecutive frames and the histogram of this difference. At macroblock level, the visual saliency
is used in the bit allocation process. The basic idea is to promote salient macroblocks. Hence, a saliency map,
based on a Bottom-Up approach, is generated and a macroblock classification is performed. This classification
is then used to accurately adjust UBitsH264 which represents the usual bit budget estimated by H.264/AVC
bit allocation process. For salient macroblocks the adjustment leads to a bit budget which is always larger
than UBitsH264. The extra bits added to code these macroblocks are deducted from the bit budget allocated
to the non-salient macroblocks. Simulations have been carried out using JM15.0 reference software, several
video sequences and different target Bit-Rates. In comparison with JM15.0 algorithm, the proposed approach
improves the coding efficiency in terms of PSNR/PSNR-HVS (up to +2dB/+3dB). Furthermore, the bandwidth
constraint is always satisfied because the actual Bit-Rate is always lower than or equal to the target Bit-Rate.
Rate control plays a key role in video coding standards. Its goal is to achieve a good quality at a given target
bit-rate. In H.264/AVC, rate control algorithm for both Intra and Inter-frames suffers from some defects. In
the Intra-frame rate control, the initial quantization parameter (QP) is mainly adjusted according to a global
target bit-rate and length of GOP. This determination is inappropriate and generates errors in the whole of
video sequence. For Inter coding unit (Frame or Macroblock), the use of MAD (Mean Average Differences) as
a complexity measure, remains inefficient, resulting in improper QP values because the MAD handles locally
images characteristics. QP miscalculations may also result from the linear prediction model which assumes
similar complexity from coding unit to another. To overcome these defects, we propose in this paper, a new
Rate-Quantization (R-Q) model resulting from extensive experiments. This latter is divided into two models.
The first one is an Intra R-Q model used to determine an optimal initial quantization parameter for Intraframes.
The second one is an Inter R-Q model that aims at determining the QP of Inter coding unit according
to the statistics of the previous coded ones. It does not use any complexity measure and substitutes both
linear and quadratic models used in H.264/AVC rate controller. Objective and subjective simulations have been
carried out using JM15.0 reference software. Compared to this latter, the global R-Q model (Intra and Inter
models combined) improves the coding efficiency in terms of PSNR, objectively (up to +2.01dB), subjectively
(by psychophysical experiments) and in terms of computational complexity.
According to the literature, automatic video summarization techniques can be classified in two parts, following the
output nature: "video skims", which are generated using portions of the original video and "key-frame sets", which
correspond to the images, selected from the original video, having a significant semantic content. The difference between
these two categories is reduced when we consider automatic procedures. Most of the published approaches are based on
the image signal and use either pixel characterization or histogram techniques or image decomposition by blocks.
However, few of them integrate properties of the Human Visual System (HVS). In this paper, we propose to extract keyframes
for video summarization by studying the variations of salient information between two consecutive frames. For
each frame, a saliency map is produced simulating the human visual attention by a bottom-up (signal-dependent)
approach. This approach includes three parallel channels for processing three early visual features: intensity, color and
temporal contrasts. For each channel, the variations of the salient information between two consecutive frames are
computed. These outputs are then combined to produce the global saliency variation which determines the key-frames.
Psychophysical experiments have been defined and conducted to analyze the relevance of the proposed key-frame
The development of metrics for assessing the quality of compressed images has given rise to considerable effort and has led to a general layout of the perceptual metrics based on models of the human visual system. Instead of another comparative study of the global performance of these systems, we sought to perform an individual evaluation of the main visual model components in order to propose a unified fidelity metric with low complexity and high performance. Our effort has focused on the frequency selectivity of human vision, perceived contrast, masking effects, pooling, and visual attention. We used the correlation between the obtained fidelity measures and the subjective assessment scores to analyze the different configurations.
At high compression ratios, the current lossy compression algorithms introduce distortions that are generally exploited by the No-Reference quality assessment. For JPEG-2000 compressed images, the blurring and ringing effects cause the principal embarrassment for a human observer. However, the Human Visual System does not carry out a systematic and local research of these impairments in the whole image, but rather, it identifies some regions of interest for judging the perceptual quality. In this paper, we propose to use both of these distortions (ringing and blurring effects), locally weighted by an importance map generated by a region-based attention model, to design a new reference free quality metric for JPEG-2000 compressed images. For the blurring effect, the impairment measure depends on spatial information contained in the whole image while, for the ringing effect, only the local information localized around strong edges is used. To predict the regions in the scene that potentially attract the human attention, a stage of the proposed metric consists to generate an importance map issued from a region-based attention model, defined by Osberger et al . First, explicit regions are obtained by color image segmentation. The segmented image is then analyzed by different factors, known to influence the human attention. The produced importance map is finally used to locally weight each distortion measure. The predicted scores have been compared on one hand, to the subjective scores and on other hand, to previous results, only based on the artefact measurement. This comparative study demonstrates the efficiency of the proposed quality metric.
The use of computational metrics to control and assess the visual quality of digital images is well known. This paper presents a quality metric including a visual channels representation and a new contrast masking model. Based on the measure of maximum quantization steps without visual impairments, the model considers both intrachannel and interchannel masking and is derived from extensive experiments conducted on noise and texture images instead of simple sinusoidal stimuli. The metric parameters are optimized in order to maximize the linear correlation coefficient as well as the Spearman rank-order correlation between the computed quality measures and the mean opinion score.
This paper presents a new approach in the selection of auspicious sites to be watermarked. The selection takes into account human visual system properties including luminance adaptation, contrast sensitivity and spatio-frequential selectivity. The paper exploits also the local band limited contrast to determine the maximum watermark strength to be applied without inducing visible degradations. Compared to the well known approaches, a contrast masking model is used here to adjust, site by site, the watermark strength. To test the approach efficiency, obtained results are considered in the context of an adaptive watermarking algorithm. The performance of this latter is evaluated in terms of watermark invisibility. Robustness to most common attacks such as JPEG compression, cropping (with zero padding) and low pass filtering is also considered.
The work presented here deals with watermarking algorithms. The goal is to show how the Human Visual System (H.V.S) properties can be taken into account in the conception of such algorithms. The construction of the watermarking algorithm presented in this paper needs three steps. In the first one the selection of auspicious sites for the watermark embedding is described. The selection exploits a multi-channel model of the Human Visual System which decomposes the visual input into seventeen perceptual components. Medium and high frequencies are then selected to generate a sites map. This latter is improved by considering some high level uniform areas. The second step deals with the choice of the strength to apply to the selected sites. The strength is determined by considering the H.V.S. sensitivity to the local band limited contrast. In the third step, examples of spatial watermarking embedding and extraction are given. The same perceptual mask has been successfully used in other studies. The watermark results from a binary pseudo-random sequence, of length 64, which is circularly shifted so as to occupy all the sites mentioned above. The watermark extraction exploits the detection theory and requires both the perceptual mask and the original watermark. The extracted watermark is then compared to the original and a normalized correlation coefficient is computed. This coefficient value allows the detection of the copyright.
Today, the use of objective quality metrics is well-known for the optimization of digital image processing systems. The work presented in this paper is about an algorithmic construction of an image quality criterion. This criterion takes into account the human visual system HVS properties in order to ensure the correspondence between objective measures given by the criterion and subjective notes given by a group of observers, and is decomposed functionally into three principal blocks. The first one corresponds to a perceptual image representation: a set of 17 frequential channels models the radial and angular selectivity of the HVS. The second block concerns the construction of the adaptation function of perception thresholds due to masking effects. Thanks to psychophysical experiments, the visibility thresholds of impairments are first measured in each individual channel, then in the presence of masking signals from other channels. The aim of this paper is to present these results and the masking model which takes into account both masking effects within channels and between channels. Finally in the third block, both frequential and spatial pooling are performed.
Objective image quality assessment techniques are currently based on the properties of the human visual system (HVS) essentially using early vision model. This type of approach allows to get the differences between original and distorted images in a perceptually space, so the outputs of the early vision model are perceptual distortion maps. In order to get one mark for the overall distortion, spatial pooling and frequency pooling in case of spatial frequency decomposition should be applied on these maps. In this paper, we present various methods to do this pooling. In order to represent the distorted image in a perceptual space, we use a multi-channel early vision model including an amplitude nonlinearity, a CSF, a subband decomposition and a masking function. For the pooling, Minkowski summation with various exponents is first tested as it is the most common pooling in literature. The second type of pooling proposed achieves a summation of all the degradations weighted by a function of the probability of their occurrence. Finally we propose a summation taking into account some higher perceptual factors in order to point out the region of interest used to weight the errors. The results are compared measuring the correlation between the distortion marks and the MOS.
In the human color vision, it is well admitted that signals issued from the three types of receptors (L, M, S) are combined in two opponent color components and one achromatic component.In this paper, we are concerned by the cardinal directions A, Cr1 and Cr2 defined by Krauskopf. We study in particular the interactions between luminance and chromatic components. These interactions should be taken into account in visual coding since they modify the visibility thresholds. We present here results that show the influence of the two chromatic components on the optimal perceptual quantizer of the achromatic component in particular subbands. On the subband called III-1 of luminance, we show influence of Cr1 and Cr2 sinusoidal maskers. Other results are also presented on the subband called II-1 with Cr1 and Cr2 maskers.
In order to achieve a color image coding based on the human visual system features, we have been interested by the design of a perceptually based quantizer. The cardinal directions Ach, Cr1 and Cr2, designed by Krauskopf from habituation experiments and validated in our lab from spatial masking experiments, have been used to characterize color images. The achromatic component, already considered in previous study, will not be considered here. The same methodology has been applied to the two chromatic components to specify the decision thresholds and the reconstruction levels which ensure that the degradations induced will be lower than their visibility thresholds. Two observers have been used for each of the two components. From the values obtained for Cr1 component one should notice that the decision thresholds and reconstruction levels follow a linear law even at higher levels. However, for Cr2 component the values seem following a monotonous increasing function. To determine if these behaviors are frequency dependent, further experiments have been conducted with stimulus frequencies varying from 1cy/deg to 4cy/deg. The measured values show no significant variations. Finally, instead of sinusoidal stimuli, filtered textures have been used to take into account the spatio-frequential combination. The same laws (linear for Cr1 and monotonous increasing for Cr2) have been observed even if a variation in the quantization intervals is reported.
A vector quantization based on a psychovisual lattice is used in a visual components image coding scheme to achieve a high compression ratio with an excellent visual quality. The vectors construction methodology preserves the main properties of the human visual system concerning the perception of quantization impairments and takes into account the masking effect due to interaction between subbands with the same radial frequency but with different orientations. The vectors components are the local band limited contrasts Cij defined as the ratio between the luminance Lij at point, which belongs to the radial subband i and angular sector j, and the average luminance at this location corresponding to the radial frequencies up to subband i-1. Hence the vectors dimension is depending on the orientation selectivity of the chosen decomposition. The low pass subband, which is nondirectional is scalar quantized. The performances of the coding scheme have been evaluated on a set of images in terms of peak SNR, true bit rates and visual quality. For this, no impairments are visible at a distance of 4 times the height of a high quality TV monitor. The SNR are about 6 to 8 dB under the ones of classical subband image coding schemes when producing the same visual quality. Due to the use of the local band limited contrast, the particularity of this approach relies in the structure of the reconstruction image error which is found to be highly correlated to the structure of the original image.
A new subband coding scheme is proposed in this paper. The two main functions in such schemes, which are the decomposition and quantization, are entirely based on the psychovisual aspects. The visual subbands have been estimated by using the variation of the masking function. These masking effects in the case of sinusoidal gratings show that the peripheral part of the visual system may be modelled by a set of sixteen filters and a low frequency residue. The quantizers associated to such a decomposition have been designed by a methodology which has been developed. The main finding of the conducted experiments is that the decision thresholds and the reconstruction follow a linear law, with an interval quantization varying with frequency and orientation. This result, highly dependent on the way the signals have been characterized, justifies the choice of the local band-limited contrast. The results, obtained with a coding scheme which includes these basic features of the visual system, show that a low signal to noise ratios the visual quality of the reconstructed image reminds much better than for the 'classical' schemes. Another particularity of the approach lies in the structure of the reconstruction image error. Indeed the latter is found to be highly correlated to the structure of the original image.
In order to specify the optimal psychovisual quantizer associated to a given visual subband decomposition scheme, a new methodology has been developed. Psychovisual experiments based on the visibility of the quantization noise, have been conducted. The complex signals used, have been characterized by the local band-limited contrast. The main finding of this study is that the quantizers are, in the chosen space contrast, of a linear type. The quantization intervals contrast, obtained with a given observer, are of 0.039 for the subband called III-1 (radial selectivity 1.5 cy/d degree(s) to 5.7 cy/d degree(s), angular selectivity -22.5 d degree(s) to 22.5) 0.031 for the subband called IV-1 (radial selectivity 5.7 cy/d degree(s) to 14.1 cy/d degree(s), angular selectivity -15 d degree(s) to 15 d degree(s)) 0.117 for the subband called V-1 (radial selectivity 14.1 cy/d degree(s) to 28.2 cy/d degree(s), angular selectivity -15 d degree(s) to 15 d degree(s)). To evaluate the importance of the `angular' aspect in this approach, further measurements have been made with the subband IV-2 (radial selectivity 5.7 cy/d degree(s) to 14.1 cy/d degree(s), angular selectivity 15 d degree(s) to 30 d degree(s)). The linearity is also observed and the quantization interval contrast, for the same observer, is of 0.122.
In order to characterize the spatial frequency mechanisms of the visual system, we measured the visibility threshold evaluation as a function of the spatial frequency cosine maskers. The stimulus and the maskers used were spatially localized and temporally weighted. The results show that the relative bandwidth (defined as the ratio between the estimated bandwidth and the frequency of the masker) varies from 3 in low frequencies of the masker (1 cy/d degree(s)) to 1.15 in high frequencies of the masker (10 cy/d degree(s)). This is consistent with a model having five classes of spatial frequency mechanisms covering the band 0 - 30 cy/d degree(s). These results allow the definition of a sub-band decomposition of images in twenty-one `visual components.'