## 1.

## Introduction

It is crucial to assess objectively image qualities for image processing applications because the assessments can compare with results of other methods to evaluate the performance. For measuring the performance of image correction, compressing and enhancing methods, such as denoising, JPEG compression, super-resolution, and frame rate upconversion,^{1}2.3.4.5.6.^{–}^{7} and almost all objective evaluation metrics do not completely agree with the perceived subjective visibility of humans, while subjective evaluation is usually too inconvenient, time-consuming, and expensive.^{8}

The simplest and most widely used metrics are mean squared error (MSE) and peak signal-to-noise ratio (PSNR); MSE is computed by averaging the squared differences of two signals, and PSNR is the ratio between the maximum value (Max) of a signal and the MSE as follows:

## (2)

$$\mathrm{PSNR}=10\text{\hspace{0.17em}}{\mathrm{log}}_{10}\left(\frac{{\mathrm{Max}}^{2}}{\mathrm{MSE}}\right),$$^{9}10.11.12.

^{–}

^{13}A lot of image quality assessment methods based on error sensitivity have been proposed,

^{14}15.16.17.18.

^{–}

^{19}and they use the human visual system (HVS), contrast sensitivity function, discrete cosine transform, wavelet transform, and so forth. However, the similarity errors assessed by them may quite differ with the loss of qualities, so some distortions may be clearly visible but these errors are not clearly observed in them.

^{8}

Recently, structural similarity (SSIM) has typically been used to determine visible quality.^{8}^{,}^{20} This is a full reference image quality assessment method and it indicates how much an image is similar to the original image. It has three main components, which are structure, illuminance, and contrast. However, the components, especially structure component, are highly sensitive to translation, scaling, and rotation of an image. This means that although when images are translated and rotated as little as an unrecognizable amount, the SSIM is sensitively decreased.^{21} Moreover, it may overestimate images that have undergone regional distortions such as JPEG compression.

In this paper, we aim at developing an improved structural similarity metric to outperform the typical SSIM, which can be used to overcome potential drawbacks. The proposed metric uses an improved structure comparison, and additionally uses a sharpness comparison.

## 2.

## SSIM and Its Drawbacks

Since humans usually use contrast, color, and frequency changes in their image quality measures,^{22} the SSIM uses the luminance, contrast, and structure comparison shown in Fig. 1.^{8}^{,}^{22} The SSIM of two images $\mathbf{x}$ and $\mathbf{y}$ is defined by the combination $f()$ of three components as follows:^{8}

## (3)

$$\mathrm{SSIM}(\mathbf{x},\mathbf{y})=f[l(\mathbf{x},\mathbf{y}),c(\mathbf{x},\mathbf{y}),s(\mathbf{x},\mathbf{y})],$$## (4)

$$l(\mathbf{x},\mathbf{y})=\frac{2{\mu}_{x}{\mu}_{y}+{C}_{1}}{{\mu}_{x}^{2}+{\mu}_{y}^{2}+{C}_{1}},$$## (5)

$$c(\mathbf{x},\mathbf{y})=\frac{2{\sigma}_{x}{\sigma}_{y}+{C}_{2}}{{\sigma}_{x}^{2}+{\sigma}_{y}^{2}+{C}_{2}},$$The combination of all comparisons between two images $\mathbf{x}$ and $\mathbf{y}$ is

## (10)

$$\mathrm{SSIM}(\mathbf{x},\mathbf{y})={[l(\mathbf{x},\mathbf{y})]}^{\alpha}\xb7{[c(\mathbf{x},\mathbf{y})]}^{\beta}\xb7{[s(\mathbf{x},\mathbf{y})]}^{\gamma},$$^{8}

^{,}

^{21}The results in a specific form of the SSIM index as follows:

## (11)

$$\mathrm{SSIM}(\mathbf{x},\mathbf{y})=\frac{(2{\mu}_{x}{\mu}_{y}+{C}_{1})(2{\sigma}_{xy}+{C}_{2})}{({\mu}_{x}^{2}+{\mu}_{y}^{2}+{C}_{1})({\sigma}_{x}^{2}+{\sigma}_{y}^{2}+{C}_{2})}.$$## (12)

$$\mathrm{MSSIM}(\mathbf{X},\mathbf{Y})=\frac{1}{M}\sum _{i=1}^{M}\mathrm{SSIM}({\mathbf{x}}_{i},{\mathbf{y}}_{i}),$$^{8}MSSIM can be interpreted as a mean value of the SSIM index map.

^{23}Because SSIM values have the range of [0, 1], MSSIM also has the same range.

The SSIM and MSSIM can be used to measure the similarity of two images. However, they have some drawbacks as shown in Fig. 2 and Table 2. First, images filtered by a low pass filter, such as a mean filter (MF), a median filter (MedF), and JPEG compression, are evaluated as having high similarity scores. Second, images that have been slightly distorted by some geometric transformations, such as spatial translation (ST) and rotation (RT), are evaluated as having low similarity scores.

## 3.

## New Structural Similarity

The main component of the SSIM that causes drawbacks is the structure comparison defined by Eq. (6). When we use Eq. (3) by only combining Eqs. (4) and (5), images that are slightly geometrically transformed do not have low similarities as shown in Fig. 3 and Table 1, where $\overline{l}(\mathbf{x},\mathbf{y})$, $\stackrel{\u203e}{c}(\mathbf{x},\mathbf{y})$, and $\stackrel{\u203e}{s}(\mathbf{x},\mathbf{y})$ are the mean of $l(\mathbf{x},\mathbf{y})$ in Eq. (4), $c(\mathbf{x},\mathbf{y})$ in Eq. (5), and $s(\mathbf{x},\mathbf{y})$ in Eq. (6). In Table 1, $\stackrel{\u203e}{s}(\mathbf{x},\mathbf{y})$ of the ST image is very low, while $\stackrel{\u203e}{s}(\mathbf{x},\mathbf{y})$ of the JPEG image is higher than that of the ST image. This example shows that the limitation of SSIM is sensitive to ST, scaling, and RT.

## Table 1

Comparison of MSSIM and its components with MSSIM-S and its components about Fig. 3.

Images | MSSIM | l‾(x,y) | c‾(x,y) | s‾(x,y) | MSSIM-S | Mean of s˜(x,y) | h‾(x,y) |
---|---|---|---|---|---|---|---|

ST | 0.500 | 0.965 | 0.898 | 0.528 | 0.660 | 0.819 | 0.825 |

JPEG | 0.706 | 0.995 | 0.917 | 0.771 | 0.640 | 0.822 | 0.822 |

To reduce the weak effect of $s(\mathbf{x},\mathbf{y})$, we define the structure comparison in a new way as follows:

## (13)

$$\tilde{s}(\mathbf{x},\mathbf{y})=\frac{(2{\sigma}_{x-}{\sigma}_{y-}+{C}_{2})(2{\sigma}_{x+}{\sigma}_{y+}+{C}_{2})}{({\sigma}_{x-}^{2}+{\sigma}_{y-}^{2}+{C}_{2})({\sigma}_{x+}^{2}+{\sigma}_{y+}^{2}+{C}_{2})},$$^{24}$(x-{\mu}_{x})/{\sigma}_{x}$ and $(y-{\mu}_{y})/{\sigma}_{y}$. However, we define $\tilde{s}(\mathbf{x},\mathbf{y})$ as the correlation between standard deviations for pixels having positive/negative standard scores because ${\sigma}_{x-}$ and ${\sigma}_{x+}$ can represent the structure of objects by dividing as locally brighter and darker regions. As shown in Fig. 3 and Table 1, the weak effect of $s(\mathbf{x},\mathbf{y})$ is relatively decreased compared to the original SSIM; however, the similarity of the ST image is lower than that of the JPEG image. That is to say, the SSIM still overestimates blurred images, when $\tilde{s}$ is used as the structure comparison. Therefore, we add a new component, the sharpness comparison $h(\mathbf{x},\mathbf{y})$, which is the correlation between the normalized digital Laplacian, defined as

## (14)

$$h(\mathbf{x},\mathbf{y})=\frac{2|{\nabla}^{2}\mathbf{x}||{\nabla}^{2}\mathbf{y}|+{C}_{2}}{{|{\nabla}^{2}\mathbf{x}|}^{2}+{|{\nabla}^{2}\mathbf{y}|}^{2}+{C}_{2}},$$The new similarity components $s(\mathbf{x},\mathbf{y})$ and $h(\mathbf{x},\mathbf{y})$ are satisfied with the properties for measurement metrics as follows:

1. Symmetry: $S(\mathbf{x},\mathbf{y})=S(\mathbf{y},\mathbf{x})$;

2. Boundedness: $S(\mathbf{x},\mathbf{y})\le 1$;

3. Unique maximum: $S(\mathbf{x},\mathbf{y})=1$, if and only if $\mathbf{x}=\mathbf{y}$.

As shown in Fig. 4, the mean of $h(\mathbf{x},\mathbf{y})$ of the ST image is higher than that of the JPEG image. Finally, the improved SSIM which includes the sharpness comparison (ISSIM-S) can be defined as

## (16)

$$\mathrm{ISSIM}\text{-}\mathrm{S}=l(\mathbf{x},\mathbf{y})\xb7c(\mathbf{x},\mathbf{y})\xb7\tilde{s}(\mathbf{x},\mathbf{y})\xb7h(\mathbf{x},\mathbf{y}),$$To measure a single overall quality measure of the entire image, a mean ISSIM-s (MISSIM-S) index may be used as follows:

The values of ISSIM-S and MISSIM-S are also in [0, 1] and these values indicate higher similarities when they are close to 1.## 4.

## Experimental Results

To evaluate the proposed similarity metric, which compares the PSNR and the SSIM, we tested some distorted images as shown in Fig. 2. In this test, we used an $11\times 11$ circular-symmetric Gaussian weight function, with a standard deviation of 1.5; normalized the unit sum equals to 1. The constants were selected to be ${C}_{1}={(0.01\xb7255)}^{2}$, ${C}_{2}={(0.03\xb7255)}^{2}$, and ${C}_{3}={C}_{2}/2$ as was done in Ref. 8. These values seem somewhat arbitrary, but Wang et al. found that in their experiments, the performance of the SSIM index algorithm is fairly insensitive to variations of these values.

The local variance similarity between the original and the histogram-equalized images is quite different because histogram equalization (HE) is a nonlinear intensity transform. However, the SSIM is evaluated to have a high similarity score, while our new metric is evaluated as having a lower similarity than the SSIM. The ISSIM-Ss of the images, filtered by low pass filters, such as MF, MedF, and JPEG compression, are also evaluated to have lower similarities than the SSIM. In addition, the ISSIM-Ss of images that have been slightly geometrically transformed by ST and RT are higher than SSIMs. The results of the mean luminance shifting (MLS) and impulsive noise (IN) images show that the SSIMs and the ISSIM-Ss are evaluated with the same image but the result values are different.

To compare the different index maps of the SSIM and the ISSIM-S, the results of HE, MedF, JPEG, and MF are shown in Fig. 5. The pixel values of the index map are normalized SSIM or ISSIM-S values. The index maps have different results, and the index maps of the ISSIM-S are darker than those of the SSIM because the MISSIM-Ss are lower than the MSSIMs. While the index maps of the ISSIM-S for IN, ST, and RT are brighter than those of the SSIM, because the similarities of the ISSIM-S are increased than those of the SSIM as shown in Fig. 6. The index maps of MLS are very similar as shown in Fig. 7.

To compare the mean opinion scores (MOSs), the rank of PSNR, mean of the SSIM, mean of the ISSIM-S, and MOS are shown in Table 2. To measure MOSs, we showed subjects the result images of each processing with the original image, and received their opinion scores, which have ranges of 1 (not similar) to 5 (very similar). Each comparison was implemented one-on-one with the original image and we randomized the order of the distorted images we showed to minimize order effects. The number of test subjects was 17 and none of them had any problems with their eyes. The experiments were implemented under the regulated illumination conditions and display conditions.

## Table 2

Comparison of the PSNR, mean of the SSIM, mean of the ISSIM-S, and MOS rank of “Lena” image (the rank for each metric is shown in parentheses).

Images | PSNR | Mean of the SSIM | Mean of the ISSIM-S | MOS | MOS rank |
---|---|---|---|---|---|

HE | 16.781 (6) | 0.908 (1) | 0.766 (5) | 3.182 | 4 |

MLS | 15.879 (8) | 0.901 (2) | 0.901 (1) | 3.273 | 3 |

MedF | 25.757 (3) | 0.785 (5) | 0.693 (6) | 1.636 | 6 |

IN | 16.098 (7) | 0.297 (8) | 0.313 (8) | 1.545 | 7 |

JPEG | 27.293 (1) | 0.805 (4) | 0.773 (4) | 2.818 | 5 |

MF | 23.888 (4) | 0.711 (7) | 0.623 (7) | 1.273 | 8 |

ST | 25.912 (2) | 0.832 (3) | 0.871 (2) | 5.000 | 1 |

RT | 23.474 (5) | 0.759 (6) | 0.832 (3) | 4.909 | 2 |

$\rho $ | 0.048 | 0.595 | 0.881 | — | — |

The scores themselves are subjective and not convincing but they can have meaning in relative comparison. Therefore, we used MOS ranks instead of MOS itself. The rank correlations by the MOS rank are also shown, where the rank correlation is computed by Spearman’s rank correlation coefficient ($\rho $)^{25} which is defined as follows:

We compared PSNR, SSIM, ISSIM-S, and MOS with another image shown in Fig. 8 and the results are shown in Table 3. The types of distortion are exactly the same as those of Table 2, but the only difference is the filter size. The resolution of test images in Table 2 is $256\times 256$ and the filter size is $11\times 11$; however, the resolution of test images in Fig. 8 is $128\times 128$ so we set the filter size as $5\times 5$.

## Table 3

Comparison of the PSNR, mean of the SSIM, mean of the ISSIM-S, and MOS rank of “Einstein” image (the rank for each metric is shown in parentheses).

Images | PSNR | Mean of the SSIM | Mean of the ISSIM-S | MOS | MOS rank |
---|---|---|---|---|---|

HE | 23.278 (3) | 0.924 (2) | 0.795 (2) | 3.583 | 4 |

MLS | 23.028 (6) | 0.986 (1) | 0.986 (1) | 4.417 | 2 |

MedF | 29.536 (1) | 0.827 (3) | 0.746 (4) | 2.333 | 7 |

IN | 23.174 (5) | 0.781 (5) | 0.787 (3) | 2.917 | 5 |

JPEG | 23.237 (4) | 0.557 (6) | 0.447 (8) | 1.000 | 8 |

MF | 27.286 (2) | 0.790 (4) | 0.654 (7) | 2.417 | 6 |

ST | 18.729 (8) | 0.393 (8) | 0.680 (6) | 4.833 | 1 |

RT | 20.517 (7) | 0.555 (7) | 0.704 (5) | 4.333 | 3 |

$\rho $ | $-0.643$ | $-0.119$ | 0.429 | — | — |

To evaluate the performance with different distortion levels, we tested a few more images: blurred images with different sizes of MF, images that have undergone various loss via JPEG compression, and images differently translated by ST (shown in Fig. 9 and Table 4). As the distortion level increases, PSNR, MSSIM, and mean ISSIM-S decrease, no matter the processing type. However, in ST, PSNR and MSSIM have the lowest values when it is translated only 3 pixels according to $y$ axis, while mean ISSIM-S does not. ISSIM-S is also affected by translation but it is less sensitive than PSNR and SSIM methods.

## Table 4

Comparison of the PSNR, mean of the SSIM, and mean of the ISSIM-S for different distortion levels.

Images | PSNR | Mean of the SSIM | Mean of the ISSIM-S |
---|---|---|---|

MF ($3\times 3$) | 29.280 | 0.896 | 0.822 |

MF ($5\times 5$) | 25.725 | 0.792 | 0.697 |

MF ($7\times 7$) | 23.888 | 0.711 | 0.623 |

JPEG (20) | 29.936 | 0.871 | 0.844 |

JPEG (10) | 27.701 | 0.806 | 0.772 |

JPEG (5) | 25.109 | 0.706 | 0.640 |

ST (1) | 25.912 | 0.832 | 0.871 |

ST (2) | 21.881 | 0.690 | 0.806 |

ST (3) | 20.060 | 0.607 | 0.754 |

We conducted two additional experiments. First, comparison of ST, MF, and JPEG compression for various scene contents are shown Fig. 10 and Table 5. The resolutions of the tested images in this experiment are $256\times 256$. The PSNR and the mean of SSIM values for each image are scored according to this order, $\mathrm{ST}<\mathrm{MF}<\mathrm{JPEG}$. However, the mean of ISSIM-S shows another pattern, which is $\mathrm{MF}<\mathrm{JPEG}<\mathrm{ST}$. The order of ISSIM-S is more reasonable than PSNR or SSIM. This result shows that the proposed image quality assessment method does not overestimate blurred images and it is much less sensitive to geometric transformations, which were one of the identified drawbacks of SSIM. Second, as shown in Fig. 11 and Table 6, we compared the PSNR, the mean of SSIM, and the mean of ISSIM-S for various combinations of degradations. The drawback of SSIM is that it is too sensitive to geometric translation and can be found when the degradations are combined. This result shows that MSSIM overvalues HE+IN while MISSIM-S evaluates moderately. It means that MISSIM-S is much closer to HVS because MISSIM-S is less sensitive to a small amount of geometric translation just as HVS is.

## Table 5

Comparison of the PSNR, mean of the SSIM, and mean of the ISSIM-S for different scene contents.

Images | PSNR | Mean of the SSIM | Mean of the ISSIM-S |
---|---|---|---|

Goldhill (ST) | 21.865 | 0.489 | 0.756 |

Goldhill (MF) | 24.191 | 0.535 | 0.388 |

Goldhill (JPEG) | 26.397 | 0.701 | 0.694 |

Boat (ST) | 19.550 | 0.423 | 0.740 |

Boat (MF) | 22.072 | 0.513 | 0.383 |

Boat (JPEG) | 25.062 | 0.704 | 0.681 |

Airplane (ST) | 20.141 | 0.664 | 0.801 |

Airplane (MF) | 21.962 | 0.675 | 0.573 |

Airplane (JPEG) | 26.356 | 0.805 | 0.751 |

House (ST) | 24.755 | 0.676 | 0.839 |

House (MF) | 26.362 | 0.766 | 0.636 |

House (JPEG) | 30.557 | 0.825 | 0.759 |

## Table 6

Comparison of the PSNR, mean of the SSIM, and mean of the ISSIM-S for various combinations of degradations.

Images | PSNR | Mean of the SSIM | Mean of the ISSIM-S |
---|---|---|---|

Goldhill (HE + IN) | 11.581 | 0.409 | 0.235 |

Goldhill (ST + HE) | 11.151 | 0.325 | 0.253 |

Goldhill (IN + ST) | 17.659 | 0.390 | 0.490 |

Boat (HE + IN) | 15.982 | 0.538 | 0.306 |

Boat (ST + HE) | 14.300 | 0.250 | 0.329 |

Boat (IN + ST) | 17.639 | 0.276 | 0.535 |

Airplane (HE + IN) | 16.413 | 0.604 | 0.389 |

Airplane (ST + HE) | 15.550 | 0.372 | 0.447 |

Airplane (IN + ST) | 18.905 | 0.322 | 0.555 |

House (HE + IN) | 16.378 | 0.394 | 0.185 |

House (ST + HE) | 16.363 | 0.275 | 0.239 |

House (IN + ST) | 20.125 | 0.365 | 0.439 |

In addition, we tested the variations of MSSIM and MISSIM-S in terms of the size of the Gaussian window as shown in Fig. 12, where the $11\times 11$ window size is large enough because the variations are very small when the window size is larger than 11.

## 5.

## Conclusion

In this paper, we have proposed an improved structural similarity metric using structure and sharpness comparison functions to overcome the drawbacks of the SSIM metric. The structure comparison used segmented standard deviations by the mean, and sharpness comparison used the normalized digital Laplacian. The proposed metric can evaluate geometric transformed images with high similarities and cannot overestimate blurred images such as JPEG compression. The experimental results indicate that our similarity metric is superior to existing methods in respect to the perceived visibility of humans. Therefore, our method can be used to evaluate the performance of various methods such as image enhancement, frame rate upconversion, image compression, super-resolution, and image restoration.

## Acknowledgments

This research was partly supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2015R1D1A1A01059091), and Institute for Information and communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (No. B0101-16-0033, Research and Development of 5G Mobile Communications Technologies using CCN-based Multi-dimensional Scalability).

## References

## Biography

**Daeho Lee** received his MS and PhD degrees in electronics engineering from Kyung Hee University, Republic of Korea, in 2001 and 2005, respectively. He has been an associate professor in the Humanities College at Kyung Hee University, Republic of Korea, since 2005. His research interests include computer vision, pattern recognition, machine learning, image processing, image fusion, 3-D image reconstruction, computer games, ITS, HCI, electrical impedance tomography analysis, and digital signal processing.

**Sungsoo Lim** received his BS degrees in electronics and radio engineering and biomedical engineering and his MS degree in electronics and radio engineering from Kyung Hee University, Republic of Korea, in 2014 and 2016. He is currently pursuing his PhD in electronic engineering at the Kyung Hee University. His research interests include computer vision, image processing, intelligent transportation systems (ITS), human computer interaction (HCI), and medical image processing.