## 1.

## Introduction

Models of human visual perception are an important component of image compression, rendering, retargeting, and editing. A typical application is prediction of differences in image pairs or detection of salient regions. Such predictions are based on the perception of luminance patterns alone and ignore that a difference might also be well explained by a transformation. As an example, the Hamming distance of the binary strings 1010 and 0101 is the same as between 1111 and 0000; however, the first pair is more similar in the sense of an edit distance, as 1010 is just a rotated, i.e., transformed version of 0101. We apply this idea to images, e.g., comparing an image and its rotated copy.

In current models of visual perception, transformation is not represented, leading to several difficulties. For image similarity or quality evaluation approaches, it is typically assumed the image pair is perfectly aligned (registered), which is directly granted in image compression, restoration, denoising, broadcasting, and rendering. However, in many other applications, such as visual equivalence judgement,^{1} comparison of rendered and photographed scenes,^{2} rephotography,^{3} or image retargeting,^{4} the similarity of images should be judged in the presence of distortions caused by transformations. Ecologically valid transformation^{5} is a nonstructural distortion^{6} and as such should be separated from others. However, current image difference metrics will report images that differ by such a transformation to be very dissimilar.^{6} In the same vein, computational models of image saliency are based on luminance alone, or in the case of video, on the principle that motion has a “pop-up” effect.^{7} However, for an image pair that differs by a spatially varying transformation some transformations might be more salient, not because they are stronger, but because they are distinct from others. Finally, motion parallax is compensated for easily and not perceived as a distortion but as a depth cue (Ref. 8, Ch. 28). We will show that all the difficulties in predicting the perception of transformed images can be overcome by an explicit model of human perception of transformations such as we propose.

In this work, we assume the optical flow^{5} of an image pair to be given, either by producing it using three-dimensional (3-D) graphics or (typically with a lower precision) using computer vision techniques and focus on how the human visual system (HVS) represents transformations. We decompose the flow field into a field of elementary transformations,^{9} a process that is likely to also happen in the dorsal visual pathway of the primate brain.^{10} From this representation, we can model the effect of transformations on the perception of images. For comparing images, strong or incoherent transformations generally make the perception of differences increasingly difficult. We model this effect using a measure of transformation entropy. When given an image pair that differs by a transformation, we predict where humans will perceive differences and where not (Fig. 1). Using our representation, we can compare transformations and predict which transformations are salient compared to others. Finally, spatially varying transformations result in motion parallax, which can serve as a depth cue.

In this work, we make the following contributions:

## 2.

## Background

In this section, we review the perceptual background of our approach. We will recall the idea of mental transformation and its relation to optical flow, saliency, as well as the basics of entropy in human psychology. The discussion of previous work for the two main applications we propose (image differences, saliency), is found in Sec. 4.

## 2.1.

### Mental Transformation

Mental transformations of space play an important role in everyday tasks such as object recognition, spatial orientation, and motion planning. Such transformations involve both objects in the space as well as the egocentric observer position. Mental rotation is the best understood mental transformation,^{11} where the time required to judge the similarity between a pair of differently oriented objects of arbitrary shape is proportional to the rotation angle both for two-dimensional (2-D) (image plane) and 3-D rotations, irrespective of the chosen rotation axis. Similar observations have been made for the size scaling and translation (in depth),^{12} where the reaction time is shorter than for rotation. Moreover, in combined scaling and rotation^{13} as well as translation and rotation^{12} the response time is additive with respect to each component transformation. This may suggest that there are independent routines for such elementary transformations, which jointly form a procedure for handling any sort of movement that preserves the rigid structure of an object.^{12} Another observation is that the mental transformation passes through a continuous trajectory of intermediate object positions, not just the beginning and end positions.^{14}

A more advanced mental transformation is perspective transformation.^{15} From our own experience, we know that observing a cinema screen from a moderately off-angle perspective does not reduce perceived quality, even if the retinal image underwent a strong transformation. One explanation for this ability is that humans compensate for the perspective transformation by mentally inverting it.^{16}

Apparent motion in the forward and backward directions is induced when two 3-D-transformed (e.g., rotated) copies of the same object are presented alternatively at proper rates. As the transformational distance (e.g., rotation angle) increases, the alternation rate must be reduced to maintain the motion illusion. Again, this observation strongly suggests that the underlying transformations require time to go through intermediate stages, a 3-D representation is utilized internally,^{17} and elementary transformations are individually sequential-additive.^{18}

The HVS is able to recover depth and rigid 3-D structure from two views (e.g., binocular vision and apparent motion) irrespective whether the perspective or orthographic projection is used, and adding more views has little impact.^{19} This indicates that the HVS might use some perceptual heuristics to derive such information as the structure-from-motion theorem stipulates that at least three views are needed in the case of orthographic projection (or under weak perspective).^{20}

The 3-D internal representation in the HVS and the rigidity hypothesis in correspondence finding, while tracking moving objects, is still a matter of scientific debate. Eagle et al.^{21} have found a preference toward translation in explaining competing motion transformations in a two-frame sequence with little regard for the projective shape transformations.

## 2.2.

### Optical Flow

The idea of optical flow dates back to Gibson^{5} and has become an essential part of computer vision and graphics where it is mostly formalized as a 2-D vector field that maps locations in one image to locations in a second image, taken from a different point in time or space. Beyond the mapping from points to points, Koenderink^{9} conducted a theoretical analysis of elementary transformations, such as expansion/contraction (radial motion), rotation (circular motion), and sheer (two-component deformation), which can be combined with translation into a general affine transformation. Such transformations map differential area to differential area. Electrophysiological recordings have shown that specialized cells in the primate brain are selective for each elementary transformation component alone or combined with translation^{10} (refer also to Ref. 8, Ch. 5.8.4). A spatially varying optical flow field does not imply a spatially varying field of transformations: a global rotation that has small displacements in the center and larger displacements in the periphery can serve as an example. For this reason, our perceptual model operates on a field of elementary transformations computed from homographies instead of a dense optical flow. Homography estimation is commonly used in the video-based scene 3-D analysis, and the best results are obtained when multiple views are considered.^{20}

In computer graphics, the use of elementary transformation fields is rare, with the exception of video stabilization and shape modeling. In video stabilization, spatially varying warps of handheld video frames into their stabilized version are performed with a desired camera path. Typically a globally reconstructed homography is applied to the input frame, before the optimization-driven local warping is performed,^{22} which is conceptually similar to our local homography decomposition step (Sec. 3.2). Notably, the concept of subspace stabilization^{23} constructs a lower-dimensional subspace of the 2-D flow field, i.e., a space with a lower number of different flows, i.e., lower entropy. In shape modeling, flow fields are decomposed into elementary transformations to remove all but the desired transformations, i.e., to remain as-rigid-as-possible when seeking to preserve only rotation.^{24}

## 2.3.

### Visual Attention

Moving objects and “pop-out” effects are strong attractors of visual attention.^{7} The classic visual attention model proposed by Itti et al.^{25} apart from the common neuronal features, such as intensity contrast, color contrast, and pattern orientation can handle also four oriented motion energies (up, down, left, and right). Differently, in our work, we detect saliency of motion, which pops out not just because it is present and the rest is static, but because it is different from other motion in the scene, such as many rotating objects where one rotates differently. As humans understand motion in form of elementary transformations,^{10} our analysis is needed to find those differences.

## 2.4.

### Motion Parallax

For a moving viewer, objects at one depth undergo a different flow than objects in their spatial neighborhood at different depth. This effect, called “motion parallax,” is both a depth cue^{26} and a grouping Gestalt cue.^{27} In relation with translation and the four elementary transformations over the flow field as derived by Koenderink,^{9} the corresponding motion parallax components can be distinguished: linear motion, expansion or contraction, rotation, shear-deformation, and compression- or expansion-deformation parallax (Ref. 8, Ch. 28.1.3). We propose a motion parallax measure, which approximates each of those components, although in our applications we found that the linear motion parallax plays the key role.

## 2.5.

### Entropy

Information entropy is a measure of complexity in the sense of how much a signal is expected or not.^{28} If it is expected, the entropy is low, otherwise it is high. In our approach, we are interested in the entropy of transformations, which tells apart uniform transformations from incoherent ones, such as disassembling a puzzle. Assembling the puzzle is hard, not because the transformation is large, but because it is incoherent, i.e., it has a high entropy. This view is supported by studies of human task performance:^{29}^{,}^{30} Sorting cards with a low entropy layout can be performed faster than sorting with high entropy. In computer graphics, entropy of luminance is used for the purpose of alignment,^{31} best-view selection,^{32} light source placement,^{33} and feature detection but was not yet applied to transformations.

## 3.

## Our Approach

## 3.1.

### Overview

Our system consists of two layers (Fig. 2). A model layer described in this section and an application layer, described in Sec. 4. Input to the model layer are two images where the second image differs from the first one by a known mapping, which is assumed to be available as a spatially dense optical flow field. This requires either use of optical flow algorithms that support large displacements^{34} and complex mappings^{23} or use of computer-generated optical flow (a.k.a. motion field). Output of our method is a field of perceptually scaled elementary transformations and a field of transformation entropy ready to be used in different applications.

Our approach proceeds as follows (Fig. 2). In the first step (Sec. 3.2), we convert the optical flow field that maps positions to positions into an overcomplete field of local homographies, describing how a differential patch from one image is mapped to the other image. While classic flow only captures translation, the field of homographies also captures effects such as rotation, scaling, shear, and perspective. Next, we factor the local homographies into “elementary” translation, scaling, rotation, shear, and perspective transformations (Sec. 3.3). Also, we compute the local entropy of the transformation field, i.e., how difficult it is to understand the transformation (Sec. 3.4). Finally, the magnitude of elementary transformations is mapped to scalar perceptual units, such that the same value indicates roughly the same sensitivity (Sec. 3.5).

Using the information above allows for several applications. Most importantly, we propose an image difference metric (Sec. 4.1) that is transformation-aware. We model the threshold elevation, which determines how much the smallest perceivable difference between two images increases as a function of transformation strength and complexity, i.e., entropy. The second application is a visual attention model that can detect what transformations are salient (Sec. 4.2). Finally, the amount of perceived parallax in the image pair can be computed from the above information (Sec. 4.3).

## 3.2.

### Homography Estimation

Input is two images with luminances ${g}_{1}$ and ${g}_{2}(\mathbf{x})\in {\mathbb{R}}^{2}\to \mathbb{R}$ and $\mathbf{x}$ as spatial location as well as a flow $f(\mathbf{x})\in {\mathbb{R}}^{2}\to {\mathbb{R}}^{2}$ from ${g}_{1}$ to ${g}_{2}$. First, the flow field is converted into a field of homography transformations.^{20} A homography maps a differential image patch into the second image while optical flow maps single pixel positions to other pixel positions. In human vision research, this Helmholtz decomposition was conceptually proposed by Koenderink^{9} and later confirmed by physiological evidence.^{10} Examples of homographies are shown in Fig. 3(a). In our case, homographies are 2-D projective $3\times 3$ matrices. While $2\times 3$ matrices can express translation, rotation, and scaling, the perspective component allows for perspective foreshortening.

We estimate a field of homographies, i.e., a map that describes for every pixel where its surrounding patch is going. We compute this field $\mathbf{M}(\mathbf{x})\in {\mathbb{R}}^{2}\to {\mathbb{R}}^{3\times 3}$ by solving a motion discontinuity-aware moving least-squares problem for every pixel using a normalized eight-point algorithm.^{35} The best transformation $\mathbf{M}(\mathbf{x})$ in the least squares sense minimizes

## (1)

$${\int}_{{\mathbb{R}}^{2}}w(\mathbf{x},\mathbf{y}){\Vert f(\mathbf{y})-\phi [\mathbf{M}(\mathbf{x})(\begin{array}{c}\mathbf{y}\\ 1\end{array}\left)\right]\Vert}_{2}^{2}\mathrm{d}\mathbf{y},$$## (2)

$$w(\mathbf{x},\mathbf{y})=\mathrm{exp}(-{\Vert \mathbf{x}-\mathbf{y}\Vert}_{2}^{2}/{\sigma}_{\mathrm{d}})\mathrm{exp}(-{\Vert f(\mathbf{x})-f(\mathbf{y})\Vert}_{2}^{2}/{\sigma}_{\mathrm{r}})$$^{36}that accounts more for locations that are spatially close (domain weight) and have a similar flow (range weight). The parameters ${\sigma}_{\mathrm{r}}$ and ${\sigma}_{\mathrm{d}}$ control the locality of the weight. The range-weighting assures to not mix different flows into one wrong estimate of the homography, but to keep them separate [Fig. 3(b)] resulting in a pixel-accurate, edge-aware field.

In the discrete case of Eq. (1), for pixel $\mathbf{x}$ we find one $\mathbf{M}$ that minimizes

## (3)

$$\sum _{i\in \mathcal{N}}{w}_{i}{\Vert {\mathbf{f}}^{i}-\phi \left[\mathbf{M}\right(\begin{array}{c}{\mathbf{y}}^{i}\\ 1\end{array}\left)\right]\Vert}_{2}^{2},$$We solve this as a homogenenous linear least squares problem in form $\mathbf{B}\mathbf{m}=0$. For one flow direction ${\mathbf{f}}_{i}$ at position ${\mathbf{y}}_{i}$ and a matrix $\mathbf{M}$, we require

^{35}

The procedure is similar to fitting of a single homography in computer vision.^{20} It is more general, as our flow field is not explained by a rigid camera but needs to find one homography in each pixel. To ensure a consistent and piecewise smooth output we combine a regularizing smooth kernel with an edge-aware component $w$ [Eq. (2)].

We implement the entire estimation in parallel over all pixel locations using graphics hardware (GPUs) allowing us to estimate the homography field in less than 3 s for a high-definition image.

## 3.3.

### Transformation Decomposition

For perceptual scaling the per-pixel transformation $\mathbf{M}$ is decomposed into multiple elementary transformations: translation (${\mathbf{e}}_{\mathrm{t}}$), rotation (${e}_{\mathrm{r}}$), uniform scaling (${e}_{\mathrm{s}}$), aspect ratio change (${e}_{\mathrm{a}}$), shear (${\mathbf{e}}_{\mathrm{h}}$), and perspective (${\mathbf{e}}_{\mathrm{p}}$) (cf. Fig. 4). The relative difficulty of each transformation will later be determined in a perceptual experiment (Sec. 3.5).

We assume that our transformations are the result of a 2-D transformation followed by a perspective transformation. This order is arbitrary, but we decided for it, as it is closer to usual understanding of transformations of 3-D objects in a 2-D world. This is motivated by the fact that it is more natural to imagine objects to live in their (perspective) space and move in their 3-D oriented plane before being projected to the image plane than to understand them as 2-D entities undergoing possibly complex nonlinear and nonrigid transformations in the image plane.

The decomposition happens independently for the matrix $\mathbf{M}$ at every pixel location. As $\mathbf{M}$ is unique up to a scalar, we first divide it by one element, which is chosen to be ${m}_{33}$. In the next five steps, each elementary component $\mathrm{T}$ will be found first by extracting it from $\mathbf{M}$, and then removing it from $\mathbf{M}$ by multiplying with ${\mathrm{T}}^{-1}$.

First, perspective is extracted by computing horizontal and vertical focal length as ${\mathbf{d}}_{p}={({\mathbf{M}}_{\mathrm{a}}^{\mathrm{T}})}^{-1}\xb7({m}_{31},{m}_{32},0)$ where ${\mathbf{M}}_{\mathrm{a}}$ is the affine part of $\mathbf{M}$. The multiplication removes dependency of ${\mathbf{d}}_{p}$ on other transformations in $\mathbf{M}$. To define the perceptual measure of the elementary transformation corresponding to the perspective change, we later convert the focal length into the $x$-axis and $y$-axis field of view ${\mathbf{e}}_{\mathrm{p}}=2\text{\hspace{0.17em}}\mathrm{arctan}({\mathbf{d}}_{p}/2)$ expressed in radians. To remove the perspective from $\mathbf{M}$, we multiply it by the inverse of a pure perspective matrix in the form

Second, a 2-D vector of translation transformation ${\mathbf{e}}_{\mathrm{t}}=({m}_{13},{m}_{23})$ in visual angle degrees is found. It is removed from $\mathbf{M}$ by multiplying with the inverse of a translation matrix.

Next, we find the rotation transformation corresponding to the angle ${e}_{\mathrm{r}}=\mathrm{arctan}\text{\hspace{0.17em}}2({m}_{21},{m}_{11})$ in radians and remove it by multiplying with an inverse rotation matrix.

2-D scaling power is recovered as ${\mathbf{d}}_{s}=\mathrm{log}({m}_{11},{m}_{22})$ and removed from the matrix. For the purpose of later perceptualization, we define a uniform scaling ${e}_{\mathrm{s}}=\mathrm{max}(|{\mathbf{d}}_{s,x}|,|{\mathbf{d}}_{s,y}|)$ and a change of aspect ratio ${e}_{\mathrm{a}}=|{\mathbf{d}}_{s,x}-{\mathbf{d}}_{s,y}|$. The assumption is that anisotropic scaling requires more effort to undo than simple isotropic size change and two separate descriptors are, therefore, needed.

The last component is shear defined by a scalar angle as ${e}_{\mathrm{h}}=\mathrm{arctan}({m}_{12})$ in radians.

## 3.4.

### Transformation Field Entropy

**Definition** Ease and difficulty of dealing with transformations does not only depend on the type and magnitude of a transformation on its own but also as it is often the case in human perception it depends on a context. Compensating for one large coherent translation might be easy compared to compensating for many small and incoherent translations. We model this effect using the notion of transformation entropy of an area in an elementary transformation field. Transformation entropy is high if many different transformations are present in a spatial area, and it is low if it is uniform. Note how entropy is not proportional to the magnitude of transformations in a spatial region but to the incoherence in their probability distribution.

We define the transformation entropy $H$ of an elementary transformation at location $\mathbf{x}$ in a neighborhood $s$ using standard entropy equation as

The probability distribution $p(\omega |\mathbf{x},s)$ of elementary transformations at neighborhood $\mathbf{x}$, $s$ has to be computed using density estimation, i.e.,

where $K$ is an appropriate kernel such as the Gaussian and $t(\mathbf{x})\in \mathrm{\Omega}$ is a field of elementary transformations of the same type as $\omega $.Depending on the size of the neighborhood $s$, entropy is more or less localized. If the neighborhood size is varied, entropy changes as well, resulting in a scale space of entropy,^{37} studied in computer vision as an image structure representation. For our purpose, we pick the entropy scale space maximum as the local entropy of each pixel and do not account for the fact at what scale it occurred. The difficulty of transformations was found to sum linearly.^{11} For this reason, we sum the entropy of all elementary transformation into a single scalar entropy value.

The decomposition into elementary transformations is the key to the successful computation of entropy; without it, a rotation field would result in a flat histogram as all directions are presented. This would indicate a high entropy, which is wrong. Instead, the HVS would explain the observation using very little information: a single rotation with low entropy.

## 3.4.1.

#### Implementation

In the discrete case, the integral to compute the entropy of the pixel $\mathbf{x}$ becomes

## (4)

$$\widehat{H}(\mathbf{x})=-\sum _{j=0}^{{n}_{\mathrm{b}}}\sum _{k\in \mathcal{N}(s)}K({t}_{k}-{\omega}_{j})\mathrm{log}\text{\hspace{0.17em}}\sum _{k\in \mathcal{N}(s)}K({t}_{k}-{\omega}_{j}),$$Due to the finite size of our ${n}_{\mathrm{b}}$ histogram bins and the overlap of the Gaussian kernel $K$, we systematically overestimate the entropy; even when only a single transformation is present, it will cover more than one bin, creating a nonzero entropy. To address this, we estimate the bias in entropy due to a single Dirac pulse and subtract it. We know that 0.99 of the area under a Gaussian distribution is within 3.2 standard deviations $\sigma $. That means that a conservative estimate of response to a Dirac pulse is a uniform distribution of the value between $3.2\sigma $ bins. That yields the entropy ${H}_{\text{bias}}\approx -3.2\sigma (1/3.2\sigma )\mathrm{log}(1/3.2\sigma )=-\mathrm{log}(1/3.2\sigma )$. For our $\sigma =0.5$ this evaluates to ${H}_{\text{bias}}=0.2$. We approximate the entropy by subtracting this value

Computing the entropy [Eq. (5)] in a naïve way would require us to iterate a large neighborhoods $\mathcal{N}(s)$ (up to the entire image) for each pixel $\mathbf{x}$ and every scale $s$. Instead, we use smoothed local histograms^{38}for this purpose. In the first pass, the 2-D image is converted into a 3-D image with $n$ layers. Layer $i$ contains the discrete smooth probability of that pixel taking this value. Histograms of larger areas as well as their entropy can now be computed in constant time, by constructing a pyramid on the histograms.

## 3.5.

### Perceptual Scaling

All elementary transformations as well as the entropy are physical values and need to be mapped to perceptual qualities. Psychological experiments indicate that elementary transformations such as translation, rotation, and scaling require time (or effort) that is close to linear in the relevant $x$-axis variable^{11}^{,}^{13}^{,}^{17}^{,}^{18}^{,}^{39} and that the effect of multiple elementary transformations is additive.^{13}^{,}^{18} A linear relation was also suggested for entropy in Hick’s law.^{30} Therefore, we scale elementary transformation and entropy using a linear mapping (Fig. 7) and treat them as additive.

## 3.5.1.

#### Transformation

To find the scaling, an experiment was performed similar to the one that Shepard and Metzler^{11} conducted for rotation but extended to all elementary transformations as defined in Sec. 3.3, including shear and perspective. Objective of the experiment is to establish a relationship of transformation strength and difficulty, measured in response time increase in the mental transformation tasks.

Subjects were shown two abstract 2-D images (Fig. 8) generated from patterns in Fig. 9. The two patterns were either different or identical. One out of the two patterns was transformed using a single elementary transformation of intensity ${e}_{i}$ as listed in the upper part of Table 1. Subjects were asked to indicate as quickly as possible if the two patterns are identical by pressing a key. Auditory feedback was provided to indicate if the answer was correct. The time $t({e}_{i})$ until the response was recorded for all correct answers where the two patterns were identical up to a transformation. The choice of pattern to transform (left or right), the elementary transformation $i$ and its magnitude ${e}_{i}$ were randomized in each trial.

## Table 1

Results of our perceptual scaling experiment (Sec. 3.5). Input domain corresponds to the range of the transformation parameter presented to users. The response time functions were fitted to our data for elementary transformations and theoretically derived for entropy (Fig. 7). Refer to Secs. 3.3 and 3.4 for the definition of input variables and their units.

Transformation | Input domain | Response time fit | R2 |
---|---|---|---|

Translation | ${e}_{t}\in [0,180]\text{\hspace{0.17em}\hspace{0.17em}}\mathrm{deg}$ | $t({e}_{t})=0.00265{e}_{t}+0.987$ | 0.171 |

Rotation | ${e}_{\mathrm{r}}\in [0,\pi ]\text{\hspace{0.17em}\hspace{0.17em}}\mathrm{rad}$ | $t({e}_{\mathrm{r}})=0.00280{e}_{\mathrm{r}}+1.053$ | 0.846 |

Scaling | ${e}_{\mathrm{s}}\in [0,2]\text{\hspace{0.17em}\hspace{0.17em}}\mathrm{log}.\text{units}$ | $t({e}_{\mathrm{s}})=0.12100{e}_{\mathrm{s}}+0.999$ | 0.933 |

Aspect | ${e}_{\mathrm{a}}\in [0,2]\text{\hspace{0.17em}\hspace{0.17em}}\mathrm{log}.\text{units}$ | $t({e}_{\mathrm{a}})=0.12100{e}_{\mathrm{a}}+0.984$ | 0.972 |

Shear | ${e}_{h}\in [0,0.32\pi ]\text{\hspace{0.17em}\hspace{0.17em}}\mathrm{rad}$ | $t({e}_{h})=0.00640{e}_{h}+0.973$ | 0.968 |

Perspective | ${e}_{p}\in [0,0.4\pi ]\text{\hspace{0.17em}\hspace{0.17em}}\mathrm{rad}$ | $t({e}_{p})=0.00342{e}_{p}+0.989$ | 0.805 |

Entropy | $H\in [0,\infty )\text{\hspace{0.17em}\hspace{0.17em}}\mathrm{bits}$ | $t(H)=0.60000\widehat{H}+0.998$ | — |

Twenty-one subjects (17 M/4 F) completed 414 trials of the experiment in three sessions. For each elementary transformation, we fit a linear function to map strength to response time (Fig. 7). We found a good fit of increasing linear functions of $x$ for all transformations except translation. See Table 1 for the derived model functions. We do not have a definitive answer, why the correlation with translation is lower than for other transformations. A hypothetical explanation is that eye motions can be used to mechanically compensate for translation without mental effort while there is no anatomical option to compensate for rotation, scaling, and so on. An improved design could employ other ways of multiplexing stimuli, e.g., across time, to identify how much translation is cofounded by mental or mechanical aspects, respectively. This agrees with findings for rotation^{11} or scaling,^{12} and our different bias or slope is likely explained by the influence of stimulus complexity also found by Shepard and Metzler.^{11}

## 3.5.2.

#### Entropy

We assume the effect of entropy can be similarly measured as in the task of Hick,^{30} where a logarithmic relationship between the number of choices (blinking lamps) and the response time (verbal report of count) was found. He reports a logarithmic time increase with a slope of 0.6 when comparing a visual search task with 10 choices to a single choice-task. The negative logarithm of the inverse number of choices with equal probability is proportional to entropy, so entropy can be directly used for scaling (Fig. 7). We define the response time function of entropy as $t(H)=0.6H+0.998$, where the bias constant was gained as a mean response time for zero transformation case in the mental transformation experiment (Table 1), and it is only reported for completeness as only the slope is relevant for our applications.

## 4.

## Applications

The key applications of our model are an image metric (Sec. 4.1), image saliency (Sec. 4.2) and a measure of motion parallax (Sec. 4.3).

## 4.1.

### Image Difference Metric

The most direct application of our transformation decomposition and its perceptual scaling is building an image difference metric. Our metric does both compare luminance patterns in corresponding regions of the image and also evaluates the strength and the complexity of the spatial relation between them. This way it accounts for the difficulty that the same matching task would cause to the HVS. In combination with a chosen traditional image metric our perceptual transformation measure works as a threshold elevation factor that modifies visibility of image differences (Figs. 1 and 10).

The inputs are two images ${g}_{1}$ and ${g}_{2}$ and their optical flow $f$ as explained in Sec. 3.2. Initially, the second image ${g}_{2}$ is aligned to the first one using the inverse flow ${f}^{-1}$. Next, the images can be compared using an arbitrary image metric (we experiment with DSSIM^{6}), with the only modification that occluded pixels are skipped from all computations. As a result, a map $\widehat{\mathcal{D}}$ is created that contains abstract visual differences [a unitless quality measure in the range from 0 to 1 for $\widehat{\mathcal{D}}({g}_{1},{g}_{2})=DSSIM({g}_{1},{g}_{2})$ as used in our examples]. This map does not account for the effect of the transformation strength and entropy while we have seen from our experiments that large or incoherent transformations make comparing two images more difficult.

Next, for every pixel we express the increase of difficulty ${d}_{i}=t({e}_{i})-t(0)$ (response time minus optimal response time, i.e., with no transformation or entropy) due to each elementary transformation: translation (${d}_{\mathrm{t}}$), rotation (${d}_{\mathrm{r}}$), scale (${d}_{\mathrm{s}}$), shear (${d}_{\mathrm{h}}$), and perspective (${d}_{\mathrm{p}}$) as well as the entropy (${d}_{\mathrm{H}}$), resulting in a transformation difficulty factor

## (6)

$$\delta ={(1+{d}_{\mathrm{t}}+{d}_{\mathrm{r}}+{d}_{\mathrm{s}}+{d}_{\mathrm{h}}+{d}_{\mathrm{p}}+{d}_{\mathrm{H}})}^{-1}.$$The summation is motivated by the finding that response time besides being linear also sums in a linear fashion^{12}^{,}^{13} (if scaling adds one second and rotations adds another one, the total time is 2 s).

Finally, we use $\delta $ as a factor masking the otherwise potentially perceivable differences in $\widehat{\mathcal{D}}$

As difficulty is in units of time, the resulting unit is visual difference per time. If the original difference map $\widehat{\mathcal{D}}$ differed by three units and was subject to a transformation that increased response time by 1 s (e.g., a rotation by about 180 deg), the difference per unit time is $3/(1+1)=1.5$, whereas a change increasing response time by 3 s (e.g., a shuffling with high entropy) the difference per unit time is $3/(1+3)=0.75$. In Fig. 10, we show the outcome of correcting the DSSIM index by considering our measure of transformation strength and entropy.Image transformations that contain local scaling power larger than 0 (zooming) might reveal details in ${g}_{2}$ that were not perceivable or not represented in the first image ${g}_{1}$. Such differences could be reported as indeed they show something in the second image that was not in the first. However, we decided not to consider such differences as a change from nothing into something might not be a relevant change. This can be achieved by blurring the image ${g}_{2}$ with a blur kernel of a bandwidth inversely proportional to the scaling. Occlusions are handled in the same way: No perceived difference is reported for regions only visible in one image.

## 4.1.1.

#### Validation

We validate our approach by measuring a human performance in perceiving differences in an image pair and analyzing its correlation with transformation magnitude and entropy. Subjects were shown image pairs that differed by a flow field as well as a change in content. Two image pairs show 3-D renderings of 16 cubes with different textures [see Figs. 11(a) and 11(c)]. The transformation between the image pairs included a change of 3-D viewpoint and a variety of 3-D motions for each cube. Larger transformations were chosen on the right side of the image [Fig. 11(e)] and similar trend also applies to the entropy introduced by swapping several of the cubes in the grid [Fig. 11(f)].

The images were distorted by adding noise and color quantization to randomly chosen textured cubes, so that the corresponding cubes could differ either by the presence of distortion or their intensity. The intensities of distortions were chosen so that without the geometrical transformation the artifacts are just visible. Ten subjects were asked to mark the cubes that appear different using a 2-D painting interface in an unlimited time. We record the error rate of each object as a relative number of cases the subjects gave wrong answers, i.e., where there was an image difference that they did not mark and where they marked a distortion while there was none [Figs. 11(b) and 11(d)]. Difficult areas have an error rate of 0.5 (chance level) while areas where the subjects were confident have a value as low as 0. The error rate is averaged over all subjects for one distortion and one scene.

Our transformation difficulty metric consists of the transformation magnitude measures and entropy, and we pool it for each cube by averaging to obtain 16 scalar values [Fig. 11(g)]. As the geometrical layout is the same for both types of distortions and each was setup to have a similar visibility, the assumption is that the subjects would find the same geometrical transformations difficult in both cases. That means that error rates in Figs. 11(b) and 11(d) should be similar and we correlate our metric with both of them together.

We analyzed the correlation of the error rate [Figs. 10(b) and 10(d)] and transformation magnitude and entropy [Figs. 10(e)–10(g)] and found an average Pearson’s $r$ correlation of 0.701 (Table 2), which is significant according to the $t$-test with $p<0.05$. The transformation magnitude has a lower correlation of $r=0.441$ compared to transformation entropy $r=0.749$ (significant for $p<0.05$). This difference could possibly be explained by a design of our experiment. Given an unlimited time, the subjects were eventually able to undo all transformations and resolve all shuffling between the cubes. It may be that the short-term memory requirement made the shuffling problem harder than the one resulting from the transformations. This would increase the importance of entropy for the performance prediction and lead to a higher correlation. A time constrained version of the experiment could answer this question. Another possible explanation points to a relatively low magnitude of transformations applied to our stimuli compared to the entropy introduced by shuffling of many similar cubes. An experiment design with different combinations of both factors could be used to verify this theory. Despite this asymmetry, we conclude that both transformation magnitude and entropy correlates with the ability to detect distortions; in the presence of strong or complex transformation, the increase in human detection error can be fit using a linear model.

## Table 2

Correlations of image differences from variants of our metric [Figs. 10(e)–10(g)] and average error rate of our study participants [Figs. 10(b) and 10(d) together] as described in Sec. 4.1. Stars⋆ denote significance according to the t-test with p<0.05.

Metric | Correlation |
---|---|

Only transformation magnitude | 0.441 |

Only transformation entropy | 0.749^{⋆} |

Our full metric | 0.701^{⋆} |

The final performance of our approach is limited by the image metric used. The correlation of image metrics and quantitative user responses is low and difficult to measure^{40} even without transformations. Therefore, the evaluation of the full metric, in particular for suprathreshold conditions, is relegated to future work.

## 4.1.2.

#### Discussion

Here we discuss the relation of our and existing image and video metrics, in particular how they deal with transformations. For a more general survey of image quality metrics we refer to Ref. 6.

Standard image difference (fidelity) metrics, such as per-pixel MSE, peak signal-to-noise ratio, per-patch structure similarity (SSIM) index,^{6} or the perception-based visible differences predictor^{41} are extremely sensitive for any geometric distortions (Ref. 6, Figs. 1.1 and 3.8). The requirement of perfect image registration is lifted for the pixel correspondence metric,^{42} closest distance metric,^{43} or point-to-closest-point MSE, which account for edge distances. Natural images can be handled by the complex wavelet CW-SSIM index but mostly small translations can be supported (Ref. 6, Ch. 3.2.3).

Liu and Chen^{44} describe a method for JPEG artifact assessment that compares the power spectrum in the frequency domain. Although not aiming for a complete transformation invariance as in our case, some degree of tolerance for an imperfect alignment can be expected. A similar analysis was also demonstrated using wavelets.^{45} Zhou et al.^{46} combined a comparison of mutual differences between the reference and test image with an analysis of self-similarity within each of the images. Such internal similarity can be better preserved than mutual similarity especially when the complexity of transformation is relatively low. A machine learning approach can be used to improve the adaptability of a metric to various content. Jin et al.^{47} train a metric selecting a specialized approach for a content structure and a distortion category. A potential extension would teach the metric to recognize local transformations and enable it to undo them. An application where the spatial alignment of images cannot be taken for granted is a quality assessment for stereoscopic 3-D. Changes in disparity will cause the left and right image to shift, often in spatially nonuniform way. Li et al.^{48} showed how the disparity information can be used to undo the stereoscopic projection and merge left and right eye into a cyclopean image where luminance features can be compared in a similar way as in SSIM.

All these approaches model local deformation invariance, which is a low-level (C1 cortical area) process. Our transformation-aware quality metric attempts to compensate for transformations of much larger magnitude, which occurs at higher levels^{10} including perspective transformation.

Video quality assessment typically considers the temporal domain as a straightforward extension of 2-D spatial filter banks into 3-D.^{49} This precludes any reasonable motion analysis based on its direction and velocity, which requires the optical flow computation. A notable exception is the work of Seshadrinathan and Bovik^{50} where the optical flow is derived using 3-D Gabor filters that span both the spatial and the temporal domain in order to evaluate the spatial quality of video frames, as well as the motion quality. Dominant motion increases the perception uncertainty and suppresses distortion visibility, while relative motion can make video degradations easier to notice.^{49} Our homography decomposition enables to analyze dominant and relative motion, and our transformation entropy accounts for their local complexity, which we utilize in our transformation-aware quality metric.

The visual image equivalence^{1} measures whether a scene’s appearance is the same rather than predicting if a difference is perceivable. Perceivably different scenes can result in the same impression, as the HVS compensates for irrelevant changes. Our method can be considered another form of visual equivalence, as we model compensation for transformations. Comparing two aggregates of objects^{51} is also related to entropy but goes beyond, if the aggregates differ by more than a transformation, i.e., deletion or insertion of objects.

## 4.1.3.

#### Limitations

Our metric heavily relies on the quality of the optical flow estimation. We analyze the transformation and entropy in the image by fitting homography locally to a small neighborhood. This can potentially lead to an amplification of the noise in the original optical flow, and consequently, overestimation of the transformation and its entropy. A special care has to be given to textureless or occluded regions where the flow estimate is unreliable. Such regions should conservatively be ignored in computing the final metric by setting $\delta =1$.

Another limitation of our metric is its focus on transformation properties alone. It is not clear how luminance properties, such as local contrast or texture distinctiveness, influence the ability of the HVS to understand the transformation, which has a direct influence on the perceivability of image differences. Our metric could therefore overestimate the performance of the HVS in regions where luminance patterns are confusing for understanding of the motion. The dazzle camouflage is one such example of a luminance pattern, which makes estimation of an object shape and position very difficult.^{52}

## 4.2.

### Saliency

Saliency estimation has a lot of applications in computer graphics (Sec. 2.3). It can substitute for a direct eye tracking and enable image processing optimized for important regions of the image that are more likely to be observed than others. While some other saliency models also consider motion, our method is unique in decomposing motion into elementary transformations (Fig. 12). This allows for an easier detection of a distinct motion patterns that can be obstructed in the original optical flow. Such unique features easily causes a “pop-out” effect, which attracts a user attention, hence it increases saliency.

Different from common image saliency, our approach takes two instead of one image as an input. It outputs saliency, e.g., how much attention an image region receives. We largely follow the basic, but popular, model of Itti et al.,^{25} and replace its motion detection component by our component that detects salient transformations.

First, we compute a feature map for every elementary transformation signal ${e}_{i}$ (Sec. 3.3) as

where $\ominus $ is an operator computing contrast between two levels $c\in \{2,3,4\}$ and $s=c+\{3,4\}$ of a Gaussian pyramid for an elementary transformation signal ${e}_{i}$. Typically, $\ominus $ is a simple difference but special care has to be taken for periodicity of angular values in case of rotation. Transformations with vector values, such as translation ${\mathbf{e}}_{\mathrm{t}}$, are treated as separate entities for each dimension. Six resulting feature maps per each transformation are then combined into corresponding conspicuity maps^{25}${\stackrel{\u203e}{E}}_{i}$ by summation

Finally, all conspicuity maps are normalized and averaged to get the final motion saliency score

where $\mathcal{N}(\xb7)$ is the normalization operator by Itty et al.^{25}

## 4.2.1.

#### Discussion

We compare our approach for scenes that contain complex elementary transformations to approaches by Le Meur et al.,^{53} Zhai and Shah,^{54} and Itti et al.^{25} (with the original motion detection component) in Fig. 12.

A vast majority of saliency models that handle temporal domain are focused on motion detection with respect to the static environment,^{7} but motion pop-out may also arise from nonconsistent relative motion. Therefore the global motion (e.g., due to camera motion) or consistent and predictable object motion should be factored out, and the remaining relative motion is likely to be a strong attention attractor. Along this line, Le Meur et al.^{53} derive the global motion in term of an affine transformation using robust statistics and remove it from the optical flow. The remaining outlier flow is compared to its median magnitude as a measure of saliency. Such per-pixel statistics make it difficult to detect visually consistent object transformations, such as rotations, where the variability of the motion magnitude and direction might be high. Zhai and Shah^{54} derived local homographies that model different motion segments in the frame. In this work, we compute transformation contrast similar to translation-based motion contrast in Ref. 54, but we perform it for all elementary transformations, and we account for neighboring homographies in a multiresolution fashion, instead considering all homographies at once. This gives us a better locality of transformation contrast. Also, through decomposition into elementary transformations we are able to account for the HVS ability to compensate for numerous comparable (nonsalient) transformation components akin to camera or large object motion and detect highly salient unique motion components. This way, instead of detecting local variations of optical flow, we are able to see more global relations between moving objects (as relative rotation in Fig. 12). The edge-stopping component of homography estimation enables us to find per-pixel boundary of regions with inconsistent motion, which further improves the accuracy of saliency maps. Finally, our saliency model is computationally efficient and can be performed at near-interactive rates.

## 4.2.2.

#### Limitations

Similarly to our image metric application our saliency predictor depends on the availability of a good optical flow estimate. Unreliable and noisy optical flow would reduce the efficiency of our method as an increase of entropy in the image generally lowers the prominence of the salient feature. Note that this makes it difficult to validate our method using standard saliency datasets with ground truth attention data gathered using eye tracking. Such datasets usually do not contain pairs of images as required by our method. A video input could potentially be used but the reliable optical flow data are missing in existing datasets. We conclude that a saliency validation dataset containing spatial transformations and registration information in the form of optical flow is highly desirable.

## 4.3.

### Motion Parallax

Motion parallax is defined by relative retinal motion due to a rigid geometrical transformation of the scene during motion of the object or the observer. Analyzing variance of optical flow directly can easily lead to an overestimation of the relative motion since nonlinear transformations, such as rotation, will yield incoherent optical flow. Our transformation decomposition removes this problem by attributing each elementary transformation to its proper magnitude map ${e}_{i}$. This way relative motion can be analyzed more robustly by investigating each map separately.

We propose a measure of motion parallax (Sec. 2.4), which relies on a combination of spatial change of flow (motion contrast) and spatial change of luminance (luminance contrast) (Fig. 13, Middle). First, an elementary transformation is computed for each level of the pyramid. The resulting pyramids contain at every pixel the difference in motion between a pixel and its spatial context of a size that depends on the level. Then, we compute the absolute values of such differences.

First, similar to a Laplacian image pyramid,^{55} we build a contrast pyramid ${\mathcal{C}}_{i}$ for each elementary transformation ${e}_{i}$ (Sec. 3.3)

## 4.3.1.

#### Adaptive parallax occlusion mapping

Real world objects often exhibit complex surface structures that would be too costly to directly model for computer graphic rendering applications. The huge geometry complexities would quickly surpass the memory capacity or the rendering throughput of any hardware. That is why the objects get simplified and surface details replaced by a single plane. This is efficient but it reduces the realism as both shading and structural details cannot be correctly reproduced [Fig. 14(a)]. The normal mapping^{56} uses additional surface raster to encode normal details, which are then used to reconstruct high frequencies of the shading at cost of a small memory and performance overhead [Fig. 14(b)]. The results are convincing for a static scene but adding a motion reveals that the motion parallax cue is completely missing. The parallax occlusion mapping^{57} (POM) tackles this issue by adding a surface-displacement map and performing a simple ray tracing to determine visibility and occlusions inside the surface plane [Fig. 14(c)]. Although significantly more costly, this effect is popular in current games as it greatly improves the realism. The importance of this effect has also been noticed in head mounted displays for virtual reality applications where the mismatch between head motion and the lack of motion parallax is strongly objectionable. As the computation happens inside the originally flat surface, its outline cannot be modified. That means that the shape of the object can still reveal the underlying simplified model. A remedy for this is provided by the displacement mapping, which uses HW capability of modern GPUs to generate a detail geometry matching the height map on the fly, therefore, without memory footprint [Fig. 14(d)]. As both of the later techniques are quite costly to compute, their use should be driven by a benefit that they bring to the user. We demonstrate how such a decision can be supported by our method on the case of POM.

In case of the POM, the height field needs to be ray-traced for every pixel, limiting its use in interactive applications, especially on low-power, i.e., mobile devices. Using our approach we can detect where a certain displacement map will actually result in a perceivable motion parallax for a certain view direction when this view is slightly changed and adjust the quality of the effect per-pixel, saving considerable compute time and bandwidth (Fig. 15). To this end, we prerender the displacement map, including texturing for all view directions and compute the motion parallax for a differential motion along each spherical direction for all view directions (refer to the inset in Fig. 15). The result is a lookup function that indicates how much a pixel benefits from POM from a certain view or not. At runtime, we look up the value for a given view direction, and adjust the size of a ray-tracing step for each pixel. That modifies the number of iterations required for evaluation of the effect and, therefore, the required computational time.

## 4.3.2.

#### Motion parallax-based viewpoint selection

Motion parallax is an effective depth cue,^{26} and cinematographers know how to use it to convey the layout of a scene. To our knowledge however, there is not yet an automated way to pick the right camera or object motion, such that the resulting motion parallax is most effective. Consequently, casual users that need to place a camera will have difficulties selecting it effectively.

Using our approach, we can derive an extended view point+motion selection approach (Fig. 16) that, given a 3-D scene, picks a view direction and a change of view position, such that the resulting image pair features optimal motion parallax. The image pair can directly be used in a stereo flip animation or as two key frames of a very slow camera motion, akin to the Ken Burns’ effect in 2-D (Ref. 58, page 512). Optionally, the view position can be fixed, and only the direction of change is suggested, or vice versa. To compute the pair, we use the same approach as for POM. We densely sample the motion parallax of the entire image for the 2-D set of all view directions and their $\phi $ and $\theta $ derivatives in spherical coordinates. We return the pair where the motion parallax is maximal, optionally combined with other viewpoint criteria.

## 5.

## Conclusion and Future Works

We propose a model of human perception of transformations between a pair of images. The model converts the underlying optical flow into a field of homographies, which is further decomposed into elementary transformations that can be perceptually scaled and allows the notion of transformation entropy. Our model enables for the first time a number of applications. We extended perceptual image metrics to handle images that differ by a transformation. We extend visual attention models to detect conspicuous relative object motion, while ignoring predictable motion such as due to view changes or consistent object motion. Finally, we provide a measure of motion parallax based on optical flow and demonstrate the utility of this measure in rendering applications to steer adaptive POM and viewpoint selection.

Our transformation-aware perceptual scaling may have other interesting applications, which we relegate as future work. In image change blindness,^{59} the same view has been considered so far, and our approach could be beneficial to predict the increased level of difficulty in the visual search task due to perspective changes. Also, the concept of visual equivalence^{1} can be extended to handle different scene views, as well as minor deformations of the object geometry and their relocation. Our quality metric could be applicable to rephotography and rerendering^{3} allowing for a better judgement of structural image differences while ignoring minor misregistration problems. This is also the case in image retargeting, where all image distance metrics such as SIFT flow, bidirectional similarity, or earth mover’s distance^{4} account for some form of the energy cost needed to transform one image into another. While semantic and cognitive elements of image perception seem to be the key missing factors in those metrics, it would be interesting to see whether our decomposition of the deformation into elementary transformations and perceptual scaling of their magnitudes could improve the existing energy-based formulations.

In the end of the day the basic question is: “what is an image?” In most cases, “image” does not refer to a matrix of physical values but refers the mental representation of a scene. This mental representation is created by compensating for many variations in physical appearance. The ability to compensate for transformation as well as its limitations are an important part of this process and has been modeled computationally in this work.

## References

## Biography

**Petr Kellnhofer** is a PhD candidate under a joint supervision of Prof. Karol Myszkowski and Prof. Hans-Peter Seidel at MPI Informatik, Saarbücken, Germany, since 2012. His research interests cover application of perception to computer graphics with a focus on stereoscopic 3-D. During his PhD he visited the group of Prof. Wojciech Matusik at MIT CSAIL, where he investigated eye-tracking methods and their applications.

**Tobias Ritschel** is a senior lecturer (associate professor) in the VECG group at the University College London. His interests include interactive and nonphotorealistic rendering, human perception, and data-driven graphics. He received the Eurographics PhD dissertation award in 2011 and Eurographics Young Researcher Award in 2014.

**Karol Myszkowski** is a tenured senior researcher at the MPI Informatik, Saarbücken, Germany. From 1993 till 2000, he served as an associate professor in the Department of Computer Software at the University of Aizu, Japan. His research interests include global illumination and rendering, perception issues in graphics, high dynamic range imaging, and stereo 3-D.

**Hans-Peter Seidel** is a scientific director and chair of the Computer Graphics Group at the MPI Informatik and a professor of computer science at Saarland University. In 2003, he received the Leibniz Preis, the most prestigious German research award, from the German Research Foundation (DFG). He is the first computer graphics researcher to receive such an award.