Impact of deep learning-based image super-resolution on binary signal detection

Abstract. Purpose: Deep learning-based image super-resolution (DL-SR) has shown great promise in medical imaging applications. To date, most of the proposed methods for DL-SR have only been assessed using traditional measures of image quality (IQ) that are commonly employed in the field of computer vision. However, the impact of these methods on objective measures of IQ that are relevant to medical imaging tasks remains largely unexplored. We investigate the impact of DL-SR methods on binary signal detection performance. Approach: Two popular DL-SR methods, the super-resolution convolutional neural network and the super-resolution generative adversarial network, were trained using simulated medical image data. Binary signal-known-exactly with background-known-statistically and signal-known-statistically with background-known-statistically detection tasks were formulated. Numerical observers (NOs), which included a neural network-approximated ideal observer and common linear NOs, were employed to assess the impact of DL-SR on task performance. The impact of the complexity of the DL-SR network architectures on task performance was quantified. In addition, the utility of DL-SR for improving the task performance of suboptimal observers was investigated. Results: Our numerical experiments confirmed that, as expected, DL-SR improved traditional measures of IQ. However, for many of the study designs considered, the DL-SR methods provided little or no improvement in task performance and even degraded it. It was observed that DL-SR improved the task performance of suboptimal observers under certain conditions. Conclusions: Our study highlights the urgent need for the objective assessment of DL-SR methods and suggests avenues for improving their efficacy in medical imaging applications.


Introduction
Single-image super-resolution (SISR) is a classic image restoration operation that seeks to estimate a high-resolution (HR) image from an observed low-resolution (LR) one. 1 A variety of methods have been developed to achieve this goal, such as filtering and interpolation-based approaches 2 and more formal regularized inverse problem-based formulations, 3,4 to name a few. Recently, deep learning-based image super-resolution (DL-SR) methods have been widely employed and have shown great promise for SISR in terms of traditional image quality (IQ) metrics such as mean square error (MSE), structural similarity index metric (SSIM), and peak-signal-to-noise ratio (PSNR). [5][6][7][8] In medical imaging, images are often acquired for specific purposes, and the use of objective measures of IQ is widely advocated for assessing imaging systems and image processing algorithms. [9][10][11][12][13][14][15] Although DL-SR algorithms can improve traditional IQ metrics, [16][17][18][19][20][21] it is well-known that such metrics may not always correlate with objective task-based IQ measures. [22][23][24][25] Despite this, relatively few studies have objectively assessed image superresolution methods. 19,[26][27][28] Dai et al. 27 evaluated six image super-resolution methods on popular vision tasks such as edge detection and semantic image segmentation and found that the standard perceptual metrics correlated well with the usefulness of image super-resolution to these tasks. Jaffe et al. 28 conducted a study in which the aesthetic IQ that DL-SR methods sought to improve did not necessarily increase classification accuracy. However, none of these studies were carried out with images, tasks, or observers relevant to medical imaging. Additionally, the data processing inequality indicates that the performance of an ideal observer (IO) on a particular task cannot be improved using image processing transformations. 29 The scenarios under which DL-SR may improve the performance of a suboptimal observer on a specified task have not been thoroughly investigated. The purpose of this work is to evaluate DL-SR methods using task-based measures as a preliminary attempt to address the issues raised above. For this study, two canonical DL-SR networks were identified for the analysis. A variety of mathematical and learning-based numerical observers (NOs) were computed on the HR images, the LR images, and the images resolved by the DL-SR methods. Receiver operating characteristics (ROC) analysis was employed to quantify the performance of these NOs. Two stylized binary signal detection tasks were designed to evaluate the DL-SR networks systematically and comprehensively under known statistical conditions. Specifically, a signal-known-exactly and background-known-statistically (SKE/ BKS) Rayleigh discrimination task 30,31 was employed to assess the ability of a DL-SR to resolve two small adjacent objects. The inherent detectability of the signal was varied, and its effect on the utility of DL-SR for improving detection task performance was studied. The impact of the depth of a DL-SR network on NO performance was investigated to see if the deep learning mantra "deeper is better" holds true for signal detection performance. 32 Additionally, a signalknown-statistically and background-known-statistically (SKS/BKS) microcalcification (MC) cluster detection task was employed to investigate under what circumstances DL-SR techniques may improve the binary signal detection performance of a suboptimal observer.
The remainder of this paper is organized as follows. Section 2 describes the relevant background on linear imaging systems, the basic theory relating to binary signal detection tasks, NOs, and DL-SR. Section 3 describes the setup for the numerical studies, and Sec. 4 describes the results of the proposed evaluation. Section 5 presents a discussion on the salient findings, and Sec. 6 concludes this paper.

Background
Many imaging systems are approximately described by a continuous-to-discrete (C-D) linear imaging model: 9 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 1 3 4 g ¼ HfðrÞ þ n; (1) where f ∈ L 2 ðR d Þ is the true object of interest that is a function of the d-dimensional spatiotemporal coordinate r and g ∈ E m is a vector that describes the measurement data. The mapping H∶L 2 ðR d Þ → E m denotes the C-D forward operator that represents the data-acquisition process, and n ∈ E m denotes the measurement noise. In practice, discrete-to-discrete (D-D) models for the imaging system are often employed, in which case the object fðrÞ is approximated by a vector f ∈ E n , n ∈ N and a D-D approximation H ∈ E m×n is employed in place of H. 9

Binary Signal Detection Tasks
A binary signal detection task requires an observer to classify the image as satisfying either hypothesis H 0 or hypothesis H 1 : E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 6 ; 6 3 8 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 6 ; 5 9 5 where f b ∈ E n denotes the background, f s 0 ∈ E n and f s 1 ∈ E n represent the signal under the two hypotheses, H ∈ E m×n refers to the D-D imaging operator, and n ∈ E m denotes the measurement noise. The special case of f s 0 ¼ 0 corresponds to a task of detecting the presence or absence of the signal f s 1 in an image. When f b is a random vector drawn from a certain nondegenerate distribution and f s 0 and f s 1 are fixed known signals, the detection task is known as a SKE/BKS detection task. Alternatively, if f s 0 and f s 1 are also random, then the detection task is known as a signal-known-statistically and background-known-statistically (SKS/BKS) detection task. Both of these tasks are considered in this work.

Numerical Observers for IQ Assessment
A NO for a signal detection task maps a given set of measurements g or, alternatively, an image estimatef ∈ E n of the object obtained from g to a scalar test statistic t that is used to determine whether g orf satisfies H 0 or H 1 based on comparison with a predetermined threshold τ. The NOs employed in this study are described below.

Ideal observer and ResNet-based observer
The IO is an observer that utilizes all available statistical information about the task at hand to maximize task performance. An IO test statistic t IO ðfÞ is any monotonic function of the likelihood ratio: 9 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 6 ; 3 1 8 where pðfjH 0 Þ and pðfjH 1 Þ are the conditional probability density functions that describe image estimatef under hypotheses H 0 and H 1 . The exact computation of an IO test statistic based on ΛðfÞ is intractable in general, and Markov-chain Monte Carlo techniques have been proposed to approximate it. 33,34 Recently, it has been empirically shown that the IO can be approximated by a neural network-based observer. 14 In this study, a residual neural network-based (ResNet-based) classifier of sufficient capacity trained on a large labeled training dataset was employed to approximate the IO. This will henceforth be referred to as the ResNet-IO. Note that, if this network does not possess the capacity to accurately approximate t IO ðfÞ, the resulting NO will be simply referred to as a ResNet-based observer. In this case, the ResNet-based observer is a suboptimal observer.

Hotelling observer and regularized Hotelling observer
The Hotelling observer (HO) is the optimal NO under the condition that the employed test statistic is a linear function of the data. 9 The test statistic for the HO is defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 1 1 6 ; 7 3 5 where E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 1 1 6 ; 6 9 9 is known as the Hotelling template and E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 1 1 6 ; 6 5 3 KðfÞ ¼ 1 2 ðK 0 ðfÞ þ K 1 ðfÞÞ: Here, K 0 ðfÞ and K 1 ðfÞ denote the covariance matrices off under the hypotheses H 0 and H 1 , repsectively, and Δf ¼ EðfjH 1 Þ − EðfjH 0 Þ is the difference between the condition mean off under the two hypotheses. In some cases, the covariance matrix KðfÞ can be ill-conditioned, and therefore its inverse cannot be stably computed. To address this, a regularized Hotelling observer (RHO) is employed. The singular value decomposition of K is written as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 1 1 6 ; 5 3 3 where R is the rank of K, σ 1 ≥ σ 2 ≥ : : : ≥ σ R are the singular values of K, u i and v i are the right and left singular vectors, respectively, and † denotes the complex conjugate transpose operation. The truncated pseudoinverse K þ λ of K is employed as a stable approximation of K −1 : E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 1 1 6 ; 4 4 6 where λ is a threshold for sigular value and P is chosen to satisfy σ P ≥ λσ 1 > σ Pþ1 . The truncated pseudoinverse is then used to construct the RHO template, which is then used to obtain the RHO test statistic: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 0 ; 1 1 6 ; 3 6 0

Gabor channelized Hotelling observer
To compute a channelized Hotelling observer (CHO) template, the image dataf is first transformed into a vector v ∈ E q , q < n, known as the channel output, via a transformation v ¼ Tf, where T ∈ E q×n is known as the channel matrix. The test statistic of CHO is then computed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 1 ; 1 1 6 ; 2 5 1 where Þ is the covariance matrix of the channelized image data. Here K v;0 and K v;1 denote the covariance matrices of v under the two hypotheses H 0 and H 1 . The CHO with Gabor channels (Gabor CHO) can be considered an anthropomorphic observer. 9,[35][36][37] The channel matrix T employed in the Gabor CHO is specified as follows.
A Gabor function C i corresponding to the i'th row of T is defined in the spatial domain by multiplying a sinusoidal wave with a Gaussian function: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 2 ; 1 1 6 ; 1 4 3 where w i is the channel width, ν i is the central frequency, θ i is the orientation, and ϕ i is the phase. The element v i of the channel vector v ¼ Tf is then given by the scalar product of the discretized version of C i with the 2D image representation off.

Deep Learning-Based Image Super-Resolution
In the context of an image super-resolution problem, an LR image f LR ∈ E n 0 , n 0 ∈ N, n 0 ≤ n can be formally thought of as being related to the sought-after HR image f HR ∈ E n via the following equation: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 3 ; 1 1 6 ; 6 7 9 f LR ¼ H blur f HR þ n; (13) where H blur ∈ E n 0 ×n represents a degradation operator that removes the higher spatial frequencies from f HR and n denotes the noise. Given a specific LR image, an estimate f SR ∈ E n of the original HR image is obtained using image super-resolution methods. However, this is a challenging ill-posed inverse problem. In recent years, deep learning has been widely applied to achieve image super-resolution. [5][6][7][8] A popular class of deep learningbased approaches calls for establishing a mapping from the space of LR images to the space of HR images: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 4 ; 1 1 6 ; 5 6 3 where S θ is a deep neural network parametrized by θ.
Various loss functions such as l 1 or l 2 loss, or a perceptual loss, 38 can be used to define L. Additionally, an adversarial loss that attempts to match the distribution of SR images to the distribution of original HR images can also be employed. 8 The two DL-SR networks considered in this study are the super-resolution convolutional neural network (SRCNN) 6 and the superresolution generative adversarial network (SRGAN). 8 The architectures of these two networks are shown in Fig. 1. The architecture of the SRCNN consists of feed-forward convolutional layers interspersed with pointwise rectified linear unit (ReLU) nonlinearities. 6,39 The SRGAN architecture consists of a generative network, which is an image-to-image mapping network consisting of convolutional residual blocks interspersed with pointwise ReLU nonlinearities. A discriminator network is jointly trained along with the generative network and provides the adversarial loss for matching the distribution of generated SR images to the distribution of HR images. 8

Numerical Studies
Computer-simulation studies were employed to objectively evaluate the DL-SR methods described above with two binary signal detection tasks: (i) a Rayleigh detection task and (ii) an MC cluster detection task. The NOs described in Sec. 2.2 were computed on the SR images, as well as the LR and true HR images, to objectively assess the impact of DL-SR on the considered tasks.

Clustered Lumpy Background
The CLB model was developed by Bochud et al. 40 to generate random backgrounds that resemble mammographic textures. The value of a CLB image at position r is E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 6 ; 1 1 6 ; 1 0 4 Here lðr; R θ Þ is known as the blob function. The integer K denotes the number of clusters that was sampled from a Poisson distribution with a mean of K∶K ∼ PoissðKÞ, N k specifies the number of blobs in the k'th cluster sampled from a Poisson distribution with the mean of N∶N ∼ PoissðNÞ, r k indicates the center location of the k'th cluster sampled uniformly over the field of view, and r kn represents the center location of the n'th blob in the k'th cluster sampled from a Gaussian distribution with the center of r k and standard deviation of σ. The matrix R θ kn represents the rotation corresponding to the angle θ kn sampled from a uniform distribution between 0 and 2π, LðrÞ refers to the radius of the ellipse with half-axes L x and L y , and α and β are adjustable coefficients. The parameters of the CLB model employed in both the Rayleigh detection task and MC cluster detection task are shown in Table 1.

Rayleigh Detection Task with a Clustered Lumpy Background Model
The Rayleigh detection task is a natural task for assessing the resolution properties of imaging systems and has been employed previously for optimizing tomographic imaging systems. 30,31 This is a binary signal detection task, in which hypothesis H 0 corresponds to a signal f s 0 consisting of two adjacent point objects and hypothesis H 1 corresponds to a signal f s 1 consisting of a single-line object.

Simulated image data for Rayleigh detection task
Given the definition of signals f s 0 and f s 1 provided above, the generation of LR images under H 0 and H 1 is written as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 7 ; 1 1 6 ; 6 9 2 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 8 ; 1 1 6 ; 6 4 8 where f b denotes a CLB image of size 128 × 128 with parameters defined in Table 1 and n denotes the measurement noise. Given an adjustable parameter L, termed the signal length, f s 0 is specified by first defining two Kronecker delta functions separated by a distance of L − 2, and convolving them with a Gaussian function of standard deviation 1.375 pixels. The signal f s 1 is specified by first defining a horizontal line of length L, which is subsequently convolved with the same Gaussian function. The signals are inserted such that the centers of the signals coincide with the center of the image. The Rayleigh detection task was performed independently on the following datasets, where the HR dataset consists of images of the type: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 9 ; 1 1 6 ; 5 3 9 f the LR dataset consists of images of the type E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 0 ; 1 1 6 ; 4 9 6 and the SR dataset consists of images of type E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 1 ; 1 1 6 ; 4 5 2 where H blur;1 represents a Gaussian filter with a standard deviation of 1.5 pixels and Here, S denotes the DL-SR operation performed by either the SRCNN or the SRGAN, and n denotes the sum of pixel-wise independent and identically distributed (IID) Poisson noise with a standard deviation scaled by σ p ¼ 0.013 and IID. Gaussian noise with a standard deviation σ g ¼ 0. 35. The simulation of an example LR image according to the described procedure is shown in Fig. 2. Two separate studies were formulated based on the Rayleigh detection task.
1. Signal length variation study. In this study, the signal length parameter L, which pertains to the distance between the two point objects in f s 0 or the length of the line in f s 1 , was varied to investigate the resolving power of the DL-SR algorithms. The signal lengths of L ¼ f5;6; 7;8; 9g were employed in this study as shown in Fig. 3. 2. Network complexity variation study. To investigate how the DL-SR network complexity correlates with the task performance for a fixed object model and task design, a network complexity variation study in which the number of layers of a DL-SR network was varied , and combined images of the Rayleigh detection task.
was conducted. The SRGAN employs an additional tunable parameter controlling the trade-off between the MSE loss and the discriminative loss, the optimal value of which may depend, among other factors, on the number of layers in the network. Hence, only SRCNN was employed in this study.

Training details for the DL-SR networks
For the signal length variation study, both the SRCNN and SRGAN were trained and evaluated. The training and validation data for SRCNN consisted of 5000 and 625 class-balanced signal present/absent images, respectively. For SRGAN training, due to more trainable parameters in the SRGAN, 20,000 images were used for training, and 2000 images were used for validation, respectively, Examples of HR, LR, and SR images produced by the networks are shown in Fig. 4(a). For the architecture variation study, seven SRCNNs with varying numbers of convolutional layers ranging from 2 to 8 were employed. For all of the SRCNNs, the filter size in the first layer was fixed to 9 × 9, whereas the filter size for the other layers was fixed to 5 × 5. The number of filters in all layers was fixed to 32, except the last layer, in which the number of filters was fixed to 1. All SRCNNs were trained on 15,000 images and validated on 3000 images with class balance.
The SRCNN was trained with an MSE loss, and the SRGAN was trained using an MSE loss and an adversarial loss. All DL-SR networks to be evaluated in the Rayleigh detection task were trained on mini-batches at each iteration using the Adam optimizer. 41 The DL-SR models that achieved the best performance on the validation set were used for evaluation. Both DL-SR networks were implemented under the TensorFlow 2.0 framework and trained on NVIDIA GPUs.

Microcalcification Cluster Detection Task with a Clustered Lumpy Background Model
Motivated by the clinical value of detecting MC clusters in mammograms that may be associated with malignancy in breast lesions, 42,43 a stylized SKS/BKS binary signal detection task of  identifying an image with or without an MC cluster present was studied. The objective of this study was to determine how the capacity of a NO affects observer performance on SR images. In essence, whether or not SR aids the performance of suboptimal observers was systematically studied.

Simulated image data for MC cluster detection task
The HR MC cluster dataset was created as follows. First, 128 × 128 CLB images were created to simulate the mammographic backgrounds, as described in Sec. 3.1. The signal-absent HR images f 0 correspond to the case in which f s 0 ¼ 0 and, hence, were kept equal to the CLB images. The signal insertion pipeline employed to generate the signal-present HR image f 1 is described as follows.
The scalar c represents a contrast factor uniformly sampled from the range [0.05, 0.06] that is chosen to visually match the contrast of real lesion. Given the generated HR image, the corresponding LR image was simulated as follows, based on the degradation model described by You Here H blur;2 represents a Gaussian blurring operation with a standard deviation of 1.5 pixels, followed by downsampling by a factor of 2. Pixel-wise IID. Poisson noise with a standard deviation scaled by a factor σ p ¼ 0.0001 and IID. Gaussian noise with a standard deviation σ g ¼ 0.001 were added to both the HR and LR images. These noise values were chosen independently of the Rayleigh task so as to not saturate the observer performance on the LR images. To enable direct comparison with the HR and SR images, an additional operation U representing upsampling by a factor of 2 was used on the LR images. Similar to the Rayleigh detection task, the MC cluster detection task was performed on the following datasets: (1) the HR dataset consisting of images of the type f HR ¼ f i þ n, i ¼ 0;1 is one of the MC cluster-absent/present hypotheses; (2) the LR dataset consisting of images of the type f LR ¼ H blur;2 f i þ n, i ¼ 0;1 along with the additional upsampling operation U acting on f LR ; and (3) the SR datasets consisting of f SR ¼ SðUf LR Þ, where S denotes the DL-SR operation performed by SRCNN.

Training details for DL-SR networks
The SRCNN employed in this study was trained on a dataset of 40,000 images and validated on a dataset of 4000 images, both with balanced classes. The network was trained with the Adam optimizer 41 with a learning rate of 5 × 10 −5 for 1000 epochs to minimize the MSE loss. The SRCNN model with the best validation performance was used. Examples of the SR images produced by the SRCNN along with the HR and the LR images are shown in Fig. 4(b).

Objective evaluation metrics for the Rayleigh detection task
To evaluate the DL-SR networks with task-based metrics, three NOs, namely the RHO, Gabor CHO, and ResNet-IO, were employed. The test statistics for the three NOs were computed on the HR, LR, and SR images that were centrally cropped to a size of 64 × 64. ROC curves were computed, and the area under the ROC curve (AUC) was employed as a figure of merit. All evaluation metrics were computed on balanced test dataset of 40,000 images. Nonparametric estimation of the AUC confidence intervals was carried out using DeLong's algorithm, 46,47 with the help of the pROC package in R. 48 Additionally, traditional IQ metrics such as PSNR and SSIM were computed on the LR and SR images.
To compute the RHO test statistic, 500,000 images containing two point objects and 500,000 images containing the line-shaped object were utilized to estimate the empirical covariance matrix KðfÞ. The threshold parameter λ in Eq. (9) was swept in from 10 −9 to 10 −4 , and the detection performance was evaluated on a validation set of 4000 class-balanced images. The value of λ that yielded the best RHO performance on the validation data was selected. This RHO with the selected parameter λ was applied to a test set consisting of 40,000 class-balanced images.
The channel matrix corresponding to the Gabor CHO comprised a set of 60 Gabor channels. Each Gabor channel was associated with one out of six passbands, one out of five orientations, and one out of two phases. The six passbands each have a spatial frequency bandwidth of 1 octave with a center frequency ν ¼ 3∕256;3∕128;3∕64;3∕32;3∕16 and 3∕8 cycles/pixel. The five orientations were 0;2π∕5;4π∕5;6π∕5, and 8π∕5, and the two phases were 0 and π∕2. Examples of Gabor channel templates are shown in Fig. 5. The channelized covariance matrix was estimated using 100,000 images from each class with 500,000 noise realizations for each class.
The ResNet-IO, as shown in Fig. 6(a), was employed to approximate the IO test statistic. To obtain a good approximation of the IO using ResNets, the optimum network capacity needs to be determined empirically by sweeping the number of layers used in the ResNet architecture and choosing the configuration that gives the best detection performance. A large training dataset must be used to correctly represent the data distribution. Here the network was initialized with the help of the RHO template to give the best performance and to speed up convergence. A family of ResNets comprising various numbers of residual blocks were trained on a dataset consisting of 100,000 training images and validated on 4000 images from each of the two classes. The binary cross-entropy loss was minimized using Adam optimizer with a learning rate of 1 × 10 −6 . Additionally, a "semionline learning" method in which the measurement noise was generated on-the-fly as described in Ref. 14 was utilized to mitigate the overfitting problem. The ResNet that had the best validation performance was chosen as the ResNet-IO.

Objective evaluation for the MC cluster detection task
As described previously, the objective of this study was to investigate the potential benefit of DL-SR as it relates to the capacity of an NO. A binary signal detection task was conducted to distinguish whether an image contains the MC cluster signal or not. To assess the task-based performance, a family of ResNet-based observers consisting of 2, 4, 6, or 8 residual blocks, respectively, were employed in the detection task. The architecture of the ResNet-based observers is shown in Fig. 6(b). Each of these observers was trained on class-balanced datasets of sizes 5000 10,000, 20,000, 50,000, and 100,000 by minimizing the binary cross-entropy loss, until the detection capability of each observer was fulfilled. Each simulated MC cluster image in the training dataset was augmented four times by flipping. The AUC values produced by the trained ResNet-based observers on a held-out test set containing 20,000 images from each class were used to evaluate the signal detection performance. The ResNet-based observer that achieves the best test performance without further improvement with either a deeper network architecture or a larger training dataset could be considered an approximated IO. 14 4 Results

Impact of regularization on the Hotelling observer performance
In addition to introducing high-frequency features to an LR image, the DL-SR networks also suppress the per-pixel IID. noise added to the LR images. Due to this, the covariance matrix Kðf SR Þ of the SR images is ill-conditioned. Hence, as mentioned in Sec. 2.2.2, regularization is needed to stably invert it to obtain the Hotelling template. Hence, the performance of the RHO depends upon the regularization parameter λ employed for truncating the singular values of K. Figure 7 shows the Hotelling templates of the HR images, the LR images, and the images SR by the SRCNN and the SRGAN. It can be seen that, for low values of λ, the Hotelling template is noisy due to the unstable inversion of K. On the other hand, for high values of λ, degradation of the signal specificity corresponding to the truncation of singular values can be seen.

Impact of signal length on observer performance
The traditional IQ metrics and AUC values for the signal length variation study computed on a class-balanced test set consisting of 40,000 images are plotted in Figs. 8 and 9, respectively. As seen in Fig. 8, the SR images generated by the SRCNN and SRGAN show an improvement in IQ across various signal lengths compared with their LR counterparts in terms of the traditional IQ metrics. Moreover, no significant changes on traditional IQ metrics were observed among SR images when varying the signal length. This is due to the degradation model and DL-SR network architecture being consistent across different signal lengths and the physical difference among images with various signal lengths being minor. However, as shown in Fig. 9, DL-SR performance as measured by NO performance provides different insights into the DL-SR behavior. First, it can be seen that AUC values corresponding to It can be seen that the DL-SR resulted in a small improvement in the CHO performance, but no improvement in the RHO and ResNet-IO performance on the LR images. As such, the observer performance on the HR images is much higher than the performance on the LR and SR images.
all NOs increased consistently along with the increment of the signal length for the HR, LR, and both types of SR images. This is due to the detection task becoming easier with an increasing signal length. Second, the AUC values corresponding to HR images were significantly greater than those on LR images and SR images. This suggested that the second-and potentially higherorder statistical properties of the images may not be recovered by the DL-SR networks. Third, it is worth noting that, in some cases, there was a small improvement in the AUC values of RHO and a small but significant improvement in the AUC values of Gabor CHO corresponding to the SR images as compared with the LR images. This could be interpreted by both the linear observers, namely the RHO and the Gabor CHO acting on the SR images, having the benefit of a nonlinear preprocessing block in the form of the DL-SR network. Finally, as shown in Fig. 9(c), there was no improvement in the performance of the ResNet-IO as a result of the employed DL-SR networks, which is consistent with the data-processing inequality. 29

Impact of number of layers in DL-SR networks on observer performance
The traditional IQ metric MSE and the NO performance measured on the LR and SR images as the number of layers in SRCNN was varied are shown in Figs. 10 and 11, respectively. As shown  in Fig. 10, the MSEs decreased when the number of layers in SRCNN increased, as expected. This indicates that the DL-SR networks improved certain first-order statistics of the images. However, this trend is not always consistent with the NO performance measured by AUC values. As shown in Fig. 11, it was observed that the AUC values for the RHO measured on SR images were no greater than those computed using the LR images. Also the RHO performance decreased as the number of DL-SR network layers increased. This suggests that the second-order statistical properties of the images were degraded by the DL-SR networks. To further analyze this, the singular values of the covariance matrix Kðf SR Þ of the SRCNN-resolved images were computed for networks having different numbers of layers. As shown in Fig. 12, the singular values indicate that, as the number of layers in the DL-SR network increased, Kðf SR Þ became increasingly ill-conditioned.
On the other hand, the AUC values for the Gabor CHO on SR images were greater than those measured on LR images, and the performance of Gabor CHO on SR images increased as the number of layer increased from 2 to 6, after which it saturated and reduced slightly for the SRCNN composed of 7 and 8 layers. This suggests that the second-order statistics of the Gabor channelized images were improved by the DL-SR networks but that this improvement reached a plateau as the number of layers increased. The singular values of the covariance matrix K v of the Gabor-channelized, SRCNN-resolved images were computed for the DL-SR networks with different numbers of layers. As shown in Fig. 13, the singular value decay of K v is faster for DL-SR networks with more layers, which is similar to the RHO.

Impact of Observer Capacity on Benefit of DL-SR for MC Cluster Detection Performance
The objective of this study is to determine how the capacity of a NO relates to its task performance on SR images. The traditional IQ metrics MSE, PSNR, and SSIM were computed for the LR and SR images generated by the SRCNN on the MC cluster dataset. As shown in Table 2, the IQ measured with these metrics improved for the SRCNN-resolved images compared with the LR counterparts. The capacity of a ResNet-based observer was varied by varying the number of residual blocks that constitute the ResNet. Figure 14 shows the performance of ResNet-based observers consisting of 2, 4, 6, and 8 residual blocks trained on a dataset of 50,000 images (200,000 considering fourfold flip-augmentation). It was observed that ResNet-based observers of smaller capacity benefited from the particular DL-SR network employed. In this case, the DL-SR network can be interpreted as an additional prepreocessing block for the ResNet observer that effectively increases the capacity of the observer. However, as the capacity of the observer was increased, the SR operation gave diminishing returns toward improving the task performance. As the NO performance plateaued with increasing capacity, it approached ResNet-IO, and the MC cluster detection performance on SR images was no greater than that in LR images. This behavior is consistent with the data processing inequality, 29 which suggests that postprocessing operations such as image super-resolution will not increase the information content in the image. As a result, the MC cluster detection performance of a ResNet-IO on SR images should not be expected to surpass that of the original LR images.
Next, ResNet-based observers of varying depths were trained on datasets consisting of different sizes to fulfill their corresponding capacity for each resolution. For each dataset, the optimal ResNet-based observer was identified based on the best performance on the validation dataset. The results in Fig. 15 show the performance of the optimal ResNet-based observer for each dataset size. It was observed that, as the amount of available training data increased, the MC cluster detection performance of the ResNet-based observers increased. More interestingly, given a small dataset with limited number of images such as 5000, 10,000, and 20,000, the DL-SR network indeed improved the detection performance on SR images compared with LR. This demonstrates a situation in which the DL-SR operation aided the MC cluster detection performance. For training dataset sizes of 50,000 and beyond, the ResNet-based observer approached the ResNet-IO, and its performance on the images resolved by the DL-SR networks was no better than its performance on the LR images.
Both of the observations in Figs. 14 and 15 illustrate that, in the case of suboptimal neuralnetwork (NN)-based observers, such as those with limited capacity or those trained on limited data, DL-SR networks may be employed to improve the detection performance compared with that achieved on the LR images. However, if the NN-based observer approximates IO, preprocessing the LR images using a DL-SR network will not improve the detection performance of the observer.

Discussion
Deep learning techniques have been adopted for a wide range of medical imaging applications, including image restoration. Despite the different traditional IQ metrics having been computed to assess the effect of these deep learning-based methods, a task-based evaluation of these approaches has been largely lacking. A recent study conducted by Li et al. 15 demonstrated that deep neural network-based image denoising methods can result in a loss of task-relevant information, despite an improvement in several traditional IQ metrics. In a similar vein, this work studies the impact of DL-SR on binary signal detection tasks. It is important to reiterate that the main goal of this work is to comprehensively study the impact of DL-SR on task performance for known tasks under known statistical conditions. It is not to explore whether DL-SR can be a viable practical solution to a particular real problem. Such a systematic and comprehensive evaluation is not possible with common clinical datasets, which have several different and unknown sources of variability that may act as confounding factors in our analysis. Therefore, for the purposes of this work, the stylized setup presented is appropriate.
A Rayleigh detection task was employed to assess the impact of the design of the signal and the depth of the DL-SR network, and an MC cluster detection task was employed to study how DL-SR affects NN-based observers of different capacities. The numerical results for the SKE/ BKS Rayleigh detection task revealed that the loss of task-relevant information in LR images cannot be recovered by the DL-SR operation, even though mild improvement of detection performance was observed with suboptimal observers. Furthermore, it was observed that, while increasing the depth of the DL-SR network improves the traditional IQ metrics, improved task performance does not always follow. This suggests that the mantra "deeper is better" while designing neural network architectures for image super-resolution is not necessarily applicable when task performance is considered. As such, seeking to minimize a loss function solely related to traditional IQ metrics may lead to a situation in which the image statistics important to the defined task are degraded.
Furthermore, it is of interest to investigate conditions under which the DL-SR improves the signal detection task performance. Using SRCNN as an example, an SKS/BKS MC cluster Fig. 15 Performance of the optimal ResNet-based observer for a particular dataset size trained on HR, LR, and SR images. detection task was conducted to investigate the capacity of the NN-based observers on SR images, as compared with that on LR and HR images. It was observed that DL-SR improved the signal detection performance of suboptimal observers that do not accurately approximate IOs due to either a limited amount of training data or the limited complexity of the observer. Given sufficient training data and an observer with sufficient complexity for the particular task considered, an IO can be approximated, and the benefit of DL-SR toward improving the task performance is lost. This suggests that the impact of DL-SR on a binary signal detection task depends on a combination of factors such as the DL-SR networks, the observers, and the defined task. Thus a task-based evaluation of DL-SR methods is essential to accurately quantify the benefit of DL-SR for clinical practice.
Some important topics remain to be investigated in the future. The binary signal detection tasks considered in this study are simplistic compared with real-world clinical tasks. Future work could investigate the performance of DL-SR methods as preprocessing blocks on tasks such as multi-class classification, lesion segmentation, and image registration. Since the introduction of SRCNN and SRGAN, several deep learning-based methods that improve the super-resolution performance have been proposed. The task-based evaluation pipeline presented in this study can readily be applied to the newer DL-SR methods in which different network architectures or loss functions are employed. It is known that deep learning-based methods may lead to hallucinations, especially when acting on data outside the training distribution. 49 Hence, an objective assessment of the robustness of DL-SR methods for distribution shifts is also an important topic for future investigation. Additionally, it will be important to conduct human reader studies to assess the performance of DL-SR methods for specific clinical tasks. The results demonstrated in our study will motivate the development of DL-SR methods in directions in which the loss of task-specific information can be mitigated by incorporating such information in designing the network architecture or the loss functions. 50

Conclusion
In this paper, we presented a task-based evaluation to assess the impact of DL-SR methods on binary signal detection. An SKE/BKS Rayleigh detection task and an SKS/BKS MC cluster detection task were conducted on simulated image datasets with a CLB. Our results verify that the performance of an IO cannot be improved via DL-SR methods, which is consistent with the data processing inequality. Also an improvement in traditional IQ metrics induced by DL-SR does not always correlate with the impact of DL-SR on observer performance. Despite this, the numerical experiments presented indicate that DL-SR methods improved the signal detection performance of suboptimal NOs in certain cases. The reported results emphasized the necessity of a task-based evaluation of DL-SR methods and suggest future avenues for developing effective DL-SR algorithms.

Disclosures
The authors declare no potential conflicts of interest.
Xiaohui Zhang received her BE degree in biomedical engineering from Beihang University, Beijing, China, in 2018. She is a PhD candidate in the Department of Bioengineering at the University of Illinois at Urbana-Champaign (UIUC). Her research interests include computational methods for neuroimaging and machine learning for medical imaging applications. She is also a member of SPIE.
Varun A. Kelkar received his MS degree in electrical and computer engineering from UIUC in 2019 and his BTech degree in engineering physics from the Indian Institute of Technology Madras, Tamil Nadu, India, in 2017. He is a PhD candidate in the Department of Electrical and Computer Engineering, UIUC, Illinois, USA. His research interests include computational imaging, inverse problems, signal processing, optics, and machine learning. He is a member of SPIE. He was a recipient of the 2019 SPIE Optics and Photonics Education Scholarship and the 2021 Oak Ridge Institute of Science and Education fellowship.
Jason Granstedt received his BS and MS degrees in electrical engineering from Virginia Polytechnic Institute and State University in 2015 and 2017, respectively. He is currently a PhD candidate in the Department of Computer Science at the UIUC. His research interests include task-based analysis of images and application of machine learning techniques to medical imaging. He is a member of SPIE.
Hua Li is a research associate professor in the Department of Bioengineering at the UIUC and a medical physicist at Carle Foundation Hospital, Urbana, Illinois, USA. Her research work focuses on developing innovative medical imaging and image analysis techniques to solve the challenges seen in clinical practice, toward improving personalized patient care. She serves as the deputy editor for the Journal of Medical Physics and a reviewer for a set of journals and NIH study sections.
Mark A. Anastasio is the Donald Biggar Willett Professor in Engineering and the head of the Department of Bioengineering at the UIUC. He is a fellow of SPIE, the American Institute for Medical and Biological Engineering, and the International Academy of Medical and Biological Engineering. His research addresses computational image science, inverse problems in imaging, and machine learning for imaging applications. He has contributed to emerging biomedical imaging technologies, including photoacoustic computed tomography and ultrasound computed tomography.