Super-resolution recurrent convolutional neural networks for learning with multi-resolution whole slide images

Abstract. We study a problem scenario of super-resolution (SR) algorithms in the context of whole slide imaging (WSI), a popular imaging modality in digital pathology. Instead of just one pair of high- and low-resolution images, which is typically the setup in which SR algorithms are designed, we are given multiple intermediate resolutions of the same image as well. The question remains how to best utilize such data to make the transformation learning problem inherent to SR more tractable and address the unique challenges that arises in this biomedical application. We propose a recurrent convolutional neural network model, to generate SR images from such multi-resolution WSI datasets. Specifically, we show that having such intermediate resolutions is highly effective in making the learning problem easily trainable and address large resolution difference in the low and high-resolution images common in WSI, even without the availability of a large size training data. Experimental results show state-of-the-art performance on three WSI histopathology cancer datasets, across a number of metrics.


Introduction
Many computational problems in medical imaging can be posed as a transformation learning problem, in that they receive some input image and transform it into an output image under domainspecific constraints. The image super-resolution (SR) problem is a typical problem in this category, where the goal is to reconstruct a high-resolution (HR) image given only a low-resolution (LR), typically degraded, image as input. Such problems are challenging to solve due to their highly ill-posed and underconstrained nature, since a large number of solutions exist for any given LR image, and the problem is especially magnified when the resolution ratio between the HR and LR images is large. Until recently, convolutional neural networks (CNN) have been the main driving tool for SR in computer vision applications, since they have the ability to learn highly nonlinear complex functions that constitute the mapping from the LR to the HR image. Several recent results have shown state-of-the-art results for the SR problem. [1][2][3] Since such CNN frameworks involve a large number of parameters, empirical evidence has shown that the corresponding models need to be trained on large datasets to show reproducible accuracy and avoid overfitting. This is not a problem for most applications in computer vision, where datasets in order of millions or larger (e.g., ImageNet, 4 TinyImages, 5 to name a few) are readily available. But for other application domains, particularly microscopic or medical imaging, such large sample sizes are hard to acquire, given that each image dataset has to be acquired individually, with significant human involvement. In this paper, we study an important SR application in the context of digital pathology and discuss how the limitations inherent to CNN-based SR methods can be addressed effectively in this context. This is described next.

Application Domain
The type of imaging modality we are interested in is called whole slide imaging (WSI) or virtual microscopy. WSI is a recent innovation in the field of digital pathology in which digital slide scanners are used to create images of entire histologic sections. Traditionally, the use of the optical capabilities of a microscope to "focus" the lens on a small subsection of the slide (based on the field of view of the device) to review and evaluate the specimen is often carried out by a trained pathologist. This process may need to be repeated for other sections of the slide depending on the scientific/clinical question of interest, toward obtaining consistently well-focused digital slides. WSI essentially automates this procedure for the whole slide. The ability to do so, automatically for a large number of slides, ideally capturing as much information as the pathologist may have been manually able to glean from the histological specimen via a light microscope, offers an immensely powerful tool with many immediate applications in clinical practice, research, and education. [6][7][8] For instance, WSI makes it feasible to solicit subspecialty disease expertise regardless of the location of the pathologist, integration of patient medical records in their health portfolio, pooling data between research institutions, and reducing long-term storage costs of histological specimens. However, given the relatively recent advent of WSI technology, there are several barriers that still need to be overcome, before it is widely accepted in clinical practice. The chief among these are the fact that HR WSI scanners, which have been shown to match images obtained from light microscopy in terms of quality for diagnostic capability, are typically very expensive, even for LR usage. In addition, the size of the files produced also generates a bottleneck. Typically, a virtual slide acquired at HR is about 1600 to 2000 megapixels, which results in a file size of several gigabytes. Such files are typically much larger that image files used by other clinical imaging specialties such as radiology. If one has to transport, share, or upload multiple such files or 3D stack, it results in a consequential increase of storage capacity and network bandwidth. Notwithstanding these issues, WSI offers tremendous potential and numerous advantages for pathologists, which is why it is important to find a way to alleviate the existing issues with WSI, such that it can be widely applicable. One potential way to address these issues is to use images from low magnification slide scanners, which are widely available, easy to use, comparatively inexpensive, and can also quickly produce images with smaller storage requirements. However, such LR images can increase the chance of misdiagnosis and false treatment if used as the primary source by a pathologist. For example, cancer grading normally requires identifying tumor cells based on size and morphology assessments, 9 which can be easily distorted in low magnification images. If such images were indeed to be used for research, clinical, or educational purposes, we need a way to convert such LR data and produce an output that closely resembles images acquired by a top-of-the-line WSI scanner, without substantial increase in storage and computational requirements.

Solution Strategies and Associated Challenges
One way to address the above issues is to dynamically improve the quality and resolution of LR images to render them comparable in quality to those acquired from HR scanners, as and when needed by the end user. Such a proposed workflow would need a fraction of the time and can yield near real-time quantifiably accurate results at a fraction of the setup cost of a standard WSI system. The implementation of such a system would require an SR approach that works well for WSI images. But still, we find that there is no off-the-shelf deep network architecture that can be used for our application directly. The reasons for this are numerous. First, most existing methods have been trained on databases of natural images. However, the WSI images under consideration do not have the same characteristics as natural images, such as, textures, edges, and other high-level exemplars, which are often leveraged by the SR algorithms. Second, such deep learning models are often trained using large training datasets (usually consisting of synthetic/resized HR-LR pairs), in the order of millions. This does not directly translate to our application for two reasons: (a) large training datasets are typically harder to acquire since each image pertains to a unique acquisition that requires significant human involvement, and (b) in our case, the LR images are acquired from a different scanner and is not just a resized version of HR image. Third and perhaps the most important limitation is that existing deep SR methods typically only reconstruct successfully up to a resolution increase factor of 4, whereas in case of WSI, the resolution (from a low-cost scanner to an expensive HR scanner) can increase up to a factor of 10×, since there can be wide variance in resolution between an LR scanner (4×) to HR scanners, which typically scan at 20× or 40×. We discuss this issue in detail in the following paragraph.

Our Contribution
Existing CNN-based methods have shown limited performance in scenarios when the resolution difference is high. The reason for this is that the complexity of the transformation that morphs the LR image to the HR one increases greatly in such situations, which in turn manifests in the CNN models taking longer to converge, or learning a function that generalizes poorly to unseen data. A typical way to address this issue is to make the CNN models deeper, by adding more layers (>5 layers) or increasing the number of examples required to learn a task to a given degree of accuracy while still keeping the network shallow. Both these directions pose challenges for our application: (a) a deeper network is associated with far more parameters, increasing the computational and memory footprint, to the extent that model may not be applicable in a real-time setup and (b) increasing the number of samples the extent needed would be impractical, due to associated time and monetary costs.
Our approach to solving this problem draws upon the inherent nature of the SR problem. While it is hard to acquire a large training dataset in this scenario, it is much more time and cost efficient to obtain the WSI images at different resolutions by varying the focus of the scanner. In this paper, we study how such multi-resolution data can be used effectively to make the transformation learning problem more tractable. Suppose I 1 and I h represent a particular LR and HR image pair. If I h is a significant resolution ratio higher than I 1 , learning their direct transformation function fðI 1 Þ ¼ I h can be challenging leading to a overparameterized system. But if we had access to some intermediate resolutions say I 2 ; : : : I h−1 (with a smaller resolution change between consecutive images), it makes intuitive sense that transformation that converts an image of a given resolution into the closest HR image would be roughly the same across all the resolutions considered, if we assume that resolution changes vary smoothly across the sequence. Having more image pairs ðI k−1 ; I k Þ for k ¼ 2: : : h to train, it may be computationally easier to learn a smooth functionf, such thatfðI k−1 Þ ≈ I k for all k. In this paper, we formalize this notion and develop a recurrent convolutional neural network (RCNN) [Note that the acronym RCNN is also used to refer to region-based CNNs, 10 but in the context of this paper, we use it to refer to recurrent convolutional neural network.] to learn from multi-resolution slide scanner images. Our main contributions are as follows. (1) We propose a new version of SR problem motivated from this problem, multi-resolution SR (MSR), where the aim is to learn the transformation function, given a sequence of resolutions, rather than simply the LR and HR images. To the best of our knowledge, this is new problem scenario for SR that has not been studied before. (2) We propose an RCNN model to solve the MSR problem.
(3) We demonstrate using experimental results on three WSI cell lines that the MSR idea can indeed reduce the need for large sample sizes and still learn the transformation that generalizes to unseen data.

Deep Network Models for SR
Stacked collaborative local autoencoders are used 11 to construct the LR image layer by layer. Reference 12 suggested a method for SR based on an extension of the predictive convolutional sparse coding framework. A multiple layer CNN, similar to our model, inspired by sparse-coding methods, is proposed in Refs. 1, 2, and 13. Chen and Pock 14 proposed to use multistage trainable nonlinear reaction diffusion as an alternative to CNN where the weights and the nonlinearity are trainable. Wang et al. 15 trained a cascaded sparse coding network from end to end inspired by learning iterative shrinkage and thresholding algorithm 16 to fully exploit the natural sparsity of images. Recently, Ref. 17 proposed a method for automated texture synthesis in reconstructed images by using a perceptual loss focusing on creating realistic textures. Several recent ideas have involved reducing the training complexity of the learning models using approaches, such as Laplacian pyramids, 18 removing unnecessary components of CNN, 19 and addressing the mutual dependencies of LR and HR images using deep back-projection networks. 20 In addition, generative adversarial networks (GAN) have also been used for the problem of single image SR, these include Refs. 21-24. Other deep network-based models for image SR problem include Refs. 25-28. We also briefly review SR approaches for sequence data such as videos. Most of the existing deep learning-based video SR methods using motion information inherent in video to generate a single HR output frame from multiple LR input frames. Kappeler et al. 29 warp video frames from the preceding and subsequent LR frames onto the current one using the optical flow method and pass them through a CNN that produces the output frame. Caballero et al. 30 followed the same approach but replaced the optical flow model with a trainable motion compensation network. Huang et al. 31 used a bidirectional recurrent architecture for video SR with shallow networks but do not use any explicit motion compensation in their model. Other notable works include Refs. 32 and 33.

Recurrent Neural Networks
A recurrent neural network (RNN) is a class of artificial neural network where connections between nodes form a directed graph along a sequence. This allows it to exhibit temporal dynamic behavior for a time sequence. Unlike feedforward neural networks such as CNNs, the input and outputs are not considered independent of each other, rather such models recompute the same/similar function for each element in the sequence, with the intermediate and final output of subsequent elements in the network being dependent on the previous computations on elements occurring earlier in the sequence. RNNs have most frequently been used in applications for language modeling, 34 speech recognition, 35 and machine translation. 36 But it can be applied to many learning tasks applied to sequence data, for more details, see survey paper by Ref. 37.

CNN Architectures for Small Sample Size Training
In this regard, Erhan et al. 38 devised unsupervised pretraining of deep architecture and showed that such weights of the network generalize better than randomly initialized weights. Mishkin and Matas 39 have proposed layer-sequential unit-variance that utilizes the orthonormal matrices to initialize the weights of CNN layer. Andén and Mallat 40 proposed scattering transform network (ScatNet), which is a CNN-like architecture where predefined Morlet filter bank is used to extract features. Other notable architectures in this regard include PCANet, 41 LDANet, 42 kernel PCANet, 43 MLDANet, 44 DLANet, 45 to name a few.

Deep Network Models in Microscopy
Since the application domain of this paper is in microscopy, we briefly review related papers that have used deep networks in this area. Most similar to our work is Rivenson et al., 46 who showed how to enhance the spatial resolution of LR optical microscopy images over a large field of view and depth of field. But unlike our model, this framework is meant for single-image SR, where the model is trained on pairs of HR and LR images and provided a single LR image at test time. The design of this model includes a single CNN architecture, though they show that feeding the output of the CNN, back to the input, can improve results further. Wang et al. 47 proposed a GAN-based approach for super-resolving confocal microscopy images to match the resolution acquired with a stimulated emission depletion microscopes. Grant-Jacob et al. 48 designed a neural lens model based on a network of neural networks, which is used to transform optical microscope images into a resolution comparable to a scanning electron microscope image. Sinha et al. 49 used deep networks to recover phase objects given their propagated intensity diffraction patterns. Other related methods that have used deep learning-based reconstruction approaches for various applications in microscopy include Nehme et al., 50 Wu et al., 51 and Nguyen et al. 52 A more detailed survey of deep learning methods in microscopy can be found in Ref. 53.

Main Model
Here, we discuss our main model for obtaining HR images from LR counterparts. First, we briefly outline the problem setting.
Let H and L denote the HR and LR image sets, respectively. In addition, we use two more intermediate resolutions of all images, we call these sets I 1 and I 2 , respectively. For notational ease, we refer to L as I 0 and H as I 3 , respectively. These image sets can be ordered in terms of increasing resolution, that is, to image size. For training/learning, we assume image to image correspondence among these four sets are known. For any pair of images ðI j ; I jþ1 Þ, we need to learn the transformation f j that maps I j to the corresponding higher resolution image I jþ1 . This can be done using a CNN architecture, with a number of intermediate convolutional layers, which we discuss shortly. The CNN pipeline then needs to be replicated for each of the three pairs ðI 0 ; I 1 Þ, ðI 1 ; I 2 Þ, and ðI 2 ; I 3 Þ. However, the main premise of this work is that each CNN subarchitecture can be informed by outputs of other CNN subarchitectures, since they are implementing similar functions. To do this, we propose an RCNN that uses three CNN subarchitectures to map each LR images to next HR one. These three CNN subnetworks are then interconnected in a recurrent manner. Furthermore, we impose that the CNN pipelines share similar weights, to ensure that function learned for each pair of images is roughly the same. We describe the details of our model next. First, we discuss the components of the CNN architecture in Sec. 1

CNN Subnetwork
Here, we describe the basic structure of each CNN subnetwork and its constituent layers briefly. A more detailed description can be found in Sec. 7. Note that the components/layers of each CNN pipeline is kept the same, see Fig. 1. The first layer is a feature extraction-mapping layer that extracts features from LR images (denoted by Y j 1 for the j'th pipeline), which are then served as input to the next layers. This is followed by three convolutional layers. We briefly elaborate on this, since they are useful to understand the RCNN terminology in the next section.

Convolutional Layers
The feature extraction layer is followed by three additional convolutional layers. We also refer to these as hidden layers, denoted by H j i , which is the i'th hidden layer (i ∈ f2;3; 4g) in the j'th CNN pipeline. The input to this layer is referred to as Y j i−1 , and the output is denoted by Y j i . The filter functions in these intermediate layers can be written as E Q -T A R G E T ; t e m p : i n t r a l i n k -; s e c 3 . 2 ; 6 3 ; 3 6 8 where θ j i and b j i represent the weights and biases of each layer, respectively. Each of the weights θ j i is composed of n i filters of size n i−1 × f i × f i . n 2 is set at 32 and n i ¼ n i−1 2 for i ∈ 3;4. This progressive reduction in the number of filters leads to computational benefits as observed in numerical experiments. The filter sizes f i are set to f3;2; 1g for each of the three layers, respectively, similar to hierarchical CNNs.
The last layer of the CNN architecture consists of a subpixel layer that upscales the LR feature maps to the size of the HR image.

Recurrent Convolution Network
The recurrent convolution network is built by interconnecting the hidden units of the CNN subarchitectures, see Fig. 2. We index the CNN pipeline components with superscript j ∈ 0;1; 2 with the j'th pipeline being given image I j as input and reconstructing the image I jþ1 . The basic premise of our RCNN model is that the hidden units processing the each of the images can be informed by the outputs of the hidden units in other CNN pipelines. We use one directional dependence (low to high) as it is more challenging to reconstruct HR images compared to lower resolution ones. We can introduce bidirectional dependencies as well, but in practice, this increases the number of parameters substantially and contributes to an increase in training time for the model.
Besides the feedforward connections already discussed as a part of the CNN subarchitectures, we introduce two additional connections to encode the dependencies among the various hidden units, see Fig. 2. These are as follows.

Recurrent convolutional connection
The first type of connection, called recurrent convolutions, is denoted by red lines and aims to model the dependency across images of different resolutions at the same hidden layer i. These convolutions connect adjacent hidden layers of successive images (ordered by resolutions), that is, the current hidden layer H j i is conditioned on the feature maps learned from the hidden layer at the previous LR image H j−1 i .

Prelayer convolutional connections
The second type of connections, called prelayer convolutions, is denoted by blue lines. This is used to model the dependency of a given hidden layer of the current image H j i on the hidden layer at the immediate previous layer corresponding to LR image H j−1 i−1 . This endows the hidden layer with not only the output of the previous layer but also information about how a lower resolution image has evolved in the previous layer.
Since the image sizes differ at each CNN pipeline, when implementing the dependence, we resize the higher-order images to match the size of images processed in the current pipeline. This resizing can be denoted by a function ηð:Þ.
Note that the first CNN pipeline (j ¼ 0), which processes I 0 as input, contains only feedforward connections, hence is identical to the network in Fig. 1. We incorporate the three types of connections (feedforward, recurrent, and prelayer) in the next two CNN pipelines (j ∈ 1;2). Let the output of hidden layer H j i be denoted by Y j i . Then, we can rewrite functions learned and the outputs at the hidden layers of the CNN pipelines j ∈ 1;2 as follows. We start with the functions learned at first hidden layer (i ¼ 2), which can be written as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 3 2 6 ; 1 5 1 For the subsequent hidden layers (i ∈ 3;4), the function can be written as The variables θ j i andθ j i represent the weights of the recurrent and prelayer connections, respectively, whereas b j i represents the biases at the i'th layer of the j'th pipeline. Note that the ηð:Þ may be replaced by the subpixel layer, but this contributes to an increase in the training time. Therefore, we implemented the ηð:Þ as a simple bicubic interpolation.

Training and Loss Function
The complete architecture of our network can be seen in Fig. 3. The output from Eq. (2) (for pipelines j ¼ 1 and j ¼ 2) is then passed on as an input to the subpixel layer [described in Eq. (4)], which outputs the desired prediction (let this be denoted by R j ). For the pipeline (j ¼ 0), the prediction is simply R 0 ¼ Y 0 5 . This network is learned by minimizing a weighted function of mean square error (MSE) and structured similarity metric (SSIM) between the predicted HR and the ground truth at each pipeline E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 6 3 ; 1 7 1 OðH; RÞ ¼ where MSEðI jþ1 ; R j Þ ¼ kI jþ1 − R j k 2 and the structured similarity objective is defined as SSIMðI jþ1 ; R j Þ ¼ LðI jþ1 ; R j Þ α CðI jþ1 ; R j Þ β SðI jþ1 ; R j Þ γ , where LðI jþ1 ; R j Þ is the luminance-based comparison, CðI jþ1 ; R j Þ is a measure of contrast difference, and SðI jþ1 ; R j Þ is the measure of structural differences between the two images I jþ1 and R j . α, β, and γ are kept constant and ρ is set to .75. The objective/loss function in Eq. (3) is optimized using stochastic gradient descent. During optimization, all the filter weights of recurrent and prelayer convolutions are initialized by randomly sampling from a Gaussian distribution with mean 0 and standard deviation 0.001, whereas the filter weights of feedforward convolution are set to 0. Note that one can also initialize these weights by pretraining the CNN pipeline on a small sized dataset. We experimentally find that using a smaller learning rate (1e − 5) for the weights of the filters is crucial to obtain good performance.

Datasets
We performed experiments to evaluate our RCNN approach on three previously published tissue microarray (TMA) cancer datasets, a breast TMA dataset consisting of 60 images, 54 and a kidney TMA dataset with 381 images, 55 and a pancreatic TMA dataset with 180 images. 56 The datasets are summarized in Table 1.

Imaging Systems
In the datasets we analyze, highest resolution images were acquired and digitalized at 40× using an Aperio CS2 digital pathology scanner (Leica Biosystems), 57 with 4 pixels∕μm, and

Evaluations
We evaluate various aspects of our RCNN model to determine the efficacy of our method. First, we evaluate the quality of reconstruction of our RCNN model compared to a single CNN pipeline (which is a baseline for our model) and other comparable methods for SR. We show both qualitative and quantitative results in this regard. Second, we study the advantage of having intermediate resolutions, by varying the number of resolutions available. We also analyze how useful our obtained reconstructions are toward end-user applications, such as segmentation and perform a user study, done by a pathologist to evaluate the utility of the reconstructed images for cancer grading. In addition, we study how the reconstruction accuracy is affected as a function of the training set size. Finally, we discuss the parameters used and the computational time requirements of our model. We discuss these issues next.
Experimental setup. Note that our model was trained by providing three sets of images (of three different resolutions) as input. However, our testing experiments can be done in the following two settings: (a) in the first case, we provide the images of the same three resolutions I 0 , I 1 , and I 2 as input to the trained model, which then outputs the reconstructed highest resolution image I 3 . We call this setup RCNN(full). (b) In the second case, we see how our model behaves given only the lowest resolution image I 0 . In this case, we first generate the two intermediate resolutions as follows. First, we only activate pipeline j ¼ 0, which outputs I 1 . Then, use this as input and activate both pipelines j ¼ 0 and j ¼ 1, which then reconstructs I 2 as well. Using I 0 and the reconstructed I 1 and I 2 as input, we activate all three pipelines and reconstruct I 3 . We call this setup RCNN(1 input). Comparable methods. To the best of our knowledge, we do not know of any other neural network framework to study the MSR problem presented in this paper. Therefore, to compare our method to other state-of-art methods, we choose other SR approaches that work in the two resolution setting (low and high). Our default baseline is the CNN architecture shown in Fig. 1. We refer to this method as CNN. In addition, we also compare with the following methods: (i) the CNN-based framework (FSRCNN) by Dong et al., 13

Segmentation results
Pathological diagnosis largely depends on nuclei localization and shape analysis. We used a simple color segmentation method to segment the nuclei using K-means clustering to segment the image into four different classes based on pixel values in Lab color space. 61 Following this, we use the Hadamard product of each class with the gray level image of the original brightfield image, computed average of pixel intensities in each class, and assigned the lowest value to the cell nuclei. To evaluate our results, we compare the segmentation of the reconstructed images with the results from HR images (ground truth) for 20 samples from each group by computing the misclassification error, which calculates the percentage of pixels misclassified. Results show that number of pixels misclassified from images generated using our RCNN(full) method generates segmentation masks with lower number of pixels misclassified, followed by RCNN(1 input) ( Table 6).

Grading user study by pathologists
Pathological assessment of tissue samples is usually considered the gold standard that requires large magnification for microscopic assessment or HR images. For example, in different types of cancer patient prognosis and treatment plans are predicated on cancer grade and stage. 62 Tumor grade is based on pathologic (microscopic) examination of tissue samples, conducted by trained pathologists. Specifically, it involves assessment of the degree of malignant epithelial differentiation, or percentage of gland-forming epithelial elements, and does not take into consideration characteristics of the stroma surrounding the cells. [63][64][65][66][67] Accuracy of pathological assessment has a vital importance in clinical workflow since the downstream treatment plans mainly rely on that. Lack of interobserver and intraobserver agreement Journal of Biomedical Optics 126003-8 December 2019 • Vol. 24 (12) in pathological tissue assessments is still of major concern, which has been reported for many diseases including pancreatic cancer, 68 intraductal carcinoma of the breast, 69 malignant non-Hodgkin's lymphoma, 70 and soft tissue sarcomas, 71 among others. SR algorithms can mitigate this effect by making second opinion and collaborative diagnosis easily accessible. However, this is achievable if able to reconstruct fine morphological details of the tissue image. For this project, we used reconstructed images of 35 TMA cores randomly selected from a TMA slide (PA 2072, US Biomax), which was graded on HR images more than a year ago and was now graded on the reconstructed images by our collaborator pathologist. These were from different grades of cancer and normal tissue. Grading for 22 TMA cores matched the previous grading by same Journal of Biomedical Optics 126003-9 December 2019 • Vol. 24 (12) pathologist. In general, the overall structure of the pancreatic tissue was reconstructed good enough so that normal and grade one was easier to identify based on overall gland shapes and in case of grade one cancer the stroma surrounding the gland was identifiable too. However, it was observed that differentiation between grade 2 and 3 was more difficult since it requires visualization of infiltrating individual tumor cells. Grading results specific to individual cores is provided in Sec. 8.

Effect of training set size
The necessary size of the training set for a particular learning task is generally hard to estimate and depends mostly on the hardness of the learning problem as well as on the type of model being trained. Here, we provide empirical evidence of how our model behaves wrt to increasing the size of the training set. For this purpose, we use the Kidney dataset, since it is the largest Journal of Biomedical Optics 126003-10 December 2019 • Vol. 24 (12) dataset considered in this paper. We vary the training size from f50;100;150;200;300g. The test set is set to 50 in each case. We analyze the quality of reconstruction by comparing the SNR in each case. As seen in Fig. 8, the SNR improves only slightly when the training set is increased, indicating that a higher training size does not significantly improve the image reconstruction metrics.

Parameters and running time
We implemented our model in TensorFlow using Python, which has inbuilt GPU utilization capability. We used a workstation with an AMD processor 6378 with a 2.4 GHz CPU, 48 GB RAM and NVIDIA GPU 1070 TI graphics card. All our experiments have been performed using GPU, which shows significant performance gains compared to CPU runtime. The training time of our models depends on various factors such as dataset volume, learning rate, batch size and number of training epochs. To report running times for training, we fix learning rate to 10 −5 , dataset volume to 300 images, batch size to 2 and number of training epochs to 10 5 . The training time of our model is approximately 20.9 hours. The time to generate a new HR image at 2048 × 2048 once the network is trained takes 1.4 minutes. The test-time speed of our model can be further accelerated by approximating or simplifying the trained networks with possible slight degradation in performance.

Future Directions
This paper provides an innovative approach to utilize multiresolution images to generate a high quality reconstruction of slide scanner images. In addition, this work also leads to several interesting ideas that we will pursue as future work. We discuss these briefly next.
1. We will study a variation of this model that is inspired by mixed effects models popular in statistics. Here the fixed effects will be modeled as function of other higher resolution inputs and the random effects are modelled as residual connection on the LR input in each pipeline. This leads to a Residual variation of the Recurrent Convolutional Network. We will study this network in detail and analyze its computational properties.
2. One of our future goal is also aimed at making our RCNN model scalable to large datasets and higher resolution ratios. In order to do so, we need a way of speeding up the RCNN model to produce HR images at a reduced computational and memory footprint. To do this, we will adopt recent developments in deep learning that show that one can substantially improve the running time of deep CNNs by approximating by linear filters and other related ideas. 72

Conclusion
This paper studies a new setup for the SR problem, where the input is multi-resolution slide scanner images, and the goal is to utilize these to make the learning problem associated with SR more tractable, so that it scales to an HR change in the target reconstruction. We propose a RCNN for this purpose.
Results show that this model performs favorably when compared to state-of-the-art approaches for SR. This provides a key contribution toward the overarching goal of making use of LR scanners over their HR counterparts, which opens up new opportunities in histopathology research and clinical application.

Appendix A: CNN Subnetwork
Here, we describe the basic structure of each CNN subnetwork and its constituent layers. The components/layers of each CNN pipeline are kept the same. The basic CNN architecture is similar to the model described in our earlier paper, 73 except it does not involve the nearest-neighbor-based enhancement, which was used to ensure that reconstructed image retains the finer details of the original HR image, which are otherwise lost using a CNN framework for learning the transformation. We avoid this step since it is also computationally expensive to search a large database of patches for nearest neighbors, especially for the output sizes of HR images at 2048 × 2048 we want to reconstruct. We also replace the ReLU function at the output of each convolutional layer with a leaky ReLU, which is known to have better performance in practice. We describe the layer of the CNN architecture next, see Fig. 1.

Feature Extraction-Mapping Layer
This layer is used as the first step to extract features from the LR input images (I j ) for the j'th CNN subarchitecture. The feature extraction process can be framed as a convolution operation and hence implemented as a single layer of the CNN. This can be expressed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; s e c 7 . 1 ; 3 2 6 ; 5 5 7Ŷ where I j is the image of a given resolution and θ j 1 and b j 1 represent the weights and biases of the first layer of this CNN pipeline, respectively. The weights are composed of n 1 ¼ 64 convolutions on each image patch, with each convolution filter being of size 2 × 2. Therefore, this layer has 64 filters, each of size 2 × 2. The bias vector is of size b j i ∈ R n 1 . We keep filter sizes small at this level, so as it extracts more fine grained features from each patch. The σðxÞ function implements a leaky ReLU function, which can be written as σðxÞ ¼ 1ðx < 0ÞðαxÞ þ 1ðx >¼ 0ÞðxÞ, where α is a small constant. This is followed by a sum pooling layer, to obtain a weighted sum pool of features across various feature-maps of the previous layer. The output of this layer is referred to as Y j 1 .

Convolutional Layers
The feature extraction layer is followed by three additional convolutional layers. We also refer to these as hidden layers, denoted by H j i , which is the i'th hidden layer (i ∈ f2;3; 4g) in the j'th CNN pipeline. The input to this layer is referred to as Y j i−1 and the output is denoted by Y j i . The filter functions in these intermediate layers can be written as E Q -T A R G E T ; t e m p : i n t r a l i n k -; s e c 7 . 2 ; 3 2 6 ; 2 8 1 where θ j i and b j i represent the weights and biases of each layer, respectively. Each of the weights θ j i is composed of n i filters of size n i−1 × f i × f i . n 2 is set at 32 and n i ¼ n i−1 2 for i ∈ 3;4. This progressive reduction in the number of filters leads to computational benefits as observed in numerical experiments.
The filter sizes f i are set to f3;2; 1g for each of the three layers, respectively, similar to hierarchical CNNs.

Subpixel Layer
In our CNN architecture, we leave the final upscaling of the learned LR feature maps to match the size of the HR image, to be done at the last layer of the CNN. This is implemented as a subpixel layer similar to Ref. 74. The advantage of this is that is that layers prior to the last layer operate on the reduced LR image rather than HR size, which reduce the computational and memory complexity substantially. The upscaling of the LR image to the size of the HR image is implemented as a convolution with a filter θ j sub whose stride is 1 r (r is the resolution ratio between the HR and LR images). The size of the filter is denoted as f sub . A convolution with stride of 1 r in the LR space with a filter θ j sub (weight spacing 1 r ) would activate different parts of θ j sub for the convolution. The patterns are activated at periodic intervals of modða; rÞ and modðb; rÞ where a, b are the pixel position in HR space. This can be implemented as a filter θ j 5 , whose size is n 4 × r 2 × f 5 × f 5 , given that f 5 ¼ f sub r and modðf sub ; rÞ ¼ 0. This can be written as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 6 3 ; 2 2 1 Y j 5 ¼ γðθ j 5 × Y j 4 þ b j 5 Þ j ∈ 0;1; 2; where γ is periodic shuffling operator that rearranges r 2 channels of the output to the size of the HR image.

Appendix B: Grading of Cancer TMAs
Here, we provide the individual grading results on the subset of the pancreatic TMA cores graded by the pathologist. Note that these results are representative of the results on a typical reconstruction. For comparison purposes, our patholoist also graded the LR images of the same TMAs. The results can be seen in Table 7. The first column refers to the identifier for each cell.
To summarize the results, the overall structure of the pancreatic tissue was reconstructed well enough so that normal and grade one was easier to identify based on overall gland shapes. Specifically, in case of grade one cancer, the stroma surrounding the gland was identifiable too. However, it was observed that differentiation between grade 2 and 3 was more difficult since it requires visualization of infiltrating individual tumor cells.

Disclosures
The authors have no relevant financial interests in this article and no potential conflicts of interest to disclose. There were some irregular gland-like spaces at 11:00 and 4:00. Pathologist suspected that there are higher-grade malignant cells infiltrating through the stroma, but could not resolve what is a vessel, stromal cell, inflammatory cell, or malignant cell.

D5
This picture gave better definition of high-grade cancer infiltrating the stroma, but pathologist still could not call G2 from G3 According to pathologist, gland-like spaces were present near the center of the image and at 1 to 2:00. She could not resolve the dark spots in the stroma, whether these strips of high-grade cancer, inflammatory cells, or capillaries.

D6
The top of the image contained G1 glands, and there was a large complex conglomeration near the center of the field that the pathologist called G2. Toward the bottom of the field, dispersed nuclei was concerning for isolated cells or clusters of cells (G3).
The gland outlines were irregular enough that pathologist called this G2 cancer, but she could not tell if there was also G3 in the background. She did not want to make a diagnosis of cancer off this slide, since the image does not show any nuclear or architectural detail (proximity of the glands to nerves, arteries, and remnant acinar tissue).

D7
A cluster of G2 was present in the bottom half of the field. The pathologist thought the top was necrotic.
She also called this as G2 cancer, but could not tell if there is G3 cancer present, as well. There was not enough nucelar definition to be able to tell the degree of nuclear atypia.

D9
G3. The nuclei at the very middle of the field was bizarre enough to identify as high-grade carcinoma, and clusters of infiltrating glands contain enough of the same nuclei to identify the cells as epithelial (as opposed to lymphocytes sitting in stroma) Also G2 cancer. Necrosis could be seen in very irregularly shaped glands. The stroma could not be resolved at all, there may have been single cells in the background but she could not see them.
E3 and E9 These cancers were heavily infiltrated by lymphocytes, making identification of the malignant glands extremely difficult, even on tissue sections. According to pathologist, they were probably G2 tumor.
Irregularly shaped glands were present throughout the core. The background could not be resolved.