We study a problem scenario of super-resolution (SR) algorithms in the context of whole slide imaging (WSI), a popular imaging modality in digital pathology. Instead of just one pair of high- and low-resolution images, which is typically the setup in which SR algorithms are designed, we are given multiple intermediate resolutions of the same image as well. The question remains how to best utilize such data to make the transformation learning problem inherent to SR more tractable and address the unique challenges that arises in this biomedical application. We propose a recurrent convolutional neural network model, to generate SR images from such multi-resolution WSI datasets. Specifically, we show that having such intermediate resolutions is highly effective in making the learning problem easily trainable and address large resolution difference in the low and high-resolution images common in WSI, even without the availability of a large size training data. Experimental results show state-of-the-art performance on three WSI histopathology cancer datasets, across a number of metrics.
Many computational problems in medical imaging can be posed as a transformation learning problem, in that they receive some input image and transform it into an output image under domain-specific constraints. The image super-resolution (SR) problem is a typical problem in this category, where the goal is to reconstruct a high-resolution (HR) image given only a low-resolution (LR), typically degraded, image as input. Such problems are challenging to solve due to their highly ill-posed and underconstrained nature, since a large number of solutions exist for any given LR image, and the problem is especially magnified when the resolution ratio between the HR and LR images is large. Until recently, convolutional neural networks (CNN) have been the main driving tool for SR in computer vision applications, since they have the ability to learn highly nonlinear complex functions that constitute the mapping from the LR to the HR image. Several recent results have shown state-of-the-art results for the SR problem.1–3 Since such CNN frameworks involve a large number of parameters, empirical evidence has shown that the corresponding models need to be trained on large datasets to show reproducible accuracy and avoid overfitting. This is not a problem for most applications in computer vision, where datasets in order of millions or larger (e.g., ImageNet,4 TinyImages,5 to name a few) are readily available. But for other application domains, particularly microscopic or medical imaging, such large sample sizes are hard to acquire, given that each image dataset has to be acquired individually, with significant human involvement. In this paper, we study an important SR application in the context of digital pathology and discuss how the limitations inherent to CNN-based SR methods can be addressed effectively in this context. This is described next.
The type of imaging modality we are interested in is called whole slide imaging (WSI) or virtual microscopy. WSI is a recent innovation in the field of digital pathology in which digital slide scanners are used to create images of entire histologic sections. Traditionally, the use of the optical capabilities of a microscope to “focus” the lens on a small subsection of the slide (based on the field of view of the device) to review and evaluate the specimen is often carried out by a trained pathologist. This process may need to be repeated for other sections of the slide depending on the scientific/clinical question of interest, toward obtaining consistently well-focused digital slides. WSI essentially automates this procedure for the whole slide. The ability to do so, automatically for a large number of slides, ideally capturing as much information as the pathologist may have been manually able to glean from the histological specimen via a light microscope, offers an immensely powerful tool with many immediate applications in clinical practice, research, and education.6–8 For instance, WSI makes it feasible to solicit subspecialty disease expertise regardless of the location of the pathologist, integration of patient medical records in their health portfolio, pooling data between research institutions, and reducing long-term storage costs of histological specimens. However, given the relatively recent advent of WSI technology, there are several barriers that still need to be overcome, before it is widely accepted in clinical practice. The chief among these are the fact that HR WSI scanners, which have been shown to match images obtained from light microscopy in terms of quality for diagnostic capability, are typically very expensive, even for LR usage. In addition, the size of the files produced also generates a bottleneck. Typically, a virtual slide acquired at HR is about 1600 to 2000 megapixels, which results in a file size of several gigabytes. Such files are typically much larger that image files used by other clinical imaging specialties such as radiology. If one has to transport, share, or upload multiple such files or 3D stack, it results in a consequential increase of storage capacity and network bandwidth. Notwithstanding these issues, WSI offers tremendous potential and numerous advantages for pathologists, which is why it is important to find a way to alleviate the existing issues with WSI, such that it can be widely applicable. One potential way to address these issues is to use images from low magnification slide scanners, which are widely available, easy to use, comparatively inexpensive, and can also quickly produce images with smaller storage requirements. However, such LR images can increase the chance of misdiagnosis and false treatment if used as the primary source by a pathologist. For example, cancer grading normally requires identifying tumor cells based on size and morphology assessments,9 which can be easily distorted in low magnification images. If such images were indeed to be used for research, clinical, or educational purposes, we need a way to convert such LR data and produce an output that closely resembles images acquired by a top-of-the-line WSI scanner, without substantial increase in storage and computational requirements.
Solution Strategies and Associated Challenges
One way to address the above issues is to dynamically improve the quality and resolution of LR images to render them comparable in quality to those acquired from HR scanners, as and when needed by the end user. Such a proposed workflow would need a fraction of the time and can yield near real-time quantifiably accurate results at a fraction of the setup cost of a standard WSI system. The implementation of such a system would require an SR approach that works well for WSI images. But still, we find that there is no off-the-shelf deep network architecture that can be used for our application directly. The reasons for this are numerous. First, most existing methods have been trained on databases of natural images. However, the WSI images under consideration do not have the same characteristics as natural images, such as, textures, edges, and other high-level exemplars, which are often leveraged by the SR algorithms. Second, such deep learning models are often trained using large training datasets (usually consisting of synthetic/resized HR–LR pairs), in the order of millions. This does not directly translate to our application for two reasons: (a) large training datasets are typically harder to acquire since each image pertains to a unique acquisition that requires significant human involvement, and (b) in our case, the LR images are acquired from a different scanner and is not just a resized version of HR image. Third and perhaps the most important limitation is that existing deep SR methods typically only reconstruct successfully up to a resolution increase factor of 4, whereas in case of WSI, the resolution (from a low-cost scanner to an expensive HR scanner) can increase up to a factor of , since there can be wide variance in resolution between an LR scanner () to HR scanners, which typically scan at or . We discuss this issue in detail in the following paragraph.
Existing CNN-based methods have shown limited performance in scenarios when the resolution difference is high. The reason for this is that the complexity of the transformation that morphs the LR image to the HR one increases greatly in such situations, which in turn manifests in the CNN models taking longer to converge, or learning a function that generalizes poorly to unseen data. A typical way to address this issue is to make the CNN models deeper, by adding more layers ( layers) or increasing the number of examples required to learn a task to a given degree of accuracy while still keeping the network shallow. Both these directions pose challenges for our application: (a) a deeper network is associated with far more parameters, increasing the computational and memory footprint, to the extent that model may not be applicable in a real-time setup and (b) increasing the number of samples the extent needed would be impractical, due to associated time and monetary costs.
Our approach to solving this problem draws upon the inherent nature of the SR problem. While it is hard to acquire a large training dataset in this scenario, it is much more time and cost efficient to obtain the WSI images at different resolutions by varying the focus of the scanner. In this paper, we study how such multi-resolution data can be used effectively to make the transformation learning problem more tractable. Suppose and represent a particular LR and HR image pair. If is a significant resolution ratio higher than , learning their direct transformation function can be challenging leading to a overparameterized system. But if we had access to some intermediate resolutions say (with a smaller resolution change between consecutive images), it makes intuitive sense that transformation that converts an image of a given resolution into the closest HR image would be roughly the same across all the resolutions considered, if we assume that resolution changes vary smoothly across the sequence. Having more image pairs for to train, it may be computationally easier to learn a smooth function , such that for all . In this paper, we formalize this notion and develop a recurrent convolutional neural network (RCNN) [Note that the acronym RCNN is also used to refer to region-based CNNs,10 but in the context of this paper, we use it to refer to recurrent convolutional neural network.] to learn from multi-resolution slide scanner images. Our main contributions are as follows. (1) We propose a new version of SR problem motivated from this problem, multi-resolution SR (MSR), where the aim is to learn the transformation function, given a sequence of resolutions, rather than simply the LR and HR images. To the best of our knowledge, this is new problem scenario for SR that has not been studied before. (2) We propose an RCNN model to solve the MSR problem. (3) We demonstrate using experimental results on three WSI cell lines that the MSR idea can indeed reduce the need for large sample sizes and still learn the transformation that generalizes to unseen data.
We summarize the current literature on three main aspects, pertaining to our model in this section: (a) deep network models for SR, (b) recurrent neural networks, and (c) CNN architecture for small sample size training. We discuss them briefly next.
Deep Network Models for SR
Stacked collaborative local autoencoders are used11 to construct the LR image layer by layer. Reference 12 suggested a method for SR based on an extension of the predictive convolutional sparse coding framework. A multiple layer CNN, similar to our model, inspired by sparse-coding methods, is proposed in Refs. 1, 2, and 13. Chen and Pock14 proposed to use multistage trainable nonlinear reaction diffusion as an alternative to CNN where the weights and the nonlinearity are trainable. Wang et al.15 trained a cascaded sparse coding network from end to end inspired by learning iterative shrinkage and thresholding algorithm16 to fully exploit the natural sparsity of images. Recently, Ref. 17 proposed a method for automated texture synthesis in reconstructed images by using a perceptual loss focusing on creating realistic textures. Several recent ideas have involved reducing the training complexity of the learning models using approaches, such as Laplacian pyramids,18 removing unnecessary components of CNN,19 and addressing the mutual dependencies of LR and HR images using deep back-projection networks.20 In addition, generative adversarial networks (GAN) have also been used for the problem of single image SR, these include Refs. 2122.23.–24. Other deep network-based models for image SR problem include Refs. 2526.27.–28. We also briefly review SR approaches for sequence data such as videos. Most of the existing deep learning-based video SR methods using motion information inherent in video to generate a single HR output frame from multiple LR input frames. Kappeler et al.29 warp video frames from the preceding and subsequent LR frames onto the current one using the optical flow method and pass them through a CNN that produces the output frame. Caballero et al.30 followed the same approach but replaced the optical flow model with a trainable motion compensation network. Huang et al.31 used a bidirectional recurrent architecture for video SR with shallow networks but do not use any explicit motion compensation in their model. Other notable works include Refs. 32 and 33.
Recurrent Neural Networks
A recurrent neural network (RNN) is a class of artificial neural network where connections between nodes form a directed graph along a sequence. This allows it to exhibit temporal dynamic behavior for a time sequence. Unlike feedforward neural networks such as CNNs, the input and outputs are not considered independent of each other, rather such models recompute the same/similar function for each element in the sequence, with the intermediate and final output of subsequent elements in the network being dependent on the previous computations on elements occurring earlier in the sequence. RNNs have most frequently been used in applications for language modeling,34 speech recognition,35 and machine translation.36 But it can be applied to many learning tasks applied to sequence data, for more details, see survey paper by Ref. 37.
CNN Architectures for Small Sample Size Training
In this regard, Erhan et al.38 devised unsupervised pretraining of deep architecture and showed that such weights of the network generalize better than randomly initialized weights. Mishkin and Matas39 have proposed layer-sequential unit-variance that utilizes the orthonormal matrices to initialize the weights of CNN layer. Andén and Mallat40 proposed scattering transform network (ScatNet), which is a CNN-like architecture where predefined Morlet filter bank is used to extract features. Other notable architectures in this regard include PCANet,41 LDANet,42 kernel PCANet,43 MLDANet,44 DLANet,45 to name a few.
Deep Network Models in Microscopy
Since the application domain of this paper is in microscopy, we briefly review related papers that have used deep networks in this area. Most similar to our work is Rivenson et al.,46 who showed how to enhance the spatial resolution of LR optical microscopy images over a large field of view and depth of field. But unlike our model, this framework is meant for single-image SR, where the model is trained on pairs of HR and LR images and provided a single LR image at test time. The design of this model includes a single CNN architecture, though they show that feeding the output of the CNN, back to the input, can improve results further. Wang et al.47 proposed a GAN-based approach for super-resolving confocal microscopy images to match the resolution acquired with a stimulated emission depletion microscopes. Grant-Jacob et al.48 designed a neural lens model based on a network of neural networks, which is used to transform optical microscope images into a resolution comparable to a scanning electron microscope image. Sinha et al.49 used deep networks to recover phase objects given their propagated intensity diffraction patterns. Other related methods that have used deep learning-based reconstruction approaches for various applications in microscopy include Nehme et al.,50 Wu et al.,51 and Nguyen et al.52 A more detailed survey of deep learning methods in microscopy can be found in Ref. 53.
Here, we discuss our main model for obtaining HR images from LR counterparts. First, we briefly outline the problem setting. Let and denote the HR and LR image sets, respectively. In addition, we use two more intermediate resolutions of all images, we call these sets and , respectively. For notational ease, we refer to as and as , respectively. These image sets can be ordered in terms of increasing resolution, that is, w.r.t. to image size. For training/learning, we assume image to image correspondence among these four sets are known.
For any pair of images , we need to learn the transformation that maps to the corresponding higher resolution image . This can be done using a CNN architecture, with a number of intermediate convolutional layers, which we discuss shortly. The CNN pipeline then needs to be replicated for each of the three pairs , , and . However, the main premise of this work is that each CNN subarchitecture can be informed by outputs of other CNN subarchitectures, since they are implementing similar functions. To do this, we propose an RCNN that uses three CNN subarchitectures to map each LR images to next HR one. These three CNN subnetworks are then interconnected in a recurrent manner. Furthermore, we impose that the CNN pipelines share similar weights, to ensure that function learned for each pair of images is roughly the same. We describe the details of our model next. First, we discuss the components of the CNN architecture in Sec. 1, followed by the motivation and design choices for the RCNN framework in Sec. 2.
Here, we describe the basic structure of each CNN subnetwork and its constituent layers briefly. A more detailed description can be found in Sec. 7. Note that the components/layers of each CNN pipeline is kept the same, see Fig. 1. The first layer is a feature extraction-mapping layer that extracts features from LR images (denoted by for the ’th pipeline), which are then served as input to the next layers. This is followed by three convolutional layers. We briefly elaborate on this, since they are useful to understand the RCNN terminology in the next section.
The feature extraction layer is followed by three additional convolutional layers. We also refer to these as hidden layers, denoted by , which is the ’th hidden layer () in the ’th CNN pipeline. The input to this layer is referred to as , and the output is denoted by . The filter functions in these intermediate layers can be written as
The last layer of the CNN architecture consists of a subpixel layer that upscales the LR feature maps to the size of the HR image.
Recurrent Convolution Network
The recurrent convolution network is built by interconnecting the hidden units of the CNN subarchitectures, see Fig. 2. We index the CNN pipeline components with superscript with the ’th pipeline being given image as input and reconstructing the image . The basic premise of our RCNN model is that the hidden units processing the each of the images can be informed by the outputs of the hidden units in other CNN pipelines. We use one directional dependence (low to high) as it is more challenging to reconstruct HR images compared to lower resolution ones. We can introduce bidirectional dependencies as well, but in practice, this increases the number of parameters substantially and contributes to an increase in training time for the model.
Besides the feedforward connections already discussed as a part of the CNN subarchitectures, we introduce two additional connections to encode the dependencies among the various hidden units, see Fig. 2. These are as follows.
Recurrent convolutional connection
The first type of connection, called recurrent convolutions, is denoted by red lines and aims to model the dependency across images of different resolutions at the same hidden layer . These convolutions connect adjacent hidden layers of successive images (ordered by resolutions), that is, the current hidden layer is conditioned on the feature maps learned from the hidden layer at the previous LR image .
Prelayer convolutional connections
The second type of connections, called prelayer convolutions, is denoted by blue lines. This is used to model the dependency of a given hidden layer of the current image on the hidden layer at the immediate previous layer corresponding to LR image . This endows the hidden layer with not only the output of the previous layer but also information about how a lower resolution image has evolved in the previous layer.
Since the image sizes differ at each CNN pipeline, when implementing the dependence, we resize the higher-order images to match the size of images processed in the current pipeline. This resizing can be denoted by a function .
Note that the first CNN pipeline (), which processes as input, contains only feedforward connections, hence is identical to the network in Fig. 1. We incorporate the three types of connections (feedforward, recurrent, and prelayer) in the next two CNN pipelines (). Let the output of hidden layer be denoted by . Then, we can rewrite functions learned and the outputs at the hidden layers of the CNN pipelines as follows. We start with the functions learned at first hidden layer (), which can be written as
Training and Loss Function
The complete architecture of our network can be seen in Fig. 3. The output from Eq. (2) (for pipelines and ) is then passed on as an input to the subpixel layer [described in Eq. (4)], which outputs the desired prediction (let this be denoted by ). For the pipeline (), the prediction is simply . This network is learned by minimizing a weighted function of mean square error (MSE) and structured similarity metric (SSIM) between the predicted HR and the ground truth at each pipeline
The objective/loss function in Eq. (3) is optimized using stochastic gradient descent. During optimization, all the filter weights of recurrent and prelayer convolutions are initialized by randomly sampling from a Gaussian distribution with mean 0 and standard deviation 0.001, whereas the filter weights of feedforward convolution are set to 0. Note that one can also initialize these weights by pretraining the CNN pipeline on a small sized dataset. We experimentally find that using a smaller learning rate () for the weights of the filters is crucial to obtain good performance.
We performed experiments to evaluate our RCNN approach on three previously published tissue microarray (TMA) cancer datasets, a breast TMA dataset consisting of 60 images,54 and a kidney TMA dataset with 381 images,55 and a pancreatic TMA dataset with 180 images.56 The datasets are summarized in Table 1.
Summary of datasets.
|Dataset||Number of images||Source|
In the datasets we analyze, highest resolution images were acquired and digitalized at using an Aperio CS2 digital pathology scanner (Leica Biosystems),57 with , and lowest resolution images were acquired and digitized using PathScan Enabler IV58 with (). Besides these, we have acquired images at two different resolutions ( and from the Aperio CS2 digital pathology scanner, which served as our intermediate resolutions). These images are then resized (in the order of resolution) to , , , and , which provides us the dataset with four different resolutions.
We evaluate various aspects of our RCNN model to determine the efficacy of our method. First, we evaluate the quality of reconstruction of our RCNN model compared to a single CNN pipeline (which is a baseline for our model) and other comparable methods for SR. We show both qualitative and quantitative results in this regard. Second, we study the advantage of having intermediate resolutions, by varying the number of resolutions available. We also analyze how useful our obtained reconstructions are toward end-user applications, such as segmentation and perform a user study, done by a pathologist to evaluate the utility of the reconstructed images for cancer grading. In addition, we study how the reconstruction accuracy is affected as a function of the training set size. Finally, we discuss the parameters used and the computational time requirements of our model. We discuss these issues next.
Quality of reconstruction
We evaluate the reconstruction quality of the obtained images by our approach by evaluating it relative to HR ground truth image and calculating seven different metrics: (1) root mean square error (RMSE), (2) signal-to-noise ratio (SNR), (3) SSIM, and (4) mutual information (MI), (5) multiscale structured similarity (MSSIM), (6) noise quality measure (NQM),59 and (7) weighted peak signal-to-noise ratio (WSNR).60 RMSE should be as low as possible, whereas SNR, SSIM (1 being the maximum), MSSIM (1 being the maximum), and the remaining metrics, should be high for good reconstruction.
Note that our model was trained by providing three sets of images (of three different resolutions) as input. However, our testing experiments can be done in the following two settings: (a) in the first case, we provide the images of the same three resolutions , , and as input to the trained model, which then outputs the reconstructed highest resolution image . We call this setup RCNN(full). (b) In the second case, we see how our model behaves given only the lowest resolution image . In this case, we first generate the two intermediate resolutions as follows. First, we only activate pipeline , which outputs . Then, use this as input and activate both pipelines and , which then reconstructs as well. Using and the reconstructed and as input, we activate all three pipelines and reconstruct . We call this setup RCNN(1 input).
To the best of our knowledge, we do not know of any other neural network framework to study the MSR problem presented in this paper. Therefore, to compare our method to other state-of-art methods, we choose other SR approaches that work in the two resolution setting (low and high). Our default baseline is the CNN architecture shown in Fig. 1. We refer to this method as CNN. In addition, we also compare with the following methods: (i) the CNN-based framework (FSRCNN) by Dong et al.,13 (ii) a CNN model that uses a subpixel layer (ESCNN), and (iii) a GAN-based approach for SR.21
Results for the breast, kidney, and pancreatic datasets are shown in Tables 2Table 3–4, respectively. In each case, we see that the RCNN(full) setting outperforms all other methods, giving a significant improvement in all the metrics calculated. A qualitative analysis of the reconstructed images in Figs. 4Fig. 5–6 shows that the reconstructed images are indeed very similar to the HR images. The comparatively poorer performances of comparable methods, such as FSRCNN or SRGAN, are expected since these methods are not designed to learn a resolution ratio of 8 used in our experiments. Still we find that RCNN(1 input), which is trained on all three input resolutions but tested by providing the lowest resolution only, outperforms the other baselines, showing that the weights learned our model generalizes better than other neural network models for this difficult learning task. The qualitative results showing the performance of RCNN(1 input) is shown in Fig. 7.
Quantitative results from reconstructed breast images.
Quantitative results from reconstructed kidney images.
Quantitative results from reconstructed pancreatic images.
Effect of the number of intermediate resolutions
Here we study the effect of having intermediate resolution images toward the quality of reconstruction. For this purpose, besides the RCNN(full), which uses two intermediate resolution, we also trained a model with only one intermediate resolution , besides the LR and HR images and . That is, the RCNN models have only two pipelines with inputs and , respectively. We call this model RCNN(1 layer). We train and evaluate our model on each of the three datasets, see Table 5. The results show RCNN(full) shows superior performance compared to RCNN(1 layer), showing that each additional intermediate resolution adds toward the quality of the reconstructions produced.
Quantitative results from varying the number of intermediate resolutions.
|Breast TMA||Kidney TMA||Pancreas TMA|
|Metric||RCNN (1 layer)||RCNN (full)||RCNN (1 layer)||RCNN (full)||RCNN (1 layer)||RCNN (full)|
Pathological diagnosis largely depends on nuclei localization and shape analysis. We used a simple color segmentation method to segment the nuclei using -means clustering to segment the image into four different classes based on pixel values in Lab color space.61 Following this, we use the Hadamard product of each class with the gray level image of the original bright-field image, computed average of pixel intensities in each class, and assigned the lowest value to the cell nuclei. To evaluate our results, we compare the segmentation of the reconstructed images with the results from HR images (ground truth) for 20 samples from each group by computing the misclassification error, which calculates the percentage of pixels misclassified. Results show that number of pixels misclassified from images generated using our RCNN(full) method generates segmentation masks with lower number of pixels misclassified, followed by RCNN(1 input) (Table 6).
Quantitative results from segmentation on the three datasets.
|RCNN (1 input)||0.2358||0.1803||0.1630|
Grading user study by pathologists
Pathological assessment of tissue samples is usually considered the gold standard that requires large magnification for microscopic assessment or HR images. For example, in different types of cancer patient prognosis and treatment plans are predicated on cancer grade and stage.62 Tumor grade is based on pathologic (microscopic) examination of tissue samples, conducted by trained pathologists. Specifically, it involves assessment of the degree of malignant epithelial differentiation, or percentage of gland-forming epithelial elements, and does not take into consideration characteristics of the stroma surrounding the cells.63–67 Accuracy of pathological assessment has a vital importance in clinical workflow since the downstream treatment plans mainly rely on that. Lack of interobserver and intraobserver agreement in pathological tissue assessments is still of major concern, which has been reported for many diseases including pancreatic cancer,68 intraductal carcinoma of the breast,69 malignant non-Hodgkin’s lymphoma,70 and soft tissue sarcomas,71 among others. SR algorithms can mitigate this effect by making second opinion and collaborative diagnosis easily accessible. However, this is achievable if able to reconstruct fine morphological details of the tissue image. For this project, we used reconstructed images of 35 TMA cores randomly selected from a TMA slide (PA 2072, US Biomax), which was graded on HR images more than a year ago and was now graded on the reconstructed images by our collaborator pathologist. These were from different grades of cancer and normal tissue. Grading for 22 TMA cores matched the previous grading by same pathologist. In general, the overall structure of the pancreatic tissue was reconstructed good enough so that normal and grade one was easier to identify based on overall gland shapes and in case of grade one cancer the stroma surrounding the gland was identifiable too. However, it was observed that differentiation between grade 2 and 3 was more difficult since it requires visualization of infiltrating individual tumor cells. Grading results specific to individual cores is provided in Sec. 8.
Effect of training set size
The necessary size of the training set for a particular learning task is generally hard to estimate and depends mostly on the hardness of the learning problem as well as on the type of model being trained. Here, we provide empirical evidence of how our model behaves wrt to increasing the size of the training set. For this purpose, we use the Kidney dataset, since it is the largest dataset considered in this paper. We vary the training size from . The test set is set to 50 in each case. We analyze the quality of reconstruction by comparing the SNR in each case.
As seen in Fig. 8, the SNR improves only slightly when the training set is increased, indicating that a higher training size does not significantly improve the image reconstruction metrics.
Parameters and running time
We implemented our model in TensorFlow using Python, which has inbuilt GPU utilization capability. We used a workstation with an AMD processor 6378 with a 2.4 GHz CPU, 48 GB RAM and NVIDIA GPU 1070 TI graphics card. All our experiments have been performed using GPU, which shows significant performance gains compared to CPU runtime. The training time of our models depends on various factors such as dataset volume, learning rate, batch size and number of training epochs. To report running times for training, we fix learning rate to , dataset volume to 300 images, batch size to 2 and number of training epochs to . The training time of our model is approximately 20.9 hours. The time to generate a new HR image at once the network is trained takes 1.4 minutes. The test-time speed of our model can be further accelerated by approximating or simplifying the trained networks with possible slight degradation in performance.
This paper provides an innovative approach to utilize multi-resolution images to generate a high quality reconstruction of slide scanner images. In addition, this work also leads to several interesting ideas that we will pursue as future work. We discuss these briefly next.
1. We will study a variation of this model that is inspired by mixed effects models popular in statistics. Here the fixed effects will be modeled as function of other higher resolution inputs and the random effects are modelled as residual connection on the LR input in each pipeline. This leads to a Residual variation of the Recurrent Convolutional Network. We will study this network in detail and analyze its computational properties.
2. One of our future goal is also aimed at making our RCNN model scalable to large datasets and higher resolution ratios. In order to do so, we need a way of speeding up the RCNN model to produce HR images at a reduced computational and memory footprint. To do this, we will adopt recent developments in deep learning that show that one can substantially improve the running time of deep CNNs by approximating by linear filters and other related ideas.72
This paper studies a new setup for the SR problem, where the input is multi-resolution slide scanner images, and the goal is to utilize these to make the learning problem associated with SR more tractable, so that it scales to an HR change in the target reconstruction. We propose a RCNN for this purpose. Results show that this model performs favorably when compared to state-of-the-art approaches for SR. This provides a key contribution toward the overarching goal of making use of LR scanners over their HR counterparts, which opens up new opportunities in histopathology research and clinical application.
Appendix A: CNN Subnetwork
Here, we describe the basic structure of each CNN subnetwork and its constituent layers. The components/layers of each CNN pipeline are kept the same. The basic CNN architecture is similar to the model described in our earlier paper,73 except it does not involve the nearest-neighbor-based enhancement, which was used to ensure that reconstructed image retains the finer details of the original HR image, which are otherwise lost using a CNN framework for learning the transformation. We avoid this step since it is also computationally expensive to search a large database of patches for nearest neighbors, especially for the output sizes of HR images at we want to reconstruct. We also replace the ReLU function at the output of each convolutional layer with a leaky ReLU, which is known to have better performance in practice. We describe the layer of the CNN architecture next, see Fig. 1.
Feature Extraction-Mapping Layer
This layer is used as the first step to extract features from the LR input images () for the ’th CNN subarchitecture. The feature extraction process can be framed as a convolution operation and hence implemented as a single layer of the CNN. This can be expressed as
The feature extraction layer is followed by three additional convolutional layers. We also refer to these as hidden layers, denoted by , which is the ’th hidden layer () in the ’th CNN pipeline. The input to this layer is referred to as and the output is denoted by . The filter functions in these intermediate layers can be written as
In our CNN architecture, we leave the final upscaling of the learned LR feature maps to match the size of the HR image, to be done at the last layer of the CNN. This is implemented as a subpixel layer similar to Ref. 74. The advantage of this is that is that layers prior to the last layer operate on the reduced LR image rather than HR size, which reduce the computational and memory complexity substantially.
The upscaling of the LR image to the size of the HR image is implemented as a convolution with a filter whose stride is ( is the resolution ratio between the HR and LR images). The size of the filter is denoted as . A convolution with stride of in the LR space with a filter (weight spacing ) would activate different parts of for the convolution. The patterns are activated at periodic intervals of and where , are the pixel position in HR space. This can be implemented as a filter , whose size is , given that and . This can be written as
Appendix B: Grading of Cancer TMAs
Here, we provide the individual grading results on the subset of the pancreatic TMA cores graded by the pathologist. Note that these results are representative of the results on a typical reconstruction. For comparison purposes, our patholoist also graded the LR images of the same TMAs. The results can be seen in Table 7. The first column refers to the identifier for each cell. To summarize the results, the overall structure of the pancreatic tissue was reconstructed well enough so that normal and grade one was easier to identify based on overall gland shapes. Specifically, in case of grade one cancer, the stroma surrounding the gland was identifiable too. However, it was observed that differentiation between grade 2 and 3 was more difficult since it requires visualization of infiltrating individual tumor cells.
Results of grading done by pathologist on individual cells.
|Cell number||Grading on reconstructed images||Grading on LR images|
|C13||Pathologist identified this as higher grade cancer, based on condensations that are likely strips of malignant epithelium infiltrating the stroma. She could not see the epithelial cell profiles well enough to judge whether it is G2 or G3.||There were some irregular gland-like spaces at 11:00 and 4:00. Pathologist suspected that there are higher-grade malignant cells infiltrating through the stroma, but could not resolve what is a vessel, stromal cell, inflammatory cell, or malignant cell.|
|D5||This picture gave better definition of high-grade cancer infiltrating the stroma, but pathologist still could not call G2 from G3||According to pathologist, gland-like spaces were present near the center of the image and at 1 to 2:00. She could not resolve the dark spots in the stroma, whether these strips of high-grade cancer, inflammatory cells, or capillaries.|
|D6||The top of the image contained G1 glands, and there was a large complex conglomeration near the center of the field that the pathologist called G2. Toward the bottom of the field, dispersed nuclei was concerning for isolated cells or clusters of cells (G3).||The gland outlines were irregular enough that pathologist called this G2 cancer, but she could not tell if there was also G3 in the background. She did not want to make a diagnosis of cancer off this slide, since the image does not show any nuclear or architectural detail (proximity of the glands to nerves, arteries, and remnant acinar tissue).|
|D7||A cluster of G2 was present in the bottom half of the field. The pathologist thought the top was necrotic.||She also called this as G2 cancer, but could not tell if there is G3 cancer present, as well. There was not enough nucelar definition to be able to tell the degree of nuclear atypia.|
|D9||G3. The nuclei at the very middle of the field was bizarre enough to identify as high-grade carcinoma, and clusters of infiltrating glands contain enough of the same nuclei to identify the cells as epithelial (as opposed to lymphocytes sitting in stroma)||Also G2 cancer. Necrosis could be seen in very irregularly shaped glands. The stroma could not be resolved at all, there may have been single cells in the background but she could not see them.|
|E3 and E9||These cancers were heavily infiltrated by lymphocytes, making identification of the malignant glands extremely difficult, even on tissue sections. According to pathologist, they were probably G2 tumor.||Irregularly shaped glands were present throughout the core. The background could not be resolved.|
The authors have no relevant financial interests in this article and no potential conflicts of interest to disclose.
This research was supported by UW Laboratory for Optical and Computational Instrumentation, the Morgridge Institute for Research, UW Carbone Cancer Center, NIH R01CA199996, and Semiconductor Research Corporation.75
Lopamudra Mukherjee is an associate professor in the Department of Computer Science at the University of Wisconsin–Whitewater. She graduated with a PhD in computer science from the University at Buffalo in 2008. Her research interests include computer vision, machine learning, and their applications in biomedical image analysis; she has published a number of papers in these areas.
Huu Dat Bui is a senior software engineer at IBM. He received his master’s degree in computer science from the University of Wisconsin–Whitewater in 2019. Currently, he is working in the field of weather data and weather forecasts. He specializes in a data lake system. He is interested in machine learning and image processing areas. His passion lies at the junction of academic research meeting the needs and expectations of industry.
Adib Keikhosravi received his MSc and PhD degrees in biomedical engineering from the University of Wisconsin–Madison. During this time, he was working at the Laboratory for Optical and Computational Instrumentation (LOCI) to develop state of the art optical and computational imaging systems, image processing and machine learning tools for extracting stromal biomarkers from a variety of optical microscopy modalities during cancer progression. He has published several book chapters and research articles in peer reviewed journals.
Agnes Loeffler is currently the chair of the Department of Pathology at the MetroHealth System, Cleveland, Ohio. She received her medical degree from the University of Illinois, Urbana-Champaign and completed her residency in anatomic pathology at Dartmouth-Hitchcock Medical Center in New Hampshire. Her research interests are primarily in gastrointestinal pathology and informatics.
Kevin W. Eliceiri is the Walter H. Helmerich Professor of Medical Physics and Biomedical Engineering at the University of Wisconsin at Madison and investigator in the Morgridge Institute for Research in Madison, Wisconsin. He is also associate director of the McPherson Eye Research Institute. He has published over 200 papers on optical imaging instrumentation, open source image informatics, and role of the cellular microenvironment in disease. He is a member of both OSA and SPIE.