Fast whole-slide cartography in colon cancer histology using superpixels and CNN classification

Abstract. Purpose Automatic outlining of different tissue types in digitized histological specimen provides a basis for follow-up analyses and can potentially guide subsequent medical decisions. The immense size of whole-slide-images (WSIs), however, poses a challenge in terms of computation time. In this regard, the analysis of nonoverlapping patches outperforms pixelwise segmentation approaches but still leaves room for optimization. Furthermore, the division into patches, regardless of the biological structures they contain, is a drawback due to the loss of local dependencies. Approach We propose to subdivide the WSI into coherent regions prior to classification by grouping visually similar adjacent pixels into superpixels. Afterward, only a random subset of patches per superpixel is classified and patch labels are combined into a superpixel label. We propose a metric for identifying superpixels with an uncertain classification and evaluate two medical applications, namely tumor area and invasive margin estimation and tumor composition analysis. Results The algorithm has been developed on 159 hand-annotated WSIs of colon resections and its performance is compared with an analysis without prior segmentation. The algorithm shows an average speed-up of 41% and an increase in accuracy from 93.8% to 95.7%. By assigning a rejection label to uncertain superpixels, we further increase the accuracy by 0.4%. While tumor area estimation shows high concordance to the annotated area, the analysis of tumor composition highlights limitations of our approach. Conclusion By combining superpixel segmentation and patch classification, we designed a fast and accurate framework for whole-slide cartography that is AI-model agnostic and provides the basis for various medical endpoints.


Introduction
With the introduction of slide scanning systems into pathological workflows, the prerequisite has been met to introduce machine learning algorithms into diagnostic routines. Due to their large size of over ten billion pixels, however, digitized histopathological whole-slide-images (WSIs) pose a challenge to automatic image analysis approaches. When working with such large images, technicians are oftentimes confronted with compromising computational efficiency for segmentation and classification accuracy. Especially in the clinical environment, however, both sides of the coin are equally desirable. This work focuses on how semantic segmentation of tissue classes can be executed efficiently. We present an algorithm for the analysis of large scale microscopic images which utilizes local pixel dependencies in order to achieve high classification accuracy, whilst maintaining reasonable computational complexity. We propose to introduce clustering into superpixels prior to classification which helps to model underlying biological structures. Furthermore, we present a technique of inferring superpixel classification labels using neural network classification. Using supervised learning and a hand-annotated database of 159 slides of colon resection specimens stained with Hematoxylin & Eosin (H&E) dye, our solution is trained to distinguish seven tissue classes. The multi-class analysis of tissue facilitates a further evaluation of tumor composition and growth progression such as deriving the invasion front, which we only touch upon in this work, but do not cover in depth.
Beyond the general research question of how a whole-slide cartography can be performed efficiently, this work aims to answer the following more concrete questions. Can superpixel clustering prior to patch-based classification be utilized to achieve a speedup? How large is the speed-up compared to a sole patch-based analysis and what is the impact on the segmentation accuracy? Does this approach work equally well for all tissue classes? Is it necessary and beneficial to classify all patches inside a superpixel or is it sufficient to classify only a subset?
If so, what is the impact on the speed-up and accuracy and where is a good balance point? Considering medical end points, can the generated tissue map already be used to derive the tumor invasive margin? How accurately can the tumor area be calculated? Is the tumor composition (necrosis, active tumor cells, tumor stroma, mucus) accurately differentiated?

Related work
In the following, an overview of recent work in the field of semantic image segmentation and applications to pathological image data is provided. Furthermore, technically related approaches that combine superpixel clustering and subsequent classifications are briefly elaborated.

Semantic segmentation
Semantic image segmentation describes the process of inferring pixel-wise classification labels in order to generate a twodimensional classification output. Due to their large size, WSIs are always divided into smaller image patches which are analyzed individually. Generally, two ap-proaches for the semantic segmentation of WSIs can be distinguished: each image patch can be analyzed by a classification or segmentation network. The former predicts a single class-label for the whole image patch and after reassembling the classified patches, a segmentation mask of the WSI can be obtained. This classificationbased approach has been applied both in a non-overlapping manner 1,2 , creating coarse segmentation masks, and, at the cost of higher computation times, in a slidingwindow manner as neighborhood around each image pixel 3 . In order to incorporate image information on various scales, multiple resolutions can be integrated into a classification-based analysis [4][5][6] .
For the latter approach, based on the segmentation of image patches, special Fully Convolutional Neural Network (FCN) 7 architectures such as U-Net 8 or SegNet 9 are typically used. These architectures employ encoder-decoder structures for the prediction of two-dimensional segmentation outputs and have been used for scene 9 and biomedical image segmentation 8,10-12 . Encoder-decoder-based approaches are able to generate a segmentation output with a high granularity that can only be achieved by classification-based approaches when classifying each image pixel with its neighborhood as individual patches. However, these approaches entail high computational complexity and require extensive hardware resources.Oskal et al. 8 , for instance, reported inference times of up to 18 minutes per WSI when using an NVIDIA Tesla P100 GPU and Khened et al. 11 30-75 minutes per WSI with an NVIDIA Titan-V GPU. These complex hardware requirements might not be attainable in a clinical setting and faster computation times are often desired.

Applications in digital pathology
In the field of digital pathology, machine learning algorithms have increasingly gained importance for answering pathological research questions. Bychkov et al. 13 , for instance, proposed a Convolutional Neural Network (CNN)-based approach for directly predicting 5-year disease-specific survival for patients with colorectal cancer merely from tissue microarray cores.
For the semantic segmentation of WSIs, two standard approaches can be distinguished: cell-based and texture-based methods.Sirinukunwattana et al. 14 designed a two-staged CNN-based cell detection and classification algorithm, which has been utilized by various approaches 15,16 . These incorporated graph structures to represent cell communities and thereby created phenotypic signatures. By splitting WSIs into smaller patches and mapping each to their most similar phenotypic signature, a multi-class WSI cartography could be created. On colorectal cancer specimens, Sirinukunwattana et al. 15 scored an accuracy of 97.4 % averaged over nine tissue classes and Javed et al. 16 an F 1 score of 92 % averaged over six classes. These high classification scores, however, were achieved at the expense of high computation times of up to 50 minutes per WSI for cell detection and classification 14 .
In the field of texture-based segmentation approaches Signolle et al. 17 proposed a method that incorporated several binary hidden wavelet domain Markov tree classifiers whose outputs were combined using majority voting. The authors scored a class-averaged recall of 71.02 % on five tissue classes on ovarian carcinoma specimens with an inference time of up to 300 hours per WSI. Other texture-based methods grouped pixels into coherent regions, which were classified using texture-based feature representations. On prostate specimens, Gorelick et al. 18

Superpixel classification
Due to their large size, digitized microscopic images can challenge standard machine learning algorithms. Aiming to reduce computational complexity, a clustering into coherent image segments, e.g. superpixels, has proven advantageous. Zhang et al. 25 , for instance, used superpixel clustering to compute a probability map for nuclei pre-segmentation, which was used as auxiliary input to the subsequent tissue classification network. Nguyen et al. 26 directly segmented breast tissue samples into coherent tissue regions using a graph-based superpixel algorithm. The authors, however, merely performed a segmentation and did not infer labels for the computed superpixels. Other existing works manually extracted hand-crafted superpixel feature vectors which were then classified using machine learning-based classifiers and thereby enabled a binary 22,27 or multi-class 12,18,19,28 semantic segmentation of medical images. On histological image data, this approach has facilitated the binary segmentation of WSIs in 20-45 minutes by Bejnordi et al. 27 and up to 60 minutes by Balazsi et al. 22 with good performance results indicated by Dice scores of 92.43 % 27 and 69 % 22 , respectively. Mehta et al. 12 segmented breast cancer tissues into eight classes by using superpixels and a Support Vector Machine (SVM) for classification. Since this combination was not the focus of their work, but merely served as a baseline for performance comparison of their proposed method, the usage of superpixels has not been evaluated in much detail. Zormpas-Petridis et al. 28 applied a combination of superpixels and SVM-based classification on the task of segmenting melanoma WSIs. Their evaluation, however, was carried out with a randomly chosen set of superpixels, i.e. the ground truth did not contain the entire annotated tissues as in our work.
Considering the classification of image data, there has been a trend towards the use of deep learning methods, specifically CNNs, in recent years. Bianconi et al. 29 provided a comprehensive overview from theory-driven (hand-crafted) to data-driven (deep-learning) color and texture descriptors. Tamang and Kim 30 summarized various deep learning-based and classical approaches especially for the application of colorectal cancer diagnostics. One significant advantage of deep learning is that it enables a closed-form optimization of classification problems whereas classification based on hand-crafted features typically requires the selection of the most characteristic features followed by optimization of the classifier. In addition, CNNs often achieve more accurate classification results than traditional methods, especially when large amounts of labeled data are available for training, which was also shown in a comparison of different approaches for the classification of Malaria pathogens in microscopic image data made by Krappe et al. 31 .
Due to their irregular size and shape, however, superpixels can challenge CNN classifiers which require square input images of pre-defined size. Previous work in the field of histopathology can be categorized into two basic strategies to overcome this issue.
The first group of approaches 32-35 extracted bounding boxes around superpixels and resized them to a pre-defined input size. This strategy either requires equally-sized superpixels to maintain a similar down-scaling factor for all superpixels or loses proportions across the input images. The latter can lead to ignoring the valuable size property of biological structures, e.g. the typically enhanced size of tumor cells, which can be an indicator for neoplastic growth. The second group of approaches 36,37 classified a precomputed superpixel by extracting a patch with pre-defined size around the centroid of the superpixel. These approaches, however, relied on compact and square-like superpixels. Otherwise, the centroid might not lie within the given superpixel and the extracted patch will not be representative of this superpixel. Biological structures, however, are rarely square-shaped and especially at tumor boundaries the interaction of tumor, healthy tissue and inflammatory or necrotic reactions can lead to very irregularly shaped superpixels. In order to meet these characteristics of biological tissue and tumor growth, approaches that can be applied to superpixels of varying shapes and sizes are highly desired. Moreover, all of these approaches 32-37 relied on a one-toone relationship between superpixel and the corresponding image patch, which is classified or processed by a CNN. Only Pati et al. 37 subsequently merged neighboring and similar superpixels and averaged their CNN feature vectors to use them in their tissue graph. In our approach, however, the superpixel shape is allowed to deviate greatly from a square shape, and the size of the superpixels is on average 20 times larger than the size of the image patches which are classified by the CNN. This opens up the possibility of classifying multiple image patches within a superpixel and combining patch classification results to a superpixel label through majority voting. Moreover, this one-to-many relationship between superpixel and image patches allows deducing a classification confidence measure from the individual patch classification results.

Material and methods
The proposed image analysis pipeline has been trained and evaluated on colon WSIs, provided by the Institute of Pathology of the University Hospital Erlangen (UKER). In the following sections an overview of the datasets and a detailed description of the applied methods is given.

Datasets
For this work, two different datasets have been used. Dataset A comprises 159 annotated H&E-stained WSIs. The microscopic slides were digitized using a 3D HISTECH Pannoramic 250 slide scanner with an objective magnification of 20 X and a resolution of 0.22 x 0.22 µm / pixel. Pathologist-approved manual annotations cover seven tissue classes: tumor cells, muscle tissue, connective tissue combined with adipose tissue, mucosa, necrosis, inflammation, and mucus. Figure 1 visualizes three representatives of each annotated class. Based on these annotations, patches of a size of 224 x 224 pixels that were covered to at least 85 % by one annotation class have been extracted and labeled accordingly. These patches have been used for training and validating a neural network for semantic image segmentation. Table 1 provides an overview of the dataset including the total number of patches and the corresponding area.

Image analysis pipeline
The developed image analysis pipeline is designed as a twofold approach: First, the WSI is segmented into superpixels using the Simple Linear Iterative Clustering (SLIC) algorithm 39 . Then, each superpixel is classified using a CNN-based approach.

Superpixel segmentation
With the goal of reducing the computational complexity of a pixel-based clustering algorithm, the input WSI is analyzed at a coarser resolution level (3.54 µm x 3.54 µm / pixel) corresponding to a down-scaling factor of 16 in each dimension with respect to the original resolution. Moreover, the WSI is cropped at the tissue's bounding box. The foreground (tissue) is determined by applying a simple intensity threshold to identify white background pixels (3.54 µm x 3.54 µm / pixel resolution). Afterwards, the remaining input image is segmented into superpixels. We compared different established superpixel clustering algorithms by Achanta et al. 39 , Beucher 40 , Felzenszwalb and Huttenlocher 41 and Vedaldi and Soatto 42 . These experiments demonstrated the superiority of the SLIC algorithm regarding boundary detection of different tissue types and computational efficiency, which is in correspondence with the observations by Achanta et al. 39 . In this work we employ the SLIC implementation from the Python scikit-image module. In order to utilize prior knowledge about the histological staining (H&E), a color deconvolution 43 is performed on the input image and the SLIC algorithm has been modified by replacing the clustering in [l, a, b, x, y] T -space with a clustering in [H, E, x, y] T -space. In order to avoid overly jagged contours, the image is smoothed prior to segmentation using a Gaussian filter (σ=5). The SLIC's number of k-means iterations is limited to 10. The average superpixel size is set to 3,600 pixels at the down-scaled resolution level (i.e. a square superpixel would cover 0.2 x 0.2 mm 2 ). This average superpixel size was determined on a subset of Dataset A, which was solely used for parameter configuration (see Table 1, chapter 4.1). Accordingly, the input parameter for the number of superpixels to be generated by the SLIC algorithm is set to: After segmentation, all superpixels that contain at least 50 % white pixels are labeled as background. These superpixels are excluded from any subsequent classification. The threshold of 50 % has been set as a compromise in order to achieve an accurate tissue-background separation whilst not disregarding superpixels that cover adipose tissue, which oftentimes also contains large white areas.  (20 X). Afterwards, all patches covered by at least 50 % of one superpixel are classified using a CNN. By lowering this threshold, the absolute number of patches that account to a superpixel's classification result increases, but so does the relative number of in-distinctive border patches. The threshold of 50 % was found to be a good trade-off during preliminary experiments on the parameter optimization subset of dataset A. After patch classification, all patch labels are combined to infer a superpixel classification. Various standard CNN architectures utilize a softmax layer to output a probability distribution over all classes. We propose to compute the combined probability distribution by summing up over all patch softmax output vectors and normalizing by N, the number of patches that contribute to the classification result. The superpixel label L SP is then defined by the class corresponding to the maximum entry in the superpixel's probability distribution:

Superpixel classification
Here, c i ∈ C is the set of available classlabels.
Preliminary experiments have shown, that due to a high variance in shape and size of the superpixels sometimes up to 100 individual patches account to a superpixel label. Since the vast majority of patches within a superpixel will contain the same tissue type we hypothesize that valuable computation time can be saved by analyzing only a random subset of patches without significantly impacting the overall accuracy. We propose to analyze at most ten patches. The influence of this restriction is investigated in chapter 4.1 Moreover, we propose a confidence measure (C votes dif f ) of a superpixel classification derived from the classification results of the patches within this superpixel. For this, we divide the difference of patch votes for the most represented and the second most represented class by the number of all patches.

CNN model
The developed pre-processing steps (color deconvolution, foreground detection, super- pixel segmentation) are independent of the subsequent CNN structure which is therefore interchangeable. However, the average superpixel size has to be adapted to the CNN input patch size in order to maintain a reasonable ratio of both measures. For the experiments elaborated hereafter a ResNet50 architecture with 224 x 224 pixel input size has been chosen and trained using training and validation set of Dataset A (see Table 1). The network has been implemented using TensorFlow 2.2. We employ the color augmentation method described by Tellez et al. 44 , where the RGB image is converted to the H&E color space using a deconvolution. Then, the Hematoxylin and Eosin components are individually modified, simulating different staining intensities. Moreover, zero-centering is applied as a preprocessing step. Training is performed using cross entropy loss and Adam optimizer with a learning rate of 0.001. A batch-size of 105 was chosen and in each batch the different classes are represented equally. Class imbalances are hereby compensated by oversampling of underrepresented classes as for example necrosis and mucus.

Evaluation method for cartography results
For a visual validation of the annotation ground truths and classification outputs the Open Source software tool SlideRunner 45 has been used. The quantitative analysis is performed with an image resolution of 3.54 µm x 3.54 µm / pixel. We assign a class-label to each foreground pixel according to the manual ground truth annotation. The prediction map of the image is generated with the same resolution. Only pixels having both a ground truth and a prediction label are evaluated. Based on the confusion matrix different classification measures like e.g. class-wise recall are calculated. For all class-wise measures the corresponding two-class problem is regarded whereby all negative classes are combined to one class.

Tumor area computation and invasive margin
The primary tumor area is determined based on the cartography results. First, binary maps for the classes "tumor cells", "necrosis" and "mucus" at the same resolution level used for superpixel segmentation The tumor invasive margin is derived from the auto-detected tumor area by extending the region in relation to the desired margin width. The intersection between this extended region and non-tumor tissue defines the basis of the invasive margin. Finally the intersection area is again extended as the invasive margin is situated at the border between tumor and surrounding tissue and stretching out into both.

Tumor composition
Quantitative analysis of the tumor microenvironment supports studies and diagnostics of tumor-infiltrating lymphocytes (TILs) in colon, as well as bladder and breast cancer 46,47 . The analysis in a region of interest like the invasive margin plays an important role for evaluating the immune response against tumor cells. In previous studies, a high correlation between CD3 and CD8 positive cell counts and patient outcome has been shown 48 . In the case of colon cancer, the immune response can be quantified using the immunoscore. Therefore, we use dataset B to compare the estimation of tumor component areas (active tumor, necrosis, mucus) from cartography results with the manual annotations. Graham et al. 49 have used a rotation equivariant network for the task of gland segmentation. We use this approach to separate the active tumor area from interconnecting tumor stroma by using the segmented glands' area as approximation for the active tumor area. The ground truth area for necrosis and mucus is directly derived from the manual annotations. The ground truth for the active tumor area is obtained on serial sections stained with immunohistochemical markers (pan-cytokeratin, epithelial AE1/AE3) by applying a simple thresholding approach within the manually annotated tumor area. Again, color deconvolution was performed and solely the DAB channel was chosen for segmentation. Figure 3 shows a comparison of the manual annotations on the H&Estained WSI and the segmentation result on the IHC-stained consecutive WSI. On the one hand, this approach is beneficial as it does not suffer from the human annotator's subjectivity. On the other hand, one has to keep in mind there is a small spatial distance between the two consecutive sections that is large enough that a cell visible in one slide might not be visible in the other slide.

Results and discussion
Several experiments were performed using dataset A to investigate the performance of the superpixel-based WSI cartography. Starting from parameter configuration of the SLIC algorithm, followed by a com- parison between a classical patch-based approach with our newly introduced superpixel approach up to an investigation of uncertainty of the superpixel classification results. Though not the focus of this work, we carried out preliminary experiments on dataset B for two possible medical endpoints (tumor area and tumor composition) that will likely benefit from having a detailed tissue map available as it is generated by our proposed method.

Configuration of superpixel approach
In order to define an optimal average superpixel size as well as a threshold for the number of classified patches per superpixel, experiments were performed on the parameter test set of dataset A containing eight WSIs (see Table 1). Figure 4 visualizes the influence of increasing the average superpixel size (as start parameter for the SLIC algorithm) on the total number of superpixels per WSI, the classification accuracy and the average computation times for superpixel classification and WSI inference. For these experiments, the maximum number of classified patches per superpixel was limited to 30. Inference times have been measured using an NVIDIA GeForce GTX 1060 GPU with 6 GB RAM. As expected, a larger average size per superpixel results in fewer superpixels per WSI. However, larger superpixels cover a larger number of patches, which are classified and then combined to infer a superpixel class-label. Therefore, larger superpixels entail higher computational costs ( Figure 4a). Nevertheless, the overall computation time for slide inference decreases due to the decreased number of superpixels on the WSI. The classification accuracy, however, also decreases ( Figure 4b). Figure 5 visualizes the effect of smaller superpixel sizes (left) and larger superpixel sizes (right) on the segmentation result. As compromise between low computational complexity for larger superpixel sizes and high accuracy for smaller superpixel sizes, an average superpixel size of 3,600 pixels, i.e. a square superpixel would cover 0.2 x 0.2 mm 2 , was chosen for further experiments. However, the results of these experiments depend on various parameters (such as the threshold for the number of classified patches per superpixel) as well as the chosen CNN architecture and there is still room for further optimization. The biggest disadvantage of a greater superpixel size is that small details are neglected resulting in inaccurate segmentation results especially for classes like necrosis or tumor cells.
The histogram in Figure 6 shows, that with an average superpixel size of 3,600  pixels, some larger superpixels cover more than 30 individual patches. We hypothesized that it is sufficient to only classify a random subset of the patches within a superpixel. Table 2 summarizes the influence of various maximum patch limits on the computation time of slide inference and overall accuracy. Whilst a smaller patch limit results in significantly lower computational costs, the slide accuracy only shows a marginal decrease. Therefore, we further reduced the limit from 30 to 10 patches for subsequent experiments.

Classification performance and run-time
In order to evaluate segmentation performance and computational complexity, the proposed algorithm is compared to a traditional classification-based approach with non-overlapping image patches. To isolate the effects produced by the proposed technique of introducing a superpixel clustering and inferring superpixel classification labels, the same CNN is used as part of both approaches. Results are collected on the remaining 29 slides of dataset A (test set), which have not been used for training, validation or adaptation of parameters. The classification performance is assessed pixelwise on a lower image resolution of 3.54 µm x 3.54 µm / pixel as described in chapter 3.3. Table 3 summarizes the total number of evaluated pixels on this resolution. Minor deviations of the overall sum of evaluated image pixels exist due to the irregular shape of the superpixels compared to the patch-wise approach.
On the 29 test slides of dataset A, the tis- sue bounding box contains on average 10.7 billion pixels on the native resolution (= 520 mm 2 ). Within these, the SLIC algorithm produces 4, 060 ± 1, 717 (µ ± σ) superpixels with an average size of 1,016,289 pixels (= 0.05 mm 2 ). The average number of patches per superpixel without introducing a maximum cut-off is 19.58 ± 6.39. A restriction of the maximum number of patches to be classified to only 10 patches per superpixel affects 94.8 % of all superpixels and decreases the average number of classified patches per superpixel to 9.95 ± 0.36. When evaluating a multi-class semantic segmentation task it is informative to look at which classes are frequently mistaken for one another. Figure 7 shows the relative confusion matrices for both approaches. They show similar behavior regarding the typical confusions of classes: e.g. necrosis is misclassified as tumor or inflammation as mucosa.
From the confusion matrices class-based recall and precision values are calculated,  which are displayed in Figure 8. The superpixel-based approach yields an overall accuracy of 95.7 % compared to 93.8 % obtained with the patch-based approach. The improvement in accuracy has been tested for statistical significance using the two-matched-samples t-test based on the 29 slide-wise classification accuracies and has been verified on a confidence interval of 99 %. Due to differences in the background detection which is performed per superpixel and respectively per patch, the sum of classified pixels slightly differs between the two approaches. Figure 8 shows an improvement of the classification measures with the superpixel approach compared to the patchbased approach. The average improvement in recall is 0.022, 0.019 for precision and 0.018 for the F 1 score. Whilst this improvement can be observed for all classes with larger annotation areas, performance sometimes decreased for inflamed, necrotic and mucous areas. One possible reason for this might be that these classes constituted very fine annotations. The chosen superpixel size sometimes creates clusters too coarse to accurately represent these minute structures. Figure 8: Comparison between patch-based and superpixel-based approach by classwise recall, precision and F 1 score. Figure 9 visualizes the cartography outputs of the compared approaches. Overall, the non-overlapping patch-based image analysis yields checkered classification out-puts with many interruptions of connected components due to individual misclassifications. A prior segmentation into superpixels, on the other hand, yields smoother results which follow biological structures. It can be seen, that the larger tissue classes are detected accurately and also smaller structures, e.g. inflammations and necrotic areas, are classified correctly in most of the cases. However, this example also highlights limitations of the algorithm, where structures become too small to be accurately represented by the superpixels, e.g. small necrotic areas of comedo necrosis, which is in correspondence with the decrease in recall for necrosis compared to the patch-based approach. This drawback could be countered by choosing a smaller average superpixel size, albeit, only at the cost of higher computation times. The relatively large superpixel size also causes tumor cell classifications to be rather generous and incorporate surrounding tumor stroma. If a precise tumor/stroma separation is intended, the superpixel-based classification approach could be followed by a separate cell-detection-algorithm or simply a second refinement run of the superpixel segmentation and classification restricted to only the tumor areas.
Using an NVIDIA GeForce GTX 1060 GPU and TensorFlow 2.2, the standard classification-based segmentation approach with non-overlapping patches resulted in computation times of 12.8 ± 5.3 minutes per WSI. The superpixel-based segmentation pipeline achieved classification times of 6.7 ± 2.8 minutes with an additional 47 ± 18 seconds for the SLIC clustering resulting in an overall run-time of 7.5 ± 3.0 minutes per WSI. Thereby, an average ac- celeration of 41 % could be achieved by the proposed image analysis approach. This acceleration is mainly the result of restricting the number of classified patches per superpixel. Without restriction the classification time increases to 13.4 ± 5.5 minutes and the overall run-time including SLIC clustering to 14.2 ± 5.7 minutes. This is slower than the patch-based approach but yields the highest overall accuracy with 96.0 % which is an improvement of 0.3 percentage points compared to the the superpixel cartography with restriction of the classified number of patches per superpixel.
When comparing computation times, it has to be considered that the patch-based approach was performed in the fastest possible way by using non-overlapping patches. Standard patch-based approaches, however, use overlapping image patches and interpolate classification results. When choosing an overlap of half the patch dimension the number of overall classifications already in-creases from n x n to (2n-1) x (2n-1). Even when using fast scanning architectures for avoiding redundant computations in overlapping image regions, the overall computational costs are assumed to further increase when using overlaps. This underlines the benefit of the proposed clustering prior to classification even further.

Introduction of rejection class based on classification confidence
Aiming to minimize the effect of misclassifications on the final cartography output, we attempt to detect superpixels with uncertain classification results. This way, a rejection label can be assigned to these superpixels. Our hypothesis is that the remaining classification results are more reliable and therefore yield a higher overall accuracy as well as average class-wise precision, recall and F 1 score. This is done at the expense that unclassified areas are introduced which are not included in the calculation of classification quality measures. Superpixels with a confidence lower than a defined threshold are assigned to the rejection class and hence all pixels (resolution: 3.54 µm x 3.54 µm / pixel) inside them as well. All pixels of the remaining superpixels are evaluated as before (see 3.3). As a consequence of the rejection of unsure pixels the number of classified pixels and therefore the number of correct and false predicted pixels decreases. We compared the confusion matrix and classification metrics without and with rejection of uncertain superpixels. As rejection threshold we have chosen 0.1, which means that all superpixels with a C votes dif f smaller than 0.1 are assigned to the rejection class. In total, 1.3 % of the pixels were rejected. The number of total true predictions decreases by 0.8 % compared to the classification without rejection while the number of false predictions decreases by 11.8 %. Overall, 1.9 million pixels that were correctly classified are discarded due to a low confidence value and 1.3 million pixels that were incorrectly classified. The overall accuracy increases to 96.1 % compared to 95.7 % without rejection of superpixels. Likewise, there is an improvement for all classes in precision (average 0.009), recall (average 0.007) and F 1 score (average 0.009). The highest impact is obtained for classes that are usually distributed over the whole tissue sample and cover very small sections like necrosis, inflammation and mucus. These results support our hypothesis that the remaining classification results are more reliable at the expense of introducing areas without classification. Therefore, it depends on the application which aspect is prioritized.
Besides the quantitative evaluation, the question arises which areas in a WSI tend to achieve uncertain classifications. We only touch upon this question with one qualitative example: In Figure 10d superpixels with uncertain classification results (based on C votes dif f with a threshold of 0.45) are highlighted. This example reveals two typical constellations that lead to an uncertain classification. Superpixels containing a high amount of background pixels, e.g. located at or nearby fissures or at the rim of the tissue section, tend to be misclassified. The same applies to superpixels in the transition of two tissue types, e.g. located near the invasive margin or slightly inflamed tis-sue. Moreover, ground truth annotations are only provided for regions that can be assigned clearly to one tissue type except for the tumor cell class. Here it was not feasible to annotate each small necrotic area which can be seen in Figure 10b.

Tumor area
Dataset B was used to evaluate the computation of the tumor area. On average, the estimated and the annotated tumor area differ by 6 % with a mean IoU of 89.4 % and a mean Dice coefficient of 94.3 % (per slide results in Figure 11). mor areas that have been found correctly (TPs), red marks areas that were mistaken as tumor (FPs) and blue indicates tumor annotations not detected by the algorithm (FNs). It can be seen, that most misclassifications are located at tumor boundaries. Especially necrotic areas adjacent to the lumen were included in the tumor area for our approach, but have been excluded by the pathologist. On the contrary, at the invasive margin our approach misses some tumor areas.
Looking at the slide results in detail, however, a few WSIs contain larger misclassified regions. Three examples are visualized in Figure 12b. One main source for deviations are again necrotic areas. In our approach all adjacent necrotic areas are incorporated into the tumor area. This technically defined rule cannot perfectly represent the pathologist's annotation (ground truth) in individual cases, as it cannot sufficiently reflect the biological and complex morphological nature of the tumor. Moreover, two sections contained adenomas that were clas-

Invasive margin
By growing the tumor area evenly by a defined distance towards the surrounding healthy classes, the tumor invasive margin can automatically be generated (see Figure 14). The generated margins of all slides of dataset B are qualitatively evaluated by Figure 13: The mixture of debris and destroyed mucosa tissue is classified as tumor (orange) and mucus (turquoise) and leads to a deviation in tumor area in slide number 11. two pathologist using a point-based grading system from 1 to 5 (1= very good, 5= insufficient). On average, the margins were rated 1.6 composed of 18 ratings as "very good", 15 ratings as "good" and three as "satisfying". The two pathologists were in correspondence for 13 WSIs and their judgments only differed by one point for five WSIs. These first qualitative results seem promising and could enable further analysis, e.g. the determination of the invasion depth or quantifying inflammation within the invasive margin.

Tumor composition
Using dataset B, the tumor composition is evaluated by computing ratios of tumor cells (Figure 15), necrosis and mucus within the ground truth tumor area. The results in Figure 15 show that both the superpixel approach and the patch-based approach overestimate the active tumor area for every slide. However, the average deviation is smaller in the latter case. The best estimation is obtained with the gland segmentation approach. In order to analyze these results further, the slides of dataset B have been divided into subsets according to their tumor grading. Table 4 breaks down the deviation of estimated active tumor area from the ground truth for each subset as mean over the subset. For tumors with grade 1 and grade 2 the gland segmentation approach provides good estimations of the active tumor area. As expected, the accuracy decreases with tumor belonging to grade 3 where the growth becomes diffuse and gland structure is destroyed.  Figure 16 shows the detected active tumor area (marked in orange) for a well differentiated tumor (grade 2, slide number 1) for all three approaches. This example illustrates that the superpixel approach overestimates the active tumor area due to misclassification of tumor stroma as tumor cells. The patch-based approach shows a similar behavior, however, with smaller deviations to the ground truth. The gland detection approach is in good correspondence with the ground truth segmentation. The limitation of this approach is evident in the second example ( Figure 17) showing a tumor with grade 3 (slide number 11). In this example the estimated active tumor area deviates significantly from the ground truth area (Table 4). On the contrary, the deviation to the ground truth for the superpixel and patch-based approaches seems to be independent of the grade of the tumor.
Besides the active tumor area, the ratio of necrosis and mucus area within the tumor area are additional relevant parameters for characterization of the tumor microenvironment. Both, the patch-based and superpixel approach show similar results  here with a slight superiority of the patchbased approach for the determination of the necrotic area (see Table 5). Because the average superpixel size (0.048 mm 2 on dataset B) is significantly bigger than the patch size (0.002 mm 2 ) necrotic areas, which are oftentimes only small islands between tumor cells, seem to be better captured by the patches than the superpixels.

Conclusion
In this work we presented an approach for histology whole-slide cartography using superpixels by the example of colon carcinomas. Our work was motivated by a feasibility of the developed method in a clinical setting. Even though, regarding granularity of segmentation outputs, encoder-decoderbased approaches are sometimes considered superior to patch-based approaches, they oftentimes require powerful hardware not attainable in a clinical setting. Therefore, our work focused on increasing the efficiency of a patch-based cartography, which can easily be transferred to e.g. a pathology institute and ensures fast inference. This increased efficiency could be obtained by pre-segmenting the input image into superpixels and only classifying a subset of patches within these superpixels.
The evaluation results on our test set composed of 29 WSIs show a superiority of our approach compared to a classical patchbased approach for overall accuracy with an increase from 93.8 % to 95.7 % as well as computing-time with an average speed-up of 41 % resulting in an average overall runtime per WSI of 7.5 minutes. The speedup is mainly achieved by limiting the number of classified patches within each superpixel. This patch restriction only results in a marginal decrease in accuracy of 0.3 percentage points compared to the unrestricted approach. These results indicate that the superpixel clustering already segments the WSI into regions belonging to the same tissue type. Only when this requirement is fulfilled can accurate cartography results be obtained. The limitation of our approach lies in the relatively large size of superpixels compared to patches. On our test set one superpixel on average covers 0.05 mm 2 . Compared to fine-grained structures, such as small necrotic areas within the tumor, this size is too big to correctly capture these areas. This is also reflected in e.g. a lower recall for necrosis. Another limitation lies in the manual annotations. Accurate and complete annotation of these fine-grained structures are also a challenge for the human annotator. Therefore, wherever possible, an alternative generation of the ground truth should be preferred, e.g. based on segmentation in immunohistochemically stained sections. Moreover, one has to keep in mind that there is a general problem with the quantitative assessment of cartography results. Although using seven tissue classes, there are still areas that cannot be clearly assigned to one of these classes. These non-annotated areas are not included in the quantitative evaluation. Therefore, from our point of view, it is important to always apply the developed approaches to complete WSIs and check the cartography results in these areas at least qualitatively.
The key difference of our method compared to other superpixel-based approaches for histopathology images is the one-tomany relationship between superpixels and corresponding image patches. In our setup, a superpixel contains on average twenty image patches of which we classify a random subset. Utilizing the fact that a superpixel class-label is inferred from a set of multiple individually classified patches, we investigated a measure for quantifying the uncertainty of a superpixel classification derived from the votes of the patches within the superpixel. This measure was suited to decrease the relative number of incorrect predictions at the cost of introducing unclassified tissue areas and a rejection of correct predictions, but to a smaller extent. Moreover, applying our introduced uncertainty measures to WSIs and visualizing uncertain superpixels enables a plausibility check of the approach. As expected, classification results of superpixels in the transition of two tissue types, e.g. located near the invasive margin, tend to be unsure. The uncertainty measurement also facilitates an automatic improvement by e.g. partitioning them into smaller segments and re-classifying these superpixels or applying other pixel-wise segmentation methods within these areas.
Whole-slide cartography by itself offers only limited support to the pathologist but provides a basis for subsequent analysis operations that can predict various medical endpoints. Within this work we used the cartography results to determine the tumor area and composition as well as to derive the invasive margin. While the tumor area is in good agreement with the ground truth, the tumor composition analysis highlights weaknesses of the approach. Again, due to the size of our superpixels the separation between finely grained active tumor cells and tumor stroma is not adequate. However, a combination with further methods (in our case gland segmentation for well differentiated tumors) yields good results with an average deviation of only 11.4 %.
Being able to reliably detect and outline the tumor area is very valuable from a clinical perspective. In a routine workflow, such a functionality could be used as an assistance system that draws the pathologist's attention to a specific region. Alternatively, such a system could be introduced as a quality control mechanism that provides a second opinion. Another potential application is the insurance that samples for molecular testing are taken from an area that actually contains a high ratio of tumor cells. In the context of Computational Pathology, it has been shown in recent literature that it is possible to predict genetic alterations directly from WSIs 50,51 . A pre-requirement here is that only tumor-areas are analyzed. In the context of colon carcinoma, an example would be the detection of microsatellite instability (MSI) to validate the presence of Lynch syndrome. The tumor composition in terms of the ratios of active tumor cells, tumor stroma, necrosis and mucus has been shown to be of prognostic relevance 52 and could at least for well differentiated tumors be assisted by the proposed approach. velopment and Energy through the Center for Analytics -Data -Applications (ADA-Center) within the framework of "BAY-ERN DIGITAL II" (20-3410-2-9-8). This work was partially supported by the Federal Ministry of Education and Research under the project reference numbers 16FMD01K, 16FMD02 and 16FMD03. Parts of the research has been funded by the German Federal Ministry of Education and Research (BMBF) under the project TraMeExCo (011S18056A).
We also want to thank Christa Winkelmann and Natascha Leicht regarding technical issues in the laboratory and regarding sectioning and staining. We want to thank Nicole Fuhrich and the working group for digital pathology at the institute of pathology like Tatjana Kulok and Jonas Plum for assistance regarding all scans of the cohort.