Cell2Grid: an efficient, spatial, and convolutional neural network-ready representation of cell segmentation data

Abstract. Purpose Cell segmentation algorithms are commonly used to analyze large histologic images as they facilitate interpretation, but on the other hand they complicate hypothesis-free spatial analysis. Therefore, many applications train convolutional neural networks (CNNs) on high-resolution images that resolve individual cells instead, but their practical application is severely limited by computational resources. In this work, we propose and investigate an alternative spatial data representation based on cell segmentation data for direct training of CNNs. Approach We introduce and analyze the properties of Cell2Grid, an algorithm that generates compact images from cell segmentation data by placing individual cells into a low-resolution grid and resolves possible cell conflicts. For evaluation, we present a case study on colorectal cancer relapse prediction using fluorescent multiplex immunohistochemistry images. Results We could generate Cell2Grid images at 5-μm resolution that were 100 times smaller than the original ones. Cell features, such as phenotype counts and nearest-neighbor cell distances, remain similar to those of original cell segmentation tables (p<0.0001). These images could be directly fed to a CNN for predicting colon cancer relapse. Our experiments showed that test set error rate was reduced by 25% compared with CNNs trained on images rescaled to 5μm with bilinear interpolation. Compared with images at 1-μm resolution (bilinear rescaling), our method reduced CNN training time by 85%. Conclusions Cell2Grid is an efficient spatial data representation algorithm that enables the use of conventional CNNs on cell segmentation data. Its cell-based representation additionally opens a door for simplified model interpretation and synthetic image generation.

Ideally, the total assignment problem of all cells in an image is solved by a single use of Munkres' algorithm.However, the runtime of Munkres' algorithm 92 increases as ( 3 ) with the size of the cost matrix .Computation time of several hours for a single image therefore becomes impractical.Instead, we follow the approach outlined in the main text, by first assigning all cells to their closest grid and subsequently solving assignment conflicts locally one grid node at a time, using Munkres' algorithm.

S.1.2 Local conflict resolution
After initial binning of cells to their closest grid node, grid nodes with conflicts are resolved sequentially in random order.At each node, we use Munkres' algorithm to find an optimal, local assignment of all cells within a small conflict resolution window   around the grid node (Figure S-1).For large   , computation time may be unnecessarily high in cases where the conflict could be resolved in a smaller local neighborhood.For each conflicting grid node we therefore start by considering the 3 × 3 local neighborhood around the conflict.If the number of cells  within this local neighborhood is smaller or equal than the total number of grid nodes  in this window ( ≤ ) we use Munkres' algorithm to optimize the assignment of all cells (using their original locations) to all available grid nodes within this window.If this window contains too many cells,  > , we increase the local neighborhood size to the next symmetric window (5 × 5, 7 × 7 etc.).We continue until a window size with  ≤  is found or until the maximum window size   ×   is reached.The entire procedure is outlined as pseudo code in the following.We implemented this method using the python package munkres, version 1.1.4.

Pseudo code for cell assignment and conflict resolution with predefined 𝒘 𝒎𝒂𝒙
1 bin all cell coordinates to target grid 2 _ = list of all grid nodes with conflicts 3 for each grid node  in _ do:   While both solutions are valid, solution 1 involves a relatively long assignment distance for cell 2. Additionally, this solution swaps the location of cells along the x-axis, creating an unnecessary local distortion of the cell arrangement.In contrast, total Euclidean travel distance of solution 2 is higher, but split more equally between the two cells.
Calculating the total assignment cost of both solutions using the conventional Euclidean distance (ED) shows that the cost of solution 1 However, using the squared Euclidean distance (SED) assigns a higher cost to individual long travel distances and subsequently leads to solution 1

S.2 Comparison of conflict resolution settings
In this section, we explore alternative settings for the conflict resolution method (in the following, our final method is named Adaptive Munkres' Algorithm with Squared Euclidean Distances with  = 7, or AMASED7 for short) to justify its design, see   0.0 (0.0 -0.1) 1.6 +/-1.5 AMASED7 (our method) 0.0 (0.0 -0.0) 1.7 +/-3.9

Conflict resolution method
Figure S-3 shows the cell loss and processing time of each method.Notably, lossyBin is the fastest method while both hungarian5 and hungarian7 are slower than AMASED5 and AMASED7, as expected.AMASED7 was used as the conflict resolution for Cell2Grid in all shown experiments as it had the lowest cell loss with acceptable processing time.Figure S-4 shows that the processing time of this method increases exponentially with the number of cells in each image.

S.3 Choosing the target grid spacing 𝑑
As illustrated, the target grid spacing parameter  defines the compression ratio of Cell2Grid.
In our experiments, we used a target grid spacing of size  = 5 µ, which is comparable to the size of a lymphocyte 93 , but other values are possible.Large values for  (i.e., coarse grids) provide a higher compression ratio but lead to an increased number of assignment conflicts, some of which may lead to the deletion of cells.Smaller values on the other hand create increasingly sparse Cell2Grid images and small compression ratios.Ideally, the largest value for  without excessive cell loss should be used.This section provides several boundaries for an adequate choice of .For illustration, this section uses empirical values from the data set introduced in the main text.

Using a desired compression ratio
If a specific spatial compression ratio R of a single image channel is desired, the target grid spacing is simply  = √, with  being the original image pixel resolution.When the original image consists of  color channels and  cell features are extracted during Cell2Grid, this gets modified to  = √ (   ).In our example with  =  = 6 and  = 0.5µ, a desired compression ratio of  = 100 leads to  = 5 µ.

Using the empirical average cell area
Using the empirical average cell area   obtained from cell segmentation we may assume dense packing of spherical cells with a diameter 2 √   /.We divide by √2 to account for cells lined up unfavorably (diagonally) to the grid to obtain the target grid spacing estimate √ 2  /.Using our empirical data (  = 42 µ 2 ) yields an upper boundary of  < 5.17 µ.

Using expected inflection point of empty grid nodes and cells in conflict
For any given cell density , the inflection point of the expected fraction of cells in conflict   () (Eq. 3) can be used as an upper boundary for .This point coincides with the inflection point of the expected fraction of empty grid nodes and therefore represents the point of steepest increase in conflicts and steepest decrease of empty grid nodes.At higher values for  the target grid becomes densely populated and cell loss starts to increase simultaneously.For our empiric data, this upper boundary is  < 6.3µ.

Maximum tolerable cell loss
Using the expression for expected cell loss (Eq.7), we can define a maximum tolerable cell loss, leading to the largest acceptable target grid spacing .For our data, 1% acceptable cell loss leads to an upper boundary of  < 6.7 µ.

S.4 Using additional cell features
Figure 8 in the main text visualizes the mean marker values over entire cells as a false-color image.However, other cell features can be used in Cell2Grid output, including the marker distribution over the cell (min, max and standard deviation) as well as size and shape of the cell and its nucleus.The possible features for Cell2Grid output channels include every cellbased feature calculated during cell segmentation.As an example, Figure S-5 illustrates CD3 marker distribution features and cell shape parameters.As can be seen in Figure S-7, per-image average values for cell shape properties remain stable after applying Cell2Grid, with more than 98 % of values falling within the 1.96*Standard Deviation interval for all features.Notably, nucleus and cell size average values tend to be shifted towards higher values in Cell2Grid, indicating that cell loss typically occurs in highdensity regions that contain smaller cells on average.This tendency for losing small cells results into slightly higher average cell sizes in Cell2Grid data.

S.5 Data augmentation for Cell2Grid images
As outlined in the main text, Cell2Grid images use a one-pixel-equals-one-cell concept.While it simplifies interpretation of the images, it complicates conventional data augmentation methods that involve interpolation of pixels, like rotations, shearing, zooming and other deformations.However, some of them can be replaced with discrete versions that only move pixels to new locations without altering their values by interpolation.

Discrete shearing
Shearing an image parallel to the x-axis using an angle  can be expressed using a transformation matrix: with  = tan  94 .This can be discretized by calculating the row-wise displacement  of each pixel.The displacement of each row is rounded to the next integer such that each row is shifted by a pixel-discrete distance.This raster shearing 95 ensures that no pixel value needs to be interpolated but that pixels of the input image are simply moved to a different location.

Discrete rotation
Arbitrary rotations around an angle  can be expressed as a chain of three independent shearing operations, known as raster rotation 95 : with  = sin  for the shearing in y-direction and  =  = − tan /2 for the two shearing operations in x-direction.Each individual shearing can be carried out as a discrete shearing.

S.6 Additional neural network experiments
This section presents additional results for different model settings for the colon cancer relapse prediction task.We tested different learning rates for both CNN architectures and explored different weight initialization settings of the convolutional layers in VGG, see

Figure S- 1 :
Figure S-1: Illustration of resolving assignment conflicts with   = 5.Left: 25 grid nodes (green circles, letters), and 30 biological cells (blue dots, numbered).The two conflicting cells 20, 23 at the central grid node M triggered the conflict resolution.Right: Solution to the assignment problem.The local 3x3 neighborhood around M contains more cells (15) than grid nodes (9), the problem is therefore solved in the 5x5 neighborhood instead.Black lines indicate final cell assignment, red crosses indicate deleted cells.
Using conventional Euclidean distances, Munkres' algorithm minimizes the sum of individual travel distances of all cells, occasionally creating solutions that include unnecessary large distances of one or more cells.Using the squared Euclidean distance makes long individual travel distances increasingly expensive, effectively producing solutions for which each cell gets assigned to grid nodes in its local neighborhood.This is illustrated in FigureS-2, showing an example of two free grid nodes A and B and two cells 1 and 2 (for simplicity we assume a grid spacing  = 1 and that no other grid cells are available).This assignment problem has only two solutions: [1 → , 2 → ] (solution 1) and [1 → , 2 → ] (solution 2).

Figure S- 2 :
Figure S-2: Two possible solutions to an assignment problem with two available grid nodes (A, B, green dots) and two cells to be assigned to them (1, 2, black dots).Solutions visualized with blue and red lines.

Figure S- 3 :
Figure S-3: Cell loss and processing time for different conflict resolution methods using a target grid of 5 µ.Zero cell loss is represented by 10 −3 in log-scale.We chose AMASED7 as the default method for conflict resolution in Cell2Grid for all experiments shown in the main text.

Figure S- 4 :
Figure S-4: Cell2Grid processing time for each individual image (blue dots) depending on the number of cells in the image using AMASED7 for conflict resolution.

Figure S- 5 :
Figure S-5: Additional, non-default cell features for the same image as shown in Figure 8 of the main text.Top two rows show CD3 marker distribution features (minimum, mean, maximum, standard deviation), bottom two rows show cell shape features (area in µm² and axis ratio of nucleus and entire cell, respectively).In addition to the results shown in the main text, we investigate below how the features of Figure S-5 change for our entire image data set after applying Cell2Grid (see Figure S-6 and Figure S-7).Since these features are only dependent on the presence of cells and not their position, they are only influenced by potential cell loss during conflict resolution.

Figure S- 6 :
Figure S-6: Regression (left column) and Bland-Altman plots (right column) for CD3 marker distribution features of cells shown as average values per image.From top to bottom: minimum, mean, maximum, and standard deviation.

Figure S- 7 :
Figure S-7: Regression (left column) and Bland-Altman plots (right column) for cell shape distributions shown as average (AVG) values per image.Top to bottom: nucleus area (µm²), nucleus axis ratio, entire cell area (µm²), entire cell axis ratio.

Figure S- 8 .
Initialization of all convolutional layers except for the first one was either random or with weights pre-trained on the ImageNet data set6,65,66 .Pre-trained layer weights were either kept fixed or were trained alongside all other network weights.Experiments with learning rate e-5 were conducted but are not shown here due to low prediction accuracy.Results are shown in Figure S-9.

Figure S- 8 :
Figure S-8: Experiment setup for additional model variations.

Figure S- 9 :
Figure S-9: Validation and test set error rates for all trained models (lower is better), 10 repeated runs shown by a single boxplot.Gray boxes indicate model settings with best validation set performance in their respective group and are sown in the main text.Abbreviations: bil (bilinear rescaling); c2g (Cell2Grid); new (network weights initialized randomly); pre (pretrained network weights); trn (pretrained weights are trainable); fix (pretrained weights kept fixed during training); e-3 and e-4 indicate learning rate.
Munkres' algorithm solves assignment problems defined by a cost matrix .We define its elements   using the squared Euclidean distance between cell  to grid node .Here, we provide an example that illustrates how this choice minimizes local distortions compared to conventional Euclidean distances.

Table S -
1: Tested conflict resolution settings.
Table S-2: Comparison of conflict resolution settings.Color codes indicate poor (red), average (white), good (yellow) and very good (green) performance.