Deformable multi-modal image registration for the correlation between optical measurements and histology images

Abstract. Significance The accurate correlation between optical measurements and pathology relies on precise image registration, often hindered by deformations in histology images. We investigate an automated multi-modal image registration method using deep learning to align breast specimen images with corresponding histology images. Aim We aim to explore the effectiveness of an automated image registration technique based on deep learning principles for aligning breast specimen images with histology images acquired through different modalities, addressing challenges posed by intensity variations and structural differences. Approach Unsupervised and supervised learning approaches, employing the VoxelMorph model, were examined using a dataset featuring manually registered images as ground truth. Results Evaluation metrics, including Dice scores and mutual information, demonstrate that the unsupervised model exceeds the supervised (and manual) approaches significantly, achieving superior image alignment. The findings highlight the efficacy of automated registration in enhancing the validation of optical technologies by reducing human errors associated with manual registration processes. Conclusions This automated registration technique offers promising potential to enhance the validation of optical technologies by minimizing human-induced errors and inconsistencies associated with manual image registration processes, thereby improving the accuracy of correlating optical measurements with pathology labels.


Introduction
Optical technologies revolutionized the field of oncologic surgery in recent years by providing non-invasive and innovative ways to monitor the assessment of resection margins during surgical procedures.By providing real-time visualization of tissue characteristics, optical technologies, such as Diffuse Reflectance Spectroscopy (DRS) [1,2], Fluorescence lifetime imaging (FlIm) [3], and Hyperspectral imaging [4,5], can help to assess if all cancerous tissue is removed while minimizing damage to surrounding healthy tissue.This will lower the number of positive resection margins and thereby reduces the need for additional treatments such as surgical re-excision or radiotherapy.However, the accuracy and reliability of these technologies are crucial to ensure the safety and efficacy during surgical procedures.Therefore, validation of optical technologies is essential to establish their clinical significance and provide healthcare professionals with the confidence to use them in their practice.This involves assessing the clinical outcomes of these technologies with the current ground truth [6].
Ground truth validation of optical tissue measurements refers to the process of comparing the acquired measurements with the gold standard histopathological analysis of tissue samples.This is provided by hematoxylin and eosin (H&E) stained tissue sections, from which the measured tissue can be defined microscopically [7].To ensure an accurate correlation between the performed measurement locations and their corresponding H&E sections, it is essential to track the locations in order to locate them back microscopically [other paper].Subsequently, a registration between a specimen snapshot image (with tracked measurement locations) and corresponding H&E sections (microscopic histology image) is necessary, enabling validation of the measured tissue types against ground truth.This process is crucial for establishing the reliability and reproducibility of optical techniques used for tissue diagnosis, and especially for the accurate development of tissue classification algorithms.
However, the registration of a specimen snapshot image with the corresponding histology image encounters some challenges.The histopathological processing of tissue specimens which involves steps such as fixation, dehydration, clearing, embedding, and cutting causes tissue deformation in H&E sections.This deformation may include shrinkage, stretching, compression, tearing, and even loss of tissue [8].Also, other factors have an influence on deformation.For example, breast tissue, due to the presence of fat, is more prone to shrinkage or compression during processing compared to muscle tissue.Similarly, the size and thickness of tissue sections can affect the degree of tissue deformation, with thicker sections more susceptible to distortion.Besides, over-staining or prolonged staining influences the amount of deformation, while under-staining may lead to poor visualization of tissue structures [9].When validating optical measurements, it is crucial to take these tissue deformations into account [10].Especially, when using labeled optical data for the development of machine learning models, where incorrectly labeled data will influence the performance of tissue classification and ultimately impacting clinical outcomes.
However, labeling and validating optical measurements with histopathology, is a subject that has received limited attention in the existing research literature [11,12,13].Where in some studies, the aspect of tissue deformation is not even taken into consideration [14,15].In the method proposed by de Boer et al. a manual point-based deformable registration between specimen snapshot and H&E sections by looking for identical landmarks in both images [16].The proposed method addressed the need for a deformable registration method when labeling and validating optical measurements since it demonstrated a higher accuracy compared to a method that neglected such deformations.However, a manual point-based registration lacks subjectivity since the identification of corresponding landmarks can vary between different users.This can lead to inconsistent results and reduce the reliability and accuracy of the registration.Also, this labor-intensive process can be time-consuming, particularly when dealing with large datasets.This approach is also not suitable for images with a limited number of paired distinguishable landmarks, which is often the case when registering multi-modal images.
In general, multi-modal image registration is a complex task that encounters various difficulties.One major challenge arises from the differences in intensity and contrast between images acquired in different modalities.These variations make it challenging to establish accurate paid landmarks between images.Another obstacle is the structural dissimilarity between modalities, resulting in differences in shape, size, and appearance of visible corresponding structures.Nonlinear deformations and limited overlapping information further complicate the registration process [17].When registering specimen snapshots with histology images, microscopic artifacts such as tears, holes and loss of tissue introduce additional complexities to accomplish an accurate registration.Overcoming these challenges involves the development of advanced algorithms capable of handling variations in intensity, contrast, shape, as well deformations.
Automating the registration process, by using advanced algorithms and computational techniques, shows potential to address current limitations and enhance the overall registration efficiency and thereby improve the accuracy of validating optical technologies [18].Therefore, the purpose of this study is to develop a multi-modal image registration model which is also able to compensate for tissue deformations automatically.The proposed approach is based on the VoxelMorph model which has been adapted to the needs of a multi-modal 2D registration between specimen snapshot images and microscopic histology images.With this deformable multi-modal image registration model we aim to achieve a faster and more precise method for labeling optical measurements with a ground truth which can lead to a more accurate development of tissue classification algorithms, enhancing their practical use in clinical settings.

Materials
The dataset used in this study consists of 113 breast tissue slices, each of which comprises three distinct images: a snapshot image of the breast tissue slice captured with a camera, a corresponding microscopic Hematoxylin and Eosin (H&E) histology image, and a manual registered histology image.Example images of one tissue slice are demonstrated in Figure 1.The manual registration is performed by manual selection of approximately 60 paired control points followed by a deformable registration using a nonrigid local weighted mean transformation, as described in de Boer et al. [16].Despite the possibility of some misalignment and registration errors, we regard the obtained manual registered histology image in this study as a ground truth image.

Method
In this paper, a multi-modal image registration technique was developed which is capable of automatically addressing tissue deformations, leading to a precise alignment of a snapshot image of a breast tissue slice and the corresponding histology image.The proposed multi-modal image registration methods are based on the VoxelMorph medical image registration framework, which will be explained in the following subsection.In this study, we intended to expand the application of VoxelMorph, as the input images for this study were acquired from different modalities.The development of this multi-modal deformable image registration involves a series of steps, beginning with dataset preparation.This is followed by two different deep-learning approaches for multi-modal image registration using unsupervised and supervised learning models, which will be evaluated separately.

VoxelMorph implementation
The VoxelMorph framework uses an unsupervised deep-learning model for deformable medical image registration.
The model is initially designed to work with 3D medical image volumes, such as MRI or CT scans, and can register two volumes of different shapes and sizes without requiring any explicit ground truth registration fields or anatomical landmarks [19].The architecture of VoxelMorph is based on a deep convolutional neural network (g θ (F, M )), similar to the UNet model [20].The network uses a moving image M and fixed image F as input and computes a dense displacement field (φ) based on a set of learnable parameters θ.The network uses this set of parameters to compute the kernels of the convolutional layers, and employs a spatial transformation function to evaluate the similarity between the predicted image (M (φ)) and the fixed image (F ).This allows the model to refine its estimation of the optimal spatial transformation function and update its parameters [21].The generated dense displacement field (DDF) represents the displacement of each pixel in the moving image relative to the corresponding pixel in the fixed image.This dense map of vectors, with the same dimensions as the moving image, describes the spatial transformation required to align M with F which results in the predicted image (M (φ)).
The network is trained on an image dataset by minimizing the loss function (L) in each epoch, as described in 1.
The loss function L consists of two components: L sim penalizes the difference between the fixed (F ) and moving (M ) image, and L smooth is a regularization on the dense deformation field (φ).The regularization parameter (λ) defines the weights of the two components.The VoxelMorph network is compatible with any differentiable loss function L [22].

Data preparation
Data augmentation is used to increase the number of images in the training set, as well as variations in deformations, which could improve the learning process of the network.Synthetic deformed images were generated from the existing dataset to simulate more deformation variations that occur during the pathology process.The augmented images were generated using randomly created dense displacement fields (DDF), in which a number between -1 and +1 was generated for every pixel in both the xand ydirections, resulting in displacement fields ∆x and ∆y.The ∆x and ∆y displacement are then convolved with a Gaussian filter with defined filter size F and standard deviation σ.Here, σ is serving as the elasticity coefficient.A scaling factor range α is then applied to the DDF to control the intensity of the deformation [23].The deformation variables (σ, α, F ) are chosen randomly within a specified range based on the chosen level of deformation intensity resulting in a total of 565 deformed specimen snapshots and histology images.Figure 2 illustrates some examples of artificial deformations for different deformation intensity levels.
Since the input of the VoxelMorph network consists of a 2-channel image representing fixed F and moving M images, it is required to convert both RGB specimen snapshot and histology images to one-channel grayscale images.The weighted average of all color channels (red, green, blue) was used and determined the final grayscale representation, allowing for selective emphasis on certain colors and structures in the histology image.To ensure a similar intensity level between both images, the specimen snapshot images were converted to grayscale by using saturation values only (Figure 3).This conversion method enhances the visual correspondence between connective and tumor tissue and is hypothesized to improve the performance of the model.At last, the computational effort and training time for the networks were reduced by resizing the histology and snapshot input images to 256×192 pixels.

Unsupervised learning model
In the unsupervised learning approach (Figure 4), the input to the model comprises pairs of synthetic deformed histology images (F ), which imitate the deformations during the pathology process, together with the snapshot specimen images (M ).The trained network is similar to the original VoxelMorph model (as explained in Section 2.2.1), and entails training g θ (F, M ) using the input images F and M to compute optimal learnable parameters θ.M will be transformed using the estimated DDF (φ) in combination with a generated spatial transformer function, resulting in the predicted image (M (φ)).
In this study, the input images were acquired through different modalities.Considering the variations in intensities and structure visibility in both images, it cannot be assumed that the relationship between intensities in these two images is linear.Therefore, mutual information is used as a loss function (L) to quantify the statistical dependence between the two images based on their joint distribution.Mutual information (MI) measures the amount of information shared between the two images.In the context of the developed model, the goal is to find a deformation field that maximizes the MI between the two input images.To compute the MI, a histogram-based mutual information (HM I) was used, which computes the probability distribution of the intensity values between the two input images, and estimates the joint probability distribution between their histograms [24].Specifically, HM I is defined as: Where, p(i, j) is the joint probability of the intensity values i and j in images F and M , and p(i) and p(j) are the marginal probabilities of intensity values i and j in images F and M , respectively.By replacing L sim in Equation 1with HM I (Equation 2), the predicted image (M (φ)) was optimized by maximizing the mutual information between F and M .

Supervised learning model
The VoxelMorph model was originally designed for unsupervised image registration, allowing it to learn without the need for ground truth labels.However, in this study, our dataset consists of manually registered histology images, which can be utilized to train the model in a supervised approach.To achieve this, the moving images (M ) consist of artificially deformed snapshot specimen images.Consequently, the fixed images (F ) consist of the manually registered histology images, whereas the ground truth labels (γ) include the original snapshot specimen images.The VoxelMorph network (g θ (F, M )) is trained by the loss function (L) to transform M to F using the predicted DDF (φ) in combination with the spatial transformer function.The modified model with an example of our data is illustrated in Figure 5.
In the supervised approach, the loss function was calculated by comparing the predicted registered image (M(φ)) and the ground truth label image (γ), which both have the same modality, similar image intensity distributions, and local contrast.Therefore, the mean squared error (M SE) loss function is used as L sim in Equation 1 which is described in Equation 3 [22] MSE(γ, Where, n is the total number of samples, γ n the ground truth image for the n-th sample (original snapshot specimen image), and M (φ) n the predicted value for the n-th sample (predicted registered image).The network outputs a dense displacement field (DDF(φ)) which defines the mapping from moving image coordinates to the fixed image and is used to register M with F .This results in predicted image (M (φ)).

Training
We utilized Python (version 3.10.4)along with the TensorFlow [25] and Keras [26] libraries for data manipulation and analysis.The augmented dataset was split into three subsets, whereas, 360 paired deformed snapshot specimen images will be allocated for the training set, 90 paired images for the validation set, and 115 paired images for the test set.To train the network, the ADAM [79] optimizer with a learning rate of 0.001 was used.The configuration involved setting the number of epochs to 200, with 100 steps per epoch and a batch size of 16.

Evaluation matrices
In order to assess the performance of automatic deformable registration models described in Section 2.2.3 and 2.2.4,various evaluation metrics were employed.This evaluation was carried out for all images in the test set, both before and after applying the registration models.The Dice score was used to measure the degree of overlap between two binary images (A and B) by comparing the number of common pixels in the two images with the total number of pixels in the reference image (B), as described in Equation 4. This metric is especially used to evaluate the overlap of the boundaries of the images.
The histogram-based mutual information (HM I) was used to measure the similarity between images by comparing their histograms (Equation 2).The HM I between two images is the amount of information that is shared between their histograms.Specifically, it measures how much the joint histogram of the two images deviates from the product of their individual histograms.As a result, the optimal alignment of the two images can be determined.
For both obtained Dice and HM I metrics, statistical analysis was performed between the unregistered and registered results using IBM SPSS statistics v27 (SPSS Inc., United States).Statistical analysis for non-normally distributed data was performed using a Mann-Whitney test.Whereas, a p-value ≤0.05 was considered statistically significant.

Evaluation unsupervised and supervised models
Figure 6 visualizes the Dice score and mutual information for the results of unsupervised and supervised approaches compared to the manual registration.
The violin plots in Figure 6 show the distribution of evaluation metrics for the same pair of specimen images from the test set.In case of the unsupervised and supervised approaches, metric values were calculated between the fixed (F ) and predicted image (M(φ)).For the manual registration, metric values between the manual registered histology image (F ) and the specimen snapshot image (label γ) are reported.The width of these plots shows the relative frequency in which each value occurs, and becomes wider when the value occurs more frequently and with a higher probability.
Figures 7 and 8 display multiple registration examples from the test set for the unsupervised and supervised approach applied on the same paired specimen images.

Discussion
When achieving a precise registration between the optically measured tissue and histopathology, the development of tissue classification algorithms can be optimized, which thereby improves the effectivity of optical technologies in clinical practice.However, registration difficulties arise when dealing with deformed multi-modality images, such as histology and tissue specimen images.Utilizing sophisticated algorithms and computational methods to optimize deformable registration processes holds promise in overcoming current inaccuracies in the validation of optical technologies.In this paper, we explored unsupervised and supervised implementations based on the VoxelMorph model, to achieve a deformable registration between 2D multi-modal images.We used a previously in-house acquired dataset of manually registered images breast specimen images to train the models.
The efficacy of the developed models was assessed through the computation of both Dice scores and mutual information for all 115 registered images, the overlap between F and M (φ), in the test set (Figure 6).The unsupervised method outperformed the other approaches significantly.Specifically, as indicated by the Dice score, a more accurate overlap between the general shape of the input images was achieved.Mutual information (MI) functioned as a metric for assessing the similarity between distinct image modalities.As illustrated by the violin plot, the unsupervised dataset, demonstrated a prominently increased distribution within the 0.6-1.0range, indicating an improved alignment of internal structures compared to the other approaches.Unlike mono-modal image registration, where the ground truth transformation is available, multi-modal images do not have a direct one-to-one correspondence due to differences in imaging modalities.This makes it difficult to define an objective reference for evaluating the accuracy of registration.
The dataset used in this study is unique since it contains ground truth manually registered histology images, which are not commonly available in similar datasets.This makes the dataset particularly well-suited for developing and testing multi-modal image registration algorithms and other image analysis techniques.However, training a model in a supervised manner, with labels derived from manually registered ground truth images, showed only a slight improvement in comparison to the manual registration approach.This can be explained due to the fact that this model was trained with images which possibly contain small manual registration errors.Besides the use of labels, the main difference between the training of supervised and unsupervised models involves the used loss function (L).Specifically, for the purpose of this study, the mutual information-based loss function demonstrated superior performance compared to the Mean Squared Error (MSE).
The adoption of algorithms for automating the deformable registration processes represents a paradigm shift in image registration, offering distinct advantages over manual methods [13,12,16].Our results demonstrate that the unsupervised algorithm achieved superior significant performance when compared with the ground truth manual point-based registration, which emphasizes the use of applicability of computational techniques for multi-modal image registration.In comparison, manual registration methods, and corresponding pre-processing steps, can be prone to human errors and inconsistencies, making the automated approach a significantly more reliable option.Recently, there has been a growing acknowledgment among studies regarding the essential requirement to account for tissue deformations when correlating optical measurements with a ground truth pathology label [27,14].Multi-modal registration is often complicated by the lack of corresponding landmarks between images.Therefore, the use of fiducial markers is investigated but involves invasive procedures, such as the placement of burn marks on the tissue surface, that could potentially inflict damage on delicate tissue structures [10].Besides, manual tasks are characterized by their laborintensive nature, which demand considerable time investments to ensure accurate alignment.In contrast, the inherent efficiency of automatic registration accelerates the alignment process, minimizing the potential for discrepancies and enhancing the overall quality of results.
While advancements in registration algorithms have significantly improved the accuracy and robustness of image alignment, the selection of appropriate evaluation metrics remains a challenging and nuanced task.The complexities inherent to multi-modal registration pose a range of difficulties in identifying evaluation metrics that accurately assess the quality of registration outcomes.Multi-modal registration often involves non-linear transformations to account for differences in anatomical structures and intensities across modalities.Conventional metrics such as mean squared error or mutual information, which are effective for linear transformations, may inadequately capture the intricate deformations and intensity variations inherent to multi-modal registration.Our findings indicate precise registration that effectively compensates for deformation, even for the visible internal structures.However, this achievement is associated with relatively low Mutual Information (MI) values, explicable by the inherent variations in contrast among multi-modal images and the employed preprocessing procedures.The challenge lies in devising metrics that can appropriately quantify the alignment accuracy across diverse spatial and intensity changes.Additional metrics like target registration error could be considered to provide a more conclusive assessment of the model's performance.The complex task of registering microscopic histology images with their corresponding tissue slices in RGB encounters challenges arising from the fundamental differences between these imaging modalities.Microscopic histology images, revealing details at the cellular level, are typically acquired through staining and specialized imaging techniques.In contrast, RGB images offer a macroscopic perspective of tissue slices under conventional optics, capturing color information at a higher scale.The presence of tears and holes disrupts the natural continuity of cellular structures in histology images, introducing gaps and inconsistencies that challenge the registration process.Developed registration techniques therefore struggle to establish trustworthy correspondences between regions that are distorted by these artifacts.Tears introduce non-local deformations, while holes disrupt the continuity of anatomical features, making it challenging for algorithms to accurately match corresponding areas in RGB images.Therefore, it is essential to acknowledge that suboptimal performance of this developed model can, at times, be influenced by the degree of deformation and the presence of artifacts in the histology images.
The presented approach has the potential to optimize the registration efficiency, for breast tissue specifically, ultimately leading to an enhancement in the precision of correlating optical measurements with a correct pathology label used for the development of tissue classification algorithms.Further research should also focus on exploring the suitability of this developed model for deformation problems in histology images which occur across different tissue types.This

Figure 1 :
Figure 1: Dataset example: (a) microscopic histology image, (b) snapshot image of the corresponding breast tissue slice captured with a camera, and (c) the manual registered histology image.

Figure 2 :
Figure 2: Examples of synthetic deformation applied to histology (a) and specimen snapshot images (b).In this study, artificially deformed histology images were used for training the unsupervised model, whereas artificially deformed specimen snapshot images were used to train the supervised model.

Figure 3 :
Figure 3: Example of the preprocessing of the used input images: (a) Original RGB histology image (b) grayscale converted histology image (c) original RGB specimen snapshot image (d) grayscale converted specimen snapshot image (e) converted specimen snapshot image using saturation values only.

Figure 4 :
Figure 4: Unsupervised learning model: Specimen snapshot image (M ) and the artificially deformed histology image (F ) are used as input images for the unsupervised deep convolutional neural network (g θ (F, M )).Mutual information is used as loss function (L).The network outputs a dense displacement field (DDF(φ)) which defines the mapping from moving image coordinates to the fixed image and is used to register M with F .This results in predicted image (M (φ)).

Figure 5 :
Figure 5: Supervised learning model: Artificially deformed specimen snapshot image (M ) and the manual registered histology image (F ) are used as input images for the supervised deep convolutional neural network (g θ (F, M )).The specimen snapshot images are used as ground truth label (gamma).Mean squared error is used as loss function (L).The network outputs a dense displacement field (DDF(φ)) which defines the mapping from moving image coordinates to the fixed image and is used to register M with F .This results in predicted image (M (φ)).

Figure 6 :
Figure 6: Evaluation automatic deformable image registration method for the unsupervised, supervised and manual approaches where (a) Dice score and (b) Mutual information values are displayed for 115 specimen pairs after the registration.The solid line represents the median, whereas the dashed lines represent the interquartile range (IQR).

Figure 7 :
Figure 7: Results unsupervised model: a. Specimen snapshot image (M ) b. Artificially deformed histology image (F ) c. Unregistered images: overlap between M and F d. Predicted image M (φ) e. Registered images: overlap between F and M (φ).Dice and mutual information are showed for the unregistered and registered examples.

Figure 8 :
Figure 8: Results supervised model: a. Artificially deformed specimen snapshot image (M ) b.The manual registered histology image (F ) c. Unregistered images: overlap between M and F d. Predicted image M (φ) e. Registered images: overlap between F and M (φ).Dice and mutual information are showed for the unregistered and registered examples.