Deep conditional generative model for longitudinal single-slice abdominal computed tomography harmonization

Abstract. Purpose Two-dimensional single-slice abdominal computed tomography (CT) provides a detailed tissue map with high resolution allowing quantitative characterization of relationships between health conditions and aging. However, longitudinal analysis of body composition changes using these scans is difficult due to positional variation between slices acquired in different years, which leads to different organs/tissues being captured. Approach To address this issue, we propose C-SliceGen, which takes an arbitrary axial slice in the abdominal region as a condition and generates a pre-defined vertebral level slice by estimating structural changes in the latent space. Results Our experiments on 2608 volumetric CT data from two in-house datasets and 50 subjects from the 2015 Multi-Atlas Abdomen Labeling Challenge Beyond the Cranial Vault (BTCV) dataset demonstrate that our model can generate high-quality images that are realistic and similar. We further evaluate our method’s capability to harmonize longitudinal positional variation on 1033 subjects from the Baltimore longitudinal study of aging dataset, which contains longitudinal single abdominal slices, and confirmed that our method can harmonize the slice positional variance in terms of visceral fat area. Conclusion This approach provides a promising direction for mapping slices from different vertebral levels to a target slice and reducing positional variance for single-slice longitudinal analysis. The source code is available at: https://github.com/MASILab/C-SliceGen.


Introduction
Body compositional analysis is an important term to determine an individual's health condition which refers to the percentage of fat, muscle, and bone percentages in the human body. 1 Studying the change of body composition on aging enables better prognosis and early disease detection for various diseases, such as heart disease, 2 sarcopenia, 3 and diabetes. 4Computed tomography body composition is a widely employed technique for assessing body composition. 5[9][10][11][12] To minimize radiation exposure for longitudinal imaging and potential risk associated with contrast administration, two-dimensional (2D) non-contrast axial single-slice CT is taken as opposed to three-dimensional (3D) volumetric CT commonly acquired in clinical practice.However, it is difficult to locate the same cross-sectional location in longitudinal imaging, and thus there is substantial variation in the organs and tissues captured in different years, as shown in Fig. 1.The organs and tissues scanned in 2D abdominal slices strongly correlate with body composition measures.Therefore, increased positional variance can make accurately analyzing body composition challenging.Despite this issue, no method has been proposed to address the problem of positional variance in 2D slices.
Our goal is to decrease the effects of positional variance in body composition analysis, to facilitate more precise longitudinal interpretation.A major challenge is that the distance between the scans taken in different years is unknown, as the slice can be taken at any abdominal region.Image registration is a commonly used technique in other contexts for correcting pose or positioning errors.However, this approach is not suitable for addressing out-of-plane motion in 2D acquisitions where the tissues/organs that appear in one scan may not appear in the other scan.Based on Ref. 13, image harmonization methods are categorized into two main groups: deep learning and statistical methods.Notable statistical methods include Combat 14 and its variants, [15][16][17] ConvBat, 18 and Bayesian factor regression. 19 However, unlike generative models, statistical methods often lack the generative capability crucial for our scenario.
1][22][23][24][25][26] The fundamental concept of generative modeling is to train a generative model to learn a distribution so that the generated samples x ∼ p d ðxÞ are from the same distribution as the training data distribution x ∼ p d ðxÞ. 27By learning the joint distribution between the input and target slices, these models can effectively address the limitations of registration.Variational autoencoders (VAEs), 28 which are a type of generative model, consisting of an encoder and a decoder.The encoder encodes inputs to an interpretable latent distribution, and the decoder decodes the samples of the latent distribution to new data.Generative adversarial networks (GANs) 20 are another type of generative model, which contains two sub-models, a generator model that generates new data and a discriminator that distinguishes between real and generated images.By playing this two-player min-max game, GANs can generate realistic images.VAEGAN 29 incorporates GAN into the VAE framework to create better-synthesized images.By using the discriminator to distinguish between real and generated images, VAEGAN can generate more realistic and high-quality images than traditional VAE models.However, original VAEs and GANs suffer from the limitation of lack of control over the generated images.This issue is addressed by conditional GAN (cGAN) 30 and conditional VAE (cVAE) 31 which allow for generating specific images with a condition, providing more control over the generated outputs.However, the majority of these conditional methods necessitate specific target information, such as a target class, semantic map, or heatmap, 32 as a condition during the testing phase, which is not feasible in our scenario since we do not have any direct target information available.To provide a condition during testing, we aim to have the network generate a slice of that specific target at a pre-determined vertebral level, which will serve as the generation target.By defining the target slice, the generative model will implicitly learn the organ/tissue composition in the target slice and have this condition learned during training time.We hypothesize that by giving an arbitrary abdominal slice, the model will generate the slice at the target vertebral level while preserving subject-specific information derived from the conditional image such as body habitus.Inspired by Refs.32-34, we introduce the conditional SliceGen (C-SliceGen) model based on VAEGAN, which enables the generation of subject-specific target vertebral level slices from an arbitrary abdominal slice input.We use 3D volumetric data to train and validate our model since in 3D data the target slice [ground truth (GT)] is available for direct comparison with the generated images.The training datasets include an in-house portal venous phase CT and an in-house non-contrast phase CT volume with 1120 and 1488 subjects, respectively.We further evaluate on the 2015 Multi-Atlas Abdomen Labeling Challenge Beyond the Cranial Vault (BTCV) dataset 35 for external validation.Structural similarity index (SSIM), 36 peak signalto-noise ratio (PSNR), 37 learned perceptual image patch similarity (LPIPS), 38 and normalized mutual information (NML) 39 are used for image quality assessment.We further apply our trained model to the BLSA dataset, comprising 1033 subjects, to illustrate our model's capability in reducing longitudinal variance caused by positional variation.We achieve this by comparing changes in body composition metrics before and after harmonization.
This paper is an extension of our conference version. 40We focus on improving the generalizability of our model and validate our model's harmonization capability on the longitudinal single-slice data.The difference can be summarized as follows: • We revisit the target slice selection method and propose a semi-BPR method that improves the structural similarity of the selected target slice.• We collect an in-house non-contrast phase CT dataset and validate model performance on different contrast phases to minimize the domain shift problem mentioned in the limitations section in the conference version.• More metrics are introduced to evaluate our model and generated images in both 3D datasets and 2D single-slice dataset.• We conduct a comprehensive longitudinal evaluation on the BLSA single-slice dataset with 1033 subjects, which is a significant increase compared to the 20 subjects evaluated in the conference version.
• We conduct an ablation study on validating the most effective distance range between the given and target slice for our proposed method.
Our contributions in this work can be summarized as follows.We present C-SliceGen, a VAEGAN-based generative model for generating subject-specific abdominal slices at predefined vertebral levels.Using an arbitrary axial slice as input, C-SliceGen can implicitly incorporate unknown target slices during testing, producing realistic and structurally similar images.Our experiments demonstrate that the proposed method can harmonize variance in body composition metrics caused by positional variation in the longitudinal setting, facilitating accurate longitudinal analysis.
posterior distribution, and θ represents the decoder parameters.However, it is not feasible to find decoder parameters θ that maximize the log-likelihood.Instead, VAEs optimize encoder parameters ϕ by estimating p θ ðxjzÞ using q ϕ ðzjxÞ, which is assumed to be a Gaussian distribution with μ and σ as the outputs of the encoder.VAEs are trained by optimizing the evidence lower bound (ELBO).(1) where E½log p θ ðxjzÞ represents the reconstruction loss and D KL ½q ϕ ðzjxÞkp θ ðzÞ represents the KL-divergence, which facilitate the posterior distribution to be close to the prior distribution pðzÞ.During testing, new data can be generated by sampling from the normal distribution z ∼ Nð0; 1Þ and inputting it into the decoder.Conditional VAE can be optimized with the following ELBO equation as well with little modification: ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 4 ; 5 8 6 L VAE ðθ; ϕ; x; cÞ ¼ E½log p θ ðxjz; cÞ − D KL ½q ϕ ðzjx; cÞkp θ ðzjcÞ: (2)

GAN
GANs consist of two parts: discriminator and generator.Suppose we have input noise variables p z ðzÞ, the generator will map the input noise to data space GðzÞ and mix it with the real data x.
The discriminator D, on the other hand, transforms image data into a probability indicating whether the image belongs to the real data distribution or the generator distribution. 41To be more specific, the discriminator and the generator play the two-player minmax game with value function VðD; GÞ in the following manner: 42 The Wasserstein GAN with gradient penalty (WGAN-GP) 43 is an alternative to traditional GAN that enhances the stability of the model during training and addresses problems such as model collapse.The loss function of WGAN-GP can be written as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 4 ; 3 8 3 where P g , P r , and P x represent the generator distribution, data distribution, and random sample distribution, respectively.

C-SliceGen
Our task involves generating a new slice at a pre-defined vertebral level from an arbitrary slice that is obtained at any vertebral level within the abdominal region.Our proposed method is shown in Fig. 2, which comprises two encoders, one decoder, and one discriminator.

Target Slice Selection
The first step is to select a target slice for each individual, with the criterion of selecting a slice that is most similar in terms of organ/tissue structure and appearance across all subjects.Choosing a comparable target slice for each individual is a challenging task as it involves taking into account subject-specific variations in organ structure and body composition.We use two methods to select similar target slices for each subject: BPR-based method: We select slices with similar body part regression (BPR) score 44 as the target slices across subjects.BPR gives different scores to different slices in the abdominal region and is efficient in locating slices.We first select a target slice in a reference subject and document its BPR score, and then we select the slices that have the most similar BPR score as the target slices across subjects.Registration-based method: Initially, we select a reference subject's slice as the reference target slice.Subsequently, we register the axial slices of every subject's volume to the reference target slice and identify the slice that has the largest NML score as the subject target slice.

Training
The input image for the model is the arbitrary slice, which provides subject-specific information, including organ shape and tissue localization.Note the assumption that the input is intended to represent the target which is not a random input.We believe that this information remains interpretable after encoding to latent variables z c by encoder1.All selected target slices (x) should have similar organ/tissue structures and appearances.This information is encoded in the latent variables z t .The distribution of z t can be expressed as q ϕ ðz t jxÞ, where ϕ denotes the encoder2 parameters.We combine the organ/tissue structure and appearance of the target slice with the subject-specific information by concatenating the latent variables z c and z t .This combination facilitates the decoder to reconstruct the target slice for the given individual.To regularize the reconstruction process, we compute the L1-norm between the target slice (x) and the reconstructed slice (x recon ) using the following equation: ; t e m p : i n t r a l i n k -; e 0 0 5 ; 1 1 7 ; 3 4 3 Since no target slice is available during testing, we follow the similar approach as in VAEs.We assume that q ϕ ðz t jxÞ is a Gaussian distribution with parameters μ t and σ t , which are the outputs of encoder2.We optimize the KL-divergence to encourage q ϕ ðz t jxÞ to be close to the prior distribution z prior ∼ Nð0; 1Þ, which can be written as where K represents the dimension of the latent space.To mimic the process of image generation in the testing phase, we added another input to the decoder by concatenating z c with z prior for target slices generation.We denote these generated images as x gen .x gen is also regularized by L1-Norm with the equation: ; t e m p : i n t r a l i n k -; e 0 0 7 ; 1 1 7 ; 1 7 6 The combined loss function of the above-mentioned steps can be expressed as ; t e m p : i n t r a l i n k -; e 0 0 8 ; 1 1 7 ; 1 3 8 However, the major drawback of VAEs is that they tend to generate blurry images.On the other hand, GANs can produce images with sharp edges.Following Ref. 29, we add GAN regularization into our model.The discriminator in our proposed C-SliceGen model classifies both the generated and reconstructed images as fake images, while the target images are considered real images.The decoder acts as the generator for the GAN part.GAN loss adds another constraint to force the generated and target images to be similar.The total loss function for our C-SliceGen model can be written as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 1 1 4 ; 7 0 0 The adversarial regularization is adjustable by the weighting factor β.
The weights of the conditional VAE and discriminator are alternately updated.The conditional VAE is updated using the loss in Eq. ( 8).Subsequently, the conditional VAE transitions to inference mode and its outputs x gen and x recon are utilized to update the discriminator parameters.The discriminator is then updated based on Eq. (4).

Testing
During testing, encoded conditional image z c is combined with z prior , which is sampled from a normal Gaussian distribution, and then fed into the decoder to generate the target slice.
3 Implementation Details

Dataset
We train and evaluate our methods on 3D volumetric CT datasets in both the portal venous phase and the non-contrast phase as well as 2D single-slice CT dataset in the non-contrast phase.

In-house portal venous dataset
This dataset contains 1120 3D portal venous CT volumes from 1120 de-identified subjects from Vanderbilt University Medical Center (VUMC).The data have been approved by the Institutional Review Board (IRB) with IRB #160764.A quality check is performed on every CT scan to ensure normal abdominal anatomy.The dataset is divided into training, validation, and testing with 1029, 8, and 83 subjects, respectively.

In-house non-contrast dataset
To minimize the domain shift problem when applying models trained with the portal venous dataset to non-contrast single-slice data, we further train and validate our method using a 3D non-contrast CT dataset with IRB #172167.This dataset contains 1488 subjects.We split the dataset into training, validation, and testing with 1059, 117, and 312 subjects, respectively.

BTCV dataset
The MICCAI 2015 Multi-Atlas Abdomen Labeling Challenge (BTCV) dataset consisting of 30 portal venous CT volumes for training and 20 for testing is used for the evaluations.We finetuned the model trained with the in-house portal venous dataset with 22 data for training and 8 for validation.

BLSA dataset
We assess the effectiveness of our method in harmonizing the positional variation on single-slice data with the BLSA dataset.To minimize the radiation exposure during longitudinal imaging, the BLSA CT protocol captures single-slice data at specific anatomical landmarks instead of acquiring 3D CT data as is typically done in clinical settings.A total of 1033 subjects have more than one visit with some subjects having up to 12 visits and the median number of scans being three for the past 15 years.The total number of CT axial scans is 4223, and all the scans are in the noncontrast phase.

Metrics
We perform a quantitative evaluation of our C-SliceGen generative models using different target slice selection approaches and varying values of β [as defined in Eq. ( 9)].

Metrics for 3D datasets
For the 3D volumetric dataset where the target slices are available for comparison, we use four metrics to assess image quality: SSIM, PSNR, LPIPS, and NML.SSIM assesses the image based on three factors: luminance, contrast, and structure, and the final score is derived from the multiplication of those three independent factors.PSNR is most determined by mean squared error.LPIPS is utilized to assess the perceptual similarity of two images by calculating the similarity between the activations of their respective patches through a pre-defined network.This measure is highly correlated with human perception.NML measures the degree of information present in one image that is contained in the other image. 45

Metrics for 2D single-slice dataset
In the BLSA dataset, each subject only has one axial abdominal CT scan taken per visit, resulting in the absence of GT for direct comparison with the generated target slices.We use NML and coefficient of variation (CV) to evaluate the model performance on harmonizing the positional variation.CV indicates the amount of differences between scans, 46,7 which is defined as

Training and Testing
BPR 44 is used to ensure a consistent field of view in the abdominal region for all the 3D volumetric data.We preprocess the data with a soft-tissue CT window range of ½−125; 275 Hounsfield units (HU) and further rescale the data to the range of 0 to 1.The 2D axial CT scans are resized from size 512 × 512 to 256 × 256 before being fed into the models.Pytorch is used to implement the proposed methods, with the Adam optimizer and a learning rate of 1e-4, and a weight decay of 1e-4 to optimize the network's total loss when training the model from scratch.When finetuning the BTCV dataset, the learning rate is reduced to 1e-5.The encoder, decoder, and discriminator structures are modified based on Refs.47 and 48.We adopt common data augmentation methods, such as shift, rotation, and flip, with a probability of 0.5 to facilitate training.

3D Datasets Evaluation
We present the quantitative performance of our model with various metrics on different datasets in Table 1.Comparing with target slices selected by the registration-based method and BPRbased method, the registration-based method achieves better performance on the in-house portal venous dataset while the BPR-based method performs slightly better on the BTCV dataset which might indicate that the BPR-based method and registration-based method have comparable performance on selecting target slices on the portal venous phase CT scans.We show qualitative results on the BTCV test set in Fig. 3, which demonstrates that our model is capable of generating target slices irrespective of whether the conditional slice is at a higher, lower, or similar vertebral level.
On the non-contrast dataset, however, our empirical results indicate that the generated images are not on a similar vertebral level as opposed to what they are supposed to be.We trace back the reason and find that the selected target slices in the non-contrast dataset are not on a similar vertebral level initially, making the network hard to learn the target location and resulting in significant noise in the generated images.To address this issue, we use a semi-BPR method wherein we compare the target slice with the eight axial slices preceding and succeeding it to select the new target slice.The results with the manually corrected target slices are presented in   1. Comparing the results from the non-contrast phase dataset and the portal venous phase dataset, two out of four metrics show slightly worse performance while the other two metrics show comparable or even better performance.This may indicate our model has good generalizability on different CT phases.

2D Single-Slice Evaluation
We use NML and CV to evaluate our model's longitudinal variation harmonization capability.We evaluate the model's performance using the visceral fat area as the primary metric.This is because visceral fat is highly susceptible to positional variation as mentioned in Ref. 7, and it is a crucial component of body composition, which indicates an individual's health condition. 7,49he BLSA single-slices CT scans are fed into the model trained with non-contrast 3D CT volumes.The generated images are resized to the original image size of 512 × 512.We use the method in Ref. 7 to extract the segmentation mask of the visceral fat which includes feeding the data in a pre-trained model for inner/outer abdominal wall segmentation and using fuzzy c-means 50,51 to extract the adipose tissues.In the inner abdominal wall segmentation, we observe that the model performs unsatisfactory to exclude the retroperitoneum from both real and generated images.The retroperitoneum is an anatomical region situated behind the abdominal cavity, which comprises the aorta, and left and right kidneys, and often lacks well-defined boundaries, making it difficult to segment accurately.We follow the practice in Ref. 7 and manually assess the results from both the real and generated images to ensure that the retroperitoneum is segmented correctly.We mask the inner abdominal wall with the adipose tissues to get the final visceral fat mask.
We calculate the NML on every two scans of the same subjects on both original and harmonized images; and the results are shown in Fig. 4(a).According to the result, the harmonized images have higher NML compared with the original images which indicates the generated images of the same subjects share a higher degree of similar information compared to those of the original images.We further evaluate with the CV, while we observe higher CV in the harmonized images compared with the original images, as shown in Fig. 4(b), which implies that differences between slices are increased after generation.As the metrics show contradictory results, we conduct a human assessment of the harmonization results.We find that 431 out of 1033 subjects have at least one scan that is taken in obviously different vertebral levels compared with the other scans.For those 431 subjects, our model can help harmonize the positional variation resulting in a significantly lower CV than the original images with p < 0.01 under the Wilcoxon signed-rank test, as shown in Fig. 5 (a) The results with NML as metrics, and (b) the results with CV as metrics.We observed higher NML and CV among the harmonized images.NML and CV show contradictory results where higher NML suggests that the generated images of the same subjects share a higher degree of similar information compared to those of the original images while higher CV implies the differences between slices are increased after harmonization.
variance in subjects with both a lesser and greater number of longitudinal visits, as shown in Figs. 6 and 7, respectively.In Fig. 6, our model reduces the variance by 36.3% and 42.5%, respectively.And in Fig. 7, the variance is reduced by 37.8% and 76.9%, respectively.However, for those subjects whose original slice was already taken at a similar vertebral level, our model can introduce additional noise and result in larger variance among scans, as it shown in Fig. 8. higher with β ¼ 0. This observation supports that SSIM and PSNR scores may not completely reflect human perception, as mentioned in Refs.52 and 53.

Distance impact
In our method, we aim to map an abdominal axial scan at an arbitrary vertebral level to a predefined target vertebral level, where the distance between the given scan and the target scan is unknown.This is also the case in the BLSA single-slice dataset.We assume that as the distance difference between the scans increases, the scans will undergo more structural changes, making the generation process more challenging.To evaluate the impact of distance on our model performance, we conduct validation experiments.Specifically, we assess the performance of models trained on scans from varying distance ranges and compare them to models trained using known distances.We also include the model trained with the abdominal region with unknown distance for comparison (model in Table 1).All the models are trained and tested with β ¼ 0.01, using the in-house portal venous dataset and BPR-based target slice selection method.We evaluate the model performance with the LPIPS and SSIM scores.The results are shown in Fig. 9.
To assess the performance of the model with fixed known spacing, we do not ask the model to predict the target slice since in most of the CT volumes, the spacing in the z dimension is 3 mm.In this case, if we train with a fixed distance of 3 mm, we can only get two conditional and target slice pairs for each subject, which leads to a data deficiency problem and cannot evaluate the model performance properly.Therefore, instead of predicting the target slice, we design the system to generate corresponding slices at intervals of 3, 6, and up to 75 mm upward in the abdominal region for each given slice.For the model trained with a fixed spacing range-3, 6, and up to 75 mm-the models are trained with slices up to the specific distance from the target slice, respectively.The line unknown distance refers to the results by applying the trained model in Table 1 to a fixed distance test set.
According to Fig. 9, when the model has a fixed distance range of 3 and 6 mm, it has the worst performance among the other distance ranges.This can be explained by the data deficiency problem as mentioned before.The performance of models trained with a fixed distance is optimal when the slice distance is small but gradually drops with an increase in distance.This indicates that the model performs better with a smaller distance between the given and target slices.The models trained using the entire abdominal region (with unknown spacing) consistently performed poorly, starting from a distance of 6 mm.On the other hand, the models trained (b) (a) Fig. 9 Assessing the impact of the distance between the given slice and target slices on the model performance.Fixed distance represents the model trained with a known distance, and fixed distance range line represents the model trained with up to a distance of the given point.Unknown distance represents the inference performance of the model trained on the entire abdominal region when tested on data with a fixed distance between the input and target slices.We observe that the model's most effective distance range is between 15 and 60 mm where performance is stable in terms of both LPIPS and SSIM.
using a fixed range of 15 to 60 mm showed stable performance in terms of LPIPS and SSIM.These results indicate that the model's most effective distance range is between 15 and 60 mm, and training with a wider range can lead to decreased overall performance.

Limitations and Discussion
In this work, we improve the domain shift problem we observed in our previous publication 40 that when applying the model trained with the portal venous phase to the non-contrast phase BLSA data, the model has limited performance.We manage to reduce this issue by using non-contrast 3D volumetric data for training together with semi-BPR-based target slice selection for accurate target slice selection.From Figs. 4 and 5, we observe that our model can help reduce the longitudinal variance on data that are taken at obviously different vertebral levels.However, when it is applied to the data that are at similar vertebral levels, our model can introduce additional noise by predicting different heterogeneous soft tissues, as shown in Fig. 8. Predicting and synthesizing heterogeneous soft tissues, such as the colon and stomach, is challenging because these tissues' size and shape are largely dependent on individual conditions and position at the time of the CT scan, making it hard for the generative model to find a subject-specific distribution of such tissues.Hence, this remains a critical limitation of this study.Solving the heterogeneous soft tissue, shape, and size generation problem can be the future work direction.Exploring solutions for the generation of heterogeneous soft tissue and preserving shape and boundary information can also be the future work direction.
In addition, we validate our method's most effective distance range.Models trained with data that are up to 60 mm away from the target slice have comparable performance to that of the model trained with up to 15 mm away from the target slice.Furthermore, the model trained within the 60 mm range performs markedly better than the models presented in Table 1, which is trained using slices from the entire abdominal region.Therefore, for the model to be most effective, it would be preferable to collect data within a range of no more than 60 mm, or roughly around AE3 vertebral level 54 in future data collection.

Conclusion
Herein, we present our C-SliceGen model, which utilizes an arbitrary 2D axial abdominal CT slice as input and generates a subject-specific slice at a desired vertebral level.Our model can effectively capture changes in the organs across different vertebral levels and generate images that are realistic and structurally similar.In addition, we demonstrate our model's effectiveness in harmonizing longitudinal body composition variance caused by positional differences among different visits in the BLSA single-slice CT dataset.Specifically, in subjects with scans taken at different vertebral levels, our model effectively harmonizes positional variation, resulting in a significantly lower CV compared to the original images (p < 0.01, Wilcoxon signed-rank test).Overall, this approach offers a promising solution for managing imperfect single-slice CT abdominal data in longitudinal analysis.

Fig. 1
Fig. 1 An example of a subject with slices acquired at different vertebral levels in different visits.The blue line represents the approximate axial position where the CT scan is taken.The yellow masks represent visceral fat.The shape and size of the captured organs and tissues vary largely among different visits leading to large variations in the visceral fat area.

Fig. 2
Fig.2The input image is an arbitrarily acquired slice in the abdominal region.During the training phase, target images (x ) are used as the GT for the generation and reconstruction process.Latent variables, such as z c , z t , and z prior , are derived from conditional images, target images, and the normal Gaussian distribution, respectively.x gen and x recon are considered fake images and target images are considered real images for the discriminator.

Fig. 3
Fig. 3 The image enclosed within a pink bounding box depicts the input slice, while the image enclosed within a light blue bounding box represents the model outputs and target slice from four different subjects in the BTCV test set.The pink and light blue lines in the rightmost column indicate the axial position of the input and target slices, respectively.The structural differences in organs between the input and target slices are highlighted by orange arrows.The results indicate our model can implicitly learn the subject-specific target slices and generate realistic and structurally similar slices given input slices from arbitrary vertebral levels.

Fig. 4
Fig.4 Quantitative results of applying the trained model on 1033 subjects from the BLSA dataset.(a) The results with NML as metrics, and (b) the results with CV as metrics.We observed higher NML and CV among the harmonized images.NML and CV show contradictory results where higher NML suggests that the generated images of the same subjects share a higher degree of similar information compared to those of the original images while higher CV implies the differences between slices are increased after harmonization.

Fig. 5 Fig. 6
Fig.5CV result of applying the trained model on subjects that have at least one scan that is taken in obvious different vertebral levels.Note: * represents statistically significant (p < 0.01) by Wilcoxon signed-rank test.The result demonstrates that our model is effective in harmonizing the variance caused by positional variation in longitudinal imaging.

Table 1
Quantitative results on two in-house test sets and BTCV test set using different target slice selection methods with different β in Eq. (9) for training.
Note: bold values indicate the best in each column/dataset.Yu et al.: Deep conditional generative model for longitudinal single-slice. . .