Homogenization of multi-institutional chest x-ray images in various data transformation schemes

Abstract. Purpose Although there are several options for improving the generalizability of learned models, a data instance-based approach is desirable when stable data acquisition conditions cannot be guaranteed. Despite the wide use of data transformation methods to reduce data discrepancies between different data domains, detailed analysis for explaining the performance of data transformation methods is lacking. Approach This study compares several data transformation methods in the tuberculosis detection task with multi-institutional chest x-ray (CXR) data. Five different data transformations, including normalization, standardization with and without lung masking, and multi-frequency-based (MFB) standardization with and without lung masking were implemented. A tuberculosis detection network was trained using a reference dataset, and the data from six other sites were used for the network performance comparison. To analyze data harmonization performance, we extracted radiomic features and calculated the Mahalanobis distance. We visualized the features with a dimensionality reduction technique. Through similar methods, deep features of the trained networks were also analyzed to examine the models’ responses to the data from various sites. Results From various numerical assessments, the MFB standardization with lung masking provided the highest network performance for the non-reference datasets. From the radiomic and deep feature analyses, the features of the multi-site CXRs after MFB with lung masking were found to be well homogenized to the reference data, whereas the others showed limited performance. Conclusions Conventional normalization and standardization showed suboptimal performance in minimizing feature differences among various sites. Our study emphasizes the strengths of MFB standardization with lung masking in terms of network performance and feature homogenization.


Introduction
2][3] A network architecture trained on a given domain data set may perform poorly when other domain data are tested.There exist a host of research articles that address this domain discrepancy, and the so-called "domain adaptation" is a subfield of transfer learning 4,5 that addresses this problem.A good summary and review of the domain adaptation techniques in chest x-ray (CXR) imaging can be found in the paper of Çallı et.][9][10][11] These approaches, however, assume that the source domain, from which the samples for testing come, can be appropriately formed and specified.][14] The situation in which we have a particular interest is from the global healthcare disparity perspectives, and it is not supportive for forming such a domain.It is often the case that CXR is performed without satisfying the scanning protocols in medically underserved areas and populations.A lack of enough electric power supply, inadequate data acquisition setting, and absence of licensed technologists are possible causes of inconsistent image quality of CXRs.Considering rather unpredictable scanning conditions, the CXR images at hand may not constitute a well-defined domain in such cases.Artificial intelligence (AI)-enabled techniques are fast evolving in medical fields including automated detection and diagnosis of diseases.They will surely help reducing global healthcare service disparity.It is considered an important area of research and development for such tools to be deployed in the field with their optimal performance uncompromised.The purpose of this study is to implement and compare instance-based data transformation methods in that line of research, which is therefore highly relevant to the special issue of global health, equity, bias, and diversity in AI in medical imaging.
An out-of-distribution detection task that verifies whether a test sample belongs to the predefined source domain can help check the availability of the trained network. 6,15However, to increase the utility of the learned models, a single data instance-based approach, such as input data transformation, is desirable.In this work, we focus on the input transformation approaches and provide a missing link that can explain which method would be more powerful in deeplearning-based detection tasks.
Preprocessing methods are commonly used for deep network performance enhancement 16 and efficient deep network training. 12To reduce data discrepancies, data transformations in the scope of histogram modification techniques, including histogram equalization, matching, clipping, and normalization, 12,17 have been investigated.Nevertheless, a claim that such global histogram modification methods cannot harmonize texture differences has been made. 17Indeed, data preprocessing steps embracing the multi-frequency characteristics or local features of x-ray images have been reported to improve computer-aided detection [18][19][20] and deep-learning-based detection tasks. 17,21,22][25] This study compares several data transformation methods in the CXR-based tuberculosis (TB) detection task and provides strong evidence for why one method is superior to the others through radiomics analysis and deep feature analysis.

Data Transformation
We implemented three types of transformation algorithms: data normalization, data standardization, and multi-frequency-based (MFB) data standardization.For data standardization and MFB data standardization, we further split each method into two different schemes: with and without lung masks.Therefore, the total number of methods implemented in this work is five.Data normalization and standardization use the transformation Eqs. ( 1) and (2), respectively.E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 5 7 4 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 6 ; 5 2 6 where X i ∈ R represents a CXR image input pixel value.For an effective normalization of CXR images coming from different systems, a dynamic normalization with histogram analysis is desirable. 12Specifically, a DICOM image may contain a constant bias or a letter mark having outlying pixel values.Although some DICOM images provide information that can be utilized to confine pixel value ranges, not all DICOM images provide this information, and windowing parameters are therefore user-specific.In this study, as an attempt to remove such outlying pixels, we first calculated the cumulative histogram f∶ R → ½0;1 of a given CXR image X ∈ R N .Then, X min and X max are such that fðX min Þ ¼ 0.02 and fðX max Þ ¼ 0.98.In the following homogenization processes, we first applied the above normalization and then calculated image statistics to make images have similar pixel value ranges regardless of the transformation methods.We localized pixels with values that are below X min or above X max , and those outlier pixels are excluded from the statistics calculation.μ and σ in Eq. ( 2) indicate the mean and the standard deviation values of the input image after this preprocessing, which are then adjusted to the reference mean μ ref and standard deviation σ ref .When standardizing images with lung masks, image statistics were calculated only within the masked lung region, whereas the transformation was applied to all of the image pixels.
The MFB data standardization starts with the Laplacian pyramid decomposition, 26 which is widely used in CXR image enhancement. 27The Laplacian pyramid decomposition results in the multiscale representation of a CXR image, and its reverse reconstruction process reproduces the original image.Conventional MFB data enhancement techniques strengthen specific frequency bands of a CXR image, whereas the MFB data standardization aims to make a CXR image similar to the target domain at each multi-frequency band. 18 As mentioned above, pixels inside the masked lung regions were used for the statistics calculation when the lung masking scheme was adopted.The iterative procedures are summarized in Algorithm 1.We set the total number of iterations N to 50 in this study.We empirically set the maximum level of Laplacian pyramids by 5, which results in a 16 × 16 array size at the fifth level.The ResNet-18 28 can downsample the input image with an array size of 512 × 512 to 16 × 16, which is equivalent to the minimum array size that is manageable when level 5 is used in the Laplacian pyramid decomposition.

Lung Segmentation
To compare the effects on the detection performance of image transformation methods in conjunction with masked lung statistics, we need a lung segmentation tool.Lung segmentation was conducted using a separate deep neural network from the network that is used for TB detection.A network with a Res-UNet structure 29 was trained with the JSRT CXR dataset 30 and its mask dataset. 31Additionally, we used the China and Montgomery datasets 32 for additional validation.The training code was implemented using PyTorch libraries 33 on a system with a GeForce RTX 3090.We summarize the details of the data set used for lung segmentation in Table 1.

Data Preparation for TB Detection
We used multi-institutional CXR data from seven clinical sites to train and test the deep neural network for TB detection.Anonymized datasets were collected from the clinically cooperative institutions of the Radisen AI Research Center.This study was approved by the institutional review boards, and informed consent was waived.We summarize dataset information in Table 2.It should be noted that the reference site data provide the target domain examples and other sites provide the source domain examples and that the source domains are diverse in terms of CXR modality, bit-depths, and scanning protocols, of which details are unavailable.The reference dataset is composed of half normal CXR and half abnormal CXR diagnosed with TB.For non-reference data, there are class imbalances between TB sample sizes and normal sample sizes.Because data skewness affects performance metrics, 34 we calculated the performance metrics after under-sampling normal data, so the sample sizes of the two classes become equal.The under-sampling was randomly performed and repeated, so statistical analyses are feasible.Meanwhile, we used all of the data for the feature analyses because the class imbalance itself does not have a signification influence on the feature extraction.Algorithm 1 MFB standardization algorithm

TB Detection Network Training
The overall workflow of the CXR-based TB detection in this work is shown in Fig. 2. In the training phase, lung regions were first identified using the trained lung segmentation network.After appropriate image cropping, each image was downsized to an array of 512 × 512 using bilinear interpolation.Various data transformation methods were then applied to each image and then the lung regions of the processed images were used for the network training.In Fig. 2, we omitted the image cropping process for the simplicity of presentation.ResNet-18 architecture was used for the TB detection network.Training details are presented in Appendix A.

TB Detection Performance Evaluation
In the inference phase, CXR images from seven different sites in Table 2 were tested.For testing each TB detection network trained by the data that went through a specific data transformation, the same data transformation method was applied to the input CXR image.For example, we applied the MFB data standardization with lung masking to the test dataset when we evaluated the performance of a TB detection network that was trained by the data transformed by the MFB standardization with lung masking.For the evaluation, receiver operating characteristic (ROC) and precision-recall (PR) curves were used.We also calculated an F1 score of the reference test results with 20 different threshold values within [0, 1].The threshold value that provides the highest F1 score in the reference test results was applied to other datasets to calculate F1 scores and recalls.For the reference dataset, we bootstrapped the performance metrics for statistical analysis.Random bootstrapping was performed up to a thousand times, which was determined so that the standard deviation from the bootstrapping results stay within 2% of difference from the Delong's estimated standard deviation in the area under the curve (AUC) value of ROC  curves. 35,36For non-reference data (from sites A to F), we randomly repeated the under-sampling a thousand times, which is the same number of times that bootstrapping was performed.Localization of suspicious regions in the CXR image was also performed using Grad-Cam++, 37 which can provide the activation map for TB detection.The bounding boxes were drawn by radiologists for suspicious areas in the entire data set, and we evaluated how well the obtained saliency maps match the radiologists' insights.We achieved Grad-Cam++ images at the deepest feature layer and calculated the weighted intersection over attention (WIOA) values with respect to the radiologists' bounding box information.Grad-CAM++ images and WIOA values were produced from the M3d-CAM PyTorch library. 38Details on the procedure for calculating the WIOA can be found in Appendix B.

Radiomic Features
To observe the distributions of various datasets, we first extracted the radiomic features [39][40][41][42] of CXR images transformed by the aforementioned methods.For each processed CXR image, we masked out the region outside of the lung.PyRadiomics 43 was used to extract features from the preprocessed lung images.We calculated 93 radiomic features, which can be grouped into 6 different series including: first-order statistics, gray-level co-occurrence matrix (GLCM), gray-level dependence matrix (GLDM), gray-level run-length matrix, gray-level size zone matrix, and neighboring gray tone difference matrix.The last five features can be grouped into second-order statistics.We did not include shape features because the shape of the lung in the CXR is not the target of data homogenization.
We calculated the Mahalanobis distance 44,45 as a numerical assessment of the homogenization ability of the radiomic features.The Mahalanobis distance D M is a distance between a point x and a distribution D with a mean of μ and a covariance matrix S and is defined as We calculated the distances between each sample of test sites (A to F) and the distribution of reference data.The distances of first-and second-order radiomics were calculated separately.
For a better visualization of the features, we performed a principal component analysis (PCA) of the calculated radiomics.We chose PCA for visualization to compare the effects of data harmonization on radiomic features in the common coordinate system.Because the radiomic features of images were extracted in the same way regardless of the dataset and harmonization methods, such a comparison is legitimate.

Deep Features
The architecture of a typical convolutional neural network (CNN) 46,47 consists of a series of convolutional layers and pooling layers.We focused on the final layer of the CNN because it stores all of the essential information extracted from the input image in the form of a feature vector.In this work, we used the latent feature vector of the final layer of the TB detection network (Sec.2.4), which is a vector with a size of 512.We used t-distributed stochastic neighbor embedding (t-SNE) 48 to visualize the latent feature vector.The t-SNE helps with understanding how the network responds to the test data under various data transformation through visualizing the feature clusters.

Data Transformation
Figure 3 shows the example network inputs of lung-segmented images processed by different transformation methods from various sites.N, S, SL, MFB, and MFBL indicate data normalization, data standardization without and with lung masking, MFB data standardization without and with lung masking, respectively.In Fig. 3, the data N method resulted in non-uniform lung region brightness.It is observed that MFB methods generally provide a more similar appearance overall with the reference compared with the N, S and SL in the same display window setting.N, S and SL resulted in rather coarse image textures in other site datasets compared with those in the reference, whereas MFB methods produced finer textures similar to the reference.However, the MFB without a lung masking scheme provides suboptimal visual texture similarity in some data, for example, site D in Fig. 3.We included a full breakdown of the time spent on various preprocessing techniques in Appendix C.

TB Detection Performance
In Fig. 4, we show the ROC and the PR curves of the network outputs from different data transformation methods.The AUC values of the ROC and the average precision (AP) values of the PR curves for the reference data showed marginal network performance differences among various data transformations.Network performances, however, largely varied for the six other site datasets.The MFBL method resulted in the minimum gap between the reference curve and the site-average curve in both ROC and PR, whereas S and SL showed poorer results in both AUC and AP.
Table 3 summarizes the network performance evaluation results over the sites in terms of F1 score, recall, and WIOA.Here, please note that the average values in the table are average scores from sites A to F, and we present the standard deviation of those six scores.In the F1 score result, the MFBL method showed the highest values for every non-reference site except F. It is noted that site variation of the scores is minimum in the MFB and MFBL results, which implies more robustness of the network performance.In the recall score, average recall scores of the N, S, and SL methods did not exceed 0.5, which means they failed to detect true TB cases at a >50% chance.Considering that we balanced the classes by undersampling, the chance would be lower than random calls.The WIOA value of a non-trained network specified as random in Table 3 was around 0.31.The WIOA values of the reference dataset with different networks were around 0.60, and the MFBL methods provided slightly lower values in the non-reference datasets as well.Figure 5 shows examples of bounding boxes and Grad-CAM++ images.As shown in Fig. 5(f), a hot spot in the saliency map from the MFBL network goes well with the radiologist's bounding box.

Radiomic Features
For a visual comparison, we calculated the Z-scores of the features and presented a heatmap of the Z-scores.Figure 6 shows the corresponding heatmap, with each row representing one feature and each column representing one CXR image.The heatmap is divided into five sections on the horizontal axis; each section denotes each data transformation method.If Z-scores of a certain  radiomic feature are similar across the samples from different sites, it can be said that the homogenization performance of the method is proper in terms of that specific radiomic feature.
The N cannot harmonize data from the multi-site data.S and SL, worked as intended for the first-order statistics.However, there was a clear discrepancy between reference test data and other multi-site data in terms of higher-order radiomic features.MFB succeeds in reducing the discrepancies for various sites both in first-and second-order radiomic features.The use of lung masks seems to be more effective in terms of harmonization performance.The heatmap suggests that the MFBL can harmonize not only the histogram characteristic but also textural features.
Figure 7 is a box-and-whisker plot of Malalanobis distances between the features of test and reference sites.A larger distance implies that a corresponding feature is not well harmonized, albeit after a certain data transformation method, and vice versa.The implications of the box plots of distances are in the same vein as the results shown qualitatively in the heatmap (Fig. 6).
Fig. 6 Heatmap depicting z-scores of 93 radiomic features for various datasets with a conventional N and four different homogenization methods.The vertical axis denotes radiomic features, divided into first-order statistics and second-order statistics including GLCM, and GLDM.The horizontal axis denotes samples.Note that the MFBL method showed the best homogenization results throughout various features.The N has a limited performance on feature harmonization overall.S and SL can harmonize only first-order statistics and not second-order features.MFBL resulted in the shortest feature distances; implying a superior homogenization performance.
Note that the smaller discrepancy in radiomics feature space does not always guarantee better performance of the classification network.For example, although the S improved the first-order feature in all cases compared with N, the performance of the network is generally better for the normalized images (see Fig. 4).Nevertheless, we confirmed that conventional N and S failed to harmonize the textural features of the images as compared to the MFB and thus cannot substantially reduce the effect of reference data bias.
Figure 8 contains visualizations of radiomic features after five different data transformation processes.There are distinct differences between the radiomic features of the reference site and other sites that have undergone conventional N and S.However, after MFB, especially with lung masks [Fig.8(e)], it is clear that the radiomic features of the samples from different sites are well harmonized with the reference group.

Deep Features
Table 4 summarizes a list of the calculated Mahalanobis distances of the multi-site features to the reference distribution used for the network training.Figure 9 shows the box-and-whisker plots corresponding to Table 4.The MFBL method shows the closest distance regardless of the test  data.This implies that the deep feature harmonization ability of the MFBL method is the highest and other harmonization methods may have failed to match the data distributions.Similar to the result of radiomic feature analysis, the distance in the deep feature space is not necessarily inversely correlated to the network classification performance.However, the distance can be used as a metric to determine the network's trustworthiness for data having different distributions from the training data.The t-SNE visualization results are presented in Fig. 10.After N, S, and SL [Fig.10(a)-10(c)], the intrasite differences were reduced, but the homogenization between reference data and other sites was not successful; multi-site data forms a distinct cluster from the reference.From radiomic feature analysis (Sec.3.3), we found that N, S and SL could not reduce data distribution between reference and other sites.The t-SNE result implies that the network was not able to handle the reference data bias and finally resulted in a biased model.From the point of view of the TB detection network learned from the reference dataset, it can be seen that data from other sites are still treated as out-of-distribution.Although the network may have achieved a somewhat satisfying performance after N, S or SL (Sec.3.2), the latent vector analysis does not provide strong support for such an improvement.Meanwhile, both versions of MFB methods can harmonize multi-site data in terms of the network's deep feature, not forming any distinct cluster [Figs.10(d

Discussion
In this study, we used a single dataset from a single clinical site for training and tried to harmonize other datasets with the reference data exploiting data transformation methods.Using multi-institutional datasets for training in conjunction with data augmentation is also a viable option for a network's generalizability enhancement.This approach can be interpreted as one that increases the diversity of feature space of the training dataset, hopefully covering the feature space of a new dataset.Pixel-level transformations, including contrast or brightness adjustment and spatial-level transformation, such as shift and rotation, were used in an example work. 13Our study shows the importance of image features at multiscale representations in the detection network training.Developing data augmentation strategies with image frequency modulation is planned as our future study.
It is observed that radiomic and deep features from the multi-site dataset after the N, S, or SL method did not agree well with the features of the reference test data.However, it is shown repeatedly that the features within the multi-sites become similar to each other after N, S and SL [see Figs.  4, the reference image is rather sharp and accordingly high-frequency-emphasized, whereas the images of other sites after N, S and SL are blurrier in a similar fashion.
Deep learning has achieved great success in various image processing tasks, and domain adaptation is one of the benefited applications.Unsupervised deep-neural-network-based domain adaption methods aim to transform the data distribution of the source domain to the target domain. 7,10,49By such a domain transformation, data harmonization between chest radiographs acquired from various conditions can be achieved.However, there are two shortcomings with such translation methods, which can be major obstacles, especially in medically underserved regions.
One reason is that deep learning techniques require a considerable amount of data.2][53][54] The performance of generative models heavily depends on the number of images for training, and the data-efficient models still require hundreds of images. 55nother point is that deep neural networks are computationally heavy.The problem remains for the few-shot learning techniques because the CNNs have a large number of parameters to be trained.The computational burden of training the network in a new environment inevitably becomes an obstacle to rapid diagnosis.If the chest x-ray scanner is placed on a mobile system, it is impractical to mount such a high-performance computing device to the system.
There are alternative versions of GAN such as one-or few-shot domain transfer models that require smaller datasets to train.Because we are interested in instance-based data harmonization, we implemented a one-shot GAN learning model. 56A network was trained to align the domain features of dataset A to the reference data.Lung-masked images were used for training.The process was accelerated by a single NVIDIA GeForce GTX 1080 Ti graphics processing unit, and it took two days to train a single one-shot domain translation network.Two distinct networks were trained on two different samples of dataset A, and generative-model-based domain translation resulted in significantly varying outcomes, depending on which instance was used for one-shot training (Please see Fig. 11).However, the transformations in this paper's scope are fully instance-based and thus free from such complications.
In this study, we analyzed the data features after various harmonizations.For the quantitative analysis, we calculated the Mahalanobis distance between the deep features of reference data and the test data points.The conclusion was that the MFBL, which resulted in the shortest distance, is the most efficient harmonization.Unfortunately, it was difficult to establish the direct correlation between the distance in feature space and the degree of harmonization.For example, the average deep feature distance of N (13.73) is farther than that of S (12.80) in the case of dataset E (see Table 4), whereas the network performance in terms of AUC and AP was superior for the N case.Still, there is an overall tendency that the shorter the distance is, the better the network performance metric becomes.Although further research is needed to determine the criteria for a reliable and credible model that is less affected by the reference data distribution, our suggested feature analysis may provide a basis for such discussion.
We would like to emphasize that this study contributes to understanding the data homogenization processes through feature analyses and recommends the MFBL method as the instance-based data transformation method potentially for the CXR images acquired at medically underserved areas.As the special issue of global health, equity, bias, and diversity in AI in medical imaging is pursued, this contribution will help to reduce global healthcare disparity and diversity.

Conclusions
In this work, we implemented various instance-based data transformations to reduce data discrepancy for the multi-institutional use of a trained deep-learning prediction model.A CNN-based TB detection network was trained using the reference site data, and the TB detection performance was tested for the remaining six sites after applying N, S, SL, MFB, and MFBL.MFBL outperformed other methods in terms of numerical criteria including AUC, AP, F1 score, and WIOA.For the radiomic feature analysis, we calculated the Mahalanobis distance and performed dimensionality reduction.We found that S and SL match the histogrambased features well but fail to match the texture-related second-order statistics.On the other hand, the textural features of the multi-site CXRs after MFBL were well homogenized to the reference data.The deep features of the trained network were analyzed through the same method, and the MFBL showed the best harmonization performance.Conventional N and S, on the other hand, did not lessen the distribution gap between multi-site datasets and were accordingly unsuccessful in deep feature harmonization.Our study emphasizes the strengths of the MFBL, especially its comparative advantage on network performance and ability to lessen the disparity between various data distributions.

Appendix A: TB Detection Network Training Details
We initialized every model corresponding to each data transformation method with the pretrained ResNet-18 provided by PyTorch, and the same random seed was used for training.A dropout layer having a dropout ratio of 0.6 was added before the last linear layer.We used the same computing resources that were used for the lung segmentation network training, and training details are summarized in Table 5.The early stopping condition was determined by monitoring the validation losses with the reference test dataset.7 Appendix B: Calculation of WIOA In Fig. 12, we show the procedure for calculating the WIOA value in an exemplary case.The Grad-Cam++ image is first segmented by the Otsu method, 58 which results in a binarized attention map ATT.Then, a weighted attention map WATT is generated by multiplying the Grad-Cam++ image and ATT.The multiplication of the WATT and the bounding boxes provides the weighted intersection area WINT, and the WIOA value is finally calculated by E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 6 ; 4 6 8 WIOA ¼

P
x WINT ðxÞ P x WATTðxÞ :  8 Appendix C: Data Preprocessing Time Table 6 provides a detailed analysis of the time required for each data-transforming procedure.
It is important to mention that we preprocessed the entire dataset prior to the actual training and saved the results separately.Through this strategy, we could avoid increasing the training time.Specifically, the preprocessing of ∼2000 reference data required an additional 4 minutes to complete.As is summarized above, in addition to the reference computation time (0.28 s), only a few 100 milliseconds are additionally required for the lung segmentation and MFB.We believe this increase in computation time would not hamper its use in clinical practices.

Appendix D: Learning Curve
In terms of convergence in the training phase, all of the implemented harmonization methods showed a similar speed of convergence, as shown in Fig. 13.The training phase used 1200 epochs in all cases.Solid lines and dashed lines represent the training phase and the validation phase, respectively.Fig. 13 Training and validation loss plots of the networks.

Figure 1
shows an example Laplacian pyramid of a CXR image.We specified the Gaussian and the Laplacian pyramid images at the k'th level by G k and L k , respectively.Lower frequency information tends to be stored at the higher level of the Laplacian pyramid image by design.For MFB standardization of an input image, we first applied the Laplacian pyramid decomposition to a set of training CXR images.For each training image, the mean μ k and standard deviation σ k values of L k were then calculated.Finally, the reference mean μ ref k and standard deviation σ ref k values were calculated by averaging those μ k and σ k values.The MFB standardization process iteratively adjust μ k and σ k values of an input image to μ ref k and σ ref k values.

Fig. 1
Fig. 1 Example Laplacian pyramid of a CXR image.

Fig. 2
Fig. 2 Diagram of overall procedures for comparison study.

Fig. 3
Fig. 3 Patches extracted from inputs to TB detection networks with different data transformations.Images in the same column are displayed with the same display window.

Fig. 4
Fig. 4 ROC and PR curves with different data transformations.Each line shows the ROC or PR curve for each site.The area value in the legend means the AUC and AP values for ROC and PR curves, respectively.Values in square brackets represent the 95% confidence interval.Average lines were produced by collecting every result from site A to site F.

Fig. 7
Fig. 7 Box-and-whisker plots of the Mahalanobis distances between test radiomic features and the reference radiomic features.(a)-(f) The site A, B, C, D, E, and F. N, S, and SL cannot harmonize the second-order statistics well, whereas the performance of the MFBL was the best for all sites.

Fig. 9
Fig. 9 Box-and-whisker plot of the Mahalanobis distance of the deep features.Distances were between each point in a dataset to the reference distribution.

Fig. 10 t
Fig. 10 t-SNE visualizations of deep-embedded features after conventional N and four different data transformations.(a) N, (b) S, (c) SL, (d) MFB, and (e) MFBL.Note that, for N and S, the deep features are not well harmonized.
8(a)-10(c), Figs.10(a)-10(c)].It is perhaps due to the fact that the textures of test data from A to F are similar in the original images.As shown in Fig.

Fig. 11
Fig. 11 Comparison of the deep-learning-based one-shot domain translation and the MFBL transformation.Each image corresponds to (a, e) normalized data from dataset A, (b, f) outputs of a network trained on (a), (c, g) outputs of a network trained on (e), and (d, h) MFBL.

Fig. 12 (
Fig. 12 (a) Example input image to explain WIOA value calculation, (b) the GCam++ image of the corresponding input image, achieved at the deepest feature layer, (c) the bounding box of TB regions drawn by a radiologist, (d) the binarized GCam++ image with thresholding, (e) the weighted attention map, (b) multiplied with (d), (f) the weighted intersection area, and (e) multiplied with (c).The WIOA value was defined as the summation of the pixel values of (f) divided by that of (e).

Table 1
Datasets information for lung segmentation network training.

Table 2
Multi-institutional datasets information.

Table 3
Network quantitative evaluation summary.

Table 4
Mahalanobis distances (D M ) for multi-site data homogenization.The target distribution was the reference train dataset.A larger distance indicates an outlier to the target distribution.

Table 5
TB detection network training details.

Table 6
Data preprocessing time.All measurements were done in intel Xeon® CPU E3-1270v5/GTX 1080 Ti platform.