Open Access
7 March 2023 UNet and MobileNet CNN-based model observers for CT protocol optimization: comparative performance evaluation by means of phantom CT images
Federico Valeri, Maurizio Bartolucci, Elena Cantoni, Roberto Carpi, Evaristo Cisbani, Ilaria Cupparo, Sandra Doria, Cesare Gori, Mauro Grigioni, Lorenzo Lasagni, Alessandro Marconi, Lorenzo Nicola Mazzoni, Vittorio Miele, Silvia Pradella, Guido Risaliti, Valentina Sanguineti, Diego Sona, Letizia Vannucchi, Adriana Taddeucci
Author Affiliations +
Abstract

Purpose

The aim of this work is the development and characterization of a model observer (MO) based on convolutional neural networks (CNNs), trained to mimic human observers in image evaluation in terms of detection and localization of low-contrast objects in CT scans acquired on a reference phantom. The final goal is automatic image quality evaluation and CT protocol optimization to fulfill the ALARA principle.

Approach

Preliminary work was carried out to collect localization confidence ratings of human observers for signal presence/absence from a dataset of 30,000 CT images acquired on a PolyMethyl MethAcrylate phantom containing inserts filled with iodinated contrast media at different concentrations. The collected data were used to generate the labels for the training of the artificial neural networks. We developed and compared two CNN architectures based respectively on Unet and MobileNetV2, specifically adapted to achieve the double tasks of classification and localization. The CNN evaluation was performed by computing the area under localization-ROC curve (LAUC) and accuracy metrics on the test dataset.

Results

The mean of absolute percentage error between the LAUC of the human observer and MO was found to be below 5% for the most significative test data subsets. An elevated inter-rater agreement was achieved in terms of S-statistics and other common statistical indices.

Conclusions

Very good agreement was measured between the human observer and MO, as well as between the performance of the two algorithms. Therefore, this work is highly supportive of the feasibility of employing CNN-MO combined with a specifically designed phantom for CT protocol optimization programs.

1.

Introduction

Computed tomography (CT) applications represent one of the most well-established diagnostic tools in current medical imaging; CT is capable of providing very detailed anatomical images of many biological tissues at one time due to its large dynamic range. With the widespread availability of CT equipment and the increasing number of patient examinations,1 the issues of the quantification of risks related to X-ray exposure, and consequently the need for further optimization of CT protocols to fulfill the ALARA (“As Low As Reasonably Achievable”) principle, have arisen.2,3 The main international organizations dealing with ionizing radiation protection and safety standards, the International Commission For Radiological Protection (ICRP) and the International Atomic Energy Agency (IAEA), have provided patient dose management recommendations and have identified lacunae in justification and optimization, thus providing guidance and improving practice.47 In the Directive 2013/59/EURATOM,8 the European Union Council stated the need to develop and put into action optimization programs to achieve the best compromise between radiation dose and image quality, with the aim to reduce patients dose to the minimum level compatible with diagnostic accuracy.

The choice of the optimum dose level requires the evaluation of CT image quality, which can be measured by receiving operator characteristic (ROC) analysis in reader studies in which trained medical staff perform a specific clinical task. Such an approach is especially suitable in the case of iterative reconstruction techniques because the standard physical quantities are no longer suitable for a thorough image quality assessment.9 However, the extremely high number of different CT protocols in use even within small radiological facilities makes de facto the evaluation of all necessary ROC curves almost impracticable as it would require too much observation time to be provided by medical staff. In the recent past, this fundamental limitation has been addressed by replacing human observers with algorithmic approaches (i.e., model observers); in particular, the channelized Hotelling observer (CHO) model1012 demonstrated great potential, but it is still limited by poor generalization capability to different CT settings.13,14 The appreciable results obtained through such algorithmic methods have encouraged researchers to proceed by employing artificial intelligence (AI) algorithms, which are seemingly more powerful than CHO.1521 Recently, the increasing availability of computational resources has driven the scientific research toward the use of the latter approach, which has shown remarkable effectiveness in mimicking the human observers’ performances in different diagnostic imaging evaluation tasks.17,2124 To take into account the inefficiency and variability of human responses, several strategies have been proposed. Previous adopted approaches consist of the introduction of an internal noise component in the output statistics of convolutional neural networks (CNNs)13,16,23 and the use of human-labeled data for training.17,21,25 When actual patient CT data are used, the dose level dependency is commonly studied by introducing appropriate noise into the images.16,2123,2628

Within this context, our goal is to build a solid model observer (MO) framework based on CNNs that is capable of reproducing the performances of human observers in the identification of low contrast-to-noise ratio (CNR) objects in reference phantom CT images. Compared with the current state-of-the-art methods, our work is characterized by the concurrence of large dataset variability in terms of size and CNR of the imaged objects, CT acquisitions at eight different dose indices, two reconstruction techniques, and labels by a large group of 30 human observers, including 19 radiologists from four different radiological departments.

The use of a specifically designed phantom allowed for the collection of a large dataset of 30,000 images at various dose indices and under controlled acquisition conditions.

Two intrinsically different CNN architectures were optimized for the double task of localization and classification of low CNR objects within the phantom CT images to get insight into the relation between CNN behavior and architectures.

We performed an extended statistical analysis of the results to address the overall observers performances in terms of localization-area under curve (L-AUC), which is expected to be more accurate than the conventional AUC metric because it takes into account both localization and classification capabilities.29

Several statistical indices and the accuracy metric were computed to obtain a better understanding of the CNNs response and the limitations of this AI approach.

The results are very promising: both approaches are capable of miming human detectability performance in phantom CT images. We believe that these CNN-based MOs, combined with specifically designed phantoms, may effectively support the optimization of CT protocols, avoiding the time-consuming limitations of medical staff evaluations.

2.

Materials and Methods

2.1.

Image Dataset

The annotated dataset used to train, validate, and test the proposed CNNs is a subset of the large dataset extensively described in our previous work.30 The dataset consists of CT images of a specifically manufactured PolyMethyl MethAcrylate (PMMA) phantom (Fig. 1), containing 10 cylindrical inserts of different diameters (3, 4, 5, 6, and 7 mm); each couple of inserts with the same diameter provides two different contrast values (45 and 55 HU) with respect to the PMMA background obtained by filling the inserts with aqueous solutions of iodinated contrast media at two distinct concentrations. The phantom consists of three adjacent blocks, each with an ellipsoidal shape with a major axis of 31 cm, a minor axis of 21 cm, and a thickness of 7 cm: two blocks have five inserts each and the third, with no inserts, is finalized to obtain homogeneous background images.

Fig. 1

Lateral view (a) and top view (b) of one of the two blocks containing five inserts filled with iodinated contrast media.

JMI_10_S1_S11904_f001.png

Acquisition was performed with a 128 slice CT scanner (Somatom Definition Flash, Siemens Healthcare) at eight different volumetric CT Dose Index settings (CTDIvol [mGy] = 4.4, 5.1, 6.0, 6.9, 7.8, 8.6, 9.6, and 10.2), with the following protocol for abdomen selected: 120 kVp, AEC on, helical mode, pitch = 1, beam collimator = 38.4 mm, and slice thickness = 2 mm).

Both filtered back projection (FBP) and Iterative Reconstruction (IR, SAFIRE force 3) image reconstruction techniques were applied to the acquired data, with convolution kernels B41s and IF41s, respectively. The CT image reconstruction FoV (RFoV) was chosen to be 5  cm2 (512×512  pixels per image) to produce reconstructed images containing one single insert each. Images without inserts were also similarly reconstructed and added to the dataset. To further increase the dataset variability, data augmentation techniques consisting of 90 degrees rotations and horizontal and vertical flips were applied on all images. Out of this large dataset, 30,000 images were selected with a tradeoff between having an adequate amount of training data for CNNs and an acceptable amount of time being spent to collect confidence scores and insert position coordinates (when detected) for each single image by visual inspection. On the basis of the knowledge acquired in a previous work,30 the selected subensemble was chosen as described in Table 1. It is worth pointing out that the dataset is not balanced in terms of diameters and contrasts of the imaged objects: the abundance of imaged object types was selected according to their detectability (as quantified by human LAUC analysis in Sec. 3): the larger the detectability is, the smaller the subensemble is; moreover, images containing inserts of 6 and 7 mm in diameter at the higher CNR were excluded because the visibility of such objects was too elevated across the entire CTDIvol range. A reference subset was chosen to evaluate the observers performances; it consists of images containing 4-mm diameter objects, contrast C=45  HU, and the iterative reconstruction (IR) algorithm: the CNR computed on such a subset monotonically increased with CTDIvol from 1.9 to 3.1.

Table 1

Selected subensembles from the original image dataset characterized by insert (object) size and contrast.

No. imagesobject diametercontrast
d (mm)C (HU)
10,000Homogeneous images
3000345
3000355
3000445
3000455
2800545
2800555
1200645
1200745

For the human observers image visualization step and the subsequent steps of algorithm optimization, the dataset, originally reconstructed with a 512×512 RFoV, was reduced to 256×256  pixels per image to optimize computational resources, after testing to ensure that resizing the images did not affect the CNN performances.

Figure 2 shows an example set of reconstructed images of two inserts (4 and 7 mm diameter) at the lower contrast (45 HU) for different CTDI values. It is noticeable that the visibility of the insert decreases with CTDIvol due to the decrease of CNR.

Fig. 2

Example of reconstructed images (iterative reconstruction technique) with (a) 4 mm and (b) 7 mm inserts at the lower contrast (45 HU); it is noticeable that object visibility depends both on size and CTDIvol.

JMI_10_S1_S11904_f002.png

2.2.

Confidence Scores Collection

To collect the labels to train the MOs, the detection task was represented as a multiclass ranking task: an ordinal score was assigned by the human observer, corresponding to the confidence attributed to the presence (or absence) of the object, in a range from 0 to 3 (0 = object surely not present; 1 = object unlikely to be present; 2 = object likely to be present; 3 = object surely present). At the same time, the operator was asked to identify the location of the object (if assigned score is not 0). A graphical Python-based interface was developed to automatically save the score and the coordinates assigned to each identified object by the human observers. A representation of the screen window generated by the software and presented to the operator for image evaluation is reported in Fig. 3.

Fig. 3

Screenshot of the software interface developed to collect the human observer response to CT images visual inspection.

JMI_10_S1_S11904_f003.png

A total of 30 human observers contributed with the visual examination of 1000 images each; following a strategy already proposed in previous works,12,16 both radiologists (20) and medical physicists (10) were included as evaluators to get a larger variability of evaluation performances and make the CNN-MOs more reliable and robust. In consideration of the easy task and the simple content of the dataset, ingredients that risk inducing overfitting in CNN training, as well as the very time-consuming reader study, we decided to promote dataset size over multiple ratings of single images: there was no intersection between the subsets of 1000 images evaluated by each observers. However, for the same contrast, size, reconstruction technique, and CTDIvol, a multitude of images with similar properties (noise pattern and signal detectability) were generated from different slices of the same CT scan, among which the signal location was varied by data augmentation (see also Sec. 2.1).

2.3.

Convolutional Neural Networks

Two specific CNNs were developed and optimized to perform the MO task: a U-Net-based architecture and a MobileNetV2-based architecture.

Both CNNs were trained from scratch on the training dataset, which consisted of noisy images that previously underwent visual inspection, labeled with the corresponding confidence scores and coordinates assigned by the human observers. Despite a few recent applications,25 this labeling choice represents an alternative training strategy with respect to most of the MO algorithms reported in the literature,16,16,24,31 often based on an a posteriori correction of the output statistics of CNNs, trained with impartial labels representing the actual presence and location of the object within the images.

To ensure a robust statistical analysis of results, a fivefold cross validation procedure was applied: five training experiments were carried out using randomly assigned train and test subsets (80% and 20%, respectively, for each experiment).

In the following, the developed CNNs are described in detail.

2.3.1.

UNet-based architecture

The first architecture was designed for the MO tasks by customizing a UNet-based CNN, previously developed by the authors30 for denoising and segmentation of phantom CT images. The double-task strategy, successfully employed in the cited work, was implemented in this context to achieve object localization and confidence score prediction at the same time.

The UNet is a CNN based on an autoencoder architecture already well documented in the literature3242 that, in particular, has been employed for segmentation and localization tasks in the postprocessing of medical images and has already demonstrated elevated performances as an MO.42

The original architecture, consisting of a combination of max pooling, convolution, and fully connected layers, was reduced to a total of nine layers and four skip connections. A scheme of the UNet used is reported in Fig. 4 (architecture details and layers sequence are reported in Fig. S1 in Supplementary Material). At the end of the encoder stream, a dense layer is connected to produce a scalar output representing the confidence score prediction (implemented as a multiclass classification task). A mean square error loss LossMO is implemented to let the CNN learn the scores given by the human observers (used as score labels for training).

Fig. 4

Schematic illustration of the developed UNet-based CNN architecture.

JMI_10_S1_S11904_f004.png

The decoder stream is fully devoted to the localization task. The idea behind the implementation of the localization task originated from the CNN architecture proposed by Newell et al.,43 in which concatenated autoencoders were used to estimate the pose of a human body through the generation of a series of heatmaps, one for each identified body joint. In our case, a single autoencoder is implemented to generate the heatmap corresponding to the object identified within the image. The heatmap is a 256×256 matrix with a maximum that is assumed to be the prediction of the object center. Following Newell et al.,43 two additional losses are implemented and devoted to the localization tasks. The first one is a Kullback–Leibler divergence loss (LossKLD)4448 between the heatmap produced by the network and the ground truth, represented by a 2D Gaussian (normalized to unity and with FWHM equal to the object diameter) centered in the coordinates picked by the human observers. In the case of images classified as “object absolutely not present” by the human observer (score = 0), the ground truth consists of a matrix filled with zeros. The second loss consists of a mean square error loss (LossLOC) between the predicted coordinates and those actually picked by the human observers (the latter being used as coordinate labels for training). The contribution of this loss was set to zero for images classified as “object absolutely not present” by the human observer.

A weighted sum of the three losses was tuned during the optimization procedure for the final training as

Eq. (1)

LOSSUNet=(LossMO+100·LossKLD+0.1·LossLOC).

It is worth noting that a specific function based on the softmax operation was built to compute the maximum of the heatmap by means of a differentiable function, an essential requirement for backpropagation to occur properly during CNN training.49

The batch size is 48, the learning rate is 0.0001, and the Adam algorithm is employed as the optimizer.

2.3.2.

MobileNetV2-based architecture

The second strategy is based on the MobileNetV2 architecture,50 the complexity of which was reduced. We used a MobileNetV2 architecture with fewer convolution layers than the original architecture, i.e., we only used the first 11 layers up to the layer called block_3_depthwise_relu during the optimization procedure to limit overfitting. The MobileNetV2 has already been exploited in the medical imaging field, mostly for classification and detection of lesions,5156 and recently it was successfully implemented for COVID-19 diagnosis.5760

Two different MobileNetV2-based CNNs are implemented for prediction of the confidence score (represented as a multiclass classification task) and of the object coordinates; their architectures are reported in Figs. 5 and 6, respectively. Two distinct CNNs were built as their architectures are not exactly identical but differ for the final two layers and the input data in the training phase have different sizes in the two cases.

  • 1. The CNN devoted to the classification task (Fig. 5) takes as input a CT image and, after the 11th layer of the original MobileNetV2, ends with a global average pooling layer followed by a densely-connected layer, consisting of four units and a softmax activation function to predict the confidence score of human observers. The sparse categorical cross-entropy function is used as the loss during the training phase.

  • 2. The CNN devoted to the localization task (Fig. 6) is trained using 48×48 images, obtained by cropping the original 256×256 images around the coordinates picked by the human observers (or random coordinates when the assigned score is 0). The crop size is chosen to be large enough to include the largest insert diameter (7  mm=36  pixels). After the 11th layer of the original MobileNetV2, an average pooling layer (pool size 4, strides 1, no padding) followed by a convolutional layer (1 filter, 3×3 kernel size and linear activation function) produces a real number as output, which is then approximated to the closest integer number. In this training phase, a mean squared error loss is used to predict the confidence score.

    Once the training is completed, a convolutional implementation of the sliding window approach61 is implemented in the test phase to predict the object coordinates: the trained CNN takes as input the original 256×256 images and produces a 27×27 heatmap as output. Each pixel of the heatmap corresponds to a delimited region of the input test image and represents the probabilities that the object is located in the center of that region: the position of the probability maximum provides the predicted coordinates.

Fig. 5

Schematic illustration of the MobileNetV2-based CNN architecture used for the classification task.

JMI_10_S1_S11904_f005.png

Fig. 6

Schematic illustration of the MobileNetV2-based CNN architecture developed for the localization task.

JMI_10_S1_S11904_f006.png

The batch size is 32, the learning rate is 0.001, and the Adam algorithm is employed as the optimizer.

2.4.

CNNs Evaluation

Performance statistics were computed on each of the five experiments (fivefold cross validation) mentioned above and then averaged to get the final statistics and associated standard errors.

The performances of the human observer and MO in detecting and localizing the object in each image were evaluated by complementary approaches that emphasize different aspects of the CNNs behaviors. The most adopted method in clinical practices is the receiver-operating characteristic (ROC) analysis.62,63

The ROC curve shows the tradeoff between sensitivity (or TPR, true positive rate) and specificity (1 - FPR, false positive rate), thus measuring the performance of a classification model. In the case of a double detection-classification task, each image is classified as true positive (TP), true negative (TN), false positive (FP), or false negative (FN) by taking into consideration the localization accuracy. The resulting curve is the localization-ROC (LROC).

The computation of LROC requires choosing an upper threshold distance between the actual center of the contrast object and the location indicated by the observers (human and model) to discriminate between correct and incorrect localization.12

An accurate analysis on the distribution of the human observers’ localization responses, reported in Fig. S2 in the Supplementary Material, was carried out to establish the threshold distance values for the different insert diameters (summarized in Table 2). The knee algorithm was used to accurately determine these threshold values.64

Table 2

Selected thresholds for the LROC curves computation for different insert diameters.

Insert diameter (mm)34567
Threshold (mm)2.32.32.53.03.5

The LROC curve was calculated for different images subsets, each characterized by one fixed parameter to highlight the dependence of the observer capability related to that parameter (i.e. object contrast, diameter, image reconstruction technique, and CTDIvol).

Therefore, the area under the LROC curve (LAUC), a measure of the overall detection performance of the observers, was calculated for each object size and contrast, reconstruction technique, and CTDIvol and was averaged over the five cross-validation experiments conducted on the train dataset (see Sec. 2.3) with the associated standard deviation.

The differences between the LAUC curve of the human observer and those of the MO are evaluated by the mean of the absolute percentage error (MAPE),65 which is a measure of prediction accuracy. The MAPE is calculated according to the following formula:

Eq. (2)

MAPE=1NiN|LAUCiModelLAUCiHuman|LAUCiHuman·100,
where i is an index for the CTDIvol level and N=8 is the total number of CTDIvol levels.

The LAUC analysis has been complemented by the evaluation of inter-rater indices66 that quantify the level of agreement between two or more evaluators of the same observed situations. The multiraters Krippendorff’s Alpha67 (an interval level of measurement), the intraclass correlation coefficient (ICC,68 a random single rating), the widespread Cohen’s Kappa, and the more robust S-statistics69,70 have been estimated to compare the agreement between models and human observers. The first two indices are conventional statistical indices that, however, suffer from a limitation related to unbalanced datasets. Kappa and S-statistics are normalized at the baseline of random chance: they describe how much better a classifier performs than that of a classifier that simply guesses at random according to the frequency of each class. A reference table with interpretation guidelines for the considered indices is in Table S1 in the Supplementary Material.7173

Moreover, the accuracy metric was computed to address the performance of the trained CNN for both confidential scores and localization prediction, separately. Accuracy, defined as the ratio between the number of correct predictions and the total number of human evaluated images,74 was calculated as a function of the relevant image parameters (diameters, contrasts, CTDIvol, and reconstruction techniques).

In the case of localization accuracy, only the images containing the low-contrast object and having a score >0 were analyzed. The same thresholds distance values used to discriminate the true positive localization in the ROC computation (see Table 2) were applied to evaluate the localization accuracy.

3.

Results

As a preliminary analysis, the performance of the human observers in the task of identifying low-contrast objects within the images was evaluated in terms of the LAUCs reported in Fig. 7, as a function of CTDIvol, in the case of the two reconstruction techniques (FBP top panel, IR bottom panel), for the different contrast values C expressed in terms of the HU difference from the PMMA background (C=45HU left panel, C=55  HU right panel). Within each panel in Fig. 7, different curves refer to images with inserts of different diameters.

Fig. 7

Human observer performances quantified by LAUC versus CTDIvol, for different object sizes, contrasts (left: 45 HU, right: 55 HU), and reconstruction techniques (a) FBP and (b) IR.

JMI_10_S1_S11904_f007.png

As expected, the human observer performance improves as CTDIvol increases, due to CNR increasing at larger radiation dose values. The detectability of the smallest objects (3 mm diameter) is poor for both contrasts C and remains below 90% even at high CTDIvol. The noise in the CT images is correlated, which means that the noise in any point of the image is affected to some extent by the noise values of the neighboring points. Small objects are affected the most by noise correlations. The calculated correlation distance for the highest CTDIvol images in our dataset, following Refs. 75 and 76, is 0.7  mm, which can significantly change the appearance of the 3 mm inserts (1.5 mm radius) and thus make them difficult to be detected even at high radiation doses.

The curve related to the 4 mm diameter objects shows a significant increase with CTDIvol, and saturation of the human observer performances, corresponding to LAUC approaching 1, is reached above 6 to 7 mGy of CTDIvol, though this slightly depends on the contrast and reconstruction technique: objects in IR reconstructed images are better recognized than in FBP reconstructed images.

In the case of inserts with diameters >4  mm, saturation of the human observer performance occurs even at low CTDIvol, with LAUC values always above 80%. This result justifies the preliminary selection of the number of images for each contrast and size, as reported in Table 1, in which the more populated subsets are those relative to the inserts of 3 and 4 mm diameters.

Additional statistical analysis was performed to evaluate the difference among the human observers and between the two professional categories that took part in the visual inspection of the CT dataset: radiologists and medical physicists. The LAUC computed for the two classes, reported in Fig. S3 in the Supplementary Material, shows slightly higher performances of the radiologists, especially in the case of the less visible inserts (corresponding to 3 mm diameter inserts). This difference would be, of course, much more significant in the case of complex images, but the scope of this work lies outside the usage of diagnostic images: we aim to exploit the advantages given from using a simple phantom, which can be acquired under different user-defined CT settings (such as protocols and CTDIvol) and in different CT scanners.

Given the above assertions, to achieve the optimization of CNNs, which are notoriously affected by overfitting and biases due to limited data selection, the increased dataset and label variability can be considered an added value, provided that the significant results of this research originates from the analysis of the inserts >3  mm (and, in particular, of the reference dataset).

The performances of trained convolutional neural networks were quantified by computing LAUC as well. The comparison of LAUCs extracted from the whole test dataset among the three observers (two CNNs and the human observer) is shown in Fig. 8, with associated standard errors, in the case of different reconstruction techniques (FBP left panel, IR right panel). A very good agreement is noticeable between the two CNNs and the human observer, especially in the case of IR reconstruction. It can be noticed that at the highest CTDIvol (10 m Gy) LAUC values of MobileNet show a decrease of performance, an anomalous trend that is present also for some LAUC curves of the human observer in the case of signal with 3 mm diameter (see Fig. 7). This trend is under investigation, and further analysis are under way which include CNR evaluation of the images dataset. Our first hypothesis is that it can be related to the MobileNet noise modeling.

Fig. 8

Overall MO and human observer performances quantified by LAUC versus CTDIvol, for the two reconstruction techniques (a) FBP and (b) IR, with associated standard errors.

JMI_10_S1_S11904_f008.png

According to the previous consideration on the human observer performances, a reference LAUC was selected to evaluate the agreement between human observer and MO, as the one extracted from images containing 4 mm inserts with a lower concentration (C=45  HU) was more explicative of the human observer behavior as a function of CTDIvol. In addition, the iterative reconstruction technique, being the most common algorithm used by clinicians in CT protocols, was selected as the reference.

Figure 9 shows the LAUC values for the three observers as a function of CTDIvol in the case of the reference images subset (4 mm insert, C=45  HU, IR reconstruction). The LAUC comparison in the case of the full dataset, i.e., at different diameters, contrasts, and reconstruction techniques, is reported in Figs. S4 and S5 in the Supplementary Materials. A very good agreement between the observers is qualitatively noticeable.

Fig. 9

Comparison of human observer and MO LAUCs versus CTDIvol for the images with an object size of 4 mm, C=45  HU, and IR reconstruction, with associated standard errors.

JMI_10_S1_S11904_f009.png

In order to address the utility of the full training dataset, we performed two additional CNN experiments by using reduced dataset, excluding the 6–7 mm inserts and the 3 mm inserts, respectively. The results of these new experiments are reported in the Supplementary Materials (Figs. S8–S10) and they support the evidence that the dataset variability and numerousness is essential, and it contributes to the success of the training in its totality.

In the following, the level of agreement among the observers is quantitatively addressed by means of appropriate metrics and statistical indices. The MAPE evaluated between the LAUC of the human observer and the LAUC of the two MOs is summarized in Table 3 in the case of the full IR dataset, the full FBP dataset, and the reference subset. Excellent agreement is found between the trained CNNs and the human observer, with an MAPE below 2% in the case of IR reconstruction, slightly above 2% in the case of FBP reconstruction, and in general below 5% when considering all of the image subsets, each related to a different relevant parameter (diameter, contrast, and reconstruction technique), as reported in Tables S2 and S3 in the Supplementary Material. An exception is represented by the 3 mm inserts that are barely recognizable by either the human observer or the MO.

Table 3

MAPE between human observer and MO LAUCs for full IR and FBP datasets and a representative case (Fig. 9).

CNNFBPIR4 mm, IR, C=45  HU
UNet2.361.431.19
MobileNetV22.491.522.48

The accuracy metric, as defined in Sec. 2.4, was evaluated separately for the localization and score prediction tasks: the analysis results as a function of the different variables (inserts diameters, contrast C, CTDIvol, and reconstruction technique) are reported in Figs. 10 and 11, respectively.

Fig. 10

MOs localization accuracy metric versus each of the independent parameters, from left to right: insert diameter, insert contrast, CTDIvol, and reconstruction techniques.

JMI_10_S1_S11904_f010.png

Fig. 11

MOs score prediction accuracy metric versus each of the independent parameters, as in Fig. 10.

JMI_10_S1_S11904_f011.png

By looking at Figs. 10 and 11, it is noticeable that the UNet is able to localize slightly better than the MobileNetV2, whereas the latter classifies with a slightly higher overall accuracy than the UNet. This behavior appears consistent with the intrinsic character of the two CNNs, which were initially designed, as reported in the literature, for localization/segmentation and classification tasks, respectively.

Naively expected trends can be observed: the accuracy, in both tasks, increases with object size (insert diameter), contrast (C), CTDIvol (radiation intensity), and IR reconstruction.

In general, the localization accuracy is above 80%, and score prediction accuracy is well above 50%; once again an exception occurs for those images containing the 3 mm inserts.

Furthermore, other common multiclass statistical indices were computed to address the inter-rater agreement in the score prediction task. The values of Cohen kappa, S-statistics, Krippendorff’s Alpha, and ICC, evaluated for the whole images dataset, are summarized in Table 4. The Cohen and S-statistics77 indices show a fair to good agreement between the MOs and human observer score predictions, whereas Alpha and ICC indices show a good to excellent agreement (see also Table S1 in the Supplementary Material).

Table 4

Human-model inter-raters statistical indices over the entire dataset.

CNNCohen kappaS-statisticsKrippendorff’s AlphaICC
UNet0.50.560.770.77
MobileNetV20.530.570.830.83

Moreover, consistent with the previous LROC analysis, a strong correlation between the S-statistics and CTDIvol is found for both CNNs, as shown in Figs. S6 and S7 in the Supplementary Material. These plots indicate that, when increasing CTDIvol and the insert diameter, the ability of the CNNs to predict the scores in agreement with the human observer increases as well. A very low value of the S-statistics and no correlation with CTDIvol are found in the case of the 3 mm insert: the very poor CNR in those images is such that the CNNs are mistaken because they cannot learn from the human observers answers, which are rather imprecise (see also Fig. 7). If the images containing 3 mm inserts are ruled out from the dataset to compute the S-statistics, index values are found to be 0.64 for the UNet and 0.66 for the MobileNnet, indicating a substantial agreement.

4.

Discussion

In this work, we developed and characterized MOs based on artificial intelligence for automatic quality evaluation of phantom CT images. Two CNNs were trained to mimic human observer images assessment in terms of object detection and localization in CT images acquired on a specifically designed and manufactured phantom.

First, we collected a big dataset of phantom CT images containing objects of different sizes and contrasts, acquired at different CTDIvol settings and reconstructed by means of different techniques (FBP and IR). The dataset was initially submitted to the visual evaluation by human observers to collect the labels necessary for the algorithm training and testing. In this way, the labels fully reflect the human observers’ interpretation of the CT images, regardless of the correctness of human images interpretation, and no internal noise component is necessary to calibrate the CNN-MO on the average human performance.

To verify the viability of our ultimate goal, which is the possibility of CT protocol optimization by means of CNN-MOs, in an almost independent way from the chosen CNN, we implemented two different architectures. UNet and MobileNetV2 were originally built and optimized for the tasks of segmentation and classification, respectively. To the above mentioned purpose, the relation between the CNN architecture and performance was also investigated. We found that both models performed quite similarly, suggesting that there are no critical aspects preventing MO application of them.

In the case of human observers, the confidence score (classification task) and the localization task are intrinsically interconnected and cannot be disentangled in the image evaluation process, whereas CNNs need to be specifically trained to carry out the two tasks, which are partially independent, using different loss functions. The accuracy metrics (Figs. 10 and 11) and the inter-rater agreement statistics (Table 4) show a trend in accordance with the above observation: the two CNNs have different performances in the two tasks of classification (MobileNet is slightly superior) and localization (UNet is slightly superior), as expected. However, the predictions of the two tasks, when combined together, for both CNN-MOs achieve very good overall performances, measured in terms of the LAUC metric. This result supports the robustness of the proposed approach and its being fairly independent from the CNN used. The quality of the trained CNNs was quantified by several statistical indices describing the inter-rater agreement between the MO and human observer in the confidence score task. The statistics computed on the full dataset, ruling out the 3 mm inserts with human detectability that is affected by strong noise correlation, give values of the robust S-statistics above 0.64 for both CNNs, indicating good general CNN performances.

The evaluation of the overall performance of the proposed algorithms in reproducing the human observer response was carried out by computing LAUC, a more accurate metric than AUC because it takes into consideration both localization and classification capability.29 The MAPE calculated between LAUCs extracted from human observer and MO responses was found to be below 2.5%, with slightly higher performances in the case of IR reconstructed images. In addition to the LAUC averaged on the full dataset, we chose a reference subset of images (with a 4 mm insert, C=45  HU, and IR reconstruction) reflecting a significant trend as a function of CTDIvol: the LAUC extracted from human observer data of the reference subset (Fig. 9) covers a wide range of values, showing poor detectability performances at low CTDIvol and then rising until saturation. This curve is a suitable starting point for developing an optimization strategy for the current CT protocol: the value of 6 mGy can be considered the optimum CTDIvol, above which there is no increase in detectability performances, and thus it is reasonable to expect a plateau of the diagnostic accuracy also.

In this work, we demonstrated the viability of an image quality assessment approach based on phantom acquisitions and CNN-MOs, which has a remarkable potential for improvements toward the final goal of dose optimization.

There are several pitfalls and perspectives to consider. We acknowledge several limitations in this study that we plan to address in future works. To finally achieve and implement a CT optimization program, a much more variable CT image dataset, acquired by different CT scanners with well defined setting parameters of a chosen CT protocol is needed. From this perspective, the proposed algorithms need to be retrained on the new dataset, after collection of new human-labeled data. However, we believe that, given the potential of the deep learning methods, the above mentioned effort, representing the next step of the ongoing research, enriched with elevated generalization capability of the algorithms, will be able to avoid the need to repeat the time-consuming reader studies for each protocol. Other limitations that we plan to address in future studies are the decrease of MobileNet performances at the highest CTDIvol, which can be correlated to the CNR and/or noise modeling by the CNN, and the optimization of the phantom design in terms of CNR, which in turn affects the objects detectability.

5.

Conclusion

In this work, we have developed and investigated the applicability of two MO algorithms based on CNN-MOs trained to mimic human observer performances in the phantom CT image detection task. We have demonstrated that two very different AI algorithms are both able to achieve very good results, thus indicating that the proposed approach is robust and fairly independent from the CNN used.

The positive results encourage continuing the exploitation of the proposed methodology toward an automatic image quality assessment based on the evaluation of CT images acquired on a specifically designed phantom; this should foster a systematic optimization, and possible standardization, of the large number of CT protocols currently used in radiological facilities, with the final goal of reaching the best tradeoff between radiation dose and image quality, which is an issue of utmost relevance in diagnostic radiology as emphasized by international organizations (ICRP, IAEA, EURATOM) focused on ionizing radiation risk and radiological protection.

Disclosures

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors are very grateful to the medical staff members of the Radiology Departments who evaluated the CT image datset: Careggi University Hospital (Director Dr. Vittorio Miele) and Santa Maria Nuova Hospital (Director Dr. Roberto Carpi) in Florence; San Jacopo Hospital in Pistoia (Director Dr. Letizia Vannucchi); Santo Stefano Hospital in Prato (Director Dr. Maurizio Bartolucci). The authors acknowledge the support and the CT resources provided by the Radiology department of the Careggi University Hospital in Florence, Italy. The authors also acknowledge the Physics Department of Florence University and the UNISER (polo pluridisciplinare, Pistoia e Pescia, Italy) for allowing the use of computational resources, essential for the neural network training and optimization tasks.

References

1. 

European Commission, “Medical radiation exposure of the European population,” Rad. Prot., 180 1 –181 (2015). Google Scholar

2. 

International Commission On Radiological Protection (ICRP), “ICRP PUBLICATION 26: 1977 recommendations of the international commission on radiological protection,” Ann. ICRP, 1 (3), 1 –77 ANICD6 0146-6453 (1977). Google Scholar

3. 

W. R. Hendee and F. M. Edwards, “ALARA and an integrated approach to radiation protection,” Semin. Nucl. Med., 16 (2), 142 –150 https://doi.org/10.1016/S0001-2998(86)80027-7 SMNMAB 0001-2998 (1986). Google Scholar

4. 

M. Rehani, “ICRP and IAEA actions on radiation protection in computed tomography,” Ann. ICRP, 41 (3-4), 154 –160 https://doi.org/10.1016/j.icrp.2012.06.029 ANICD6 0146-6453 (2012). Google Scholar

5. 

M. Rehani, “Managing patient dose in computed tomography,” (2000). Google Scholar

6. 

M. Rehani, “Managing patient dose in multi-detector computed tomography (MDCT),” (2007). Google Scholar

7. 

M. Rehani, “Dose reduction in CT while maintaining diagnostic confidence: a feasibility/demonstration study,” (2009). Google Scholar

8. 

The Council of the European Union, “Council directive 2013/59/EURATOM,” (2014). Google Scholar

9. 

J. Vaishnav et al., “Objective assessment of image quality and dose reduction in CT iterative reconstruction,” Med. Phys., 41 (7), 071904 https://doi.org/10.1118/1.4881148 MPHYA6 0094-2405 (2014). Google Scholar

10. 

L. Noferini et al., “CT image quality assessment by a channelized Hotelling observer (CHO): application to protocol optimization,” Phys. Med., 32 1717 –1723 https://doi.org/10.1016/j.ejmp.2016.11.002 (2016). Google Scholar

11. 

D. Racine et al., “Objective assessment of low contrast detectability in computed tomography with channelized Hotelling observer,” Phys. Med., 32 76 –83 https://doi.org/10.1016/j.ejmp.2015.09.011 (2016). Google Scholar

12. 

S. Leng et al., “Correlation between model observer and human observer performance in CT imaging when lesion location is uncertain,” Med. Phys., 40 (8), 081908 https://doi.org/10.1118/1.4812430 MPHYA6 0094-2405 (2013). Google Scholar

13. 

M. Han, B. Kim and J. Baek, “Human and model observer performance for lesion detection in breast cone beam CT images with the FDK reconstruction,” PLoS One, 13 1 –16 https://doi.org/10.1371/journal.pone.0194408 POLNCL 1932-6203 (2018). Google Scholar

14. 

M. Han et al., “Investigation on slice direction dependent detectability of volumetric cone beam CT images,” Opt. Express, 24 3749 –3764 https://doi.org/10.1364/OE.24.003749 OPEXFF 1094-4087 (2016). Google Scholar

15. 

H. Gong et al., “Deep-learning-based model observer for a lung nodule detection task in computed tomography,” J. Med. Imaging, 7 (4), 042807 https://doi.org/1.JMI.10.S1.S11904/1.JMI.7.4.042807 JMEIET 0920-5497 (2020). Google Scholar

16. 

H. Gong et al., “A deep learning- and partial least square regression-based model observer for a low-contrast lesion detection task in CT,” Med. Phys., 46 (5), 2052 –2063 https://doi.org/10.1002/mp.13500 MPHYA6 0094-2405 (2019). Google Scholar

17. 

F. Kopp et al., “CNN as model observer in a liver lesion detection task for x-ray computed tomography: a phantom study,” Med. Phys., 45 (10), 4439 –4447 https://doi.org/10.1002/mp.13151 MPHYA6 0094-2405 (2018). Google Scholar

18. 

F. H. Reith and B. A. Wandell, “Comparing pattern sensitivity of a convolutional neural network with an ideal observer and support vector machine,” (2019). Google Scholar

19. 

W. Zhou, H. Li and M. Anastasio, “Approximating the ideal observer and Hotelling observer for binary signal detection tasks by use of supervised learning methods,” IEEE Trans. Med. Imaging, 38 3142456 –2468 https://doi.org/10.1109/TMI.2019.2911211 ITMID4 0278-0062 (2019). Google Scholar

20. 

M. Alnowami et al., “A deep learning model observer for use in alterative forced choice virtual clinical trials,” Proc SPIE, 10577 105770Q https://doi.org/1.JMI.10.S1.S11904/12.2293209 (2018). Google Scholar

21. 

F. Massanes and J. Brankov, “Evaluation of CNN as anthropomorphic model observer,” Proc SPIE, 10136 101360Q https://doi.org/1.JMI.10.S1.S11904/12.2254603 (2017). Google Scholar

22. 

C. Castella et al., “Mass detection on mammograms: influence of signal shape uncertainty on human and model observers,” J. Opt. Soc. Am. A, 26 425 –436 https://doi.org/10.1364/JOSAA.26.000425 JOAOD6 0740-3232 (2009). Google Scholar

23. 

Y. Zhang, B. T. Pham and M. P. Eckstein, “Evaluation of internal noise methods for Hotelling observer models,” Med. Phys., 34 (8), 3312 –3322 https://doi.org/10.1118/1.2756603 MPHYA6 0094-2405 (2007). Google Scholar

24. 

M. Han and J. Baek, “A convolutional neural network-based anthropomorphic model observer for signal-known-statistically and background-known-statistically detection tasks,” Phys. Med. Biol., 65 225025 https://doi.org/10.1088/1361-6560/abbf9d PHMBA7 0031-9155 (2020). Google Scholar

25. 

F. Kopp et al., “CNN as model observer in a liver lesion detection task for x-ray computed tomography: a phantom study,” Med. Phys., 45 4439 –4447 https://doi.org/10.1002/mp.13151 MPHYA6 0094-2405 (2018). Google Scholar

26. 

G. Kim et al., “A convolutional neural network-based model observer for breast CT images,” Med. Phys., 47 (4), 1619 –1632 https://doi.org/10.1002/mp.14072 MPHYA6 0094-2405 (2020). Google Scholar

27. 

R. D. Man et al., “Comparison of deep learning and human observer performance for detection and characterization of simulated lesions,” J. Med. Imaging, 6 (2), 025503 https://doi.org/1.JMI.10.S1.S11904/1.JMI.6.2.025503 JMEIET 0920-5497 (2019). Google Scholar

28. 

I. Lorente, C. K. Abbey and J. G. Brankov, “Understanding CNN based anthropomorphic model observer using classification images,” Proc. SPIE, 11599 115990C https://doi.org/1.JMI.10.S1.S11904/12.2581121 PSISDG 0277-786X (2021). Google Scholar

29. 

R. Swensson, “Unified measurement of observer performance in detecting and localizing target objects on images,” Med. Phys., 23 1709 –1725 https://doi.org/10.1118/1.597758 MPHYA6 0094-2405 (1996). Google Scholar

30. 

S. Doria et al., “Addressing signal alterations induced in CT images by deep learning processing: a preliminary phantom study,” Phys. Med., 83 88 –100 https://doi.org/10.1016/j.ejmp.2021.02.022 (2021). Google Scholar

31. 

H. Gong et al., “Deep-learning model observer for a low-contrast hepatic metastases localization task in computed tomography,” Med. Phys., 49 (1), 70 –83 https://doi.org/10.1002/mp.15362 MPHYA6 0094-2405 (2022). Google Scholar

32. 

R. Yang and Y. Yu, “Artificial convolutional neural network in object detection and semantic segmentation for medical imaging analysis,” Front. Oncol., 11 638182 https://doi.org/10.3389/fonc.2021.638182 FRTOA7 0071-9676 (2021). Google Scholar

33. 

Y. Weng et al., “NAS-Unet: Neural architecture search for medical image segmentation,” IEEE Access, 7 44247 –44257 https://doi.org/10.1109/ACCESS.2019.2908991 (2019). Google Scholar

34. 

X. Li et al., “H-DenseUNet: Hybrid densely connected unet for liver and tumor segmentation from CT volumes,” IEEE Trans. Med. Imaging, 37 (12), 2663 –2674 https://doi.org/10.1109/TMI.2018.2845918 ITMID4 0278-0062 (2018). Google Scholar

35. 

S. Qamar et al., “A variant form of 3d-UNet for infant brain segmentation,” Future Gener. Comput. Syst., 108 613 –623 https://doi.org/10.1016/j.future.2019.11.021 FGSEVI 0167-739X (2020). Google Scholar

36. 

J. Tian et al., “Automatic couinaud segmentation from CT volumes on liver using GLC-UNet,” Lect. Notes Comput. Sci., 11861 274 –282 https://doi.org/10.1007/978-3-030-32692-0_32 LNCSD9 0302-9743 (2019). Google Scholar

37. 

A. Lou, S. Guan and M. Loew, “DC-UNet: rethinking the U-Net architecture with dual channel efficient CNN for medical image segmentation,” Proc. SPIE, 11596 758 –768 https://doi.org/1.JMI.10.S1.S11904/12.2582338 PSISDG 0277-786X (2021). Google Scholar

38. 

J. Dolz, C. Desrosiers and I. Ben Ayed, “IVD-Net: intervertebral disc localization and segmentation in MRI with a multi-modal UNet,” Lect. Notes Comput. Sci., 11397 130 –143 https://doi.org/10.1007/978-3-030-13736-6_11 LNCSD9 0302-9743 (2019). Google Scholar

39. 

P. Ahmad et al., “Context aware 3D UNet for brain tumor segmentation,” Lect. Notes Comput. Sci., 12658 207 –218 https://doi.org/10.1007/978-3-030-72084-1_19 LNCSD9 0302-9743 (2021). Google Scholar

40. 

U. Latif et al., “An end-to-end brain tumor segmentation system using multi-inception-unet,” Int. J. Imaging Syst. Technol., 31 (4), 1803 –1816 https://doi.org/10.1002/ima.22585 IJITEG 0899-9457 (2021). Google Scholar

41. 

D. T. Kushnure and S. N. Talbar, “MS-UNet: a multi-scale unet with feature recalibration approach for automatic liver and tumor segmentation in CT images,” Comput. Med. Imaging Graph., 89 101885 https://doi.org/10.1016/j.compmedimag.2021.101885 (2021). Google Scholar

42. 

I. Lorente et al., “Deep learning based model observer by U-Net,” Proc SPIE, 11316 113160F https://doi.org/1.JMI.10.S1.S11904/12.2549687 (2010). Google Scholar

43. 

A. Newell, K. Yang and J. Deng, “Stacked hourglass networks for human pose estimation,” (2016). Google Scholar

44. 

T. van Erven and P. Harremos, “Rényi divergence and Kullback–Leibler divergence,” IEEE Trans. Inf. Theory, 60 (7), 3797 –3820 https://doi.org/10.1109/TIT.2014.2320500 IETTAW 0018-9448 (2014). Google Scholar

45. 

X. Yang et al., “Learning high-precision bounding box for rotated object detection via Kullback–Leibler divergence,” (2021). Google Scholar

46. 

S. Ji et al., “Kullback–Leibler divergence metric learning,” IEEE Trans. Cybern., 52 (4), 2047 –2058 https://doi.org/10.1109/TCYB.2020.3008248 (2022). Google Scholar

47. 

F. Martín et al., “Kullback–Leibler divergence-based global localization for mobile robots,” Rob. Auton. Syst., 62 (2), 120 –130 https://doi.org/10.1016/j.robot.2013.11.006 RASOEJ 0921-8890 (2014). Google Scholar

48. 

D. I. Belov and R. D. Armstrong, “Distributions of the Kullback–Leibler divergence with applications,” Br. J. Math. Stat. Psychol., 64 (2), 291 –309 https://doi.org/10.1348/000711010X522227 (2011). Google Scholar

49. 

Y. LeCun, Y. Bengio and G. Hinton, “Deep learning,” Nature, 521 436 –444 https://doi.org/10.1038/nature14539 (2015). Google Scholar

50. 

M. Sandler et al., “MobileNetV2: inverted residuals and linear bottlenecks,” (2018). Google Scholar

51. 

M. Akay et al., “Deep learning classification of systemic sclerosis skin using the MobileNetV2 model,” IEEE Open J. Eng. Med. Biol., 2 104 –110 https://doi.org/10.1109/OJEMB.2021.3066097 (2021). Google Scholar

52. 

C. Buiu, V.-R. Dănăilă and C. N. Răduţă, “MobileNetV2 ensemble for cervical precancerous lesions classification,” Processes, 8 (5), 595 https://doi.org/10.3390/pr8050595 (2020). Google Scholar

53. 

A. Kanadath, J. A. A. Jothi and S. Urolagin, “Histopathology image segmentation using MobileNetV2 based U-net model,” in Int. Conf. Intell. Technol. (CONIT), 1 –8 (2021). https://doi.org/10.1109/CONIT51480.2021.9498341 Google Scholar

54. 

R. Roslidar et al., “A study of fine-tuning cnn models based on thermal imaging for breast cancer classification,” in IEEE CyberneticsCom Int. Conf., 77 –81 (2019). https://doi.org/10.1109/CYBERNETICSCOM.2019.8875661 Google Scholar

55. 

R. Indraswari, R. Rokhana and W. Herulambang, “Melanoma image classification based on MobileNetV2 network,” Proc. Comput. Sci., 197 198 –207 https://doi.org/10.1016/j.procs.2021.12.132 (2022). Google Scholar

56. 

S. Taufiqurrahman et al., “Diabetic retinopathy classification using a hybrid and efficient MobileNetV2-SVM model,” in IEEE Region 10 Conf. (TENCON), 235 –240 (2020). https://doi.org/10.1109/TENCON50793.2020.9293739 Google Scholar

57. 

T. Kaur, T. K. Gandhi, “Automated diagnosis of Covid-19 from CT scans based on concatenation of MobileNetV2 and ResNet50 features,” Computer Vision and Image Processing, 149 –160 Springer, Singapore, Singapore (2021). Google Scholar

58. 

M. M. Ahsan et al., “Detection of Covid-19 patients from CT scan and chest x-ray data using modified MobileNetV2 and lime,” Healthcare, 9 (9), 1099 https://doi.org/10.3390/healthcare9091099 (2021). Google Scholar

59. 

S. Serte, M. A. Dirik and F. Al-Turjman, “Deep learning models for Covid-19 detection,” Sustainability, 14 (10), 5820 https://doi.org/10.3390/su14105820 (2022). Google Scholar

60. 

S. Aggarwal et al., “Automated Covid-19 detection in chest x-ray images using fine-tuned deep learning architectures,” Expert Syst., 39 (3), e12749 https://doi.org/10.1111/exsy.12749 (2022). Google Scholar

61. 

W. Zhiqiang and L. Jun, “A review of object detection based on convolutional neural network,” in 36th Chin. Control Conf. (CCC), 11104 –11109 (2017). https://doi.org/10.23919/ChiCC.2017.8029130 Google Scholar

62. 

T. Fawcett, “An introduction to ROC analysis,” Pattern Recognit. Lett., 27 861 –874 https://doi.org/10.1016/j.patrec.2005.10.010 PRLEDG 0167-8655 (2006). Google Scholar

63. 

F. R. Verdun et al., “Image quality in CT: from physical measurements to model observers,” Phys. Med., 31 823 –843 https://doi.org/10.1016/j.ejmp.2015.08.007 (2015). Google Scholar

64. 

V. Satopaa et al., “Finding a “kneedle” in a haystack: detecting knee points in system behavior,” in 31st Int. Conf. Distrib. Comput. Syst. Workshops, 166 –171 (2011). https://doi.org/10.1109/ICDCSW.2011.20 Google Scholar

65. 

U. Khair et al., “Forecasting error calculation with mean absolute deviation and mean absolute percentage error,” J. Phys. Conf. Ser., 930 (1), 012002 https://doi.org/10.1088/1742-6596/930/1/012002 JPCSDZ 1742-6588 (2017). Google Scholar

66. 

M. Warrens, “Inequalities between multi-rater kappas,” Adv. Data Anal. Classif., 4 (4), 271 –286 https://doi.org/10.1007/s11634-010-0073-4 (2010). Google Scholar

67. 

K. Krippendorff, “Computing Krippendorff’s alpha-reliability,” (2011). Google Scholar

68. 

P. E. Shrout and J. L. Fleiss, “Intraclass correlations: uses in assessing rater reliability,” Psychol. Bull., 86 (2), 420 –428 https://doi.org/10.1037/0033-2909.86.2.420 PSBUAI 0033-2909 (1979). Google Scholar

69. 

D. Marasini, P. Quatto and E. Ripamonti, “Assessing the inter-rater agreement for ordinal data through weighted indexes,” Stat. Methods Med. Res., 25 (6), 2611 –2633 https://doi.org/10.1177/0962280214529560 (2016). Google Scholar

70. 

D. J. Arenas, “Inter-rater: software for analysis of inter-rater reliability by permutating pairs of multiple users,” (2018). Google Scholar

71. 

B. Dawson, Basic & Clinical Biostatistics, 57 –61 4th ed.McGraw-Hill( (2004). Google Scholar

72. 

J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,” Biometrics, 33 (1), 159 –174 https://doi.org/10.2307/2529310 BIOMB6 0006-341X (1977). Google Scholar

73. 

P. Armitage, “The design and analysis of clinical experiments,” Biometrics, 43 (4), 1028 –1028 https://doi.org/10.2307/2531561 BIOMB6 0006-341X (1987). Google Scholar

74. 

H. Gifford and M. King, “Implementing visual search in human-model observers for emission tomography,” in IEEE Nucl. Sci. Symp. Conf. Rec. (NSS/MIC), 2482 –2485 (2009). https://doi.org/10.1109/NSSMIC.2009.5402080 Google Scholar

75. 

K. M. Hanson, “Detectability in computed tomographic images,” Med. Phys., 6 (5), 441 –451 https://doi.org/10.1118/1.594534 MPHYA6 0094-2405 (1979). Google Scholar

76. 

R. F. Wagner, “Fast Fourier digital quantum mottle analysis with application to rare earth intensifying screen systems,” Med. Phys., 4 (2), 157 –162 https://doi.org/10.1118/1.594304 MPHYA6 0094-2405 (1977). Google Scholar

77. 

C. H. Shweta and R. C. Bajpai, “Evaluation of inter-rater agreement and inter-rater reliability for observational data: an overview of concepts and methods,” J. Indian Acad. Appl. Psychol., 41 (3), 20 –27 (2015). Google Scholar

Biography

Federico Valeri received his MS degree in physics and his postgraduate diploma in medical physics from Florence University in 2019 and 2022, respectively. His research interests include CT physics, radiomics, computer vision, and machine learning.

Elena Cantoni is a student of Medical Physics Specialization School at the University of Bologna, Italy. In 2022, she was a research fellow at the University of Florence in the field of development and optimization of innovative AI methods on CT medical imaging. In particular, she focused on the statistical analysis of model observers and development of imaging software for early detection of hepatocellular cacarcinoma.

Evaristo Cisbani received his PhD in physics, research, and development of innovative instrumentations for nuclear medicine, radiation therapy, and nuclear experimental physics. He has experience in design and realization of Cherenkov detectors, gaseous chambers, and gamma imaging devices, combined to and supported by simulation, data acquisition, image processing, and data analysis. He is involved in the development of new approaches, based on artificial intelligence, for the optimization of instrumentation design, improvement of medical image quality, and performance evaluation of AI-based systems.

Ilaria Cupparo is a medical physicist. Currently, she is working as a research fellow on a radiation protection project in collaboration with the radioprotection expert at the University of Florence. She is involved in some research projects concerning the development of neural networks and machine learning algorithms for medical image analysis. Within the field of artificial intelligence, her work is particularly focused on optimizing CT protocols.

Sandra Doria received her PhD in physics in 2017 and the specialization in medical physics in 2021. She is now permanent researcher in the National Research Council of Italy. The main scientific activities are placed at LENS (European Laboratory for Nonlinear Spectroscopy), a center of excellence at the University of Florence. Her areas of expertise include deep learning in medical imaging, dose optimization of computed tomography protocols, radiomics, statistical image analysis, Monte Carlo modeling, and radioprotection.

Cesare Gori, formerly director of the Health Physics Department at the University Hospital “Careggi” in Firenze, Italy, is now appointed by the University of Firenze, Italy, as a radiation protection expert. He is founding partner and honorary member of the Italian Association of Medical Physics (AIFM) and he is also AIFM delegate to the International Organization for Medical Physics (IOMP) Council.

Lorenzo Lasagni is a research fellow at the University of Florence. He received his specialization in medical physics at the University of Florence in 2022. His main area of expertise is in developing software for medical image analysis, with a focus on segmentation, computer-aided diagnosis, and explainable artificial intelligence. He provides support for teaching at the University of Florence and co-supervises thesis activities.

Lorenzo Nicola Mazzoni is a medical physics expert at the Medical Physics Unit of Prato-Pistoia, AUSL Toscana Centro. His activity is mainly focused on medical imaging and radiation protection. He is member of the European Federation of Organisations for Medical Physics (EFOMP), European and international matter committee, and the current primary contact of EFOMP with the International Commission on Radiological Protection (ICRP).

Valentina Sanguineti received her BSc degree in electronic engineering and information technology and her MSc degree in internet and multimedia engineering from the University of Genoa, Italy, in 2016 and 2018, respectively. She received her PhD in computer vision, pattern recognition, and machine learning in 2022, which was done in collaboration between the University of Genoa and the Italian Institute of Techlology (IIT), where she has been involved in research on audio and video processing using deep neural networks.

Diego Sona received his PhD in computer science from the University of Pisa, Italy, in 2002. He joined the Adaptive Advisory Systems Group at the Istituto Trentino di Cultura. In 2008, he moved to the Neuroinformatics Laboratory in FBK. From 2011 to 2020, he joined as a visiting scientist in the Pattern Analysis and Computer Vision Department, IIT. In 2021, he moved to Data Science for Health Unit. His research has been always on machine learning.

Adriana Taddeucci is a medical physicist at Florence University Hospital, Italy. Her main activity concerns the implementation of optimization principle in diagnostic radiology. She started working with Model Observer (CHO) applied to CT protocols 10 years ago. She is also an adjunct professor at the University of Florence, where she teaches radiological equipments principles for medical imaging, including related quality assurance programs.

Biographies of the other authors are not available.

© 2023 Society of Photo-Optical Instrumentation Engineers (SPIE)
Federico Valeri, Maurizio Bartolucci, Elena Cantoni, Roberto Carpi, Evaristo Cisbani, Ilaria Cupparo, Sandra Doria, Cesare Gori, Mauro Grigioni, Lorenzo Lasagni, Alessandro Marconi, Lorenzo Nicola Mazzoni, Vittorio Miele, Silvia Pradella, Guido Risaliti, Valentina Sanguineti, Diego Sona, Letizia Vannucchi, and Adriana Taddeucci "UNet and MobileNet CNN-based model observers for CT protocol optimization: comparative performance evaluation by means of phantom CT images," Journal of Medical Imaging 10(S1), S11904 (7 March 2023). https://doi.org/10.1117/1.JMI.10.S1.S11904
Received: 6 October 2022; Accepted: 9 February 2023; Published: 7 March 2023
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Computed tomography

Education and training

Molybdenum

Mathematical optimization

Image restoration

Infrared imaging

Image quality

Back to Top