Variability and accuracy of different software packages for dynamic susceptibility contrast magnetic resonance imaging for distinguishing glioblastoma progression from pseudoprogression

Abstract. Determining whether glioblastoma multiforme (GBM) is progressing despite treatment is challenging due to the pseudoprogression phenomenon seen on conventional MRIs, but relative cerebral blood volume (CBV) has been shown to be helpful. As CBV’s calculation from perfusion-weighted images is not standardized, we investigated whether there were differences between three FDA-cleared software packages in their CBV output values and subsequent performance regarding predicting survival/progression. Forty-five postradiation therapy GBM cases were retrospectively identified as having indeterminate MRI findings of progression versus pseudoprogression. The dynamic susceptibility contrast MR images were processed with different software and three different relative CBV metrics based on the abnormally enhancing regions were computed. The intersoftware intraclass correlation coefficients were 0.8 and below, depending on the metric used. No statistically significant difference in progression determination performance was found between the software packages, but performance was better for the cohort imaged at 3.0 T versus those imaged at 1.5 T for many relative CBV metric and classification criteria combinations. The results revealed clinically significant variation in relative CBV measures based on the software used, but minimal interoperator variation. We recommend against using specific relative CBV measurement thresholds for GBM progression determination unless the same software or processing algorithm is used.

Variability and accuracy of different software packages for dynamic susceptibility contrast magnetic resonance imaging for distinguishing glioblastoma progression from pseudoprogression 1 Introduction Glioblastoma multiforme (GBM) is the most common adult, primary brain cancer and carries a poor prognosis. Median survival in patients treated on clinical trials with radiation therapy and temozolomide ranges from 15 to 20 months. 1,2 After biopsy or maximal safe resection, patients typically receive ∼60 Gy of radiation over 30 fractions, concurrent with daily low-dose temozolomide. Patients then go on to receive adjuvant temozolomide for 6 to 12 months or longer, while being imaged every two to three months to assess tumor status. If it is determined that the tumor is progressing, second-line agents are introduced. However, progression can be difficult to confidently determine based on imaging, since a treatment response can transiently cause a similar imaging appearance as tumor progression, 3 often referred to as pseudoprogression. It is important to distinguish the two conditions. If there is true progression, a secondline agent may extend survival. If there is pseudoprogression, there may be a survival advantage 4 and temozolomide should not be discontinued. It is difficult to distinguish pseudoprogression versus tumor progression by clinical symptoms, especially early in the postradiation period. 5 In a study examining the incidence of progression versus pseudoprogression in 63 GBM patients, 5 28 (44.4%) of the patients had lesion enlargement with the first postradiation follow-up MR exam. Each of these cases underwent salvage surgery and pathologic analysis, resulting in 12 (42.8%) being classified as pseudoprogression, with the other 16 (57.2%) exhibiting true tumor progression. In the largest study regarding conventional imaging of progression versus pseudoprogression, qualitative features were analyzed regarding their ability to distinguish between the two phenomena. 6 With 63 progression and 30 pseudoprogression patients, the only feature found to have predictive value regarding progression was subependymal enhancement. However, this was only present in 26 of the 93 cases, producing a negative predictive value of 41.8%, and is not a good candidate for attempting to extract further value through quantitative assessment since a radiologist would not have trouble recognizing this and determining it to be new tumor growth.
Relative cerebral blood volume (rCBV) has attracted much interest as a functional measurement potentially representing tumor-related vascular changes beyond those visible in conventional MR characteristics. 7 In addition to many studies investigating its utility in distinguishing between tumor grades, various studies have analyzed its use in distinguishing between tumor progression and pseudoprogression. The mean rCBV in progression is higher than the mean rCBV in pseudoprogression, consistent with the understanding that active tumor elicits angiogenesis and consequently higher blood volumes. Accordingly, many authors have reported optimal rCBV thresholds for separating progression from pseudoprogression cases. [8][9][10][11] CBV images are generated through postprocessing of a perfusion-sensitive image acquisition, which tracks signal change over time due to the transit of a contrast bolus. Dynamic susceptibility contrast (DSC) MR is commonly used to produce the perfusion-weighted images in brain tumor imaging. The CBV for each voxel is calculated based on an integral of the relaxivity change (derived from the MR signal using the echo time) measured during bolus transit from a prebolus baseline level (see Fig. 1). The starting and ending time points of this integration, baseline estimate, model fitting, integration method used, and correction for contrast agent extravasation are sources of variation in CBV calculation. 12,13 As a measurement with arbitrary units, the need for normalization has been investigated, with the most common approach being to divide by the mean contralateral white matter value to produce relative or rCBV values. 14 Efforts have been made to correct the DSC signal corruption caused by contrast agent extravasation due to blood brain barrier disruption, both by bolus preload dose administration and correction using mathematical models during the CBV calculation. 13 Previous studies have shown that both preload dosing and modeling are needed for maximal rCBV accuracy. 12,15 If these methods are insufficient to correct for the variability, then there is no translatability of results between studies using different software packages. The potential for variability has been recognized, 12,16 with recent reports of variability in measurements of mean rCBV between FDA-cleared software packages using clinical DSC-MR images. 17,18 The purpose of this study was to determine whether there were significant differences in multiple rCBV metrics from the same DSC-MR images between three FDA-cleared software packages, and if so, how much disagreement there exists at various thresholds of rCBV used to predict tumor progression. Then, using clinical or outcome-based information to classify whether the analyzed tumors were progressing or not, we investigated whether one software performed better than others for distinguishing between GBM progression and pseudoprogression. Finally, we analyzed whether there are clinically significant differences between the optimal rCBV metric thresholds found for each software.

Patients
Our institutional review board reviewed and approved this retrospective study and granted a waiver of informed consent. The patient image files were anonymized prior to processing.
We identified the set of potential subjects through a medical record query for patients who had been treated at this institution with radiation and had a histologic diagnosis of GBM (SNOMED Code: M-94403). From this initial set of 148 patients, further inclusion criteria were treatment with temozolomide concurrent with radiation and continuing afterward, and sufficient follow-up to determine whether, within six months postradiation, a decision was made to discontinue temozolomide and initiate alternate therapy because of some appearance of progression, including notations of enlarging contrast enhancement. From this set of 58 patients, 10 did not have perfusionweighted images, and three were excluded due to software incompatibility, leaving 45 cases for this study. The images used were from the first MR exams obtained within six months postradiation therapy demonstrating signs of possible progression. This resulted in the exam of interest for each patient being obtained, for example, one month, four months, or six months after radiation completion.

MR Images
Each imaging exam was acquired using one of several clinical General Electric MR scanners (GE Healthcare, Milwaukee, Wisconsin), operating at 1.5 T (n ¼ 29) or 3 T (n ¼ 16). For both the 1.5 and 3 T scans, the DSC images were obtained using a spin-echo echo-planar sequence with axial orientation and TR/TE/FA of 2217 to 2225 ms∕60 ms∕90 deg. The matrix was 128 × 96, field of view (FOV) 240 × 240 mm, slice thickness 5 mm, and slice gap 5 mm. 40 successive time points were imaged with ∼2 s between acquisitions. The number of slices ranged from 10 to 26, covering the entire tumor in all cases. For the DSC imaging, 2 ml of gadolinium-based contrast agent were introduced as a preloading bolus to decrease the T1 leakage effects from contrast extravasation through the disrupted blood brain barrier 15 during the main bolus of 18 ml. Except for two cases, the T1w postcontrast images were acquired at an oblique axial angle using either spin-echo or fast spin-echo sequences ∼10 min after gadolinium injection. The T1w parameters for the 1.5 T spin-echo sequence were TR/TE/FA of 433 to 683 ms∕20 to 21 ms∕90 deg. The matrix was 256 × 192, FOV 220 × 220 to 250 × 250 mm, slice thickness 4 mm, and no slice gap. For the 1.5 T fast spin-echo sequence, the TR/ TE/FA was 5067 ms∕98 ms∕90 deg. The matrix was 256 × 192, FOV 220 × 220 mm, and an echo train length of 8. For the two-dimensional (2-D) 3 T spin-echo acquisitions, the TR/ TE/FA was 467 to 700 ms∕20 ms∕90 deg. The matrix was 320 × 192, FOV 220 × 220 mm, slice thickness 4 mm, and no slice gap. For the three-dimensional 3 T fast spin-echo acquisitions, the TR/TE/FA was 600 ms∕11.039 to 20.157 ms∕90 deg. The matrix was 256 × 256, FOV 240 × 240 mm, and an echo train length of 24. The two nonaxial postcontrast image volumes were 2-D 3 T fast spin-echo acquisitions obtained in the coronal plane, with TR/TE/FA of 600 to 767 ms∕13.1 to 13.5 ms∕ 90 deg. The matrix was 320 × 224, FOV 180 × 180 mm, slice thickness 4 mm, slice gap 5 mm, and an echo train length of 3.

DSC-MRI Processing
Three operators created CBV images from the DSC-MRIs using IB Neuro ver. 1.1 (Imaging Biometrics, Elm Grove, Wisconsin), FuncTool ver. 4.5.3 (GE Healthcare, Milwaukee, Wisconsin), and nordicICE ver. 2.3.13 (NordicNeuroLab, Bergen, Norway). Each of the three operators processed all of the cases using FuncTool and nordicICE, attempting to operate each package with similar parameters, although exact matching was not possible due to proprietary aspects of each software. Just one operator using IB Neuro was sufficient to represent all three operators since its algorithm is automatic, requiring no manual intervention. We confirmed with a subset of images that multiple runs with IB Neuro produced identical results. FuncTool required manual selection of the prebolus baseline and integration starting and stopping time points, whereas nordicICE required manual specification of the prebolus baseline only when its automatic selection algorithm failed (7 of the 45 cases). Gamma-variate fitting and leakage correction were the only nondefault settings used for nordicICE. IB Neuro's leakage correction was activated, and for FuncTool, the baseline was interpolated between the integration time points. For both FuncTool and nordicICE, the noise threshold was adjusted to maximize brain coverage for rCBV calculation without processing excessive background voxels. For nordicICE, this was done after the prebolus baseline determination. We did test a subset with and without gamma-variate fitting with nordicICE and did not find a significant difference in values.

Registration and Tumor Segmentation
We defined a region of interest (ROI) representing abnormal contrast enhancement on the postcontrast T1-weighted images. The ROI was created by one author (Z.S.K.), who manually drew a generous boundary around each slice of enhancing tumor using ITK-SNAP v. 2.4.0, 19 trying to achieve a roughly 50/50 distribution of enhancing voxels and a second tissue intensity distribution. Then, on a per slice basis, custom software used an Otsu threshold 20 to segment out the enhancing voxels. Those voxels with intensities above the Otsu threshold were assigned the label "tumor" for enhancing tissue (see Fig. 2), although it is possible this was not tumor but pseudoprogression.
To avoid registration-induced modification of the raw rCBV values, we registered the T1w volume to the perfusion-weighted space. To do this, we used FSL ver. 5.0's 21 linear registration tool FLIRT 22 after manual editing of the segmented brain produced by brain extraction tool. 23 In a few cases, an additional pathology mask had to be used during the registration step. Thus, the tumor ROI was specified by the T1w postcontrast image, which had been registered to the perfusion-weighted image space, and then used for sampling the rCBV image voxels.

rCBV Metrics
We calculated three different metrics that have been reported in the literature: mean tumor rCBV, tumor 95th percentile rCBV, and percent of tumor voxels with CBV greater than the normalappearing white matter (NAWM) mean (rCBV > 1.0). 24 This NAWM mean was calculated based on an ROI drawn on the NAWM voxels in the hemisphere contralateral to the tumor, guided by the T1w postcontrast images. The slice nearest to the tumor with a large number of NAWM voxels was targeted, if not the same slice. Normalization was conducted by dividing the NAWM mean from the tumor CBV values in order to create the rCBV values. Then, the rCBV metrics were obtained from the tumor ROI. The 95% rCBV value represents a form of the hotspot method, as proposed by Kim et al. 25 that can be calculated more automatically and objectively. Summary metrics for the tumors were used instead of direct voxel comparison since rCBV analyses are performed for ROIs in practice. Since CBV values are not computed for all image voxels, care was taken to exclude nonprocessed (CBV ¼ 0) values from the measurements. Custom code written using Python ver. 2.7.3 and the modules Numpy ver. 1.6.2, Scipy ver. 0.11.0, SimpleITK ver. 0.6.0, and Pandas ver. 0.10.1 were used for calculations and data management.

rCBV Values
For measuring variability between the rCBV values, both intersoftware and interoperator, we calculated the intraclass correlation coefficients (ICCs) using the irr ver. 0.84 26 package for R ver. 3.0.1. 27 The two-way analysis of variance model was used, with both the absolute agreement and consistency coefficients computed. 28 The consistency measurement excludes software-specific additive bias, essentially allowing for an agreement measure after subtraction of the software-specific means. Favorable ICC values were considered to be >0.9, with the expectation that they should be ∼1.0 for this application.
We computed for each operator and software the classification of individual cases as progression or pseudoprogression based on an rCBV metric threshold. Due to a lack of biopsy proof of the tumor status and no absolute consensus regarding classification criteria, we started with outcome-agnostic analysis of differences in classification between software and operators for a range of rCBV metric values. We focused the disagreement analysis for the range of thresholds within which 25 to 75% of the brains were classified as cases of progression by each software, as this is a particularly informative range due to estimates of true progression incidence. 5 We do not have histologic confirmation of the tissue makeup, but the literature suggests that true progression is about as frequent as pseudoprogression in patients treated with temozolomide and radiation. 29 If that is a reasonable estimate for this cohort, then the threshold for rCBV that splits the patients in half should be similar. Results were calculated, however, for a continuum of thresholds to allow for visualization of global trends as well as analysis at any reader-preferred thresholds.

Outcome Prediction
For measuring the utility of the rCBV metrics for determining whether progression or pseudoprogression is occurring, each case needed a label as progression or pseudoprogression, making use of a postimage acquisition outcome measure. Almost none of this patient cohort had biopsy proof of tissue, so clinical history alone was utilized. For the first labeling method, the criterion used was based on how long the patients survived after their first postradiation image exam with indications of progression or pseudoprogression. The days survived for each patient were aggregated, and the 40th and 60th percentile values (237.6 and 321.4 days) were calculated. This is based on the literature reports suggesting that about one-half to two-thirds of patients with worrisome findings will have true progression and the other fraction will have pseudoprogression. All patients who survived less than the 40% threshold of 237.6 days were labeled as shortsurvivors, likely due to tumor progression. Those surviving longer than 321.4 days were labeled as long-survivors or as likely having had tumor pseudoprogression. The patients who survived between 237.6 and 321.4 days were excluded from further analysis based on the survival criterion. Also, two patients with last follow-up at 94 and 162 days were removed from all outcome-based analysis regardless of classification criterion due to uncertainty regarding short-or long-survivor status, leaving a total of 43 patients for this portion of the study.
For the second method of defining pseudoprogression, we used the criterion published by Young et al. 6 If temozolomide was clinically determined to have failed within six months postradiation and a treatment change was necessary, the patient was classified as having had progression. Patients who did not have a change in treatment within six months were classified as having pseudoprogression, and those who died within six months with no treatment change were excluded.
Finally, as a third method, the two criteria were combined. If the survival-based and treatment change based classification methods agreed for the patient, then that patient was given a "combined" classification of progression or pseudoprogression. If there was disagreement between the two classification methods, or the survival-based method gave an "intermediate" label, then that patient was given an "indeterminate" combined-classification label and excluded from further analysis based on the combined labeling method. While all 43 patients could be classified by the treatment change criterion, the survival criterion allowed 34 patients in its group, with 20% of cases being excluded. Twenty-four cases remained in the combined-classification group after excluding the cases not meeting its criteria, representing what we believe is the most reliable labeling. 3 Results

rCBV Values
Significant differences were observed between software packages for the rCBV measurements. The intersoftware ICCs are shown in Table 1. The mean rCBV metric has the highest intersoftware agreement, in part due to smaller additive bias, as evidenced by the consistency ICC, than the other metrics. However, the agreement ICCs are around 0.8 or below, with none of the 95% confidence intervals topping 0.9. With additive bias negated, the consistency ICC for operator 3 reached 0.853 for the mean rCBV metric, but was 0.800 for operator 2. The "% voxels above NAWM mean" metric had the lowest ICC in all cases. The interoperator ICCs are shown in Table 2. FuncTool has lower ICCs for each of the metrics, perhaps due to a greater number of manual steps. Based on the confidence intervals, this difference is statistically significant for mean rCBV and "% voxels above NAWM mean," and almost significant for 95% rCBV. The interoperator ICCs are higher than the intersoftware ICCs, with statistical agreement shown for each software and metric except for the FuncTool/mean rCBV metric combination. While the "% voxels above NAWM mean" metric had the lowest intersoftware ICC, it had the highest intrasoftware, interoperator ICC for both FuncTool and nordicICE. Figure 3 displays the variation in rCBV values for both tumor and NAWM samples on a per-voxel basis for a selected case.
To assess the impact on clinical decision-making, classification analysis is shown in Figs. 4-6. We assessed a range of rCBV metrics and thresholds for classifying progression versus pseudoprogression. Figure 4 displays the percentage of cases above a range of rCBV metric thresholds on the x axis. Overall,  each software package produced different measurements (i.e., was the outlier) than the other two packages, depending on the rCBV metric of interest. For mean rCBV, IB Neuro was the outlier, while nordicICE was the outlier for the 95% rCBV metric, and FuncTool was the outlier with % voxels above the NAWM mean.
With the assumption that pseudoprogression may occur in roughly half of this cohort, the mean rCBV value at which IB Neuro splits the cases in half is ∼0.2 below the value at which the other software split the cases in half. Alternatively, at a 95% rCBV threshold of 2.7, nordicICE classifies ∼50% more cases as above the threshold (progressing tumor) than FuncTool. There is no single value where all three packages had reasonably high agreement, other than extreme values, where all cases would be considered progression or pseudoprogression.
For case-by-case analysis, Fig. 5 displays the percentage of cases where one software provides a different classification result than the other two, depending on the rCBV metric threshold used. Particular thresholds of interest would be based on the estimated incidence of progression versus pseudoprogression among the cases. The percentage of cases with disagreement ranges from the 20s to the 50s. The mean rCBV and % voxels above NAWM metrics have similar disagreement curves, with 95% rCBV peaking higher. The interoperator differences are small. Figure 6 displays the percentage of cases with discordant classification for each software package for a range of thresholds. This figure uses data from operator 1, as the small interoperator difference in Fig. 5 suggests it is representative of other operators. As expected from Fig. 4, IB Neuro disagrees more for mean rCBV, nordicICE disagrees more for 95% rCBV, and IB Neuro or FuncTool disagrees more for % voxels above NAWM.

Outcome Prediction
The number of cases classified as progression or pseudoprogression using the three different criteria are shown in Table 3. More of the cases were classified as progression than pseudoprogression in the 1.5 T dataset, but less than half in the 3 T dataset. The area under the curve (AUC) measurements are shown in Table 4 for all the cases pooled together, with Table 5 displaying the results for the 1.5 and 3 T datasets analyzed separately. The 1.5 and 3 T pooled dataset showed poor performance of rCBV measures, with none of the instances having a 95% CI > 0.5. However, the 3 T dataset had multiple instances where the AUC was significantly >0.5 based on the 95% confidence interval. Additionally, despite the low numbers, the 3 T dataset had statistically significantly higher AUCs for the mean rCBV metric than the 1.5 T dataset. When nordicICE was used, the % voxels above NAWM metric also resulted in significantly higher AUCs for the 3 T group than the 1.5 T group. There was no statistically significant difference for any of the three metrics between the software or operators. Additionally, none of the three metrics performed better than the others.
The sensitivity and specificity analysis is shown in Tables 6 through 8 for operator 1's data. The optimal thresholds often differed between the software packages, and this resulted in differences in sensitivity and specificity that were statistically (a) (b) (c) Fig. 6 Percentage of cases where one software package disagreed from other two (by software). Operator 1's data. The x axis range plotted is for all software's percent of cases above the threshold (as shown in Fig. 4) being between 25% and 75%. The threshold is used to differentiate between pseudoprogression and progression.    The 3 T value was statistically significantly higher (p < 0.05) than the corresponding 1.5 T value, using the Delong method for comparison.; Op., operator.
significant in many cases. The optimal threshold for the 3 T dataset was always higher than for the 1.5 T dataset, except FuncTool's optimal thresholds for the % voxels above NAWM metric and the combined classification ground truth criterion. For the mean rCBV metric, the optimal threshold for 1.5 T ranged from 0.87 to 1.44, and that for 3.0 T from 1.10 to 1.52, depending on the software and classification criteria. For the 95% rCBV metric, the 1.5 T range was 2.0 to 3.04 and 3.0 T range was 2.64 to 4.00. The % voxels above NAWM metric produced optimal threshold ranges of 32.5 to 72.4 for 1.5 T (32.5 to 55.7 if the 72.4 threshold is removed), and 49.5 to 58.5 for 3.0 T.

Discussion
DSC perfusion imaging is widely used in brain tumor imaging. In all cases, some form of postprocessing is required to convert the acquired images into a clinically relevant image, such as an rCBV image. The processing required to compute the rCBV includes identification of the time point where the bolus arrives and ends. The area under the relaxivity change curve created by this bolus is the basis of CBV determination. The challenge is that these images have a low signal-to-noise ratio (SNR), and contrast leakage can result in different baseline intensity after the bolus compared to before the bolus, and the baseline after the bolus can have a slope. Overall, the Boxerman et al. 15 modeling method represents the most widely used and accepted model to date. Yet, which models the software programs implement can vary, and the specific method of implementation is often not readily available. IB Neuro and nordicICE employ the Boxerman model as the basis of their algorithm, while GE FuncTool uses linear interpolation from the prebolus and postbolus baselines when calculating the AUC. Our study suggests that using different software packages results in clinically significant differences in CBV images, but using different operators produces just mild variability. It is important to note that the measurement comparisons we made were for the exact same voxels-the only variables were the software and the operator. While little operator variability was seen, substantial variation between software was seen. This variation was not something as simple as a scaling factor, which one could reasonably expect to see. The differences showed some patterns, with one software package being an outlier compared to the other two for each of the three metrics, but for each metric, a different package was the outlier. The variation is not based on the selection of any particular threshold, but for a broad range of threshold values, the clinical interpretation of the enhancing tissue would be different, depending only on the software used. Normalization of rCBV values through the use of an NAWM mean appears insufficient as a postprocessing step to correct for variation. Normalization through removal of additive bias still did not increase the ICCs to over 0.9, and that is an optimistic correction that only works if the additive bias can be perfectly known on both an intersoftware and operator-specific basis. Regardless, a simple additive bias correction would have mostly empirical support rather than robust theoretical support.
One reassuring aspect of this study was the small variation between operators for a given software package. This suggests that if operators are given criteria for processing using a given software package, the results can be reproduced.
While previous papers published revealed differences in mean rCBV measurements from clinical images between software, 17,18 this work makes the additional contributions of analysis of the other previously published metrics of 95th % and % voxels above NAWM. While the other metrics did not prove to be more resistant to intersoftware variability, they had different, large effects on the intersoftware variability without eliminating it. Additional new contributions were that the software packages investigated were expanded to include IB Neuro, and interoperator differences were analyzed. Finally, the results were analyzed within a threshold-based framework, allowing for better estimation of the effect on clinical practice of the measurement differences.
The outcome-based GBM progression classification performance analysis using three different definitions of pseudoprogression did not detect a difference between the software when receiver operating characteristic curves (ROCs) were constructed with the software-specific threshold ranges. However, when an optimal threshold found for one software was used for the other software, there were many instances of statistically significant differences in sensitivity and specificity. Previously published optimal thresholds for determining tumor progression or recurrence have ranged from 0.71 9 to 1.47 11 to 1.8 10 to 2.6. 8 However, the discrepancy could be attributed to differences in tumor types allowed in the studies, some allowing tumors other than GBM or high-grade gliomas, or differences in ROI approaches, with different numbers of hotspot voxels or entire tumors being used. It was unclear, though, how much of the difference could be due to the use of different software for CBV computation. For the survival classification criterion, 95% rCBV metric, and the 3 T images, this study's optimal thresholds ranged from 2.64 to 3.19 to 3.50 depending on the software used, with all other variables fixed. For the mean rCBV metric, this study's optimal thresholds were 1.10, 1.33, and 1.39 depending on the software package. While the ROI approach or rCBV metric used clearly has a significant effect on the optimal threshold values, the software effect itself is not negligible. One note of caution is that, in general, the 3 T rCBV values were higher than the 1.5 T values despite the 3 T group having more pseudoprogression cases than progression. This observation is further confirmation of a study that imaged 21 patients at both 1.5 and 3 T and found that the 3 T rCBV was statistically significantly greater for the tumors (p ¼ :0041). 30 While that study had differences in acquisition parameters besides field strength, our study confirms those findings with the magnet strength being the only variable. The superior performance of 3 T is likely due to the increased T1 and susceptibility weighting for 3 T versus 1.5 T. For this reason, 1.5 and 3 T data should not be pooled together for accuracy analysis since they likely have different optimal thresholds. For the cases in this study, the optimal thresholds shown in Tables 6 through 8 were always higher for the 3 T dataset than the 1.5 T dataset, except for FuncTool's thresholds for the combined-classification criterion using the % voxels above NAWM metric. This anomaly could be due to the lower combined-classification number of cases as well as the AUC being 0.500 for that combination of software, metric, and ground truth criterion.
These data support the use of 3 T DSC-MR imaging of GBM patients as opposed to 1.5 T imaging for distinguishing tumor progression or pseudoprogression with spin-echo acquisitions using similar preload dosing. While the same patients were not imaged at both 1.5 and 3 T for direct comparison, and the numbers were small for 3 T, statistical significance was found for the 3 T performance advantage. Due to the importance of this finding, further investigation with larger numbers of cases is indicated.
The lack of statistical significance in AUC differences between software could be reflective of inadequacy of the survival-and treatment-based classification criteria, or the number of cases. Additionally, the analysis can be susceptible to small differences in treatment. Since intermediate survival cases were excluded, the ROC curves using the survival criterion are optimistic. However, as days survived is a surrogate marker for another characteristic that is continuous in reality, progression, the potential bias effect is somewhat muted. The combined-classification criterion similarly has the potential to produce a higher performance estimate than would occur in analysis of new, unknown cases due to the excluded cases. However, it also could be considered the stronger, more accurate classification method than the other two separately since it eliminates cases with a more uncertain classification from the analysis. A common practice on the use of rCBV values, as described in other papers, is for users to select ROIs using the rCBV images. That practice is suboptimal because it introduces dependency on the user and makes the method challenging to reproduce. Because it is not matched to areas of enhancement, it also is unclear what the hotspots represent on conventional imaging. Nevertheless, the point of this paper is that the actual rCBV values that one would see on an rCBV image will depend heavily on the software used.
There is strong interest in promoting the use of quantitative imaging, but the results here suggest that how rCBV is calculated must be more thoroughly examined before quantitation can be broadly applied. Either some correction factor will need to be found for each software/rCBV metric, an approach not likely successful based on this study's data, or the studies published for a given software and CBV metric will need to be reproduced with the other software and metric methods to determine the proper thresholds. We note here that the three packages included in our study represent the vast majority of publications that use FDA-cleared software. Because these are proprietary commercial packages, certain details of the algorithm are not available, making it difficult to understand, characterize, or correct for the differences. While there is an accepted general model of the effect of gadolinium on DSC images, specifics of how the baselines before and after the bolus are determined, how leakage rates after the bolus are determined, assumptions about how to correct for the observed leakage, and noise estimation methods are all critical to computing the rCBV, but unless vendors share their specific algorithms, it will be difficult to explain the basis for the differences we found.
It should be noted that the analyzed images were spin-echo echo-planar T2W acquisitions, and similar results may not occur with gradient recalled echo acquisitions or spin-echo acquisitions using different contrast administration protocols. However, a decrease in variability was not seen when the 3 T acquired data were compared with the 1.5 T acquired data (see Table 9), suggesting precise acquisition methods or signal-to-noise are not significant factors. Further studies are needed to evaluate postprocessing differences using gradient recalled echo data. Some have suggested that spin-echo acquisitions may be more appropriate for brain tumors because it emphasizes the smaller vessels seen in brain tumors, compared to the large vessel occlusions seen in vascular disease. While this is a theoretical advantage, we are not aware of a study documenting an advantage, and this question warrants further study. Regarding other aspects of the acquisition protocols, this was a retrospective study and their parameters could not be altered since they were the clinical protocols used. Increased matrix size might possibly increase the software divergence due to increased noise, but lower magnetic strength (with lower SNR) did not show increased divergence. Regarding increasing the temporal resolution, we did look to see whether there was a noticeable difference in variability when the cases were limited to those with subjectively better bolus curves, and did not find any. However, better temporal resolution might indeed decrease the variability. The same NAWM and tumor ROIs were used across operator and software, so sampling effects should not have influenced the measured variability.
In this study, the ROI used is only the enhancing component of the image. It is well-known that the region of enhancement does not represent the true extent of infiltrating glioma. Therefore, while the ROI for this study may not be entirely representative, it is exactly the enhancing component that requires differentiation of progression versus pseudoprogression. Because the software is proprietary, we do not have access to the model used to estimate and correct for leakage, but this is likely one source of variability between the software packages. Detecting progression in areas of nonenhancement is clearly an important concern and could ultimately yield intersoftware performance differences in future studies. However, the regions of enhancement are presumed to provide the most intersoftware differences due to the leakage correction variable. Similarly, lower-grade gliomas would be expected to have decreased rCBV variability due to decreased leakage, although this needs to be confirmed in future studies.
Limiting our patient cohort to GBM patients treated with radiation and temozolomide represents a select group. However, this treatment regimen is quite common and is associated with frequent occurrences of MRI changes for which perfusion is selected as an important characteristic to interpret. Use of antiangiogenic agents will substantially alter the perfusion and enhancement characteristics, and while used commonly in this patient group, this very different clinical situation would potentially confound our findings rather than improve it, and deserves separate attention.
rCBV values were found to be useful for distinguishing GBM progression from pseudoprogression, as previously shown in the literature. However, as one specific software package or rCBV metric did not provide more useful information than the others, we cannot recommend a specific software package for use in multicenter studies based on these study results. Further studies are needed to evaluate DSC data acquired through other methods (such as gradient recalled echo). It is critical, though, that individual trials use the same software package and same DSC acquisition methods to generate each patient's rCBV images. Additionally, these data show that acquiring images Table 9 Intersoftware ICC-operator 1's data. at 3.0 T produces both different optimal thresholds and more valuable information for determining GBM progression than 1.5 T for spin-echo acquisitions. As more research is conducted regarding the use of rCBV, clinicians are relying upon it more frequently for help with diagnosis and treatment planning. Consequently, accuracy and precision of rCBV measurements become increasingly important as the analysis becomes more quantitative. This study's implication for clinical practice is clear: care must be taken to assure that if thresholds are used in clinical practice that are based on the literature, the same software and processing methods must be applied. Additionally, when comparing exams for the same patient or pooling exams for an rCBV study, the same CBV calculation software should be used. This report raises serious doubt about the ability to use quantitative rCBV measures without requiring a specific, consistent software for processing.