More than 90% of patients with advanced prostate cancer develop bone metastases,1 which can produce some of the most severe complications of the disease and is associated with shorter survival times.2–6 New drugs are under development for metastatic castration-resistant prostate cancer (mCRPC), and there is a need for biomarkers to identify target populations and for early evaluation of treatment effects as an alternative to overall survival, which leads to long studies and is becoming problematic due to subject crossover and contamination from multiple therapies.
Whole-body bone scintigraphy is the accepted standard imaging modality for detection of bone metastases and assessment of treatment outcomes. Response evaluation criteria in solid tumors, the standard guideline used to assess outcomes in solid tumor malignancies, treats bone lesions as “nonmeasurable” and is therefore of limited usefulness in the setting of prostate cancer treatments.7 Therefore, the Prostate Cancer Working Group 2 (PCWG2) developed visually based criteria for assessing disease progression on bone scans based on counting new lesions.8 PCWG2 does address the significance of changes in intensity, size, or area of individual lesions, all of which are limited by the challenges of subjective, visual lesion detection. The simple conventional metric of lesion counting is of limited value when assessing treatment effects, as lesions may decrease in size without changing in number or may break into smaller components and, thus, superficially appear as an increased metastatic burden. This motivated the development of computer-aided quantitative measures of disease burden on bone scans.
The two quantitative bone imaging biomarkers that have undergone the most study and use in oncologic clinical trials are the bone scan index (BSI)9 and bone scan lesion area (BSLA).10 The BSI sums the product of the estimated weight and the fractional involvement of each bone, determined visually or from lesion segmentation on the bone scan. BSI was first evaluated as a prognostic biomarker11 and has had a number of more recent follow-up studies.12,13 BSLA was the first quantitative imaging biomarker to be developed and evaluated primarily for treatment response assessment in prostate cancer. The calculation and its ongoing clinical validation will be described in this paper.
In development of an imaging biomarker, there are two important phases: (1) development and analytic validation (including training of classifiers, determination of cut points, assessment of reproducibility, and evaluation against radiologist measurements) and (2) clinical validation in which the system and its cut points are fixed and it is evaluated against outcomes in new clinical trial data. BSLA is an imaging biomarker computed from whole-body scintigraphic imaging as a measure of overall bone tumor burden. Initial development and analytic validation, including evaluation against manual tumor segmentation14 and determination of response thresholds using trial cohorts, are from a single drug treatment (cabozantinib) with controls in subjects with metastatic CRPC.10,15 A 30% increase/decrease in BSLA relative to baseline was defined as progression/response on bone scan based on the data from these previous cohorts. Because of the promising results and urgent need, the BSLA imaging biomarker was rapidly adopted in clinical trials, such as16 using the drug for which the biomarker was initially developed and, in Ref. 17, an ongoing trial using a different treatment. Therefore, rather than an investigation into threshold or other algorithm parameters, this paper is focused on clinical validation of an existing test that has been adopted by the research community in prostate cancer clinical trials. In the computer-aided diagnosis research community, the majority of papers involve analytic validation, and this clinical validation is the next step in putting biomarker translation into practice. As a clinical validation, the analysis approach in this paper is consistent with those used in clinical trials.
We seek to establish the clinical value of an existing imaging test based on the quantitative BSLA and will investigate and evaluate the biomarker as a prognostic factor, predictive factor, and surrogate outcome marker. A prognostic factor is a clinical or biologic characteristic that is objectively measurable and that provides information on the likely outcome of the cancer disease in an untreated individual. A predictive factor is a clinical or biologic characteristic that provides information on the likely benefit from treatment (either in terms of tumor shrinkage or survival). Such predictive factors can be used to identify subpopulations of patients who are most likely to benefit from a given therapy. Importantly, prognostic factors define the effects of patient or tumor characteristics on the patient outcome, whereas predictive factors define the effect of treatment on the tumor.18 A surrogate outcome marker can be defined as a laboratory measurement that is used in therapeutic trials as a substitute for a clinically meaningful endpoint, such as survival, and is expected to predict the effect of the therapy.19,20
We hypothesize that, when applied to an independent treatment trial cohort with a different mechanism of drug action, a week 12 change posttreatment using this prespecified threshold for progression is predictive of a subject’s overall survival, i.e., can be used as a surrogate outcome marker. Second, we evaluated the potential of baseline BSLA (disease burden on the baseline scan) as a predictive biomarker used to identify patients most likely to benefit from treatment.
From an anonymized imaging research database, a cohort of 198 mCRPC subjects who enrolled in a multicenter treatment trial of abiraterone acetate (127 treated and 71 placebo) using a standardized imaging protocol was identified. Subjects were included that had whole-body original DICOM images and survival data available. This cohort was independent of those used for development of the biomarker criteria for progression/response and involved a different mechanism of drug action. Subjects underwent the standard of care whole-body bone scintigraphy with 99mTc-Methyl diphosphonate (99mTc-MDP) at baseline and week 12 posttreatment.
Bone Scan Image Processing
A CADrx system for bone scan assessment was developed within the imaging biomarker information system (IBIS) for image markup and analysis (MedQIA, LLC, Los Angeles, California). The IBIS markup system combines image review capabilities with computer-aided tools for region segmentation, quantitative analysis, and data export for clinical trials. In the CADrx system, anterior and posterior bone scan images are processed with pixel intensity normalization and lesion segmentation, followed by quantitative assessment of lesion burden. The image analysis method was previously described in detail,10,21 and the steps are summarized here.
Anatomical region segmentation
Atlas-based segmentation was performed to label seven anatomical regions on the bone scan: sternum/spine, ribs/head, extremities (arms and legs), pelvis, shoulders, kidney search region, and bladder search region. Registration to the atlas involved affine registration using the Mattes mutual information metric22 followed by a multiresolution demons deformable registration.23 An example of the output of the anatomical region segmentation is shown in Fig. 1.
Image intensity normalization
Image intensity normalization was applied to reduce inter- and intrapatient variations due to differences in body habitus, radiotracer dosing levels, and scan acquisition parameters. A region of normal bone in the extremities was identified automatically based on the anatomical region segmentation. Then all pixel values were linearly rescaled to set the intensity of this normal bone to a reference intensity. After normalization, the pixel intensities of normal bone are consistent between subjects and across time points for a given subject, allowing for reproducible lesion segmentation and quantitative assessment in serial patient images.
Automated lesion detection
Based on the atlas-based segmentation, anatomical region-specific lesion intensity thresholds for lesion detection were learned previously using receiver operating characteristic (ROC) analysis on a training set of images.10 The ROC analysis was used to set the region-specific thresholds to maximize segmentation accuracy against expert delineated reference segmentations on the training set.14 An additional classification stage was applied to candidate lesions generated by the thresholding to remove false positives related to bladder uptake, kidney uptake, and symmetric degenerative joint disease.21 False positives related to uptake in these anatomical regions were identified and removed based on overlap with corresponding regions from the atlas-based segmentation. Symmetric degenerative joint disease removal involved computing features of lesion candidates: lesion area, mean intensity, perpendicular distance from the midline, and vertical distance along the midline. Lesion candidates were compared in a pairwise manner and symmetric pairs identified based on feature difference thresholds. Parameters in the false positive reduction were trained using a multistart local optimization method using the Nelder–Mead simplex.24
Segmentation review and approval
For each bone scan, the results of the automated lesion segmentation were reviewed by a nuclear medicine physician and manually edited (lesion pixels added or removed) as needed. This editing typically involved removal of any remaining false-positive regions (e.g., areas of degenerative joint disease) and took on the order of minutes per scan. Previous studies showed 89% pixel accuracy of the lesion segmentation method against manual expert annotations, so the amount of editing required for a given case is typically minimal.14
Treatment Response Assessment Using Bone Scan Lesion Area
BSLA is summed as , where is the set of pixels identified as bone lesion and is the physical area of pixel (in ). The BSLA measure thus represents a quantification of the size and number of active regions on the bone scan. BSLA was calculated at baseline and week 12 posttreatment for all subjects in the study. A prespecified 30% increase in BSLA from baseline to week 12 was used to identify subjects with progressive disease (PD). Subjects with increase or decrease in BSLA were categorized as nonprogressive disease (non-PD). For evaluation as a prognostic factor, the dataset was dichotomized about the median baseline BSLA. Figure 2 shows examples of bone scans with lesions semiautomatically segmented in red and changes in BSLA computed from baseline to week 12. The examples reflect PD (an increase in lesion burden) and non-PD (stable and reduction in lesion burden).
BSLA was evaluated as a prognostic factor, predictive biomarker, and a surrogate outcome biomarker. Subjects were grouped as PD versus non-PD and multivariate Cox regression was used to test whether (1) baseline BSLA and (2) early changes in BSLA (12 weeks posttreatment) were predictive of overall survival. Landmark survival analyses were used to assess early changes. Kaplan–Meier plots and hazard ratios were used to evaluate differences among groups defined by the BSLA biomarker.
Prognostic and Predictive Biomarker Evaluation
Median BSLA at baseline was . BSLA at baseline was a prognostic factor for delaying progression ( and ) and predictive of longer survival ( and ). Figure 3 shows Kaplan–Meier plots of the proportion of subjects surviving a given number of days beyond baseline when separated into groups based on the baseline BSLA score.
Figure 3(a) shows BSLA as a prognostic factor including all subjects, both treatment and control groups. It shows that subjects with low baseline BSLA scores () have a better overall prognosis in terms of survival time.
Figure 3(b) shows BSLA as a predictive biomarker with subjects separated into treatment and control groups. It shows that subjects with high baseline BSLA scores () can be predicted to experience treatment benefit relative to controls (red versus brown survival curves). Subjects with low baseline BSLA scores undergoing treatment () can be predicted to have a better survival outcome than those with high BSLA scores. Subjects with low baseline BSLA scores have a relatively good overall prognosis, irrespective of whether they are treated (blue and black survival curves).
Early Surrogate Outcome Evaluation
Overall survival rates between PD and non-PD groups were statistically different ( and ). Subjects without PD by BSLA at week 12 had longer survival than subjects with PD: median 398 days versus 280 days (378 days versus 228 days after adjustment for baseline BSLA ). Similar differences were seen within the treatment and placebo groups (see Table 1). The corresponding Kaplan–Meier survival curves are shown in Fig. 4, and multivariate Cox regression analysis for survival is shown in Table 2.
Median survival in days after week 12, with number of subjects in each group (adjusting for baseline BSLA score <200 cm2).
|Median (±IQR) survival after week 12 (N=number of subjects)||PD by BSLA||Non-PD by BSLA|
|Placebo group||186 () days ()||170 () days ()|
|Treatment group||260 () days ()||392 () days ()|
|All||228 () days ()||378 () days ()|
Multivariate Cox regression for survival.
|Coefficient||HZ (±SE)||p-value||95% CI|
|Treatment||0.49 ()||0.002||[0.32, 0.76]|
|Baseline BSLA||0.34 ()||[0.20, 0.58]|
|Interaction between treatment and BSLA||2.15 ()||0.019||[1.14, 4.08]|
|Non-PD||0.64 ()||0.007||[0.46, 0.88]|
As a prognostic and predictive biomarker, the BSLA can facilitate patient management and prospective determination of those most likely to benefit from a given therapy, rather than beginning a therapy and waiting months to see if the disease progresses or not, which is particularly problematic for advanced prostate cancer. Specifically, subjects with high BSLA should be treated (low BSLA has a relatively good prognosis regardless of whether treated or not). In addition to using the median baseline BSLA (50th centile) as a prognostic cut point, we performed a sensitivity analysis by testing the 33rd and 67th centiles. Both of these also gave statistical significance as a prognostic factor, indicating that definitions of “low” and “high” baseline BSLA are likely to be robust.
Table 2 shows that treatment and non-PD are factors in lowering the hazard ratio, i.e., having longer survival. This is reflected in the Kaplan–Meier curves of Fig. 4; the treated, non-PD subjects have the longest survival times and the control, PD subjects the shortest. As a surrogate outcome measure, the BSLA can be used in clinical trials to speed up drug development by determining utility without waiting for survival. This can be particularly useful in adaptive designs and dose-ranging studies.10 It can thus be used to develop and evaluate new mCRPC therapies more quickly.
In this study, we dichotomized subjects based on PD only (i.e., PD versus non-PD, rather than responders versus nonresponders) to obtain similar numbers of subjects in each of the two groups. However, BSLA also allows classification of subjects as responders to therapy (reduction in BSLA of 30% or more) as described in Ref. 10. In a previous study,15 BSLA was used to group subjects into responders and nonresponders with significant differences hazard ratio (HR 0.47, 95% CI 0.28 to 0.79, and ). We did not form a separate responder group in this study since the number of such subjects was relatively low.
As described in the Introduction, BSI was originally evaluated as a prognostic biomarker. More recently, changes in BSI have also been investigated retrospectively, with various cut points for BSI groupings being explored rather than prespecified. In a mCRPC cohort, Reza et al.25 found that an increase in BSI at follow up of at most 0.30 had a significantly longer median survival time than those with an increase of BSI . They note that retrospective design (choice of BSI cut point) was a limitation. In another mCRPC cohort26 in which a different cut point of not increase from BSI baseline was applied, they found that the group had a significantly longer time to progression in bone than those who had a BSI increase during treatment. These studies differ from ours in that we prespecified the criteria for disease progression of 30% or more increase in BSLA and then applied it prospectively to this and other new cohorts to demonstrate robustness across different therapeutic protocols.
The focus of this paper has been on clinical validation of an existing algorithm already adopted in trials. However, as more data are becoming available, there will be an opportunity to update parameters in the algorithm, such as the intensity thresholds and response/progression cut points, and to include more advanced classifiers to further improve performance. For example, the currently used 30% cut point for progression/response was set conservatively in a small developmental set such that all control subjects had BSLA changes less than this threshold,10 and, as further reproducibility studies are performed, we may be able to reduce the threshold and increase the sensitivity of the biomarker. Because the initial developmental set was relatively small, the subsequent larger clinical validation studies, such as Ref. 16 and this one in a different drug treatment, are particularly important to show that the current algorithm in use in clinical trials provides an effective surrogate for overall survival.
BSLA is calculated semiautomatically from bone scans and provides a quantitative and objective treatment response assessment. Baseline BSLA and early changes posttreatment were found to be predictive of overall survival in patients with mCRPC. BSLA has now been demonstrated to be an early surrogate outcome for overall survival in different prostate cancer drug treatments.
Conflict of interest disclosure: Matthew S. Brown and Jonathan G. Goldin are directors of MedQIA Imaging CRO.
The authors acknowledge the support of MedQIA, LLC in collaboration and data sharing and Ms. Eloisa Rodriguez-Mena for article preparation and formatting.
Matthew S. Brown received his PhD in computer science from the University of New South Wales, Sydney, Australia, in 1997. Currently, he is the director of the Center for Computer Vision and Imaging Biomarkers and a professor of radiological sciences at the University of California, Los Angeles.
Grace Hyun J. Kim received her PhD in biostatics from the University of California, Los Angeles in 2007. She became an assistant professor at the University of California, Los Angeles in 2009 and currently is an associate professor. Her research involves statistical approaches to discover and validate imaging biomarkers derived from advanced image analysis. The goal of this proposed research is for early detection, diagnosis, and treatment of cancer.
Gregory H. Chu received his master’s degree in physics and biology in medicine from the University of California, Los Angeles in March 2016. His research interest includes computer vision machine learning imaging biomarkers, disease detection, and measurements.
Bharath Ramakrishna received his PhD in computer engineering from the University of Maryland Baltimore County in 2008. He is a technology leader with over 8 years of experience productizing cutting-edge research from concept to launch in the regulated health tech space. He is passionate about turning breakthrough technologies into innovative products and building high-performance collaborative teams.
Martin Allen-Auerbach received his MD degree from the Universitat Wien Medizinische Fakultat in 1995. Currently, he is the director of nuclear medicine at the Westwood and Santa Monica Hospitals of University of California, Los Angeles, as well as an associate clinical professor of molecular and medical pharmacology.
Cheryce P. Fischer received her MD degree from Robert Wood Johnson Medical School in 1994. Her specialty is diagnostic radiology in the Department of Radiological Sciences at University of California, Los Angeles.
Benjamin Levine received his MD degree 2003 from Loyola University of Chicago. He has been an associate clinical professor in the Department of Radiology at University of California, Los Angeles since 2009. He is also the dual energy CT-Gout program director since 2014 as well as the chair of the medical student course at University of California, Los Angeles.
Pawan K. Gupta received his MD degree from the University College of Medical Sciences in 1985. His specialty is nuclear medicine in the pharmacology department of the University of California, Los Angeles.
Christiaan W. Schiepers received his MD degree from the Universiteit Utrecht in 1982. His specialty is diagnostic radiology in the pharmacology department at the University of California, Los Angeles.
Jonathan G. Goldin received his MD degree and PhD from the University of Cape Town Faculty of Medicine. Currently, he is the executive chief of clinical care, chief of radiology department, professor of radiology and biomedical physics program at the University of California, Los Angeles, as well as director of Santa Monica multispecialty radiology.