Emerging research into certain disease risk factors has lead to an increase in risk awareness among the medical community and general population.1, 2, 3 Consequently, there is a growing desire to use risk factors to monitor and maintain health and possibly delay or prevent future disease. Current examples of commonly monitored risk factors include blood pressure, cholesterol, weight or body mass index (BMI) and prostate specific antigen (PSA), to name but a few.4, 5, 6, 7 However, to be effective in monitoring health, these risk factors should be attainable noninvasively, or through sampling of body fluids only, and they should also be measurable at a frequency enabling a quantitative assessment of change due to environmental impact, preventive interventions, and/or aging.
In the field of preventive oncology, there is a shift toward the development of quantitative models for risk assessment.8 With respect to breast cancer, numerous studies have demonstrated that mammographic density (MD), obtained from standard x-ray mammography, is a strong independent risk factor for the disease. 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 As a result, MD is often used in epidemiological and pharmaceutical studies as a surrogate marker for breast cancer risk when measurements are executed on an interval scale, using threshold modeling or similar means of quantification.12, 20, 21 However, with the exception of clinical trials, MD is generally not reported on an interval scale to the physician and/or to the patient as such an analysis would require a trained individual to extract this information from analogue or digital mammograms. Instead, radiologists commonly only classify MD as low, medium, or high, rendering it insufficient for breast health/risk monitoring. Furthermore, because mammography involves exposure to ionizing radiation, and its sensitivity in premenopausal women is reduced, it is not appropriate for young women or women at increased risk (i.e., mutation carriers) and not for repeated assessments, as would be required, for example, in chemoprevention trials.
Transillumination breast spectroscopy (TiBS) is a nonimaging, noninvasive optical technique that provides information about bulk breast tissue properties (light scattering, water, lipid, and hemoglobin content) based on the spectral dependence of a photon’s probability to pass through this tissue.22, 23, 24, 25, 26, 27 Recently, we demonstrated the ability of TiBS to differentiate women with MD from those with MD with an associated area under the curve (AUC) of 0.92 using receiver operator curve (ROC) analysis, indicating good sensitivity and specificity of this technique in assessing MD on a categorical scale.28 In this analysis, we evaluated the ability of TiBS to predict MD on an interval scale, as required for quantification of change, for example, in chemoprevention trials, among 232 pre- and postmenopausal women. MD was obtained from analogue mammograms using a computer-assisted threshold program12 (i.e., Cumulus) and subsequently used as the target data in the training of a partial least-squares (PLS) regression based model,29 with the TiBS spectra acting as the input data .
Materials and Methods
Study participants were recruited between March 1, 2000, and September 30, 2004, from the Marvelle Koffler Breast Imaging Centre at Mount Sinai Hospital in Toronto. This study was approved by the Research Ethics Boards of the University of Toronto, Mount Sinai Hospital, and the University Health Network. Inclusion criteria were an analogue standard screening mammogram within approximately prior to recruitment, but not exhibiting any radiological suspicious lesions . Exclusion criteria included prior fine-needle aspiration, core biopsies, or any other type of breast surgery including breast reduction or augmentation and any type of tattoos on the breast(s). Additionally, women showing left and right asymmetry in MD, based on classification of mammograms by an expert radiologist ( difference), were excluded, thus limiting participants to women whose breast tissue retained symmetric tissue aging.
Information concerning participants’ age, menopausal status, height, and weight were collected by means of a self-administered questionnaire. Postmenopausal status was defined as having had no menstrual period for at least . Height and weight were used to calculate BMI defined as weight in kilograms divided by the square of the height in meters.
Quantification of MD from Mammograms
Mammographic breast tissue density (MD) was used as the gold standard to evaluate the potential for TiBS to estimate breast cancer risk. All film mammograms in cranial-caudal view (2 volunteers, ) were digitized using a Lumisys Digital Scanner (Kodak, Rochester, New York, USA) at resolution and a pixel pitch of . Digitized mammograms were examined using Cumulus,12 an interactive density-threshold software. For each image, the trained reader interactively selects two pixel level values using a special outlining tool. The first level selected separates the outer edge of the breast from the image background. The second level defines the edges of x-ray dense tissue regions, where all pixels within the outlined region of interest are considered to represent mammographic densities. Additionally, the program enables the user to outline the pectoral muscle thus excluding contributions of the x-ray dense muscle from MD. Pixels between the delineated pectoral muscle and the edge threshold represent total breast area, while those above the density threshold represent dense tissue areas. The percent dense area (MD) is defined by the ratio of dense area(s) to the total breast area, multiplied by 100.
MD measurements on all mammograms were performed by two individual raters (KMB, LL) after being trained in the use of Cumulus by an expert rater (NB). Mammograms were presented in a randomized order over eight different sessions ( mammograms per session). To assess the reliability of MD measurements, a randomly selected subset of mammograms (15%) were interspersed throughout the eight sessions ( or nine films per session). For the trained raters, the reading of MD was repeated three times (reads 1 to 3) for the entire data set with a period of at least separating each read. The expert rater only read the mammograms once, including repeats. Table 1 displays the breakdown of study volunteers, as well as MD, demographic, reproductive, and anthropometric information.
MD, demographic, reproductive, and anthropometric information of all participants with analogue mammograms.
|MD (%) mean (sd)||35.5 (19.7)||23.1 (18.6)||28.9 (20.1)|
|Age (years) mean (sd)||45.9 (4.1)||55.4 (6.3)||50.9 (7.1)|
|BMI mean (sd)3||26.3 (6.4)||26.7 (5.1)||26.6 (5.7)|
Optical Setup and Procedure
The instrumentation used to gather transillumination spectra was previously described in detail.30 A halogen lamp served as broadband light source. UV, part of the visible spectrum, and mid-IR radiation were eliminated using a cut-on and a heat rejection filter, respectively. A total power of , covering the bandwidth, was delivered to the skin via a light guide. Transmitted light was collected via a -diam optical fiber bundle (140 fibers, core diameter, numerical aperture: 0.36, P & P Optica, Kitchener, Canada), pointed coaxialles toward the light source. The interoptode distance was provided by a caliper holding the optodes. The source fiber was placed against the skin on the top surface of the breast with minimal compression. Wavelength-resolved detection was achieved using a spectrophotometer (Kaiser Optical Systems, Inc., Ann Arbor, MI, USA) with holographic transillumination grating ( blazed at ) and a CCD (Photometrics, Tucson, AZ, USA) with a spectral resolution of better than from ( wavelengths). These wavelengths were selected as they include the absorption spectra of the primary tissue chromophores, namely, hemoglobin (deoxy- and oxy-), water, and lipid. Health Canada Investigational New Device Class II approval was obtained.
All measurements were taken in the dark, with the participant seated comfortably in an upright position and the breast resting on the support platform containing the attached caliper. A total of eight measurements in cranial-caudal projection were taken per individual, four per breast (center: midline close to the pectoral muscle; medial: from the inner edge; distal: behind the nipple; lateral: from the outer edge). With the measurement of four positions on each breast, an ovoid shaped volume of approximately for a -thick breast is sampled.30, 31 Numerical modeling demonstrated that 70% of the total breast tissue contributed to 98% of the optical signal. Temporal and spatial reproducibility of the optical measurements is good, as previously addressed.30, 31 The entire TiBS procedure takes approximately .
Preparation of Spectra for Data Analysis
All spectra were corrected for variations in the wavelength-dependent signal transfer function of the optical system and the thickness of the interrogated tissue, such that all spectra used in further data processing are independent of the instrument and interoptode distance resulting in units of optical density per centimeter (OD ) for the spectra. For correction of the signal transfer function, spectra were referenced to a -thick ultra-high-density polyurethane transmission standard (Gigahertz Optics, Munich, Germany). Its optical properties (OD to 2.3 over the wavelength range of interest) were measured separately using an integrating sphere diffuse reflectance setup.30
Prior to PLS analysis, the error in MD measurements for each rater and between raters (intra- and interrater error, respectively) was determined as this can directly affect the quality of PLS algorithm training and hence the strength of the attainable correlation between TiBS spectra and MD (see the following). Intraclass correlation coefficients (ICCs) were calculated to assess intrarater repeatability for each read (1 to 3) for the two trained raters (KMB and LL) and for the expert rater (NB, one read only). To assess interrater error, two methods were used. First, the mean absolute MD difference between each trained rater and that of the expert was calculated for each mammogram and each read using a mixed linear model (PROC MIXED in SAS 9.1). Second, an interclass correlation coefficient was calculated between each trained rater and the expert rater for each read.
Training of the PLS algorithm was executed on a subset of randomly selected spectra ( or 75%) and the predictive power of the algorithm tested by including the remaining 25% of spectra in a validation data set . The PLS function in MATLAB29 extracts a common spectrum, called a vector , from the entire training spectral data set that when multiplied as a cross-product with an individual’s spectrum (OD ) produces a scalar or the target, MD where . The vector identifies those wavelengths, and indirectly also morphological and structural traits, that contribute positively, negatively, or not appreciably to MD . The 75 versus 25% ratio for training and validation ensures that the training set covers a sufficient range of the variation within the population without over training the system, thereby permitting the PLS algorithm to retain validity on the validation set.29 Because the PLS algorithm uses absolute intensities, spectra were not mean centered as in our previous analysis.28
Since only MD of the entire breast (global assessment) is an established marker for breast cancer risk, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 the four spatially collected spectra per breast were considered as a single vector for PLS training with the corresponding Cumulus MD as the target . Therefore, spectra from all quadrants on each breast (center, medial, distal, lateral) were appended to create a single spectrum to approximate global information (Fig. 1 ). Hence, for each individual two spectra with corresponding MD (left and right breasts were treated as separate events) were used as the input and target variables, respectively.
We considered multiple PLS models (Table 2 ) to determine which spectral processing technique on the input data and/or the target data best predicts individual global MD . For the target variable , we examined a model that included only Cumulus MD of the expert rater (NB), as well as models that used the average Cumulus MD of all three raters, to approximate the “gold standard” or best available read (see the following results). We also considered models that excluded mammograms (and associate appended spectra) with greater than 10% absolute disagreement in Cumulus MD between any two raters and models where the Cumulus MD of each rater was not averaged and hence each appended spectrum was used up to three times in the training of our PLS algorithm. In some models, detection shot or CCD readout noise seen in the TiBS spectra was reduced using either one of two MATLAB defined functions: Boxcar (window ) or Savitsky-Golay (Savgol) smoothing and differentiation with a filter width or a third-order polynomial.29
Results of linear regression analysis [Cumulus MD (Yi)=independent variable (x) and PLS predicted MD (Ŷi)=dependent variable (y) ] and Spearman’s rho correlation coefficients for different PLS models for the training and validation data sets.
|MD (Yi)||Spectra (Xi) ,Shot-Noise Minimized||N||Data Set||Intercept||95%||CI||β||95%||CI||R2||Spearman’srho|
Spectral processing and PLS algorithm training was executed in MATLAB version 12.0 and statistical analyses were conducted using SAS version 9.1. For each PLS model, the association between the PLS predicted MD and the target, Cumulus MD , was established using Spearman’s rank correlation and linear regression analysis [where Cumulus MD variable and PLS MD variable ] for both the training and validation data sets. For each model, we also examined the residuals [i.e., Cumulus MD minus PLS predicted MD ] plotted against Cumulus MD . For statistical testing, values were considered significant.
Intra- and Interrater Error in Cumulus MD
The intrarater ICC for the expert (NB) was 0.97. For each trained rater (KMB and LL), the intrarater ICC improved with each read and was highest for the third read (0.96 and 0.93, respectively). The mean absolute difference in Cumulus MD for all mammograms between each rater and the expert also decreased with each read (i.e., from read 1 to read 3). The mean absolute difference between rater 1 and the expert and rater 2 and the expert for the third read was 6.0% (95% CI: 5.4%–6.4%) and 5.4% (95% CI: 4.9%–6.4%), respectively, and both were significantly different from zero . The corresponding interrater ICC for rater 1 was 0.92, while for rater 2 it was 0.93.
PLS Predicted MD
Because the third read showed the highest intra- and interrater ICC and the smallest absolute difference in Cumulus MD between each trained rater and the expert, the target used in all PLS modeling was MD from Cumulus read 3 for the trained raters.
Table 2 displays the results of linear regression analysis and Spearman’s rank correlation coefficients for the association between Cumulus MD (the independent variable, ) and PLS predicted MD (the dependent variable, ) for both the training and validation data sets.
Training the PLS algorithm using Cumulus MD of the expert rater only (model 1) or Cumulus MD of all three raters (not averaged) (models 7 to 9) as the target , yielded the lowest values (0.61 to 0.74) and Spearman’s rank correlation coefficients (rho: 0.80 to 0.84) for the validation data set. Furthermore, for models 7 to 9 the slope of the regression line in the validation set was significantly less than 1.0, suggesting a bias in model development. Averaging Cumulus MD of all three raters (model 2) improved both the value and Spearman’s rho correlation coefficient of the validation set (i.e., versus models 1, 7 to 9) and the 95% CI of the regression slope included 1.0. Conversely, excluding mammograms (and associated spectra) that displayed greater than 10% disagreement between any two raters (model 3) or reducing detection or CCD readout shot noise using either the Boxcar or Savgol MATLAB functions (models 4 to 6) only improved the results for the training data set, not the prediction of MD (i.e., versus model 2). Because model 2 had the least assumptions regarding spectral and target processing and because it demonstrated both a large and Spearman’s correlation coefficient in the validation data set, it was considered the most parsimonious model. The PLS vector associated with this model is shown in Fig. 2 . Figure 3a is a scatter plot of PLS predicted MD versus Cumulus MD (model 2, validation set only) and Fig. 3b is a plot of the residuals [i.e., Cumulus MD minus PLS predicted MD ] versus Cumulus MD . Overall, for the majority of women (80%) the estimation of individual MD was within 10% of Cumulus MD , although for the entire data set it ranged between and [Fig. 3b]. The slope of the residuals for model 2 was (95% CI: 0.08% to 0.24%), significantly larger than zero , suggesting that as Cumulus MD increases so does the error associated with the prediction of MD [Fig. 3b].
The performance of the PLS algorithm is given by the actual strength of the correlation between the target ( , Cumulus MD) and the input ( , TiBS spectra) on which it is trained, as well as the accuracy of the target data and the quality of the spectral data. To assess the influence of target accuracy and spectral quality on PLS prediction of MD various models were evaluated. Among these models, we considered model 2 to be the “best” model because it had the fewest assumptions regarding spectral and target processing, the and Spearman rho values for the validation set were among the highest (0.78 and 0.88, respectively), and the results for the training and validation data sets were comparable, suggesting no over- or undertraining of the PLS algorithm.
Although we used MD from the final Cumulus read as the target in PLS training, as it demonstrated the highest intra- (0.93 to 0.97) and interrater ICC (0.92 and 0.93) of all three reads, it was not exact. In addition, the absolute difference in Cumulus MD between each trained rater and the expert rater ranged between 5 and 6%, which was significantly different from zero. Employing MD of the expert rater only as the target (model 1) resulted in a less favorable PLS algorithm; both the values and correlation coefficients for the training data set were small and the prediction of MD in the validation set limited. Conversely, models that employed the average MD of all three raters as the target (models 2 to 6) demonstrated improved results for both the training and validation sets. Hence, there is a potential for bias to be introduced in the estimation of MD when MD from only one or different raters is used as the target in PLS training. Although, the expert rater (NB) has demonstrated a high odds ratio (OR) between MD derived by himself and breast cancer incidence,12 it is possible that the trained raters (KMB, LL) utilized image attributes (i.e., coarseness, brightness, contrast, noise, etc.) when measuring MD, that were also more common to the TiBS spectra, but that may potentially impact the TiBS-derived MD relationship with breast cancer risk.
For some mammograms, the absolute difference in Cumulus MD between raters exceeded 10%. Although, excluding these mammograms (and associated appended spectra) (models 3, 5, and 6) resulted in a slight improvement in PLS training, it did not improve the estimation of MD in the validation set. This is because removal of these mammograms likely resulted in exclusion of more cases with higher MD compared to those with lower MD, producing a more homogeneous training set that no longer covered the variance seen in the validation spectra data set. Additionally, despite the fact that participants were selected based on bilateral symmetry of MD by an expert radiologist, left-right symmetry of Cumulus MD was within 10% for only 80% of mammogram pairs (data not shown). Consequently we treated MD from the left and right breasts as separate events, rather than averaging them. This decision was also based on the fact that Cumulus MD for each breast was derived separately and that TiBS spectra were obtained from each breast. Treating each breast as a separate event would enable future use of TiBS in women with bilateral variations in MD.
The source of the spectral noise at short wavelengths (i.e., wave number 0 to 50; Fig. 1) is due to limited photons traversing several centimeters of tissue, while the noise at long wavelengths (i.e., wave number 400 to 450, Fig. 1) is due to the low quantum efficiency of the CCD detector. Although some of this noise was translated to the PLS vector (Fig. 2), reducing spectral noise using smoothing functions (either Boxcar or Savgol, models 4 to 6), only improved training of the PLS algorithm and not the estimation of MD in the validation data set. Hence, the noisy spectral components did not contribute significantly to the sensitivity/specificity of PLS predicted MD. Instead, it is likely that the noisy components of the vector are associated with the step function at longer wavelengths, which likely resulted from appending individual spectra into a single spectrum (Fig. 1). We did consider training the PLS algorithm on each position individually, thereby avoiding the contribution of additional noise to the vector; however, the prediction of MD did not improve (data not shown). Furthermore, even though an appended spectrum was used, the PLS vector still captured variations in the contribution of absorbers to each measurement position due to the different anatomical structures present in each breast quadrant (Fig. 2). Removing very noisy spectra altogether and/or limiting the wavelength range to omit noisy spectral components may be beneficial and should be explored in future work.
The estimation of MD was poorest when Cumulus MD of each rater contributed independently to the training of the PLS algorithm, but the same appended TiBS spectrum was used each time in training the algorithm (models 7 to 9). Although the 75 versus 25% split for the training and validation sets, respectively, was maintained in these models, overtraining of the algorithm still occurred such that the resulting PLS algorithm no longer retained validity on the validation set (i.e., compared to the other models, the values and correlation coefficients were much larger for the training set relative to the validation set).
Despite the inherent limitations in the definition of our PLS target and the contribution of noise to some spectra and the resulting vector, the and Spearman’s correlation coefficients for the validation sets did not vary greatly between models ( : 0.61 to 0.80; Spearman’s rho: 0.79 to 0.90). This suggests that the limited accuracy of the target and the quality of the spectral data did not contribute substantially to PLS prediction of MD by TiBS and that estimation of MD might be improved with additional data. Most encouraging is the observation that in the majority of cases the regression slopes were not significantly different from 1.0 (except in models 7 to 9) indicating that TiBS derived MD does not demonstrate a significant bias. However, even our optimal PLS algorithm (model 2) tended to underestimate MD in women with MD , as suggested by our residual analysis. This is likely due to the fact that only a limited number of women with high MD were available on which to train our PLS algorithm; this should be considered a priority in future studies.
We showed that TiBS can estimate MD on an interval scale within 10% of Cumulus measured MD in the majority of women without stratification on age, BMI, and menopausal status. The limitations in the generation of a gold standard (i.e., Cumulus MD), which was used as the target for PLS training, does not appear to restrict the overall strength of the attainable correlation between Cumulus MD and PLS predicted MD, possibly indicating that a similar association can be demonstrated in larger multicenter studies. Hence, TiBS has the potential to become a noninvasive, nonionizing-radiation-based method to determine MD without the requirement for highly trained individuals, such as a radiologist and/or expert raters of mammograms. Compared to mammography, it can be applied at higher sampling intervals and can potentially be used to detect changes in MD and possibly the rate of this change. This latter application of TiBS would be important for monitoring high-risk populations (i.e., mutation carriers) and/or women enrolled in chemoprevention trials. While TiBS can determine MD values, its ability to demonstrate changes relative to actual breast cancer risk must still be demonstrated, which would require a prospective longitudinal study.
The authors wish to express their gratitude to Dr. Norman Boyd (NB) for his help in the use of the Cumulus program and for his contribution to the development of the ideas that were the basis of this analysis. The authors also thank Dr. Martin Yaffe for the use of the Cumulus program and Dr. Gina Lockwood for discussion relating to the statistical analysis of the data. The authors express their sincere thanks to all study participants for their time and effort. The study was funded in part by the Susan Komen Foundation (BCTR0402530) and the Canadian Breast Cancer Research Alliance (017467).