Volumetric breast density measurement for personalized screening: accuracy, reproducibility, consistency, and agreement with visual assessment

Assessment of breast density at the point of mammographic examination could lead to optimized breast cancer screening pathways. The onsite breast density information may offer guidance of when to recommend supplemental imaging for women in a screening program. A software application (Insight BD, Siemens Healthcare GmbH) for fast onsite quantification of volumetric breast density is evaluated. The accuracy of the method is assessed using breast tissue equivalent phantom experiments resulting in a mean absolute error of 3.84%. Reproducibility of measurement results is analyzed using 8427 exams in total, comparing for each exam (if available) the densities determined from left and right views, from cranio-caudal and medio-lateral oblique views, from full-field digital mammograms (FFDM) and digital breast tomosynthesis (DBT) data and from two subsequent exams of the same breast. Pearson correlation coefficients of 0.937, 0.926, 0.950, and 0.995 are obtained. Consistency of the results is demonstrated by evaluating the dependency of the breast density on women’s age. Furthermore, the agreement between breast density categories computed by the software with those determined visually by 32 radiologists is shown by an overall percentage agreement of 69.5% for FFDM and by 64.6% for DBT data. These results demonstrate that the software delivers accurate, reproducible, and consistent measurements that agree well with the visual assessment of breast density by radiologists. © The Authors. Published by SPIE under a Creative Commons Attribution 4.0 Unported License. Distribution or reproduction of this work in whole or in part requires full attribution of the original publication, including its DOI. [DOI: 10.1117/1.JMI.6.3.031406]


Clinical Background
Breast density is an important topic in breast cancer screening because of two aspects. First, a high amount of dense (fibroglandular) breast tissue is considered to be an independent risk factor for developing breast cancer, and second, sensitivity of mammography is lower in dense breasts due to the masking effect. 1 Supplemental imaging for dense breasts [e.g., breast ultrasound, breast magnetic resonance imaging (MRI)] can thus be useful to increase cancer detection rates in breast cancer screening. 2 Nowadays, the majority of the US states require that women are informed if they have dense breast tissue and they will receive information about supplemental imaging options. 3,4 Radiologists typically estimate breast density during interpretation of the mammograms. However, visual breast density assessment is known to have considerable intra-and inter-reader variability. 5 Automated breast density assessment by computer software is increasingly used to assist radiologists in reporting breast density more objectively and consistently.
The time when breast density is assessed has a considerable impact on a personalized screening work flow. When breast density is assessed by radiologists, the woman usually has already left the screening center. Supplemental imaging requires the woman to be called back for an extra assessment.
If, however, automated breast density assessment is provided during the screening appointment, then the work flow might be sped up considerably. If supplemental imaging is recommended, it could be initiated before the woman leaves the screening center (Fig. 1). The women could get the result of the recommended supplemental imaging test on the day of screening, which would reduce psychological distress. This procedure, though, would require an organizational change in scheduling, a problem that should be solvable after having gained experience and after a transition to routine.

Automated Breast Density Assessment
Several techniques for automated breast density measurement from mammographic x-ray images have been proposed. 6,7 These techniques either calculate the projected two-dimensional (2-D) area (in cm 2 ) of dense tissue in the x-ray image or quantify the three-dimensional (3-D) volume (in cm 3 ) of the dense tissue in the breast. The calculation of the projected 2-D area of dense tissue requires a segmentation of the dense tissue areas in the image. 8 For quantification of the 3-D volume of the dense tissue, the physics of the image acquisition process is modeled and it can involve either a precalibration of the system, 9 a calibration object in the acquired image, 10 or an image-based self-calibration step. 11 A breast density (percentage) value can be computed by dividing the area or volume of dense tissue with the total area or volume of the breast. Because areal and volumetric breast density (VBD) values are computed from different quantities, they have different value ranges and they cannot be compared directly. 12 Breast density classification has a high clinical relevance. A breast density category can be determined from a measured breast density value by using cut points. Furthermore, machine learning-based algorithms exist that assign a breast density category directly based on extracted image features, such as parenchymal texture or histogram information. 7 Recently, also, deep machine learning techniques have been applied for breast density classification. 13 For using an automated breast density assessment software for clinical decision support, it should be validated comprehensively. Ng and Lau 7 have identified six requirements (denoted in their paper as "sanity checks") that should be fulfilled by an automated breast density measurement software: 1. "Density should be the same for the identical image of the breast." 2. "Density should be similar for a breast no matter what the view, in particular cranio-caudal (CC) and mediolateral oblique (MLO) views." 3. "Density should be similar for the same breast no matter the imaging equipment, in particular, it should not matter if the equipment is GE, Siemens, Hologic, or if the imaging is done on mammography, tomosynthesis, MRI, or CT." 4. "Density should be invariant to breast compression."

"
Left and right breast densities should be highly correlated but not identical." 6. "Density should, over a population, generally reduce with age." Some studies exist that validate existing software applications for automated breast density assessment. [14][15][16][17][18][19] Typical aspects assessed by these studies are accuracy (comparing measured breast density to an objective ground truth), reproducibility (see Ng and Lau's requirements 1 to 5), consistency (see Ng and Lau's requirement 6), and agreement with visual assessment (comparing classified breast density to a subjective reference).
Recently, an automated breast density measurement software (Insight BD, Siemens Healthcare GmbH) has been integrated into the acquisition work station of a mammography system (MAMMOMAT Revelation, Siemens Healthcare GmbH; Insight BD and MAMMOMAT Revelation are not commercially available in all countries. Due to regulatory reasons, the future availability cannot be guaranteed). This allows objective evaluation of breast density directly after the mammographic exam. In this work, we evaluate performance of the software Insight BD to measure VBD. In particular, we evaluate whether the software satisfies the six requirements identified by Ng and Lau. This paper is a substantially expanded version of a previously published conference paper. 20 2 Material and Methods

Volumetric Breast Density Measurement
Insight BD measures VBD based on a physics model of the image acquisition process and an image-based self-calibration. 11,21 This model appears to be the basis of most commercial software implementations to assess breast density. 7 The model assumes that the breast consists of two types of tissue, fibroglandular and fatty tissue, with known energy-dependent x-ray attenuation values.
The algorithm receives an unprocessed full-field digital mammogram (FFDM) or an unprocessed central digital breast tomosynthesis (DBT) projection image as input along with the image acquisition parameters such as compressed breast thickness and peak tube voltage. For each detector pixel location, the amount of fibroglandular tissue (measured in mm) located above the pixel is calculated and a 2-D breast density map is created. The total amount of fibroglandular tissue (V fg , measured in cm 3 ) is determined from the map by numerical integration over the projected breast area. The volume of the breast, V breast , is determined using the known compressed thickness of the breast, its projected surface area, and a 3-D shape model. For determining V fg and V breast , the pectoral muscle region is excluded.
The VBD is calculated by dividing V fg by the total breast volume (V breast , measured in cm 3 ): E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 3 2 6 ; 2 4 2 Breast density categories (a, b, c, and d) correlating with those from ACR BI-RADS fifth ed. atlas 22 are assigned by using VBD cut points. To aid classification between categories "b" and "c," considered to be a nondense and dense breast, respectively, the distribution of dense tissue is also taken into account as described by Fieselmann et al. 21 The software framework used in this work consists of a core module for the VBD measurement and a wrapper around this module allowing for batch processing of many image files at once. The core module is the same as the one implemented in the Insight BD application of the MAMMOMAT Revelation mammography system.

Evaluation of Accuracy
Accuracy of breast density measurement is evaluated using phantoms with physical characteristics similar to that of breast tissue (phototimer compensation plates; CIRS Inc., Norfolk, Virginia). Plates simulating 100% fatty breast tissue and plates simulating fibroglandular breast tissue are placed on the left and right sides, respectively, of the breast support table (Fig. 2). Different glandularities (right side only: 30%, 50%, and 70%) and plate heights (left and right sides: 30 mm, 50 mm, and 70 mm) lead to nine different combinations for evaluation. Images are acquired with a MAMMOMAT Inspiration mammography system (Siemens Healthcare GmbH) using W/Rh anode/filter combination, antiscatter grid in place and automatic exposure control enabled. The tube voltages are chosen automatically depending on the compression paddle height.
The average VBD is measured in two square regions of interest (side length 27.54 mm, Fig. 2) in the 2-D breast density map, one placed in the fatty tissue region and one placed in the fibroglandular tissue region. To measure the accuracy, two quantities are computed. The mean absolute deviation [MAD, measured in percentage points (pp)] is defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 6 3 ; 3 4 The mean absolute percentage error (MAPE, measured in %) is calculated as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 6 3 ; 2 In the above equations, x i and y i (i ¼ 1; : : : ; N) denote the known ground truth values and the measured values, respectively. N denotes the number of samples.

Same woman, different FFDM views
The reproducibility of the breast density measurement is evaluated using different FFDM views of the same woman. 8150 exams were selected from the Malmö Breast Tomosynthesis Screening Trial (MBTST). 23 The exams were selected based on the availability of raw data (for processing FFDM images and DBT projection images) in the data base. Selection of cases was not influenced by a woman's breast cancer status. Characteristics of this data set are shown in Table 1 in column "data set 1." Each exam contains anonymized four-view FFDM raw images. The MBTST is an ethics committee-approved prospective trial investigating the accuracy of DBT in a population-based screening program in the city of Malmö in Sweden. 23 In this trial, four FFDM images (CC and MLO views of each breast) and two DBT scans (MLO view of each breast) have been acquired from each participant with a MAMMOMAT Inspiration system.
The average VBD value of both views (CC and MLO) of the left breast is compared with the average VBD of both views of the right breast. This assumes that there is bilateral mammographic density symmetry, and left and right breast densities are correlated but not identical. Furthermore, the average VBD from both CC views of the exam is compared to the average VBD from both MLO views of the exam. Reproducibility is quantified using Pearson correlation coefficient (PCC) 24

and the MAD of VBD values.
Similarly, reproducibility of breast volume measurement is evaluated. Reproducibility of fibroglandular tissue volume is not assessed, for the sake of brevity, because it is not statistically independent from VBD and breast volume.

Same woman, FFDM and DBT exams
In the second reproducibility evaluation, FFDM and DBT exams of the same woman acquired during the same breast compression are analyzed. Two data sets are used in this analysis: one data set (denoted as "data set 2" in Table 1) contains 108 exams acquired with a MAMMOMAT Inspiration in Tokyo, Japan; the other data set (denoted as "data set 3" in Table 1) contains 95 exams acquired with a MAMMOMAT Inspiration in Vienna, Austria. For each exam, anonymized four-view FFDM raw images and anonymized four-view DBT projection images are available.
Breast density measures (VBD and breast volume) are calculated by taking the average of the values of all views using FFDM and DBT data, respectively. Reproducibility is quantified using PCC and MAD between the sample values from the two imaging modalities.

Same woman, two FFDM acquisitions
In the third reproducibility evaluation, two FFDM acquisitions of the same woman acquired during the same breast compression with a MAMMOMAT Inspiration are analyzed. The first image was acquired with the antiscatter grid in place; the second one was acquired without antiscatter grid but with softwarebased scatter correction and reduced x-ray dose. The exams were part of an ethics committee-approved study, 25 and 74 anonymized image pairs are available. This data set is denoted as "data set 4" in Table 1. This evaluation allows assessment of reproducibility of breast density measurement when different image acquisition conditions (with and without antiscatter grid) are employed.

Evaluation of Consistency
The calculated breast density in a large population is analyzed with respect to the women's age. With postmenopausal alteration of fibroglandular breast tissue, it is expected that the density of a woman's breast will decrease with increasing age. 26 The images from data set 1 (Table 1) are used for this analysis. Breast density is calculated from all 8150 four-view exams on a per breast basis (averaging results from CC and MLO views) giving 16,300 separate values for VBD and breast density category, respectively. Mean and standard deviation of VBD as well as frequency of breast density categories are calculated depending on a woman's age.

Evaluation of Agreement with Radiologists'
Visual Assessment 600 four-view anonymized FFDM exams had been randomly selected from the MBTST, and 32 experienced radiologists from the US and Canada provided individual breast density classifications for these exams according to the ACR BI-RADS ® fifth ed. atlas. Nine radiologists labeled the first set of 200 exams (set "1 to 200"), 10 radiologists labeled the second set of 200 exams (set "201 to 400"), and 13 radiologists labeled the third set of 200 exams (set "401 to 600"). The most frequently chosen category for a certain exam is defined to be the reference density category by the radiologists for this exam (panel majority vote). 21 The software calculates density categories for each of the 600 FFDM exams. 512 exams have DBT raw projection images (MLO views only) available that were acquired in a different breast compression. For these DBT exams, density categories are calculated by the software as well. These categories, determined by the software using FFDM and DBT exams, respectively, are compared to the reference density categories by the radiologists. Overall percentage agreement and Cohen's linearly weighted kappa 27 values are computed.

Evaluation of Accuracy
The results for the accuracy evaluation are shown in Fig. 3. The measured VBD values are plotted against the ground truth VBD values. One sample point has a ground truth VBD value of 33% instead of 30%, which corresponds to 60-mm plate height with 30% glandularity plus 10-mm plate height with 50% glandularity. The MAD are 3.38 and 1.65 pp for the fatty tissue and dense tissue regions, respectively. MAPE is 3.84% for the dense tissue region. MAPE was not calculated for the fatty tissue region as the denominator would be zero.

Evaluation of Consistency
In Fig. 9, the breast density per breast depending on the age at examination is shown. It is presented as VBD value and as breast density category dichotomized as nondense (a, b) and  dense (c, d) categories. A histogram of the age at examination for data set 1 is shown in Fig. 10. Proportions of calculated breast density categories for different age groups are shown in Fig. 11. For comparison, proportions for the same age groups as reported by the Breast Cancer Surveillance Consortium (BCSC) are shown as well.

Comparison with Radiologists' Visual Assessment
Confusion matrices for the evaluation of agreement of the software density categories with radiologists' visual assessment are shown in

Discussion
Different studies were carried out to evaluate the performance of breast density measurement with Insight BD. Each evaluation had a focus on one of these four aspects: accuracy, reproducibility, consistency, or agreement with visual assessment. Table 3 displays the results obtained in our evaluations in combination with results from previously published studies using existing software for automated breast density measurement to support interpretation and comparison of our results. A strength of our work is that it addresses all these different relevant aspects of validation in one study. Accuracy was evaluated based on phantom data, where the breast density is known. An important aspect is the linearity of measured quantities. As can be seen from Fig. 3, the calculated quantities show a high level of linearity. Our results are comparable to those shown in a previous study, 14 where an MAD of 1.1 pp and an MAPE of 6.94% were obtained ( Table 3). The current evaluation is based on phantoms that have a very homogeneous distribution of fibroglandular tissue and not a realistic shape in the breast periphery region. It is known that algorithms for breast density assessment may not work well for phantoms that do not have realistic compressed breast edge shapes. 7 Therefore, we have used regions of interest in the central breast area for the analysis to avoid effects caused by the unrealistic shape in the breast periphery. Only for this phantom analysis, the evaluation is restricted to the central breast area. In clinical breast images, the full breast is evaluated. Future studies could also evaluate phantoms with a more realistic heterogeneous distribution of fibroglandular tissue.
Reproducibility was evaluated based on clinical data using three different experimental setups. A strong correlation between the results from the left and right breast and also between the two views of the same breast is evident. It should be considered that an existing or developing breast cancer in the exam images may influence the correlation values. However, the cancer prevalence in the data set 1 is expected to be low (breast cancer was detected in 137 of 14,848 women participating in the MBTST 23 ), and this influence is considered to be negligible. Breast volume is slightly higher when estimated from MLO views compared to CC views (Fig. 5), which can be explained by the different ways the breast is positioned and visible in the mammograms. In data sets 2 and 3, FFDM and DBT images were acquired in the same breast compression. The estimation of breast volume is thus not influenced by breast positioning, and the breast volume shows a higher correlation compared to the results from data set 1.
For the measurements described in Secs. 2.3.1 and 2.3.2, previous studies exist that show similar correlation values ( Table 3). The study by Förnvik et al. 29 also investigated the agreement between VBD calculated from FFDM and DBT data based on an initial prototype version of the software assessed in this work. The results in that study (Spearman's correlation coefficient 24 = 0.83) were based on a different data set but indicate high correlation as do the results from our study (PCC = 0.900 to 0.950). For the setup described in Sec. 2.3.3, no previous publication could be identified.
Consistency was evaluated based on a sample of 8150 four-view FFDM images. Results show that VBD decreases with age until it reaches a steady state at about 60 years of age (Fig. 9). Also, the frequency of the classification with breast density category "c" or "d" decreases with age until about 60 years of age (Fig. 9). These results are consistent with the expected behavior that the density of a woman's breasts will decrease with increasing age. 26 Studies evaluating consistency of breast density calculation using other breast density measurement software have also shown a decrease of the woman's breast density until 60 to 65 years of age. 16 The trend visible in the proportions of calculated breast density categories depending on age group is also similar to the trend visible in the data from the BCSC (Fig. 11). Small differences in the proportions may be explained by the different screening populations  (Sweden and USA). The study by Förnvik et al. 29 investigated dependency of mean VBD on age using a subset of the MBTST data and a weak correlation has been found (Spearman's correlation coefficient ranging from −0.28 to −0.20). This result is consistent with the results from our study analyzing the agedependent proportions of breast density categories as well.
The evaluation of agreement with radiologists' visual assessment is based on the radiologists' categories according to the ACR BI-RADS ® fourth ed. atlas in the comparison studies (Table 3) and the more recent ACR BI-RADS ® fifth ed. atlas in our study. Our results for the radiologists' agreement are similar to those reported in a previous study 30 (four category agreement: 63% to 70%). In that study, an initial prototype version of the software assessed in this work has been evaluated, and the labels were provided by Swedish radiologists.
In Sec. 1.2, the six requirements identified by Ng and Lau for an automated breast density measurement software are quoted. Requirement 1 is satisfied by Insight BD since it is based on a deterministic algorithm. The results from the reproducibility evaluations (Sec. 3.2) show that requirement 2 (density values for CC and MLO views are similar), requirement 3 (density values obtained with mammography and tomosynthesis are similar), and requirement 5 (density values for the left and right breast are highly correlated) are satisfied as well. Requirement 4 has been evaluated implicitly and is also met: in the data sets used for the evaluations, the mean breast compression force was different (Table 1). Finally, requirement 6 is also fulfilled: over a population, breast density values decrease with age as expected (Sec. 3.3).
To conclude, a performance evaluation of Insight BD has been carried out to provide a comprehensive performance assessment of this software. It could be shown that this software satisfies all six requirements identified in the work by Ng and Lau. 7 It may provide onsite breast density measurement in the exam room for screening pathway guidance. The integration of the software into the acquisition work station of the mammography system makes this information directly available to the radiographer. Other existing software applications for breast density measurement provide this information primarily to the radiologist during image interpretation.
A limitation of this work is that it focuses on a pure technical performance evaluation. The practical impact of onsite breast density evaluation on a screening workflow has not been investigated. Furthermore, the evaluation of accuracy is limited to experiments with simple phantoms. In future studies, accuracy could be evaluated using more realistic breast phantoms and also involve tomographic images (e.g., from breast MRI) providing the ground truth data for comparison.

Summary
A software application for VBD measurement (Insight BD) has been evaluated. The results of the performance evaluation show that the software delivers accurate, reproducible, and consistent results that correlate well with the visual assessment done by radiologists. As a feature, this software is directly integrated into the acquisition work station of the mammography system.