1.IntroductionStudies have shown a strong relationship between breast density and the risk of developing breast cancer.1–3 Breast density is generally defined as the proportion of fibro-glandular tissue within the breast; however, there are various different methods to estimate this measure, and these show different levels of correlation with cancer risk, with breast density assessed subjectively by radiologists demonstrating a stronger link to breast cancer than other methods.4,5 Exactly why this subjective type of measure produces improved performance is not clear, although presumably radiologists utilize years of knowledge and experience to go beyond simple estimates of ratios of dense to nondense tissue. Their knowledge and experience can be harnessed by training an automated method to produce similar density assessments. If we can accurately measure breast density over time, we can also measure whether risk-reducing interventions are working effectively.6 Deep learning is now the dominant method used in general image analysis tasks due to the higher accuracy it tends to achieve over more traditional methods.7 The main alternative to deep learning is to hand-craft features and then apply traditional machine learning methods. The advantage of deep learning is that the features are automatically extracted from the data itself. This is appealing for breast density estimation as we do not completely understand why subjective expert judgement seems to outperform other methods. Deep learning does have significant downsides, such as the requirement of significant amounts of data and computing power. In addition, the interpretation of results is challenging.8 Medical imaging problems often have significantly smaller amounts of data than datasets that deep learning is usually trained on. One of the most commonly used nonmedical imaging datasets is ImageNet,9 which consists of over a million images and a thousand classes. Conversely, medical imaging datasets tend to be considered large if they contain tens of thousands of images, with many datasets consisting of far fewer.10 Even with these challenges, deep learning is increasingly being used in medical imaging studies11 with good outcomes. There have been many approaches for making breast density estimates. One example12 used convolutional neural networks (CNNs) with support vector machines (SVMs) to classify breasts into four different density categories. Other work used unsupervised CNNs at different scales to extract features before finetuning on known labels.13 Another method used fuzzed c-means before applying an SVM.14 Deep learning has been used to make binary (dense and nondense) and four-way (fatty, scattered, heterogeneous, dense) classifications of mammograms.15 Other classifiers of four-way density predictions have used synthetic and real mammograms.16 Other work has involved segmentation to separate the tissue into dense and fatty before calculating the density.17 One recent approach built18 a deep learning model and trained it from scratch to make estimates of breast density using domain expert labeled images as targets. The authors showed that deep learning models could make estimates that correlate well with the expert labels. Deep learning can either be performed by designing and training a model from scratch or with a transfer learning approach.19 At the moment, there is little agreement across the medical imaging field about which option produces better outcomes.10 In a field as new as deep learning, with little solid underlying theory and estimates being made on highly complex datasets, and using complex models, it is difficult to draw strong conclusions about which method to follow. A sensible approach is to test multiple methods on different problems and attempt to determine which models work better in certain situations. A transfer learning method to estimate breast density for low dose mammograms recently showed good performance20 albeit on a small dataset; in this paper, we will demonstrate that transfer learning using deep networks produces good performance using a large full dose dataset. We present a transfer learning method based upon two independent deep learning models trained on ImageNet9 with regression models trained using a dataset with visual labels produced by domain experts.21 These models are combined using a multilayer perceptron (MLP) to make a final ensemble prediction. We compare these results with those of a previous method trained on this dataset.18 Each image in the dataset was previously assessed by two independent readers, which gives us the opportunity to analyze the quality of the labels themselves.21 We will show there are challenges with using data with this level of noise both in terms of training and also in how we assess model performance. The key contributions of this work are
2.DataThe dataset is formed of full-field digital mammogram images with associated density assessments by domain experts (radiologists, advanced practitioner radiographers, and breast physicians). The images are from the predicting risk of cancer at screening (PROCAS)21 study. All images were produced using GE mammography machines and have three different image sizes: , , and . There are four views for each woman: cranial caudal (CC) and mediolateral oblique (MLO) for the left and right breast, giving a right craniocaudal view, right mediolateral oblique view, left craniocaudal view, and left mediolateral oblique view. In the PROCAS study, every image was independently viewed by two domain experts, out of a total pool of 19, who assigned a density value on a visual analogue scale (VAS) between 0 and 100 for each image. Therefore each woman received eight estimates of breast density across both CC and MLO and right and left breasts. Before performing any preprocessing, we removed any images that do not have labels assigned from two readers. In Table 1, we show the number of images at the three different sizes in our dataset. The images are drawn from 39,357 women, although not all the women have the full set of image views. When considering predictions per image, we will use the entire dataset, but when analyzing density estimates per woman, we will only consider those women who have all four views. In Fig. 1, (a)–(c) show image examples of CC and (d)–(f) show image examples of MLO images with [(a), (d)] low, [(b), (e)] medium, and [(c), (f)] high densities as defined by the average of the two reader scores. These are images after preprocessing (see Algorithm 1 in Sec. 3.1 for details). Table 1Number of images at each size and aspect ratio.
Algorithm 1Preprocessing.
We partition the dataset into training, validation, and testing sets by woman, so that all the available views for a women are in the same partition. All the women from a previous case control set,4 discussed below, are placed in the testing set. Otherwise the rest of the training, validation, and testing sets are chosen at random from the PROCAS dataset. In total, we have 33,011 women in the training set, 3649 women in the validation set and 2697 women in the testing set. In addition, to compare with a previous study,18 we have a second partition with 19,048 women in the training set, 769 women in the validation set, and 19,844 women in the testing set. Similar to the previous partition, all women from the previous case control set4 are in the testing set. We also investigated the log ratios for developing cancer from a previously created subset of the data4 called the priors, which consist of women who did not have a cancer detected when their screen was taken but subsequently went onto develop a cancer. All these data were held in the test set and not used for training or validation. These cancer risk predictions (on the priors) are found by taking the breast density estimates and splitting them into quintiles. Three control (no breast cancer) mammograms are matched with one breast cancer prior mammogram by matching on other known risk factors (age, body mass index, hormone replacement therapy use, menopausal status, and year of mammogram). This attempts to isolate breast density as a risk factor while controlling for potential confounding variables. Therefore it allows for estimates of the ratios of probability of developing cancer for women with high breast density compared with low breast density. The log ratio is calculated using conditional logistic regression on quintiles of the density. The higher the risk ratios at high density compared with low density, the better the density model is at assessing risk. For further details of the approach, see Astley et al.4 3.Prediction MethodsThe objective is to take in a mammogram image and output an estimate of the density score of that image. The procedure we follow has four parts: (1) preprocessing stage, (2) using pretrained deep learning models to extract features from the processed image, (3) mapping the features to a set of density scores, and (4) using an ensemble approach to take the multiple scores and produce a final density estimate. Throughout this paper, we will use “density mapping” to refer to the third step, where individual feature vectors are mapped to a density score and “ensemble prediction” to refer to the fourth step, which takes those density estimates and combines them into a final ensemble prediction. In Fig. 2, we show this process in a schematic format. 3.1.PreprocessingWe preprocessed the images to size (50,176 input pixels) for processing by the feature extractors. The downscaling process reduces the number of input pixels down to 1.1%, 0.68%, and 0.22% of the original size for the small, medium, and large images, respectively (see Table 1 for the three image sizes). We also enhance the contrast of the images to make it easier for the methods to extract information. In Algorithm 1, we lay out the steps we take to preprocess the images. We first rescale the image from its original size (see Table 1 for the three image sizes) down to using cubic interpolation. We then clip all element values to 75% of the image maximum, subtract the minimum, and divide by the new maximum. The values are inverted and any image that is positioned on the left hand side is flipped horizontally. We perform histogram equalization and normalize the image to contain values between 0 and 1. We show examples of these preprocessed images in Fig. 1. 3.2.Feature ExtractionTo perform the feature extraction part of our procedure, we use two pretrained deep networks: ResNet22 and DenseNet.23 Both were trained on the ILSVRC 2012 version of ImageNet,9 a large database of 1.2 million images across 1000 classes. ResNet and DenseNet are both popular in the literature, available as easily accessible models from PyTorch24 and produce modest sized feature vectors, which keeps the computational requirement manageable. However, there are other potential feature extraction networks available that could be further investigated, such as VGG25 or inception.26 To extract features from the preprocessed images, we remove the final fully connected classification layer from both networks, which alters the output from 1000 classes to 2208 and 512 dimensional feature vectors for DenseNet and ResNet, respectively. Details of our implementation is in Appendix A. We do not adjust the weights of the network or perform any form of fine-tuning. The training for the feature extractors was performed using ImageNet data, which consists of natural images with three channels, whereas our mammograms only have one channel. We therefore copy across the same image to make a repeated three-channel tensor. Both networks require the inputs to be normalized across the channels to have means of (0.485, 0.456, 0.406) and standard deviations of (0.229, 0.224, 0.225), respectively. In Algorithm 2, we lay out the steps of our feature extraction method. Algorithm 2Feature extraction method.
3.3.Density MappingTo map the deep feature vectors to produce a density estimation, we utilize two methods: linear regression with regularization and the application of MLPs. The linear regression approach enables us to see how well a simple model, with one consistent solution, performs. An MLP can (in principle) map any function,27 and allows us to explore whether nonlinear mappings are necessary. A procedure for our linear regression method is shown in Algorithm 3. During training, we add a bias term to the feature vectors from the training set, , then stack them into a feature matrix, . We utilize standard ridge regression, forming an objective function, , where are the known labels and are the weights. To find the optimal term, we perform five-fold cross-validation on the training data, finding the value of , which minimizes the combined held-out error. We then retrain the weights on the entire training dataset using the optimal . To solve for during both cross-validation and for the final optimal , we use pseudoinverse inversion: . Algorithm 3Density mapping: linear regression training.
The linear regression approach assumes that the feature extractors are mapping the images onto a reasonably linear space where there is no need to consider nonlinear correlations. As this assumption may not be correct, and it may be possible to improve performance with a more complex mapping, we also consider the use of the MLP. In Algorithm 4, we show the procedure we follow for applying the MLP to the data. We have the same input training vectors as for the linear regression. The details of the architecture and training are in Appendix A. In summary, we train the MLP on the same training partition as the linear regression, use three objective functions: , , mean squared error (), and select the learning rates and training epochs from a trial and error approach until the training error converges reasonably smoothly to a steady state. Algorithm 4Density mapping: MLP.
3.4.Ensemble ApproachThe two deep feature extracting models along with a linear regression and the MLP models produce different predictions for each image. In addition, as will be discussed in Sec. 5.1, we can train on different labels, producing another set of different density mapping models. Ensemble methods tend to outperform individual models28 so if we can combine the individual predictions together, we would expect to improve the model performance. Therefore there are 16 separate predictions: two feature extractors (ResNet and DenseNet) and four density mappings (linear regression and three MLPs trained with different objective functions). Those eight models can be either trained on averaged labels or individual labels (see Sec. 5.1 for details), giving the total of 16 separate sets of predictions. To produce our final ensemble prediction, we start by splitting the training data into two sets—a larger one that each individual model is trained upon (see Algorithms 3 and 4) and a smaller one to train the ensemble on. We train the individual models on the first training set and then apply each to the second training set to produce predictions. We stack the predictions into a new training set, which we then train a new MLP on to find a final model. To make predictions, we run all models on an image, produce the output for each one and make a final prediction by feeding the -dimensional vector into our ensemble model. In Appendix A, we provide details of the architecture and training procedure. There are many ensemble approaches available29 including simple methods, such as averaging across individual results. We show just one approach that demonstrates that we can improve on individual results using an ensemble method. Algorithm 5Ensemble method.
4.Assessing Predictive Performance4.1.MetricsTo assess the predictive performance of our models, we consider a range of metrics. The global measures we use to compare the quality of the VAS predictions with labels are Pearson correlation coefficient, root mean squared errors (RMSE), mean absolute error (MAE), and median absolute error (MedAE). We also show results of the risk ratios on the priors as discussed in Sec. 2. 4.2.Perfect Predictor EstimatesDue to label unreliability, a “perfect” model, which correctly estimates the VAS scores for all images, if compared to the noisy labels, will have a nonzero error. Therefore, we need to produce an estimate of what metrics a “perfect” model would produce so that we can assess the quality of our approach. We take the average of the two reader scores to be the “true” values and then assume that the actual reader scores are drawn from a Gaussian distribution around the real score. The Gaussian error distribution varies for different parts of the score distribution, with smaller average errors at both low and high densities, when considering averaged reader scores. We calculate the error distribution for small ranges of densities. We then use the “true” values, the averaged reader scores, and create a pair of modeled reader scores by adding Gaussian noise to the “true” value. In detail, we take the average reader scores and call these our perfect modeled estimates. We bin the average reader scores into small bins (4% each), which provides a reasonable number of data points to calculate the Gaussian parameters, except for above 80% where there is little data so that we use the range 80% to 100% as one bin. The distributions approximate a Gaussian (see Fig. 9 in appendix) but the Gaussian tails are longer than the real results, and there is the problem of modeled reader estimates below 0 and above 100. To correct for these two effects, we resample from the distribution if the reader estimate is outside the 0 to 100 range or if the deviation of the sample from the mean is greater than a certain threshold . These corrections make the modeled distribution a plausible match for the real data. To make the model as simple as possible, the only parameter we alter is . To check whether the model produces sensible estimates, we compare the differences between the pairs of real reader estimates and the modeled pair differences. We can then adjust this tuneable parameter to equalize this comparison for the different metrics we consider. In this manner, we can estimate the range of metrics that would occur for an optimal prediction method. We summarize our method in Algorithm 6. Algorithm 6Perfect predictor model.
We compare the four metrics (correlation, RMSE, MAE, and MedAE) between the pair of modeled reader scores and match up the metrics between the pair of real reader scores. For example, we alter until the modeled pair of readers have the same RMSE as the real pair of readers. We do the same for the other metrics and record the optimal metrics from the lowest to the highest. We then show the range of these values when we perform the simulation. 5.ResultsWe split our results section into three subsections. In Sec. 5.1, we analyze the labels to gain insight into their reliability and how we need to train our models. Then in Sec. 5.2, we analyze the performance of our models compared to the ground truth as produced by radiologist’s labels of the images. Finally, in Sec. 5.3, we make a range of comparisons both between different versions of our models and between our models and previous work to try and gain further insight into our model performance. 5.1.Reader Density Label AnalysisIn Fig. 3, we show the (a) distribution of the density scores and the (b) distribution of the absolute differences between reader scores. For the density scores, we show both the average of the two readers (averaged) and the individual reader estimates (individual). As there are twice as many individual scores as averaged scores, we normalized the distributions to make them directly comparable. The distributions of density are highly skewed, with little data at high breast densities and relatively large amounts with a density of around 20. In addition, the averaged reader score tends to compress the distribution away from both low and high densities compared with the individual scores. The reader absolute differences [Fig. 3(b)] show a distribution with a relatively long tail, with many images having similar density scores but some with large differences. In Fig. 4, we show how RMSE between the pair of readers differs per decile of VAS scores for averaged and individual reader scores. The averaged and individual scores are used to define the deciles (bins) and the differences per decile then calculated using those images in each bin. In Table 6, the appendix scores for MAE and median absolute for the same decile are shown. The variability of reader scores means that some of the images will be labeled inaccurately. To gain some intuition about the scale of this effect, in Figure 5, we show the distribution of the differences for the reader estimates using individual reader estimates as the scores. We bin the images into their deciles using the individual labels and then plot the distribution of the differences for each decile. The differences shown are between an individual VAS score and its pair, therefore we see skewed distributions. In Sec. 4.2, we presented a simulation method for producing quantitative estimates of the errors for a perfect set of predictions. In Table 2, we show error measures between the modeled density scores and the reader estimates for correlation, RMSE, MAE, and the MedAE for the entire dataset (all data) and for the test set. The ranges are by adjustments of (see Sec. 4.2 and Algorithm 6). There are further plots and analysis of these results in the appendices. Table 2Expected metrics per image, for a model predicting the true VAS values if the assumptions we specify are correct. This can be seen as an estimate of a set of metrics we would find if we produced a highly accurate model. The range is found by matching the metrics of the modeled pair of reader scores using σmax for the four metrics. The best score (high correlation and low errors) is found when matching RMSE and the worst when matching the correlation.
5.2.Model PredictionsIn Fig. 6, we show plots of our ensemble CC image density estimates against the reader average. Fig. 6(a) is a direct comparison of prediction against reader average, and Fig. 6(b) is a Bland–Altman plot of the difference between prediction and reader averages against the average of the two. Plots per woman are shown in the appendix in Fig. 14 and show a similar pattern, with a smaller variation. These plots allow for some intuition about the quality of the predictions compared to labels. They can also be compared to plots of the individual reader scores against one another as well as the modeled scores, all in the appendix (Fig. 11). These plots also show a considerable similarity to those of the modeled optimal plots in the appendix (Fig. 12). In Table 3, we show the four metrics produced by comparing our predictions against the averaged labels for the CC images. The equivalent results for the MLO images are in Table 7 in Appendix B. All the results are shown against the averaged reader scores of the test set. The label columns refers to which label was used when training the models on the training data. The general trend is for the DenseNet model to perform better than ResNet and for the MLPs to outperform the linear regression. The ensemble predictor slightly outperforms the individual models in most of the metrics. Table 3Comparison metrics between our models and the average labeled data for the CC images.
Figure 7 (a) shows differences in labels per decile and (b) shows prediction errors per decile. The label differences are slightly different than Fig. 4 (but there are no differences in trends), as those were on the entire dataset and these on the test set. The images are binned either by their averaged labels (crosses) or individual labels (dots). The errorbars are the 95% confidence intervals found by performing 1000 sets of bootstrapping. The prediction errors are taken as the difference between the ensemble predictions and averaged labels, the individual labels are used to bin the images, not to estimate the accuracy of the model. The key point is that while the prediction errors do increase with breast density, the differences between the pairs of readers also do. At higher density, the models are both trained and compared to more variable labels. In Fig. 8, we show the results of cancer risk predictions for the same 16 model variants and ensemble predictor as in Table 3. The odds ratios (ORs) are in comparison to the first quintile. The most relevant is , which shows the OR of the highest density women compared with the lowest density women. All the model predictions show a substantial OR between the first and fifth quintile, with no differences outside of the uncertainty bounds. These are also comparable to those of the averaged reader VAS scores with an OR of around 4.5.4 5.3.Model ComparisonsWe make a comparison between models trained on averaged and individual labels and models with different feature extractors (ResNet and DenseNet) and models with density mapping from linear regression and MLPs. In addition, we compare our predictions to those of a variant of a previous method trained on this data, pVAS.18 In Table 4 (comparison 1), we show metrics for the differences in prediction between models trained on averaged and individual labels for the DenseNet MLP method with L1 objective function. Plots relating to these results are in the appendix (Fig. 15). The similarity in predictions is high, which might not be expected considering the substantial differences in some reader scores, implying there is a strong density signal in the data. Table 4Comparison metrics between the predictions made between our density mapping models. We show standard metrics with the addition of the RMSE per quintile (labeled Q1 to Q5), with quintiles defined as the average of the two predictions. AvQ is the mean of the RMSEs per quintile. Comparison 1 is between models trained on averaged and individual labels. Comparison 2 is between MLPs trained on DenseNet and ResNet extracted features. Comparison 3 is between an MLP and linear regression model trained on the DenseNet extracted features. Comparison 4 is between our ensemble model trained on both individual and averaged labels and pVAS from a previous study.18 The test set for this comparison is different than in the rest of this paper.
Metrics for the comparison between model predictions made using the feature extractors ResNet and DenseNet, both with MLPs trained on the feature vectors, are shown in Table 4 (comparison 2) (with related Fig. 16). Both models are trained on individual labels and with the L1 objective function. The differences are larger than the pair of predictions trained on the individual and averaged labels, but they also appear to be random in character; there does not appear to be much in the way of systematic variation. We show the results for a comparison between model predictions made using the linear regression and MLP density estimators, both using the DenseNet feature extractor [Table 4 (comparison 3)] (Fig. 17 in the appendix). These are for average labels and L1 objective function for the MLP. They show some systematic differences in predictions, likely due to the linear regression underfitting to the data. The final comparison we make is between previous work18 labeled pVAS and the final ensemble predictions produced in this paper. Table 5 shows metrics for the pVAS estimates and our ensemble estimates. We also show these results as comparison 4 in Table 4. Plots are shown in Fig. 18. The test set is different to that for the results shown in the rest of the paper so that it coincides with the test set used by the pVAS model. Our model outperforms the pVAS model even though it is never trained end-to-end, which demonstrates the power of the representation formed by the pretrained models. Table 5Metrics of predictions made by our ensemble method and predictions made by pVAS18 per image. The results are all the images, both CC and MLO. These results are from the second test set partition, discussed in Sec. 2. Uncertainties are at the 95% confidence level found via bootstrapping with 1000 repeats.
6.DiscussionBreast density, as measured on a VAS by experienced radiologists, shows a strong realtionship with breast cancer risk.4 We designed and implemented a framework based on pretrained deep networks to find a mapping between a mammogram and its associated breast density. The core part of the system is the feature extractor, which is taken from deep networks trained on ImageNet.9 Reader density scores are known to be variable.30,31 One consequence is that we are training our models on data with interreader variability, and the second is that if we were to produce a highly accurate predictor, when we compared it to the labels, it would look like it produced inaccurate predictions. We analyzed the labels, in Sec. 5.1, to gain insight into how this variability affects our conclusions. An important consideration is how the variability changes with the density scores. The average of the two reader estimates (Fig. 4 and Table 6) appear to show significantly smaller differences between the readers at both lower and higher densities with a maximum difference in the middle. For example, in the 50 to 60 decile, we have a mean difference of 18.0 and in the 0 to 10 decile a mean difference of 4.5. This might imply that the readers are more consistent at the two extremes and produce more variable results in the middle. However, if we look at the individual reader estimates, the average differences between the readers tend to increase with increasing density across the deciles. Table 6The fraction of density scores (number fraction) and the average (both mean and median) absolute differences between reader estimates per decile. These are all shown for the deciles defined either by averaged reader scores (averaged) or individual reader scores (individual).
The reason the differences appear to peak in the middle of the density distribution and are lower at the two ends for the averaged reader estimates is likely to be at least partially a statistical artefact. If we consider the first decile (1 to 10) and take an example of an averaged reader score of 5.0, the maximum difference possible for this average result is one reader marking 1 and the other marking 9, with a difference of 8. Conversely, the maximum possible difference for an average value of 50.0 is one reader marking 1 and the other marking 99, giving a difference of 98. Part of the reason that the average differences appear to be lower in the low and high densities is because, when considering the average scores, the results with the larger differences would not result in averages with low or high densities. There might be more label consistency at the low and higher ends of the density distribution, but it is hard to separate that out from this statistical artefact. As the performance of our models is measured based upon the quality of the labels, if the label variability increases with higher density, we would expect to see an apparent fall in model performance due to the increasing variability, rather than the failure of the model. In Fig. 7, this is the effect we do see, an apparent reduction in performance of the model at higher density scores. It is likely that our models are more inaccurate at higher densities as we also see a large variability when comparing model predictions to each other at those higher densities (Table 4), but it is a smaller effect than if we simply measure the differences in predictions to the labels. These effects are further discussed in the appendix. The errors between our density predictions (Table 3) and the labels fall close to the range of the errors we see with a modeled optimal estimator (Table 2), which can perfectly predict the density of an image. There is still space for improvement in the quality of the predictions, which may be best achieved by adapting the feature extractors to better represent mammography data either through fine-tuning or with other approaches. However, there is a limit above which it will become difficult to assess if the improved models are able to perform better as they approach the metrics shown for the simulated optimal model. We compared our models both to a previous method, pVAS,18 and to variants of our method. We find considerable similarities between all of our models and to pVAS. This sort of similarity implies that we are finding true structure in the data. There is greater divergence of prediction at higher VAS scores, although it is still fairly small considering the uncertainty in the labels and the small amount of data at those higher densities. There is little difference in results when training on averaged or individual labels (Table 3), and this is an interesting finding considering the low consistency of the pairs of reader estimates. We might expect there to be a significant improvement when training on the averaged labels because they might be expected to reduce some of the noise in data. The fact that we do not see this suggests that the models are able to effectively extract a true density-related signal. The ensemble predictor produces a small improvement in performance giving slightly lower errors and higher correlations than the individual density mapping models. The linear regression produced reasonable accuracy implying that the feature extractors are producing a mapping to a fairly linear subspace; however, there is clearly some nonlinearity required as the MLPs do perform better (Table 3). DenseNet performs better than ResNet, which may be due to the fact that the DenseNet version performs better on ImageNet than the specific ResNet version used (see Appendix A for details). Alternatively, it may be due to the larger subspace size of the DenseNet model. The model predictors (Fig. 8) are all comparable with the VAS labels in terms of risk prediction. Although there is variation between the models in terms of the ORs found, these are all within the uncertainty bounds and we cannot make any strong statements about the quality of the predictive models compared with one another. This usage of transfer learning allows us to leverage the long training times and large computational power of other research groups. There is a debate over whether transfer learning or learning from scratch are more appropriate.10 One answer is that transfer learning approaches like the one we have demonstrated here is far quicker to implement than the requirement of designing and then training networks from scratch. If results from a transfer learning approach are poor, then it may be necessary to pursue other approaches. A potential issue with a transfer learning approach in medical imaging is that the models used tend to be trained on unrelated images. It might be expected that this would result in features being extracted that are unrelated to the medical domain. However, we have shown that reasonably accurate results can be obtained using these features, even when using only a linear mapping. Perhaps with some domain adaptation, these results would improve further. We reduced the size of the images to , both to match the size of images the models were originally trained on and to reduce computational requirements. If we can utilize smaller images and achieve good results, it is a significant advantage in terms of the computer power and time required. Saving on computational time is a major advantage, one benefit is that it allows researchers and groups without access to large computing resources to perform analysis. It also means that multiple different runs of algorithms can be performed, and multiple other facets of the data can be investigated. We also did not preserve the aspect ratios of the images, do anything about the fact that there are three different image sizes that are then distorted by differing amounts, or crop the images to focus on the breast. Yet we still achieve accurate predictions, although further research is required to investigate whether results could be improved by correcting these issues, or if they do not adversely effect the quality of the models. We also do not utilize the three-channel nature of the transferred network, something that might enable more predictive capacity to be extracted from the pretrained models. Overall, our method produces predictive accuracy close to the maximum that can be assessed with the ground truth labels we have access to. We do so using a method that requires modest computer resources both in terms of memory and time. In particular, once the feature extractors have been run, the computational requirements of the density predictors are very low, especially for the linear regression. This can enable both much faster training and also training on a small dataset or subsets of the larger dataset. We also demonstrate the, perhaps, surprising ability of deep learning models trained on a different image domain to produce good performance on this medical dataset. There is a direct benefit of accurate prediction of breast density, and that it can be used to produce information to medical practitioners or in research projects where there is no access to radiologists. Another potential value is as an input to other automated models. Density provides considerable information that might enable models that are trying to predict cancers, perform segmentation, or other tasks that might be able to utilize to produce improved performance on their specific problem. 7.ConclusionsIn this paper, we have demonstrated that using a transfer learning approach with deep features results in accurate breast density predictions. This approach is computationally fast and cheap, which can enable more analysis to be done and smaller datasets to be used. However, the deep feature models were trained on a nonmedical dataset, which would imply that the features extracted could be considerably improved. If we can train deep models on medical imaging data, then we might expect to see improvements in performance when those models are used as a transfer learning model across a wide range of medical imaging applications. We have demonstrated the issues associated with data where readers are variable in predictions. Finding ways to reduce the variability of the labels would enable us to train on more reliable data and to better assess which models are performing better than others. If we could improve the quality of the labels, it would also mean that we could more systematically investigate what measures could improve performance. 8.Appendix A: Model Details8.1.Feature ExtractionWe use two networks for the feature extraction: ResNet22 and DenseNet,23 specifically we use the “ResNet18” and “DenseNet161” architectures downloaded with pretrained weights from the PyTorch24 Torchvision repositories. For both, we replace the final classification layer (“fc” in Resnet and “classifier” in DenseNet) with an identify matrix. We ran the images through the two networks with a NVIDIA Quadro P400 2 GB GPU, a process that takes around 10 to 50 s per 100 images for ResNet and DenseNet, respectively. In total, we have around 160,000 images, and the total run time is around 4.5 to 22 h for ResNet and DenseNet, respectively. Once this stage is completed, there is no need to repeat as the feature vectors remain the same and can be used in any required permutation. 8.2.MLP Density MappingOur MLP is small and simple with (512 for ResNet and 2208 for DenseNet) input neurons, followed by 200 neurons, and a rectified linear unit (ReLU),32 then 300 neurons and a ReLU to the output. We trained them using the same NVIDIA Quadro P400 GPU on PyTorch24 with around 200 epochs, the Adam optimizer,33 and a starting learning rate of and one reduction in learning rate half way through to end at . Due to the small and simple nature of the MLP, multiple other training approaches would suffice, this one was found through trial-and-error checking that the training error was reducing as we would expect. We use three objective functions: , , and the MSE, all available as basic functions in PyTorch. 8.3.Ensemble PredictionThe ensemble MLP is built using PyTorch24 with two hidden layers of size and , respectively, where is the number of density predictions fed into the system. We apply ReLU to both hidden layers and trained using the Adam optimizer. 9.Appendix B: Extra ResultsIn Fig. 9, we show the distribution of the differences between individual label scores for the bins we use to make our simulated predictions. The bin centers are noted on the top of each plot. 9.1.Label AnalysisIn Fig. 10, we graphically demonstrate the challenge with having high reader uncertainty together with a skewed distribution. We plot all the binned individual reader scores per decile along with the distribution of the differences from the center of the decile, these are the blue bars. We also plot the total number of individual labels above 80% as orange bars. In the bottom right two plots, we show the entire distributions (left of the bottom right pair) and a zoomed in version (bottom right plot). The potential number of images, which are falsely labeled as high density, may be comparable to the number of real high-density images. This makes it difficult to confidently assess how well our models are performing at high VAS scores, as we cannot assess whether a high VAS score is accurately labeled. If we were to perform oversampling or data augmentation on the VAS scores classed as high by the average reader estimate, we are likely to oversample from the least reliable part of the distribution with the most falsely labeled data points and the lowest signal-to-noise ratio. In Fig. 11, we show plots of the pairs of VAS labels for the modeled and real estimates, we show 3000 random pairs so the structure of the distribution is visible. The dotted line shows a perfect relationship between the pair of readers. Fig. 11(a) is the lower error (higher correlation) end of our modeled range, and Fig. 11(c) is the higher error (lower correlation) end of our modeled range. Fig. 11(b) show the real pairs of reader labels plotted next to one another, these points are very slightly perturbed so that individual points can be seen. In Fig. 12, we show the average modeled optimal scores versus the average of the two modeled labels. Fig. 12(a) is a direct comparison and Fig. 12(b) is the Bland–Altman plot for the difference between the modeled optimal score and the modeled average reader scores versus the average of the optimal and modeled scores. Previous work has shown that VAS density scores show a strong correlation to the risk of developing cancer.4 In this paper, we have demonstrated that the variability of the reader estimates means that we have to be cautious about the level of confidence we have in results derived from these labels. We therefore repeat the analysis for the Prior dataset from that previous work to calculate the ORs.4,18 In addition, we provide an additional piece of analysis by perturbing the averaged scores. The purpose of doing so is to give some intuition about how much the ORs might change with small variations in the reader scores. We added a random amount to all the averaged scores by sampling from a Gaussian distribution with three different standard deviations of , 2, 5. We resampled any scores that went above 100 or below 1, until we got a score within the correct range. We performed each set of perturbations five times. The perturbation we apply to the reader averaged scores is small when compared to the variability between the pairs of reader scores, which have an RMSE between readers of 16.2. The results of the range of the ORs found are shown in Fig. 13. We show OR plots for the second (), third (), fourth (), and fifth () quintiles. On the left side of the plots, labeled with Orig., is the nonperturbed ORs. The next five, labeled with show the five repeats when the perturbation is parameterized with a standard deviation of 1. The results for and are shown in the next two sections, separated by the dashed line. The crosses show the ORs and the error bars the upper and lower 95% confidence intervals found via bootstrapping. We therefore consider two aspects of uncertainty here: via bootstrapping we see the uncertainty relating to the sampling of the data, and via the perturbations we see the uncertainty due to the unreliability of the labels due to reader variability. The uncertainty due to sampling (the bootstrap error bars) is large and show the wide range of possible ORs that could occur with a different sample. The perturbations alter both the OR and the uncertainty of it. We will see in the next section that the ORs found by the models are comparable with these ORs for the readers. Although this is a positive result for our models, the problem is that the high level of uncertainty in the OR values means we cannot easily assess whether our models are performing better than one another. This means that we cannot bypass the metrics we considered earlier in this section to assess the quality of our models by looking directly at cancer risk, because the uncertainty involved is too large. From these perturbation results, we would not want to trust that one model is better than another without quite significant differences in estimates. However, what this also shows is that the VAS scores do produce a robust set of cancer risk prediction. In previous work, the other measures of density studied did not produce a ratio of above 3.4 Therefore, although we cannot be confident that results in a fairly broad range are an improvement on other results, we can be reasonably confident in the overall ability of VAS to make good risk estimates in comparison to other density scores. 9.2.Model PredictionsIn Table 7, we show the prediction results for the MLO images equivalent to Table 3. Table 7Comparison metrics between our models and the average labeled data for the MLO images.
Plots of the final ensemble predictions versus labels per woman are displayed in Fig. 14. There are 2682 women in the test set who have all labels and all predictions intact, 15 women are missing labels or images and are removed. We see the same general pattern as the per-image plots of Fig. 6. 9.3.Model ComparisonsIn Fig. 15 we show a comparison of predictions made by training on the individual labels compared to averaged labels, both with a DenseNet feature vector and an MLP trained using the objective function. Metrics related to these plots are shown as in Table 4 (comparison 1). In Fig. 16 we show a comparison of predictions made using MLPs with the objective function on the DenseNet and ResNet feature vectors, both trained on individual labels. Metrics related to these plots are presented in Table 4 (comparison 2). In Fig. 17 we show a comparison of predictions made using an MLP (with objective function) against linear regression, both on feature vectors from the DenseNet model, with averaged labels. Related metrics are in Table 4 (comparison 3). In Fig. 18 we show predictions made by our ensemble predictor compared to the pVAS model. Related metrics are in Table 4 (comparison 4). DisclosuresThe authors declare no conflicts of interest. Ethics approval for the PROCAS study was through the North Manchester Research Ethics Committee (09/H1008/81). Informed consent was obtained from all participants on entry to the PROCAS study. AcknowledgmentsDafydd Gareth Evans, Elaine Harkness, and Susan M. Astley are supported by the National Institute for Health Research (NIHR) Manchester Biomedical Research Center (Grant No. IS-BRC-1215-20007). Steven Squires was supported by CRUK AI-informed screening (Grant No. A29024). ReferencesN. F. Boyd et al.,
“Breast tissue composition and susceptibility to breast cancer,”
J. Natl. Cancer Inst., 102
(16), 1224
–1237 https://doi.org/10.1093/jnci/djq239 JNCIEQ
(2010).
Google Scholar
C. Huo et al.,
“Mammographic density-a review on the current understanding of its association with breast cancer,”
Breast Cancer Res. Treat., 144
(3), 479
–502 https://doi.org/10.1007/s10549-014-2901-2 BCTRD6
(2014).
Google Scholar
V. A. McCormack and I. dos Santos Silva,
“Breast density and parenchymal patterns as markers of breast cancer risk: a meta-analysis,”
Cancer Epidemiol. Prevent. Biomark., 15
(6), 1159
–1169 https://doi.org/10.1158/1055-9965.EPI-06-0034
(2006).
Google Scholar
S. M. Astley et al.,
“A comparison of five methods of measuring mammographic density: a case-control study,”
Breast Cancer Res., 20
(1), 10 https://doi.org/10.1186/s13058-018-0932-z BCTRD6
(2018).
Google Scholar
A. R. Brentnall et al.,
“Long-term accuracy of breast cancer risk assessment combining classic risk factors and breast density,”
JAMA Oncol., 4
(9), e180174
–e180174 https://doi.org/10.1001/jamaoncol.2018.0174
(2018).
Google Scholar
J. Cuzick et al.,
“Tamoxifen-induced reduction in mammographic density and breast cancer risk reduction: a nested case–control study,”
J. Natl. Cancer Inst., 103
(9), 744
–752 https://doi.org/10.1093/jnci/djr079 JNCIEQ
(2011).
Google Scholar
A. Krizhevsky, I. Sutskever and G. E. Hinton,
“ImageNet classification with deep convolutional neural networks,”
in Adv. in Neural Inf. Process. Syst.,
1097
–1105
(2012). Google Scholar
Z. C. Lipton,
“The mythos of model interpretability,”
Queue, 16
(3), 31
–57 https://doi.org/10.1145/3236386.3241340 QUSYE8 0257-0130
(2018).
Google Scholar
J. Deng et al.,
“ImageNet: a large-scale hierarchical image database,”
in IEEE Conf. Comput. Vis. and Pattern Recognit.,
248
–255
(2009). https://doi.org/10.1109/CVPR.2009.5206848 Google Scholar
G. Litjens et al.,
“A survey on deep learning in medical image analysis,”
Med. Image Anal., 42 60
–88 https://doi.org/10.1016/j.media.2017.07.005
(2017).
Google Scholar
D. Shen, G. Wu and H.-I. Suk,
“Deep learning in medical image analysis,”
Annu. Rev. Biomed. Eng., 19 221
–248 https://doi.org/10.1146/annurev-bioeng-071516-044442 ARBEF7 1523-9829
(2017).
Google Scholar
P. Fonseca et al.,
“Automatic breast density classification using a convolutional neural network architecture search procedure,”
Proc. SPIE, 9414 941428 https://doi.org/10.1117/12.2081576 PSISDG 0277-786X
(2015).
Google Scholar
M. Kallenberg et al.,
“Unsupervised deep learning applied to breast density segmentation and mammographic risk scoring,”
IEEE Trans. Med. Imaging, 35
(5), 1322
–1331 https://doi.org/10.1109/TMI.2016.2532122 ITMID4 0278-0062
(2016).
Google Scholar
B. M. Keller et al.,
“Estimation of breast percent density in raw and processed full field digital mammography images via adaptive fuzzy c-means clustering and support vector machine segmentation,”
Med. Phys., 39
(8), 4903
–4917 https://doi.org/10.1118/1.4736530 MPHYA6 0094-2405
(2012).
Google Scholar
C. D. Lehman et al.,
“Mammographic breast density assessment using deep learning: clinical implementation,”
Radiology, 290
(1), 52
–58 https://doi.org/10.1148/radiol.2018180694 RADLAX 0033-8419
(2019).
Google Scholar
T. P. Matthews et al.,
“A multisite study of a breast density deep learning model for full-field digital mammography and synthetic mammography,”
Radiol.: Artif. Intell., 3
(1), e200015 https://doi.org/10.1148/ryai.2020200015
(2020).
Google Scholar
O. H. Maghsoudi et al.,
“Deep-libra: an artificial-intelligence method for robust quantification of breast density with independent validation in breast cancer risk assessment,”
Med. Image Anal., 73 102138 https://doi.org/10.1016/j.media.2021.102138
(2021).
Google Scholar
G. V. Ionescu et al.,
“Prediction of reader estimates of mammographic density using convolutional neural networks,”
J. Med. Imaging, 6
(3), 031405 https://doi.org/10.1117/1.JMI.6.3.031405 JMEIET 0920-5497
(2019).
Google Scholar
S. J. Pan and Q. Yang,
“A survey on transfer learning,”
IEEE Trans. Knowl. Data Eng., 22
(10), 1345
–1359 https://doi.org/10.1109/TKDE.2009.191 ITKEEH 1041-4347
(2009).
Google Scholar
S. Squires et al.,
“Automatic density prediction in low dose mammography,”
Proc. SPIE, 11513 115131D https://doi.org/10.1117/12.2564714 PSISDG 0277-786X
(2020).
Google Scholar
D. G. R. Evans et al.,
“Assessing individual breast cancer risk within the uk national health service breast screening program: a new paradigm for cancer prevention,”
Cancer Prevent. Res., 5
(7), 943
–951 https://doi.org/10.1158/1940-6207.CAPR-11-0458
(2012).
Google Scholar
K. He et al.,
“Deep residual learning for image recognition,”
in Proc. IEEE Conf. Comput. Vis. and Pattern Recognit.,
770
–778
(2016). https://doi.org/10.1109/CVPR.2016.90 Google Scholar
G. Huang et al.,
“Densely connected convolutional networks,”
in Proc. IEEE Conf. Comput. Vis. and Pattern Recognit.,
4700
–4708
(2017). https://doi.org/10.1109/CVPR.2017.243 Google Scholar
A. Paszke et al.,
“Automatic differentiation in pytorch,”
(2017). Google Scholar
K. Simonyan and A. Zisserman,
“Very deep convolutional networks for large-scale image recognition,”
(2014). Google Scholar
C. Szegedy et al.,
“Going deeper with convolutions,”
in Proc. IEEE Conf. Comput. Vis. and Pattern Recognit.,
1
–9
(2015). https://doi.org/10.1109/CVPR.2015.7298594 Google Scholar
K. Hornik et al.,
“Multilayer feedforward networks are universal approximators,”
Neural Netw., 2
(5), 359
–366 https://doi.org/10.1016/0893-6080(89)90020-8 NNETEB 0893-6080
(1989).
Google Scholar
L. Rokach,
“Ensemble-based classifiers,”
Artif. Intell. Rev., 33
(1–2), 1
–39 https://doi.org/10.1007/s10462-009-9124-7 AIREV6
(2010).
Google Scholar
D. Opitz and R. Maclin,
“Popular ensemble methods: an empirical study,”
J. Artif. Intell. Res., 11 169
–198 https://doi.org/10.1613/jair.614 JAIRFR 1076-9757
(1999).
Google Scholar
B. L. Sprague et al.,
“Variation in mammographic breast density assessments among radiologists in clinical practice: a multicenter observational study,”
Ann. Internal Med., 165
(7), 457
–464 https://doi.org/10.7326/M15-2934 AIMEAS 0003-4819
(2016).
Google Scholar
M. Sperrin et al.,
“Correcting for rater bias in scores on a continuous scale, with application to breast density,”
Stat. Med., 32
(26), 4666
–4678 https://doi.org/10.1002/sim.5848 SMEDDA 1097-0258
(2013).
Google Scholar
V. Nair and G. E. Hinton,
“Rectified linear units improve restricted Boltzmann machines,”
in Proc. 27th Int. Conf. Mach. Learn. (ICML-10),
807
–814
(2010). Google Scholar
D. P. Kingma and J. Ba,
“Adam: a method for stochastic optimization,”
(2014). Google Scholar
BiographySteven Squires received his PhD in machine learning at the University of Southampton, worked as a research associate on applying machine learning methods to medical imaging at the University of Manchester, and is currently working at the University of Exeter. He is a postdoctoral research fellow interested in the development and application of machine learning and statistical methods to medical problems. Dafydd Gareth Evans is chair of medical genetics and cancer epidemiology at the University of Manchester. He has established a national/international reputation in clinical and research aspects of cancer genetics, particularly in neurofibromatosis and breast cancer. He has published 1012 peer-reviewed research publications (first/senior author = 370), and in addition, >150 reviews and chapters. He has an ISI WoK H-index of 129 and Google Scholar H-index of 170. He is the theme leader of Manchester NIHR Biomedical Research Centre Cancer Prevention Early Detection. Susan M. Astley is chair of intelligent medical imaging at the University of Manchester, with research interest in breast density, early detection, and the prediction of risk of breast cancer. She has published more than 290 research publications with over 6700 citations and is currently co-chair of the SPIE Medical Imaging CAD conference. |
CITATIONS
Cited by 1 scholarly publication.
Education and training
Breast density
Deep learning
Data modeling
Mammography
Feature extraction
Linear regression