Automated quality assessment in three-dimensional breast ultrasound images

Abstract. Automated three-dimensional breast ultrasound (ABUS) is a valuable adjunct to x-ray mammography for breast cancer screening of women with dense breasts. High image quality is essential for proper diagnostics and computer-aided detection. We propose an automated image quality assessment system for ABUS images that detects artifacts at the time of acquisition. Therefore, we study three aspects that can corrupt ABUS images: the nipple position relative to the rest of the breast, the shadow caused by the nipple, and the shape of the breast contour on the image. Image processing and machine learning algorithms are combined to detect these artifacts based on 368 clinical ABUS images that have been rated manually by two experienced clinicians. At a specificity of 0.99, 55% of the images that were rated as low quality are detected by the proposed algorithms. The areas under the ROC curves of the single classifiers are 0.99 for the nipple position, 0.84 for the nipple shadow, and 0.89 for the breast contour shape. The proposed algorithms work fast and reliably, which makes them adequate for online evaluation of image quality during acquisition. The presented concept may be extended to further image modalities and quality aspects.


Introduction
Three-dimensional (3-D) automated breast ultrasound (ABUS) is gaining importance in breast cancer screening programs as an adjunct to x-ray mammography. 1 It has been shown that its use may lead to early detection of small invasive cancers that are occult on mammography in women with dense breasts. [2][3][4] Furthermore, ABUS is a radiation-free technique, which is relatively inexpensive and effective, since images are acquired by technicians and interpreted later by radiologists-in contrast to hand-held ultrasound, which needs to be performed by experienced clinicians.
However, the quality of the images highly depends on the acquisition procedure. Bad skin contact or slight misplacement of the transducer during ABUS acquisition produces imaging artifacts, which may obstruct a complete diagnostic evaluation. This may lead to a recall of the woman for subsequent additional imaging, which increases screening costs. Recall rates of up to 19% due to BI-RADS category 0 rated images (Breast Imaging Reporting and Data System of the American College of Radiology) have been reported, 5 which means that these images were incomplete or of low quality and that a possible abnormality could not be clearly seen or defined. These numbers can be explained by the fact that technicians need some time to train before they are able to produce artifact-free images since the positioning of the transducer frame is an essential factor for image quality. Automated image quality assessment (AQUA) could support the technicians in recognizing image artifacts during or directly after image acquisition. By doing so, technicians could repeat the scan with corrected parameters while the woman is still in the examination room. In the described scenario, correctly detected artifacts would help to anticipate and potentially avoid recalls caused by insufficient image quality.
While there are several studies investigating image quality assessment of (breast) MRI [6][7][8] and of hand-held ultrasound images, 9 very little work has been performed to investigate image quality assessment of ABUS images. In Ref. 10, an algorithm to reduce motion artifacts in ABUS images based on nonlinear registration was developed. Generally, ultrasound image quality is considered from the technical point of view. The descriptions in Refs. 11-13 focus on the functionality of the equipment (beam former, transducer) but not on the usage of the system in daily routine. More recently, we investigated the incidence and influence of diverse ABUS artifacts in a reader study. 14 In that previous work, the investigated artifacts had been defined by radiologists, technicians, and physicists, aiming at those that were disturbing diagnostics. In the present work, we concentrate on three of the most relevant aspects that could be avoided in the majority of cases by rescanning: the acoustic shadow caused by the nipple, the position of the nipple relative to the rest of the breast in the image, and the shape of the breast contour on the image. If we manage to achieve high specificity in artifact detection, avoiding unnecessary rescans, such a tool could not only lower the number of recalls that cost time and money but also help to train the technicians.
The contribution of this work is the development of an automated image quality assessment system to automatically detect the previously mentioned artifacts. Such a system will support technicians during image acquisition by giving a warning if imaging artifacts disturbing the clinical interpretation of the images are present. A repetition of the affected scans with corrected parameters can then be performed while the patient is still in the examination room.
2 Background 2.1 Automated Breast Ultrasound Imaging ABUS images are acquired by a wide linear array ultrasound transducer sliding continuously over one breast, which is gently compressed by a dedicated membrane while the patient lies in a supine position. During the sliding motion of the transducer, the ultrasound scanner acquires more than 300 transversal images covering a large segment of the breast. These single slices are stacked to form a 3-D ultrasound image that can be examined in multiplanar reconstructions. 15 Depending on the size of the breast, three to five views of each breast are acquired. The positioning and compression of the breast are standardized to some extent and include anterior-posterior (AP), lateral (LAT), medial (MED), superior (SUP), or inferior (INF) views, the breast being gently pushed in these directions, respectively. The latter one (INF) is acquired very rarely and was not contained in our datasets.

Automated Breast Ultrasound Image Quality Aspects
The focus was put on the three most frequent quality aspects that could be avoided by a repeated scan. The first problem is an incorrect nipple position within the image. In some cases, the nipple is pushed very close to the edge of the breast in coronal view [see Fig. 1(a)]. This may cause severe posterior acoustic shadows, obscuring anatomical structures behind the nipple, which can usually be avoided by proper repositioning of the transducer. The second issue is the shadow of the nipple [see Fig. 1(b)]. In the area around the nipple, there is commonly no perfect contact between transducer and skin, resulting in an acoustic shadow behind the nipple on the ultrasound image. Air-filled ducts may contribute to this effect. In most cases, the image is nevertheless usable for diagnostics, but sometimes, the shadow covers noteworthy parts of the breast tissue. Applying more contact gel in a repeated scan often resolves this problem. The third aspect is also correlated to the positioning of the transducer and the breast. If the breast is not supported correctly by the provided cushions, there might be a lack of contact and the outer regions of the breast will not be not imaged [see Fig. 1(c)]. This results in large background areas in the image as well as irregular breast contour lines.

Automated Image Quality Assessment
In this work, we propose an automated image quality assessment system checking the images during or directly after the acquisition. The current standard and the proposed additional workflow step are indicated in Fig. 2. The early automatic detection of image quality issues will initiate a repeated acquisition if indicated. This will only take a few minutes. If a problem that disturbs the diagnosis was only detected later by the radiologist, the woman would have to be recalled, which would take several days. In order to build a convenient application for clinical practice, we first gathered expert definitions of artifacts and had real image data annotated by clinicians. Approved image processing algorithms were employed to extract features characteristic of distinct quality aspects. Feature design was based on a training dataset (dataset A, introduced below) and aimed at translating the physical properties of the artifacts into computable values taking into account the radiologists' descriptions. Classifiers were used to reproduce the manual annotations based on the most meaningful subset of available features. In order to be used in clinical environments and produce results before the patient leaves the facility, the algorithms had to have a low run time (a few minutes at most). Similarly, in order not to produce unnecessary inconveniences for patients, the false positive rate was sought to be very low (clearly below 10%).

Methodology
The presented software development approach is based on machine learning and evaluated against the expert assessment of two clinicians. In what follows, first we explain the features computed for the detection of each image quality aspect and subsequently present the learning algorithm used. All image processing routines were implemented in C++ using the open source National Library of Medicine Insight Segmentation and Registration Toolkit (ITK, www.itk.org). All computations were performed on a Windows 7 machine with an Intel ® Core™ i7-2627M processor at 2.7 GHz and with 6 GB of RAM.

Relative Nipple Position
The position of the nipple relative to the rest of the breast in the image is important because it relates to acoustic shadows that hamper the clinical interpretation of the image. The absolute nipple position in the image was given by the technician during image acquisition and stored as a private DICOM tag as specified by the standard acquisition protocol. The ABUS images were prepared for feature extraction in several preprocessing steps. First, a two-dimensional (2-D) coronal breast mask was computed similarly to the approach proposed by Wang et al. 16 Therefore, a coronal mean projection of a stack of 120 slices close to the skin was performed. However, the top 50 slices from the skin were excluded from the breast mask computation to avoid responses from skin tissue. The projection image was smoothed using a Gaussian filter with a sigma of 0.2 mm and binarized by applying Otsu's thresholding method. 17 In order to close holes within the breast mask or along its edges, the binary image was dilated and holes were filled before it was eroded again. Finally, the breast contour line was computed in 2-D based on the mask image, as shown in Fig. 3. Note that the nipple coordinates x T and y T are generally assumed to be the same for all coronal slices, and the z-coordinate of the nipple is always on top of the image, since there is direct contact between transducer and nipple. The breast contour line is the same in all slices due to the compression of the breast and the properties of ultrasound. Using this contour and the given nipple position, nine features were extracted.
• c view : The view of the considered image strongly influences the absolute nipple position and may affect the impact of a nipple being close to the contour line of the breast. Thus, a categorical feature c view that can be one of the four available standard views (AP, LAT, MED, SUP) was extracted from the information provided in the header of the DICOM file.
• x T and y T : The given nipple coordinates ðx T ; y T Þ, which are the same for all coronal slices, were considered possibly important features since the absolute nipple position in the image may correlate with the position relative to the breast. As the appearance of ABUS images differs a lot depending on the breast size and the transducer position, the absolute nipple position is, however, not coupled directly to the nipple position relative to the breast image.
• d min : The shortest Euclidean distance d min between the nipple position ðx T ; y T Þ and the breast mask contour line was computed.
• c io : It was determined whether the nipple was located inside or outside the breast mask. The latter case can occur when the shadow around the nipple is very dark and close to the breast contour such that this region is, by mistake, not included in the breast mask. A categorical feature c io ∈ f1; −1g was included.
• d Ã min : The signed distance between nipple position and contour line was computed as The total 2-D physical area of the breast A B was computed using the pixel size and the number of pixels within the breast mask.
• A B∕I : The ratio of the physical 2-D area of the breast to the total image size was calculated as A B∕I ¼ A B ∕A Image .
• d COM : The center of masses ðx COM ; y COM Þ of the breast area and the Euclidean distance d COM between ðx COM ; y COM Þ and ðx T ; y T Þ was determined.

Nipple Shadow
In order to estimate the size of a possible nipple shadow, it was assumed that the shape of the shadow could be approximated by a cylinder around the nipple with the axis going in the anteroposterior direction. As the nipple is (approximately) a disk in the coronal plane, once it has stopped the US wave, it produces a cylindrical acoustic shadow. The nipple position ðx T ; y T Þ was obtained from the DICOM header as given by the technician during acquisition. The size of the dark cylindrical region around the nipple position was estimated by counting cylinder segments (rings) that had low pixel intensity. The radius of the different cylinder segments varied from 4.0 to 20.0 mm in steps of 4 mm (see Fig. 4). In the anteroposterior direction, the height of each cylinder segment was ∼2.0 mm. The highest layer was positioned starting at 6 mm below the skin, avoiding potentially disturbing high-intensity signals due to skin fat or sound reflections within the coupling layers of the transducer. The deepest layer ended at 26.0 mm below the skin. It was empirically determined that these measures were useful to describe the extent of the nipple shadow. The following seven features were extracted: • c view : The view of the considered image affects the absolute nipple position and the possibilities of supporting the breast properly by cushions.
• x T and y T : The coordinates ðx T ; y T Þ describing the absolute position of the nipple in coronal plane were included.
• N I<50 and N I<60 : The segments showing a lower mean intensity than a specific threshold value were counted. The intensity threshold was set to 50 and 60, respectively, yielding two features, N I<50 and N I<60 , for every image. In the present 8-bit grayscale images, these threshold values yielded reasonable differentiation between tissue and shadow signals.
• N Pix : The amount of pixels N Pix in the cylinder segments that had a mean intensity below 60 was counted. This number accounted for the different sizes of the considered cylinder segments.
• σ 2 bright : The variance σ 2 bright of brightness in one cylindrical region of 4.0 mm radius around the nipple was calculated since ultrasound shadow signals tend to have a lower variance than signals reflected from structured tissue. The cylinder went from the skin to a depth of 25.0 mm in the anteroposterior direction.

Breast Contour Shape
In order to extract the breast mask and its contour line, several preprocessing steps were performed. They were similar to those described in Sec. 3.1 but with a focus on the breast contour line. A 4.0-mm stack of coronal slices starting at a distance of 7.0 mm from the skin was used for breast mask generation. The top 7.0 mm of coronal slices were excluded since they often contain spurious signals caused by sound reflections within contact fluid on parts of the transducer that do not have skin contact. Coronal slices lying deeper than 11.0 mm were not included in order to avoid signals from the ribs that can already appear from this depth on, depending on the breast size and the transducer positioning. A total of 17 features were extracted: • c view : The view direction was taken into account since breast positioning and cushion support depend on the intended view.
• A B : The physical area A B in 2-D coronal view of the breast mask was assessed as a first indicator for the amount of tissue being imaged.  The relative size of the breast mask A B∕I ¼ A B ∕A Image compared to the total size of the image was computed. The higher this value, the higher the probability that the breast was imaged completely.
• x C and y C : The position ðx C ; y C Þ of the breast mask centroid was computed as an indicator for the position and "mass distribution" of the breast within the image.
• l 1 , l 2 , and F: An ellipsoid was fitted to the breast contour line, and the lengths l 1 and l 2 of the ellipsoid axes were determined. The flatness F was computed as the ratio l 1 ∕l 2 to indicate whether the breast contour was extremely elongated in one direction or rather roundish.
• p Mask : The perimeter p Mask of the breast mask was determined and corresponded to the length of the breast contour line. The higher the p Mask , the more curves and irregularities might be in the contour line.
• p Circle and r Circle : The perimeter p Circle and the radius r Circle of a circle that has the same surface as the breast mask were computed.
• N Border and p Border : The amount of pixels N Border that belong to the breast mask and are touching the edges of the image, as well as the physical length p Border of these pixels (perimeter on border), were measured. The higher these measures, the higher the probability that the imaged breast is very large.
• R Border : The ratio R Border ¼ p Border ∕p Mask of the breast mask perimeter along the border and the total breast mask perimeter were computed.
• R Round : The roundness R Round ¼ p Circle ∕p Mask was determined as the inverse ratio between the actual perimeter of the mask and the perimeter of a circle with the same surface. Since the circle is the geometrical shape with the lowest ratio between perimeter and surface, R Round being close to 1 is a strong indicator for a round and smooth breast contour line. If R Round is very small, the determined breast contour line is supposed to be "inefficient," meaning that it has many turns and irregularities.
• p 1 and p 2 : The first two principal moments p 1 and p 2 of the breast mask were determined.

Learning Step
We first evaluated each of the three image quality aspects individually and afterward merged all above described features in order to detect images of generally insufficient quality, i.e., a fourth classifier was trained. This joint classification approach was motivated by the fact that a large portion of the positive images was affected by more than one artifact. The manual annotation of two experienced clinicians served as ground truth for classifier training. Classification tasks were performed on dataset A (introduced below) using the random forests classifier, 18 as provided by the OpenCV library (version 2.4.10). 19 While the number of trees was set to 100, the number of considered random features for decision tree construction was determined internally by the classifier, as proposed in Ref. 18 as log 2 ðMÞ þ 1, where M is the number of given features. The maximum depth of each tree was set to 15, and the minimum sample count required at each node to be split was set to 10% of the total number of samples. Ten repetitions of 10-fold stratified cross-validation (CV) were conducted to evaluate the performance of the classification. For each repetition, the instances were randomly partitioned into 10 folds under the constraint that images of the same patient were within one fold to avoid bias.
The resulting receiver-operating characteristic (ROC) curves were fitted by a binormal function, as implemented in MATLAB and Statistics Toolbox (Release 2011a, The MathWorks, Inc., Natick, Massachusetts, United States). First, for each repetition of CV, the merged ROC curve of all 10 folds was computed by sorting all instances into one curve. These were used to determine the mean ROC curve and the 95% confidence interval (CI) of all 10 repetitions. The area under the ROC curve (AUC) was estimated from the fitted curves, whereas single values of sensitivity and specificity were retrieved from the original (unfitted) classifier outputs. To compare the performance of the joint approach to that of the single classifiers, the significance of the difference between the corresponding AUCs was computed as a p-value using the method described in Ref. 20, as well as the Bonferroni correction 21 to account for multiple (three) comparisons. This means that the computed p-values were multiplied by 3 and then compared to a confidence level of α ¼ 0.05. The number of actually positive and negative instances was used to compare two ROC curves.

Additional Performance Tests
In order to investigate the robustness and potential overfitting of the trained classifiers, an independent test dataset (called B) was employed. These data were acquired in a different clinic and manually annotated by a different reader group than dataset A. After training the four classifiers on the complete dataset A, they were applied to dataset B and compared to the manual rating results. Classifier decision thresholds were chosen such that the specificity in the training step was 97%. To obtain statistics, the data were bootstrapped 100 times. The data were also used to examine the difference between the joint classification, which is based on all features at once, and the straightforward combination of the three single classifier outputs to a combined rating. Ground truth for this comparison was the combination of the manual annotations, i.e., if at least one artifact was detected concordantly by both readers, the case was considered positive. As an additional performance measure, the inter-rater agreement between the two readers (R2 versus R3) as well as between the automated image quality assessment and the manual rating (AQUA versus R2&R3) was computed as Cohen's κ. 22

Dataset
In total, 815 ABUS volumes acquired from 114 women were obtained in routine clinical care and split up into two datasets, A and B. The images were acquired using either the Somo-v automated 3-D breast ultrasound system (U-systems, Sunnyvale, California) or the ACUSON s2000 ABVS (Siemens, Erlangen, Germany). Details on the size and spatial resolution of the images are given in Table 1. According to the acquisition protocol, the nipple position (in coronal view) was indicated manually by the technicians after each measurement and stored in the DICOM header of the corresponding file so that it could easily be used for further image processing. In some images, the nipple is not visible at all. These cases were excluded from the analysis of the relative nipple position and the nipple shadow. Detailed description of the datasets is given in Table 2. The Institutional Review Board waived the need for informed consent and approved the use of anonymized images for this study.
All images were classified separately by two clinicians with several years of experience in ABUS imaging. Dataset A was annotated by Readers 1 and 2, whereas dataset B was annotated by Readers 2 and 3; i.e., one reader was the same and one was different for the two datasets. Among others, the above-mentioned quality aspects-nipple position, nipple shadow, and breast contour shape-were taken into account during manual classification. 14 The detailed rating results for dataset A are shown in Fig. 5. The distribution of artifacts was similar in dataset B. Considerable inter-rater disagreement has already been observed in another study dealing with quality rating of ultrasound images. 23 It renders classifier training difficult, but excluding the unclear, i.e., differently rated, cases from the study would mean excluding the critical cases and might bias the results. As the focus of the proposed application was put on a high specificity, we decided to consider only those cases "positive" that were rated as such by both readers. For the joint rating, an image was considered positive if at least one artifact was detected concordantly by both readers. All other cases were considered "negative" and hence usable for diagnostics. Throughout this report, "positive" and "negative" only refer to the rating of the image quality and are not correlated to any diagnostic findings, i.e., tumors or lesions.

Relative Nipple Position
Repeated cross-validation yielded an AUC of 0.99 [see Fig. 6(a)]. Different operating points on the ROC curve can be chosen for the final application by varying the decision threshold for the classifier. Depending on the intended purpose, the user may give more weight to specificity or sensitivity. As summarized in Table 3, at a specificity of 0.99, the sensitivity was 0.36. The point closest to the upper left corner of the plot ("best operating point") represents a specificity of 0.905 AE 0.009 (mean AE 95% confidence interval) and a sensitivity of 0.93 AE 0.01. For comparison, the performance of each reader when compared to the other, respectively, is displayed in the plots. It can be seen that the automatic classification performed very similarly to the readers. This is a general trend that also accounts for the other considered quality aspects. In Fig. 7, extreme outlier cases are shown. A false positive case is shown in Fig. 7(a), where the breast is very large and not completely visible in the image. In this case, the breast mask fails to describe the true contour of the breast. The breast in Fig. 7(b) is small and skinny, which impedes proper ultrasound coupling. As a consequence, a bright rectangle caused by reflections is visible in the upper right corner of the image, and breast mask segmentation using the Otsu filter fails. Figure 7(c) shows a false negative case caused by the irregular breast contour shape of the breast, which in turn produces an erroneous breast mask. The average computing time for all nine features was 3 s AE 2 s per volumetric image, whereas the computing time for the classification was in the order of milliseconds (also for the other quality aspects) and thus negligible.

Nipple Shadow
Automatic classification yielded an AUC of 0.84 [see Fig. 6(b)]. At a specificity of over 0.99, sensitivity was 0.24. The best operating point was described by a specificity of 0.82 AE 0.02 and a sensitivity of 0.73 AE 0.02. Figure 8 shows three sample outlier cases. The false positive case in Fig. 8   a small and skinny breast with a clearly visible nipple shadow close to the breast contour line. However, it was rated as negative by the readers since it is hardly possible to get better images of such a small breast in the present view and a repeated scan probably would not enhance the image. Figure 8(b) shows a false negative case, where the dark region is not directly below the nipple but rather in a half ring around it. In Fig. 8(c), the false negative classification was caused by a relatively bright and fuzzy shadow. However, the algorithm was designed to detect very prominent, low-intensity nipple shadows, as shown in Figs. 3(a) and 3(b). On average, it took 5 s AE 2 s per ABUS image to compute all possibly relevant features.

Breast Contour Shape
The classification of irregular breast contour shapes achieved an AUC of 0.89 [see Fig. 6(c)]. At a specificity of 0.99, the sensitivity was 0.15. At the best operating point, specificity was 0.82 AE 0.04 and sensitivity was 0.79 AE 0.04. Figure 9(a) shows a sample false positive case. The breast as such is imaged correctly, but parts of the axilla and the arm cause atypical contour lines, which are misinterpreted by the classifier . Figures 9(b) and 9(c) show false negative cases, where parts of the breast are not imaged correctly. Nevertheless, the breast mask has smooth contours, obscuring missing parts and misleading the classifier. The average computing time was 6 s AE 4 s.

Joint Classification
In this approach, the manual ratings for the three distinct artifacts were combined to a joint quality measure. If, according to both readers, at least one artifact was present, an image was assigned the positive class.  Table 3, the AUC of the joint approach was significantly (p < 0.05) smaller than that of the nipple position classification and significantly larger than that of the nipple shadow classification. The difference between the AUCs of the joint approach and the breast contour shape classification was not significant.

Additional Performance Tests
Applying classifiers trained on dataset A to the independent test dataset B resulted in the values shown in Table 4. Here we also show the results of a simple combination of classifier outputs (named "combination" in the table) compared to using all features in one single classifier (named "joint"). The "combination" approach combined the outputs of the three distinct classifiers (nipple position, nipple shadow, breast contour) into one rating in the same way as the manual ground truth annotation was combined for the global quality rating: If at least one of the three   artifacts was detected by the distinct classifier, the image was rated positive.

Discussion
In this work, we presented automatic techniques to assess the quality of ABUS images. The algorithms have short run times and can be applied to the images right after acquisition such that impeded scans could be repeated while the patient is still in the examination room. The focus of the proposed algorithms was on high specificity in order to avoid unnecessary rescans, but depending on the preferred application, other classifier settings could be chosen. The algorithm to rate the nipple position performed very well. Based on measures such as the distance between nipple and breast contour line, this algorithm had a high specificity and sensitivity at the same time. The good performance may be correlated to the fact that the manual rating of the nipple position was essentially driven by the same parameters as the automatic classification. This means that the clinicians marked the nipple being "too close to the edge of the breast" if the distance between nipple and breast contour line was very small. Exactly the same distance measure, d min , was used as feature for classifier training; i.e., the semantic gap between human perception and computed attributes was very small in this case. Outliers were generally caused by an erroneous breast mask due to irregular breast contour shapes. In some other cases, the algorithm was not able to reproduce the complex decision process that a human reader performs. Even if the above described features were determined as expected, the reader might have anticipated and considered other aspects, e.g., parts of the breast that were not visible in the image, as shown in Fig. 7(a).
The proposed algorithms to detect prominent nipple shadows and irregular breast contour shapes had a similar performance with an AUC slightly smaller than that of the nipple position classification. This might be due to the larger variance in the physical appearance of an acoustic nipple shadow and an irregular breast contour when compared to the relative nipple position. This variance might also be correlated to the larger disagreement between the readers, making this classification an "ill-posed" problem.
Evaluations on independent test data resulted in slightly lower sensitivities and specificities than estimated from repeated cross-validation. Nevertheless, an overall good performance showed that the classifiers were not overfitting in the training step. Note that the images of the test dataset had been acquired in a different clinic and that one of the readers was different than for the training data.
Finally, the joint evaluation of all three artifacts yielded a sensitivity and a specificity of 0.55 and 0.99 in the training dataset as well as 0.82 and 0.81 in the test dataset, respectively, which is promising and justifies the next step toward clinical application. Even if the sensitivity is only moderate, the proposed method has high potential to improve the current standard, as outlined in Sec. 2. As a reasonable number of corrupt images were detected while the specificity of the automatic image quality rating was very high, the technicians could rely on the rating without the risk of producing too many unnecessary rescans. It will be evaluated in clinical practice whether precise information on the kind of the detected artifact is relevant. As shown in Table 4, the inter-rater agreement of R2 versus R3 is in all cases higher than for AQUA versus R2&R3. Nevertheless, the trend of both measures is similar; i.e., if the agreement of both readers is high, the agreement of AQUA versus R2&R3 is also high, showing the classifiers' dependency on the clarity of the manual ground truth annotation. Thus, more readers as well as a clearer definition and separation of the single artifacts might be beneficial. Note that the joint classification approach (κ ¼ 0.58, AUC ¼ 0.91) slightly outperforms the simple combination of single classifiers (κ ¼ 0.44) as well as the single classification alone (AUC of 0.86 to 0.91). Although the effect is not as pronounced as expected, the joint approach can provide a higher sensitivity and specificity than the single classifiers, however, at the expense of detailed information on a specific quality aspect.
About 28 cases of the training dataset were excluded from the analysis of the first two algorithms because the nipple was not visible at all in the images. In some rare cases, e.g., for very large breasts, this is inevitable. It is, however, unclear how to handle these cases in clinical practice. One possibility is to use an automatic nipple detection algorithm 16 to determine whether the nipple is visible or not. Another option is to find an agreement with the technicians on how to handle the cases where they do not see the nipple in the image. So far, this case is not covered by the standard acquisition protocol.
Apart from the computations described in this work, we also tested the classifier performance by only using those cases that were given the same class by both readers. For all considered artifacts, the sensitivity, specificity, and AUC of the algorithms were slightly higher, showing that the presented approach partly relies on the used image data and the manual annotations. For the nipple position classification, e.g., the AUC was raised to 0.99 leading to a sensitivity of 0.46 at a specificity of more than 0.99. Nevertheless, we decided to include the cases with disagreement as negatives, in order not to bias the data base Table 4 Rating results retrieved from test dataset. "Combination" refers to the simple combination of the three single rating results, whereas "joint" describes the classifier that was trained on all features at once. R2 and R3 refer to Readers 2 and 3. κ refers to Cohen's inter-rater agreement. Given values are mean (stdev) from 100 times bootstrapping. by excluding the difficult cases. In another experiment (not explicitly shown in this manuscript), the classifiers were trained only on the clear, i.e., concordantly annotated, cases of dataset A and tested on the unclear cases. The classifier accuracies (sum of true positives and true negatives divided by all cases) when compared to Reader 1 and Reader 2, respectively, were 0.52 and 0.48 for the nipple position, 0.27 and 0.73 for the nipple shadow, as well as 0.45 and 0.55 for the breast contour shape. Thus, for all considered quality aspects, the trained classifiers were consistently more in line with Reader 2 than with Reader 1.
Beneath the random forests, other classifiers like the K* instance-based learner using an entropic distance measure 24 or the J48 decision tree were tested. The latter is an open-source Java implementation of the C4.5 decision tree, 25 which uses normalized information gain (difference in entropy) as a splitting criterion. However, they were outperformed by the random forests yielding the best results in terms of AUC and correlated measures while still being fast enough for the planned application. Furthermore, random forests are robust against overfitting, which was observed in single decision tree classification, and against dependent features.
The average total computing time was determined to be 14 s AE 5 s per image. Since a typical ABUS examination takes several minutes, the algorithms are fast enough for the planned online feedback application. To our knowledge, there is no previous work that our results could be directly compared to.
According to the manual rating and considering the three discussed image quality aspects, in 40 out of all 83 provided examinations, there was no or only one corrupt image, while in 43 examinations, there were two or more corrupt images. This means that early feedback to the technician after the first scan that showed problems might have helped to avoid another image with incorrect settings. However, throughout this work, the ABUS volumes were considered as independent images. Their correlation to the other images of one examination and the consequences for the usefulness of this examination were not investigated in detail and will be subject to further studies.
Concerning patient age and breast density, no direct influence on the image processing routines or on the image quality rating were detected during this work. Evaluating a first clinical installation of a prototype, it turned out, however, that the breast size has an essential impact on the reliability of the rating of the relative nipple position: in large breasts, transducer positioning often has to be performed such that the nipple is pushed toward the edges of the breast in order to capture the whole breast volume with the available views (AP, MED, LAT, and so on). Therefore, the breast volume is an important additional feature, but computing the actual 3-D volume of the breast based on an ABUS scan is not trivial and, to the authors' knowledge, has not yet been performed completely automatically by any other group. First steps like fully automatic chest wall segmentation have been presented by Tan et al., 26 who approximated the chest wall by a cylinder. However, computing time was reported to be 6 min and 30 s per breast image, which would be too slow for the application we were aiming at. Thus, extracting information from 3-D images by projecting them to 2-D was more reliable, i.e., reproducible, and reduced complexity and computational costs.

Conclusion
In this work, a computerized approach for image quality assessment in ABUS imaging was presented. We have shown that the proposed algorithms have the potential to detect up to 55% of images (at a specificity of 99%) that are currently accepted but present diminished diagnostic value. Apart from the potential to train and support the technicians and to save time and money for patient recalls, the presented algorithms will also help to filter and prepare data for further computer-assisted detection. 26 Although the sensitivity for the single quality aspects is only moderate, the described algorithms are fast and accurate enough to be tested in clinical practice, as the specificity is high, preventing too many false positive cases and unnecessary rescans. In conclusion, by using classifiers, expert knowledge was turned into algorithms that can be used in clinical practice. The contribution of this work is not only to provide a full working application for ABUS but also to test the methodology and the general concept of AQUA software development based on clinical image data. More image quality assessment algorithms for ABUS and other imaging devices such as MRI will be developed in the future in order to complement and upgrade the presented pipeline.