The purpose was to investigate the repeatability and bias of the output of two classifiers commonly used in computeraided
diagnosis for the task of distinguishing benign from malignant lesions. Classifier training and testing were
performed within a bootstrap approach using a dataset of 125 sonographic breast lesions (54 malignant, 71 benign). The
classifiers investigated were linear discriminant analysis (LDA) and a Bayesian Neural Net (BNN) with 5 hidden units.
Both used the same 4 input lesion features. The bootstrap .632plus area under the ROC curve (AUC) was used as a
summary performance metric. On an individual case basis, the variability of the classifier output was used in a detailed
performance evaluation of repeatability and bias. The LDA obtained an AUC value of 0.87 with 95% confidence interval
[0.81; 0.92]. For the BNN, those values were 0.86 and [.76; .93], respectively. The classifier outputs for individual cases
displayed better repeatability (less variability) for the LDA than for the BNN and for the LDA the maximum
repeatability (lowest variability) lied in the middle of the range of possible outputs, while the BNN was least repeatable
(highest variability) in this region. There was a small but significant systematic bias in the LDA output, however, while
for the BNN the bias appeared to be weak. In summary, while ROC analysis suggested similar classifier performance,
there were substantial differences in classifier behavior on a by-case basis. Knowledge of this behavior is crucial for
successful translation and implementation of computer-aided diagnosis in clinical decision making.