It is generally recognized that recent advancements in computer vision, especially the development of deep convolutional neural networks, has substantially improved the performance of computerized algorithms in medical imaging for classification tasks such as cancer detection/diagnosis.These advancements underscore the importance of the question of how the computer algorithm’s stand-alone performance compares with the performance of physicians. Current literature often uses descriptive statistics or a visual check of plots for the comparison lacking quantitative and rigorous statistical inference. In this work, we developed a U-statistic based approach to estimate the variance of performance difference between an algorithm and a group of human observers in a binary classification task. The performance metric considered in this work is percent correct (PC), e.g., sensitivity or specificity. Our variance estimation treats both human observers and patient cases as random samples and accounts for both sources of variability, thereby allowing for the conclusion to be generalizable to both the patient and the physician populations. Moreover, we investigated a z -statistic method based on our variance estimator for hypothesis testing. Our simulation results show that our variance estimator for the PC performance difference is unbiased. The normal approximation method using our variance estimator for hypothesis testing appears useful for large sample sizes.