Validation accuracy and test accuracy are necessary, but not sufficient, measures of a neural network classifier’s quality. A model judged successful by these metrics alone may nevertheless reveal serious flaws upon closer examination, such as vulnerability to adversarial attacks or a tendency to misclassify (with high confidence) real-world data different than that in its training set. It may also be incomprehensible to a human, basing its decisions on seemingly arbitrary criteria or overemphasizing one feature of the dataset while ignoring others of equal importance. While these problems have been the focus of a substantial amount of recent research, they are not prioritized during the model development process, which almost always maximizes validation accuracy to the exclusion of everything else. The product of such an approach is likely to fail in unexpected ways outside of the training environment. We believe that, in addition to validation accuracy, the model development process must give equal weight to other performance metrics such as explainability, resistance to adversarial attacks, and classification of out-of-distribution data. We incorporate these assessments into the model design process using free, readily available tools to differentiate between convolutional neural network classifiers trained on the notMNIST character dataset. Specifically, we show that ensemble and ensemble-like models with high cardinality outperform simpler models with identical validation accuracy by up to a factor of 5 on these other metrics.