Validation accuracy and test accuracy are necessary, but not sufficient, measures of a neural network classifier’s quality. A model judged successful by these metrics alone may nevertheless reveal serious flaws upon closer examination, such as vulnerability to adversarial attacks or a tendency to misclassify (with high confidence) real-world data different than that in its training set. It may also be incomprehensible to a human, basing its decisions on seemingly arbitrary criteria or overemphasizing one feature of the dataset while ignoring others of equal importance. While these problems have been the focus of a substantial amount of recent research, they are not prioritized during the model development process, which almost always maximizes validation accuracy to the exclusion of everything else. The product of such an approach is likely to fail in unexpected ways outside of the training environment. We believe that, in addition to validation accuracy, the model development process must give equal weight to other performance metrics such as explainability, resistance to adversarial attacks, and classification of out-of-distribution data. We incorporate these assessments into the model design process using free, readily available tools to differentiate between convolutional neural network classifiers trained on the notMNIST character dataset. Specifically, we show that ensemble and ensemble-like models with high cardinality outperform simpler models with identical validation accuracy by up to a factor of 5 on these other metrics.
The Army Rapid Capabilities Office (RCO) sponsored a Blind Signal Classification Competition seeking algorithms to automatically identify the modulation schemes of RF signal from complex-valued IQ (in-phase quadrature) samples. Traditional spectrum sensing technology uses energy detection to detect the existence of RF signals but the RCO competition further aimed to detect the modulation scheme of signals without prior information. Machine Learning (ML) technologies have been widely used for blind signal classification problem. Traditional ML methods usually have two stages where the first stage is to manually extract the features of the IQ symbols by subject matter experts and the second stage is to feed the features to an ML algorithm (e.g., a support vector machine) to develop the classifier. The state-of-art technology is to apply deep learning technologies such as Convolutional Neural Network (CNN) or Recurrent Neural Network (RNN) directly to the complex-value IQ symbols to train a multi-class classifier. Our team, dubbed Deep Dreamers, participated in the RCO competition and placed 3rd out of 42 active teams across industry, academia, and government. In this work we share our experience and lessons learned from the competition. Deep learning methods such as CNN, Residual Neural Network (ResNet), and Long Short-Term Memory (LSTM) are the fundamental neural network layers we used to develop a multi-class classifier. None of our individual models were able to achieve a competitively high ranking in the competition. The key to our success was to use ensemble learning to average the outputs of multiple diverse classifiers. In order for ensemble methods to be more accurate than any of its base models; the base learners have to be as accurate as possible. We found that while ResNet was more accurate than the LSTM; the LSTM was less sensitive to deviations in the test set.