In machine learning, a good predictive model is the one that generalizes well over future unseen data. In general, this problem is ill-posed. To mitigate this problem, a predictive model can be constructed by simultaneously minimizing an empirical error over training samples and controlling the complexity of the model. Thus, the regularized least squares (RLS) is developed. RLS requires matrix inversion, which is expensive. And as such, its “big data” applications can be adversely affected. To address this issue, we have developed an efficient machine learning algorithm for pattern recognition that approximates RLS. The algorithm does not require matrix inversion, and achieves competitive performance against the RLS algorithm. It has been shown mathematically that RLS is a sound learning algorithm. Therefore, a definitive statement about the relationship between the new algorithm and RLS will lay a solid theoretical foundation for the new algorithm. A recent study shows that the spectral norm of the kernel matrix in RLS is tightly bounded above by the size of the matrix. This spectral norm becomes a constant when the training samples have independent centered sub-Gaussian coordinators. For example, typical sub-Gaussian random vectors such as the standard normal and Bernoulli satisfy this assumption. Basically, each sample is drawn from a product distribution formed from some centered univariate sub-Gaussian distributions. These new results allow us to establish a bound between the new algorithm and RLS in finite samples and show that the new algorithm converges to RLS in the limit. Experimental results are provided that validate the theoretical analysis and demonstrate the new algorithm to be very promising in solving “big data” classification problems.
Ensemble methods provide a principled framework in which to build high performance classifiers and represent
many types of data. As a result, these methods can be useful for making inferences about biometric and biological
events. We introduce a novel ensemble method for combining multiple representations (or views). The method
is a multiple view generalization of AdaBoost. Similar to AdaBoost, base classifiers are independently built from
each represetation. Unlike AdaBoost, however, all data types share the same sampling distribution computed
from the base classifier having the smallest error rate among input sources. As a result, the most consistent
data type dominates over time, thereby significantly reducing sensitivity to noise. The method is applied to
the problem of facial and gender prediction based on biometric traits. The new method outperforms several
competing techniques including kernel-based data fusion, and is provably better than AdaBoost trained on any
single type of data.
Many classifiers have been proposed for ATR applications. Given a set of training data, a classifier is built from the labeled training data, and then applied to predict the label of a new test point. If there is enough training data, and the test points are drawn from the same distribution (i.i.d.) as training data, then many classifiers perform quite well. However, in reality, there will never be enough training data or with limited computational resources we can only use part of the training data. Likewise, the distribution of new test points might be different from that of the training data, whereby the training data is not representative of the test data. In this paper, we empirically compare several classifiers, namely support vector machines, regularized least squares classifiers, C4.4, C4.5, random decision trees, bagged C4.4, and bagged C4.5 on IR imagery. We reduce the training data by half (less representative of the test data) each time and evaluate the resulting classifiers on the test data. This allows us to assess the robustness of classifiers against a varying knowledge base. A robust classifier is the one whose accuracy is the least sensitive to changes in the training data. Our results show that ensemble methods (random decision trees, bagged C4.4 and bagged C4.5) outlast single classifiers as the training data size decreases.
In ATR applications, each feature is a convolution of an image with a filter. It is important to use most discriminant features to produce compact representations. We propose two novel subspace methods for dimension reduction to address limitations associated with Fukunaga-Koontz Transform (FKT). The first method, Scatter-FKT, assumes that target is more homogeneous, while clutter can be anything other than target and anywhere. Thus, instead of estimating a clutter covariance matrix, Scatter-FKT computes a clutter scatter matrix that measures the spread of clutter from the target mean. We choose dimensions along which the difference in variation between target and clutter is most pronounced. When the target follows a Gaussian distribution, Scatter-FKT can be viewed as a generalization of FKT. The second method, Optimal Bayesian Subspace, is derived from the optimal Bayesian classifier. It selects dimensions such that the minimum Bayes error rate can be achieved. When both target and clutter follow Gaussian distributions, OBS computes optimal subspace representations. We compare our methods against FKT using character image as well as IR data.
Nearest neighbor classifiers are one of most common techniques for
classification and ATR applications. Hastie and Tibshirani propose a
discriminant adaptive nearest neighbor (DANN) rule for computing a
distance metric locally so that posterior probabilities tend to be
homogeneous in the modified neighborhoods. The idea is to enlongate or
constrict the neighborhood along the direction that is parallel or
perpendicular to the decision boundary between two classes. DANN
morphs a neighborhood in a linear fashion. In this paper, we extend
it to the nonlinear case using the kernel trick. We demonstrate the
efficacy of our kernel DANN in the context of ATR applications using a
number of data sets.