When is it safe to use synthetic training data in supervised classification? Trainable classifier technologies require
large representative training sets consisting of samples labeled with their true class. Acquiring such training sets
is difficult and costly. One way to alleviate this problem is to enlarge training sets by generating artificial,
synthetic samples. Of course this immediately raises many questions, perhaps the first being "Why should we
trust artificially generated data to be an accurate representative of the real distributions?" Other questions
include "When will training on synthetic data work as well as - or better than training on real data ?".
We distinguish between sample space (the set of real samples), parameter space (all samples that can be
generated synthetically), and finally, feature space (the set of samples in terms of finite numerical values). In
this paper, we discuss a series of experiments, in which we produced synthetic data in parameter space, that is,
by convex interpolation among the generating parameters for samples and showed we could amplify real data
to produce a classifier that is as accurate as a classifier trained on real data. Specifically, we have explored the
feasibility of varying the generating parameters for Knuth's Metafont system to see if previously unseen fonts
could also be recognized. We also varied parameters for an image quality model.
We have found that training on interpolated data is for the most part safe, that is to say never produced
more errors. Furthermore, the classifier trained on interpolated data often improved class accuracy.
We offer a preliminary report on a research program to investigate versatile algorithms for <i>document image content extraction</i>, that is locating regions containing handwriting, machine-print text,
graphics, line-art, logos, photographs, noise, etc. To solve this problem in its full generality requires coping with a vast diversity of document and image types. Automatically trainable methods are highly desirable, as well as extremely high speed in order to process large collections. Significant obstacles include the expense of preparing correctly labeled ("ground-truthed") samples, unresolved methodological questions in specifying the domain (<i>e.g.</i> what is a representative collection of document images?), and a lack of consensus among researchers on how to evaluate content-extraction performance. Our research strategy emphasizes <i>versatility first</i>: that is, we concentrate at the outset on designing methods that promise to work across the broadest possible range of cases.
This strategy has several important implications: the classifiers must be trainable in reasonable time on vast data sets; and expensive ground-truthed data sets must be complemented by amplification using generative models. These and other design and architectural issues are discussed. We propose a trainable classification methodology that marries k-d trees and hash-driven table lookup and describe preliminary experiments.