Paper
19 January 2009 Using synthetic data safely in classification
Author Affiliations +
Proceedings Volume 7247, Document Recognition and Retrieval XVI; 72470G (2009) https://doi.org/10.1117/12.805619
Event: IS&T/SPIE Electronic Imaging, 2009, San Jose, California, United States
Abstract
When is it safe to use synthetic training data in supervised classification? Trainable classifier technologies require large representative training sets consisting of samples labeled with their true class. Acquiring such training sets is difficult and costly. One way to alleviate this problem is to enlarge training sets by generating artificial, synthetic samples. Of course this immediately raises many questions, perhaps the first being "Why should we trust artificially generated data to be an accurate representative of the real distributions?" Other questions include "When will training on synthetic data work as well as - or better than training on real data ?". We distinguish between sample space (the set of real samples), parameter space (all samples that can be generated synthetically), and finally, feature space (the set of samples in terms of finite numerical values). In this paper, we discuss a series of experiments, in which we produced synthetic data in parameter space, that is, by convex interpolation among the generating parameters for samples and showed we could amplify real data to produce a classifier that is as accurate as a classifier trained on real data. Specifically, we have explored the feasibility of varying the generating parameters for Knuth's Metafont system to see if previously unseen fonts could also be recognized. We also varied parameters for an image quality model. We have found that training on interpolated data is for the most part safe, that is to say never produced more errors. Furthermore, the classifier trained on interpolated data often improved class accuracy.
© (2009) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Jean Nonnemaker and Henry S. Baird "Using synthetic data safely in classification", Proc. SPIE 7247, Document Recognition and Retrieval XVI, 72470G (19 January 2009); https://doi.org/10.1117/12.805619
Lens.org Logo
CITATIONS
Cited by 15 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Cardiovascular magnetic resonance imaging

Image quality

Error analysis

Image processing

Feature extraction

Data acquisition

Prototyping

RELATED CONTENT


Back to Top