18 January 2010 Time and space optimization of document content classifiers
Author Affiliations +
Abstract
Scaling up document-image classifiers to handle an unlimited variety of document and image types poses serious challenges to conventional trainable classifier technologies. Highly versatile classifiers demand representative training sets which can be dauntingly large: in investigating document content extraction systems, we have demonstrated the advantages of employing as many as a billion training samples in approximate k-nearest neighbor (kNN) classifiers sped up using hashed K-d trees. We report here on an algorithm, which we call online bin-decimation, for coping with training sets that are too big to fit in main memory, and we show empirically that it is superior to offline pre-decimation, which simply discards a large fraction of the training samples at random before constructing the classifier. The key idea of bin-decimation is to enforce an upper bound approximately on the number of training samples stored in each K-d hash bin; an adaptive statistical technique allows this to be accomplished online and in linear time, while reading the training data exactly once. An experiment on 86.7M training samples reveals a 23-times speedup with less than 0.1% loss of accuracy (compared to pre-decimation); or, for another value of the upper bound, a 60-times speedup with less than 5% loss of accuracy. We also compare it to four other related algorithms.
© (2010) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Dawei Yin, Dawei Yin, Henry S. Baird, Henry S. Baird, Chang An, Chang An, } "Time and space optimization of document content classifiers", Proc. SPIE 7534, Document Recognition and Retrieval XVII, 753409 (18 January 2010); doi: 10.1117/12.838957; https://doi.org/10.1117/12.838957
PROCEEDINGS
11 PAGES


SHARE
Back to Top