24 January 2011 High recall document content extraction
Author Affiliations +
Abstract
We report methodologies for computing high-recall masks for document image content extraction, that is, the location and segmentation of regions containing handwriting, machine-printed text, photographs, blank space, etc. The resulting segmentation is pixel-accurate, which accommodates arbitrary zone shapes (not merely rectangles). We describe experiments showing that iterated classifiers can increase recall of all content types, with little loss of precision. We also introduce two methodological enhancements: (1) a multi-stage voting rule; and (2) a scoring policy that views blank pixels as a "don't care" class with other content classes. These enhancements improve both recall and precision, achieving at least 89% recall and at least 87% precision among three content types: machine-print, handwriting, and photo.
© (2011) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Chang An, Chang An, Henry S. Baird, Henry S. Baird, } "High recall document content extraction", Proc. SPIE 7874, Document Recognition and Retrieval XVIII, 787405 (24 January 2011); doi: 10.1117/12.876706; https://doi.org/10.1117/12.876706
PROCEEDINGS
8 PAGES


SHARE
RELATED CONTENT


Back to Top