29 January 2007 Document image content inventories
Author Affiliations +
Abstract
We report an investigation into strategies, algorithms, and software tools for document image content extraction and inventory, that is, the location and measurement of regions containing handwriting, machine-printed text, photographs, blank space, etc. We have developed automatically trainable methods, adaptable to many kinds of documents represented as bilevel, greylevel, or color images, that offer a wide range of useful tradeoffs of speed versus accuracy using methods for exact and approximate k-Nearest Neighbor classification. We have adopted a policy of classifying each pixel (rather than regions) by content type: we discuss the motivation and engineering implications of this choice. We describe experiments on a wide variety of document-image and content types, and discuss performance in detail in terms of classification speed, per-pixel classification accuracy, per-page inventory accuracy, and subjective quality of page segmentation. These show that even modest per-pixel classification accuracies (of, e.g., 60-70%) support usefully high recall and precision rates (of, e.g., 80-90%) for retrieval queries of document collections seeking pages that contain a given minimum fraction of a certain type of content.
© (2007) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Henry S. Baird, Henry S. Baird, Michael A. Moll, Michael A. Moll, Chang An, Chang An, Matthew R Casey, Matthew R Casey, } "Document image content inventories", Proc. SPIE 6500, Document Recognition and Retrieval XIV, 65000X (29 January 2007); doi: 10.1117/12.705094; https://doi.org/10.1117/12.705094
PROCEEDINGS
12 PAGES


SHARE
Back to Top