29 January 2007 Document image content inventories
Author Affiliations +
Abstract
We report an investigation into strategies, algorithms, and software tools for document image content extraction and inventory, that is, the location and measurement of regions containing handwriting, machine-printed text, photographs, blank space, etc. We have developed automatically trainable methods, adaptable to many kinds of documents represented as bilevel, greylevel, or color images, that offer a wide range of useful tradeoffs of speed versus accuracy using methods for exact and approximate k-Nearest Neighbor classification. We have adopted a policy of classifying each pixel (rather than regions) by content type: we discuss the motivation and engineering implications of this choice. We describe experiments on a wide variety of document-image and content types, and discuss performance in detail in terms of classification speed, per-pixel classification accuracy, per-page inventory accuracy, and subjective quality of page segmentation. These show that even modest per-pixel classification accuracies (of, e.g., 60-70%) support usefully high recall and precision rates (of, e.g., 80-90%) for retrieval queries of document collections seeking pages that contain a given minimum fraction of a certain type of content.
© (2007) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Henry S. Baird, Michael A. Moll, Chang An, Matthew R Casey, "Document image content inventories", Proc. SPIE 6500, Document Recognition and Retrieval XIV, 65000X (29 January 2007); doi: 10.1117/12.705094; https://doi.org/10.1117/12.705094
PROCEEDINGS
12 PAGES


SHARE
Back to Top