We describe a methodology for retrieving document images
from large extremely diverse collections. First we perform
content extraction, that is the location and measurement
of regions containing handwriting, machine-printed
text, photographs, blank space, etc, in documents represented
as bilevel, greylevel, or color images. Recent experiments
have shown that even modest per-pixel content classification accuracies can support usefully high recall and precision rates (of, e.g., 80-90%) for retrieval queries within document collections seeking pages that contain a fraction of a certain type of content. When the distribution of content and error rates are uniform across the entire collection, it is possible to derive IR measures from classification measures and vice versa. Our largest experiments to date, consisting of 80 training images totaling over 416 million pixels, are presented to illustrate these conclusions. This data set is more representative than previous experiments, containing a more balanced distribution of content types. Contained in this data set are also images of text obtained from handheld digital cameras and the success of existing methods (with no modification) in classifying these images with are discussed. Initial experiments in discriminating line art from the four classes mentioned above are also described. We also discuss methodological issues that affect both ground-truthing and evaluation measures.
We report an investigation into strategies, algorithms, and software tools for document image content extraction
and inventory, that is, the location and measurement of regions containing handwriting, machine-printed text,
photographs, blank space, etc. We have developed automatically trainable methods, adaptable to many kinds
of documents represented as bilevel, greylevel, or color images, that offer a wide range of useful tradeoffs of
speed versus accuracy using methods for exact and approximate k-Nearest Neighbor classification. We have
adopted a policy of classifying each pixel (rather than regions) by content type: we discuss the motivation
and engineering implications of this choice. We describe experiments on a wide variety of document-image and
content types, and discuss performance in detail in terms of classification speed, per-pixel classification accuracy,
per-page inventory accuracy, and subjective quality of page segmentation. These show that even modest per-pixel
classification accuracies (of, e.g., 60-70%) support usefully high recall and precision rates (of, e.g., 80-90%)
for retrieval queries of document collections seeking pages that contain a given minimum fraction of a certain
type of content.
We offer a preliminary report on a research program to investigate versatile algorithms for <i>document image content extraction</i>, that is locating regions containing handwriting, machine-print text,
graphics, line-art, logos, photographs, noise, etc. To solve this problem in its full generality requires coping with a vast diversity of document and image types. Automatically trainable methods are highly desirable, as well as extremely high speed in order to process large collections. Significant obstacles include the expense of preparing correctly labeled ("ground-truthed") samples, unresolved methodological questions in specifying the domain (<i>e.g.</i> what is a representative collection of document images?), and a lack of consensus among researchers on how to evaluate content-extraction performance. Our research strategy emphasizes <i>versatility first</i>: that is, we concentrate at the outset on designing methods that promise to work across the broadest possible range of cases.
This strategy has several important implications: the classifiers must be trainable in reasonable time on vast data sets; and expensive ground-truthed data sets must be complemented by amplification using generative models. These and other design and architectural issues are discussed. We propose a trainable classification methodology that marries k-d trees and hash-driven table lookup and describe preliminary experiments.