We report methodologies for computing high-recall masks for document image content extraction, that is, the
location and segmentation of regions containing handwriting, machine-printed text, photographs, blank space,
etc. The resulting segmentation is pixel-accurate, which accommodates arbitrary zone shapes (not merely rectangles).
We describe experiments showing that iterated classifiers can increase recall of all content types, with
little loss of precision. We also introduce two methodological enhancements: (1) a multi-stage voting rule; and (2)
a scoring policy that views blank pixels as a "don't care" class with other content classes. These enhancements
improve both recall and precision, achieving at least 89% recall and at least 87% precision among three content
types: machine-print, handwriting, and photo.
In order to accurately recognize textual images of a book, we often employ various models including iconic
model (for character classification), dictionary (for word recognition), character segmentation model, etc.,
which are derived from prior knowledge. Imperfections in these models affect recognition performance inevitably.
In this paper, we propose an unsupervised learning technique that adapts multiple models on-the-fly
on a homogeneous input data set to achieve a better overall recognition accuracy fully automatically. The
major challenge for this unsupervised learning process is, how to make models improve rather than damage
one another? In our framework, models measure disagreements between their input data and output data.
We propose a policy based on disagreements to adapt multiple models simultaneously (or alternately) safely.
We will construct a book recognition system based on this framework, and demonstrate its feasibility.
Scaling up document-image classifiers to handle an unlimited variety of document and image types poses serious
challenges to conventional trainable classifier technologies. Highly versatile classifiers demand representative
training sets which can be dauntingly large: in investigating document content extraction systems, we have
demonstrated the advantages of employing as many as a billion training samples in approximate k-nearest
neighbor (kNN) classifiers sped up using hashed K-d trees. We report here on an algorithm, which we call online
bin-decimation, for coping with training sets that are too big to fit in main memory, and we show empirically that
it is superior to offline pre-decimation, which simply discards a large fraction of the training samples at random
before constructing the classifier. The key idea of bin-decimation is to enforce an upper bound approximately on
the number of training samples stored in each K-d hash bin; an adaptive statistical technique allows this to be
accomplished online and in linear time, while reading the training data exactly once. An experiment on 86.7M
training samples reveals a 23-times speedup with less than 0.1% loss of accuracy (compared to pre-decimation);
or, for another value of the upper bound, a 60-times speedup with less than 5% loss of accuracy. We also compare
it to four other related algorithms.
We describe a technique of linguistic post-processing of whole-book recognition results. Whole-book recognition is a
technique that improves recognition of book images using fully automatic cross-entropy-based model adaptation. In previous
published works, word recognition was performed on individual words separately, without awaring passage-level information
such as word-occurrence frequencies. Therefore, some rare words in real texts may appear much more often in recognition
results; vice versa. Differences between word frequencies in recognition results and in prior knowledge may indicate recognition
errors on a long passage. In this paper, we propose a post-processing technique to enhance whole-book recognition
results by minimizing differences between word frequencies in recognition results and prior word frequencies. This technique
works better when operating on longer passages, and it drives the character error rate down 20% from 1.24% to 0.98% in a
When is it safe to use synthetic training data in supervised classification? Trainable classifier technologies require
large representative training sets consisting of samples labeled with their true class. Acquiring such training sets
is difficult and costly. One way to alleviate this problem is to enlarge training sets by generating artificial,
synthetic samples. Of course this immediately raises many questions, perhaps the first being "Why should we
trust artificially generated data to be an accurate representative of the real distributions?" Other questions
include "When will training on synthetic data work as well as - or better than training on real data ?".
We distinguish between sample space (the set of real samples), parameter space (all samples that can be
generated synthetically), and finally, feature space (the set of samples in terms of finite numerical values). In
this paper, we discuss a series of experiments, in which we produced synthetic data in parameter space, that is,
by convex interpolation among the generating parameters for samples and showed we could amplify real data
to produce a classifier that is as accurate as a classifier trained on real data. Specifically, we have explored the
feasibility of varying the generating parameters for Knuth's Metafont system to see if previously unseen fonts
could also be recognized. We also varied parameters for an image quality model.
We have found that training on interpolated data is for the most part safe, that is to say never produced
more errors. Furthermore, the classifier trained on interpolated data often improved class accuracy.
Proc. SPIE. 6815, Document Recognition and Retrieval XV
KEYWORDS: Detection and tracking algorithms, Data modeling, Image acquisition, Associative arrays, Image classification, Optical character recognition, Probability theory, Statistical modeling, Performance modeling, Current controlled current source
We describe an approach to unsupervised high-accuracy recognition of the textual contents of an entire book using fully automatic mutual-entropy-based model adaptation. Given images of all the pages of a book together with approximate models of image formation (e.g. a character-image classifier) and linguistics (e.g. a word-occurrence probability model), we detect evidence for disagreements between the two models by analyzing the mutual entropy between two kinds of probability distributions: (1) the a posteriori probabilities of character classes (the recognition results from image classification alone), and (2) the a posteriori probabilities of word classes (the recognition results from image classification combined with linguistic
constraints). The most serious of these disagreements are identified as candidates for automatic corrections to one or the other of the models. We describe a formal information-theoretic framework for detecting model disagreement and for proposing corrections. We illustrate this approach on a small test case selected from real book-image data. This reveals that a sequence of automatic model corrections can drive improvements in both models, and can achieve a lower recognition error rate. The importance of considering the contents of the whole book is motivated by a series of studies, over the last decade, showing that isogeny can be exploited to achieve unsupervised improvements in recognition accuracy.
We describe a methodology for retrieving document images
from large extremely diverse collections. First we perform
content extraction, that is the location and measurement
of regions containing handwriting, machine-printed
text, photographs, blank space, etc, in documents represented
as bilevel, greylevel, or color images. Recent experiments
have shown that even modest per-pixel content classification accuracies can support usefully high recall and precision rates (of, e.g., 80-90%) for retrieval queries within document collections seeking pages that contain a fraction of a certain type of content. When the distribution of content and error rates are uniform across the entire collection, it is possible to derive IR measures from classification measures and vice versa. Our largest experiments to date, consisting of 80 training images totaling over 416 million pixels, are presented to illustrate these conclusions. This data set is more representative than previous experiments, containing a more balanced distribution of content types. Contained in this data set are also images of text obtained from handheld digital cameras and the success of existing methods (with no modification) in classifying these images with are discussed. Initial experiments in discriminating line art from the four classes mentioned above are also described. We also discuss methodological issues that affect both ground-truthing and evaluation measures.
We report an investigation into strategies, algorithms, and software tools for document image content extraction
and inventory, that is, the location and measurement of regions containing handwriting, machine-printed text,
photographs, blank space, etc. We have developed automatically trainable methods, adaptable to many kinds
of documents represented as bilevel, greylevel, or color images, that offer a wide range of useful tradeoffs of
speed versus accuracy using methods for exact and approximate k-Nearest Neighbor classification. We have
adopted a policy of classifying each pixel (rather than regions) by content type: we discuss the motivation
and engineering implications of this choice. We describe experiments on a wide variety of document-image and
content types, and discuss performance in detail in terms of classification speed, per-pixel classification accuracy,
per-page inventory accuracy, and subjective quality of page segmentation. These show that even modest per-pixel
classification accuracies (of, e.g., 60-70%) support usefully high recall and precision rates (of, e.g., 80-90%)
for retrieval queries of document collections seeking pages that contain a given minimum fraction of a certain
type of content.
We offer a preliminary report on a research program to investigate versatile algorithms for document image content extraction, that is locating regions containing handwriting, machine-print text,
graphics, line-art, logos, photographs, noise, etc. To solve this problem in its full generality requires coping with a vast diversity of document and image types. Automatically trainable methods are highly desirable, as well as extremely high speed in order to process large collections. Significant obstacles include the expense of preparing correctly labeled ("ground-truthed") samples, unresolved methodological questions in specifying the domain (e.g. what is a representative collection of document images?), and a lack of consensus among researchers on how to evaluate content-extraction performance. Our research strategy emphasizes versatility first: that is, we concentrate at the outset on designing methods that promise to work across the broadest possible range of cases.
This strategy has several important implications: the classifiers must be trainable in reasonable time on vast data sets; and expensive ground-truthed data sets must be complemented by amplification using generative models. These and other design and architectural issues are discussed. We propose a trainable classification methodology that marries k-d trees and hash-driven table lookup and describe preliminary experiments.
We propose a design methodology for "implicit" CAPTCHAs to relieve drawbacks of present technology. CAPTCHAs are tests administered automatically over networks that can distinguish between people and machines and thus protect web services from abuse by programs masquerading as human users. All existing CAPTCHAs' challenges require a significant conscious effort by the person answering them -- e.g. reading and typing a nonsense word -- whereas implicit CAPTCHAs may require as little as a single click. Many CAPTCHAs distract and interrupt users, since the challenge is perceived as an irrelevant intrusion; implicit CAPTCHAs can be woven into the expected sequence of browsing using cues tailored to the site. Most existing CAPTCHAs are vulnerable to "farming-out" attacks in which challenges are passed to a networked community of human readers; by contrast, implicit CAPTCHAs are not "fungible" (in the sense of easily answerable in isolation) since they are meaningful only in the specific context of the website that is protected. Many existing CAPTCHAs irritate or threaten users since they are obviously tests of skill: implicit CAPTCHAs appear to be elementary and inevitable acts of browsing. It can often be difficult to detect when CAPTCHAs are under attack: implicit CAPTCHAs can be designed so that certain failure modes are correlated with failed bot attacks. We illustrate these design principles with examples.
A reading-based CAPTCHA designed to resist character-segmentation attacks, called 'ScatterType,' is described. Its challenges are pseudorandomly synthesized images of text strings rendered in machine-print typefaces: within each image, characters are fragmented
using horizontal and vertical cuts, and the fragments are scattered by vertical and horizontal displacements. This scattering is designed to defeat all methods known to us for automatic segmentation into characters. As in the BaffleText CAPTCHA, English-like but unspellable text-strings are used to defend against known-dictionary attacks. In contrast to the PessimalPrint and BaffleText CAPTCHAs (and others), no physics-based image degradations, occlusions, or extraneous patterns are employed. We report preliminary results from a human legibility trial with 57 volunteers that yielded 4275 CAPTCHA challenges and responses. ScatterType human legibility remains remarkably high even on extremely degraded cases. We speculate that this is due to Gestalt perception abilities
assisted by style-specific (here, typeface-specific) consistency
among primitive shape features of character fragments. Although recent efforts to automate style-consistent perceptual skills
have reported progress, the best known methods do not yet pose a threat to ScatterType. The experimental data also show that subjective rating of difficulty is strongly (and usefully) correlated with illegibility. In addition, we present early insights emerging from these data as we explore the ScatterType design space -- choice of typefaces, 'words', cut positioning, and displacements -- with the goal of locating regimes in which ScatterType challenges remain comfortably legible to almost all people but strongly resist mahine-vision methods for automatic segmentation into characters.
Internet services designed for human use are being abused by programs. We present a defense against such attacks in the form of a CAPTCHA (Completely Automatic Public Turing test to tell Computers and Humans Apart) that exploits the difference in ability between humans and machines in reading images of text. CAPTCHAs are a special case of 'human interactive proofs,' a broad class of security protocols that allow people to identify themselves over networks as members of given groups. We point out vulnerabilities of reading-based CAPTCHAs to dictionary and computer-vision attacks. We also draw on the literature on the psychophysics of human reading, which suggests fresh defenses available to CAPTCHAs. Motivated by these considerations, we propose BaffleText, a CAPTCHA which uses non-English pronounceable words to defend against dictionary attacks, and Gestalt-motivated image-masking degradations to defend against image restoration attacks. Experiments on human subjects confirm the human legibility and user acceptance of BaffleText images. We have found an image-complexity measure that correlates well with user acceptance and assists in engineering the generation of challenges to fit the ability gap. Recent computer-vision attacks, run independently by Mori and Jitendra, suggest that BaffleText is stronger than two existing CAPTCHAs.
We describe a technique for modeling the character recognition accuracy of an OCR system -- treated as a black box -- on a particular page of printed text based on an examination only of the output top-choice character classifications and, for each, a confidence score such as is supplied by many commercial OCR systems. Latent conditional independence (LCI) models perform better on this task, in our experience, than naive uniform thresholding methods. Given a sufficiently large and representative dataset of OCR (errorful) output and manually proofed (correct) text, we can automatically infer LCI models that exhibit a useful degree of reliability. A collaboration between a PARC research group and a Xerox legacy conversion service bureau has demonstrated that such models can significantly improve the productivity of human proofing staff by triaging -- that is, selecting to bypass manual inspection -- pages whose estimated OCR accuracy exceeds a threshold chosen to ensure that a customer-specified per-page accuracy target will be met with sufficient confidence. We report experimental results on over 1400 pages. Our triage software tools are running in production and will be applied to more than 5 million pages of multi-lingual text.
We present an automatic method for constructing high-performance preclassification decision trees for OCR. Good preclassifiers prune the set of alternative classes to many fewer without erroneously pruning the correct class. We build the decision tree using greedy entropy minimization, using pseudo-randomly generated training samples derived from a model of imaging defects, and then `populate' the tree with many more samples to drive down the error rate. In [BM94] we presented a statistically rigorous stopping rule for population that enforces a user-specified upper bound on error: this works in practice, but is too conservative, driving the error far below the bound. Here, we describe a refinement that achieves the user- specified accuracy more closely and thus improves the pruning rate of the resulting tree. The method exploits the structure of the tree: the essential technical device is a leaf-selection rule based on Good's Theorem [Good53]. We illustrate its effectiveness through experiments on a pan-European polyfont classifier.
We have developed a practical scheme to take advantage of local typeface homogeneity to improve the accuracy of a character classifier. Given a polyfont classifier which is capable of recognizing any of 100 typefaces moderately well, our method allows it to specialize itself automatically to the single -- but otherwise unknown -- typeface it is reading. Essentially, the classifier retrains itself after examining some of the images, guided at first by the preset classification boundaries of the given classifier, and later by the behavior of the retrained classifier. Experimental trials on 6.4 M pseudo-randomly distorted images show that the method improves on 95 of the 100 typefaces. It reduces the error rate by a factor of 2.5, averaged over 100 typefaces, when applied to an alphabet of 80 ASCII characters printed at ten point and digitized at 300 pixels/inch. This self-correcting method complements, and does not hinder, other methods for improving OCR accuracy, such as linguistic contextual analysis.