A large collection of reproductions of calligraphy on paper was scanned into images to enable web access for both the academic community and the public. Calligraphic paper digitization technology is mature, but technology for segmentation, character coding, style classification, and identification of calligraphy are lacking. Therefore, computational tools for classification and quantification of calligraphic style are proposed and demonstrated on a statistically characterized corpus. A subset of 259 historical page images is segmented into 8719 individual character images. Calligraphic style is revealed and quantified by visual attributes (i.e., appearance features) of character images sampled from historical works. A style space is defined with the features of five main classical styles as basis vectors. Cross-validated error rates of 10% to 40% are reported on conventional and conservative sampling into training/test sets and on same-work voting with a range of voter participation. Beyond its immediate applicability to education and scholarship, this research lays the foundation for style-based calligraphic forgery detection and for discovery of latent calligraphic groups induced by mentor-student relationships.
Revealing related content among heterogeneous web tables is part of our long term objective of formulating queries over
multiple sources of information. Two hundred HTML tables from institutional web sites are segmented and each table
cell is classified according to the fundamental indexing property of row and column headers. The categories that
correspond to the multi-dimensional data cube view of a table are extracted by factoring the (often multi-row/column)
headers. To reveal commonalities between tables from diverse sources, the Jaccard distances between pairs of category
headers (and also table titles) are computed. We show how about one third of our heterogeneous collection can be
clustered into a dozen groups that exhibit table-title and header similarities that can be exploited for queries.
Ruling gap ratios are an affine-invariant characterization of parallel ruling configurations in scanned documents. This report quantifies the advantage of simultaneous extraction of horizontal and vertical rulings. It demonstrates that every ruling gap ratio can be derived from a minimal set of basis ratios. The effect on the basis ratios of noise on the radial coordinates of individual rulings is analyzed and the dependence of basis-ratio variability on random-phase sampling noise is determined as a function of the spatial sampling rate. The analysis provides insight into already-presented small-scale experimental results on form classification and guidance for future work that requires the extraction of parallel lines from scanned or photographed images.
Geometric invariants are combined with edit distance to compare the ruling configuration of noisy filled-out forms. It is
shown that gap-ratios used as features capture most of the ruling information of even low-resolution and poorly scanned
form images, and that the edit distance is tolerant of missed and spurious rulings. No preprocessing is required and the
potentially time-consuming string operations are performed on a sparse representation of the detected rulings. Based on
edit distance, 158 Arabic forms are classified into 15 groups with 89% accuracy. Since the method was developed for an
application that precludes public dissemination of the data, it is illustrated on public-domain death certificates.
Integrity tests are proposed for image processing algorithms that should yield essentially the same output under 90
degree rotations, edge-padding and monotonic gray-scale transformations of scanned documents. The tests are
demonstrated on built-in functions of the Matlab Image Processing Toolbox. Only the routine that reports the area of the
convex hull of foreground components fails the rotation test. Ensuring error-free preprocessing operations like size and
skew normalization that are based on resampling an image requires more radical treatment. Even if faultlessly
implemented, resampling is generally irreversible and may introduce artifacts. Fortunately, advances in storage and
processor technology have all but eliminated any advantage of preprocessing or compressing document images by
resampling them. Using floating point coordinate transformations instead of resampling images yields accurate run-length,
moment, slope, and other geometric features.
Confirming the labels of automatically classified patterns is generally faster than entering new labels or correcting
incorrect labels. Most labels assigned by a classifier, even if trained only on relatively few pre-labeled patterns, are
correct. Therefore the overall cost of human labeling can be decreased by interspersing labeling and classification. Given
a parameterized model of the error rate as an inverse power law function of the size of the training set, the optimal splits
can be computed rapidly. Projected savings in operator time are over 60% for a range of empirical error functions for
hand-printed digit classification with ten different classifiers.
In spite of a hundredfold decrease in the cost of relevant technologies, the role of document image processing systems is
gradually declining due to the transition to an on-line world. Nevertheless, in some high-volume applications, document
image processing software still saves millions of dollars by accelerating workflow, and similarly large savings could be
realized by more effective automation of the multitude of low-volume personal document conversions. While potential
cost savings, based on estimates of costs and values, are a driving force for new developments, quantifying such savings
is difficult. The most important trend is that the cost of computing resources for DIA is becoming insignificant compared
to the associated labor costs. An econometric treatment of document processing complements traditional performance
evaluation, which focuses on assessing the correctness of the results produced by document conversion software.
Researchers should look beyond the error rate for advancing both production and personal document conversion.
Calligraphic style is considered, for this research, visual attributes of images of calligraphic characters sampled randomly
from a "work" created by a single artist. It is independent of page layout or textual content. An experimental design is
developed to investigate to what extent the source of a single, or of a few pairs, of character images can be assigned to
the either same work or to two different works. The experiments are conducted on the 13,571 segmented and labeled
600-dpi character images of the CADAL database. The classifier is not trained on the works tested, only on other works.
Even when only a few samples of same-class pairs are available, the difference-vector of a few simple features extracted
from each image of a pair yields over 80% classification accuracy for a same-work vs. different-work dichotomy. When
many pairs of different classes are available for each pair, the accuracy, using the same features, is almost the same.
These style-verification experiments are part of our larger goal of style identification and forgery detection.
The essential layout attributes of a visual table can be defined by the location of four critical grid cells. Although these
critical cells can often be located by automated analysis, some means of human interaction is necessary for correcting
residual errors. VeriClick is a macro-enabled spreadsheet interface that provides ground-truthing, confirmation,
correction, and verification functions for CSV tables. All user actions are logged. Experimental results of seven subjects
on one hundred tables suggest that VeriClick can provide a ten- to twenty-fold speedup over performing the same
functions with standard spreadsheet editing commands.
Photocopies of the ballots challenged in the 2008 Minnesota elections, which constitute a public
record, were scanned on a high-speed scanner and made available on a public radio website. The
PDF files were downloaded, converted to TIF images, and posted on the PERFECT website. Based
on a review of relevant image-processing aspects of paper-based election machinery and on
additional statistics and observations on the posted sample data, robust tools were developed for
determining the underlying grid of the targets on these ballots regardless of skew, clipping, and
other degradations caused by high-speed copying and digitization. The accuracy and robustness of
a method based on both index-marks and oval targets are demonstrated on 13,435 challenged
ballot page images.
Analyzing paper-based election ballots requires finding all marks added to the base ballot. The position, size, shape,
rotation and shade of these marks are not known a priori. Scanned ballot images have additional differences from the
base ballot due to scanner noise. Different image processing techniques are evaluated to see under what conditions they
are able to detect what sorts of marks. Basing mark detection on the difference of raw images was found to be much
more sensitive to the mark darkness. Converting the raw images to foreground and background and then removing the
form produced better results.
The fifteenth anniversary of the first SPIE symposium (titled Character Recognition Technologies)
on Document Recognition and Retrieval provides an opportunity to examine DRR's contributions to
the development of document technologies. Many of the tools taken for granted today, including
workable general purpose OCR, large-scale, semi-automatic forms processing, inter-format table
conversion, and text mining, followed research presented at this venue. This occasion also affords an
opportunity to offer tribute to the conference organizers and proceedings editors and to the coterie of
professionals who regularly participate in DRR.
The error rate can be considerably reduced on a style-consistent document if its style is identified and the right
style-specific classifier is used. Since in some applications both machines and humans have difficulty in identifying
the style, we propose a strategy to improve the accuracy of style-constrained classification by enlisting the human
operator to identify the labels of some characters selected by the machine. We present an algorithm to select the
set of characters that is likely to reduce the error rate on unlabeled characters by utilizing the labels to reclassify
the remaining characters. We demonstrate the efficacy of our algorithm on simulated data.
Binary classifiers (dichotomizers) are combined for multi-class classification. Each region formed by the pairwise
decision boundaries is assigned to the class with the highest frequency of training samples in that region. With
more samples and classifiers, the frequencies converge to increasingly accurate non-parametric estimates of the
posterior class probabilities in the vicinity of the decision boundaries. The method is applicable to non-parametric
discrete or continuous class distributions dichotomized by either linear or non-linear classifiers (like support
vector machines). We present a formal description of the method and place it in context with related methods.
We present experimental results on machine-printed and handwritten digits that demonstrate the viability of
frequency coding in a classification task.
Symbolic indirect correlation (SIC) is a new approach for bringing lexical context into the recognition of unsegmented signals that represent words or phrases in printed or spoken form. One way of viewing the SIC problem is to find the correspondence, if one exists, between two bipartite graphs, one representing the matching of the two lexical strings and the other representing the matching of the two signal strings. While perfect matching cannot be expected with real-world signals and while some degree of mismatch is allowed for in the second stage of SIC, such errors, if they are too numerous, can present a serious impediment to a successful implementation of the concept. In this paper, we describe a framework for evaluating the effectiveness of SIC match graph generation and examine the relatively simple, controlled cases of synthetic images of text strings typeset, both normally and in highly condensed fashion. We quantify and categorize the errors that arise, as well as present a variety of techniques we have developed to visualize the intermediate results of the SIC process.
The notion of assigning every piece of paper that passes through a printer a unique ID encoded either on the surface or in the substrate of the page, regardless of its intended use or perceived importance, could prove to be a breakthrough of magnitude comparable to the now ubiquitous concept of referencing a webpage through the use of its Universal Resource Locater (URL). We see many opportunities for using chipless ID in the world of everyday documents, but also many challenges. In this paper, we begin to explore the ways this new technology can be used to enable advanced document management functions, along with its implications for the ways in which people use documents.
Symbolic Indirect Correlation (SIC) is a new classification method for unsegmented patterns. SIC requires two levels of comparisons. First, the feature sequences from an unknown query signal and a known multi-pattern reference signal are matched. Then, the order of the matched features is compared with the order of matches between every lexicon symbol-string and the reference string in the lexical domain. The query is classified according to the best matching lexicon string in the second comparison. Accuracy increases as classified feature-and-symbol strings are added to the reference string.
We offer a perspective on the performance of current OCR systems by illustrating and explaining actual OCR errors made by three commercial devices. After discussing briefly the character recognition abilities of humans and computers, we present illustrated examples of recognition errors. The top level of our taxonomy of the causes of errors consists of Imaging Defects, Similar Symbols, Punctuation, and Typography. The analysis of a series of 'snippets' from this perspective provides insight into the strengths and weaknesses of current systems, and perhaps a road map to future progress. The examples were drawn from the large-scale tests conducted by the authors at the Information Science Research Institute of the University of Nevada, Las Vegas. By way of conclusion, we point to possible approaches for improving the accuracy of today's systems. The talk is based on our eponymous monograph, recently published in The Kluwer International Series in Engineering and Computer Science, Kluwer Academic Publishers, 1999.
We have developed a practical scheme to take advantage of local typeface homogeneity to improve the accuracy of a character classifier. Given a polyfont classifier which is capable of recognizing any of 100 typefaces moderately well, our method allows it to specialize itself automatically to the single -- but otherwise unknown -- typeface it is reading. Essentially, the classifier retrains itself after examining some of the images, guided at first by the preset classification boundaries of the given classifier, and later by the behavior of the retrained classifier. Experimental trials on 6.4 M pseudo-randomly distorted images show that the method improves on 95 of the 100 typefaces. It reduces the error rate by a factor of 2.5, averaged over 100 typefaces, when applied to an alphabet of 80 ASCII characters printed at ten point and digitized at 300 pixels/inch. This self-correcting method complements, and does not hinder, other methods for improving OCR accuracy, such as linguistic contextual analysis.
Document image blocks characterized using projection profiles is proposed. We have collected statistical information on scanned pages of technical articles as a by-product of digitized document analysis. Specifically, in our hierarchical block segmentation and labeling approach (syntactic), 65 training and test pages from two publications were used. Additional information on compression and profiles was also used. Pixel-level information is required as input whether the analyzing tool is an expert system or something else. The issues covered are: (1) Profile characteristics of document objects like text, line drawings, tables, and half-tones, and the variation of these profiles with block size, type size, and direction of scan. (2) Speckle noise: sizes and distribution. (3) For the hierarchical (syntactic) approach, number of tree nodes at each level along with their areas, and comparison with node areas derived from transition-cut trees. (4) CCITT-Group 4 compression statistics on document sub-blocks and whole pages. (5) Size of postscript files and postscript commands used in printing these page files. We believe that these results would allow predicting some characteristics of a printed page digitized at any specified sampling rate.