This PDF file contains the front matter associated with SPIE
Proceedings Volume 6500, including the Title Page, Copyright
information, Table of Contents, Introduction (if any), and the
Conference Committee listing.
Optical Character Recognition is much more than character classification. An industrial OCR application combines
algorithms studied in detail by different researchers in the area of image processing, pattern recognition, machine
learning, language analysis, document understanding, data mining, and other, artificial intelligence domains. There is no
single perfect algorithm for any of the OCR problems, so modern systems try to adapt themselves to the actual features
of the image or document to be recognized. This paper describes the architecture of a modern OCR system with an
emphasis on this adaptation process.
In this paper, we propose a shape representation and description well adapted to pattern recognition, particularly in the
context of affine shape transformations. The proposed approach operates from a single closed contour. The
parameterized contour is convolved with a Gaussian kernel. The curvature is calculated to determine the inflexion
points and the main significant ones are kept by using a threshold defined by observing a segment-length between two
curvature zero-crossing points. Then this filtered and simplified shape is registered with the original one. Finally, we
separately calculate the areas between the two segments corresponding to these two scale-space representations. The
proposed descriptor is a vector with components issued for each segment and the corresponding area. This article
develops the new concepts: 1) compares the same segment under different scales representation; 2) chooses the
appropriate scales by applying a threshold to the shape shortest-segment; 3) then proposes the algorithm and the
conditions of merging and removing the short-segments. An experimental evaluation of robustness under affine
transformations is presented on a shape database.
Binary classifiers (dichotomizers) are combined for multi-class classification. Each region formed by the pairwise
decision boundaries is assigned to the class with the highest frequency of training samples in that region. With
more samples and classifiers, the frequencies converge to increasingly accurate non-parametric estimates of the
posterior class probabilities in the vicinity of the decision boundaries. The method is applicable to non-parametric
discrete or continuous class distributions dichotomized by either linear or non-linear classifiers (like support
vector machines). We present a formal description of the method and place it in context with related methods.
We present experimental results on machine-printed and handwritten digits that demonstrate the viability of
frequency coding in a classification task.
Although modern OCR technology is capable of handling a wide variety of document images, there is no single
OCR engine that performs equally well on all documents for a given single language script. Naturally, each OCR
engine has its strengths and weaknesses, and therefore different engines tend to differ in the accuracy on different
documents, and in the errors on the same document image. While the idea of using multiple OCR engines
to boost output accuracy is not new, most of the existing systems do not go beyond variations on majority
voting. While this approach may work well in many cases, it has limitations, especially when OCR technology
used to process a given script has not yet fully matured. Our goal is to develop a system called MEMOE (for
"Multi-Evidence Multi-OCR-Engine") that combines, in an optimal or near-optimal way, output streams of
one or more OCR engines together with various types of evidence extracted from these streams as well as from
original document images, to produce output of higher quality than that of the individual OCR engines, or of
majority voting applied to multiple OCR output streams. Furthermore, we aim to improve the accuracy of OCR
output on images that might otherwise have low accuracy that significantly impacts downstream processing.
The MEMOE system functions as an OCR engine taking document images and some configuration parameters
as input and producing a single output text stream. In this paper, we describe the design of the system, various
evidence types and how they are incorporated into MEMOE in the form of filters. Results of initial tests that
involve two corpora of Arabic documents show that, even in its initial configuration, the system is superior to a
voting algorithm and that even more improvement may be achieved by incorporating additional evidence types
into the system.
The error rate can be considerably reduced on a style-consistent document if its style is identified and the right
style-specific classifier is used. Since in some applications both machines and humans have difficulty in identifying
the style, we propose a strategy to improve the accuracy of style-constrained classification by enlisting the human
operator to identify the labels of some characters selected by the machine. We present an algorithm to select the
set of characters that is likely to reduce the error rate on unlabeled characters by utilizing the labels to reclassify
the remaining characters. We demonstrate the efficacy of our algorithm on simulated data.
We present a distributed system to extract text contained in natural scenes within consumer photographs. The
objective is to automatically annotate pictures in order to make consumer photo sets searchable based on the image content. The system is designed to process a large volume of photos, by quickly isolating candidate text
regions, and successively cascading them through a series of text recognition engines which jointly make a decision
on whether or not the region contains text that is readable by OCR. In addition, a dedicated rejection engine is
built on top of each text recognizer to adapt its confidence measure to the specifics of the task. The resulting
system achieves very high text retrieval rate and data throughput with very small false detection rates.
The objective of the character recognition effort for the Archimedes Palimpsest is to provide a tool that allows scholars of ancient Greek mathematics to retrieve as much information as possible from the remaining degraded text. With this in mind, the current pattern recognition system does not output a single classification decision, as in typical target detection problems, but has been designed to provide intermediate results that allow the user to apply his or her own decisions (or evidence) to arrive at a conclusion. To achieve this result, a probabilistic network has been incorporated into our previous recognition system, which was based primarily on spatial correlation techniques. This paper reports on the revised tool and its recent success in the transciption process.
Post-processing of OCR is a bottleneck of the document image processing system. Proof reading is necessary since the
current recognition rate is not enough for publishing. The OCR system provides every recognition result with a confident
or unconfident label. People only need to check unconfident characters while the error rate of confident characters is low
enough for publishing. However, the current algorithm marks too many unconfident characters, so optimization of OCR
results is required. In this paper we propose an algorithm based on pattern matching to decrease the number of
unconfident results. If an unconfident character matches a confident character well, its label could be changed into a
confident one. Pattern matching makes use of original character images, so it could reduce the problem caused by image
normalization and scanned noises. We introduce WXOR, WAN, and four-corner based pattern matching to improve the
effect of matching, and introduce confidence analysis to reduce the errors of similar characters. Experimental results
show that our algorithm achieves improvements of 54.18% in the first image set that contains 102,417 Chinese
characters, and 49.85% in the second image set that contains 53,778 Chinese characters.
Distortion correction methods for digital camera document images of thick volumes or curved papers become important
for camera-based document recognition technologies. In this paper we propose a novel distortion correction method for
digital camera document images based on "shape from parallel geodesics." This method considers the following features:
parallel lines corresponding to character strings or ruled lines of tables on extended surface become parallel geodesics on
a curved paper surface and a smoothly curved paper can be modeled by a ruled surface, which is sweep surface of rulings.
The projected geodesics and rulings exist in the input image derived from perspective transformation. The presented
method extracts the projected geodesics, estimates the projected rulings in the input image, estimates the ruled surface
that models the curved paper, and generates the corrected image, in this order. The projected rulings are estimated by the
condition derived from only parallelism of geodesics without the requirements for equal spacing. This method can
estimate the ruled surface model directly by numerical operations of differentiation, integration and matrix inversion
without any iterative calculation. We also report on experiments that show the effectiveness of the proposed method.
The Archimedes Palimpsest is one of the most significant texts in the history of science. Much of
the text has been read using images of reflected visible light and visible light produced by
ultraviolet fluorescence. However, these techniques do not perform well on the four pages of the
manuscript that are obscured by forged icons that were painted over these pages during the first
half of the 20th century. X-ray fluorescence images of one of these pages have been processed
using spectral pattern recognition techniques developed for environmental remote sensing to
recover the original texts beneath the paint.
Poor quality documents are obtained in various situations such as historical document collections, legal archives,
security investigations, and documents found in clandestine locations. Such documents are often scanned for
automated analysis, further processing, and archiving. Due to the nature of such documents, degraded document
images are often hard to read, have low contrast, and are corrupted by various artifacts. We describe
a novel approach for the enhancement of such documents based on probabilistic models which increases the
contrast, and thus, readability of such documents under various degradations. The enhancement produced by
the proposed approach can be viewed under different viewing conditions if desired. The proposed approach was
evaluated qualitatively and compared to standard enhancement techniques on a subset of historical documents
obtained from the Yad Vashem Holocaust museum. In addition, quantitative performance was evaluated based
on synthetically generated data corrupted under various degradation models. Preliminary results demonstrate
the effectiveness of the proposed approach.
The aim of this scientific work is to propose a suitable assistance tool for palaeographers and
historians to help them in their intuitive and empirical work of identification of writing styles (for medieval
handwritings) and authentication of writers (for humanistic manuscripts). We propose a global approach of
writers' classification based on Curvelets based features in relation with two discriminative shapes properties, the
curvature and the orientation. Those features are revealing of structural and directional micro-shapes and also of
concavity that captures the finest variations in the contour. The Curvelets based analysis leads to the construction
of a compact Log-polar signature for each writing. The relevance of the signature is quantified with a CBIR
(content based image retrieval) system that compares request images and database images candidates. The main
experimental results are very promising and show 78% of good retrieval (as precision) on the Middle-Ages
database and 89% on the humanistic database.
We present a method of interactive training for handwriting recognition in collections of documents. As the user transcribes (labels) the words in the training set, words are automatically skipped if they appear to match words that are already transcribed. By reducing the amount of redundant training, better coverage of the data is achieved, resulting in more accurate recognition. Using word-level features for training and recognition in a collection of George Washington's manuscripts, the recognition ratio is approximately 2%-8% higher after training with our interactive method than after training the same number of words sequentially. Using our approach, less training is required to achieve an equivalent recognition ratio. A slight improvement in recognition ratio is also observed when using our method on a second data set, which consists of several pages from a diary written by Jennie Leavitt Smith.
We describe a system for recognizing online, handwritten mathematical expressions. The system is designed with a user-interface
for writing scientific articles, supporting the recognition of basic mathematical expressions as well as integrals,
summations, matrices etc. A feed-forward neural network recognizes symbols which are assumed to be single-stroke and
a recursive algorithm parses the expression by combining neural network output and the structure of the expression.
Preliminary results show that writer-dependent recognition rates are very high (99.8%) while writer-independent symbol
recognition rates are lower (75%). The interface associated with the proposed system integrates the built-in recognition
capabilities of the Microsoft's Tablet PC API for recognizing textual input and supports conversion of hand-drawn
figures into PNG format. This enables the user to enter text, mathematics and draw figures in a single interface. After
recognition, all output is combined into one LATEX code and compiled into a PDF file.
We investigate in this paper the application of dynamic Bayesian networks (DBNs) to the recognition
of handwritten digits. The main idea is to couple two separate HMMs into various architectures. First,
a vertical HMM and a horizontal HMM are built observing the evolving streams of image columns and
image rows respectively. Then, two coupled architectures are proposed to model interactions between
these two streams and to capture the 2D nature of character images. Experiments performed on the
MNIST handwritten digit database show that coupled architectures yield better recognition performances
than non-coupled ones. Additional experiments conducted on artificially degraded (broken)
characters demonstrate that coupled architectures better cope with such degradation than non coupled
ones and than discriminative methods such as SVMs.
Google Book Search is working with libraries and publishers around the world to digitally scan books. Some of those
works are now in the public domain and, in keeping with Google's mission to make all the world's information useful and
universally accessible, we wish to allow users to download them all.
For users, it is important that the files are as small as possible and of printable quality. This means that a single codec
for both text and images is impractical. We use PDF as a container for a mixture of JBIG2 and JPEG2000 images which
are composed into a final set of pages.
We discuss both the implementation of an open source JBIG2 encoder, which we use to compress text data, and the
design of the infrastructure needed to meet the technical, legal and user requirements of serving many scanned works.
We also cover the lessons learnt about dealing with different PDF readers and how to write files that work on most of the
readers, most of the time.
This paper reports on novel and traditional pixel and semantic operations using a recently standardized document representation called JPM. The JPM representation uses compressed pixel arrays for all visible elements on a page. Separate data containers called boxes provide the layout and additional semantic information. JPM and related image-based document representation standards were designed to obtain the most rate efficient document compression. The authors, however, use this representation directly for operations other than compression typically performed either on pixel arrays or semantic forms. This paper describes the image representation used in the JPM standard and presents techniques to (1) perform traditional raster-based document analysis on the compressed data, (2) transmit semantically meaningful portions of compressed data between devices, (3) create multiple views from one compressed data stream, and (4) edit high resolution document images with only low resolution proxy images.
In order to present most XML documents for human consumption, formatting information must be introduced and
applied. Formatting is typically done through a style sheet, however, it is conceivable that one could wish to view the
document without having a style sheet (either because a style sheet does not exist, or is unavailable, or is inappropriate
for the display device). This paper describes a method for formatting structured documents without a provided style
sheet. The idea is to first analyze the document to determine structures and features that might be relevant to style
decisions. A transformation can be constructed to convert the original document to a generic form that captures the
semantics that will be expressed through formatting and style. In the second stage styling is applied to the structures and
features that have been discovered by applying a pre-defined style sheet for the generic form. The document instance,
and if available, the corresponding schema or DTD can be analyzed in order to construct the transformation. This paper
will describe the generic form used for formatting and techniques for generating transformations to it.
Digital publishing workflows usually have the need for composition and balance within the document, where certain
photographs will have to be chosen according to the overall layout of the document it is going to be placed in. i.e., the
composition within the photograph will have a relationship/balance with the rest of the document layout.
This paper presents a novel image retrieval method, in which the document where the image is to be inserted is used as
query. The algorithm calculates a balance measure between the document and each of the images in the collection,
retrieving the ones that have a higher balance score. The image visual weight map, used in the balance calculation, has
been successfully approximated by a new image quality map that takes into consideration sharpness, contrast and
Professional authoring environments are used by Graphic Artists (GA) during the design phase of any publication type.
With the increasing demand for supporting Variable Data Print (VDP) designs, these authoring environments require
The recurring challenge is to provide flexible VDP features that can be represented using several VDP enabling XML
based formats. Considering the different internal structure of the authoring environments, a common platform needs to
The solution must, at the same time, empower the GA with a rich VDP feature set, as well as, generating a range of
output formats that drive their respective VDP workflows.
We have designed a common architecture to collect the required data from the hosting application and a generic internal
representation that enables multiple XML output formats.
The purpose of this study is to document current cost-estimating practices used in commercial digital printing. A
research study was conducted to determine the use of cost-estimating in commercial digital printing companies. This
study answers the questions: 1) What methods are currently being used to estimate digital printing? 2) What is the
relationship between estimating and pricing digital printing? 3) To what extent, if at all, do digital printers use full-absorption,
all-inclusive hourly rates for estimating?
Three different digital printing models were identified: 1) Traditional print providers, who supplement their offset
presswork with digital printing for short-run color and versioned commercial print; 2) "Low-touch" print providers, who
leverage the power of the Internet to streamline business transactions with digital storefronts; 3) Marketing solutions
providers, who see printing less as a discrete manufacturing process and more as a component of a complete marketing
campaign. Each model approaches estimating differently.
Understanding and predicting costs can be extremely beneficial. Establishing a reliable system to estimate those costs
can be somewhat challenging though. Unquestionably, cost-estimating digital printing will increase in relevance in the
years ahead, as margins tighten and cost knowledge becomes increasingly more critical.
List fusion is a critical problem in information retrieval. The approach using uniform weights for list fusion ignores the correctness, importance and individuality of various detectors for a concrete application. In this paper, we propose a nonuniform and rational optimized paradigm for TRECVid list fusion, which is expected to loyally preserve the precision in the outcomes and reach the maximum Average Precision (A.P.). Therefore we exhaustively search for the corresponding parametric set for the best A.P. in the space spanned by the feature vectors. In order to accelerate the fusion procedure of the input score lists, we train our model using the training data set, and apply the learnt parameters to fuse those new vectors. We take the nonuniform rational blending functions into account, the advantage of using this fusion is that the problem of weights selection is converted to the issue of parameters selection in the space related to the nonuniform and rational functions. The high precision and multiple resolution, controllable and stable attributes of rational functions are helpful in parameters selection. Therefore, the space for fusion weights selection becomes large. The correctness of our proposal is compared and verified with the average and linear fusion results.
MEDLINE(R) is the premier bibliographic online database of the National Library of Medicine, containing approximately
14 million citations and abstracts from over 4,800 biomedical journals. This paper presents an automated method based
on support vector machines to identify a "comment-on" list, which is a field in a MEDLINE citation denoting previously
published articles commented on by a given article. For comparative study, we also introduce another method based on
scoring functions that estimate the significance of each sentence in a given article. Preliminary experiments conducted
on HTML-formatted online biomedical documents collected from 24 different journal titles show that the support vector
machine with polynomial kernel function performs best in terms of recall and F-measure rates.
The application-relevant text data are very useful in various natural language applications. Using them can achieve
significantly better performance for vocabulary selection, language modeling, which are widely employed in automatic
speech recognition, intelligent input method etc. In some situations, however, the relevant data is hard to collect. Thus,
the scarcity of application-relevant training text brings difficulty upon these natural language processing. In this paper,
only using a small set of application specific text, by combining unsupervised text clustering and text retrieval
techniques, the proposed approach can find the relevant text from unorganized large scale corpus, thereby, adapt
training corpus towards the application area of interest. We use the performance of n-gram statistical language model,
which is trained from the text retrieved and test on the application-specific text, to evaluate the relevance of the text
acquired, accordingly, to validate the effectiveness of our corpus adaptation approach. The language models trained
from the ranked text bundles present well discriminated perplexities on the application-specific text. The preliminary
experiments on short message text and unorganized large corpus demonstrate the performance of the proposed methods.
Document recognition advances have improved the lives of people with print disabilities, by providing accessible
documents. This invited paper provides perspectives on the author's career progression from document recognition
professional to social entrepreneur applying this technology to help people with disabilities. Starting with initial
thoughts about optical character recognition in college, it continues with the creation of accurate omnifont character
recognition that did not require training. It was difficult to make a reading machine for the blind in a commercial setting,
which led to the creation of a nonprofit social enterprise to deliver these devices around the world. This network of
people with disabilities scanning books drove the creation of Bookshare.org, an online library of scanned books.
Looking forward, the needs for improved document recognition technology to further lower the barriers to reading are
discussed. Document recognition professionals should be proud of the positive impact their work has had on some of
society's most disadvantaged communities.
Extraction of metadata from documents is a tedious and expensive process. In general, documents are manually reviewed for structured data such as title, author, date, organization, etc. The purpose of extraction is to build metadata for documents that can be used when formulating structured queries. In many large document repositories such as the National Library of Medicine (NLM)1 or university libraries, the extraction task is a daily process that spans decades. Although some automation is used during the extraction process, generally, metadata extraction is a manual task. Aside from the cost and labor time, manual processing is error prone and requires many levels of quality control. Recent advances in extraction technology, as reported at the Message the Understanding Conference (MUC),2 is comparable with extraction performed by humans. In addition, many organizations use historical data for lookup to improve the quality of extraction. For the large government document repository we are working with, the task involves extraction of several fields from millions of OCR'd and electronic documents. Since this project is time-sensitive, automatic extraction turns out to be the only viable solution. There are more than a dozen fields associated with each document that require extraction. In this paper, we report on the extraction and generation of the title field.
We address the problem of content-based image retrieval in the context of complex document images. Complex
documents typically start out on paper and are then electronically scanned. These documents have rich internal
structure and might only be available in image form. Additionally, they may have been produced by a combination
of printing technologies (or by handwriting); and include diagrams, graphics, tables and other non-textual
elements. Large collections of such complex documents are commonly found in legal and security investigations.
The indexing and analysis of large document collections is currently limited to textual features based OCR data
and ignore the structural context of the document as well as important non-textual elements such as signatures,
logos, stamps, tables, diagrams, and images. Handwritten comments are also normally ignored due to the
inherent complexity of offline handwriting recognition. We address important research issues concerning content-based
document image retrieval and describe a prototype for integrated retrieval and aggregation of diverse
information contained in scanned paper documents we are developing. Such complex document information
processing combines several forms of image processing together with textual/linguistic processing to enable
effective analysis of complex document collections, a necessity for a wide range of applications. Our prototype
automatically generates rich metadata about a complex document and then applies query tools to integrate
the metadata with text search. To ensure a thorough evaluation of the effectiveness of our prototype, we are
developing a test collection containing millions of document images.
A new technique to segment a handwritten document into distinct lines of text is presented. Line segmentation
is the first and the most critical pre-processing step for a document recognition/analysis task. The proposed
algorithm starts, by obtaining an initial set of candidate lines from the piece-wise projection profile of the
document. The lines traverse around any obstructing handwritten connected component by associating it to the
line above or below. A decision of associating such a component is made by (i) modeling the lines as bivariate
Gaussian densities and evaluating the probability of the component under each Gaussian or (ii)the probability
obtained from a distance metric. The proposed method is robust to handle skewed documents and those with
lines running into each other. Experimental results show that on 720 documents (which includes English, Arabic
and children's handwriting) containing a total of 11, 581 lines, 97.31% of the lines were segmented correctly. On
an experiment over 200 handwritten images with 78, 902 connected components, 98.81% of them were associated
to the correct lines.
The paper describes the use of Conditional Random Fields(CRF) utilizing contextual information in automatically
labeling extracted segments of scanned documents as Machine-print, Handwriting and Noise. The result of
such a labeling can serve as an indexing step for a context-based image retrieval system or a bio-metric signature
verification system. A simple region growing algorithm is first used to segment the document into a number of
patches. A label for each such segmented patch is inferred using a CRF model. The model is flexible enough
to include signatures as a type of handwriting and isolate it from machine-print and noise. The robustness of
the model is due to the inherent nature of modeling neighboring spatial dependencies in the labels as well as
the observed data using CRF. Maximum pseudo-likelihood estimates for the parameters of the CRF model are
learnt using conjugate gradient descent. Inference of labels is done by computing the probability of the labels
under the model with Gibbs sampling. Experimental results show that this approach provides for 95.75% of the
data being assigned correct labels. The CRF based model is shown to be superior to Neural Networks and Naive
We describe a physical and logical layout analysis algorithm, which is applied to segment and label online medical
journal articles (regular HTML and PDF-Converted-HTML files). For these articles, the geometric layout of the Web
page is the most important cue for physical layout analysis. The key to physical layout analysis is then to render the
HTML file in a Web browser, so that the visual information in zones (composed of one or a set of HTML DOM nodes),
especially their relative position, can be utilized. The recursive X-Y cut algorithm is adopted to construct a hierarchical
zone tree structure. In logical layout analysis, both geometric and linguistic features are used. The HTML documents are
modeled by a Hidden Markov Model with 16 states, and the Viterbi algorithm is then used to find the optimal label
sequence, concluding the logical layout analysis.
Handwriting recognition research requires large databases of word images each of which is labeled with the word it contains. Full images scanned in, however, usually contain sentences or paragraphs of writing. The creation of labeled databases of images of isolated words is usually tedious, requiring a person to drag a rectangle around each word in the full image and type in the label. Transcript mapping is the automatic alignment of words in a text file with word locations in the full image. It can ease the creation of databases for research. We propose the first transcript mapping method for handwritten Arabic documents. Our approach is based on Dynamic Time Warping (DTW) and offers two primary algorithmic contributions. First is an extension to DTW that uses true distances when mapping multiple entries from one series to a single entry in the second series. Second is a method to concurrently map elements of a partially aligned third series within the main alignment. Preliminary results are provided.
We report an investigation into strategies, algorithms, and software tools for document image content extraction
and inventory, that is, the location and measurement of regions containing handwriting, machine-printed text,
photographs, blank space, etc. We have developed automatically trainable methods, adaptable to many kinds
of documents represented as bilevel, greylevel, or color images, that offer a wide range of useful tradeoffs of
speed versus accuracy using methods for exact and approximate k-Nearest Neighbor classification. We have
adopted a policy of classifying each pixel (rather than regions) by content type: we discuss the motivation
and engineering implications of this choice. We describe experiments on a wide variety of document-image and
content types, and discuss performance in detail in terms of classification speed, per-pixel classification accuracy,
per-page inventory accuracy, and subjective quality of page segmentation. These show that even modest per-pixel
classification accuracies (of, e.g., 60-70%) support usefully high recall and precision rates (of, e.g., 80-90%)
for retrieval queries of document collections seeking pages that contain a given minimum fraction of a certain
type of content.