PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.
This PDF file contains the front matter associated with SPIE Proceedings Volume 8658, including the Title Page, Copyright Information, Table of Contents, and the Conference Committee listing.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper describes the development history of the Tesseract OCR engine, and compares the methods to general
changes in the field over a similar time period. Emphasis is placed on the lessons learned with the goal of providing a
primer for those interested in OCR research.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This article presents a method to recognize and to localize semi-structured documents such as ID cards, tickets,
invoices, etc. Standard object recognition methods based on interest points work well on natural images but fail
on document images because of repetitive patterns like text. In this article, we propose an adaptation of object
recognition for image documents. The advantages of our method is that it does not use character recognition
or segmentation and it is robust to rotation, scale, illumination, blur, noise and local distortions. Furthermore,
tests show that an average precision of 97.2% and recall of 94.6% is obtained for matching 7 different kinds of
documents in a database of 2155 documents.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper presents an unified recognition and retrieval system for isolated offline printed mathematical symbols
for the first time. The system is based on nearest neighbor scheme and uses modified Turning Function and
Grid Features to calculate the distance between two symbols based on Sum of Squared Difference. An unwrap
process and an alignment process are applied to modify Turning Function to deal with the horizontal and vertical
shift caused by the changing of staring point and rotation. This modified Turning Function make our system
robust against rotation of the symbol image. The system obtains top-1 recognition rate of 96.90% and 47.27%
Area Under Curve (AUC) of precision/recall plot on the InftyCDB-3 dataset. Experiment result shows that
the system with modified Turning Function performs significantly better than the system with original Turning
Function on the rotated InftyCDB-3 dataset.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this paper, we report a breakthrough result on the difficult task of segmentation and recognition of coloured
text from the word image dataset of ICDAR robust reading competition challenge 2: reading text in scene images.
We split the word image into individual colour, gray and lightness planes and enhance the contrast of each of
these planes independently by a power-law transform. The discrimination factor of each plane is computed as
the maximum between-class variance used in Otsu thresholding. The plane that has maximum discrimination
factor is selected for segmentation. The trial version of Omnipage OCR is then used on the binarized words for
recognition. Our recognition results on ICDAR 2011 and ICDAR 2003 word datasets are compared with those
reported in the literature. As baseline, the images binarized by simple global and local thresholding techniques
were also recognized. The word recognition rate obtained by our non-linear enhancement and selection of plance
method is 72.8% and 66.2% for ICDAR 2011 and 2003 word datasets, respectively. We have created ground-truth
for each image at the pixel level to benchmark these datasets using a toolkit developed by us. The recognition
rate of benchmarked images is 86.7% and 83.9% for ICDAR 2011 and 2003 datasets, respectively.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Forensic identification is the task of determining whether or not observed evidence arose from a known source. It involves
determining a likelihood ratio (LR) – the ratio of the joint probability of the evidence and source under the identification
hypothesis (that the evidence came from the source) and under the exclusion hypothesis (that the evidence did not arise from
the source). In LR- based decision methods, particularly handwriting comparison, a variable number of input evidences is
used. A decision based on many pieces of evidence can result in nearly the same LR as one based on few pieces of evidence.
We consider methods for distinguishing between such situations. One of these is to provide confidence intervals together
with the decisions and another is to combine the inputs using weights. We propose a new method that generalizes the
Bayesian approach and uses an explicitly defined discount function. Empirical evaluation with several data sets including
synthetically generated ones and handwriting comparison shows greater flexibility of the proposed method.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Handwriting recognition systems are typically trained using publicly available databases, where data have been
collected in controlled conditions (image resolution, paper background, noise level,...). Since this is not often
the case in real-world scenarios, classification performance can be affected when novel data is presented to the
word recognition system. To overcome this problem, we present in this paper a new approach called database
adaptation. It consists of processing one set (training or test) in order to adapt it to the other set (test or training,
respectively). Specifically, two kinds of preprocessing, namely stroke thickness normalization and pixel intensity
normalization are considered. The advantage of such approach is that we can re-use the existing recognition
system trained on controlled data. We conduct several experiments with the Rimes 2011 word database and
with a real-world database. We adapt either the test set or the training set. Results show that training set
adaptation achieves better results than test set adaptation, at the cost of a second training stage on the adapted
data. Accuracy of data set adaptation is increased by 2% to 3% in absolute value over no adaptation.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Confirming the labels of automatically classified patterns is generally faster than entering new labels or correcting
incorrect labels. Most labels assigned by a classifier, even if trained only on relatively few pre-labeled patterns, are
correct. Therefore the overall cost of human labeling can be decreased by interspersing labeling and classification. Given
a parameterized model of the error rate as an inverse power law function of the size of the training set, the optimal splits
can be computed rapidly. Projected savings in operator time are over 60% for a range of empirical error functions for
hand-printed digit classification with ten different classifiers.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The transcription of handwritten words remains a still challenging and difficult task. When processing full
pages, approaches are limited by the trade-off between automatic recognition errors and the tedious aspect of
human user verification. In this article, we present our investigations to improve the capabilities of an automatic
recognizer, so as to be able to reject unknown words (not to take wrong decisions) while correctly rejecting (i.e.
to recognize as much as possible from the lexicon of known words).
This is the active research topic of developing a verification system that optimize the trade-off between
performance and reliability. To minimize the recognition errors, a verification system is usually used to accept
or reject the hypotheses produced by an existing recognition system. Thus, we re-use our novel verification
architecture1 here: the recognition hypotheses are re-scored by a set of support vector machines, and validated
by a verification mechanism based on multiple rejection thresholds. In order to tune these (class-dependent)
rejection thresholds, an algorithm based on dynamic programming has been proposed which focus on maximizing
the recognition rate for a given error rate.
Experiments have been carried out on the RIMES database in three steps. The first two showed that this
approach results in a performance superior or equal to other state-of-the-art rejection methods. We focus here on
the third one showing that this verification system also greatly improves results of keywords extraction in a set
of handwritten words, with a strong robustness to lexicon size variations (21 lexicons have been tested from 167
entries up to 5,600 entries) which is particularly relevant to our application context cooperating with humans,
and only made possible thanks to the rejection ability of this proposed system. The proposed verification system,
compared to a HMM with simple rejection, improves on average the recognition rate by 57% (resp. 33% and
21%) for a given error rate of 1% (resp. 5% and 10%).
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Comic image understanding aims to automatically decompose scanned comic page images into storyboards and then
identify the reading order of them, which is the key technique to produce digital comic documents that are suitable for
reading on mobile devices. In this paper, we propose a novel comic image understanding method based on polygon
detection. First, we segment a comic page images into storyboards by finding the polygonal enclosing box of each
storyboard. Then, each storyboard can be represented by a polygon, and the reading order of them is determined by
analyzing the relative geometric relationship between each pair of polygons. The proposed method is tested on 2000
comic images from ten printed comic series, and the experimental results demonstrate that it works well on different
types of comic images.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Free-form online handwritten documents contain a high diversity of content, organized without constraints imposed
to the user. The lack of prior knowledge about content and layout makes the modeling of contextual
information of crucial importance for interpretation of such documents. In this work, we present a comprehensive
investigation of the sources of contextual information that can benefit the task of discerning textual
from non-textual strokes in handwritten online documents. An in-depth analysis of interactions between strokes
is conducted through the design of various pairwise clique systems that are combined within a Conditional
Random Field formulation of the stroke labeling problem. Our results demonstrate the benefits of combining
complementary sources of context for improving the text/non-text recognition performance.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Regions of interest (ROIs) that are pointed to by overlaid markers (arrows, asterisks, etc.) in biomedical images
are expected to contain more important and relevant information than other regions for biomedical article
indexing and retrieval. We have developed several algorithms that localize and extract the ROIs by recognizing
markers on images. Cropped ROIs then need to be annotated with contents describing them best. In most cases
accurate textual descriptions of the ROIs can be found from figure captions, and these need to be combined
with image ROIs for annotation. The annotated ROIs can then be used to, for example, train classifiers that
separate ROIs into known categories (medical concepts), or to build visual ontologies, for indexing and retrieval
of biomedical articles.
We propose an algorithm that pairs visual and textual ROIs that are extracted from images and figure
captions, respectively. This algorithm based on dynamic time warping (DTW) clusters recognized pointers into
groups, each of which contains pointers with identical visual properties (shape, size, color, etc.). Then a rule-based
matching algorithm finds the best matching group for each textual ROI mention. Our method yields a
precision and recall of 96% and 79%, respectively, when ground truth textual ROI data is used.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Converting the PDF books to re-flowable format has recently attracted various interests in the area of e-book reading.
Robust graphic segmentation is highly desired for increasing the practicability of PDF converters. To cope with various
layouts, a multi-layer concept is introduced to segment graphic composites including photographic images, drawings
with text insets or surrounded with text elements. Both image based analysis and inherent digital born document
advantages are exploited in this multi-layer based layout analysis method. By combining low-level page elements
clustering applied on PDF documents and connected component analysis on synthetically generated PNG image
document, graphic composites can be segmented for PDF documents with complex layouts. The experimental results on
graphic composite segmentation of PDF document pages have shown satisfactory performance.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this paper, a classification-free Word-Spotting system, appropriate for the retrieval of printed historical document
images is proposed. The system skips many of the procedures of a common approach. It does not include segmentation,
feature extraction or classification. Instead it treats the queries as compact shapes and uses image processing techniques
in order to localize a query in the document images. Our system was tested on a historical document collection with
many problems and a Google book, printed in 1675. Moreover, some comparative results are given for a traditional word
spotting system.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Symbol spotting is important for automatic interpretation of technical line drawings. Current spotting methods
are not reliable enough for such tasks due to low precision rates. In this paper, we combine a geometric matching-based
spotting method with an SVM classifier to improve the precision of the spotting. In symbol spotting, a
query symbol is to be located within a line drawing. Candidate matches can be found, however, the found
matches may be true or false. To distinguish a false match, an SVM classifier is used. The classifier is trained
on true and false matches of a query symbol. The matches are represented as vectors that indicate the qualities
of how well the query features are matched, those qualities are obtained via geometric matching. Using the
classification, the precision of the spotting improved from an average of 76.6% to an average of 97.2% on a
database of technical line drawings.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
We propose a segmentation free word spotting framework using Dynamic Background Model. The proposed
approach is an extension to our previous work where dynamic background model was introduced and integrated
with a segmentation based recognizer for keyword spotting. The dynamic background model uses the local
character matching scores and global word level hypotheses scores to separate keywords from non-keywords. We
integrate and evaluate this model on Hidden Markov Model (HMM) based segmentation free recognizer which
works at line level without any need for word segmentation. We outperform the state of the art line level word
spotting system on IAM dataset.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Data extraction from engraved text is discussed rarely, and nothing in the open literature discusses data extraction
from cemetery headstones. Headstone images present unique challenges such as engraved or embossed characters
(causing inner-character shadows), low contrast with the background, and significant noise due to inconsistent
stone texture and weathering. Current systems for extracting text from outdoor environments (billboards, signs,
etc.) make assumptions (i.e. clean and/or consistently-textured background and text) that fail when applied to
the domain of engraved text. The ability to extract the data found on headstones is of great historical value. This
paper describes a novel and efficient feature-based text zoning and segmentation method for the extraction of
noisy text from a highly textured engraved medium. This paper also demonstrates the usefulness of constraining
a problem to a specific domain. The transcriptions of images zoned and segmented through the proposed system
have a precision of 55% compared to 1% precision without zoning, a 62% recall compared to 39%, and an error
rate of 78% compared to 8303%.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
We describe a system for indexing of census records in tabular documents with the goal of recognizing the content
of each cell, including both headers and handwritten entries. Each document is automatically rectified, registered
and scaled to a known template following which lines and fields are detected and delimited as cells in a tabular
form. Whole-word or whole-phrase recognition of noisy machine-printed text is performed using a glyph library,
providing greatly increased efficiency and accuracy (approaching 100%), while avoiding the problems inherent
with traditional OCR approaches. Constrained handwriting recognition results for a single author reach as high
as 98% and 94.5% for the Gender field and Birthplace respectively. Multi-author accuracy (currently 82%) can
be improved through an increased training set. Active integration of user feedback in the system will accelerate
the indexing of records while providing a tightly coupled learning mechanism for system improvement.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Recent progress in the digitization of heterogeneous collections of ancient documents has rekindled new challenges
in information retrieval in digital libraries and document layout analysis. Therefore, in order to control
the quality of historical document image digitization and to meet the need of a characterization of their content
using intermediate level metadata (between image and document structure), we propose a fast automatic layout
segmentation of old document images based on five descriptors. Those descriptors, based on the autocorrelation
function, are obtained by multiresolution analysis and used afterwards in a specific clustering method. The
method proposed in this article has the advantage that it is performed without any hypothesis on the document
structure, either about the document model (physical structure), or the typographical parameters (logical
structure). It is also parameter-free since it automatically adapts to the image content. In this paper, firstly,
we detail our proposal to characterize the content of old documents by extracting the autocorrelation features
in the different areas of a page and at several resolutions. Then, we show that is possible to automatically find
the homogeneous regions defined by similar indices of autocorrelation without knowledge about the number of
clusters using adapted hierarchical ascendant classification and consensus clustering approaches. To assess our
method, we apply our algorithm on 316 old document images, which encompass six centuries (1200-1900) of
French history, in order to demonstrate the performance of our proposal in terms of segmentation and characterization
of heterogeneous corpus content. Moreover, we define a new evaluation metric, the homogeneity measure,
which aims at evaluating the segmentation and characterization accuracy of our methodology. We find a 85%
of mean homogeneity accuracy. Those results help to represent a document by a hierarchy of layout structure
and content, and to define one or more signatures for each page, on the basis of a hierarchical representation of
homogeneous blocks and their topology.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
We report on a case study on OCR of eighteenth century books conducted in the IMPACT project. After introducing the
IMPACT project and its approach to lexicon building and deployment, we zoom in to the application of IMPACT tools
and data to the Dutch EDBO collection. The results are exemplified by detailed discussion of various practical options to
improve text recognition beyond a baseline of running an uncustomized Finereader 10. In particular, we discuss
improved recognition of long s.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
A framework is proposed in this paper to effectively generate a new hybrid character type by means of integrating local
contour feature of Chinese calligraphy with structural feature of font in computer system. To explore traditional art
manifestation of calligraphy, multi-directional spatial filter is applied for local contour feature extraction. Then the
contour of character image is divided into sub-images. The sub-images in the identical position from various characters
are estimated by Gaussian distribution. According to its probability distribution, the dilation operator and erosion
operator are designed to adjust the boundary of font image. And then new Chinese character images are generated which
possess both contour feature of artistical calligraphy and elaborate structural feature of font. Experimental results
demonstrate the new characters are visually acceptable, and the proposed framework is an effective and efficient strategy
to automatically generate the new hybrid character of calligraphy and font.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this paper, we present a generic Optical Character Recognition system for Arabic script languages called
Nabocr. Nabocr uses OCR approaches specific for Arabic script recognition. Performing recognition on Arabic
script text is relatively more difficult than Latin text due to the nature of Arabic script, which is cursive and
context sensitive. Moreover, Arabic script has different writing styles that vary in complexity. Nabocr is initially
trained to recognize both Urdu Nastaleeq and Arabic Naskh fonts. However, it can be trained by users to be
used for other Arabic script languages. We have evaluated our system's performance for both Urdu and Arabic.
In order to evaluate Urdu recognition, we have generated a dataset of Urdu text called UPTI (Urdu Printed
Text Image Database), which measures different aspects of a recognition system. The performance of our system
for Urdu clean text is 91%. For Arabic clean text, the performance is 86%. Moreover, we have compared the
performance of our system against Tesseract's newly released Arabic recognition, and the performance of both
systems on clean images is almost the same.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Digitization of historical Chinese documents includes two key technologies, character segmentation and character
recognition. This paper focuses on developing character segmentation algorithm. As a preprocessing step, we
combine several effective measures to remove noises in a historical Chinese document image. After binarization,
a new character segmentation algorithm segment single characters based on projections of a cost image in local
windows. The cost image is constructed by utilizing the information of stroke bounding boxes and a skeleton
image extracted from the binarized image. We evaluate the proposed algorithm based on matching degrees of
character bounding boxes between segmentation results and ground-truth data, and achieve a recall rate of 74.3%
on a test set, which shows the effectiveness of the proposed algorithm.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Optical character recognition is widely used for converting document images into digital media. Existing OCR
algorithms and tools produce good results from high resolution, good quality, document images. In this paper,
we propose a machine learning based super resolution framework for low resolution document image OCR. Two
main techniques are used in our proposed approach: a document page segmentation algorithm and a modified
K-means clustering algorithm. Using this approach, by exploiting coherence in the document, we reconstruct
from a low resolution document image a better resolution image and improve OCR results. Experimental results
show substantial gain in low resolution documents such as the ones captured from video.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Pointers (arrows and symbols) are frequently used in biomedical images to highlight specific image regions of
interest (ROIs) that are mentioned in figure captions and/or text discussion. Detection of pointers is the first
step toward extracting relevant visual features from ROIs and combining them with textual descriptions for a
multimodal (text and image) biomedical article retrieval system.
Recently we developed a pointer recognition algorithm based on an edge-based pointer segmentation method,
and subsequently reported improvements made on our initial approach involving the use of Active Shape Models
(ASM) for pointer recognition and region growing-based method for pointer segmentation. These methods
contributed to improving the recall of pointer recognition but not much to the precision. The method discussed
in this article is our recent effort to improve the precision rate. Evaluation performed on two datasets and
compared with other pointer segmentation methods show significantly improved precision and the highest F1
score.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
For noisy, historical documents, a high optical character recognition (OCR) word error rate (WER) can render the OCR
text unusable. Since image binarization is often the method used to identify foreground pixels, a body of research seeks
to improve image-wide binarization directly. Instead of relying on any one imperfect binarization technique, our method
incorporates information from multiple simple thresholding binarizations of the same image to improve text output. Using
a new corpus of 19th century newspaper grayscale images for which the text transcription is known, we observe WERs of
13.8% and higher using current binarization techniques and a state-of-the-art OCR engine. Our novel approach combines
the OCR outputs from multiple thresholded images by aligning the text output and producing a lattice of word alternatives
from which a lattice word error rate (LWER) is calculated. Our results show a LWER of 7.6% when aligning two threshold
images and a LWER of 6.8% when aligning five. From the word lattice we commit to one hypothesis by applying the
methods of Lund et al. (2011) achieving an improvement over the original OCR output and a 8.41% WER result on this
data set.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Binarization is of significant importance in document analysis systems. It is an essential first step, prior to further
stages such as Optical Character Recognition (OCR), document segmentation, or enhancement of readability of
the document after some restoration stages. Hence, proper evaluation of binarization methods to verify their
effectiveness is of great value to the document analysis community. In this work, we perform a detailed goal-oriented evaluation of image quality assessment of the 18 binarization methods that participated in the DIBCO
2011 competition using the 16 historical document test images used in the contest. We are interested in the
image quality assessment of the outputs generated by the different binarization algorithms as well as the OCR
performance, where possible. We compare our evaluation of the algorithms based on human perception of quality
to the DIBCO evaluation metrics. The results obtained provide an insight into the effectiveness of these methods
with respect to human perception of image quality as well as OCR performance.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper presents a novel solution for the layout segmentation of graphical elements in Business Intelligence
documents. We propose a generalization of the recursive X-Y cut algorithm, which allows for cutting along
arbitrary oblique directions. An intermediate processing step consisting of line and solid region removal is also
necessary due to presence of decorative elements. The output of the proposed segmentation is a hierarchical
structure which allows for the identification of primitives in pie and bar charts. The algorithm was tested on a
database composed of charts from business documents. Results are very promising.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Integrity tests are proposed for image processing algorithms that should yield essentially the same output under 90
degree rotations, edge-padding and monotonic gray-scale transformations of scanned documents. The tests are
demonstrated on built-in functions of the Matlab Image Processing Toolbox. Only the routine that reports the area of the
convex hull of foreground components fails the rotation test. Ensuring error-free preprocessing operations like size and
skew normalization that are based on resampling an image requires more radical treatment. Even if faultlessly
implemented, resampling is generally irreversible and may introduce artifacts. Fortunately, advances in storage and
processor technology have all but eliminated any advantage of preprocessing or compressing document images by
resampling them. Using floating point coordinate transformations instead of resampling images yields accurate run-length,
moment, slope, and other geometric features.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this paper we present a novel method for multilingual artificial text extraction from still images. We propose a lexicon
independent, block based technique that employs a combination of spatial transforms, texture, edge and, gradient based
operations to detect unconstrained textual regions from still images. Finally, some morphological and geometrical
constraints are applied for fine localization of textual content. The proposed method was evaluated on two standard and
three custom developed datasets comprising a wide variety of images with artificial text occurrences in five different
languages namely English, Urdu, Arabic, Chinese and Hindi.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this paper, we propose a computer-assisted transcription system of old registers, handwritten in Arabic from
the 19th century onwards, held in the National Archives of Tunisia (NAT). The proposed system assists the
human supervisor to complete the transcription task as efficiently as possible. This assistance is given at all
different recognition levels. Our system addresses different approaches for transcription of document images. It
also implements an alignment method to find mappings between word images of a handwritten document and
their respective words in its given transcription.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
A necessary step for the recognition of scanned documents is binarization, which is essentially the segmentation
of the document. In order to binarize a scanned document, we can find several algorithms in the literature.
What is the best binarization result for a given document image? To answer this question, a user needs to check
different binarization algorithms for suitability, since different algorithms may work better for different type of
documents. Manually choosing the best from a set of binarized documents is time consuming. To automate
the selection of the best segmented document, either we need to use ground-truth of the document or propose
an evaluation metric. If ground-truth is available, then precision and recall can be used to choose the best
binarized document. What is the case, when ground-truth is not available? Can we come up with a metric which
evaluates these binarized documents? Hence, we propose a metric to evaluate binarized document images using
eigen value decomposition. We have evaluated this measure on DIBCO and H-DIBCO datasets. The proposed
method chooses the best binarized document that is close to the ground-truth of the document.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Symbol retrieval is important for content-based search in digital libraries and for automatic interpretation of
line drawings. In this work, we present a complete symbol retrieval system. The proposed system has an
off-line content-analysis stage, where the contents of a database of line drawings are represented as a symbol
index, which is a compact indexable representation of the database. Such representation allows efficient on-line
query retrieval. Within the retrieval system, three methods are presented. First, a feature grouping method for
identifying local regions of interest (ROIs) in the drawings. The found ROIs represent symbols' parts. Second,
a clustering method based on geometric matching, is used to cluster the similar parts from all the drawings
together. A symbol index is then constructed from the clusters' representatives. Finally, the ROIs of a query
symbol are matched to the clusters' representatives. The matching symbols' parts are retrieved from the clusters,
and spatial verification is performed on the matching parts. By using the symbol index we are able to achieve
a query look-up time that is independent of the database size, and dependent on the size of the symbol index.
The retrieval system achieves higher recall and precision than state-of-the-art methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Mathematical expression recognition is still a very challenging task for the research community mainly because of the
two-dimensional (2d) structure of mathematical expressions (MEs). In this paper, we present a novel approach for the
structural analysis between two on-line handwritten mathematical symbols of a ME, based on spatial features of the
symbols. We introduce six features to represent the spatial affinity of the symbols and compare two multi-class
classification methods that employ support vector machines (SVMs): one based on the “one-against-one” technique and
one based on the “one-against-all”, in identifying the relation between a pair of symbols (i.e. subscript, numerator, etc).
A dataset containing 1906 spatial relations derived from the Competition on Recognition of Online Handwritten
Mathematical Expressions (CROHME) 2012 training dataset is constructed to evaluate the classifiers and compare them
with the rule-based classifier of the ILSP-1 system participated in the contest. The experimental results give an overall
mean error rate of 2.61% for the “one-against-one” SVM approach, 6.57% for the “one-against-all” SVM technique and
12.31% error rate for the ILSP-1 classifier.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The work reported in this paper concerns the problem of mathematical expressions recognition. This task is
known to be a very hard one. We propose to alleviate the difficulties by taking into account two complementary
modalities. The modalities referred to are handwriting and audio ones. To combine the signals coming from
both modalities, various fusion methods are explored. Performances evaluated on the HAMEX dataset show a
significant improvement compared to a single modality (handwriting) based system.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this paper we describe a modified classification method destined for extractive summarization purpose. The classification
in this method doesn’t need a learning corpus; it uses the input text to do that. First, we cluster the document sentences to
exploit the diversity of topics, then we use a learning algorithm (here we used Naive Bayes) on each cluster considering
it as a class. After obtaining the classification model, we calculate the score of a sentence in each class, using a scoring
model derived from classification algorithm. These scores are used, then, to reorder the sentences and extract the first ones
as the output summary.
We conducted some experiments using a corpus of scientific papers, and we have compared our results to another summarization
system called UNIS.1 Also, we experiment the impact of clustering threshold tuning, on the resulted summary,
as well as the impact of adding more features to the classifier. We found that this method is interesting, and gives good
performance, and the addition of new features (which is simple using this method) can improve summary’s accuracy.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Supervised topic models are promising tools for text analytics that simultaneously model topical patterns in document
collections and relationships between those topics and document metadata, such as timestamps. We examine empirically the
effect of OCR noise on the ability of supervised topic models to produce high quality output through a series of experiments
in which we evaluate three supervised topic models and a naive baseline on synthetic OCR data having various levels of
degradation and on real OCR data from two different decades. The evaluation includes experiments with and without
feature selection. Our results suggest that supervised topic models are no better, or at least not much better in terms of their
robustness to OCR errors, than unsupervised topic models and that feature selection has the mixed result of improving topic
quality while harming metadata prediction quality. For users of topic modeling methods on OCR data, supervised topic
models do not yet solve the problem of finding better topics than the original unsupervised topic models.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Current systems for automatic extraction of index terms from business documents either take a rule-based
or training-based approach. As both approaches have their advantages and disadvantages it seems natural to
combine both methods to get the best of both worlds. We present a combination method with the steps selection,
normalization, and combination based on comparable scores produced during extraction. Furthermore, novel
evaluation metrics are developed to support the assessment of each step in an existing extraction system. Our
methods were evaluated on an example extraction system with three individual extractors and a corpus of 12,000
scanned business documents.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this paper, we present the implementation and evaluation of first order and second order
Hidden Markov Models to identify and correct OCR errors in the post processing of books.
Our experiments show that the first order model approximately corrects 10% of the errors
with 100% precision, while the second order model corrects a higher percentage of errors
with much lower precision.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Investigators are people who are listed as members of corporate organizations but not entered as authors in an article.
Beginning with journals published in 2008, investigator names are required to be included in a new bibliographic field in
MEDLINE citations. Automatic extraction of investigator names is necessary due to the increase in collaborative
biomedical research and consequently the large number of such names. We implemented two discriminative SVM
models, i.e., SVM and structural SVM, to identify named entities such as the first and last names of investigators from
online medical journal articles. Both approaches achieve good performance at the word and name chunk levels. We
further conducted an error analysis and found that SVM and structural SVM can offer complementary information about
the patterns to be classified. Hence, we combined the two independently trained classifiers where the SVM is chosen as
a base learner with its outputs enhanced by the predictions from the structural SVM. The overall performance especially
the recall rate of investigator name retrieval exceeds that of the standalone SVM model.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The French National Library (BnF*) has launched many mass digitization projects in order to give access to
its collection. The indexation of digital documents on Gallica (digital library of the BnF) is done through their
textual content obtained thanks to service providers that use Optical Character Recognition softwares (OCR).
OCR softwares have become increasingly complex systems composed of several subsystems dedicated to the
analysis and the recognition of the elements in a page. However, the reliability of these systems is always an
issue at stake. Indeed, in some cases, we can find errors in OCR outputs that occur because of an accumulation
of several errors at different levels in the OCR process. One of the frequent errors in OCR outputs is the missed
text components. The presence of such errors may lead to severe defects in digital libraries. In this paper, we
investigate the detection of missed text components to control the OCR results from the collections of the French
National Library. Our verification approach uses local information inside the pages based on Radon transform
descriptors and Local Binary Patterns descriptors (LBP) coupled with OCR results to control their consistency.
The experimental results show that our method detects 84.15% of the missed textual components, by comparing
the OCR ALTO files outputs (produced by the service providers) to the images of the document.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Currently, structural pattern recognizer evaluations compare graphs of detected structure to target structures
(i.e. ground truth) using recognition rates, recall and precision for object segmentation, classification and
relationships. In document recognition, these target objects (e.g. symbols) are frequently comprised of multiple
primitives (e.g. connected components, or strokes for online handwritten data), but current metrics do not
characterize errors at the primitive level, from which object-level structure is obtained. Primitive label graphs
are directed graphs defined over primitives and primitive pairs. We define new metrics obtained by Hamming
distances over label graphs, which allow classification, segmentation and parsing errors to be characterized
separately, or using a single measure. Recall and precision for detected objects may also be computed directly
from label graphs. We illustrate the new metrics by comparing a new primitive-level evaluation to the symbol-level
evaluation performed for the CROHME 2012 handwritten math recognition competition. A Python-based
set of utilities for evaluating, visualizing and translating label graphs is publicly available.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This work proposes several approaches that can be used for generating correspondences between real scanned
books and their transcriptions which might have different modifications and layout variations, also taking OCR
errors into account. Our approaches for the alignment between the manuscript and the transcription are based on
weighted finite state transducers (WFST). In particular, we propose adapted WFSTs to represent the transcription
to be aligned with the OCR lattices. The character-level alignment has edit rules to allow edit operations
(insertion, deletion, substitution). Those edit operations allow the transcription model to deal with OCR segmentation
and recognition errors, and also with the task of aligning with different text editions. We implemented
an alignment model with a hyphenation model, so it can adapt the non-hyphenated transcription. Our models
also work with Fraktur ligatures, which are typically found in historical Fraktur documents. We evaluated our
approach on Fraktur documents from Wanderungen durch die Mark Brandenburg" volumes (1862-1889) and
observed the performance of those models under OCR errors. We compare the performance of our model for
three different scenarios: having no information about the correspondence at the word (i), line (ii), sentence (iii)
or page (iv) level.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.