Optical character recognition (OCR) is a challenging task because most existing preprocessing approaches are
sensitive to writing style, writing material, noises and image resolution. Thus, a single recognition system cannot
address all factors of real document images. In this paper, we describe an approach to combine diverse recognition
systems by using iVector based features, which is a newly developed method in the field of speaker verification.
Prior to system combination, document images are preprocessed and text line images are extracted with different
approaches for each system, where iVector is transformed from a high-dimensional supervector of each text line
and is used to predict the accuracy of OCR. We merge hypotheses from multiple recognition systems according
to the overlap ratio and the predicted OCR score of text line images. We present evaluation results on an Arabic
document database where the proposed method is compared against the single best OCR system using word
error rate (WER) metric.
In this paper, we present a novel text line segmentation framework following the divide-and-conquer paradigm:
we iteratively identify and re-process regions of ambiguous line segmentation from an input document image
until there is no ambiguity. To detect ambiguous line segmentation, we introduce the use of two complimentary
line descriptors, referred as to the underline and highlight line descriptors, and identify ambiguities when their
patterns mismatch. As a result, we can easily identify already good line segmentations, and largely simplify the
original line segmentation problem by only reprocessing ambiguous regions. We evaluate the performance of the
proposed line segmentation framework using the ICDAR 2009 handwritten document dataset, and it is close to
top-performing systems submitted to the competition. Moreover, the proposed method is also robust against
skewness, noise, variable line heights and touching characters. The proposed idea can also be applied to other
text analysis tasks such as word segmentation and page layout analysis.
In this paper, we present a novel method for extracting handwritten and printed text zones from noisy document
images with mixed content. We use Triple-Adjacent-Segment (TAS) based features which encode local shape
characteristics of text in a consistent manner. We first construct two codebooks of the shape features extracted
from a set of handwritten and printed text documents respectively. We then compute the normalized histogram
of codewords for each segmented zone and use it to train a Support Vector Machine (SVM) classifier. The
codebook based approach is robust to the background noise present in the image and TAS features are invariant
to translation, scale and rotation of text. In experiments, we show that a pixel-weighted zone classification
accuracy of 98% can be achieved for noisy Arabic documents. Further, we demonstrate the effectiveness of our
method for document page classification and show that a high precision can be achieved for the detection of
machine printed documents. The proposed method is robust to the size of zones, which may contain text content
at line or paragraph level.
Proc. SPIE. 5676, Document Recognition and Retrieval XII
KEYWORDS: Detection and tracking algorithms, Data modeling, Imaging systems, Feature extraction, Data processing, Optical character recognition, Optimization (mathematics), Electronic imaging, Performance modeling, Systems modeling
The BBN Byblos OCR system implements a script-independent methodology for OCR using Hidden Markov Models (HMMs). We have successfully ported the system to Arabic, English, Chinese, Pashto, and Japanese. In this paper, we report on our recent effort in training the system to perform recognition of Hindi (Devanagari) documents. The initial experiments reported in this paper were performed using a corpus of synthetic (computer-generated) document images along with slightly degraded versions of the same that were generated by scanning printed versions of the document images and by scanning faxes of the printed versions. On a fair test set consisting of synthetic images alone we measured a character error rate of 1.0%. The character error rate on a fair test set consisting of scanned images (scans of printed versions of the synthetic images) was 1.40% while the character error rate on a fair test set of fax images (scans of printed and faxed versions of the synthetic images) was 8.7%.
We present a language-independent optical character recognition system that is capable, in principle, of recognizing printed text from most of the world's languages. For each new language or script the system requires sample training data along with ground truth at the text-line level; there is no need to specify the location of either the lines or the words and characters. The system uses hidden Markov modeling technology to model each character. In addition to language independence, the technology enhances performance for degraded data, such as fax, by using unsupervised adaptation techniques. Thus far, we have demonstrated the language-independence of this approach for Arabic, English, and Chinese. Recognition results are presented in this paper, including results on faxed data.
In recent years the wavelet paradigm has become a popular area of research and it has been found to be a very powerful technique for a variety of applications ranging from communication protocols to image compression. While most of the published literature deals with real valued wavelets recently there has been some interest in wavelets for finite fields, especially GF(2). This paper describes a framework for a logical-wavelet based multiresolution analysis for binary images.