In this paper, we present a generic Optical Character Recognition system for Arabic script languages called
Nabocr. Nabocr uses OCR approaches specific for Arabic script recognition. Performing recognition on Arabic
script text is relatively more difficult than Latin text due to the nature of Arabic script, which is cursive and
context sensitive. Moreover, Arabic script has different writing styles that vary in complexity. Nabocr is initially
trained to recognize both Urdu Nastaleeq and Arabic Naskh fonts. However, it can be trained by users to be
used for other Arabic script languages. We have evaluated our system's performance for both Urdu and Arabic.
In order to evaluate Urdu recognition, we have generated a dataset of Urdu text called UPTI (Urdu Printed
Text Image Database), which measures different aspects of a recognition system. The performance of our system
for Urdu clean text is 91%. For Arabic clean text, the performance is 86%. Moreover, we have compared the
performance of our system against Tesseract's newly released Arabic recognition, and the performance of both
systems on clean images is almost the same.
Page segmentation into text and non-text elements is an essential preprocessing step before optical character
recognition (OCR) operation. In case of poor segmentation, an OCR classification engine produces garbage
characters due to the presence of non-text elements. This paper describes modifications to the text/non-text
segmentation algorithm presented by Bloomberg,<sup>1</sup> which is also available in his open-source Leptonica library.<sup>2</sup>The modifications result in significant improvements and achieved better segmentation accuracy than the original
algorithm for UW-III, UNLV, ICDAR 2009 page segmentation competition test images and circuit diagram
Ideally, digital versions of scanned documents should be represented in a format that is searchable, compressed,
highly readable, and faithful to the original. These goals can theoretically be achieved through OCR and font
recognition, re-typesetting the document text with original fonts. However, OCR and font recognition remain
hard problems, and many historical documents use fonts that are not available in digital forms. It is desirable
to be able to reconstruct fonts with vector glyphs that approximate the shapes of the letters that form a
font. In this work, we address the grouping of tokens in a token-compressed document into candidate fonts.
This permits us to incorporate font information into token-compressed images even when the original fonts are
unknown or unavailable in digital format. This paper extends previous work in font reconstruction by proposing
and evaluating an algorithm to assign a font to every character within a document. This is a necessary step
to represent a scanned document image with a reconstructed font. Through our evaluation method, we have
measured a 98.4% accuracy for the assignment of letters to candidate fonts in multi-font documents.
In current study we examine how letter permutation affects in visual recognition of words for two orthographically
dissimilar languages, Urdu and German. We present the hypothesis that recognition or reading of permuted and
non-permuted words are two distinct mental level processes, and that people use different strategies in handling
permuted words as compared to normal words. A comparison between reading behavior of people in these
languages is also presented. We present our study in context of dual route theories of reading and it is observed
that the dual-route theory is consistent with explanation of our hypothesis of distinction in underlying cognitive
behavior for reading permuted and non-permuted words. We conducted three experiments in lexical decision
tasks to analyze how reading is degraded or affected by letter permutation. We performed analysis of variance
(ANOVA), distribution free rank test, and t-test to determine the significance differences in response time
latencies for two classes of data. Results showed that the recognition accuracy for permuted words is decreased
31% in case of Urdu and 11% in case of German language. We also found a considerable difference in reading
behavior for cursive and alphabetic languages and it is observed that reading of Urdu is comparatively slower
than reading of German due to characteristics of cursive script.
In large scale scanning applications, orientation detection of the digitized page is necessary for the following
procedures to work correctly. Several existing methods for orientation detection use the fact that in Roman
script text, ascenders are more likely to occur than descenders. In this paper, we propose a different approach
for page orientation detection that uses this information. The main advantage of our method is that it is more
accurate than compared widely used methods, while being scan resolution independent. Another interesting
aspect of our method is that it can be combined with our previously published method for skew detection to
have a single-step skew and orientation estimate of the page image. We demonstrate the effectiveness of our
approach on the UW-I dataset and show that our method achieves an accuracy of above 99% on this dataset. We
also show that our method is robust to different scanning resolutions and can reliably detect page orientations
for documents rendered at 150, 200, 300, and 400 dpi.
Adaptive binarization is an important first step in many document analysis and OCR processes. This paper
describes a fast adaptive binarization algorithm that yields the same quality of binarization as the Sauvola
method,<sup>1</sup> but runs in time close to that of global thresholding methods (like Otsu's method<sup>2</sup>), independent of
the window size. The algorithm combines the statistical constraints of Sauvola's method with integral images.<sup>3</sup>
Testing on the UW-1 dataset demonstrates a 20-fold speedup compared to the original Sauvola algorithm.