In order to accurately recognize textual images of a book, we often employ various models including iconic
model (for character classification), dictionary (for word recognition), character segmentation model, etc.,
which are derived from prior knowledge. Imperfections in these models affect recognition performance inevitably.
In this paper, we propose an unsupervised learning technique that adapts multiple models on-the-fly
on a homogeneous input data set to achieve a better overall recognition accuracy fully automatically. The
major challenge for this unsupervised learning process is, how to make models improve rather than damage
one another? In our framework, models measure disagreements between their input data and output data.
We propose a policy based on disagreements to adapt multiple models simultaneously (or alternately) safely.
We will construct a book recognition system based on this framework, and demonstrate its feasibility.
We describe a technique of linguistic post-processing of whole-book recognition results. Whole-book recognition is a
technique that improves recognition of book images using fully automatic cross-entropy-based model adaptation. In previous
published works, word recognition was performed on individual words separately, without awaring passage-level information
such as word-occurrence frequencies. Therefore, some rare words in real texts may appear much more often in recognition
results; vice versa. Differences between word frequencies in recognition results and in prior knowledge may indicate recognition
errors on a long passage. In this paper, we propose a post-processing technique to enhance whole-book recognition
results by minimizing differences between word frequencies in recognition results and prior word frequencies. This technique
works better when operating on longer passages, and it drives the character error rate down 20% from 1.24% to 0.98% in a
Proc. SPIE. 6815, Document Recognition and Retrieval XV
KEYWORDS: Detection and tracking algorithms, Data modeling, Image acquisition, Associative arrays, Image classification, Optical character recognition, Probability theory, Statistical modeling, Performance modeling, Current controlled current source
We describe an approach to unsupervised high-accuracy recognition of the textual contents of an entire book using fully automatic mutual-entropy-based model adaptation. Given images of all the pages of a book together with approximate models of image formation (e.g. a character-image classifier) and linguistics (e.g. a word-occurrence probability model), we detect evidence for disagreements between the two models by analyzing the mutual entropy between two kinds of probability distributions: (1) the a posteriori probabilities of character classes (the recognition results from image classification alone), and (2) the a posteriori probabilities of word classes (the recognition results from image classification combined with linguistic
constraints). The most serious of these disagreements are identified as candidates for automatic corrections to one or the other of the models. We describe a formal information-theoretic framework for detecting model disagreement and for proposing corrections. We illustrate this approach on a small test case selected from real book-image data. This reveals that a sequence of automatic model corrections can drive improvements in both models, and can achieve a lower recognition error rate. The importance of considering the contents of the whole book is motivated by a series of studies, over the last decade, showing that isogeny can be exploited to achieve unsupervised improvements in recognition accuracy.
The digitization of ancient Chinese documents presents new challenges to OCR (Optical Character Recognition) research field due to the large character set of ancient Chinese characters, variant font types, and versatile document layout styles, as these documents are historical reflections to the thousands of years of Chinese civilization. After analyzing the general characteristics of ancient Chinese documents, we present a solution for recognition of ancient Chinese documents with regular font-types and layout-styles. Based on the previous work on multilingual OCR in TH-OCR system, we focus on the design and development of two key technologies which include character recognition and page segmentation. Experimental results show that the developed character recognition kernel of 19,635 Chinese characters outperforms our original traditional Chinese recognition kernel; Benchmarked test on printed ancient Chinese books proves that the proposed system is effective for regular ancient Chinese documents.