Historical Chinese character recognition is very important to larger scale historical document digitalization, but is a very challenging problem due to lack of labeled training samples. This paper proposes a novel non-linear transfer learning method, namely Gaussian Process Style Transfer Mapping (GP-STM). The GP-STM extends traditional linear Style Transfer Mapping (STM) by using Gaussian process and kernel methods. With GP-STM, existing printed Chinese character samples are used to help the recognition of historical Chinese characters. To demonstrate this framework, we compare feature extraction methods, train a modified quadratic discriminant function (MQDF) classifier on printed Chinese character samples, and implement the GP-STM model on Dunhuang historical documents. Various kernels and parameters are explored, and the impact of the number of training samples is evaluated. Experimental results show that accuracy increases by nearly 15 percentage points (from 42.8% to 57.5%) using GP-STM, with an improvement of more than 8 percentage points (from 49.2% to 57.5%) compared to the STM approach.
As smartphones and touch screens are more and more popular, on-line signature verification technology can be used as
one of personal identification means for mobile computing. In this paper, a novel Laplacian Spectral Analysis (LSA)
based on-line signature verification method is presented and an integration framework of LSA and Dynamic Time
Warping (DTW) based methods for practical application is proposed. In LSA based method, a Laplacian matrix is
constructed by regarding the on-line signature as a graph. The signature’s writing speed information is utilized in the
Laplacian matrix of the graph. The eigenvalue spectrum of the Laplacian matrix is analyzed and used for signature
verification. The framework to integrate LSA and DTW methods is further proposed. DTW is integrated at two stages.
First, it is used to provide stroke matching results for the LSA method to construct the corresponding graph better.
Second, the on-line signature verification results by DTW are fused with that of the LSA method. Experimental results
on public signature database and practical signature data on mobile phones proved the effectiveness of the proposed
Digitization of historical Chinese documents includes two key technologies, character segmentation and character
recognition. This paper focuses on developing character segmentation algorithm. As a preprocessing step, we
combine several effective measures to remove noises in a historical Chinese document image. After binarization,
a new character segmentation algorithm segment single characters based on projections of a cost image in local
windows. The cost image is constructed by utilizing the information of stroke bounding boxes and a skeleton
image extracted from the binarized image. We evaluate the proposed algorithm based on matching degrees of
character bounding boxes between segmentation results and ground-truth data, and achieve a recall rate of 74.3%
on a test set, which shows the effectiveness of the proposed algorithm.
Proc. SPIE. 7874, Document Recognition and Retrieval XVIII
KEYWORDS: FDA class I medical device development, Detection and tracking algorithms, Image segmentation, Image processing, Feature extraction, Digital imaging, Machine learning, Optical character recognition, Intelligence systems, Selenium
A SemiBoost-based character recognition method is introduced in order to incorporate the information of unlabeled
practical samples in training stage. One of the key problems in semi-supervised learning is the criteria of unlabeled
sample selection. In this paper, a criteria based on pair-wise sample similarity is adopted to guide the SemiBoost learning
process. At each time of iteration, unlabeled examples are selected and assigned labels. The selected samples are used
along with the original labeled samples to train a new classifier. The trained classifiers are integrated to make the final
classfier. An empirical study on several Arabic similar character pairs with different similarities shows that the proposed
method improves the performance as unlabeled samples reveal the distribution of practical samples.
The OCR technology for Chinese historical documents is still an open problem. As these documents are hand-written or
hand-carved in various styles, overlapped and touching characters bring great difficulty for character segmentation
module. This paper presents an over-segmentation-based method to handle the overlapped and touching Chinese
characters in historic documents. The whole segmentation process includes two parts: over-segmented and segmenting
path optimization. In the former part, touching strokes will be found and segmented by analyzing the geometric
information of the white and black connected components. The segmentation cost of the touching strokes is estimated
with connected components' shape and location, as well as the touching stroke width. The latter part uses local
optimization dynamic programming to find best segmenting path. HMM is used to express the multiple choices of
segmenting paths, and Viterbi algorithm is used to search local optimal solution. Experimental results on practical
Chinese documents show the proposed method is effective.
Mongolian is one of the major ethnic languages in China. Large amount of Mongolian printed documents need to be
digitized in digital library and various applications. Traditional Mongolian script has unique writing style and multi-font-type
variations, which bring challenges to Mongolian OCR research. As traditional Mongolian script has some
characteristics, for example, one character may be part of another character, we define the character set for recognition
according to the segmented components, and the components are combined into characters by rule-based post-processing
module. For character recognition, a method based on visual directional feature and multi-level classifiers is presented.
For character segmentation, a scheme is used to find the segmentation point by analyzing the properties of projection and
connected components. As Mongolian has different font-types which are categorized into two major groups, the
parameter of segmentation is adjusted for each group. A font-type classification method for the two font-type group is
introduced. For recognition of Mongolian text mixed with Chinese and English, language identification and relevant
character recognition kernels are integrated. Experiments show that the presented methods are effective. The text
recognition rate is 96.9% on the test samples from practical documents with multi-font-types and mixed scripts.
As a cursive script, the characteristics of Arabic texts are different from Latin or Chinese greatly. For example, an Arabic character has up to four written forms and characters that can be joined are always joined on the baseline. Therefore, the methods used for Arabic document recognition are special, where character segmentation is the most critical problem. In this paper, a printed Arabic document recognition system is presented, which is composed of text line segmentation, word segmentation, character segmentation, character recognition and post-processing stages. In the beginning, a top-down and bottom-up hybrid method based on connected components classification is proposed to segment Arabic texts into lines and words. Subsequently, characters are segmented by analysis the word contour. At first the baseline position of a given word is estimated, and then a function denote the distance between contour and baseline is analyzed to find out all candidate segmentation points, at last structure rules are proposed to merge over-segmented characters. After character segmentation, both statistical features and structure features are used to do character recognition. Finally, lexicon is used to improve recognition results. Experiment shows that the recognition accuracy of the system has achieved 97.62%.
Logical structure extraction of book documents is significant in electronic document database automatic construction. The tables of contents in a book play an important role in representing the overall logical structure and reference information of the book documents. In this paper, a new method is proposed to extract the hierarchical logical structure of book documents, in addition to the reference information, by combining spatial and semantic information of the tables of contents in a book. Experimental results obtained from testing on various book documents demonstrate the effectiveness and robustness of the proposed approach.
The digitization of ancient Chinese documents presents new challenges to OCR (Optical Character Recognition) research field due to the large character set of ancient Chinese characters, variant font types, and versatile document layout styles, as these documents are historical reflections to the thousands of years of Chinese civilization. After analyzing the general characteristics of ancient Chinese documents, we present a solution for recognition of ancient Chinese documents with regular font-types and layout-styles. Based on the previous work on multilingual OCR in TH-OCR system, we focus on the design and development of two key technologies which include character recognition and page segmentation. Experimental results show that the developed character recognition kernel of 19,635 Chinese characters outperforms our original traditional Chinese recognition kernel; Benchmarked test on printed ancient Chinese books proves that the proposed system is effective for regular ancient Chinese documents.