Mongolian is one of the major ethnic languages in China. Large amount of Mongolian printed documents need to be
digitized in digital library and various applications. Traditional Mongolian script has unique writing style and multi-font-type
variations, which bring challenges to Mongolian OCR research. As traditional Mongolian script has some
characteristics, for example, one character may be part of another character, we define the character set for recognition
according to the segmented components, and the components are combined into characters by rule-based post-processing
module. For character recognition, a method based on visual directional feature and multi-level classifiers is presented.
For character segmentation, a scheme is used to find the segmentation point by analyzing the properties of projection and
connected components. As Mongolian has different font-types which are categorized into two major groups, the
parameter of segmentation is adjusted for each group. A font-type classification method for the two font-type group is
introduced. For recognition of Mongolian text mixed with Chinese and English, language identification and relevant
character recognition kernels are integrated. Experiments show that the presented methods are effective. The text
recognition rate is 96.9% on the test samples from practical documents with multi-font-types and mixed scripts.
Optical performance monitoring is a very import issue in the optical transparent network. We present an optical performance monitoring method, based on asynchronous sampling technology, and Q value and bit error rate can be calculated by the asynchronous histogram. Experiment result shows that the optical performance method is bit rate transparent and modulate format transparent.
As a cursive script, the characteristics of Arabic texts are different from Latin or Chinese greatly. For example, an Arabic character has up to four written forms and characters that can be joined are always joined on the baseline. Therefore, the methods used for Arabic document recognition are special, where character segmentation is the most critical problem. In this paper, a printed Arabic document recognition system is presented, which is composed of text line segmentation, word segmentation, character segmentation, character recognition and post-processing stages. In the beginning, a top-down and bottom-up hybrid method based on connected components classification is proposed to segment Arabic texts into lines and words. Subsequently, characters are segmented by analysis the word contour. At first the baseline position of a given word is estimated, and then a function denote the distance between contour and baseline is analyzed to find out all candidate segmentation points, at last structure rules are proposed to merge over-segmented characters. After character segmentation, both statistical features and structure features are used to do character recognition. Finally, lexicon is used to improve recognition results. Experiment shows that the recognition accuracy of the system has achieved 97.62%.
Although about 300 million people worldwide, in several different languages, take Arabic characters for writing, Arabic OCR has not been researched as thoroughly as other widely used characters (Latin or Chinese). In this paper, a new statistical method is developed to recognize machine-printed Arabic characters. Firstly, the entire Arabic character set is pre-classified into 32 sub-sets in terms of character forms (Isolated, Final, Initial, Medial), special zones (divided according to the headline and the baseline of a text line) that characters occupy and component information (with or without secondary parts, say, diacritical marks, movements, etc.). Then 12 types of directional features are extracted from character profiles. After dimension reduction by linear discriminant analysis (LDA), features are sent to modified quadratic discriminant function (MQDF), which is utilized as the final classifier. At last, similar characters are discriminated before outputting recognition results. Selecting involved parameters properly, encouraging experimental results on test sets demonstrate the validity of proposed approach.
Tibetan optical character recognition (OCR) system plays a crucial role in the Chinese multi-language information processing system. This paper proposed a new statistical method to perform multi-font printed Tibetan/English character recognition. A robust Tibetan character recognition kernel is elaborately designed. Incorporating with previous English character recognition techniques, the recognition accuracy on a test set containing 206,100 multi-font printed characters reaches 99.67%, which shows the validity of the proposed method.
Text segmentation plays a crucial role in a text recognition system. A comprehensive method is proposed to solve Tibetan/English text segmentation. 2 algorithms based on Tibetan inter-syllabic tshegs and discirminant function, respectively, are presented to perform skew detection before text line separation. Then a dynamic recursive character segmentation algorithm integrating multi-level information is developed. The encouraging experimental results on a large-scale Tibetan/English mixed text set show the validity of proposed method.