To test the implemented OCR system for Chinese characters, a very large number of handwritten Chinese character samples are required. The collecting work is time consuming and strenuous. In this paper, we present a new method for generating handwritten Chinese character samples. We introduce an algebra of geometric shapes, the geometric shapes have composite operations such as Fourier addition and Fourier multiplication. In our method, a small set of Chinese character samples are required. We use Fourier series descriptor to describe the shape of these specific Chinese characters. Handwritten Chinese character samples with various styles may be generated automatically by the following methods. (1) Changing the harmonics N of Fourier series. (2) With the Fourier addition operation, a random shape may be added to the shape of the existing Chinese character. (3) Combining styles of two or more kinds of handwritten Chinese character samples. Experiment result shows that our method is effective.
A rejection rule is called class-selective if it does not reject an ambiguous pattern from all classes but only from those classes that are most unlikely to issue the pattern. A class-selective rejection rule makes a correct decision if the true class of the pattern is among the selected classes; otherwise, i.e., the true class is rejected, it commits an error. The risk of making an error can be reduced by increasing the number of selected classes. Thus the power of a class-selective rejection rule is characterized by the tradeoff between the error rate and the average number of selected classes. Many class-selective rejection rules have been proposed in literature, but the optimal rule was discovered only recently. The rule is optimal in the sense that, for any given average number of classes, it minimizes the error rate, and vice versa. The optimal rule consists in selecting all classes whose posterior probability exceeds a prespecified threshold; if there exist no such classes, the rule simply selects the (a) best class. This paper presents an experimental comparison of the optimal class-selective rejection rule and two other heuristic rules. The experiments are performed on isolated handwritten numerals from the NIST databases. In particular, the tradeoff powers of the three rules are compared using a neutral network based classifier as estimator of posterior probabilities. The experiments show that the theoretically optimal rule does outperform the heuristic rules in practice.
Building upon the utility of connected components, NIST has designed a new character segmentor based on statistically modeling the style of a person's handwriting. Simple spatial features capture the characteristics of a particular writer's style of handprint, enabling the new method to maintain a traditional character-level segmentation philosophy without the integration of recognition or the use of oversegmentation and linguistic postprocessing. Estimates for stroke width and character height are used to compute aspect ratio and standard stroke count features that adapt to the writer's style at the field level. The new method has been developed with a predetermined set of fuzzy rules making the segmentor much less fragile and much more adaptive, and the new method successfully reconstructs fragmented characters as well as splits touching characters. The new segmentor was integrated into the NIST public domain form-based handprint recognition systems and then tested on a set of 490 handwriting sample forms found in NIST special database 19. When compared to a simple component-based segmentor, the new adaptable method improved the overall recognition of handprinted digits by 3.4 percent and field level recognition by 6.9 percent, while effectively reducing deletion errors by 82 percent. The same program code and set of parameters successfully segments sequences of uppercase and lowercase characters without any context-based tuning. While not as dramatic as digits, the recognition of uppercase and lowercase characters improved by 1.7 percent and 1.3 percent respectively. The segmentor maintains a relatively straight-forward and logical process flow avoiding convolutions of encoded exceptions as is common in expert systems. As a result, the new segmentor operates very efficiently, and throughput as high as 362 characters per second can be achieved. Letters and numbers are constructed from a predetermined configuration of a relatively small number of strokes. Results in this paper show that capitalizing on this knowledge through the use of simple adaptable features can significantly improve segmentation, whereas recognition-based and oversegmentation methods fail to take advantage of these intrinsic qualities of handprinted characters.
Generally speaking, a recognition system should be insensitive to translation, rotation, scaling and distortion found in the data set. Non-linear distortion is difficult to eliminate. This paper discusses a method based on dynamic programming which copes with features normalization subjected to small non-linear distortions. Combining with k- means clustering results in a statistical classification algorithm suitable for pattern recognition problems. In order to assess the classifier, it has been integrated into a hand-written character recognition system. Dynamic features have been extracted from a database of 1248 isolated Roman character. The recognition rates are, on average, 91.67 percent and 94.55 percent. The classifier might also be tailored to any pattern recognition application.
Principal component analysis (PCA) has been a major field of study in image compression, coding technique, or pattern recognition, particularly for classification and feature subset selection. Based on its success in these domains, character recognition methods using PCA have attracted considerable attention in recent years. In this paper, we propose a novel scheme for gray-scale handwritten character recognition based on principal of training set are projected onto the subspaces defined by their most important eigenvectors. Here, the significant eigenvectors of each class are chosen as those with the largest associated eigenvalues. These eigenvectors can be thought of as a set of feature vectors, that is, principal features. In this paper, we consider the minimum error subspace classifier for classification. It is a discriminant function derived from the PCA. We discriminate an unknown test character during the recognition phase by projection and classification. The recognition is performed by projecting a test image onto the subspace defined by the dominant eigenvectors of each class and then choosing the class corresponding to the subspace with the minimum error as the class of the test character. In order to verify the performance of the proposed scheme for gray-scale handwritten character recognition, experiments with the IPTP CDROM1 database have been carried out. Of the 12,000 samples available on this CD, 9,000 and 3,000 have been sued for training and testing, respectively. In this paper, we investigated the influence of the number eigencharacters used to define the subspace as well as the number of training characters for each character. Experimental results reveal that the proposed scheme based on principal features has advantages over other character recognition approaches in its speed and simplicity, learning capacity, and insensitivity to variations in the handwritten character images.
A method to provide estimates of font attributes in an OCR system, using detectors of individual attributes that are error-prone. For an OCR system to preserve the appearance of a scanned document, it needs accurate detection of font attributes. However, OCR environments have noise and other sources of errors, tending to make font attribute detection unreliable. Certain assumptions about font use can greatly enhance accuracy. Attributes such as boldness and italics are more likely to change between neighboring words, while attributes such as serifness are less likely to change within the same paragraph. Furthermore, the document as a whole, tends to have a limited number of sets of font attributes. These assumptions allow a better use of context than the raw data, or what would be achieved by simpler methods that would oversmooth the data.
A significant amount of text now present in World Wide Web documents is embedded in image data, and a large portion of it does not appear elsewhere at all. To make this information available, we need to develop techniques for recovering textual information from in-line Web images. In this paper, we describe two methods for Web image OCR. Recognizing text extracted from in-line Web images is difficult because characters in these images are often rendered at a low spatial resolution. Such images are typically considered to be 'low quality' by traditional OCR technologies. Our proposed methods utilize the information contained in the color bits to compensate for the loss of information due to low sampling resolution. The first method uses a polynomial surface fitting technique for object recognition. The second method is based on the traditional n-tuple technique. We collected a small set of character samples from Web documents and tested the two algorithms. Preliminary experimental results show that our n-tuple method works quite well. However, the surface fitting method performs rather poorly due to the coarseness and small number of color shades used in the text.
A system has been built that embeds arbitrary digital data in an iconic representation of a text image. For encoding, a page image containing text is analyzed for the text regions. A highly reduced image of the page is generated, with an iconic version of the text that encodes an input data stream substituting for the text regions. The data is encoded into modulations of rectangular iconic representations of text words, where the length, height and vertical positioning of rectangles, as well as the spacing between rectangles, can all be independently varied. No correspondence need be maintained between the words in the document and the word icons. Word icons or other marks on each line can be used for identifying, calibrating and justifying iconic text. Decoding proceeds by finding iconic lines and determining the iconic word sizes and locations. Word icons printed with 8x reduction are reliably decoded from 300 ppi binary scans. One application is to present iconified first pages of many documents on a sheet of paper, where the URL of each document is encoded in its icon. Icon scanning and selection then allows retrieval of the full document. Another use is to print an icon on every page of a document, containing meta-information about the document or the specific page, such as the version, revision history, keywords, authorization, or a signed hashing of the full image for authentication.
A method is proposed for detecting whether tow CCITT group 4 images were scanned from the same document. Features are extracted from rectangular patches of text and compared with a modified Hausdorff distance measure. Two images are said to be 'equivalent' if the Hausdorff measure finds that a specified number of features are located within a given distance of one another in both images. This paper explains the technique and presents experimental results that demonstrate its effectiveness.It is shown that features extracted from a one-inch square patch of image data provide better than 95 percent correct retrieval accuracy with no false positives on a database of 800 documents.
In document image filing applications it is important to be able to recognize whether a particular document has already been entered into the system either as an individual document or as an inclusion in another document.Document images could be matched on the basis of layout or contents.However, matching of layout may not be effective when style is strictly controlled. We develop a document 'handle' which is stored along with the document image. The handle is simply a character shape coded representation of the image after the figures and tables have ben removed. Character shape coding is a method of identifying individual character images as members of one of a small number of classes. This process is computationally inexpensive and tolerant of differing generations of photocopying, skew and scanner characteristics. When a new document is entered into the system, its handle is computed and compared against al of the extant handles using a normalized Levenshtein metric. We demonstrate the ability to detect duplicate documents comprising single and multiple pages.
This paper proposes a new approach to the detection of local orientation and skew in document images. It is based on the observation that there are many documents where a single global estimate of the page skew is not sufficient. These documents require local adaptation to deal robustly with todays complex configurations of components on the page. The approach attempts to identify regions in the image which exhibit locally consistent physical properties and consistent physical properties and consistent orientation. To do this, we rapidly compute a coarse segmentation and delineate regions which differ with respect to layout and/or physical content. Each region is classified as text, graphics, mixed text/graphics, image or background using local features and additional features are extracted to estimate orientation. The local orientation decisions are propagated where appropriate to resolve ambiguity and to produce a global estimate of the skew for the page. The implementation of our algorithms is demonstrated on a set of images which have multiple regions with different orientations.
Restoration and enhancement of digital documents typically involves use of binary filters. Various methods have been developed to facilitate automatic design of optimal binary filters in the context of morphological image processing. Among these are iterative increasing filters for designing optimal increasing filters in stages, and paired- representation filters, in which a nonincreasing filter is represented as a union of anti-extensive and extensive increasing filters. The present paper applies these filters in three different modalities for digital document enhancement. Iterative design is illustrated in the context of document restoration for dilated, background-noise images; iterative, paired design is illustrate in for restoration of edge-degraded characters; and paired design is used for integer resolution conversion.
Arc segmentation is essential for high-level engineering drawing understanding and is very difficult due to its high order geometry form and the mixtures of other geometry lines. We present an arc segmentation method that is implemented in the current version of the machine drawing understanding system. The method follows closely the perpendicular bisector tracing algorithm we developed before. Some important variations are being explored to improve the efficiency, lower the number of false arc detections and improve the accuracy of the detected arcs. Experimental results are presented and discussed.
In this paper, we discuss a performance evaluator for line- drawing recognition systems on images that contain binary digital logic schematic diagrams, a restricted subclass of engineering line drawings. The evaluator accepts inputs of IGES files containing IGES primitives of straight lines, circles, partial arcs of circles, and IGES label block objects. Our evaluator takes two IGES files. One of these files is the recognition algorithm's output and the other IGES file is the corresponding ground truth. The first step of processing involves parsing each IGES file and extracting IGES entities and the parameter information according to the IGES file format specification. The evaluator performs the evaluation for each pair of entities within these two files based on their types and the matching protocols and matching criteria defined in this paper. The results of our evaluator is a table of numbers which when weighted by application specific weights can be summed to produce an overall score relevant to the application.
A performance evaluation protocol for the layout analysis is discussed in this paper. In the University of Washington English Document Image Database-III, there are 1600 English document images that come with manually edited ground truth of entity bounding boxes. These bounding boxes enclose text and non-text zones, text-lines, and words. We describe a performance metric for the comparison of the detected entities and the ground truth in terms of their bounding boxes. The Document Attribute Format Specification is used as the standard data representation. The protocol is intended to serve as a model for using the UW-III database to evaluate the document analysis algorithms. A set of layout analysis algorithms which detect different entities have been tested based on the data set and the performance metric. The evaluation results are presented in this paper.
As part of the Department of Energy document declassification program, we have developed a numerical rating system to predict the OCR error rate that we expect to encounter when processing a particular document. The rating algorithm produces a vector containing scores for different document image attributes such as speckle and touching characters. The OCR error rate for a document is computed from a weighted sum of the elements of the corresponding quality vector. The predicted OCR error rate will be used to screen documents that would not be handled properly with existing document processing products.
Early work in document image decoding was based on a bilevel imaging model in which an observed image is formed by passing an ideal bilevel image through a memoryless asymmetric bit-flip channel. While this simple model has proven useful in practice, there are many situations in which the bit-flip channel is an inadequate degradation mode. This paper presents a multilevel generalization of the bilevel model in which the pixels of the ideal image are assigned values from a finite set of L discrete 'levels. Level 0 is a background color and the remaining levels are foreground colors. The observed image is bilevel and is modelled as the output of a memoryless L-input symbol, 2- output symbol, 2-output symbol channel. The multilevel model is motivated in part by the intuition that pixel sin a character image are more or less reliably black, depending on their distance from an edge. In addition, the multilevel model supports both 'write-black' and 'write-write' levels, and thus can be used to implement a probabilistic analog of morphological 'hit-miss' filtering. In experiments with the University of Washington UW-II English journal database, the character error rate with multilevel templates was about a factor of four less than the error rate with bilevel templates.
We propose a novel segmentation algorithm called SMART for color, complex documents. It decomposes a document image into 'binarizable' and 'non-binarizable' components. The segmentation procedure includes color transformation, halftone texture suppression, subdivision of the image into 8 by 8 blocks, classification of the 8 by 8 blocks as 'active' or 'inactive', formation of macroblocks from the active blocks, and classification of the macroblocks as binarizable or non-binarizable. The classification processes involve the DCT coefficients and a histogram analysis. SMART is compared to three well-known segmentation algorithms: CRLA, RXYC, and SPACE. SMART can handle image components of various shapes, multiple backgrounds of different gray levels, different relative grayness of text to this background, tilted image components, and text of different gray levels. To compress the segmented image, we apply JPEG4 to the non-binarizable macroblocks and the Group 4 coding scheme to the binary image representing the binarizable macroblocks and to the bitmap storing the configuration of all macroblocks. Data about the representative gray values, the color information, and other descriptors of the binarizable macroblocks and the background regions are also sent to allow image reconstruction. The gain is using our compression algorithm over using JPEG for the whole image is significant. This gain increases as the proportion of the size of the subjects prefer the reconstructed images from our compression algorithm to those form the bitrate-matching JPEG images. In a series of test images, this document segmentation and compression system enables compression ratios two times to six times improved over standard methods.
Conventional electronic document filing systems are inconvenient because the user must specify the keywords in each document for later searches. To solve this problem, automatic keyword extraction methods using natural language processing and character recognition have been developed. However, these methods are slow, especially for japanese documents. To develop a practical electronic document filing system, we focused on the extraction of keyword areas from a document by image processing. Our fast title extraction method can automatically extract titles as keywords from business documents. All character strings are evaluated for similarity by rating points associated with title similarity. We classified these points as four items: character sitting size, position of character strings, relative position among character strings, and string attribution. Finally, the character string that has the highest rating is selected as the title area. The character recognition process is carried out on the selected area. It is fast because this process must recognize a small number of patterns in the restricted area only, and not throughout the entire document. The mean performance of this method is an accuracy of about 91 percent and a 1.8 sec. processing time for an examination of 100 Japanese business documents.
Rather than follow the trend in form recognition procedures toward more and more sophisticated analysis suitable for archival or enhancement of the form, I have chosen to address the most common use of forms: filling them out immediately. My efforts have focuses on permitting the use of the form electronically as early in the recognition process as possible. The interactive procedure described here yields no delay at all, even on older office and home computers. To achieve this, no preprocessing is performed - all computations are made at the instant an area of the form is selected for use. Furthermore, only the addition of new data is permitted, to be aligned with pre-existing markings, precluding the complex analysis needed to replace the image with a symbolic reproduction. Interactivity is the key to the success of this approach. Performing analyses only on demand eliminates the time the human must wait before he is allowed to use the form, and takes advantage of the human element by dividing the tasks. The human locates the areas of interest - task suited to his pattern-recognizing biological brain - and the computer determines precise alignment within each area - a task more suited to rule- based logical algorithms.
This paper describes an approach to retrieving information from document images stored in a digital library by means of knowledge-based layout analysis and logical structure derivation techniques. Queries on document image content are categorized in terms of the type of information that is desired, and are parsed to determine the type of document from which information is desired, the syntactic level of the information desired, and the level of analysis required to extract the information. Using these clauses in the query, a set of salient documents are retrieved, layout analysis and logical structure derivation are performed on the retrieved documents, and the documents are then analyzed in detail to extract the relevant logical components. A 'document browser' application, being developed based on this approach, allows a user to interactively specify queries on the documents in the digital library using a graphical user interface, provides feedback about the candidate documents at each stage of the retrieval process, and allows refinements of the query based on the intermediate results of the search. Results of a query are displayed either as an image or as formatted text.
We present a pattern matching based compression (PMBC) system which compresses scanned documents into postscript format. The output of a PMBC system is a pattern library, or font, and a series of pattern indices and positions. PMBC represents scanned documents in the same way that word processing programs represent their output pages. We explore various postscript representations of this output file, choosing the one resulting in the smallest output after compression with gzip. The resulting postscript file doesn't require a separate decompression program to view and print, and is at least 50 percent smaller than the postscript files generated by other conventional programs, such as tifftops.