The problem of Writer Identification based on similarity is formalized by defining a distance between character or word level features and finding the most similar writings or all writings which are within a certain threshold distance. Among many features, we consider stroke direction and pressure sequence strings of a character as character level image signatures for writer identification. As the conventional definition of edit distance is not applicable in essence, we present the newly defined and modified edit distances depending upon their measurement types. Finally, we present a prototype stroke directional and pressure sequence string extractor used on the writer identification. The importance of this study is the attempt to give a definition of distance between two characters based on the two types of strings.
The segmentation of degraded characters is a challenging problem. A lot of optical character recognition systems remain weak in this problem. Broken and touching characters are two types of degradation that one encounters frequently in the old documents, in newspapers where the writing is very condensed, etc. We propose therefore in this article very efficient techniques to solve the problem of segmentation and recognition of broken and touching characters. These techniques use several mathematical tools as Fuzzy Logic or statistics. An algorithm regrouping all these techniques is exposed at the end of this article. Based on an approach by cooperation between classification and segmentation, our algorithm succeed to treat chains of characters either constituted of a certain number of broken/touching characters without a priori knowledge of the width of characters nor their number.
The printed areas of a handprinted character with thick strokes were replaced by a frame formed by bended ellipses to represent the character efficiently and emulate the high order receptive fields in later visual system. Each bended ellipse maximally fits the local stroke pattern and captures the position, orientation and topology information contained in the local stroke pattern. Complex stroke structures are represented by concept neurons which each contains several bended ellipses. The craft of concept neurons provides an uniform representation for receptive fields in any order. The model uses these concept neurons in searching their corresponding neurons in the template frame. To obtain the correspondence, a global affine transform followed by a local distorting process are used to align the two frames. To afford topology preservation the topology order of the character is generated explicitly. The classification is achieved by examining the similarity between the topology order of the handprinted pattern and the template patterns.
A structural method for analysis of Chinese characters is presented, with the purpose of handwritten character recognition. Firstly, a line following and thinning process is used to obtain the thinned shape of the character. This process includes a specific treatment of singular regions allowing the detection of the branching points. In a second stage, an extended direction code is assigned to each point of the thinned line. Then, median filtering of extended codes eliminates much of the quantization noise, without altering significant direction changes. This leads to split up the character into a list of straight line segments, which are characterized by a main direction attribute. In a third stage, strokes are extracted by bringing together adjoining segments having neighboring directions. To compare two characters, firstly, we try to associate to each stroke of the first character the nearest stroke of the second one. Then, the distance between both characters is obtained from the sum of the distances between strokes, associated by pairs. This distance takes into account the possible presence of non- paired strokes in both characters.
Latin and Chinese OCR systems have been studied extensively in the literature. Yet little work was performed for Arabic character recognition. This is due to the technical challenges found in the Arabic text. Due to its cursive nature, a powerful and stable text segmentation is needed. Also; features capturing the characteristics of the rich Arabic character representation are needed to build the Arabic OCR. In this paper a novel segmentation technique which is font and size independent is introduced. This technique can segment the cursive written text line even if the line suffers from small skewness. The technique is not sensitive to the location of the centerline of the text line and can segment different font sizes and type (for different character sets) occurring on the same line. Features extraction is considered one of the most important phases of the text reading system. Ideally, the features extracted from a character image should capture the essential characteristics of this character that are independent of the font type and size. In such ideal case, the classifier stores a single prototype per character. However, it is practically challenging to find such ideal set of features. In this paper, a set of features that reflect the topological aspects of Arabia characters is proposed. These proposed features integrated with a topological matching technique introduce an Arabic text reading system that is semi Omni.
In this paper a structural method for recognizing Persian handwritten digits is proposed. Because of different styles in some Persian digits, in this work they were divided into 18 classes. In recognition algorithm a heuristically designed binary decision tree recognizes the digits via a series of initiative structural features. The research aim was to choose those structural features that were useful both in terms of accuracy and speed. A set of 6180 digits was used for the test. About 93.1 percent of the digits were recognized correctly. The recognition average speed of this algorithm is 24 digits per second on a 486DX4 100 MHZ.
We offer a perspective on the performance of current OCR systems by illustrating and explaining actual OCR errors made by three commercial devices. After discussing briefly the character recognition abilities of humans and computers, we present illustrated examples of recognition errors. The top level of our taxonomy of the causes of errors consists of Imaging Defects, Similar Symbols, Punctuation, and Typography. The analysis of a series of 'snippets' from this perspective provides insight into the strengths and weaknesses of current systems, and perhaps a road map to future progress. The examples were drawn from the large-scale tests conducted by the authors at the Information Science Research Institute of the University of Nevada, Las Vegas. By way of conclusion, we point to possible approaches for improving the accuracy of today's systems. The talk is based on our eponymous monograph, recently published in The Kluwer International Series in Engineering and Computer Science, Kluwer Academic Publishers, 1999.
This paper describes a memory efficient representation for connected component labeled images. Connected component labeled images enable powerful segmentation methods, however, they require a lot of memory. To solve this problem, we developed a hierarchical rectangular method for connected component labeled images without generating the image. Experimental results show the effectiveness of our method.
Two statistical context-based filters are proposed for the enhancement of the binary document images for compression and recognition. The Simple Context Filter unconditionally changes the uncommon pixels in low information contexts, whereas the Gain-Loss Filter changes the pixels conditionally depending whether the gain in compression outweighs the loss of information. The evaluation methods and results with some traditional filtering methods are presented. The filtering methods alleviate the loss in compression performance caused by digitization noise while preserving the image quality measured as the OCR accuracy. The Gain-Loss Filter reaches approximately the compression limit estimated by the compression of the noiseless digital original.
We describe the design of document analysis procedures to separate mathematics from ordinary text on a scanned page of mixed material. It is easy to observe that the accuracy of commercial OCR programs is helped by separating mixed material into two (or more) streams, with conventional non-math text handled by the usual OCR text-based-heuristics analysis. The second stream, consisting of material judged to be mathematics, can be fed to a specialized recognizer. If that fails to decode it, it can be passed on to yet a third stream including diagrams, logos, or other miscellaneous material, perhaps including halftones. We explore the extent to which this separation can be automated in the context of scanning archival material for a digital library project including mathematical and scientific journal pages.
It is common that most document forms opt for the use of straight line as a reference position for filled information. The automated data-entry systems of such documents require an ability to search these reference lines so that the location of information in the forms can be known. This paper proposes a wavelet-based algorithm for extracting these reference lines in business forms. Stationary wavelet transform is used to transform a gray-level document image into different frequency-band images. The horizontal detail subband is then selected and passed through a post-processing to produce a binary bitmap of reference lines. The experimental results on synthetic and real document images will be given to illustrate the usefulness of such an algorithm.
Automatic reading of form documents with gray-level background needs a preliminary task of preparing a clean image of the data to be recognized using OCR. In this paper, we present a data extraction process for such documents. First, the preprinted background is removed by decomposing the histogram of the input image. Reference lines are then subtracted from this image. Finally, the lost parts of character images overlaying with reference lines can be restored. The experiments carried out with a bank cheque will be given to illustrate the usefulness of such an algorithm.
The automation of the document storage and retrieval procedures has become an important issue in the restructuring of many institutes. Considering governmental agencies the official forms processing has attracted the attention of the decision-makers. Electronic storage of the documents is essential for improving the document processing. In this paper, the improvement of the official forms processing is considered. Each form consists of a static portion (the printed fields) and a dynamic portion (the handwritten filling). So, by splitting these forms into the described portions, the static portion will be discarded and the dynamic portion, which contains the information, is stored. The splitting of these forms depends on color differentiation of both portions. Only one static portion is kept as a reference for each form. This will result in a reduced size of the original document. Then, the resulting dynamic portions of document is binarized (background and foreground). This process is followed by the application of a lossless compression procedure (e.g. run length) on the binarized document. In order to retrieve any form, the compressed variable portion is expanded to its original size. Then, the hand written fields are aligned with the static fields in the reference documents. Then, the original document is reconstructed. The developed procedure was applied to the official identification card, more than 90% of the original size was reduced in the compression. This results in a more efficient use of the storage media. It will guarantee a faster transfer of the compressed file from a place to another using computer networks.
In this paper, we present a logical representation for form documents to be used for identification and retrieval. A hierarchical structure is proposed to represent the logical structure of a form by using lines. The approach is top-down and no domain knowledge such as the preprinted data or filled- in data is used. Logically same forms are associated to the same hierarchical structure. This representation can handle geometrical modifications and slight variations.
The paper considers the problem of automatic input and vectorization of engineering drawings. One of the most difficult tasks in this problem is recognition of ED entities. We propose the algorithm to recognize one of the ED entities namely blocks or closed 'thick' polygons in engineering drawings. The algorithm is based on ED image vectorization and takes into account parameters of vectorized segments, tests several alternative extensions for a path, and chooses the best one among several alternatives. It has been verified experimentally on real ED images and has shown good results. In conclusion of the paper, some technical parameters of experiments are given.
This paper describes a method for automatically routing unconstrained faxes to mail recipients. Incoming faxes usually have many formats, variation in name representation, and can be either machine or hand printed. Our approach synthesized the use of multiple OCR engines, the geometry based name location, and error correcting name matching. Candidates for a name match are determined by a number of components such as the qualification of the name image, the confidence of the name text, and the confidence of the keyword that appears in front of the name. Multiple name matches are then filtered out based on the weight toward the location of the name in the document and the distance between the keyword and the name, etc. As a result, our project accurately routes faxes to the right destination or sends the fax to an operator if the recipient cannot be recognized with sufficient confidence. An initial prototype fax routing system has been deployed and tested at DARPA. In this paper we discuss our approach, the results of testing at a live site and directions for the future.
We report on the UNLV-ISRI document collection history, composition, and characteristics. We further provide a short summary of research projects that were conducted using subsets of this collection. These projects were designed to address the retrieval effectiveness from OCR generated collections. Along with this report, ISRI is making this collection available to researchers for further study on the topic of OCR and Information Retrieval.
The decomposition of a document into segments such as text regions and graphics is a significant part of the document analysis process. The basic requirement for rating and improvement of page segmentation algorithms is systematic evaluation. The approaches known from the literature have the disadvantage that manually generated reference data (zoning ground truth) are needed for the evaluation task. The effort and cost of the creation of these data are very high. This paper describes the evaluation system SEE. The system requires the OCR generated text and the original text of the document in correct reading order (text ground truth) as input. No manually generated zoning ground truth is needed. The implicit structure information that is contained in the text ground truth is used for the evaluation of the automatic zoning. Therefore, an assignment of the corresponding text regions in the text ground truth and those in the OCR generated text (matches) is sought. A fault tolerant string matching algorithm is used to develop a method which tolerates OCR errors in the text. The segmentation errors are determined as a result of the evaluation of the matching. Subsequently, the edit operations which are necessary for the correction of the recognized segmentation errors are computed to estimate the correction costs.
Text categorization in the form of topic identification is a capability of current interest. This paper is concerned with categorization of electronic document images. Previous work on the categorization of document images has relied on Optical Character Recognition (OCR) to provide the transformation between the image domain and a domain where pattern recognition techniques are more readily applied. Our work uses a different technology to provide this transformation. Character shape coding is a computationally efficient, extraordinarily robust, means of providing access to the character content of document images. While this transform is lossy, sufficient salient information is retained to support many applications. Furthermore, the use of shape coding is particularly advantageous over OCR in the processing of page images of poor quality. In this study we found that topic identification performance was maintained or slightly improved using character shape codes derived from images.
Searching for documents by their type or genre is a natural way to enhance the effectiveness of document retrieval. The layout of a document contains a significant amount of information that can be used to classify a document's type in the absence of domain specific models. A document type or genre can be defined by the user based primarily on layout structure. Our classification approach is based on 'visual similarity' of the layout structure by building a supervised classifier, given examples of the class. We use image features, such as the percentages of tex and non-text (graphics, image, table, and ruling) content regions, column structures, variations in the point size of fonts, the density of content area, and various statistics on features of connected components which can be derived from class samples without class knowledge. In order to obtain class labels for training samples, we conducted a user relevance test where subjects ranked UW-I document images with respect to the 12 representative images. We implemented our classification scheme using the OC1, a decision tree classifier, and report our findings.
Text categorization is useful for indexing documents for information retrieval, filtering parts for document understanding, and summarizing contents of documents of special interests. We describe a text categorization task and an experiment using documents from the Reuters and OHSUMED collections. We applied the Decision Forest classifier and compared its accuracies to those of C4.5 and kNN classifiers using both category dependent and category independent term selection schemes. It is found that Decision Forest outperforms both C4.5 and kNN in all cases, and that category dependent term selection yields better accuracies. Performances of al three classifiers degrade from the Reuters collection to the OHSUMED collection, but Decision Forest remains to be superior.
An IETM is intended to be the functional equivalent of a paper-based Technical Manual (TM), and in most cases a total replacement for paper manual. In this paper, we will describe some of document image understanding technologies applied to the conversion of paper-based TMs to IETMs. Using these advanced technologies allow us to convert paper-based TMs to class 1/2 IETMs. However, these were not sufficient for an automated integrated logistics support system in the ROC Department of Defense. An advanced IETM system is therefore required. Such class 4/5 like IETM system could provide intelligent display of information and other user applications such as diagnostics, intelligent design and manufacturing, or computer-managed training. The author has developed some of the advanced functions, and examples will be shown to demonstrate the new aspect of IETMs.
This paper presents an experimental evaluation of several text-based methods for detecting duplication in document image databases using uncorrected OCR output. This task is challenging because of both the wide range of degradations printed documents can suffer, and conflicting interpretations of what it means to be a 'duplicate.' We report results for five sets of experiments exploring various aspects of the problem space. While the techniques studied are generally robust in the face of most types of OCR errors, there are nonetheless important differences which we identify and discuss in detail.
Today, the paper document is fast becoming a thing of the past. With the rapid development of fast, inexpensive computing and storage devices, many government and private organizations are archiving their documents in electronic form (e.g., personnel records, medical records, patents, etc.). Many of these organizations are converting their paper archives to electronic images, which are then stored in a computer database. Because of this, there is a need to efficiently organize this data into comprehensive and accessible information resources and provide for rapid access to the information contained within these imaged documents. To meet this need, Litton PRC and Litton Data Systems Division are developing a system, the Imaged Document Optical Correlation and Conversion System (IDOCCS), to provide a total solution to the problem of managing and retrieving textual and graphic information from imaged document archives. At the heart of IDOCCS, optical correlation technology provide a means for the search and retrieval of information from imaged documents. IDOCCS can be used to rapidly search for key words or phrases within the imaged document archives and has the potential to determine the types of languages contained within a document. In addition, IDOCCS can automatically compare an input document with the archived database to determine if it is a duplicate, thereby reducing the overall resources required to maintain and access the document database. Embedded graphics on imaged pages can also be exploited, e.g., imaged documents containing an agency's seal or logo can be singled out. In this paper, we present a description of IDOCCS as well as preliminary performance results and theoretical projections.
We introduce a new document image segmentation algorithm, HMTseg, based on wavelets and the hidden Markov tree (HMT) model. The HMT is a tree-structured probabilistic graph that captures the statistical properties of the coefficients of the wavelet transform. Since the HMT is particularly well suited to images containing singularities (edges and ridges), it provides a good classifier for distinguishing between different document textures. Utilizing the inherent tree structure of the wavelet HMT and its fast training and likelihood computation algorithms, we perform multiscale texture classification at a range of different scales. We then fuse these multiscale classifications using a Bayesian probabilistic graph to obtain reliable final segmentations. Since HMTseg works on the wavelet transform of the image, it can directly segment wavelet-compressed images, without the need for decompression into the space domain. We demonstrate HMTseg's performance with both synthetic and real imagery.
The optical character recognition system (OCR) selected by the National Library of Medicine (NLM) as part of its system for automating the production of MEDLINER records frequently segments the scanned page images into zones which are inappropriate for NLM's application. Software has been created in-house to correct the zones using character coordinate and character attribute information provided as part of the OCR output data. The software correctly delineates over 97% of the zones of interest tested to date.
Wavelet transforms have been widely used as effective tools in texture segmentation in the past decade. Segmentation of document images, which usually contain three types of texture information: text, picture and background, can be regarded as a special case of texture segmentation. B-spline wavelets possess some desirable properties such as being well localized in time and frequency, and being compactly supported, which make them a good approach to texture analysis. In this paper, cubic B-spline wavelets are applied to document images; thereafter, each texture is featured by several regional and statistical characteristics estimated at the outputs of high frequency bands of spline wavelet transforms. Then three-means classification is applied for classifying pixels which have similar features. We also examine and evaluate the contributions of different factors to the segmentation results from the viewpoints of decomposition levels, frequency bands and feature selection, respectively.
We present a method for extracting text from images where the text plane is not necessarily fronto-parallel to the camera. Initially, we locate local image features such as borders and page edges. We then use perceptual grouping on these features to find rectangular regions in the scene. These regions are hypothesized to be pages or planes that may contain text. Edge distributions are then used for the assessment of these potential regions, providing a measure of confidence. It will be shown that the text may then be transformed to a fronto- parallel view suitable, for example, for an OCR system or other higher level recognition. The proposed method is scale independent (of the size of the text). We illustrate the algorithm using various examples.
An algorithm for extracting character strings from color documents is described. Most characters on color documents are printed with the same color and font size at every word or text line. The blobs of pixels which have similar color are extracted by a clustering in a color space. Although these blobs are correspond to characters or background patterns, they can be discriminated by using the features of sizes, aspect ratios and pitches of the circumscribing rectangles of the blobs. Some experimental results are also presented and show our algorithm is applicable to color document OCR.
This paper describes in detail the LuraDocument technique, a recently developed, high performance technique for compressing and archiving scanned documents, particularly those containing text and image. LuraDocument offers higher compression rates and quality in comparison to traditional document compression methods, preserving text legibility even at extremely high compression rates. This various stages of LuraDocument compression are described in detail, including image quantization and text detection procedures.
In this paper, a novel on-line handwritten Chinese character recognition (OLHCCR) system is proposed. Several methods and efficient techniques, which have been successfully implemented in an off-line multi-font Chinese character recognition system, have been adopted in the proposed OLHCCR system. In this proposed OLHCCR system, an improved structural representation of handwritten Chinese character (HCC) has been introduced for pre-classification. In pre-classification stage, an index search scheme is used to quick screen out the candidates. An iterative scheme is used in detailed matching. The character to be recognized can be stroke-order and stroke- number free, tolerance for 2 to 3 unimportant missing or added strokes, size flexible, but within the constraint of normal handwriting. A prototype of OLHCCR system is implemented in Visual C++ 6.0 on Window98. The recognition results are based upon 3755 frequently used Chinese characters. Good experimental results show that the proposed approach is very promising.
An important step towards the goal of table understanding is a method for reliable table detection. This paper describes a general solution for detecting tables based on computing an optimal partitioning of a document into some number of tables. A dynamic programming algorithm is given to solve the resulting optimization problem. This high-level framework is independent of any particular table quality measure and independent of the document medium. Moreover, it does not rely on the presence of ruling lines or other table delimiters. We also present table quality measures based on white space correlation and vertical connected component analysis. These measures can be applied equally well to ASCII text and scanned images. We report on some preliminary experiments using this method to detect tables in both ASCII text and scanned images, yielding promising results. We present detailed evaluation of these results using three different criteria which by themselves pose interesting research questions.
Document page segmentation is a crucial preprocessing step in Optical Character Recognition (OCR) system. While numerous segmentation algorithms have been proposed, there is relatively less literature on comparative evaluation -- empirical or theoretical -- of these algorithms. We use the following five step methodology to quantitatively compare the performance of page segmentation algorithms: (1) First we create mutually exclusive training and test dataset with groundtruth, (2) we then select a meaningful and computable performance metric, (3) an optimization procedure is then used to automatically search for the optimal parameter values of the segmentation algorithms, (4) the segmentation algorithms are then evaluated on the test dataset, and finally (5) a statistical error analysis is performed to give the statistical significance of the experimental results. We apply this methodology to five segmentation algorithms, three of which are representative research algorithms and the rest two are well-known commercial products. The three research algorithms evaluated are: Nagy's X-Y cut, O'Gorman's Docstrum and Kise's Voronoi-diagram-based algorithm. The two commercial products evaluated are: Caere Corporation's segmentation algorithm and ScanSoft Corporation's segmentation algorithm. The evaluations are conducted on 978 images from the University of Washington III dataset. It is found that the performance of the Voronoi-based, Docstrum and Caere's segmentation algorithms are not significantly different from each other, but they are significantly better than ScanSoft's segmentation algorithm, which in turn is significantly better than the performance of the X-Y cut algorithm. Furthermore, we see that the commercial segmentation algorithms and research segmentation algorithms have comparable performances.