The UC Berkeley Environmental Digital Library Project is one of six university-led projects that were initiated in the fall of 1994 as part of a four-year digital library initiative sponsored by the NSF, NASA, and ARPA. The Berkeley project is particularly interesting from a document image analysis perspective because its testbed collection consists almost entirely of scanned materials. As a result, the Berkeley project is making extensive use of document recognition and other image analysis technology to provide content-based access to the collection. The Document Image Decoding (DID) group at Xerox PARC is a member of the Berkeley team and is investigating the application of DID techniques to providing high-quality (accurate and properly structured) transcriptions of scanned documents in the collection. This paper briefly describes the Berkeley project, discusses some of its recognition requirements and presents examples of online structured documents created using DID technology.
An approach to supervised training of document-specific character templates from sample page images and unaligned transcriptions is presented. The template estimation problem is formulated as one of constrained maximum likelihood parameter estimation within the document image decoding (DID) framework. This leads to a two-phase iterative training algorithm consisting of transcription alignment and aligned template estimation (ATE) steps. The ATE step is the heart of the algorithm and involves assigning template pixel colors to maximize likelihood while satisfying a template disjointedness constraint. The training algorithm is demonstrated on a variety of English documents, including newspaper columns, 15th century books, degraded images of 19th century newspapers, and connected text in a script-like font. Three applications enabled by the training procedure are described -- high accuracy document-specific decoding, transcription error visualization and printer font generation.
In our recent research, we found that visual inter-word relations can be useful for different stages of English text recognition such as character segmentation and postprocessing. Different methods had been designed for different stages. In this paper, we propose a unified approach to use visual contextual information for text recognition. Each word image has a lattice, which is a data structure to keep results of segmentation, recognition and visual inter-word relation analysis. A lattice allows ambiguity and uncertainty at different levels. A lattice-based unification algorithm is proposed to analyze information in the lattices of two or more visually related word images, and upgrade their contents. Under the approach, different stages of text recognition can be accomplished by the same set of operations -- inter-word relation analysis and lattice-based unification. The segmentation and recognition result of a word image can be propagated to those visually related word images and can contribute to the recognition of them. In this paper, the formal definition of lattice, the unification operators and their uses are discussed in detail.
Printed mathematics has a number of features which distinguish it from conventional text. These include structure in two dimensions (fractions, exponents, limits), frequent font changes, symbols with variable shape (quotient bars), and substantially differing notational conventions from source to source. When compounded with more generic problems such as noise and merged or broken characters, printed mathematics offers a challenging arena for recognition. Our project was initially driven by the goal of scanning and parsing some 5,000 pages of elaborate mathematics (tables of definite integrals). While our prototype system demonstrates success on translating noise-free typeset equations into Lisp expressions appropriate for further processing, a more semantic top-down approach appears necessary for higher levels of performance. Such an approach may benefit the incorporation of these programs into a more general document processing viewpoint. We intend to release to the public our somewhat refined prototypes as utility programs in the hope that they will be of general use in the construction of custom OCR packages. These utilities are quite fast even as originally prototyped in Lisp, where they may be of particular interest to those working on 'intelligent' optical processing. Some routines have been re-written in C++ as well. Additional programs providing formula recognition and parsing also form a part of this system. It is important however to realize that distinct conflicting grammars are needed to cover variations in contemporary and historical typesetting, and thus a single simple solution is not possible.
Cherry Blossom is a machine-printed Japanese document recognition system developed at CEDAR in past years. This paper focuses on the character recognition part of the system. for Japanese character classification, two feature sets are used in the system: one is the local stroke direction feature; another is the gradient, structural and concavity feature. Based on each of those features, two different classifiers are designed: one is the so-called minimum error subspace classifier; another is the fast nearest-neighbor (FNN) classifier. Although the original version of the FNN classifier uses Euclidean distance measurement, its new version uses both Euclidean distance and the distance calculation defined in the ME subspace method. This integration improved performance significantly. The number of character classes handled by those classifiers is about 3,300 (including alphanumeric, kana and level-1 Kanji JIS). Classifiers were trained and tested on 200 ppi character images from CEDAR Japanese character image CD-ROM.
Based on the capabilities of morphological operators in extracting shape features, a new method for character recognition in Persian machine printed documents is introduced. Given the image of a printed character is available with high enough SNR such that its regular shape is preserved, some common primitive patterns can always be found after thinning different images of a single character. This property has inspired the development of our morphological processing in which the hit-or-miss operator is used to determine which patterns exist or do not exist in the input images of the recognition system. All the required processing before feature extraction including image enhancement, segmentation, and thinning are also performed using the hit-or-miss operator. Having the input words described in terms of some pre-defined patterns, the system knowledge base, holding descriptions for all characters, is searched for possible matches. Finding a match ends in the recognition of a character. This approach is proved to be fast and reliable in practice.
Many text recognition systems recognize text imagery at the character level and assemble words from the recognized characters. An alternative approach is to recognize text imagery at the word level, without analyzing individual characters. This approach avoids the problem of individual character segmentation, and can overcome local errors in character recognition. A word-level recognition system for machine-printed Arabic text has been implemented. Arabic is a script language, and is therefore difficult to segment at the character level. Character segmentation has been avoided by recognizing text imagery of complete words. The Arabic recognition system computes a vector of image-morphological features on a query word image. This vector is matched against a precomputed database of vectors from a lexicon of Arabic words. Vectors from the database with the highest match score are returned as hypotheses for the unknown image. Several feature vectors may be stored for each word in the database. Database feature vectors generated using multiple fonts and noise models allow the system to be tuned to its input stream. Used in conjunction with database pruning techniques, this Arabic recognition system has obtained promising word recognition rates on low-quality multifont text imagery.
A system has been built that selects excerpts from a scanned document for presentation as a summary, without using character recognition. The method relies on the idea that the most significant sentences in a document contain words that are both specific to the document and have a relatively high frequency of occurrence within it. Accordingly, and entirely within the image domain, each page image is deskewed and the text regions of are found and extracted as a set of textblocks. Blocks with font size near the median for the document are selected and then placed in reading order. The textlines and words are segmented, and the words are placed into equivalence classes of similar shape. The sentences are identified by finding baselines for each line of text and analyzing the size and location of the connected components relative to the baseline. Scores can then be given to each word, depending on its shape and frequency of occurrence, and to each sentence, depending on the scores for the words in the sentence. Other salient features, such as textblocks that have a large font or are likely to contain an abstract, can also be used to select image parts that are likely to be thematically relevant. The method has been applied to a variety of documents, including articles scanned from magazines and technical journals.
This paper presents a novel signature feature representation for retrieving degraded binary document images based on graphical content that is rotation, resolution and translation insensitive. We use logos as an example of graphical regions in document images. Logos are arbitrarily complex in geometry and tend to be highly degraded. The first stage of signature extraction normalizes the logo with respect to geometrical variations using principal component analysis. The second stage extracts the wavelet projection signature representation which consists of the low-pass wavelet transform coefficients of the projections of the normalized image. Images are retrieved based on L1 distance in the wavelet projection signature space from the query. We present an exhaustive performance evaluation of retrieval performance on a database of over 2000 real-world degraded logo images. The retrieval performance as quantified by the percentage of queries where the target is in the top 16 logos retrieved from the database (in terms of distance from the query) ranges between 88 and 95%. We also synthetically degrade the logo images to study retrieval performance as a function of rotation, resolution and pixel noise introduced using the Baird document defect model and present the results of these evaluations in the paper.
A formulation of a hierarchical page decomposition technique for technical journal pages using attribute grammars is presented. In this approach, block-grammars are recursively applied until a page is classified into its most significant sub-blocks. While a grammar devised for each block depends on its logical function, it is possible to formulate a generic description for all block grammars using attribute grammars. This attribute grammar formulation forms a generic framework on which this syntactic approach is based, while the attributes themselves are derived from publication-specific knowledge. The attribute extraction process and the formulation itself are covered in this paper. We discuss an application of attribute grammars to a document analysis problem, the extraction of logical, relational information from the image of tables.
An important aspect of document understanding is document logical structure derivation, which involves knowledge-based analysis of document images to derive a symbolic description of their structure and contents. Domain-specific as well as generic knowledge about document layout is used in order to classify, logically group, and determine the read-order of the individual blocks in the image, i.e., translate the physical structure of the document into a layout-independent logical structure. We have developed a computational model for the derivation of the logical structure of documents. Our model uses a rule-based control structure, as well as a hierarchical multi-level knowledge representation scheme in which knowledge about various types of documents is encoded into a document knowledge base and is used by reasoning processes to make inferences about the document. An important issue addressed in our research is the kind of domain knowledge that is required for such analysis. A document logical structure derivation system (DeLoS) has been developed based on the above model, and has achieved good results in deriving the logical structure of complex multi- articled documents such as newspaper pages. Applications of this approach include its use in information retrieval from digital libraries, as well as in comprehensive document understanding systems.
A particularly effective method for analyzing document images, that consist of large numbers of binary pixels, is to generate reduced images whose pixels represent enhancements of textural densities in the full-resolution image. These reduced images are generated using an integrated combination of filtering and subsampling. Previously reported methods used thresholding over a square grid, and cascaded these threshold reduction operations. Here, the approach is generalized to a sequence of arbitrary filtering/subsample operations, with emphasis on several particular filtering operations that respond to salient textural qualities of document images, such as halftones, lines or blocks of text, and horizontal or vertical rules. As with threshold reductions, these generalized 'textured reductions' are performed with no regard for connected components. Consequently, the results are typically robust to noise processes that can vitiate analysis based on connected components. Examples of image analysis and segmentation operations using textured reductions are given. Some properties can be determined very quickly; for example, the existence or absence of halftone regions in a full page image can be established in about 10 milliseconds.
Traditional document analysis systems often adopt a top-down framework, i.e., they are composed of various locally interacting functional components, guided by a central control mechanism. The design of each component is determined by a human expert and is optimized for a given class of inputs. Such a system can fail when confronted by an input that falls outside its anticipated domain. This paper investigates the use of a genetic-based adaptive mechanism in the analysis of complex test formatting. Specifically, we explore a genetic approach to the binarization problem. As opposed to a single, pre-defined, 'optimal' thresholding scheme, the genetic-based process applies various known methods and evaluates their effectiveness on the input image. Individual regions are treated independently, while the genetic algorithm attempts to optimize the overall result for the entire page. Advantages and disadvantages of this approach are discussed.
In this paper, we developed statistical models to characterize the text line and text block structures on document images using the text word bounding boxes. We posed the extraction problem as finding the text lines and text blocks that maximize the Bayesian probability of the text lines and text blocks by observing the text word bounding boxes. We derived the so-called probabilistic linear displacement model (PLDM) to model the text line structures from text word bounding boxes. We also developed an augmented PLDM model to characterize the text block structures from text line bounding boxes. By systematically gathering statistics from a large population of document images, we are able to validate our models experimentally and determine the proper model parameters. We designed and implemented an iterative algorithm that utilized these probabilistic models to extract the text lines and text blocks. The quantitative performances of the algorithm in terms of the rates of miss, false, correct, splitting, merging and spurious detections of the text lines and text blocks were reported.
This paper proposes a bottom-up method for recognizing tables within a document. This method is based on the paradigm of graph-rewriting. First, the document image is transformed into a layout graph whose nodes and edges represent document entities and their interrelations respectively. This graph is subsequently rewritten using a set of rules designed based on a priori document knowledge and general formatting conventions. The resulting graph provides a logical view of the document content. It can be parsed to provide general format analysis information.
Both cognitive processes and artificial recognition systems may be characterized by the forms of representation they build and manipulate. This paper looks at how handwriting is represented in current recognition systems and the psychological evidence for its representation in the cognitive processes responsible for reading. Empirical psychological work on feature extraction in early visual processing is surveyed to show that a sound psychological basis for feature extraction exists and to describe the features this approach leads to. The first stage of the development of an architecture for a handwriting recognition system which has been strongly influenced by the psychological evidence for the cognitive processes and representations used in early visual processing, is reported. This architecture builds a number of parallel low level feature maps from raw data. These feature maps are thresholded and a region labeling algorithm is used to generate sets of features. Fuzzy logic is used to quantify the uncertainty in the presence of individual features.
This paper describes a method that recognizes handwritten addresses for automated FAX mail distribution. The addressee field on FAX cover sheets indicated by a double line and the keyword 'TO:' is identified and recognized. Recognizing the addressee enables automatic distribution of received faxes to that individual's terminal through a router on a LAN system. In order to extract a handwritten double underline which is often slanted and fluctuates, a method using combinations of shift, logical operations and run-length features has been developed. A pair of parallel solid lines are recognized as a candidate for the double underlined addressee field. In order to determine this field, the keyword 'TO:' should also be recognized. Individual capital characters comprising the addressee information are extracted using a priori knowledge respecting stroke separation. After character recognition, word matching using a user directory and recognition results is carried out. The matching process comprises re-segmentation and re-recognition processes. An automated fax mail distribution system is developed on a DOS/V PC. An experiment using over 300 examples is conducted: 79% of the addressee fields are correctly extracted and 82% of those are correctly distributed without error. The total automatic distribution ratio of the data is 65%.
The holistic paradigm in HWR has been applied to recognition scenarios involving small, static lexicons, such as the check amount recognition task. In this paper, we explore the possibility of using holistic information for lexicon reduction when the lexicons are large or dynamic, and training, in the traditional sense of learning decision surfaces from training samples of each class, is not viable. Two experimental lexicon reduction methods are described. The first uses perceptual features such as ascenders, descenders and length and achieves consistent reduction performance with cursive, discrete and mixed writing styles. A heuristic feature-synthesis algorithm is used to 'predict' holistic features of lexicon entries, which are matched against image features using a constrained bipartite graph matching scheme. With essentially unconstrained handwritten words, this system achieves reduction of 50% with less than 2% error. More effective reduction can be achieved if the problem can be constrained by making assumptions about the nature of input. The second classifier described operates on pure cursive script and achieves effective reduction of large lexicons of the order of 20,000 entries. Downstrokes are extracted from the contour representation of cursive words by grouping local extrema using a small set of heuristic rules. The relative heights of downstrokes are captured in a string descriptor that is syntactically matched with lexicon entries using a set of production rules. In initial tests, the system achieved high reduction (99%) at the expense of accuracy (75%).
Increasing the sample size plays an important role in improving recognition accuracy. When it is difficult to collect additional character data written by new writers, distorted characters artificially generated from the original characters by a distortion model can serve as the additional data. This paper proposes a model for selecting those distorted characters that improve recognition accuracy. Binary images are used as a feature vector. In the experiments, recognition based on the k nearest neighbor rule is made for the handwritten zip code database, called IPTP CD-ROM1. Distorted characters are generated using a new model of nonlinear geometrical distortion. New learning samples consisting of the original ones and the distorted ones are generated iteratively. In this model, distortion parameter range is investigated to yield improved recognition accuracy. The results show that the iterative addition of slightly distorted characters improves recognition accuracy.
The hand-printed address recognition system described in this paper is a part of the Name and Address Block Reader (NABR) system developed by the Center of Excellence for Document Analysis and Recognition (CEDAR). NABR is currently being used by the IRS to read address blocks (hand-print as well as machine-print) on fifteen different tax forms. Although machine- print address reading was relatively straightforward, hand-print address recognition has posed some special challenges due to demands on processing speed (with an expected throughput of 8450 forms/hour) and recognition accuracy. We discuss various subsystems involved in hand- printed address recognition, including word segmentation, word recognition, digit segmentation, and digit recognition. We also describe control strategies used to make effective use of these subsystems to maximize recognition accuracy. We present system performance on 931 address blocks in recognizing various fields, such as city, state, ZIP Code, street number and name, and personal names.
In this paper, we propose an efficient method for integrated segmentation and recognition of connected handwritten characters with recurrent neural network. In the proposed method, a new type of recurrent neural network is developed for training the spatial dependencies in connected handwritten characters. This recurrent neural network differs from Jordan's and Elman's recurrent networks in view of functions and architectures because it was originally extended from the multilayer feedforward neural network for improving the discrimination and generalization power. In order to verify the performance of the proposed method, experiments with the NIST database have been performed and the performance of the proposed method has been compared with those of the previous integrated segmentation and recognition methods. The experimental results reveal that the proposed method is superior to the previous integrated segmentation and recognition methods in view of discrimination and generalization ability.
Efficient image handling in the handwritten document recognition is an important research issue in real time applications. Image manipulation procedures for a fast handwritten word recognizer, including pre-processing, segmentation, and feature extraction, have been implemented using the chain code representation and presented in this paper. Pre-processing includes noise removal, slant correction and smoothing of contours. Slant angle is estimated by averaging orientation angles of vertical strokes. Smoothing removes jaggedness on contours. Segmentation points are determined using ligatures and concavity features. Average stroke width of an image is used in an adaptive fashion to locate ligatures. Concavities are located by examination of slope changes in contours. Feature extraction efficiently converts a segment into feature vectors. Experimental results demonstrate the efficiency of the algorithms developed. Three-thousand word images captured from real mail pieces, with size of 217 by 82 in average, are used in the experiments. Average processing times taken for each module are 10, 15, and 34 msec on a single Sparc 10 for pre-processing, segmentation, and feature extraction, respectively.
This paper discusses a method for binary morphological filter design to restore document images degraded by subtractive or additive noise, given a constraint on the size of filters. With a filter size restriction (for example 3 by 3), each pixel in output image depends only on its (3 by 3) neighborhood of input image. Therefore, we can construct a look-up table between input and output. Each output image pixel is determined by this table. So the filter design becomes the search for the optimal look-up table. By considering the degradation condition of the input image, we provide a methodology for knowledge based look-up table design, to achieve computational tractability. The methodology can be applied iteratively so that the final output image is the input image after being transformed through successive 3 by 3 operations. An experimental protocol is developed for restoring degraded document images, and improving the corresponding recognition accuracy rates of an OCR algorithm. We present results for a set of real images which are manually ground-truthed. The performance of each filter is evaluated by the OCR accuracy.
We are developing a portable text-to-speech system for the vision impaired. The input image is acquired with a lightweight CCD camera that may be poorly focused and aimed, and perhaps taken under inadequate and uneven illumination. We therefore require efficient and effective thresholding and segmentation methods which are robust with respect to character contrast, font, size, and format. In this paper, we present a fast thresholding scheme which combines a local variance measure with a logical stroke-width method. An efficient post- thresholding segmentation scheme utilizing Fisher's linear discriminant to distinguish noise and character components functions as an effective pre-processing step for the application of commercial segmentation and character recognition methods. The performance of this fast new method compared favorably with other methods for the extraction of characters from uncontrolled illumination, omnifont scene images. We demonstrate the suitability of this method for use in an automated portable reader through a software implementation running on a laptop 486 computer in our prototype device.
This paper presents a methodology for model based restoration of degraded document imagery. The methodology has the advantages of being able to adapt to nonuniform page degradations and of being based on a model of image defects that is estimated directly from a set of calibrating degraded document images. Further, unlike other global filtering schemes, our methodology filters only words that have been misspelled by the OCR with a high probability. In the first stage of the process, we extract a training sample of candidate misspelled word subimages from the set of calibration images before and after the degradation that we wish to undo. These word subimages are registered to extract defect pixels. The second stage of our methodology uses a vector quantization based algorithm to construct a summary model of the defect pixels. The final stage of the algorithm uses the summary model to restore degraded document images. We evaluate the performance of the methodology for a variety of parameter settings on a real world sample of degraded FAX transmitted documents. The methodology eliminates up to 56.4% of the OCR character errors introduced as a result of FAX transmission for our sample experiment.
A new technique for intelligent form removal has been developed along with a new method for evaluating its impact on optical character recognition (OCR). All the dominant lines in the image are automatically detected using the Hough line transform and intelligently erased while simultaneously preserving overlapping character strokes by computing line width statistics and keying off of certain visual cues. This new method of form removal operates on loosely defined zones with no image deskewing. Any field in which the writer is provided a horizontal line to enter a response can be processed by this method. Several examples of processed fields are provided, including a comparison of results between the new method and a commercially available forms removal package. Even if this new form removal method did not improve character recognition accuracy, it is still a significant improvement to the technology because the requirement of a priori knowledge of the form's geometric details has been greatly reduced. This relaxes the recognition system's dependence on rigid form design, printing, and reproduction by automatically detecting and removing some of the physical structures (lines) on the form. Using the National Institute of Standards and Technology (NIST) public domain form-based handprint recognition system, the technique was tested on a large number of fields containing randomly ordered handprinted lowercase alphabets, as these letters (especially those with descenders) frequently touch and extend through the line along which they are written. Preserving character strokes improves overall lowercase recognition performance by 3%, which is a net improvement, but a single performance number like this doesn't communicate how the recognition process was really influenced. There is expected to be trade- offs with the introduction of any new technique into a complex recognition system. To understand both the improvements and the trade-offs, a new analysis was designed to compare the statistical distributions of individual confusion pairs between two systems. As OCR technology continues to improve, sophisticated analyses like this are necessary to reduce the errors remaining in complex recognition problems.
Determining the readability of documents is an important task. Human readability pertains to the scenario when a document image is ultimately presented to a human to read. Machine readability pertains to the scenario when the document is subjected to an OCR process. In either case, poor image quality might render a document un-readable. A document image which is human readable is often not machine readable. It is often advisable to filter out documents of poor image quality before sending them to either machine or human for reading. This paper is about the design of such a filter. We describe various factors which affect document image quality and the accuracy of predicting the extent of human and machine readability possible using metrics based on document image quality. We illustrate the interdependence of image quality measurement and enhancement by means of two applications that have been implemented: (1) reading handwritten addresses on mailpieces and (2) reading handwritten U.S. Census forms. We also illustrate the degradation of OCR performance as a function of image quality. On an experimental test set of 517 document images, the image quality metric (measuring fragmentation due to poor binarization) correctly predicted 90% of the time that certain documents were of poor quality (fragmented characters) and hence not machine readable.
Recently there has been an increased interest in document image skew detection algorithms. Most of the papers relevant to this problem include some experimental results. However, there exists a lack of a universally accepted methodology for evaluating the performance of such algorithms. We have implemented four types of skew detection algorithms in order to investigate possible testing methodologies. We then tested each algorithm on a sample of 460 page images randomly selected from a collection of approximately 100,000 pages. This collection contains a wide variety of typographical features and styles. In our evaluation we examine several issues relevant to the establishment of a uniform testing methodology. First, we begin with a clear definition of the problem and the ground truth collection process. Then we examine the need for pre-processing and parameter optimization specific to each technique. Next, we investigate the problem of establishing meaningful statistical measurements of the performance of these algorithms and the use of non-parametric comparison methods to perform pairwise comparisons of methods. Lastly, we look at the sensitivity of each algorithm to particular typographical features, which indicates the need for the adoption of a stratified sampling paradigm for accurate analysis of performance.
This paper describes the basic design principles for a new series of bar code scanners from Symbol Technologies. Traditional bar code scanners include an edge detector which has several innate limitations. We propose replacing this edge detector with a selective sampling circuit. While the superiority of decoding the analog signal has been demonstrated, its implementation is too costly because of the need for considerable additional memory. Selective sampling achieves most of the advantages of analog decoding at a cost comparable to that of conventional decoders. Instead of sampling the signal periodically, it is only sampled when a certain event (e.g. an edge) is detected. At each edge two data values are produced: the edge time and the sampled value, often referred to as the edge strength. This strength value gives a measure of the intensity of the edge. Using selective sampling the new scanners can read poorly printed and noisy bar codes that cannot be read by traditional scanners. Another innate limitation of bar code laser scanners is the density of bar code that can be read. This limitation is due to the blurring of the bar code when scanned by a laser beam with a finite spot size. We propose the addition of an edge enhancement filter to the scanner, which compensates for the finite width of the optical beam. The proposed filter is designed to enhance the edges of the bar code so that for a given optical focusing it is possible to read higher density bar codes.
In this paper we present a vision system that is capable of interpreting schematic logic diagrams, i.e. determine the output as a logic function of the inputs. The system is composed of a number of modules each designed to perform a specific subtask. Each module bears a minor contribution in the form of a new mixture of known algorithms or extensions to handle actual real life image imperfections which researchers tend to ignore when they develop their theoretical foundations. The main contribution, thus, is not in any individual module, it is rather in their integration to achieve the target job. The system is organized more or less in a classical fashion. Aside from the image acquisition and preprocessing modules, interesting modules include: the segmenter, the identifier, the connector and the grapher. A good segmentation output is one reason for the success of the presented system. Several novelties exist in the presented approach. Following segmentation the type of each logic gate is determined and its topological connectivity. The logic diagram is then transformed to a directed acyclic graph in which the final node is the output logic gate. The logic function is then determined by backtracking techniques. The system is not only aimed at recognition applications. In fact its main usage may be to target other processing applications such as storage compression and graphics modification and manipulation of the diagram as is explained.
Graphical designs are often used in Japanese newspaper headlines to indicate hot articles. However, conventional OCR software seldom recognizes characters in such headlines because of the difficulty of removing the designs. This paper proposes a method that recognizes these characters without needing removal of the graphical designs. First, the number of text-line regions and the averaged character heights are roughly extracted from the local distribution of the black and white runs observed in a rectangular window while the window is shifted pixel- by-pixel along the direction of the text-line. Next, normalized text-line regions are yielded by normalizing their heights to the height of binary reference patterns in a dictionary. Next, displacement matching is applied to the normalized text-line region for character recognition. A square window at each position is matched against binary reference patterns while being shifted pixel-by-pixel along the direction of the text-line. The complementary similarity measure, which is robust against graphical designs, is used as a discriminant function. When the maximum similarity value at each position exceeds the threshold, which is automatically determined from the degree of degradation in the square window, the character category of this similarity value is specified as a recognized category. Experimental results for fifty Japanese newspaper headlines show that the method achieves recognition rates of over 90%, much higher than a conventional method (17%).
In this paper, we describe a feature based supervised zone classifier using only the knowledge of the widths and the heights of the connected-components within a given zone. The distribution of the widths and the heights of the connected-components is encoded into a n multiplied by m dimensional vector in the decision making. Thus, the computational complexity is in the order of the number of connected-components within the given zone. A binary decision tree is used to assign a zone class on the basis of its feature vector. The training and testing data sets for the algorithm are drawn from the scientific document pages in the UW-I database. The classifier is able to classify each given scientific and technical document zone into one of the eight labels: text of font size 8-12, text of font size 13-18, text of font size 19-36, display math, table, halftone, line drawing, and ruling, in real time. The classifier is able to discriminate text from non-text with an accuracy greater than 97%.
The Choquet fuzzy integral provides a useful mechanism for evidence aggregation. It is a flexible method which can represent weighted averages, medians, order statistics, and many other information aggregation mechanisms. In this paper, two applications are described to handwritten word recognition: as a match function in a dynamic programming based classifier and as a method for fusing the results from multiple word recognition algorithms. In the first case, the results are compared with traditional methods. In the second case, the results are compared with neural network and Borda count approaches.