A signature verification method that combines recognition methods of one-dimensional signals, e.g., speech and on-line handwriting, and two-dimensional images, e.g., holistic word recognition in OCR and off-line handwriting is described. In the one-dimensional approach, a sequence of data is obtained by tracing the exterior contour of the signature which allows the application of string-matching algorithms. The upper and lower contours of the signature are first determined by ignoring small gaps between signature components. The contours are combined into a single sequence so as to define a pseudo-writing path. To match two signatures a non-linear normalization method, viz., dynamic time warping, is applied to segment them into curves. Shape descriptors based on Zernike moments are extracted as features from each segment. A harmonic distance is used for measuring signature similarity. The two-dimensional approach is based on using features describing the word-shape. When the two methods are combined, the overall performance is significantly better than either method alone. With a database of 1320 genuines and 1320 forgeries the combination method has an accuracy of 90%.
The design and performance of a system for spotting handwritten Arabic words in scanned document images is presented. Three main components of the system are a word segmenter, a shape based matcher for words and a search interface. The user types in a query in English within a search window, the system finds the equivalent Arabic word, e.g., by dictionary look-up, locates word images in an indexed (segmented) set of documents. A two-step approach is employed in performing the search: (1) prototype selection: the query is used to obtain a set of handwritten samples of that word from a known set of writers (these are the prototypes), and (2) word matching: the prototypes are used to spot each occurrence of those words in the indexed document database. A ranking is performed on the entire set of test word images-- where the ranking criterion is a similarity score between each prototype word and the candidate words based on global word shape features. A database of 20,000 word images contained in 100 scanned handwritten Arabic documents written by 10 different writers was used to study retrieval performance. Using five writers for providing prototypes and the other five for testing, using manually segmented documents, 55% precision is obtained at 50% recall. Performance increases as more writers are used for training.
Offline handwritten Chinese character recognition is one of the difficult problems in pattern recognition area because of its large stroke distortion, writing anomaly, and no stroke ranking information can be gotten. The basic characteristic of Chinese character is that it is composed of four kinds of stroke, i.e. horizontal, vertical, 45 degree direction and 135 degree direction. A Chinese character can be uniquely confirmed by the quantity of the four directional strokes and its relative position. From the contour of Chinese character, we can get the features mentioned above. In this paper, we proposed first to modify an existed contour extraction algorithm and obtained strict single pixel contour of Chinese character, and then to give a contour-based elastic mesh fuzzy feature extraction method. Comparison experimental results show that the performance of our approaches is encouraging and can be comparable to other algorithms.
Generally speaking optical character recognition algorithms tend to perform better when presented with homogeneous data. This paper studies a method that is designed to increase the homogeneity of training data, based on an understanding of the types of degradations that occur during the printing and scanning process, and how these degradations affect the homogeneity of the data. While it has been shown that dividing the degradation space by edge spread improves recognition accuracy over dividing the degradation space by threshold or point spread function width alone, the challenge is in deciding how many partitions and at what value of edge spread the divisions should be made. Clustering of different types of character features, fonts, sizes, resolutions and noise levels shows that edge spread is indeed shown to be a strong indicator of the homogeneity of character data clusters.
Symbolic indirect correlation (SIC) is a new approach for bringing lexical context into the recognition of unsegmented signals that represent words or phrases in printed or spoken form. One way of viewing the SIC problem is to find the correspondence, if one exists, between two bipartite graphs, one representing the matching of the two lexical strings and the other representing the matching of the two signal strings. While perfect matching cannot be expected with real-world signals and while some degree of mismatch is allowed for in the second stage of SIC, such errors, if they are too numerous, can present a serious impediment to a successful implementation of the concept. In this paper, we describe a framework for evaluating the effectiveness of SIC match graph generation and examine the relatively simple, controlled cases of synthetic images of text strings typeset, both normally and in highly condensed fashion. We quantify and categorize the errors that arise, as well as present a variety of techniques we have developed to visualize the intermediate results of the SIC process.
Exploiting style consistency in groups of patterns (pattern fields)
generated by the same source has been demonstrated to yield higher
accuracies in OCR applications. The accuracy gains obtained by a
style consistent classifier depend on the amount of style in a
dataset in addition to the classifier itself. The computational
complexity of style-based classifiers precludes their applicability
in situations where datasets have small amounts of style. In this
paper, we propose a correlation-based measure to quantify the amount
of style in a dataset and demonstrate its use in determining the
suitability of a style consistent classifier on both simulation and
Optical Character Recognition (OCR) is a classical research field and has become one of most successful applications in the area of pattern recognition. Feature extraction is a key step in the process of OCR. This paper presents three algorithms for feature extraction based on binary images: the Lattice with Distance Transform (DTL), Stroke Density (SD) and Co-occurrence Matrix (CM). DTL algorithm improves the robustness of the lattice feature by using distance transform to increase the distance of the foreground and background and thus
reduce the influence from the boundary of strokes. SD and CM algorithms extract robust stroke features base on the fact that human recognize characters according to strokes, including length and orientation. SD reflects the quantized stroke information including the length and the orientation. CM reflects the length and orientation of a contour. SD and CM together sufficiently describe strokes. Since these three groups of feature vectors complement each other in expressing characters, we integrate them and adopt a hierarchical algorithm to achieve optimal performance. Our methods are tested on the USPS (United States Postal Service) database and the Vehicle License Plate Number Pictures Database (VLNPD). Experimental results shows that the methods gain high recognition rate and cost reasonable average running time. Also, based on similar condition, we compared our results to the box method proposed by Hannmandlu . Our methods demonstrated better performance in efficiency.
Most researchers would agree that research in the field of document processing can benefit tremendously from a common software library through which institutions are able to develop and share research-related software and applications across academic, business, and government domains. However, despite several attempts in the past, the research community still lacks a widely-accepted standard software library for document processing. This paper describes a new library called DOCLIB, which tries to overcome the drawbacks of earlier approaches. Many of DOCLIB's features are unique either in themselves or in their combination with others, e.g. the factory concept for support of different image types, the juxtaposition of image data and metadata, or the add-on mechanism. We cherish the hope that DOCLIB serves the needs of researchers better than previous approaches and will readily be accepted by a larger group of scientists.
When mixed mail enters a postal facility, it must first be faced and oriented so that the address is readable by automated mail processing machinery. Existing US Postal Service (USPS) automated systems face and orient domestic mail by searching for fluorescing stamps on each mail piece. However, misplaced or partially fluorescing postage causes a significant fraction of mail to be rejected. Previously, rejected mail had to be faced and oriented by hand, thus increasing mail processing cost and time. Our earlier work successfully demonstrated the utility of machine-vision-based extraction of postal delimiters-such as cancellation marks and barcodes-for camera-based mail facing and orientation. Arguably, of all the localized information sources on the envelope image, the destination address block is the richest in content and the most structured in its form and layout. This paper focuses exclusively on the destination address block image and describes new vision-based features that can be extracted and used for mail orientation. Our results on real USPS datasets indicate robust performance. The algorithms described herein will be deployed nationwide on USPS hardware in the near future.
Detecting documents with a certain stamp instance is an effective and reliable way to retrieve documents associated with a specific source. However, this unique problem has essentially remained unaddressed. In this paper, we present a novel stamp detection framework based on parameter estimation of connected edge features. Using robust basic-shape detectors, the approach is effective for stamps with analytically shaped contours, when only limited samples are available. For elliptic/circular stamps, it efficiently exploits the orientation information from pairs of edge points to determine its center position and area, without computing all the five parameters of an ellipse. In our approach, we considered the set of unique characteristics of stamp patterns. Specifically, we introduced effective algorithms to address the problem that stamps often spatially overlay their background contents. These give our approach significant advantages in detection accuracy and computation complexity over traditional Hough transform method in locating candidate ellipse regions. Experimental results on real degraded documents demonstrated the robustness of this retrieval approach on large document database, which consists of both printed text and handwritten notes.
This paper describes new capabilities of ImageRefiner, an automatic image enhancement system based on machine learning (ML). ImageRefiner was initially designed as a pre-OCR cleanup filter for bitonal (black-and-white) document images. Using a single neural network, ImageRefiner learned which image enhancement transformations (filters) were best suited for a given document image and a given OCR engine, based on various image measurements (characteristics). The new release improves ImageRefiner in three major ways. First, to process grayscale document images, we have included three grayscale filters based on smart thresholding and noise filtering, as well as five image characteristics that are all byproducts of various thresholding techniques. Second, we have implemented additional ML algorithms, including a neural network ensemble and several "all-pairs" classifiers. Third, we have introduced a measure that evaluates overall performance of the system in terms of cumulative improvement of OCR accuracy. Our experiments indicate that OCR accuracy on enhanced grayscale images is higher than that of both the original grayscale images and the corresponding bitonal images obtained by scanning the same documents. We have noticed that the system's performance may suffer when document characteristics are correlated.
The JBIG2 (joint bi-level image group) standard for bi-level image coding is drafted to allow encoder designs by individuals. In JBIG2, text images are compressed by pattern matching techniques. In this paper, we propose a lossy text image compression method based on OCR (optical character recognition) which compresses bi-level images into the JBIG2 format. By processing text images with OCR, we can obtain recognition results of characters and the confidence of these results. A representative symbol image could be generated for similar character image blocks by OCR results, sizes of blocks and mismatches between blocks. This symbol image could replace all the similar image blocks and thus a high compression ratio could be achieved. Experiment results show that our algorithm achieves improvements of 75.86% over lossless SPM and 14.05% over lossy PM and S in Latin Character images, and 37.9% over lossless SPM and 4.97% over lossy PM and S in Chinese character images. Our algorithm leads to much fewer substitution errors than previous lossy PM and S and thus preserves acceptable decoded image quality.
This paper introduces a novel Active Document Versioning system that can extract the layout template and constraints from the original document and then automatically adjust the layout to accommodate new contents. "Active" reflects several unique features of the system: First, the need of handcrafting adjustable templates is largely eliminated through layout understanding techniques that can convert static documents into Active Layout Templates and accompanying
constraints. Second, through the linear text block modeling and the two-pass constraint solving algorithm, it supports a rich set of layout operations, such as simultaneous optimization of text block width and height, integrated image cropping, and non-rectangular text wrapping. This system has been successfully applied to a wide range of professionally designed documents. This paper covers both the core algorithms and the implementation.
When designers develop a document layout their objective is to convey a specific message and provoke a specific response from the audience. Design principles provide the foundation for identifying document components and relations among them to extract implicit knowledge from the layout. Variable Data Printing enables the production of personalized printing jobs for which traditional proofing of all the job instances could result unfeasible. This paper explains a rule-based system that uses design principles to segment and understand document context. The system uses the design principles of repetition, proximity, alignment, similarity, and contrast as the foundation for the strategy in document segmentation and understanding which holds a strong relation with the recognition of artifacts produced by the infringement of the constraints articulated in the document layout. There are two main modules in the tool: the geometric analysis module; and the design rule engine. The geometric analysis module extracts explicit knowledge from the data provided in the document. The design rule module uses the information provided by the geometric analysis to establish logical units inside the document. We used a subset of XSL-FO, sufficient for designing documents with an adequate amount complexity. The system identifies components such as headers, paragraphs, lists, images and determines the relations between them, such as header-paragraph, header-list, etc. The system provides accurate information about the geometric properties of the components, detects the elements of the documents and identifies corresponding components between a proofed instance and the rest of the instances in a Variable Data Printing Job.
Document authentication decides whether a given document is from a specific individual or not. In this paper, we propose a new document authentication method in physical (after document printed out) domain by embedding deformation characters. When an author writers a document to a specific individual or organization, a unique error-correcting code which serves as his Personal Identification Number (PIN) is proposed and then some characters in the text line are deformed according to his PIN. By doing so, the writer's personal information is embedded in the document. When the document is received, it is first scanned and recognized by an OCR module, and then the deformed characters are detected to get the PIN, which can be used to decide the originality of the document. So the document authentication can be viewed as a kind of communication problems in which the identity of a document from a writer is being "transmitted" over a channel. The channel consists of the writer's PIN, the document, and the encoding rule. Experimental result on deformation character detection is very promising, and the availability and practicability of the proposed method is verified by a practical system.
A CAPTCHA is a Completely Automated Public Test to tell Computers and Humans Apart. Typical CAPTCHAs present a challenge string consisting of a visually distorted sequence of letters and perhaps numbers, which in theory only a human can read. Attackers of CAPTCHAs have two primary points of leverage: Optical Character Recognition (OCR) can identify some characters, while nonuniform probabilities make other characters relatively easy to guess. This paper uses a mathematical theory of assurance to characterize the probability that a correct answer to a CAPTCHA is not just a lucky guess. We examine the three most common types of challenge strings, dictionary words, Markov text, and random strings, and find substantial weaknesses in each. We therefore propose improvements to Markov text, and new challenges based on the consonant-vowel-consonant (CVC) trigrams of psychology. Theory and experiment together quantify problems in current challenges and the improvements offered by modifications.
This paper presents a novel automatic web publishing solution, Pageview(R). PageView(R) is a complete working solution for document processing and management. The principal aim of this tool is to allow workgroups to share, access and publish documents on-line on a regular basis. For example, assuming that a person is working on some documents. The user will, in some fashion, organize his work either in his own local directory or in a shared network drive. Now extend that concept to a workgroup. Within a workgroup, some users are working together on some documents, and they are saving them in a directory structure somewhere on a document repository. The next stage of this reasoning is that a workgroup is working on some documents, and they want to publish them routinely on-line. Now it may happen that they are using different editing tools, different software, and different graphics tools. The resultant documents may be in PDF, Microsoft Office(R), HTML, or Word Perfect format, just to name a few. In general, this process needs the documents to be processed in a fashion so that they are in the HTML format, and then a web designer needs to work on that collection to make them available on-line. PageView(R) takes care of this whole process automatically, making the document workflow clean and easy to follow. PageView(R) Server publishes documents, complete with the directory structure, for online use. The documents are automatically converted to HTML and PDF so that users can view the content without downloading the original files, or having to download browser plug-ins. Once published, other users can access the documents as if they are accessing them from their local folders. The paper will describe the complete working system and will discuss possible applications within the document management research.
We report on an attempt to build an automatic redaction system by applying information extraction techniques to the identification of private dates of birth. We conclude that automatic redaction is a promising concept although information extraction is significantly affected by the presence of OCR error.
This paper introduces a document clustering method within a commercial document repository, FileShare(R). FileShare(R) is a commercial collaborative digital library offering facilities for sharing and accessing documents over a simple Internet browser (e.g. Microsoft(R) Internet Explorer(R), Netscape(R) or Opera(R)) within groups of people working on common projects. As the number of documents increases within a digital library, displaying these documents in this environment poses a huge challenge. This paper proposes a document clustering method that uses a modified version of the traditional K-Means algorithm to categorize documents by their themes using lexical chaining within the FileShare(R) repository. The proposed algorithm is unsupervised, and has shown very high accuracy in a typical experimental setup.
A method for extracting names in degraded documents is presented in this article. The documents targeted are images of photocopied scientific journals from various scientific domains. Due to the degradation, there is poor OCR recognition, and pieces of other articles appear on the sides of the image. The proposed approach relies on the combination of a low-level textual analysis and an image-based analysis. The textual analysis extracts robust typographic features, while the image analysis selects image regions of interest through anchor components. We report results on the University of Washington benchmark database.
Analysis of large collections of complex documents is an increasingly important need for numerous applications. Complex documents are documents that typically start out on paper and are then electronically scanned. These documents have rich internal structure and might only be available in image form. Additionally, they may have been produced by a combination of printing technologies (or by handwriting); and include diagrams, graphics, tables and other non-textual elements. The state of the art today for a large document collection is essentially text search of OCR'd documents with no meaningful use of data found in images, signatures, logos, etc. Our prototype automatically generates rich metadata about a complex document and then applies query tools to integrate the metadata with text search. To ensure a thorough evaluation of the effectiveness of our prototype, we are also developing a roughly 42,000,000 page complex document test collection. The collection will include relevance judgments for queries at a variety of levels of detail and depending on a variety of content and structural characteristics of documents, as well as "known item" queries looking for particular documents.
This paper investigates and compares between applying the algorithms of Support Vector Machine (SVM), Principal Component Analysis (PCA), Individual Principal Component Analysis (iPCA), Linear Discriminant Analysis (LDA), and Single-Nearest-Neighbor Method (1-NNM) to distorted-character recognition. Applying SVM achieves a classification error rate of 2.15% on the Letter-Image Dataset [Frey and Slate 1991]. This error rate is statistically comparable to the best number in the literature on this dataset that the authors are aware of, which is 2%. This was archived by a fully connected MLP neural network with adaboosting, where training was performed on 20 machines [Schwenk and Bengio 1997]. However, using SVM on a single machine, takes less than 3.5 minutes for training. The features of the dataset and the errors committed by SVM were analyzed in an attempt to combine classifiers and reduce the error rate. We report the results achieved for the different techniques used.
Most pattern classifiers are trained on data from multiple sources,
so that they can accurately classify data from any source. However,
in many applications, it is necessary to classify groups of test
patterns, with patterns in each group generated by the same source.
The co-occurring patterns in a group are statistically dependent due
to the commonality of source. The dependence between these patterns
introduces style context within a group that can be exploited
to improve the classification accuracy. In this paper, we present a
style consistent nearest neighbor classifier that exploits style
context in groups of adjacent patterns to improve the classification
accuracy. We demonstrate the efficacy of the proposed classifier on
a dataset of machine-printed digits where the proposed classifier
reduces the error rate by 64.5%.
Conventional approaches to combining classifiers improve accuracy at the cost of increased processing. We propose a novel search based approach to automatically combine multiple classifiers in a cascade to obtain the desired tradeoff between classification speed and classification accuracy. The search procedure only updates the rejection thresholds (one for each constituent classier) in the cascade, consequently no new classifiers are added and no training is necessary. A branch-and-bound version of depth-first-search with efficient pruning is proposed for finding the optimal thresholds for the cascade. It produces optimal solutions under arbitrary user specified speed and accuracy constraints. The effectiveness of the approach is demonstrated on handwritten character recognition by finding a) the fastest possible combination given an upper bound on classification error, and also b) the most accurate combination given a lower bound on speed.
We offer a preliminary report on a research program to investigate versatile algorithms for document image content extraction, that is locating regions containing handwriting, machine-print text,
graphics, line-art, logos, photographs, noise, etc. To solve this problem in its full generality requires coping with a vast diversity of document and image types. Automatically trainable methods are highly desirable, as well as extremely high speed in order to process large collections. Significant obstacles include the expense of preparing correctly labeled ("ground-truthed") samples, unresolved methodological questions in specifying the domain (e.g. what is a representative collection of document images?), and a lack of consensus among researchers on how to evaluate content-extraction performance. Our research strategy emphasizes versatility first: that is, we concentrate at the outset on designing methods that promise to work across the broadest possible range of cases.
This strategy has several important implications: the classifiers must be trainable in reasonable time on vast data sets; and expensive ground-truthed data sets must be complemented by amplification using generative models. These and other design and architectural issues are discussed. We propose a trainable classification methodology that marries k-d trees and hash-driven table lookup and describe preliminary experiments.