After decades of research, Optical Character Recognition (OCR) has entered into a relatively mature stage. Commercial off-the-shelf (COTS) OCR software packages have become powerful tools in Document Recognition and Retrieval (DRR) applications. One question naturally arises: What areas are left for new DRR research beyond COTS OCR software? There are many discussions around it in recent conferences. This paper attempts to address this question through a systematic survey of recently reported DRR projects as well as our own Digital Content Re-Mastering (DCRM) research at HP Labs. This survey has shown that custom DRR research is still in great need for better accuracy and reliability, complementary contents, or downstream information retrieval. Several concrete observations are also made on the basis of this survey: First, the basic character/word recognition is mostly taken on by COTS software, with a few exceptions. Second, system-level research with regard to reliability and guaranteed accuracy can seldom be replaced by COTS software. Third, document-level structure understanding still has much room to expand. Fourth, post-OCR information retrieval also has many challenging research topics.
The BBN Byblos OCR system implements a script-independent methodology for OCR using Hidden Markov Models (HMMs). We have successfully ported the system to Arabic, English, Chinese, Pashto, and Japanese. In this paper, we report on our recent effort in training the system to perform recognition of Hindi (Devanagari) documents. The initial experiments reported in this paper were performed using a corpus of synthetic (computer-generated) document images along with slightly degraded versions of the same that were generated by scanning printed versions of the document images and by scanning faxes of the printed versions. On a fair test set consisting of synthetic images alone we measured a character error rate of 1.0%. The character error rate on a fair test set consisting of scanned images (scans of printed versions of the synthetic images) was 1.40% while the character error rate on a fair test set of fax images (scans of printed and faxed versions of the synthetic images) was 8.7%.
The World Wide Web is a vast information resource which can be useful for validating the results produced by document recognizers. Three computational steps are involved, all of them challenging: (1) use the recognition results in a Web search to retrieve Web pages that contain information similar to that in the document, (2) identify the relevant portions of the retrieved Web pages, and (3) analyze these relevant portions to determine what corrections (if any) should be made to the recognition result. We have conducted exploratory implementations of steps (1) and (2) in the business-card domain: we use fields of the business card to retrieve Web pages and identify the most relevant portions of those Web pages. In some cases, this information appears suitable for correcting OCR errors in the business card fields. In other cases, the approach fails due to stale information: when business cards are several years old and the business-card holder has changed jobs, then websites (such as the home page or company website) no longer contain information matching that on the business card. Our exploratory results indicate that in some domains it may be possible to develop effective means of querying the Web with recognition results, and to use this information to correct the recognition results and/or detect that the information is stale.
The National Library of Medicine has developed a system for the automatic extraction of data from scanned journal articles to populate the MEDLINE database. Although the 5-engine OCR system used in this process exhibits good performance overall, it does make errors in character recognition that must be corrected in order for the process to achieve the requisite accuracy. The correction process works by feeding words that have characters with less than 100% confidence (as determined automatically by the OCR engine) to a human operator who then must manually verify the word or correct the error. The majority of these errors are contained in the affiliate information zone where the characters are in italics or small fonts. Therefore only affiliate information data is used in this research. This paper examines the correlation between OCR errors and various character attributes in the MEDLINE database, such as font size, italics, bold, etc. The motivation for this research is that if a correlation between the types of characters and types of errors exists it should be possible to use this information to improve operator productivity by increasing the probability that the correct word option is presented to the human editor. Using a categorizing program and confusion matrices, we have determined that this correlation exists, in particular for the case of characters with diacritics.
We announce the availability of the UNLV/ISRI Analytic Tools for OCR Evaluation together with a large and diverse collection of scanned document images with the associated ground-truth text. This combination of tools and test data will allow anyone to conduct a meaningful test comparing the performance of competing page-reading algorithms. The value of this collection of software tools and test data is enhanced by knowledge of the past performance of several systems using exactly these tools and this data. These performance comparisons were published in previous ISRI Test Reports and are also provided. Another value is that the tools can be used to test the character accuracy of any page-reading OCR system for any language included in the Unicode standard. The paper concludes with a summary of the programs, test data, and documentation that is available and gives the URL where they can be located.
As a cursive script, the characteristics of Arabic texts are different from Latin or Chinese greatly. For example, an Arabic character has up to four written forms and characters that can be joined are always joined on the baseline. Therefore, the methods used for Arabic document recognition are special, where character segmentation is the most critical problem. In this paper, a printed Arabic document recognition system is presented, which is composed of text line segmentation, word segmentation, character segmentation, character recognition and post-processing stages. In the beginning, a top-down and bottom-up hybrid method based on connected components classification is proposed to segment Arabic texts into lines and words. Subsequently, characters are segmented by analysis the word contour. At first the baseline position of a given word is estimated, and then a function denote the distance between contour and baseline is analyzed to find out all candidate segmentation points, at last structure rules are proposed to merge over-segmented characters. After character segmentation, both statistical features and structure features are used to do character recognition. Finally, lexicon is used to improve recognition results. Experiment shows that the recognition accuracy of the system has achieved 97.62%.
Despite recent developments in Tablet PC technology, there has not been any applications for recognizing handwritings in Turkish. In this paper, we present an online handwritten text recognition system for Turkish, developed using the Tablet PC interface. However, even though the system is developed for Turkish, the addressed issues are common to online handwriting recognition systems in general. Several dynamic features are extracted from the handwriting data for each recorded point and Hidden Markov Models (HMM) are used to train letter and word models. We experimented with using various features and HMM model topologies, and report on the effects of these experiments. We started with first and second derivatives of the x and y coordinates and relative change in the pen pressure as initial features. We found that using two more additional features, that is, number of neighboring points and relative heights of each point with respect to the base-line improve the recognition rate. In addition, extracting features within strokes and using a skipping state topology improve the system performance as well. The improved system performance is 94% in recognizing handwritten words from a 1000-word lexicon.
Search aspects of a system for analyzing handwritten documents are described. Documents are indexed using global image features, e.g., stroke width, slant as well as local features that describe the shapes of words and characters. Image indexing is done automatically using page analysis, page segmentation, line separation, word segmentation and recognition of words and characters. Two types of search are permitted: search based on global features of entire document and search using features at local level. For the second type of search, i.e., local, all the words in the document are characterized and indexed by various features and it forms the basis of different search techniques. The paper focuses on local search and describes four tasks: word/phrase spotting, text to image, image to text and plain text. Performance in terms of precision/recall and word ranking is reported on a database of handwriting samples from about 1,000 individuals.
A new sequence matching based feature extracting method is proposed in this paper, and the method is applied to on-line signature verification. The signature is first extracted as a point sequence in writing order. Then the sequence is matched with a model sequence that is extracted from the model signature, utilizing a modified DTW matching criterion. Based on the matching result, the sequence is divided into a fixed number of segments. Local shape features are extracted from each segment, making use of the direction and length information. Experiments show that this new feature extracting method is more discriminative than other commonly used feature extracting method. When applied to an on-line signature verification system, current feature extracting method shows benefit in verifying users with large variations in their genuine signatures.
Most of the handwritten text challenges are usually either more severe or not encountered in machine-printed text. In contrast to the traditional role of handwriting recognition in various applications, we explore a different perspective inspired by these challenges and introduce new applications based on security systems and HIP. Human Interactive Proofs (HIP) emerged as a very active research area that has focused on defending online services against abusive attacks. The approach uses a set of security protocols based on automatic reverse Turing tests, which virtually all humans can pass but current computer programs don't. In our paper we explore the fact that some recognition tasks are significantly harder for machines than for humans and describe a HIP algorithm that exploits the gap in ability between humans and computers in reading handwritten text images. We also present several promising applications of HIP for Cyber security.
In this paper, we propose a new system for segmentation and recognition of unconstrained handwritten numeral strings. The system uses a combination of foreground and background features for segmentation of touching digits. The method introduces new algorithms for traversing the top/bottom-foreground-skeletons of the touched digits, and for finding feature points on these skeletons, and matching them to build all the segmentation paths. For the first time a genetic representation is used to show all the segmentation hypotheses. Our genetic algorithm tries to search and evolve the population of candidate segmentations and finds the one with the highest confidence for its segmentation and recognition. We have also used a new method for feature extraction which lowers the variations in the shapes of the digits, and then a MLP neural network is utilized to produce the labels and confidence values for those digits. The NIST SD19 and CENPARMI databases are used for evaluating the system. Our system can get a correct segmentation-recognition rate of 96.07% with rejection rate of 2.61% which compares favorably with those that exist in the literature.
This paper presents an OCR method that combines Hopfield network with two layer perceptron for degraded printed character recognition. Hopfield network stores 35 prototype characters used as main classes. After the pre-processing, an image of a character is given to Hopfield network which can yield after a fixed iteration number, a pattern that is subsquently fed to MLP for classification. The main idea is to enhance or restore such degraded character images with Hopfield model at different iteration number for recognition accuracy applied to poor quality bank check. We report experimental results for a comparison of three neural architectures: the Hopfield network, the MLP-based classifier and the proposed combined architecture. Classification accuracy for ten digits and twenty five alphabetic characters from a single font is also studied in the presence of additive Gaussian noise. The paper reports 100% recognition rate at different levels of noise. Experimental results show an achievement of 99.35% of recognition rate on poor quality bank check characters, which confirm that the proposed approach can be successfully used for effective degraded printed character recognition.
Proc. SPIE 5676, A Fourier-descriptor-based character recognition engine implemented under the Gamera open-source document-processing framework, 0000 (17 January 2005); https://doi.org/10.1117/12.589218
This paper discusses the implementation of an engine for performing optical character recognition of bi-tonal images using the Gamera framework, an existing open-source framework for building document analysis applications. The OCR engine uses features that are based on the Fourier descriptor to distinguish characters, and is designed to be able to handle character images that contain multiple boundaries. The algorithm works by assigning to each character image a signature that encodes the boundary types that are present in the image as well as the positional relationships that exist between them. Under this approach, only images having the same signature are comparable. Effectively, a meta-classifier is used which first computes the signature of an input image and then dispatches the image to an underlying neural network based classifier which is trained to distinguish between images having that signature. The performance of the OCR engine is evaluated on a set of sample images taken from the newspaper domain, and compares well with other OCR engines. The source code for this engine and all supporting modules is currently available upon request, and will eventually be made available through an open-source project on the sourceforge website.
This paper presents the implementation and evaluation of a Hidden Markov Model to extract addresses from OCR text. Although Hidden Markov Models discover addresses with high precision and recall, this type of Information Extraction task seems to be affected negatively by the presence of OCR text.
Although about 300 million people worldwide, in several different languages, take Arabic characters for writing, Arabic OCR has not been researched as thoroughly as other widely used characters (Latin or Chinese). In this paper, a new statistical method is developed to recognize machine-printed Arabic characters. Firstly, the entire Arabic character set is pre-classified into 32 sub-sets in terms of character forms (Isolated, Final, Initial, Medial), special zones (divided according to the headline and the baseline of a text line) that characters occupy and component information (with or without secondary parts, say, diacritical marks, movements, etc.). Then 12 types of directional features are extracted from character profiles. After dimension reduction by linear discriminant analysis (LDA), features are sent to modified quadratic discriminant function (MQDF), which is utilized as the final classifier. At last, similar characters are discriminated before outputting recognition results. Selecting involved parameters properly, encouraging experimental results on test sets demonstrate the validity of proposed approach.
A new method for restoring high-resolution binary images is presented to improve legibility and OCR accuracy for low-resolution text images. The initially restored image is generated by simple techniques, and is then improved by integrating a variety of features obtained through image analysis. Missing strokes of characters are complemented based on topographic features. Contours of characters are then modified in terms of gradient magnitudes and curvatures along the contours. Finally, contours are beautified so that they look good to the human eye. The proposed method can deal with characters having complex structures such as Kanji, and entails relatively simple computation. Through experiments, it has been validated that the proposed method improves both OCR accuracy and legibility. In particular, smoothness and linearity along contours are significantly improved and strokes are restored correctly.
In this paper, a new feature extraction operator, the grating cell operator, is applied to analyze the texture features and classify different fonts of scanned document images. This operator is compared with the isotropic Gabor filter feature extractor which was also employed to classify fonts of documents. In order to improve the performance, a back-propagation neural network (BPNN) classifier is applied to the extracted features to perform the classification and compared with the simple weighted Euclidean distance (WED) classifier. Experimental results show that the grating cell operator performs better than the isotropic Gabor filter, and the BPNN classifier can provide more accurate classification results than the WED classifier.
A handwritten codex often included an inscription that listed facts about its publication, such as the names of the scribe and patron, date of publication, the city where the book was copied, etc. These facts obviously provide essential information to a historian studying the provenance of the codex. Unfortunately, this page was sometimes erased after the sale of the book to a new owner, often by scraping off the original ink. The importance of recovering this information would be difficult to overstate. This paper reports on the methods of imaging, image enhancement, and character recognition that were applied to this page in a Hebrew prayer book copied in Florence in the 15th century.
This paper presents a new document binarization algorithm for camera
images of historical documents, which are especially found in The
Library of Congress of the United States. The algorithm uses a
background light intensity normalization algorithm to enhance an
image before a local adaptive binarization algorithm is applied. The
image normalization algorithm uses an adaptive linear or non-linear
function to approximate the uneven background of the image due to
the uneven surface of the document paper, aged color or uneven light
source of the cameras for image lifting. Our algorithm adaptively
captures the background of a document image with a "best fit"
approximation. The document image is then normalized with respect to
the approximation before a thresholding algorithm is applied. The
technique works for both gray scale and color historical handwritten
document images with significant improvement in readability for both
human and OCR.
Style is an important feature of printed or handwritten characters. But it is not studied thoroughly compared with character recognition. In this paper, we try to learn how many typical styles exist in a kind of real world form images. A hierarchical clustering method has been developed and tested. A cross recognition error rate constraint is proposed to reduce the false combinations in the hierarchical clustering process, and a cluster selecting method is used to delete redundant or unsuitable clusters. Only a similarity measure between any patterns is needed by the algorithm. It is tested on a template matching based similarity measure which can be extended to any other feature and distance measure easily. The detailed comparing on every step’s effects is shown in the paper. Total 16 kinds of typical styles are found out, and by giving each character in each style a prototype for recognition, a 0.78% error rate is achieved by recognizing the testing set.
It has been recently demonstrated, in dramatic fashion, that sensitive information thought to be obliterated through the process of redaction can be successfully recovered via a combination of manual effort, document image analysis, and natural language processing techniques. In this paper, we examine what might be revealed through redaction, exploring how known methods might be employed to detect vestigial artifacts of the pre-redacted text. We discuss the process of redaction and circumstances under which sensitive information might leak, present an outline for experimental analyses of various approaches that could be used to recover redacted material, and describe a series of increasingly stringent countermeasures to address, and in some cases eliminate, the perceived threat.
We propose a design methodology for "implicit" CAPTCHAs to relieve drawbacks of present technology. CAPTCHAs are tests administered automatically over networks that can distinguish between people and machines and thus protect web services from abuse by programs masquerading as human users. All existing CAPTCHAs' challenges require a significant conscious effort by the person answering them -- e.g. reading and typing a nonsense word -- whereas implicit CAPTCHAs may require as little as a single click. Many CAPTCHAs distract and interrupt users, since the challenge is perceived as an irrelevant intrusion; implicit CAPTCHAs can be woven into the expected sequence of browsing using cues tailored to the site. Most existing CAPTCHAs are vulnerable to "farming-out" attacks in which challenges are passed to a networked community of human readers; by contrast, implicit CAPTCHAs are not "fungible" (in the sense of easily answerable in isolation) since they are meaningful only in the specific context of the website that is protected. Many existing CAPTCHAs irritate or threaten users since they are obviously tests of skill: implicit CAPTCHAs appear to be elementary and inevitable acts of browsing. It can often be difficult to detect when CAPTCHAs are under attack: implicit CAPTCHAs can be designed so that certain failure modes are correlated with failed bot attacks. We illustrate these design principles with examples.
A reading-based CAPTCHA designed to resist character-segmentation attacks, called 'ScatterType,' is described. Its challenges are pseudorandomly synthesized images of text strings rendered in machine-print typefaces: within each image, characters are fragmented
using horizontal and vertical cuts, and the fragments are scattered by vertical and horizontal displacements. This scattering is designed to defeat all methods known to us for automatic segmentation into characters. As in the BaffleText CAPTCHA, English-like but unspellable text-strings are used to defend against known-dictionary attacks. In contrast to the PessimalPrint and BaffleText CAPTCHAs (and others), no physics-based image degradations, occlusions, or extraneous patterns are employed. We report preliminary results from a human legibility trial with 57 volunteers that yielded 4275 CAPTCHA challenges and responses. ScatterType human legibility remains remarkably high even on extremely degraded cases. We speculate that this is due to Gestalt perception abilities
assisted by style-specific (here, typeface-specific) consistency
among primitive shape features of character fragments. Although recent efforts to automate style-consistent perceptual skills
have reported progress, the best known methods do not yet pose a threat to ScatterType. The experimental data also show that subjective rating of difficulty is strongly (and usefully) correlated with illegibility. In addition, we present early insights emerging from these data as we explore the ScatterType design space -- choice of typefaces, 'words', cut positioning, and displacements -- with the goal of locating regimes in which ScatterType challenges remain comfortably legible to almost all people but strongly resist mahine-vision methods for automatic segmentation into characters.
The notion of assigning every piece of paper that passes through a printer a unique ID encoded either on the surface or in the substrate of the page, regardless of its intended use or perceived importance, could prove to be a breakthrough of magnitude comparable to the now ubiquitous concept of referencing a webpage through the use of its Universal Resource Locater (URL). We see many opportunities for using chipless ID in the world of everyday documents, but also many challenges. In this paper, we begin to explore the ways this new technology can be used to enable advanced document management functions, along with its implications for the ways in which people use documents.