We present a deterministic text extraction algorithm that relies on three basic assumptions: color/luminance uniformity of the interior region, closed boundaries of sharp edges and the consistency of local contrast. The algorithm is basically independent of the character alphabet, text layout, font size and orientation. The heart of this algorithm is an edge-bounded averaging for the classification of smooth regions that enhances robustness against noise without sacrificing boundary accuracy. We have also developed a verification process to clean up the residue of incoherent segmentation. Our framework provides a symmetric treatment for both regular and inverse text. We have proposed three heuristics for identifying the type of text from a cluster consisting of two types of pixel aggregates. Finally, we have demonstrated the advantages of the proposed algorithm over adaptive thresholding and block-based clustering methods in terms of boundary accuracy, segmentation coherency, and capability to identify inverse text and separate characters from background patches.
Spatial covariances based on geostatistics are extracted as representative features of logo or trademark images. These spatial covariances are different from other statistical features for image analysis in that the structural information of an image is independent of the pixel locations and represented in terms of spatial series. We then design a classifier in the sense of hidden Markov models to make use of these geostatistical sequential data to recognize the logos. High recognition rates are obtained from testing the method against a public-domain logo database.
Several dissimilarity measures for binary vectors are formulated and examined for their recognition capability in handwriting identification for which the binary micro-features are used to characterize handwritten character shapes. Pertaining to eight dissimilarity measures, i.e., Jaccard-Needham, Dice, Correlation, Yule, Russell-Rao, Sokal-Michener, Rogers-Tanmoto and Kulzinsky, the discriminary power of ten individual characters and their combination is exhaustively studied. Conclusions are made on how to choose a dissimilarity measure and how to combine hybrid features.
Handwritten and machine-printed characters are recognized separately in most OCR systems due to their distinct difference. In applications where both kinds of characters are involved, it is necessary to judge a character’s handwritten/printed property before feeding it into the proper recognition engine. In this paper, a new method to discriminate between handwritten and machine-printed character is proposed. Unlike most previous works, the discrimination we carried out in this paper is totally based on single character. Five kinds of statistical features are extracted from character image, then feature selection and classification are implemented simultaneously by a learning algorithm based on AdaBoost. Experiments on large data sets have demonstrated the effectiveness of the method.
A prototype system has been designed to automate the extraction of bibliographic data (e.g., article title, authors, abstract, affiliation and others) from online biomedical journals to populate the National Library of Medicine’s MEDLINE database. This paper describes a key module in this system: the labeling module that employs statistics and fuzzy rule-based algorithms to identify segmented zones in an article’s HTML pages as specific bibliographic data. Results from experiments conducted with 1,149 medical articles from forty-seven journal issues are presented.
Video is an increasingly important and ever-growing source of information to the intelligence and homeland defense analyst. A capability to automatically identify the contents of video imagery would enable the analyst to index relevant foreign and domestic news videos in a convenient and meaningful way. To this end, the proposed system aims to help determine the geographic focus of a news story directly from video imagery by detecting and geographically localizing political maps from news broadcasts, using the results of videotext recognition in lieu of a computationally expensive, scale-independent shape recognizer. Our novel method for the geographic localization of a map is based on the premise that the relative placement of text superimposed on a map roughly corresponds to the geographic coordinates of the locations the text represents. Our scheme extracts and recognizes videotext, and iteratively identifies the geographic area, while allowing for OCR errors and artistic freedom. The fast and reliable recognition of such maps by our system may provide valuable context and supporting evidence for other sources, such as speech recognition transcripts. The concepts of syntax-directed content analysis of videotext presented here can be extended to other content analysis systems.
This paper describes and approach for the extraction of the valid data sets in legal registers containing data which have been invalidated by invalidation lines. Invalidation lines are hand drawn lines below the invalid words or text lines. In a first step detection of horizontal lines and segmentation of the text objects (block, line, word, character) is performed based on a fast connected component analysis using sub-components, which is robust against touching lines. Invalidation is performed on a word or text line level using the neighborhood relation between text objects and invalidation lines. Invalidations are recognized with about 90% accuracy at about 10 % rejection threshold (false negatives). The error rate (i.e. invalidation of a valid word) is less than 0.5 %. For most data sets it is sufficient to eliminate the invalidated text, so the valid data remains. In a second step a syntactical analysis on the valid text strings is performed. This increases the accuracy to 99% on the data set level. Error detection and correction is done by a graphical user interface. Data capture time can be reduced by a factor of 2 to 3 compared with manual input.
In support of the goal of automatically selecting methods of enhancing an image to improve the accuracy of OCR on that image, we consider the problem of determining whether to apply each of a set of methods as a supervised classification problem for machine learning. We characterize each image according to a combination of two sets of measures: a set that are intended to reflect the degree of particular types of noise present in documents in a single font of Roman or similar script and a more general set based on connected component statistics. We consider several potential methods of image improvement, each of which constitutes its own 2-class classification problem, according to whether transforming the image with this method improves the accuracy of OCR. In our experiments, the results varied for the different image transformation methods, but the system made the correct choice in 77% of the cases in which the decision affected the OCR score (in the range [0,1]) by at least .01, and it made the correct choice 64% of the time overall.
The Medical Article Records System (MARS) developed by the Lister Hill National Center for Biomedical Communications uses scanning, OCR and automated recognition and reformatting algorithms to generate electronic bibliographic citation data from paper biomedical journal articles. The OCR server incorporated in MARS performs well in general, but fares less well with text printed in small or italic fonts. Affiliations are often printed in small italic fonts in the journals processed by MARS. Consequently, although the automatic processes generate much of the citation data correctly, the affiliation field frequently contains incorrect data, which must be manually corrected by verification operators. In contrast, author names are usually printed in large, normal fonts that are correctly converted to text by the OCR server.
The National Library of Medicine’s MEDLINE database contains 11 million indexed citations for biomedical journal articles. This paper documents our effort to use the historical author, affiliation relationships from this large dataset to find potential correct affiliations for MARS articles based on the author and the affiliation in the OCR output. Preliminary tests using a table of about 400,000 author/affiliation pairs extracted from the corrected data from MARS indicated that about 44% of the author/affiliation pairs were repeats and that about 47% of newly converted author names would be found in this set. A text-matching algorithm was developed to determine the likelihood that an affiliation found in the table corresponding to the OCR text of the first author was the current, correct affiliation. This matching algorithm compares an affiliation found in the author/affiliation table (found with the OCR text of the first author) to the OCR output affiliation, and calculates a score indicating the similarity of the affiliation found in the table to the OCR affiliation. Using a ground truth set of 519 OCR author/OCR affiliation/correct affiliation triples, the matching algorithm is able to select a correct affiliation for the author 43% of the time with a false positive rate of 6%, a true negative rate of 44% and a false negative rate of 7%.
MEDLINE citations with United States affiliations typically include the zip code. In addition to using author names as clues to correct affiliations, we are investigating the value of the OCR text of zip codes as clues to correct USA affiliations. Current work includes generation of an author/affiliation/zipcode table from the entire MEDLINE database and development of a daemon module to implement affiliation selection and matching for the MARS system using both author names and zip codes. Preliminary results from the initial version of the daemon module and the partially filled author/affiliation/zipcode table are encouraging.
Small image deformations such as those introduced by optical scanners significantly reduce the accuracy rate of optical character recognition (OCR) software. Characterization of the scanner used in the OCR process may diminish the impact on recognition rates. Theoretical methods have been developed to characterize a scanner based on the bi-level image resulting from scanning a high contrast document. Two bottlenecks in the naïve implementation of these algorithms have been identified, and techniques were developed to improve the execution time of the software. The algorithms are described and analyzed. Since approximations are used in one of the techniques, the error of the approximations is examined.
For over 10 years, the Information Science Research Institute (ISRI) at UNLV has worked on problems associated with the electronic conversion of archival document collections. Such collections typically have a large fraction of poor quality images and present a special challenge to OCR systems. Frequently, because of the size of the collection, manual correction of the output is not affordable. Because the output text is used only to build the index for an information retrieval (IR) system, the accuracy of non-stopwords is the most important measure of output quality. For these reasons, ISRI has focused on using document level knowledge as the best means of providing automatic correction of non-stopwords in OCR output. In 1998, we developed the MANICURE  post-processing system that combined several document level corrections. Because of the high cost of obtaining accurate ground-truth text at the document level, we have never been able to quantify the accuracy improvement achievable using document level knowledge. In this report, we describe an experiment to measure the actual number (and percentage) of non-stopwords corrected by the MANICURE system. We believe this to be the first quantitative measure of OCR conversion improvement that is possible using document level knowledge.
In this paper we study boosting methods from a new perspective. We build on recent work by Efron et al. to show that boosting approximately (and in some cases exactly) minimizes its loss criterion with an L1 constraint. For the two most commonly used loss criteria (exponential and logistic log-likelihood), we further show that as the constraint diminishes, or equivalently as the boosting iterations proceed, the solution converges in the separable case to an “L1-optimal” separating hyper-plane. This “L1-optimal” separating hyper-plane has the property of maximizing the minimal margin of the training data, as de£ned in the boosting literature. We illustrate through examples the regularized and asymptotic behavior of the solutions to the classifcation problem with both loss criteria.
A rule-based automatic text categorizer was tested to see if two types of thesaurus expansion, called query expansion and Junker expansion respectively, would improve categorization. Thesauri used were domain-specific to an OCR test collection focussed on a single topic. Results show that neither type of expansion significantly improved categorization.
Due to the proliferation of various types of devices used to browse the web and the shift of document access via web interfaces, it is now becoming very important to classify web pages into pre-selected types. This often forms the pre-processing stage of a number of web applications. However, classification of web pages is known to be a difficult problem because it is inherently difficult to identify specific features of web pages that are distinct and therefore it is equally difficult to use a set of heuristics to accomplish this. This paper describes a solution to the problem by combining a heuristic based system and a Support Vector Machine (SVM). It is found that such a hybrid system is able to perform at a very high accuracy when compared to using SVMs on their own.
The difficulty with information retrieval for OCR documents lies in the fact that OCR documents contain a significant amount of erroneous words and unfortunately most information retrieval techniques rely heavily on word matching between documents and queries. In this paper, we propose a general content-based correction model that can work on top of an existing OCR correction tool to “boost” retrieval performance. The basic idea of this correction model is to exploit the whole content of a document to supplement any other useful information provided by an existing OCR correction tool for word corrections. Instead of making an explicit correction decision for each erroneous word as typically done in a traditional approach, we consider the uncertainties in such correction decisions and compute an estimate of the original “uncorrupted” document language model accordingly. The document language model can then be used for retrieval with a language modeling retrieval approach. Evaluation using the TREC standard testing collections indicates that our method significantly improves the performance compared with simple word correction approaches such as using only the top ranked correction.
The World Wide Web provides an increasingly powerful and popular publication mechanism. Web documents often contain a large number of images serving various different purposes. Identifying the functional categories of these images ahs important applications including information extraction, web mining, web page summarization and mobile access. An important first step towards designing algorithms for automatic categorization of images on the web is to identify the common categories and examine their properties and characteristics. This paper describes results from such an initial study using data collected from news web sites. We describe the image categories found in such web pages and their distributions, and identify the main research issues involved in automatically classifying images into these categories.
A framework for real-time adaptive delivery of web images to resource-constrained devices is presented, bringing together techniques from image analysis, compression, rate-distortion optimization, and user interaction. Two fundamental themes in this work are: (1) a structured and scalable representation, obtained through content- and lower-level image analysis, that allows multiple descriptions of object regions, and (2) resource-optimized content adaptation in real time, facilitated by an algorithm for directly merging LZ77-compressed streams without the need for additional string matching.
Also introduced is a new distortion measure for image approximations based on a feature space distance. Using this measure, a color reduction algorithm is proposed. Simulation studies show that this algorithm can yield better results than previous approaches, both from a visual standpoint and in terms of feature space distortion.
The purpose of this paper is to propose a new robust algorithm to retrieve information in handwriting scripts. For example, digital pen is popular on some electronic device such as Pocket PC and Tablet PC in recent years. It is helpful for users to index information such like keyword and symbol from handwriting script. Digital devices can capture the trace of pen exactly. The basic idea of this algorithm is that we calculate the “distance” between two scripts to determine if the two scripts are similar enough to be the same. Compared with other systems, the new retrieval algorithm can get a higher accuracy. In addition, we present results obtained by using an implementation of this algorithm.
This paper introduces a robust algorithm to extract headers and footers from a variety of electronic documents, such as image files, Adobe PDF files, and files generated from OCR. Compared with the conventional methods based on the page-level layout and format, the proposed strategy considers a page in the context of neighboring pages. Through the page-association, the headers and footers in different patterns can be automatically detected without human interference or individual templates. In addition, fuzzy string match makes the method robust against OCR errors.
We present an overview of an information extraction application in the health insurance invoice processing domain. The system is novel in that it is not constrained by the document type - it has no explicit document model or document type classification phase. The system relies on constraints derived from a domain model, constraints derived from world state, and simple models of layout, including the use of labeled fields and the proximity of related information.
In this paper, we present an approach to the bootstrap learning of a page segmentation model. The idea evolves from attempts to segment dictionaries that often have a consistent page structure, and is extended to the segmentation of more general structured documents. In cases of highly regular structure, the layout can be learned from examples of only a few pages. The system is first trained using a small number of samples, and a larger test set is processed based on the training result. After making corrections to a selected subset of the test set, these corrected samples are combined with the original training samples to generate bootstrap samples. The newly created samples are used to retrain the system, refine the learned features and resegment the test samples. This procedure is applied iteratively until the learned parameters are stable. Using this approach, we do not need to initially provide a large set of training samples. We have applied this segmentation to many structured documents such as dictionaries, phone books, spoken language transcripts, and obtained satisfying segmentation performance.
The use of content feature extracted from recognized text is valuable in labeling logical elements in documents without rigid layout structure, like business letters. This paper discusses a model-based approach to combining content features with other geometrical and presentation features for logical labeling. Models are automatically initialized and adaptively improved using training samples. Satisfactory experiment results are presented.
Document structure analysis can be regarded as a syntactic analysis problem. The order and containment relations among the physical or logical components of a document page can be described by an ordered tree structure and can be modeled by a tree grammar which describes the page at the component level in terms of regions or blocks. This paper provides a detailed survey of past work on document structure analysis algorithms and summarize the limitations of past approaches. In particular, we survey past work on document physical layout representations and algorithms, document logical structure representations and algorithms, and performance evaluation of document structure analysis algorithms. In the last section, we summarize this work and point out its limitations.
This paper presents a new type ATM called image-ATM and an image workflow system developed for banking applications. The system including the image-ATM captures the paper forms brought by the clients at the very front-end, identifies the type of forms, and recognizes the data on the form automatically. The image-ATM can accept over 400 different kinds of forms. The system is presently in operation at some of the Japanese major banks. They could reduce considerable human workforce for their branch offices by introducing the image workflow system and by centralizing the back-office work at a few operation centers. Technically, form recognition, especially form type identification, was one of the keys for this success. This paper discusses a method for form type identification and its technical issues.
Printed character image contains not only the information of characters, but also the information of fonts. Font information is essential in layout analysis and reconstruction, and is helpful to improve the performance of character recognition system. An algorithm for font recognition of single Chinese character is proposed in this paper. The aim is to analyze a single Chinese character and to identify the font. No priori knowledge of characters is required for font recognition. The new algorithm can recognize the font of a single Chinese character while existing methods are all based on a block of text. Stroke property features and stroke distribution features are extracted from a single Chinese character and two classifiers are employed to classify different fonts. We combine these two classifiers by logistic regression method to get the final result. Experiment shows that our method can recognize the font of a single Chinese character effectively.
We describe a system for recognizing unconstrained Turkish handwritten text. Turkish has agglutinative morphology and theoretically an infinite number of words that can be generated by adding more suffixes to the word. This makes lexicon-based recognition approaches, where the most likely word is selected among all the alternatives in a lexicon, unsuitable for Turkish. We describe our approach to the problem using a Turkish prefix recognizer. First results of the system demonstrates the promise of this approach, with top-10 word recognition rate of about 40% for a small test data of mixed handprint and cursive writing. The lexicon-based approach with a 17,000 word-lexicon (with test words added) achieves 56% top-10 word recognition rate.
Color histograms are widely used for content-based image retrieval. Their advantages are efficiency, and insensitivity to small changes in camera viewpoint. However, a histogram is a coarse characterization of an image, and so images with very different appearances can have similar histograms. This is particularly important for large image databases, in which many images can have similar color histograms. We will show how to find a relationship between histograms and elliptic curves, in order to define a similarity color feature based onto parametric elliptic equations. This equations are directly involved in the Fermat's Last Theorem, thus representing a solution which is interesting in terms of theory and parametric properties.
Document image understanding techniques have been widely used in many application domains. Various kinds of documents have been researched and different methods are developed for information retrieval purpose. In this paper we present a practical method to extract information items from Chinese business card. Before retrieval information in business card, the image of business card had been segmented into little text regions and each text region had been recognized. Because the typeset of business card is variable, and both English and Chinese characters are used, so there are errors in segmentation and recognition result. We focus on building a robust model that can tolerate errors and extract syntax pattern of each text lines in business card, which using both layout information and logical information. By this model, many errors will be identified and adjusted. Finally, correct property will be assigned to each text region in business card, and recognition errors will be corrected.
Many document images contain both text and non-text (images, line drawings, etc.) regions. An automatic segmentation of such an image into text and non-text regions is extremely useful in a variety of applications. Identification of text regions helps in text recognition applications, while the classification of an image into text and non-text regions helps in processing the individual regions differently in applications like page reproduction and printing. One of the main approaches to text detection is based on modeling the text as a texture. We present a method based on a combination of neural networks (texture-based) and connected component analysis to detect text in color documents with busy foreground and background. The proposed method achieves an accuracy of 96% (by area) on a test set of 40 documents.
This paper introduces a newly designed general-purpose Chinese document data capture system - Tsinghua OCR Network Edition (TONE). The system aimed to cut down the high cost in the process of digitalizing mass Chinese paper documents. Our first step was to divide the whole data-entry process into a few single-purpose procedures. Then based on these procedures, a production-line-like system configuration was developed. By design, the management cost was reduced directly by substituting automated task scheduling for traditional manual assignment, and indirectly by adopting well-designed quality control mechanism. Classification distances, character image positions, and context grammars are synthesized to reject questionable characters. Experiments showed that when 19.91% of the characters are rejected, the residual error rate could be 0.0097% (below one per ten thousand characters). This finally improved the error-rejecting module to be applicable. According to the cost distribution (specially, the manual correction occupies 70% of total) in the data companies, the estimated total cost reduction could be over 50%.
In this paper we propose a novel method to bridge the 'semantic gap' between a user's information need and the image content. The semantic gap describes the major deficiency of content-based image retrieval (CBIR) systems which use visual features extracted from images to describe the images. We conquer the deficiency by extracting semantic of an image from the environmental texts around it. Since an image generally co-exists with accompany texts in various formats, we may rely on such environmental texts to discover the semantic of the image. A text mining approach based on self-organizing maps is used to extract the semantic of an image from its environmental texts. We performed experiments on a small set of images and obtained promising results.
To recognize a handwritten check mark on a pre-printed character with OCR, we need to separate the pre-print and the superimposed check mark. By the previous method, an unmarked form image and marked one are matched and overlapping part is removed. Then the remaining pattern is regarded as a superimposed mark pattern. Therefore when the mark pattern and the pre-print is overlapped, the overlapping part is removed with the pre-print. It sometimes causes a decision error of the existence of the check mark. In this paper, we propose a new method to separate a superimposed pattern to preprints using directional decomposition of an image for precise recognition of the check mark.
Information retrieval and knowledge acquisition represent the basis of the modern information age. The internet provides the possibility of nearly worldwide unlimited information search. However, for a user searching the internet the huge mass of electronic information offerings requires a lot of time and effort to find the desired information. Because of the lack of context awareness, traditional internet search engines cannot satisfy the growing need for a selective high qualitative filtering and extraction of topic and user oriented information. The aim of the project INFOX-I at the University of Applied Sciences Trier, is to develop concepts to support the user searching information in the WWW. Therefore, there is an urgent need for methods that make it possible to automatically select relevant information on a given domain. Methods from the field of document analysis and knowledge based systems are used. In this paper we outline the concepts and the current state of the project.
This article presents a new segmentation method of complex postal envelopes by mixture approach combining the concepts of mathematical morphology and of co-occurrence matrix. Morphological segmentation techniques will assist to interpret the information generated by the co-occurrence matrix for extracting the contents of the brazilian postal envelopes. Very little a priori knowledge of the envelope images is required. The advantages of this approach will be described and illustrated with tests carried out on 250 different complex envelope images where there is no fixed position for the handwritten address block, postmarks and stamps.
Internet services designed for human use are being abused by programs. We present a defense against such attacks in the form of a CAPTCHA (Completely Automatic Public Turing test to tell Computers and Humans Apart) that exploits the difference in ability between humans and machines in reading images of text. CAPTCHAs are a special case of 'human interactive proofs,' a broad class of security protocols that allow people to identify themselves over networks as members of given groups. We point out vulnerabilities of reading-based CAPTCHAs to dictionary and computer-vision attacks. We also draw on the literature on the psychophysics of human reading, which suggests fresh defenses available to CAPTCHAs. Motivated by these considerations, we propose BaffleText, a CAPTCHA which uses non-English pronounceable words to defend against dictionary attacks, and Gestalt-motivated image-masking degradations to defend against image restoration attacks. Experiments on human subjects confirm the human legibility and user acceptance of BaffleText images. We have found an image-complexity measure that correlates well with user acceptance and assists in engineering the generation of challenges to fit the ability gap. Recent computer-vision attacks, run independently by Mori and Jitendra, suggest that BaffleText is stronger than two existing CAPTCHAs.
PDF is a document format for final presentation. It preserves the original document layout but often not the document logical structure. Graphic illustrations such as figures and tables in PDF often consist of ungrouped graphic primitives such as lines, curves and small text elements. In this paper, we present a bottom up approach to recognize graphic illustration in PDF document. Vicinities of page elements in both 2D space and indexes in layer are used to understand the logical connection between elements. Graphics recognition and elements grouping for illustration is an important part in understanding the document logical structure. This technique can be used in automatic figure extraction, document re-flow and document transformation.