This work aims at automatically identifying scribes of historical Slavonic manuscripts. The quality of the ancient
documents is partially degraded by faded-out ink or varying background. The writer identification method used
is based on image features, which are described with Scale Invariant Feature Transform (SIFT) features. A visual
vocabulary is used for the description of handwriting characteristics, whereby the features are clustered using a
Gaussian Mixture Model and employing the Fisher kernel. The writer identification approach is originally designed
for grayscale images of modern handwritings. But contrary to modern documents, the historical manuscripts are
partially corrupted by background clutter and water stains. As a result, SIFT features are also found on the
background. Since the method shows also good results on binarized images of modern handwritings, the approach
was additionally applied on binarized images of the ancient writings. Experiments show that this preprocessing
step leads to a significant performance increase: The identification rate on binarized images is 98.9%, compared
to an identification rate of 87.6% gained on grayscale images.
There is little work done in the analysis of children's handwriting, which can be useful in developing automatic evaluation systems and in quantifying handwriting individuality. We consider the statistical analysis of children's handwriting in early grades. Samples of handwriting of children in Grades 2-4 who were taught the Zaner-Bloser style were considered. The commonly occurring word "and" written in cursive style as well as hand-print were extracted from extended writing. The samples were assigned feature values by human examiners using a truthing tool. The human examiners looked at how the children constructed letter formations in their writing, looking for similarities and differences from the instructions taught in the handwriting copy book. These similarities and differences were measured using a feature space distance measure. Results indicate that the handwriting develops towards more conformity with the class characteristics of the Zaner-Bloser copybook which, with practice, is the expected result. Bayesian networks were learnt from the data to enable answering various probabilistic queries, such as determining students who may continue to produce letter formations as taught during lessons in school and determining the students who will develop a different and/or variation of the those letter formations and the number of different types of letter formations.
We propose a bayesian framework for keyword spotting in handwritten documents. This work is an extension to our previous work where we proposed dynamic background model, DBM for keyword spotting that takes into account the local character level scores and global word level scores to learn a logistic regression classifier to separate keywords from non-keywords. In this work, we add a bayesian layer on top of the DBM called the variational dynamic background model, VDBM. The logistic regression classifier uses the sigmoid function to separate keywords from non-keywords. The sigmoid function being neither convex nor concave, exact inference of VDBM becomes intractable. An expectation maximization step is proposed to do approximate inference. The advantage of VDBM over the DBM is multi-fold. Firstly, being bayesian, it prevents over-fitting of data. Secondly, it provides better modeling of data and an improved prediction of unseen data. VDBM is evaluated on the IAM dataset and the results prove that it outperforms our prior work and other state of the art line based word spotting system.
Boosting over decision-stumps proved its efficiency in Natural Language Processing essentially with symbolic features, and its good properties (fast, few and not critical parameters, not sensitive to over-fitting) could be of great interest in the numeric world of pixel images. In this article we investigated the use of boosting over small decision trees, in image classification processing, for the discrimination of handwritten/printed text. Then, we conducted experiments to compare it to usual SVM-based classification revealing convincing results with very close performance, but with faster predictions and behaving far less as a black-box. Those promising results tend to make use of this classifier in more complex recognition tasks like multiclass problems.
Geometric invariants are combined with edit distance to compare the ruling configuration of noisy filled-out forms. It is
shown that gap-ratios used as features capture most of the ruling information of even low-resolution and poorly scanned
form images, and that the edit distance is tolerant of missed and spurious rulings. No preprocessing is required and the
potentially time-consuming string operations are performed on a sparse representation of the detected rulings. Based on
edit distance, 158 Arabic forms are classified into 15 groups with 89% accuracy. Since the method was developed for an
application that precludes public dissemination of the data, it is illustrated on public-domain death certificates.
In this paper a document form classification and retrieval method using Bag of Words and newly introduced local shape features of form lines is proposed. In a preprocessing step the document is binarized and the form lines (solid and dotted) are detected. The shape features are based on the line information describing local line structures, e.g. line endings, crossings, boxes. The dominant line structures build a vocabulary for each form class. According to the vocabulary an occurrence histogram of structures of form documents can be calculated for the classification and retrieval. The proposed method has been tested on a set of 489 documents and 9 different form classes.
In this paper, we report on our experiments for detection and correction of OCR errors with web data. More specifically, we utilize Google search to access the big data resources available to identify possible candidates for correction. We then use a combination of the Longest Common Subsequences (LCS) and Bayesian estimates to automatically pick the proper candidate. Our experimental results on a small set of historical newspaper data show a recall and precision of 51% and 100%, respectively. The work in this paper further provides a detailed classification and analysis of all errors. In particular, we point out the shortcomings of our approach in its ability to suggest proper candidates to correct the remaining errors.
As the digitization of historical documents, such as newspapers, becomes more common, the need of the archive patron for accurate digital text from those documents increases. Building on our earlier work, the contributions of this paper are: 1. in demonstrating the applicability of novel methods for correcting optical character recognition (OCR) on disparate data sets, including a new synthetic training set, 2. enhancing the correction algorithm with novel features, and 3. assessing the data requirements of the correction learning method. First, we correct errors using conditional random fields (CRF) trained on synthetic training data sets in order to demonstrate the applicability of the methodology to unrelated test sets. Second, we show the strength of lexical features from the training sets on two unrelated test sets, yielding a relative reduction in word error rate on the test sets of 6.52%. New features capture the recurrence of hypothesis tokens and yield an additional relative reduction in WER of 2.30%. Further, we show that only 2.0% of the full training corpus of over 500,000 feature cases is needed to achieve correction results comparable to those using the entire training corpus, effectively reducing both the complexity of the training process and the learned correction model.
Text in video is useful and important in indexing and retrieving the video documents efficiently and accurately. In this
paper, we present a new method of text detection using a combined dictionary consisting of wavelets and a recently
introduced transform called shearlets. Wavelets provide optimally sparse expansion for point-like structures and
shearlets provide optimally sparse expansions for curve-like structures. By combining these two features we have
computed a high frequency sub-band to brighten the text part. Then K-means clustering is used for obtaining text pixels
from the Standard Deviation (SD) of combined coefficient of wavelets and shearlets as well as the union of wavelets and
shearlets features. Text parts are obtained by grouping neighboring regions based on geometric properties of the
classified output frame of unsupervised K-means classification. The proposed method tested on a standard as well as
newly collected database shows to be superior to some of the existing methods.
In this paper, we present a novel text line segmentation framework following the divide-and-conquer paradigm:
we iteratively identify and re-process regions of ambiguous line segmentation from an input document image
until there is no ambiguity. To detect ambiguous line segmentation, we introduce the use of two complimentary
line descriptors, referred as to the underline and highlight line descriptors, and identify ambiguities when their
patterns mismatch. As a result, we can easily identify already good line segmentations, and largely simplify the
original line segmentation problem by only reprocessing ambiguous regions. We evaluate the performance of the
proposed line segmentation framework using the ICDAR 2009 handwritten document dataset, and it is close to
top-performing systems submitted to the competition. Moreover, the proposed method is also robust against
skewness, noise, variable line heights and touching characters. The proposed idea can also be applied to other
text analysis tasks such as word segmentation and page layout analysis.
In this paper, we present our new method for the segmentation of handwritten text pages into lines, which has been submitted to ICDAR'2013 handwritten segmentation competition. This method is based on two levels of perception of the image: a rough perception based on a blurred image, and a precise perception based on the presence of connected components. The combination of those two levels of perception enables to deal with the difficulties of handwritten text segmentation: curvature, irregular slope and overlapping strokes. Thus, the analysis of the blurred image is efficient in images with high density of text, whereas the use of connected components enables to connect the text lines in the pages with low text density. The combination of those two kinds of data is implemented with a grammatical description, which enables to externalize the knowledge linked to the page model. The page model contains a strategy of analysis that can be associated to an applicative goal. Indeed, the text line segmentation is linked to the kind of data that is analysed: homogeneous text pages, separated text blocks or unconstrained text. This method obtained a recognition rate of more than 98% on last ICDAR'2013 competition.
A system is presented for optical recognition of music scores. The system processes a document page in three
main phases. First it performs a hierarchical decomposition of the page, identifying systems, staves and measures.
The second phase, which forms the heart of the system, interprets each measure found in the previous phase as a
collection of non-overlapping symbols including both primitive symbols (clefs, rests, etc.) with fixed templates,
and composite symbols (chords, beamed groups, etc.) constructed through grammatical composition of primitives
(note heads, ledger lines, beams, etc.). This phase proceeds by first building separate top-down recognizers for
the symbols of interest. Then, it resolves the inevitable overlap between the recognized symbols by exploring the
possible assignment of overlapping regions, seeking globally optimal and grammatically consistent explanations.
The third phase interprets the recognized symbols in terms of pitch and rhythm, focusing on the main challenge
of rhythm. We present results that compare our system to the leading commercial OMR system using MIDI
ground truth for piano music.
The aim of this paper is to propose a document flow supervised segmentation approach applied to real world
heterogeneous documents. Our algorithm treats the flow of documents as couples of consecutive pages and studies the
relationship that exists between them. At first, sets of features are extracted from the pages where we propose an
approach to model the couple of pages into a single feature vector representation. This representation will be provided to
a binary classifier which classifies the relationship as either segmentation or continuity. In case of segmentation, we
consider that we have a complete document and the analysis of the flow continues by starting a new document. In case
of continuity, the couple of pages are assimilated to the same document and the analysis continues on the flow. If there is
an uncertainty on whether the relationship between the couple of pages should be classified as a continuity or
segmentation, a rejection is decided and the pages analyzed until this point are considered as a "fragment". The first
classification already provides good results approaching 90% on certain documents, which is high at this level of the
The analysis of 2D structured documents often requires localizing data inside of a document during the recognition process. In this paper we present LearnPos a new generic tool, independent of any document recognition system. LearnPos models and evaluates positioning from a learning set of documents. Thanks to LearnPos, the user is helped to define the physical structure of the document. He then can concentrate his efforts on the definition of the logical structure of the documents. LearnPos is able to furnish spatial information for both absolute and relative spatial relations, in interaction with the user. Our method can handle spatial relations compose of distinct zones and is able to furnish appropriate order and point of view to minimize errors. We prove that resulting models can be successfully used for structured document recognition, while reducing the manual exploration of the data set of documents.
In this paper, a model is proposed to learn logical structure of fixed-layout document pages by combining support vector machine (SVM) and conditional random fields (CRF). Features related to each logical label and their dependencies are extracted from various original Portable Document Format (PDF) attributes. Both local evidence and contextual dependencies are integrated in the proposed model so as to achieve better logical labeling performance. With the merits of SVM as local discriminative classifier and CRF modeling contextual correlations of adjacent fragments, it is capable of resolving the ambiguities of semantic labels. The experimental results show that CRF based models with both tree and chain graph structures outperform the SVM model with an increase of macro-averaged F1 by about 10%.
Comic page image understanding aims to analyse the layout of the comic page images by detecting the storyboards and identifying the reading order automatically. It is the key technique to produce the digital comic documents suitable for reading on mobile devices. In this paper, we propose a novel comic page image understanding method based on edge segment analysis. First, we propose an efficient edge point chaining method to extract Canny edge segments (i.e., contiguous chains of Canny edge points) from the input comic page image; second, we propose a top-down scheme to detect line segments within each obtained edge segment; third, we develop a novel method to detect the storyboards by selecting the border lines and further identify the reading order of these storyboards. The proposed method is performed on a data set consisting of 2000 comic page images from ten printed comic series. The experimental results demonstrate that the proposed method achieves satisfactory results on different comics and outperforms the existing methods.
Despite the explosion of text on the Internet, hard copy documents that have been scanned as images still play a significant
role for some tasks. The best method to perform ranked retrieval on a large corpus of document images, however, remains
an open research question. The most common approach has been to perform text retrieval using terms generated by optical
character recognition. This paper, by contrast, examines whether a scalable segmentation-free image retrieval algorithm,
which matches sub-images containing text or graphical objects, can provide additional benefit in satisfying a user’s
information needs on a large, real world dataset. Results on 7 million scanned pages from the CDIP v1.0 test collection
show that content based image retrieval finds a substantial number of documents that text retrieval misses, and that when
used as a basis for relevance feedback can yield improvements in retrieval effectiveness.
Contours, object blobs, and specific feature points are utilized to represent object shapes and extract shape descriptors that can then be used for object detection or image classification. In this research we develop a shape descriptor for biomedical image type (or, modality) classification. We adapt a feature extraction method used in optical character recognition (OCR) for character shape representation, and apply various image preprocessing methods to successfully adapt the method to our application. The proposed shape descriptor is applied to radiology images (e.g., MRI, CT, ultrasound, X-ray, etc.) to assess its usefulness for modality classification. In our experiment we compare our method with other visual descriptors such as CEDD, CLD, Tamura, and PHOG that extract color, texture, or shape information from images. The proposed method achieved the highest classification accuracy of 74.1% among all other individual descriptors in the test, and when combined with CSD (color structure descriptor) showed better performance (78.9%) than using the shape descriptor alone.
In this paper a semi-automated document image clustering and retrieval is presented to create links between different documents based on their content. Ideally the initial bundling of shuffled document images can be reproduced to explore large document databases. Structural and textural features, which describe the visual similarity, are extracted and used by experts (e.g. registrars) to interactively cluster the documents with a manually defined feature subset (e.g. checked paper, handwritten). The methods presented allow for the analysis of heterogeneous documents that contain printed and handwritten text and allow for a hierarchically clustering with different feature subsets in different layers.
The structure of document images plays a significant role in document analysis thus considerable efforts have
been made towards extracting and understanding document structure, usually in the form of layout analysis
approaches. In this paper, we first employ Distance Transform based MSER (DTMSER) to efficiently extract
stable document structural elements in terms of a dendrogram of key-regions. Then a fast structural matching
method is proposed to query the structure of document (dendrogram) based on a spatial database which facilitates
the formulation of advanced spatial queries. The experiments demonstrate a significant improvement in
a document retrieval scenario when compared to the use of typical Bag of Words (BoW) and pyramidal BoW
Document image analysis is a data-driven discipline. For a number of years, research was focused on small,
homogeneous datasets such as the University of Washington corpus of scanned journal pages. More recently, library
digitization efforts have raised many interesting problems with respect to historical documents and their recognition. In
this paper, we present the Lehigh Steel Collection (LSC), a new open dataset we are currently assembling which will be,
in many ways, unique to the field. LSC is an extremely large, heterogeneous set of documents dating from the 1960's
through the 1990's relating to the wide-ranging research activities of Bethlehem Steel, a now-bankrupt company that was
once the second-largest steel producer and the largest shipbuilder in the United States. As a result of the bankruptcy
process and the disposition of the company's assets, an enormous quantity of documents (we estimate hundreds of
thousands of pages) were left abandoned in buildings recently acquired by Lehigh University. Rather than see this
history destroyed, we stepped in to preserve a portion of the collection via digitization. Here we provide an overview of
LSC, including our efforts to collect and scan the documents, a preliminary characterization of what the collection
contains, and our plans to make this data available to the research community for non-commercial purposes.
Separation of keywords from non-keywords is the main problem in keyword spotting systems which has traditionally been approached by simplistic methods, such as thresholding of recognition scores. In this paper, we analyze this problem from a machine learning perspective, and we study several standard machine learning algorithms specifically in the context of non-keyword rejection. We propose a two-stage approach to keyword spotting and provide a theoretical analysis of the performance of the system which gives insights on how to design the classifier in order to maximize the overall performance in terms of F-measure.
Accuracy of content-based image retrieval is affected by image resolution among other factors. Higher resolution images
enable extraction of image features that more accurately represent the image content. In order to improve the relevance
of search results for our biomedical image search engine, Open-I, we have developed techniques to extract and label
high-resolution versions of figures from biomedical articles supplied in the PDF format. Open-I uses the open-access
subset of biomedical articles from the PubMed Central repository hosted by the National Library of Medicine. Articles
are available in XML and in publisher supplied PDF formats. As these PDF documents contain little or no meta-data to
identify the embedded images, the task includes labeling images according to their figure number in the article after they
have been successfully extracted. For this purpose we use the labeled small size images provided with the XML web
version of the article. This paper describes the image extraction process and two alternative approaches to perform image
labeling that measure the similarity between two images based upon the image intensity projection on the coordinate
axes and similarity based upon the normalized cross-correlation between the intensities of two images. Using image
identification based on image intensity projection, we were able to achieve a precision of 92.84% and a recall of 82.18%
in labeling of the extracted images.
As there are increasing numbers of digital documents for education purpose, we realize that there is not a retrieval application for mathematic plane geometry images. In this paper, we propose a method for retrieving plane geometry figures (PGFs), which often appear in geometry books and digital documents. First, detecting algorithms are applied to detect common basic geometry shapes from a PGF image. Based on all basic shapes, we analyze the structural relationships between two basic shapes and combine some of them to a compound shape to build the PGF descriptor. Afterwards, we apply matching function to retrieve candidate PGF images with ranking. The great contribution of the paper is that we propose a structure analysis method to better describe the spatial relationships in such image composed of many overlapped shapes. Experimental results demonstrate that our analysis method and shape descriptor can obtain good retrieval results with relatively high effectiveness and efficiency.
As smartphones and touch screens are more and more popular, on-line signature verification technology can be used as
one of personal identification means for mobile computing. In this paper, a novel Laplacian Spectral Analysis (LSA)
based on-line signature verification method is presented and an integration framework of LSA and Dynamic Time
Warping (DTW) based methods for practical application is proposed. In LSA based method, a Laplacian matrix is
constructed by regarding the on-line signature as a graph. The signature’s writing speed information is utilized in the
Laplacian matrix of the graph. The eigenvalue spectrum of the Laplacian matrix is analyzed and used for signature
verification. The framework to integrate LSA and DTW methods is further proposed. DTW is integrated at two stages.
First, it is used to provide stroke matching results for the LSA method to construct the corresponding graph better.
Second, the on-line signature verification results by DTW are fused with that of the LSA method. Experimental results
on public signature database and practical signature data on mobile phones proved the effectiveness of the proposed
The slant removal is a necessary preprocessing task in many document image processing systems. In this paper, we
describe a technique for removing the slant from the entire page, avoiding the segmentation procedure. The presented
technique could be combined with the most existed slant removal algorithms. Experimental results are presented on two
Historically significant documents are often discovered with defects that make them difficult to read and analyze. This
fact is particularly troublesome if the defects prevent software from performing an automated analysis. Image
enhancement methods are used to remove or minimize document defects, improve software performance, and generally
make images more legible. We describe an automated, image enhancement method that is input page independent and
requires no training data. The approach applies to color or greyscale images with hand written script, typewritten text,
images, and mixtures thereof. We evaluated the image enhancement method against the test images provided by the
2011 Document Image Binarization Contest (DIBCO). Our method outperforms all 2011 DIBCO entrants in terms of
average F1 measure – doing so with a significantly lower variance than top contest entrants. The capability of the
proposed method is also illustrated using select images from a collection of historic documents stored at Yad Vashem
Holocaust Memorial in Israel.
Video segmentation and indexing are important steps in multi-media document understanding and information
retrieval. This paper presents a novel machine learning based approach for automatic structuring and indexing
of lecture videos. By indexing video content, we can support both topic indexing and semantic querying of
multimedia documents. In this paper, our proposed approach extracts features from video images and then uses
these features to construct a model to label video frames. Using this model, we are able to segment and indexing
videos with accuracy of 95% on our test collection.