This PDF file contains the front matter associated with SPIE Proceedings Volume 9402, including the Title Page, Copyright information, Table of Contents, Introduction (if any), and Conference Committee listing.
In this paper, we propose a new dataset and a ground-truthing methodology for layout analysis of historical
documents with complex layouts. The dataset is based on a generic model for ground-truth presentation of
the complex layout structure of historical documents. For the purpose of extracting uniformly the document
contents, our model defines five types of regions of interest: page, text block, text line, decoration, and comment.
Unconstrained polygons are used to outline the regions. A performance metric is proposed in order to evaluate
various page segmentation methods based on this model. We have analysed four state-of-the-art ground-truthing
tools: TRUVIZ, GEDI, WebGT, and Aletheia. From this analysis, we conceptualized and developed Divadia, a
new tool that overcomes some of the drawbacks of these tools, targeting the simplicity and the efficiency of the
layout ground truthing process on historical document images. With Divadia, we have created a new public
dataset. This dataset contains 120 pages from three historical document image collections of different styles and
is made freely available to the scientific community for historical document layout analysis research.
Designing reliable and fast segmentation algorithms of ancient documents has been a topic of major interest for many libraries and the prime issue of research in the document analysis community. Thus, we propose in this article a fast ancient document enhancement and segmentation algorithm based on using Simple Linear Iterative Clustering (SLIC) superpixels and Gabor descriptors in a multi-scale approach. Firstly, in order to obtain enhanced backgrounds of noisy ancient documents, a novel foreground/background segmentation algorithm based on SLIC superpixels, is introduced. Once, the SLIC technique is carried out, the background and foreground superpixels are classified. Then, an enhanced and non-noisy background is achieved after processing the background superpixels. Subsequently, Gabor descriptors are only extracted from the selected foreground superpixels of the enhanced gray-level ancient book document images by adopting a multi-scale approach. Finally, for ancient document image segmentation, a foreground superpixel clustering task is performed by partitioning Gabor-based feature sets into compact and well-separated clusters in the feature space. The proposed algorithm does not assume any a priori information regarding document image content and structure and provides interesting results on a large corpus of ancient documents. Qualitative and numerical experiments are given to demonstrate the enhancement and segmentation quality.
Digital methods, tools and algorithms are gaining in importance for the analysis of digitized manuscript collections
in the arts and humanities. One example is the BMBF-funded research project “eCodicology” which
aims to design, evaluate and optimize algorithms for the automatic identification of macro- and micro-structural
layout features of medieval manuscripts. The main goal of this research project is to provide better insights into
high-dimensional datasets of medieval manuscripts for humanities scholars. The heterogeneous nature and size
of the humanities data and the need to create a database of automatically extracted reproducible features for
better statistical and visual analysis are the main challenges in designing a workflow for the arts and humanities.
This paper presents a concept of a workflow for the automatic tagging of medieval manuscripts. As a starting
point, the workflow uses medieval manuscripts digitized within the scope of the project Virtual Scriptorium St.
Matthias". Firstly, these digitized manuscripts are ingested into a data repository. Secondly, specific algorithms
are adapted or designed for the identification of macro- and micro-structural layout elements like page size,
writing space, number of lines etc. And lastly, a statistical analysis and scientific evaluation of the manuscripts
groups are performed. The workflow is designed generically to process large amounts of data automatically with
any desired algorithm for feature extraction. As a result, a database of objectified and reproducible features is
created which helps to analyze and visualize hidden relationships of around 170,000 pages. The workflow shows
the potential of automatic image analysis by enabling the processing of a single page in less than a minute.
Furthermore, the accuracy tests of the workflow on a small set of manuscripts with respect to features like page
size and text areas show that automatic and manual analysis are comparable. The usage of a computer cluster
will allow the highly performant processing of large amounts of data. The software framework itself will be
integrated as a service into the DARIAH infrastructure to make it adaptable for wider range of communities.
We introduce a new method for indexing and retrieving mathematical expressions, and a new protocol for
evaluating math formula retrieval systems. The Tangent search engine uses an inverted index over pairs of
symbols in math expressions. Each key in the index is a pair of symbols along with their relative distance
and vertical displacement within an expression. Matched expressions are ranked by the harmonic mean of the
percentage of symbol pairs matched in the query, and the percentage of symbol pairs matched in the candidate
expression. We have found that our method is fast enough for use in real time and finds partial matches well, such
as when subexpressions are re-arranged (e.g. expressions moved from the left to the right of an equals sign) or
when individual symbols (e.g. variables) differ from a query expression. In an experiment using expressions from
English Wikipedia, student and faculty participants (N=20) found expressions returned by Tangent significantly
more similar than those from a text-based retrieval system (Lucene) adapted for mathematical expressions.
Participants provided similarity ratings using a 5-point Likert scale, evaluating expressions from both algorithms
one-at-a-time in a randomized order to avoid bias from the position of hits in search result lists. For the Lucene-based
system, precision for the top 1 and 10 hits averaged 60% and 39% across queries respectively, while for
Tangent mean precision at 1 and 10 were 99% and 60%. A demonstration and source code are publicly available.
Handwritten tabular documents, such as census, birth, death and marriage records, contain a wealth of information
vital to genealogical and related research. Much work has been done in segmenting freeform handwriting,
however, segmentation of cursive handwriting in tabular documents is still an unsolved problem. Tabular documents
present unique segmentation challenges caused by handwriting overlapping cell-boundaries and other
words, both horizontally and vertically, as “ascenders” and “descenders” overlap into adjacent cells. This paper
presents a method for segmenting handwriting in tabular documents using a min-cut/max-flow algorithm on a
graph formed from a distance map and connected components of handwriting. Specifically, we focus on line,
word and first letter segmentation. Additionally, we include the angles of strokes of the handwriting as a third
dimension to our graph to enable the resulting segments to share pixels of overlapping letters. Word segmentation
accuracy is 89.5% evaluating lines of the data set used in the ICDAR2013 Handwriting Segmentation Contest.
Accuracy is 92.6% for a specific application of segmenting first and last names from noisy census records. Accuracy
for segmenting lines of names from noisy census records is 80.7%. The 3D graph cutting shows
promise in segmenting overlapping letters, although highly convoluted or overlapping handwriting remains an
Cross-references, such like footnotes, endnotes, figure/table captions, references, are a common and useful type of page elements to further explain their corresponding entities in the target document. In this paper, we focus on cross-reference identification in a PDF document, and present a robust method as a case study of identifying footnotes and figure references. The proposed method first extracts footnotes and figure captions, and then matches them with their corresponding references within a document. A number of novel features within a PDF document, i.e., page layout, font information, lexical and linguistic features of cross-references, are utilized for the task. Clustering is adopted to handle the features that are stable in one document but varied in different kinds of documents so that the process of identification is adaptive with document types. In addition, this method leverages results from the matching process to provide feedback to the identification process and further improve the algorithm accuracy. The primary experiments in real document sets show that the proposed method is promising to identify cross-reference in a PDF document.
We present Intelligent Indexing: a general, scalable, collaborative approach to indexing and transcription of non-machinereadable documents that exploits visual consensus and group labeling while harnessing human recognition and domain expertise. In our system, indexers work directly on the page, and with minimal context switching can navigate the page, enter labels, and interact with the recognition engine. Interaction with the recognition engine occurs through preview windows that allow the indexer to quickly verify and correct recommendations. This interaction is far superior to conventional, tedious, inefficient post-correction and editing. Intelligent Indexing is a trainable system that improves over time and can provide benefit even without prior knowledge. A user study was performed to compare Intelligent Indexing to a basic, manual indexing system. Volunteers report that using Intelligent Indexing is less mentally fatiguing and more enjoyable than the manual indexing system. Their results also show that it reduces significantly (30.2%) the time required to index census records, while maintaining comparable accuracy. (a video demonstration is available at http://youtube.com/gqdVzEPnBEw)
This paper reports on the first phase of an attempt to create a full retro-engineering pipeline that aims to construct
a complete set of coherent typographic parameters defining the typefaces used in a printed homogenous text. It
should be stressed that this process cannot reasonably be expected to be fully automatic and that it is designed
to include human interaction. Although font design is governed by a set of quite robust and formal geometric
rulesets, it still heavily relies on subjective human interpretation. Furthermore, different parameters, applied to
the generic rulesets may actually result in quite similar and visually difficult to distinguish typefaces, making
the retro-engineering an inverse problem that is ill conditioned once shape distortions (related to the printing
and/or scanning process) come into play.
This work is the first phase of a long iterative process, in which we will progressively study and assess the
techniques from the state-of-the-art that are most suited to our problem and investigate new directions when
they prove to not quite adequate. As a first step, this is more of a feasibility proof-of-concept, that will allow us
to clearly pinpoint the items that will require more in-depth research over the next iterations.
Redundancy of word and sub-word occurrences in large documents can be effectively utilized in an OCR system to improve recognition results. Most OCR systems employ language modeling techniques as a post-processing step; however these techniques do not use important pictorial information that exist in the text image. In case of large-scale recognition of degraded documents, this information is even more valuable. In our previous work, we proposed a subword image clustering method for the applications dealing with large printed documents. In our clustering method, the ideal case is when all equivalent sub-word images lie in one cluster. To overcome the issues of low print quality, the clustering method uses an image matching algorithm for measuring the distance between two sub-word images. The measured distance with a set of simple shape features were used to cluster all sub-word images. In this paper, we analyze the effects of adding more shape features on processing time, purity of clustering, and the final recognition rate. Previously published experiments have shown the efficiency of our method on a book. Here we present extended experimental results and evaluate our method on another book with totally different font face. Also we show that the number of the new created clusters in a page can be used as a criteria for assessing the quality of print and evaluating preprocessing phases.
Historical Chinese character recognition is very important to larger scale historical document digitalization, but is a very challenging problem due to lack of labeled training samples. This paper proposes a novel non-linear transfer learning method, namely Gaussian Process Style Transfer Mapping (GP-STM). The GP-STM extends traditional linear Style Transfer Mapping (STM) by using Gaussian process and kernel methods. With GP-STM, existing printed Chinese character samples are used to help the recognition of historical Chinese characters. To demonstrate this framework, we compare feature extraction methods, train a modified quadratic discriminant function (MQDF) classifier on printed Chinese character samples, and implement the GP-STM model on Dunhuang historical documents. Various kernels and parameters are explored, and the impact of the number of training samples is evaluated. Experimental results show that accuracy increases by nearly 15 percentage points (from 42.8% to 57.5%) using GP-STM, with an improvement of more than 8 percentage points (from 49.2% to 57.5%) compared to the STM approach.
Optical character recognition (OCR) is a challenging task because most existing preprocessing approaches are
sensitive to writing style, writing material, noises and image resolution. Thus, a single recognition system cannot
address all factors of real document images. In this paper, we describe an approach to combine diverse recognition
systems by using iVector based features, which is a newly developed method in the field of speaker verification.
Prior to system combination, document images are preprocessed and text line images are extracted with different
approaches for each system, where iVector is transformed from a high-dimensional supervector of each text line
and is used to predict the accuracy of OCR. We merge hypotheses from multiple recognition systems according
to the overlap ratio and the predicted OCR score of text line images. We present evaluation results on an Arabic
document database where the proposed method is compared against the single best OCR system using word
error rate (WER) metric.
The BLSTM-CTC is a novel recurrent neural network architecture that has outperformed previous state
of the art algorithms in tasks such as speech recognition or handwriting recognition. It has the ability to
process long term dependencies in temporal signals in order to label unsegmented data. This paper describes
different ways of combining features using a BLSTM-CTC architecture. Not only do we explore the low level
combination (feature space combination) but we also explore high level combination (decoding combination)
and mid-level (internal system representation combination). The results are compared on the RIMES word
database. Our results show that the low level combination works best, thanks to the powerful data modeling
of the LSTM neurons.
In this article, we propose a hybrid model for spotting words and regular expressions (REGEX) in handwritten
documents. The model is made of the state-of-the-art BLSTM (Bidirectional Long Short Time Memory) neural
network for recognizing and segmenting characters, coupled with a HMM to build line models able to spot the
desired sequences. Experiments on the Rimes database show very promising results.
In this paper, we present an Arabic handwriting recognition method based on recurrent neural network. We use the Long Short Term Memory (LSTM) architecture, that have proven successful in different printed and handwritten OCR tasks. Applications of LSTM for handwriting recognition employ the two-dimensional architecture to deal with the variations in both vertical and horizontal axis. However, we show that using a simple pre-processing step that normalizes the position and baseline of letters, we can make use of 1D LSTM, which is faster in learning and convergence, and yet achieve superior performance. In a series of experiments on IFN/ENIT database for Arabic handwriting recognition, we demonstrate that our proposed pipeline can outperform 2D LSTM networks. Furthermore, we provide comparisons with 1D LSTM networks trained with manually crafted features to show that the automatically learned features in a globally trained 1D LSTM network with our normalization step can even outperform such systems.
We present a simple and accurate approach for aligning historical documents with their corresponding transcription.
First, a representative of each letter in the historical document is cropped. Then, the transcription
is transformed to synthetic word images by representing the letters in the transcription by the cropped letters.
These synthetic word images are aligned to groups of connected components in the original text, along each
line, using dynamic programming. For measuring image similarities we experimented with a variety of feature
extraction and matching methods. The presented alignment algorithm was tested on two historical datasets and
provided excellent results.
We propose an improved HMM formulation for offline handwriting recognition (HWR). The main contribution of this
work is using modified quadratic discriminant function (MQDF)  within HMM framework. In an MQDF-HMM the
state observation likelihood is calculated by a weighted combination of MQDF likelihoods of individual Gaussians of
GMM (Gaussian Mixture Model). The quadratic discriminant function (QDF) of a multivariate Gaussian can be rewritten
by avoiding the inverse of covariance matrix by using the Eigen values and Eigen vectors of it. The MQDF is
derived from QDF by substituting few of badly estimated lower-most Eigen values by an appropriate constant. The
estimation errors of non-dominant Eigen vectors and Eigen values of covariance matrix for which the training data is
insufficient can be controlled by this approach. MQDF has been successfully shown to improve the character recognition
performance . The usage of MQDF in HMM improves the computation, storage and modeling power of HMM when
there is limited training data. We have got encouraging results on offline handwritten character (NIST database) and
word recognition in English using MQDF HMMs.
We describe a document image segmentation algorithm to classify a scanned document into different regions
such as text/line drawings, pictures, and smooth background. The proposed scheme is relatively independent
of variations in text font style, size, intensity polarity and of string orientation. It is intended for use in an
adaptive system for document image compression. The principal parts of the algorithm are the generation of
the foreground and background layers and the application of hierarchical singular value decomposition (SVD)
in order to smoothly fill the blank regions of both layers so that the high compression ratio can be achieved.
The performance of the algorithm, both in terms of its effectiveness and computational efficiency, was evaluated
using several test images and showed superior performance compared to other techniques.
No-reference image quality assessment (NR-IQA) aims at computing an image quality score that best correlates
with either human perceived image quality or an objective quality measure, without any prior knowledge of
reference images. Although learning-based NR-IQA methods have achieved the best state-of-the-art results so
far, those methods perform well only on the datasets on which they were trained. The datasets usually contain
homogeneous documents, whereas in reality, document images come from different sources. It is unrealistic to
collect training samples of images from every possible capturing device and every document type. Hence, we
argue that a metric-based IQA method is more suitable for heterogeneous documents. We propose a NR-IQA
method with the objective quality measure of OCR accuracy. The method combines distortion-specific quality
metrics. The final quality score is calculated taking into account the proportions of, and the dependency among
different distortions. Experimental results show that the method achieves competitive results with learning-based
NR-IQA methods on standard datasets, and performs better on heterogeneous documents.
Revealing related content among heterogeneous web tables is part of our long term objective of formulating queries over
multiple sources of information. Two hundred HTML tables from institutional web sites are segmented and each table
cell is classified according to the fundamental indexing property of row and column headers. The categories that
correspond to the multi-dimensional data cube view of a table are extracted by factoring the (often multi-row/column)
headers. To reveal commonalities between tables from diverse sources, the Jaccard distances between pairs of category
headers (and also table titles) are computed. We show how about one third of our heterogeneous collection can be
clustered into a dozen groups that exhibit table-title and header similarities that can be exploited for queries.
In recent years, the retrieval of plane geometry figures (PGFs) has attracted increasing attention in the fields of mathematics
education and computer science. However, the high cost of matching complex PGF features leads to the low efficiency of
most retrieval systems. This paper proposes an indirect classification method based on multi-label learning, which
improves retrieval efficiency by reducing the scope of compare operation from the whole database to small candidate
groups. Label correlations among PGFs are taken into account for the multi-label classification task. The primitive feature
selection for multi-label learning and the feature description of visual geometric elements are conducted individually to
match similar PGFs. The experiment results show the competitive performance of the proposed method compared with
existing PGF retrieval methods in terms of both time consumption and retrieval quality.
In this paper a method to detect the electrical circuit elements from the scanned images of electrical drawings is proposed.
The method, based on histogram analysis and mathematical morphology, detects the circuit elements, for example, circuit
components, wires, and generates a connectivity matrix which may be used to find similar, but spatially different looking
circuit using graph isomorphism. The work may also be used for vectorization of the circuit drawings utilising the information
on the segmented circuit elements and corresponding connectivity matrix. The novelty of the method lies in its
simplicity and adaptability to work with a tolerable skewed image and the capability to segment symbols irrespective of
their orientation. The proposed method is tested over a data-set containing more than one hundred scanned images of a
variety of electrical drawings. Some of the results are presented in this paper to show the efficacy and robustness of the
Missing values make pattern analysis difficult, particularly with limited available data. In longitudinal research,
missing values accumulate, thereby aggravating the problem. Here we consider how to deal with temporal
data with missing values in handwriting analysis. In the task of studying development of individuality of
handwriting, we encountered the fact that feature values are missing for several individuals at several time
instances. Six algorithms, i.e., random imputation, mean imputation, most likely independent value imputation,
and three methods based on Bayesian network (static Bayesian network, parameter EM, and structural EM),
are compared with children's handwriting data. We evaluate the accuracy and robustness of the algorithms
under different ratios of missing data and missing values, and useful conclusions are given. Specifically, static
Bayesian network is used for our data which contain around 5% missing data to provide adequate accuracy and
low computational cost.