PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.
This PDF file contains the front matter associated with SPIE-IS&T Proceedings Volume 6815, including the Title Page, Copyright information, Table of Contents, Introduction (if any), and the Conference Committee listing.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The fifteenth anniversary of the first SPIE symposium (titled Character Recognition Technologies)
on Document Recognition and Retrieval provides an opportunity to examine DRR's contributions to
the development of document technologies. Many of the tools taken for granted today, including
workable general purpose OCR, large-scale, semi-automatic forms processing, inter-format table
conversion, and text mining, followed research presented at this venue. This occasion also affords an
opportunity to offer tribute to the conference organizers and proceedings editors and to the coterie of
professionals who regularly participate in DRR.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this paper we present a system for the off-line recognition of cursive Arabic handwritten words. This system
in an enhanced version of our reference system presented in [El-Hajj et al., 05] which is based on Hidden Markov
Models (HMMs) and uses a sliding window approach. The enhanced version proposed here uses contextual
character models. This approach is motivated by the fact that the set of Arabic characters includes a lot of ascending
and descending strokes which overlap with one or two neighboring characters. Additional character models are
constructed according to characters in their left or right neighborhood. Our experiments on images of the benchmark
IFN/ENIT database of handwritten villages/towns names show that using contextual character models improves
recognition. For a lexicon of 306 name classes, accuracy is increased by 0.6% in absolute value which corresponds
to a 7.8% reduction in error rate.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Machine perception and recognition of handwritten text in any language is a difficult problem. Even for Latin script most solutions are restricted to specific domains like bank checks courtesy amount recognition. Arabic script presents additional challenges for handwriting recognition systems due to its highly connected nature,
numerous forms of each letter, and other factors. In this paper we address the problem of offline Arabic handwriting
recognition of pre-segmented words. Rather than focusing on a single classification approach and trying to perfect it, we propose to combine heterogeneous classification methodologies. We evaluate our system on the IFN/ENIT corpus of Tunisian village and town names and demonstrate that the combined approach yields results that are better than those of the individual classifiers.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Writer adaptation or specialization is the adjustment of handwriting recognition algorithms to a specific writer's
style of handwriting. Such adjustment yields significantly improved recognition rates over counterpart general
recognition algorithms. We present the first unconstrained off-line handwriting adaptation algorithm for Arabic
presented in the literature. We discuss an iterative bootstrapping model which adapts a writer-independent
model to a writer-dependent model using a small number of words achieving a large recognition rate increase in
the process. Furthermore, we describe a confidence weighting method which generates better results by weighting
words based on their length. We also discuss script features unique to Arabic, and how we incorporate them into
our adaptation process. Even though Arabic has many more character classes than languages such as English,
significant improvement was observed.
The testing set consisting of about 100 pages of handwritten text had an initial average overall recognition
rate of 67%. After the basic adaptation was finished, the overall recognition rate was 73.3%. As the improvement
was most marked for the longer words, and the set of confidently recognized longer words contained many fewer
false results, a second method was presented using them alone, resulting in a recognition rate of about 75%.
Initially, these words had a 69.5% recognition rate, improving to about a 92% recognition rate after adaptation.
A novel hybrid method is presented with a rate of about 77.2%.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
We describe an approach to unsupervised high-accuracy recognition of the textual contents of an entire book using fully automatic mutual-entropy-based model adaptation. Given images of all the pages of a book together with approximate models of image formation (e.g. a character-image classifier) and linguistics (e.g. a word-occurrence probability model), we detect evidence for disagreements between the two models by analyzing the mutual entropy between two kinds of probability distributions: (1) the a posteriori probabilities of character classes (the recognition results from image classification alone), and (2) the a posteriori probabilities of word classes (the recognition results from image classification combined with linguistic
constraints). The most serious of these disagreements are identified as candidates for automatic corrections to one or the other of the models. We describe a formal information-theoretic framework for detecting model disagreement and for proposing corrections. We illustrate this approach on a small test case selected from real book-image data. This reveals that a sequence of automatic model corrections can drive improvements in both models, and can achieve a lower recognition error rate. The importance of considering the contents of the whole book is motivated by a series of studies, over the last decade, showing that isogeny can be exploited to achieve unsupervised improvements in recognition accuracy.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The digital cleaning of dirty and old documents and the binarization into a black/white image can be a tedious process. It is usually done by experts. In this article a method is shown that is easy for the end user. Untrained persons are able to do this task now while before an expert was needed. The method uses interactive evolutionary computing to program image processing operations that act on the document image.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
OCR often performs poorly on degraded documents. One approach to improving performance is to determine a good filter
to improve the appearance of the document image before sending it to the OCR engine. Quality metrics have been
measured in document images to determine what type of filtering would most likely improve the OCR response for that
document image. In this paper those same quality metrics are measured for several word images degraded by known
parameters in a document degradation model. The correlation between the degradation model parameters and the quality
metrics is measured. High correlations do appear in many places that were expected. They are also absent in some
expected places and offer a comparison of quality metric definitions proposed by different authors.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The fast evolution of scanning and computing technologies have led to the creation of large collections of scanned paper documents. Examples of such collections include historical collections, legal depositories, medical archives, and business archives. Moreover, in many situations such as legal litigation and security investigations scanned collections are being used to facilitate systematic exploration of the data. It is almost always the case that
scanned documents suffer from some form of degradation. Large degradations make documents hard to read and substantially deteriorate the performance of automated document processing systems. Enhancement of degraded document images is normally performed assuming global degradation models. When the degradation is large,
global degradation models do not perform well. In contrast, we propose to estimate local degradation models and
use them in enhancing degraded document images. Using a semi-automated enhancement system we have labeled
a subset of the Frieder diaries collection.1 This labeled subset was then used to train an ensemble classifier. The
component classifiers are based on lookup tables (LUT) in conjunction with the approximated nearest neighbor algorithm. The resulting algorithm is highly effcient. Experimental evaluation results are provided using the Frieder diaries collection.1
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
A method is presented for automatically identifying and removing crossed-out text in off-line handwriting. It classifies connected components by simply comparing two scalar features with thresholds. The performance is quantified based on manually labeled connected components of 250 pages of a forensic dataset. 47% of connected components consisting of crossed-out text can be removed automatically while 99% of the normal text components are preserved. The influence of automatically removing crossed-out text on writer verification and identification is also quantified. This influence is not significant.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this paper, we present a hybrid approach to splitting a book document into individual chapters. We use multiple
sources of information to obtain a reliable assessment of the chapter title pages. These sources are produced by four
methods: blank space detection, font analysis, header and footer association, and table of content (TOC) analysis.
Finally, a combination component is used to score potential chapter title pages and select the best candidates. This
approach takes full advantage of various kinds of information such as page header and footer, layout, and keywords. It
works well even without the information of TOC which is crucial for most previous similar researches. Experiments
show that this approach is robust and reliable.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Line segmentation is the first and the most critical pre-processing step for a document recognition/analysis task.
Complex handwritten documents with lines running into each other impose a great challenge for the line segmentation
problem due to the absence of online stroke information. This paper describes a method to disentangle
lines running into each other, by splitting and associating the correct character strokes to the appropriate lines.
The proposed method can be used along with the existing algorithm1 that identifies such overlapping lines in
documents. A stroke tracing method is used to intelligently segment the overlapping components. The method
uses slope and curvature information of the stroke to disambiguate the course of the stroke at cross points. Once
the overlapping components are segmented into strokes, a statistical method is used to associate the strokes with
appropriate lines.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In recognizing characters written on forms, it often happens that characters overlap with pre-printed form lines. In order
to recognize overlapped characters, removal of the line and restoration of the broken character strokes caused by line
removal are generally conducted. But it is not easy to restore the broken character strokes accurately especially when the
direction of the line and the character stroke are almost same. In this paper, a novel recognition method of line-touching
characters without line removal is proposed in order to avoid the difficulty of the stroke restoration problem. A line-touching
character is recognized as a whole by matching with reference character features which include a line feature.
And the reference features are synthesized dynamically from a character feature and a line feature based on the touching
condition of an input line-touching character string. We compared the performance of the proposed method with a
conventional method in which a touching line is removed leaving the overlapped character stroke by mathematical
morphology. Experimental results show that proposed method can achieves 96.26% character recognition rate whereas
the conventional method achieves 92.77%.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Word segmentation is the most critical pre-processing step for any handwritten document recognition and/or
retrieval system. When the writing style is unconstrained (written in a natural manner), recognition of individual
components may be unreliable, so they must be grouped together into word hypotheses before recognition
algorithms can be used. This paper describes a gap metrics based machine learning approach to separate a line
of unconstrained handwritten text into words. Our approach uses a set of both local and global features, which
is motivated by the ways in which human beings perform this kind of task. In addition, in order to overcome
the disadvantage of different distance computation methods, we propose a combined distance measure computed
using three different methods. The classification is done by using a three-layer neural network. The algorithm is
evaluated using an unconstrained handwriting database that contains 50 pages (1026 line, 7562 words images)
handwritten documents. The overall accuracy is 90.8%, which shows a better performance than a previous
method.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
OCRopus is a new, open source OCR system emphasizing modularity, easy extensibility, and reuse, aimed at both the research community and large scale commercial document conversions. This paper describes the current status of the system, its general architecture, as well as the major algorithms currently being used for layout analysis and text line recognition.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Noise presents a serious challenge in optical character recognition, as well as in the downstream applications
that make use of its outputs as inputs. In this paper, we describe a paradigm for measuring the impact of
recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization,
and part-of-speech tagging. Employing a hierarchical methodology based on approximate string matching for
classifying errors, their cascading effects as they travel through the pipeline are isolated and analyzed. We
present experimental results based on injecting single errors into a large corpus of test documents to study their
varying impacts depending on the nature of the error and the character(s) involved. While most such errors are
found to be localized, in the worst case some can have an amplifying effect that extends well beyond the site of
the original error, thereby degrading the performance of the end-to-end system.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Writer identification is a process which aims to identify the writer of a given handwritten document. Its implementation
is needed in applications such as forensic document analysis and document retrieval which involved the use of offline
handwritten documents. With the recent advances of technology, the invention of digital pen and paper has extended the
field of writer identification to cover online handwritten documents. In this communication, a methodology is proposed
to solve the problem of text-independent writer identification using online handwritten documents. The proposed
methodology would strive to identify the writer of a given handwritten document regardless of its text contents by
comparing his or her handwritings with those stored in a reference database. The output of this process would be a
ranked list of the writers whose handwritings are stored in the reference database. The main idea is to use the distance
measurement between the distributions of reference patterns defined at the character level. Very few, if any, attempts
have been done at this character level. Two sets of handwritten document databases each with 82 online documents contributed by 82 subjects were used in the
experiments. The reported result was 95% of Top 1 rate accuracy. Only four writers were identified wrongly, ranked as
2, 4, 5 and 12 choice returned.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Writer identification in offline handwritten documents is a difficult task with multiple applications such as authentication,
identification, and clustering in document collections. For example, in the context of content-based
document image retrieval, given a document with handwritten annotations it is possible to determine whether
the comments were added by a specific individual and find other documents annotated by the same person. In
contrast to online writer identification in which temporal stroke information is available, such information is not
readily available in offline writer identification. The base approach and the main contribution of our work is the
idea of using derived canonical stroke frequency descriptors from handwritten text to identify writers. We show
that a relatively small set of canonical strokes can be successfully employed for generating discriminative frequency
descriptors. Moreover, we show that by using frequency descriptors alone it is possible to perform writer
identification with success rate which is comparable to the known state of the art in offline writer identification
with close to 90% accuracy. As frequency descriptors are independent of existing descriptors, the performance
of offline writer identification may be improved by combining both standard and frequency descriptors. Experimental
evaluation with quantitative performance evaluation is provided using the IAM dataset.1
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
An efficient mail sorting system is mainly based on an accurate optical recognition of the addresses on the envelopes.
However, the localizing of the address block (ABL) should be done before the OCR recognition process. The location
step is very crucial as it has a great impact on the global performance of the system. Currently, a good localizing step
leads to a better recognition rate. The limit of current methods is mainly caused by modular linear architectures used for
ABL: their performances greatly depend on each independent module performance. We are presenting in this paper a
new approach for ABL based on a pyramidal data organization and on a hierarchical graph coloring for classification
process. This new approach presents the advantage to guarantee a good coherence between different modules and
reduces both the computation time and the rejection rate. The proposed method gives a very satisfying rate of 98% of
good locations on a set of 750 envelope images.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this paper, we revisit the problem of detecting the page numbers of a document. This work is motivated by a need for a generic method which applies on a large variety of documents, as well as the need for analyzing the document page numbering scheme rather than spotting one number per page. We propose here a novel method, based on the notion of sequence, which goes beyond any previous described work, and we report on an extensive evaluation of its performance.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
We describe a methodology for retrieving document images
from large extremely diverse collections. First we perform
content extraction, that is the location and measurement
of regions containing handwriting, machine-printed
text, photographs, blank space, etc, in documents represented
as bilevel, greylevel, or color images. Recent experiments
have shown that even modest per-pixel content classification accuracies can support usefully high recall and precision rates (of, e.g., 80-90%) for retrieval queries within document collections seeking pages that contain a fraction of a certain type of content. When the distribution of content and error rates are uniform across the entire collection, it is possible to derive IR measures from classification measures and vice versa. Our largest experiments to date, consisting of 80 training images totaling over 416 million pixels, are presented to illustrate these conclusions. This data set is more representative than previous experiments, containing a more balanced distribution of content types. Contained in this data set are also images of text obtained from handheld digital cameras and the success of existing methods (with no modification) in classifying these images with are discussed. Initial experiments in discriminating line art from the four classes mentioned above are also described. We also discuss methodological issues that affect both ground-truthing and evaluation measures.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Transcript mapping or text alignment with handwritten documents is the automatic alignment of words in a text file with word images in a handwritten document. Such a mapping has several applications in fields ranging from machine learning where large quantities of truth data are required for evaluating handwriting recognition algorithms, to data mining where word image indexes are used in ranked retrieval of scanned documents in a digital library. The alignment also aids "writer identity" verification algorithms. Interfaces which display scanned handwritten documents may use this alignment to highlight manuscript tokens when a person examines the corresponding transcript word. We propose an adaptation of the True DTW dynamic programming algorithm for English handwritten documents. The integration of the dissimilarity scores from a word-model word recognizer and Levenshtein distance between the recognized word and
lexicon word, as a cost metric in the DTW algorithm leading to a fast and accurate alignment, is our primary contribution. Results provided, confirm the effectiveness of our approach.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Word-spotting techniques are usually based on detailed modeling of target words, followed by search for the
locations of such a target word in images of handwriting. In this study, the focus is on deciding for the presence
of target words in lines of text, regardless and disregarding their horizontal position. Line strips are modeled
using a Bag-of-Glyphs approach using a self-organized map. This approach uses the presence of fragmented-connected
component shapes (glyphs) in a line strip to characterize this text passage, similar to the Bag-of-Words
approach for 'ASCII'-encoded documents in regular Information Retrieval. Subsequently, the presence of a word
or word category is trained to a support-vector machine in an iterative setup which involves an active group of
users. Results are promising for a large proportion of words and are dependent both on the amount of labeled
lines as well as shape uniqueness. Particularly useful is the ability to train on abstract content classes such as
proper names, municipalities or word-bigram presence in the line-strip images.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper describes an OCR-based technique for word
spotting in Devanagari printed documents. The system
accepts a Devanagari word as input and returns a sequence
of word images that are ranked according to their
similarity with the input query. The methodology involves
line and word separation, pre-processing document
words, word recognition using OCR and similarity
matching. We demonstrate a Block Adjacency Graph
(BAG) based document cleanup in the pre-processing
phase. During word recognition, multiple recognition hypotheses
are generated for each document word using a
font-independent Devanagari OCR. The similarity matching
phase uses a cost based model to match the word
input by a user and the OCR results. Experiments are
conducted on document images from the publicly available
ILT and Million Book Project dataset. The technique
achieves an average precision of 80% for 10 queries and
67% for 20 queries for a set of 64 documents containing
5780 word images. The paper also presents a comparison
of our method with template-based word spotting techniques.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
We describe a statistical machine learning method for extracting databank accession numbers (DANs) from online
medical journal articles. Because the DANs are sparsely-located in the articles, we take a hierarchical approach. The
HTML journal articles are first segmented into zones according to text and geometric features. The zones are then
classified as DAN zones or other zones by an SVM classifier. A set of heuristic rules are applied on the candidate DAN
zones to extract DANs according to their edit distances to the DAN formats. An evaluation shows that the proposed
method can achieve a very high recall rate (above 99%) and a significantly better precision rate compared to extraction
through brute force regular expression matching.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Essential information is often conveyed pictorially (images, illustrations, graphs, charts, etc.) in biomedical
publications. A clinician's decision to access the full text when searching for evidence in support of clinical decision
is frequently based solely on a short bibliographic reference. We seek to automatically augment these references
with images from the article that may assist in finding evidence.
In a previous study, the feasibility of automatically classifying images by usefulness (utility) in finding evidence
was explored using supervised machine learning and achieved 84.3% accuracy using image captions for modality and
76.6% accuracy combining captions and image data for utility on 743 images from articles over 2 years from a
clinical journal. Our results indicated that automatic augmentation of bibliographic references with relevant images
was feasible. Other research in this area has determined improved user experience by showing images in addition to
the short bibliographic reference. Multi-panel images used in our study had to be manually pre-processed for image
analysis, however. Additionally, all image-text on figures was ignored.
In this article, we report on developed methods for automatic multi-panel image segmentation using not only image
features, but also clues from text analysis applied to figure captions. In initial experiments on 516 figure images we
obtained 95.54% accuracy in correctly identifying and segmenting the sub-images. The errors were flagged as
disagreements with automatic parsing of figure caption text allowing for supervised segmentation. For localizing
text and symbols, on a randomly selected test set of 100 single panel images our methods reported, on the average,
precision and recall of 78.42% and 89.38%, respectively, with an accuracy of 72.02%.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Interactive Paper and Symposium Demonstration Session-Tuesday
We propose a document categorization method based on a document model that can be defined externally for each task and that categorizes Web content or business documents into a target category in accordance with the similarity of the model. The main feature of the proposed method consists of two aspects of semantics extraction from an input document. The semantics of terms are extracted by the semantic pattern analysis and implicit meanings of document substructure are specified by a bottom-up text clustering technique focusing on the similarity of text line attributes. We have constructed a system based on the proposed method for trial purposes. The experimental results show that the system achieves more than 80% classification accuracy in categorizing Web content and business documents into 15 or 70 categories.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Building a system which allows to search a very large database of document images requires professionalization of hardware and software, e-science and web access. In astrophysics there is ample experience dealing with large data sets due to an increasing number of measurement instruments. The problem of digitization of historical documents of the Dutch cultural heritage is a similar problem. This paper discusses the use of a system developed at the Kapteyn Institute of Astrophysics for the processing of large data sets, applied to the problem of creating a very large searchable archive of connected cursive handwritten texts. The system is adapted to the specific needs of processing document images. It shows that interdisciplinary collaboration can be beneficial in the context of machine learning, data processing and professionalization of image processing and retrieval systems.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Page body holds the central information of a page in most documents. This paper addresses the problem of automatically detecting page body area in digital books or journals. A novel method based on font expansion and header and footer elimination is detailed. This method extracts body text font (BFont) and headers and footers from a document first, and then draws two page body bounding boxes for each page, one by analyzing the distribution of BFont in pages and the other by removing headers and footers from pages. Finally, the two bounding boxes are combined to obtain the resultant page body bounding box. The test results demonstrate very high recognition rate: up to 99.49% in precision.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this paper, we present a method to extract the text lines in poorly structured documents. The text lines may have
different orientations, considerably curved shapes, and there are possibly a few wide inter-word gaps in a text line. Those
text lines can be found in posters, blocks of addresses, artistic documents. Our method is an expansion of the traditional
perceptual grouping. We develop novel solutions to overcome the problems of insufficient seed points and varied
orientations in a single line. In this paper, we assume that text lines consists of connected components, in which each
connected components is a set of black pixels within a letter, or some touched letters. In our scheme, the connected
components closer than an iteratively incremented threshold will be combined to make chains of connected components.
Elongate chains are identified as the seed chains of lines. Then the seed chains are extended to the left and the right
regarding the local orientations. The local orientations will be reevaluated at each side of the chains when it is extended.
By this process, all text lines are finally constructed. The advantage of the proposed method over prior works in
extraction of curved text lines is that this method can both deal with more than a specific language and extract text lines
containing some wide inter-word gaps. The proposed method is good for extraction of the considerably curved text lines
from logos and slogans in our experiment; 98% and 94% for the straight-line extraction and the curved-line extraction,
respectively.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Recognition rate is traditionally used as the main criterion for evaluating the performance of a recognition system. High recognition reliability with low misclassification rate is also a must for many applications. To handle the variability of the writing style of different individuals, this paper employs decision trees and WRB AdaBoost to design a classifier with high recognition reliability for recognizing Bangla handwritten numerals. Experiments on the numeral images obtained from real Bangladesh envelopes show that the proposed recognition method is capable of achieving high recognition reliability with acceptable recognition rate.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper presents a system to extract the logical structure of handwritten mail documents. It consists in two
joined tasks: the segmentation of documents into blocks and the labeling of such blocks. The main considered
label classes are: addressee details, sender details, date, subject, text body, signature. This work has to face
with difficulties of unconstrained handwritten documents: variable structure and writing.
We propose a method based on a geometric analysis of the arrangement of elements in the document. We
give a description of the document using a two-dimension grammatical formalism, which makes it possible to
easily introduce knowledge on mail into a generic parser. Our grammatical parser is LL(k), which means several
combinations are tried before extracting the good one. The main interest of this approach is that we can deal
with low structured documents. Moreover, as the segmentation into blocks often depends on the associated
classes, our method is able to retry a different segmentation until labeling succeeds.
We validated this method in the context of the French national project RIMES, which proposed a contest on
a large base of documents. We obtain a recognition rate of 91.7% on 1150 images.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
There is a strong demand for developing automated tools for extracting pertinent information from the biomedical
literature that is a rich, complex, and dramatically growing resource, and is increasingly accessed via the web. This paper
presents a hybrid method based on contextual and statistical information to automatically identify two MEDLINE
citation terms: NIH grant numbers and databank accession numbers from HTML-formatted online biomedical
documents. Their detection is challenging due to many variations and inconsistencies in their format (although
recommended formats exist), and also because of their similarity to other technical or biological terms. Our proposed
method first extracts potential candidates for these terms using a rule-based method. These are scored and the final
candidates are submitted to a human operator for verification. The confidence score for each term is calculated using
statistical information, and morphological and contextual information. Experiments conducted on more than ten
thousand HTML-formatted online biomedical documents show that most NIH grant numbers and databank accession
numbers can be successfully identified by the proposed method, with recall rates of 99.8% and 99.6%, respectively.
However, owing to the high false alarm rate, the proposed method yields F-measure rates of 86.6% and 87.9% for NIH
grants and databanks, respectively.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The problem of form classification is to assign a single-page form image to one of a set of predefined form types or classes.
We classify the form images using low level pixel density information from the binary images of the documents. In this
paper, we solve the form classification problem with a classifier based on the k-means algorithm, supported by adaptive
boosting. Our classification method is tested on the NIST scanned tax forms data bases (special forms databases 2 and 6)
which include machine-typed and handwritten documents. Our method improves the performance over published results
on the same databases, while still using a simple set of image features.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Degraded documents are frequently obtained in various situations. Examples of degraded document collections include historical document depositories, document obtained in legal and security investigations, and legal and medical archives. Degraded document images are hard to to read and are hard to analyze using computerized techniques. There is hence a need for systems that are capable of enhancing such images. We describe a language-independent semi-automated system for enhancing degraded document images that is capable of exploiting inter- and intra-document coherence. The system is capable of processing document images with high levels of degradations and can be used for ground truthing of degraded document images. Ground truthing of degraded document images is extremely important in several aspects: it enables quantitative performance measurements
of enhancement systems and facilitates model estimation that can be used to improve performance. Performance evaluation is provided using the historical Frieder diaries collection.1
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Adaptive binarization is an important first step in many document analysis and OCR processes. This paper
describes a fast adaptive binarization algorithm that yields the same quality of binarization as the Sauvola
method,1 but runs in time close to that of global thresholding methods (like Otsu's method2), independent of
the window size. The algorithm combines the statistical constraints of Sauvola's method with integral images.3
Testing on the UW-1 dataset demonstrates a 20-fold speedup compared to the original Sauvola algorithm.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.