Ideally, digital versions of scanned documents should be represented in a format that is searchable, compressed,
highly readable, and faithful to the original. These goals can theoretically be achieved through OCR and font
recognition, re-typesetting the document text with original fonts. However, OCR and font recognition remain
hard problems, and many historical documents use fonts that are not available in digital forms. It is desirable
to be able to reconstruct fonts with vector glyphs that approximate the shapes of the letters that form a
font. In this work, we address the grouping of tokens in a token-compressed document into candidate fonts.
This permits us to incorporate font information into token-compressed images even when the original fonts are
unknown or unavailable in digital format. This paper extends previous work in font reconstruction by proposing
and evaluating an algorithm to assign a font to every character within a document. This is a necessary step
to represent a scanned document image with a reconstructed font. Through our evaluation method, we have
measured a 98.4% accuracy for the assignment of letters to candidate fonts in multi-font documents.
Wide availability of cheap high-quality printing techniques make document forgery an easy task that can easily be done
by most people using standard computer and printing hardware. To prevent the use of color laser printers or color copiers
for counterfeiting e.g. money or other valuable documents, many of these machines print Counterfeit Protection System
(CPS) codes on the page. These small yellow dots encode information about the specific printer and allow the questioned
document examiner in cooperation with the manufacturers to track down the printer that was used to generate the document.
However, the access to the methods to decode the tracking dots pattern is restricted. The exact decoding of a tracking pattern
is often not necessary, as tracking the pattern down to the printer class may be enough. In this paper we present a method
that detects what CPS pattern class was used in a given document. This can be used to specify the printer class that the
document was printed on. Evaluation proved an accuracy of up to 91%.
Detecting the correct orientation of document images is an important step in large scale digitization processes, as most
subsequent document analysis and optical character recognition methods assume upright position of the document page.
Many methods have been proposed to solve the problem, most of which base on ascender to descender ratio computation.
Unfortunately, this cannot be used for scripts having no descenders nor ascenders. Therefore, we present a trainable
method using character similarity to compute the correct orientation. A connected component based distance measure is
computed to compare the characters of the document image to characters whose orientation is known. This allows to detect
the orientation for which the distance is lowest as the correct orientation. Training is easily achieved by exchanging the
reference characters by characters of the script to be analyzed. Evaluation of the proposed approach showed accuracy of
above 99% for Latin and Japanese script from the public UW-III and UW-II datasets. An accuracy of 98.9% was obtained
for Fraktur on a non-public dataset. Comparison of the proposed method to two methods using ascender / descender ratio
based orientation detection shows a significant improvement.
In large scale scanning applications, orientation detection of the digitized page is necessary for the following
procedures to work correctly. Several existing methods for orientation detection use the fact that in Roman
script text, ascenders are more likely to occur than descenders. In this paper, we propose a different approach
for page orientation detection that uses this information. The main advantage of our method is that it is more
accurate than compared widely used methods, while being scan resolution independent. Another interesting
aspect of our method is that it can be combined with our previously published method for skew detection to
have a single-step skew and orientation estimate of the page image. We demonstrate the effectiveness of our
approach on the UW-I dataset and show that our method achieves an accuracy of above 99% on this dataset. We
also show that our method is robust to different scanning resolutions and can reliably detect page orientations
for documents rendered at 150, 200, 300, and 400 dpi.