1 January 1996 Extraction of text words in document images based on a statistical characterization
Su S. Chen, Robert M. Haralick, Ihsin T. Phillips
Author Affiliations +
Abstract
Text structures in document images are usually laid out in a structured manner—having preferred spatial relations. These spatial relations are rarely deterministic; however, they can be modeled by probabilities. Therefore, any realistic document layout analysis algorithm should utilize this type of probabilistic knowledge to optimize its performance. We first describe a method for automatically generating a large amount of nearly perfect layout ground truth data from the LaTeX device-independent (DVI) files, where the bounding boxes for the characters, words, text lines, and text blocks are represented in hierarchies. These ground truth data enable us to construct statistical models that characterize the various layout structures in document images. We demonstrate this concept through the development of a word segmentation algorithm, which employs the recursive morphological closing transform to model word shapes in document images. We also conducted systematic experiments to evaluate the performance of our algorithm using the synthetic images generated from the LaTeX DVI files and the real images from the UW-I and UW-II English document image databases. The results indicate that the correct word detection rate is about 95% on the synthetic images and more than 90% on most of the tested real images.
Su S. Chen, Robert M. Haralick, and Ihsin T. Phillips "Extraction of text words in document images based on a statistical characterization," Journal of Electronic Imaging 5(1), (1 January 1996). https://doi.org/10.1117/12.227706
Published: 1 January 1996
Lens.org Logo
CITATIONS
Cited by 1 scholarly publication and 1 patent.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Image segmentation

Image processing algorithms and systems

Algorithm development

Image processing

Databases

Binary data

Latex

Back to Top