1 January 1996 Extraction of text words in document images based on a statistical characterization
Author Affiliations +
J. of Electronic Imaging, 5(1), (1996). doi:10.1117/12.227706
Abstract
Text structures in document images are usually laid out in a structured manner—having preferred spatial relations. These spatial relations are rarely deterministic; however, they can be modeled by probabilities. Therefore, any realistic document layout analysis algorithm should utilize this type of probabilistic knowledge to optimize its performance. We first describe a method for automatically generating a large amount of nearly perfect layout ground truth data from the LaTeX device-independent (DVI) files, where the bounding boxes for the characters, words, text lines, and text blocks are represented in hierarchies. These ground truth data enable us to construct statistical models that characterize the various layout structures in document images. We demonstrate this concept through the development of a word segmentation algorithm, which employs the recursive morphological closing transform to model word shapes in document images. We also conducted systematic experiments to evaluate the performance of our algorithm using the synthetic images generated from the LaTeX DVI files and the real images from the UW-I and UW-II English document image databases. The results indicate that the correct word detection rate is about 95% on the synthetic images and more than 90% on most of the tested real images.
Su S. Chen, Robert M. Haralick, Ihsin T. Phillips, "Extraction of text words in document images based on a statistical characterization," Journal of Electronic Imaging 5(1), (1 January 1996). http://dx.doi.org/10.1117/12.227706
JOURNAL ARTICLE
12 PAGES


SHARE
KEYWORDS
Image segmentation

Image processing algorithms and systems

Algorithm development

Image processing

Databases

Binary data

Latex

RELATED CONTENT


Back to Top