1 January 1996 Extraction of text words in document images based on a statistical characterization
Author Affiliations +
Text structures in document images are usually laid out in a structured manner—having preferred spatial relations. These spatial relations are rarely deterministic; however, they can be modeled by probabilities. Therefore, any realistic document layout analysis algorithm should utilize this type of probabilistic knowledge to optimize its performance. We first describe a method for automatically generating a large amount of nearly perfect layout ground truth data from the LaTeX device-independent (DVI) files, where the bounding boxes for the characters, words, text lines, and text blocks are represented in hierarchies. These ground truth data enable us to construct statistical models that characterize the various layout structures in document images. We demonstrate this concept through the development of a word segmentation algorithm, which employs the recursive morphological closing transform to model word shapes in document images. We also conducted systematic experiments to evaluate the performance of our algorithm using the synthetic images generated from the LaTeX DVI files and the real images from the UW-I and UW-II English document image databases. The results indicate that the correct word detection rate is about 95% on the synthetic images and more than 90% on most of the tested real images.
Su S. Chen, Su S. Chen, Robert M. Haralick, Robert M. Haralick, Ihsin T. Phillips, Ihsin T. Phillips, } "Extraction of text words in document images based on a statistical characterization," Journal of Electronic Imaging 5(1), (1 January 1996). https://doi.org/10.1117/12.227706 . Submission:

Back to Top