30 March 1995 Extraction of text layout structures on document images based on statistical characterization
Author Affiliations +
Abstract
The textual structures like the characters, words, text lines, paragraphs on a document image are usually laid out in a very structured manner -- having preferred spatial relations. These spatial relations are rarely deterministic; instead, they describe correlations and likelihoods. Therefore, any realistic document layout analysis algorithm should utilize this type of knowledge in order to optimize its performances. In this paper, we first describe a method for automatically generating a large amount of almost 100% correct ground truth data for the document layout analysis. The bounding boxes for the characters, words, text lines, paragraphs are expressed in a hierarchy. Then based on these layout ground-truth, we build statistical models to model the layout structures for the words, text lines, paragraphs on document images. Finally, we described an algorithm that utilizes these statistical models to extract the text words on document images. The performance of the algorithm is evaluated and reported.
© (1995) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Su S. Chen, Su S. Chen, Robert M. Haralick, Robert M. Haralick, Ihsin T. Phillips, Ihsin T. Phillips, "Extraction of text layout structures on document images based on statistical characterization", Proc. SPIE 2422, Document Recognition II, (30 March 1995); doi: 10.1117/12.205815; https://doi.org/10.1117/12.205815
PROCEEDINGS
12 PAGES


SHARE
Back to Top