Translator Disclaimer
21 March 2013 Non-Manhattan layout extraction algorithm
Author Affiliations +
Proceedings Volume 8664, Imaging and Printing in a Web 2.0 World IV; 86640A (2013)
Event: IS&T/SPIE Electronic Imaging, 2013, Burlingame, California, United States
Automated publishing requires large databases containing document page layout templates. The number of layout templates that need to be created and stored grows exponentially with the complexity of the document layouts. A better approach for automated publishing is to reuse layout templates of existing documents for the generation of new documents. In this paper, we present an algorithm for template extraction from a docu- ment page image. We use the cost-optimized segmentation algorithm (COS) to segment the image, and Voronoi decomposition to cluster the text regions. Then, we create a block image where each block represents a homo- geneous region of the document page. We construct a geometrical tree that describes the hierarchical structure of the document page. We also implement a font recognition algorithm to analyze the font of each text region. We present a detailed description of the algorithm and our preliminary results.
© (2013) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Aziza Satkhozhina, Ildus Ahmadullin, Jan P. Allebach, Qian Lin, Jerry Liu, Daniel Tretter, Eamonn O'Brien-Strain, and Andrew Hunter "Non-Manhattan layout extraction algorithm", Proc. SPIE 8664, Imaging and Printing in a Web 2.0 World IV, 86640A (21 March 2013);


Archiving of line-drawing images
Proceedings of SPIE (November 21 1995)
Text segmentation for automatic document processing
Proceedings of SPIE (January 07 1999)
Machine-printed Arabic OCR
Proceedings of SPIE (February 25 1994)
Benchmarking system for document analysis algorithms
Proceedings of SPIE (April 01 1998)

Back to Top