21 March 2013 Non-Manhattan layout extraction algorithm
Author Affiliations +
Automated publishing requires large databases containing document page layout templates. The number of layout templates that need to be created and stored grows exponentially with the complexity of the document layouts. A better approach for automated publishing is to reuse layout templates of existing documents for the generation of new documents. In this paper, we present an algorithm for template extraction from a docu- ment page image. We use the cost-optimized segmentation algorithm (COS) to segment the image, and Voronoi decomposition to cluster the text regions. Then, we create a block image where each block represents a homo- geneous region of the document page. We construct a geometrical tree that describes the hierarchical structure of the document page. We also implement a font recognition algorithm to analyze the font of each text region. We present a detailed description of the algorithm and our preliminary results.
© (2013) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Aziza Satkhozhina, Aziza Satkhozhina, Ildus Ahmadullin, Ildus Ahmadullin, Jan P. Allebach, Jan P. Allebach, Qian Lin, Qian Lin, Jerry Liu, Jerry Liu, Daniel Tretter, Daniel Tretter, Eamonn O'Brien-Strain, Eamonn O'Brien-Strain, Andrew Hunter, Andrew Hunter, } "Non-Manhattan layout extraction algorithm", Proc. SPIE 8664, Imaging and Printing in a Web 2.0 World IV, 86640A (21 March 2013); doi: 10.1117/12.2009424; https://doi.org/10.1117/12.2009424


Archiving of line-drawing images
Proceedings of SPIE (November 20 1995)
Text segmentation for automatic document processing
Proceedings of SPIE (January 06 1999)
Benchmarking system for document analysis algorithms
Proceedings of SPIE (March 31 1998)
Benchmarking of document page segmentation
Proceedings of SPIE (December 21 1999)

Back to Top