21 March 2013 Graph-based layout analysis for PDF documents
Author Affiliations +
To increase the flexibility and enrich the reading experience of e-book on small portable screens, a graph based method is proposed to perform layout analysis on Portable Document Format (PDF) documents. Digital born document has its inherent advantages like representing texts and fractional images in explicit form, which can be straightforwardly exploited. To integrate traditional image-based document analysis and the inherent meta-data provided by PDF parser, the page primitives including text, image and path elements are processed to produce text and non text layer for respective analysis. Graph-based method is developed in superpixel representation level, and page text elements corresponding to vertices are used to construct an undirected graph. Euclidean distance between adjacent vertices is applied in a top-down manner to cut the graph tree formed by Kruskal’s algorithm. And edge orientation is then used in a bottom-up manner to extract text lines from each sub tree. On the other hand, non-textual objects are segmented by connected component analysis. For each segmented text and non-text composite, a 13-dimensional feature vector is extracted for labelling purpose. The experimental results on selected pages from PDF books are presented.
© (2013) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Canhui Xu, Canhui Xu, Zhi Tang, Zhi Tang, Xin Tao, Xin Tao, Yun Li, Yun Li, Cao Shi, Cao Shi, "Graph-based layout analysis for PDF documents", Proc. SPIE 8664, Imaging and Printing in a Web 2.0 World IV, 866407 (21 March 2013); doi: 10.1117/12.2005608; https://doi.org/10.1117/12.2005608

Back to Top