13 July 2012 Text, photo, and line extraction in scanned documents
Author Affiliations +
We propose a page layout analysis algorithm to classify a scanned document into different regions such as text, photo, or strong lines. The proposed scheme consists of five modules. The first module performs several image preprocessing techniques such as image scaling, filtering, color space conversion, and gamma correction to enhance the scanned image quality and reduce the computation time in later stages. Text detection is applied in the second module wherein wavelet transform and run-length encoding are employed to generate and validate text regions, respectively. The third module uses a Markov random field based block-wise segmentation that employs a basis vector projection technique with maximum a posteriori probability optimization to detect photo regions. In the fourth module, methods for edge detection, edge linking, line-segment fitting, and Hough transform are utilized to detect strong edges and lines. In the last module, the resultant text, photo, and edge maps are combined to generate a page layout map using K-Means clustering. The proposed algorithm has been tested on several hundred documents that contain simple and complex page layout structures and contents such as articles, magazines, business cards, dictionaries, and newsletters, and compared against state-of-the-art page-segmentation techniques with benchmark performance. The results indicate that our methodology achieves an average of ∼ 89% classification accuracy in text, photo, and background regions.
© 2012 SPIE and IS&T
M. Sezer Erkilinc, Mustafa I. Jaber, Eli Saber, Peter Bauer, Dejan Depalov, "Text, photo, and line extraction in scanned documents," Journal of Electronic Imaging 21(3), 033006 (13 July 2012). https://doi.org/10.1117/1.JEI.21.3.033006 . Submission:

Back to Top