15 May 2015 Dealing with extreme data diversity: extraction and fusion from the growing types of document formats
Author Affiliations +
The growth in text data available online is accompanied by a growth in the diversity of available documents. Corpora with extreme heterogeneity in terms of file formats, document organization, page layout, text style, and content are common. The absence of meaningful metadata describing the structure of online and open-source data leads to text extraction results that contain no information about document structure and are cluttered with page headers and footers, web navigation controls, advertisements, and other items that are typically considered noise. We describe an approach to document structure and metadata recovery that uses visual analysis of documents to infer the communicative intent of the author. Our algorithm identifies the components of documents such as titles, headings, and body content, based on their appearance. Because it operates on an image of a document, our technique can be applied to any type of document, including scanned images. Our approach to document structure recovery considers a finer-grained set of component types than prior approaches. In this initial work, we show that a machine learning approach to document structure recovery using a feature set based on the geometry and appearance of images of documents achieves a 60% greater F1- score than a baseline random classifier.
© (2015) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Peter David, Peter David, Nichole Hansen, Nichole Hansen, James J. Nolan, James J. Nolan, Pedro Alcocer, Pedro Alcocer, } "Dealing with extreme data diversity: extraction and fusion from the growing types of document formats", Proc. SPIE 9499, Next-Generation Analyst III, 94990Q (15 May 2015); doi: 10.1117/12.2184171; https://doi.org/10.1117/12.2184171


Color document analysis
Proceedings of SPIE (December 27 2001)
Data Compaction in the Polygonal Representation of Images
Proceedings of SPIE (April 30 1986)

Back to Top