24 March 2014 The Lehigh Steel Collection: a new open dataset for document recognition research
Author Affiliations +
Abstract
Document image analysis is a data-driven discipline. For a number of years, research was focused on small, homogeneous datasets such as the University of Washington corpus of scanned journal pages. More recently, library digitization efforts have raised many interesting problems with respect to historical documents and their recognition. In this paper, we present the Lehigh Steel Collection (LSC), a new open dataset we are currently assembling which will be, in many ways, unique to the field. LSC is an extremely large, heterogeneous set of documents dating from the 1960's through the 1990's relating to the wide-ranging research activities of Bethlehem Steel, a now-bankrupt company that was once the second-largest steel producer and the largest shipbuilder in the United States. As a result of the bankruptcy process and the disposition of the company's assets, an enormous quantity of documents (we estimate hundreds of thousands of pages) were left abandoned in buildings recently acquired by Lehigh University. Rather than see this history destroyed, we stepped in to preserve a portion of the collection via digitization. Here we provide an overview of LSC, including our efforts to collect and scan the documents, a preliminary characterization of what the collection contains, and our plans to make this data available to the research community for non-commercial purposes.
© (2014) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Barri Bruno, Barri Bruno, Daniel Lopresti, Daniel Lopresti, } "The Lehigh Steel Collection: a new open dataset for document recognition research", Proc. SPIE 9021, Document Recognition and Retrieval XXI, 90210O (24 March 2014); doi: 10.1117/12.2042615; https://doi.org/10.1117/12.2042615
PROCEEDINGS
9 PAGES


SHARE
RELATED CONTENT


Back to Top