4 February 2013 WFST-based ground truth alignment for difficult historical documents with text modification and layout variations
Author Affiliations +
This work proposes several approaches that can be used for generating correspondences between real scanned books and their transcriptions which might have different modifications and layout variations, also taking OCR errors into account. Our approaches for the alignment between the manuscript and the transcription are based on weighted finite state transducers (WFST). In particular, we propose adapted WFSTs to represent the transcription to be aligned with the OCR lattices. The character-level alignment has edit rules to allow edit operations (insertion, deletion, substitution). Those edit operations allow the transcription model to deal with OCR segmentation and recognition errors, and also with the task of aligning with different text editions. We implemented an alignment model with a hyphenation model, so it can adapt the non-hyphenated transcription. Our models also work with Fraktur ligatures, which are typically found in historical Fraktur documents. We evaluated our approach on Fraktur documents from Wanderungen durch die Mark Brandenburg" volumes (1862-1889) and observed the performance of those models under OCR errors. We compare the performance of our model for three different scenarios: having no information about the correspondence at the word (i), line (ii), sentence (iii) or page (iv) level.
© (2013) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Mayce Al Azawi, Mayce Al Azawi, Marcus Liwicki, Marcus Liwicki, Thomas M. Breuel, Thomas M. Breuel, "WFST-based ground truth alignment for difficult historical documents with text modification and layout variations", Proc. SPIE 8658, Document Recognition and Retrieval XX, 865818 (4 February 2013); doi: 10.1117/12.2003134; https://doi.org/10.1117/12.2003134

Back to Top