4 February 2013 Combining multiple thresholding binarization values to improve OCR output
Author Affiliations +
For noisy, historical documents, a high optical character recognition (OCR) word error rate (WER) can render the OCR text unusable. Since image binarization is often the method used to identify foreground pixels, a body of research seeks to improve image-wide binarization directly. Instead of relying on any one imperfect binarization technique, our method incorporates information from multiple simple thresholding binarizations of the same image to improve text output. Using a new corpus of 19th century newspaper grayscale images for which the text transcription is known, we observe WERs of 13.8% and higher using current binarization techniques and a state-of-the-art OCR engine. Our novel approach combines the OCR outputs from multiple thresholded images by aligning the text output and producing a lattice of word alternatives from which a lattice word error rate (LWER) is calculated. Our results show a LWER of 7.6% when aligning two threshold images and a LWER of 6.8% when aligning five. From the word lattice we commit to one hypothesis by applying the methods of Lund et al. (2011) achieving an improvement over the original OCR output and a 8.41% WER result on this data set.
© (2013) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
William B. Lund, William B. Lund, Douglas J. Kennard, Douglas J. Kennard, Eric K. Ringger, Eric K. Ringger, "Combining multiple thresholding binarization values to improve OCR output", Proc. SPIE 8658, Document Recognition and Retrieval XX, 86580R (4 February 2013); doi: 10.1117/12.2006228; https://doi.org/10.1117/12.2006228

Back to Top