24 March 2014 Utilizing web data in identification and correction of OCR errors
Author Affiliations +
Abstract
In this paper, we report on our experiments for detection and correction of OCR errors with web data. More specifically, we utilize Google search to access the big data resources available to identify possible candidates for correction. We then use a combination of the Longest Common Subsequences (LCS) and Bayesian estimates to automatically pick the proper candidate. Our experimental results on a small set of historical newspaper data show a recall and precision of 51% and 100%, respectively. The work in this paper further provides a detailed classification and analysis of all errors. In particular, we point out the shortcomings of our approach in its ability to suggest proper candidates to correct the remaining errors.
© (2014) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Kazem Taghva, Shivam Agarwal, "Utilizing web data in identification and correction of OCR errors", Proc. SPIE 9021, Document Recognition and Retrieval XXI, 902109 (24 March 2014); doi: 10.1117/12.2042403; https://doi.org/10.1117/12.2042403
PROCEEDINGS
6 PAGES


SHARE
RELATED CONTENT

Post processing with first and second order hidden Markov...
Proceedings of SPIE (February 04 2013)
Evaluating text categorization in the presence of OCR errors
Proceedings of SPIE (December 21 2000)
Efficiently mining maximal frequent patterns: fast-miner
Proceedings of SPIE (March 27 2001)
Image categorization for marketing purposes
Proceedings of SPIE (February 07 2011)
Asymptotic cost in document conversion
Proceedings of SPIE (January 23 2012)

Back to Top