28 January 2008 Measuring the impact of character recognition errors on downstream text analysis
Author Affiliations +
Abstract
Noise presents a serious challenge in optical character recognition, as well as in the downstream applications that make use of its outputs as inputs. In this paper, we describe a paradigm for measuring the impact of recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging. Employing a hierarchical methodology based on approximate string matching for classifying errors, their cascading effects as they travel through the pipeline are isolated and analyzed. We present experimental results based on injecting single errors into a large corpus of test documents to study their varying impacts depending on the nature of the error and the character(s) involved. While most such errors are found to be localized, in the worst case some can have an amplifying effect that extends well beyond the site of the original error, thereby degrading the performance of the end-to-end system.
© (2008) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Daniel Lopresti, "Measuring the impact of character recognition errors on downstream text analysis", Proc. SPIE 6815, Document Recognition and Retrieval XV, 68150G (28 January 2008); doi: 10.1117/12.767131; https://doi.org/10.1117/12.767131
PROCEEDINGS
11 PAGES


SHARE
Back to Top