Paper
13 January 2003 Information retrieval for OCR documents: a content-based probabilistic correction model
Author Affiliations +
Proceedings Volume 5010, Document Recognition and Retrieval X; (2003) https://doi.org/10.1117/12.472838
Event: Electronic Imaging 2003, 2003, Santa Clara, CA, United States
Abstract
The difficulty with information retrieval for OCR documents lies in the fact that OCR documents contain a significant amount of erroneous words and unfortunately most information retrieval techniques rely heavily on word matching between documents and queries. In this paper, we propose a general content-based correction model that can work on top of an existing OCR correction tool to “boost” retrieval performance. The basic idea of this correction model is to exploit the whole content of a document to supplement any other useful information provided by an existing OCR correction tool for word corrections. Instead of making an explicit correction decision for each erroneous word as typically done in a traditional approach, we consider the uncertainties in such correction decisions and compute an estimate of the original “uncorrupted” document language model accordingly. The document language model can then be used for retrieval with a language modeling retrieval approach. Evaluation using the TREC standard testing collections indicates that our method significantly improves the performance compared with simple word correction approaches such as using only the top ranked correction.
© (2003) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Rong Jin, ChangXiang Zhai, and Alexander Hauptmann "Information retrieval for OCR documents: a content-based probabilistic correction model", Proc. SPIE 5010, Document Recognition and Retrieval X, (13 January 2003); https://doi.org/10.1117/12.472838
Lens.org Logo
CITATIONS
Cited by 6 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Optical character recognition

Expectation maximization algorithms

Performance modeling

Compound parabolic concentrators

Quantum wells

Systems modeling

Computer science

RELATED CONTENT


Back to Top