28 January 2008 An OCR based approach for word spotting in Devanagari documents
Author Affiliations +
Abstract
This paper describes an OCR-based technique for word spotting in Devanagari printed documents. The system accepts a Devanagari word as input and returns a sequence of word images that are ranked according to their similarity with the input query. The methodology involves line and word separation, pre-processing document words, word recognition using OCR and similarity matching. We demonstrate a Block Adjacency Graph (BAG) based document cleanup in the pre-processing phase. During word recognition, multiple recognition hypotheses are generated for each document word using a font-independent Devanagari OCR. The similarity matching phase uses a cost based model to match the word input by a user and the OCR results. Experiments are conducted on document images from the publicly available ILT and Million Book Project dataset. The technique achieves an average precision of 80% for 10 queries and 67% for 20 queries for a set of 64 documents containing 5780 word images. The paper also presents a comparison of our method with template-based word spotting techniques.
© (2008) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Anurag Bhardwaj, Suryaprakash Kompalli, Srirangaraj Setlur, Venu Govindaraju, "An OCR based approach for word spotting in Devanagari documents", Proc. SPIE 6815, Document Recognition and Retrieval XV, 68150O (28 January 2008); doi: 10.1117/12.767289; https://doi.org/10.1117/12.767289
PROCEEDINGS
9 PAGES


SHARE
Back to Top