28 January 2008 An OCR based approach for word spotting in Devanagari documents
Author Affiliations +
Abstract
This paper describes an OCR-based technique for word spotting in Devanagari printed documents. The system accepts a Devanagari word as input and returns a sequence of word images that are ranked according to their similarity with the input query. The methodology involves line and word separation, pre-processing document words, word recognition using OCR and similarity matching. We demonstrate a Block Adjacency Graph (BAG) based document cleanup in the pre-processing phase. During word recognition, multiple recognition hypotheses are generated for each document word using a font-independent Devanagari OCR. The similarity matching phase uses a cost based model to match the word input by a user and the OCR results. Experiments are conducted on document images from the publicly available ILT and Million Book Project dataset. The technique achieves an average precision of 80% for 10 queries and 67% for 20 queries for a set of 64 documents containing 5780 word images. The paper also presents a comparison of our method with template-based word spotting techniques.
© (2008) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Anurag Bhardwaj, Anurag Bhardwaj, Suryaprakash Kompalli, Suryaprakash Kompalli, Srirangaraj Setlur, Srirangaraj Setlur, Venu Govindaraju, Venu Govindaraju, } "An OCR based approach for word spotting in Devanagari documents", Proc. SPIE 6815, Document Recognition and Retrieval XV, 68150O (28 January 2008); doi: 10.1117/12.767289; https://doi.org/10.1117/12.767289
PROCEEDINGS
9 PAGES


SHARE
Back to Top