14 February 2015 A unified approach for development of Urdu Corpus for OCR and demographic purpose
Author Affiliations +
Proceedings Volume 9445, Seventh International Conference on Machine Vision (ICMV 2014); 944526 (2015) https://doi.org/10.1117/12.2180903
Event: Seventh International Conference on Machine Vision (ICMV 2014), 2014, Milan, Italy
Abstract
This paper presents a methodology for the development of an Urdu handwritten text image Corpus and application of Corpus linguistics in the field of OCR and information retrieval from handwritten document. Compared to other language scripts, Urdu script is little bit complicated for data entry. To enter a single character it requires a combination of multiple keys entry. Here, a mixed approach is proposed and demonstrated for building Urdu Corpus for OCR and Demographic data collection. Demographic part of database could be used to train a system to fetch the data automatically, which will be helpful to simplify existing manual data-processing task involved in the field of data collection such as input forms like Passport, Ration Card, Voting Card, AADHAR, Driving licence, Indian Railway Reservation, Census data etc. This would increase the participation of Urdu language community in understanding and taking benefit of the Government schemes. To make availability and applicability of database in a vast area of corpus linguistics, we propose a methodology for data collection, mark-up, digital transcription, and XML metadata information for benchmarking.
© (2015) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Prakash Choudhary, Prakash Choudhary, Neeta Nain, Neeta Nain, Mushtaq Ahmed, Mushtaq Ahmed, } "A unified approach for development of Urdu Corpus for OCR and demographic purpose", Proc. SPIE 9445, Seventh International Conference on Machine Vision (ICMV 2014), 944526 (14 February 2015); doi: 10.1117/12.2180903; https://doi.org/10.1117/12.2180903
PROCEEDINGS
5 PAGES


SHARE
RELATED CONTENT


Back to Top