Modifications to a previous character-level deciphering algorithm for OCR are presented in this paper that are able to handle touching characters and are tolerant to mistakes made at the clustering stage. The objective of a character-level deciphering algorithm is to assign alphabetic identities to character patterns such that the character repetition pattern in an input text matches the letter repetition pattern provided by a language model. Degradation in document images usually causes the occurrence of touching characters and mistakes in clustering the character patterns, which pose difficulties for character-level deciphering algorithms. The modifications proposed in this paper tightly integrate visual constraints from characters and touching patterns with constraints from a language model to decode touching characters and to detect and reverse clustering mistakes. It provides a deciphering algorithm with robust performance under image degradation.
Jonathan J. Hull,
"Modified character-level deciphering algorithm for OCR in degraded documents", Proc. SPIE 2422, Document Recognition II, (30 March 1995); doi: 10.1117/12.205843; https://doi.org/10.1117/12.205843