To date, most optical character recognition (OCR) systems process binary document images, and the quality of the input image strongly affects their performance. Since a binarization process is inherently lossy, different algorithms typically produce different binary images from the same gray scale image. The objective of this research is to study effects of global binarization algorithms on the performance of OCR systems. Several binarization methods were examined: the best fixed threshold value for the data set, the ideal histogram method, and Otsu's algorithm. Four contemporary OCR systems and 50 hard copy pages containing 91,649 characters were used in the experiments. These pages were digitized at 300 dpi and 8 bits/pixel, and 36 different threshold values (ranging from 59 to 199 in increments of 4) were used. The resulting 1,800 binary images were processed by all four OCR systems. All systems made approximately 40% more errors from images generated by Otsu's method than those of the ideal histogram method. Two of the systems made approximately the same number of errors from images generated by the best fixed threshold value and Otsu's method.
Kevin O. Grover,
"Preliminary evaluation of histogram-based binarization algorithms", Proc. SPIE 2422, Document Recognition II, (30 March 1995); doi: 10.1117/12.205823; https://doi.org/10.1117/12.205823