This paper presents the limits of the character recognition engines (commercial OCRs) and how to exceed these limits to achieve the industrial goals in terms of document capture and coding performances. The recent integration of these OCRs in several industrial capture chains leads to think that a solution is possible to reach electronically the same performances obtained by human typists. After a global description of the problems and the exposure of the OCR limits, the paper will focus on the methodology used and details the different steps proposed for the individual performance improvement. The first step consists in the individual evaluation of the OCRs. This is made by comparing the OCR result with a ground truth, which allows to highlight its defects and catalogue its main errors on the document processed. The second step allows to increase these individual performances by combination the OCR with some others. Our choice has been fixed on the combination of only two OCRs deemed very efficient and complementary on the same class of documents. The residual errors are treated in the last step which be able to propose a list of heuristics resolving punctually the OCR defects on the limit cases. In order to validate our approach, we present in the second part of the paper a practical case of experimentation to reach industrial performances. This approach has been tested in the framework of an industrial application for automatic document capture, by attempting the lowest score, imposed on one specific document class, of 1 error for 10000 characters.
|