The aim of this paper is to propose a document flow supervised segmentation approach applied to real world
heterogeneous documents. Our algorithm treats the flow of documents as couples of consecutive pages and studies the
relationship that exists between them. At first, sets of features are extracted from the pages where we propose an
approach to model the couple of pages into a single feature vector representation. This representation will be provided to
a binary classifier which classifies the relationship as either segmentation or continuity. In case of segmentation, we
consider that we have a complete document and the analysis of the flow continues by starting a new document. In case
of continuity, the couple of pages are assimilated to the same document and the analysis continues on the flow. If there is
an uncertainty on whether the relationship between the couple of pages should be classified as a continuity or
segmentation, a rejection is decided and the pages analyzed until this point are considered as a "fragment". The first
classification already provides good results approaching 90% on certain documents, which is high at this level of the
In this paper, we propose a computer-assisted transcription system of old registers, handwritten in Arabic from
the 19th century onwards, held in the National Archives of Tunisia (NAT). The proposed system assists the
human supervisor to complete the transcription task as efficiently as possible. This assistance is given at all
different recognition levels. Our system addresses different approaches for transcription of document images. It
also implements an alignment method to find mappings between word images of a handwritten document and
their respective words in its given transcription.
This paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based
approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical
features needed by the system. Each block of text is extracted during the layout analysis and raw physical
features are collected and stored in the ALTO format. The data-mining method employed here is the "Improved
CHi-squared Automatic Interaction Detection" (I-CHAID). The contribution of this work is the insertion of
logical rules extracted from the logical layout knowledge to support the decision tree. Two setups have been
tested; the first uses one tree per logical element, the second one uses a single tree for all the logical elements
we want to recognise. The main system, implemented in Java, coordinates the third-party tools (Omnipage
for the OCR part, and SIPINA for the I-CHAID algorithm) using XML and XSL transforms. It was tested
on around 1000 documents belonging to the ICPR'04 and ICPR'08 conference proceedings, representing about
16,000 blocks. The final error rate for determining the logical labels (among 9 different ones) is less than 6%.
Recently, we have investigated the use of Arabic linguistic knowledge to improve the recognition of wide Arabic word lexicon. A neural-linguistic approach was proposed to mainly deal with canonical vocabulary of decomposable words derived from tri-consonant healthy roots. The basic idea is to factorize words by their roots and schemes. In this direction, we conceived two neural networks TNN_R and TNN_S to respectively recognize roots and schemes from structural primitives of words. The proposal approach achieved promising results. In this paper, we will focus on how to reach better results in terms of accuracy and recognition rate. Current improvements concern especially the training stage. It is about 1) to benefit from word letters order 2) to consider "sisters letters" (letters having same features), 3) to supervise networks behaviors, 4) to split up neurons to save letter occurrences and 5) to solve observed ambiguities. Considering theses improvements, experiments carried on 1500 sized vocabulary show a significant enhancement: TNN_R (resp. TNN_S) top4 has gone up from 77% to 85.8% (resp. from 65% to 97.9%). Enlarging the vocabulary from 1000 to 1700, adding 100 words each time, again confirmed the results without altering the networks stability.
This paper presents a novel approach for the multi-oriented text line extraction from historical handwritten
Arabic documents. Because of the multi-orientation of lines and their dispersion in the page, we use an image
paving algorithm that can progressively and locally determine the lines. The paving algorithm is initialized with
a small window and then its size is corrected by extension until enough lines and connected components were
found. We use the Snake for line extraction. Once the paving is established, the orientation is determined using
the Wigner-Ville distribution on the histogram projection profile. This local orientation is then enlarged to limit
the orientation in the neighborhood. Afterwards, the text lines are extracted locally in each zone basing on
the follow-up of the baselines and the proximity of connected components. Finally, the connected components
that overlap and touch in adjacent lines are separated. The morphology analysis of the terminal letters of
Arabic words is here considered. The proposed approach has been experimented on 100 documents reaching an
separation accuracy of about 98.6%.
This paper describes a segmentation method of continuous document flow. A document flow is a list of successive
scanned pages, put in a production chain, representing several documents without explicit separation mark between
them. To separate the documents for their recognition, it is needed to analyze the content of the successive pages
and to point out the limit pages of each document. The method proposed here is similar to the variable horizon
models (VHM) or multi-grams used in speech recognition. It consists in maximizing the flow likelihood knowing all
the Markov Models of the constituent elements. As the calculation of this likelihood on all the flow is NP-complete,
the solution consists in studying them in windows of reduced observations. The first results obtained on
homogeneous flows of invoices reaches more than 75% of precision and 90% of recall.
This paper presents the limits of the character recognition engines (commercial OCRs) and how to exceed these limits to achieve the industrial goals in terms of document capture and coding performances. The recent integration of these OCRs in several industrial capture chains leads to think that a solution is possible to reach electronically the same performances obtained by human typists. After a global description of the problems and the exposure of the OCR limits, the paper will focus on the methodology used and details the different steps proposed for the individual performance improvement. The first step consists in the individual evaluation of the OCRs. This is made by comparing the OCR result with a ground truth, which allows to highlight its defects and catalogue its main errors on the document processed. The second step allows to increase these individual performances by combination the OCR with some others. Our choice has been fixed on the combination of only two OCRs deemed very efficient and complementary on the same class of documents. The residual errors are treated in the last step which be able to propose a list of heuristics resolving punctually the OCR defects on the limit cases. In order to validate our approach, we present in the second part of the paper a practical case of experimentation to reach industrial performances. This approach has been tested in the framework of an industrial application for automatic document capture, by attempting the lowest score, imposed on one specific document class, of 1 error for 10000 characters.