Online handwritten data, produced with Tablet PCs or digital pens, consists in a sequence of points (x, y). As
the amount of data available in this form increases, algorithms for retrieval of online data are needed. Word
spotting is a common approach used for the retrieval of handwriting. However, from an information retrieval
(IR) perspective, word spotting is a primitive keyword based matching and retrieval strategy. We propose a
framework for handwriting retrieval where an arbitrary word spotting method is used, and then a manifold
ranking algorithm is applied on the initial retrieval scores. Experimental results on a database of more than
2,000 handwritten newswires show that our method can improve the performances of a state-of-the-art word
spotting system by more than 10%.
Proc. SPIE. 7534, Document Recognition and Retrieval XVII
KEYWORDS: Infrared imaging, Detection and tracking algorithms, Visualization, Databases, Computing systems, Image quality, Electronic imaging, Systems modeling, Current controlled current source, Data fusion
In this work, we propose to combine two quite different approaches for retrieving handwritten documents. Our
hypothesis is that different retrieval algorithms should retrieve different sets of documents for the same query.
Therefore, significant improvements in retrieval performances can be expected. The first approach is based on
information retrieval techniques carried out on the noisy texts obtained through handwriting recognition, while
the second approach is recognition-free using a word spotting algorithm. Results shows that for texts having
a word error rate (WER) lower than 23%, the performances obtained with the combined system are close to
the performances obtained on clean digital texts. In addition, for poorly recognized texts (WER > 52%), an
improvement of nearly 17% can be observed with respect to the best available baseline method.
Proc. SPIE. 7247, Document Recognition and Retrieval XVI
KEYWORDS: Detection and tracking algorithms, Visualization, Error analysis, Feature extraction, Signal processing, Optical character recognition, Algorithm development, Electronic imaging, Systems modeling, Current controlled current source
As new innovative devices, accepting or producing on-line documents, emerge, managing facilities for these
kinds of documents such as topic spotting are required. This means that we should be able to perform text
categorization of on-line documents. The textual data available in on-line documents can be extracted through online
recognition, a process which produces noise, i.e. errors, in the resulting text. This work reports experiments
on categorization of on-line handwritten documents based on their textual contents. We analyze the effect of the
word recognition rate on the categorization performances, by comparing the performances of a categorization
system over the texts obtained through on-line handwriting recognition and the same texts available as ground
truth. Two categorization algorithms (kNN and SVM) are compared in this work. A subset of the Reuters-21578
corpus consisting of more than 2000 handwritten documents has been collected for this study. Results show that
accuracy loss is not significant, and precision loss is only significant for recall values of 60%-80% depending on
the noise levels.