With the tremendous popularity of PDF format, recognizing mathematical formulas in PDF documents becomes a new
and important problem in document analysis field. In this paper, we present a method of embedded mathematical
formula identification in PDF documents, based on Support Vector Machine (SVM). The method first segments text
lines into words, and then classifies each word into two classes, namely formula or ordinary text. Various features of
embedded formulas, including geometric layout, character and context content, are utilized to build a robust and
adaptable SVM classifier. Embedded formulas are then extracted through merging the words labeled as formulas.
Experimental results show good performance of the proposed method. Furthermore, the method has been successfully
incorporated into a commercial software package for large-scale e-Book production.