The Medical Article Records System or MARS has been developed at the U.S. National Library of Medicine (NLM) for automated data entry of bibliographical information from medical journals into MEDLINE, the premier bibliographic citation database at NLM. Currently, a rule-based algorithm (called ZoneCzar) is used for labeling important bibliographical fields (title, author, affiliation, and abstract) on medical journal article page images. While rules have been created for medical journals with regular layout types, new rules have to be manually created for any input journals with arbitrary or new layout types. Therefore, it is of interest to label any journal articles independent of their layout styles. In this paper, we first describe a system (called ZoneMatch) for automated generation of crucial geometric and non-geometric features of important bibliographical fields based on string-matching and clustering techniques. The rule based algorithm is then modified to use these features to perform style-independent labeling. We then describe a performance evaluation method for quantitatively evaluating our algorithm and characterizing its error distributions. Experimental results show that the labeling performance of the rule-based algorithm is significantly improved when the generated features are used.
Document structure analysis can be regarded as a syntactic analysis problem. The order and containment relations among the physical or logical components of a document page can be described by an ordered tree structure and can be modeled by a tree grammar which describes the page at the component level in terms of regions or blocks. This paper provides a detailed survey of past work on document structure analysis algorithms and summarize the limitations of past approaches. In particular, we survey past work on document physical layout representations and algorithms, document logical structure representations and algorithms, and performance evaluation of document structure analysis algorithms. In the last section, we summarize this work and point out its limitations.
Image segmentation is an important component of any document image analysis system. While many segmentation algorithms exist in the literature, very few i) allow users to specify the physical style, and ii) incorporate user-specified style information into the algorithm's objective function that is to be minimized. We describe a segmentation algorithm that models a document's physical structure as a hierarchical structure where each node describes a region of the document using a stochastic regular grammar. The exact form of the hierarchy and the stochastic language is specified by the user, while the probabilities associated with the transitions are estimated from groundtruth data. We demonstrate the segmentation algorithm on images of bilingual dictionaries.
Document page segmentation is a crucial preprocessing step in Optical Character Recognition (OCR) system. While numerous segmentation algorithms have been proposed, there is relatively less literature on comparative evaluation -- empirical or theoretical -- of these algorithms. We use the following five step methodology to quantitatively compare the performance of page segmentation algorithms: (1) First we create mutually exclusive training and test dataset with groundtruth, (2) we then select a meaningful and computable performance metric, (3) an optimization procedure is then used to automatically search for the optimal parameter values of the segmentation algorithms, (4) the segmentation algorithms are then evaluated on the test dataset, and finally (5) a statistical error analysis is performed to give the statistical significance of the experimental results. We apply this methodology to five segmentation algorithms, three of which are representative research algorithms and the rest two are well-known commercial products. The three research algorithms evaluated are: Nagy's X-Y cut, O'Gorman's Docstrum and Kise's Voronoi-diagram-based algorithm. The two commercial products evaluated are: Caere Corporation's segmentation algorithm and ScanSoft Corporation's segmentation algorithm. The evaluations are conducted on 978 images from the University of Washington III dataset. It is found that the performance of the Voronoi-based, Docstrum and Caere's segmentation algorithms are not significantly different from each other, but they are significantly better than ScanSoft's segmentation algorithm, which in turn is significantly better than the performance of the X-Y cut algorithm. Furthermore, we see that the commercial segmentation algorithms and research segmentation algorithms have comparable performances.