In this paper, a model is proposed to learn logical structure of fixed-layout document pages by combining support vector machine (SVM) and conditional random fields (CRF). Features related to each logical label and their dependencies are extracted from various original Portable Document Format (PDF) attributes. Both local evidence and contextual dependencies are integrated in the proposed model so as to achieve better logical labeling performance. With the merits of SVM as local discriminative classifier and CRF modeling contextual correlations of adjacent fragments, it is capable of resolving the ambiguities of semantic labels. The experimental results show that CRF based models with both tree and chain graph structures outperform the SVM model with an increase of macro-averaged F<sub>1</sub> by about 10%.
To increase the flexibility and enrich the reading experience of e-book on small portable screens, a graph based method
is proposed to perform layout analysis on Portable Document Format (PDF) documents. Digital born document has its
inherent advantages like representing texts and fractional images in explicit form, which can be straightforwardly
exploited. To integrate traditional image-based document analysis and the inherent meta-data provided by PDF parser,
the page primitives including text, image and path elements are processed to produce text and non text layer for
respective analysis. Graph-based method is developed in superpixel representation level, and page text elements
corresponding to vertices are used to construct an undirected graph. Euclidean distance between adjacent vertices is
applied in a top-down manner to cut the graph tree formed by Kruskal’s algorithm. And edge orientation is then used in a
bottom-up manner to extract text lines from each sub tree. On the other hand, non-textual objects are segmented by
connected component analysis. For each segmented text and non-text composite, a 13-dimensional feature vector is
extracted for labelling purpose. The experimental results on selected pages from PDF books are presented.
Converting the PDF books to re-flowable format has recently attracted various interests in the area of e-book reading.
Robust graphic segmentation is highly desired for increasing the practicability of PDF converters. To cope with various
layouts, a multi-layer concept is introduced to segment graphic composites including photographic images, drawings
with text insets or surrounded with text elements. Both image based analysis and inherent digital born document
advantages are exploited in this multi-layer based layout analysis method. By combining low-level page elements
clustering applied on PDF documents and connected component analysis on synthetically generated PNG image
document, graphic composites can be segmented for PDF documents with complex layouts. The experimental results on
graphic composite segmentation of PDF document pages have shown satisfactory performance.