Cross-references, such like footnotes, endnotes, figure/table captions, references, are a common and useful type of page elements to further explain their corresponding entities in the target document. In this paper, we focus on cross-reference identification in a PDF document, and present a robust method as a case study of identifying footnotes and figure references. The proposed method first extracts footnotes and figure captions, and then matches them with their corresponding references within a document. A number of novel features within a PDF document, i.e., page layout, font information, lexical and linguistic features of cross-references, are utilized for the task. Clustering is adopted to handle the features that are stable in one document but varied in different kinds of documents so that the process of identification is adaptive with document types. In addition, this method leverages results from the matching process to provide feedback to the identification process and further improve the algorithm accuracy. The primary experiments in real document sets show that the proposed method is promising to identify cross-reference in a PDF document.
Comic page image understanding aims to analyse the layout of the comic page images by detecting the storyboards and identifying the reading order automatically. It is the key technique to produce the digital comic documents suitable for reading on mobile devices. In this paper, we propose a novel comic page image understanding method based on edge segment analysis. First, we propose an efficient edge point chaining method to extract Canny edge segments (i.e., contiguous chains of Canny edge points) from the input comic page image; second, we propose a top-down scheme to detect line segments within each obtained edge segment; third, we develop a novel method to detect the storyboards by selecting the border lines and further identify the reading order of these storyboards. The proposed method is performed on a data set consisting of 2000 comic page images from ten printed comic series. The experimental results demonstrate that the proposed method achieves satisfactory results on different comics and outperforms the existing methods.
With the tremendous popularity of PDF format, recognizing mathematical formulas in PDF documents becomes a new
and important problem in document analysis field. In this paper, we present a method of embedded mathematical
formula identification in PDF documents, based on Support Vector Machine (SVM). The method first segments text
lines into words, and then classifies each word into two classes, namely formula or ordinary text. Various features of
embedded formulas, including geometric layout, character and context content, are utilized to build a robust and
adaptable SVM classifier. Embedded formulas are then extracted through merging the words labeled as formulas.
Experimental results show good performance of the proposed method. Furthermore, the method has been successfully
incorporated into a commercial software package for large-scale e-Book production.
When reading electronic books on handheld devices, content sometimes should be reflowed and recomposed to adapt for
small-screen mobile devices. According to people's reading practice, it is reasonable to reflow the text content based on
paragraphs. Hence, this paper addresses the requirement and proposes a set of novel methods on paragraph recognition
for electronic books in PDF. The proposed methods consist of three steps, namely, physical structure analysis, paragraph
segmentation, and reading order detection. We make use of locally ordered property of PDF documents and layout style
of books to improve traditional page recognition results. In addition, we employ the optimal matching of Bipartite Graph
technology to detect paragraphs' reading order. Experiments show that our methods achieve high accuracy. It is
noteworthy that, the research has been applied in a commercial software package for Chinese E-book production.
Although many XML-based document formats are available for printing or publishing on the Internet, none of them is
well designed to support both high quality printing and web publishing. Therefore, we propose a novel XML-based
document format for web publishing, called CEBX, in this paper. The proposed format is a fixed-layout document
supporting high quality printing, which has optimized document content organization, physical structure and protection
scheme to support web publishing. There are four noteworthy features of CEBX documents: (1) CEBX provides original
fixed layout by graphic units for printing quality. (2) The content in CEBX document can be reflowed to fit the display
device basing on the content blocks and additional fluid information. (3) XML Document Archiving model (XDA), the
packaging model used in CEBX, supports document linearization and incremental edit well. (4) By introducing a
segment-based content protection scheme into CEBX, some part of a document can be previewed directly while the
remaining part is protected effectively such that readers only need to purchase partial content of a book that they are
interested in. This will be very helpful to document distribution and support flexible business models such as try-beforebuy,
on-demand reading, superdistribution, etc.
Page body holds the central information of a page in most documents. This paper addresses the problem of automatically detecting page body area in digital books or journals. A novel method based on font expansion and header and footer elimination is detailed. This method extracts body text font (BFont) and headers and footers from a document first, and then draws two page body bounding boxes for each page, one by analyzing the distribution of BFont in pages and the other by removing headers and footers from pages. Finally, the two bounding boxes are combined to obtain the resultant page body bounding box. The test results demonstrate very high recognition rate: up to 99.49% in precision.
In this paper, we present a hybrid approach to splitting a book document into individual chapters. We use multiple
sources of information to obtain a reliable assessment of the chapter title pages. These sources are produced by four
methods: blank space detection, font analysis, header and footer association, and table of content (TOC) analysis.
Finally, a combination component is used to score potential chapter title pages and select the best candidates. This
approach takes full advantage of various kinds of information such as page header and footer, layout, and keywords. It
works well even without the information of TOC which is crucial for most previous similar researches. Experiments
show that this approach is robust and reliable.