7 March 1996 Extraction of text-related features for condensing image documents
Author Affiliations +
Abstract
A system has been built that selects excerpts from a scanned document for presentation as a summary, without using character recognition. The method relies on the idea that the most significant sentences in a document contain words that are both specific to the document and have a relatively high frequency of occurrence within it. Accordingly, and entirely within the image domain, each page image is deskewed and the text regions of are found and extracted as a set of textblocks. Blocks with font size near the median for the document are selected and then placed in reading order. The textlines and words are segmented, and the words are placed into equivalence classes of similar shape. The sentences are identified by finding baselines for each line of text and analyzing the size and location of the connected components relative to the baseline. Scores can then be given to each word, depending on its shape and frequency of occurrence, and to each sentence, depending on the scores for the words in the sentence. Other salient features, such as textblocks that have a large font or are likely to contain an abstract, can also be used to select image parts that are likely to be thematically relevant. The method has been applied to a variety of documents, including articles scanned from magazines and technical journals.
© (1996) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Dan S. Bloomberg, Dan S. Bloomberg, Francine R. Chen, Francine R. Chen, "Extraction of text-related features for condensing image documents", Proc. SPIE 2660, Document Recognition III, (7 March 1996); doi: 10.1117/12.234726; https://doi.org/10.1117/12.234726
PROCEEDINGS
17 PAGES


SHARE
RELATED CONTENT

Graph-based layout analysis for PDF documents
Proceedings of SPIE (March 20 2013)
Data acquisition from cemetery headstones
Proceedings of SPIE (February 03 2013)
Document image orientation based on both text and image
Proceedings of SPIE (February 21 2012)
Multiresolution morphological analysis of document images
Proceedings of SPIE (October 31 1992)
Automatic inspection of leather surfaces
Proceedings of SPIE (October 02 1994)

Back to Top