23 March 1994 Text characterization by connected component transformations
Author Affiliations +
Worldwide there are many different scripts and languages in common use. Finding text lines and character and word boundaries, where present, are necessary primitive operations for most document processing applications. We have developed a method of handling text lines from several different languages that is robust in the presence of common printing and scanning artifacts. A technique is described by which information about the characteristics of a text line can be determined from a list of the connected pixel components that comprise the image. This technique applies across many languages and scripts that are laid out horizontally. For text comprising Roman type, the location and dimensions of each text line are augmented with positions of the baseline and x-height. Where appropriate, coordinates of space-delimited words and individual character cells are determined. This technique incorporates a computationally inexpensive method for straightening curved lines and segmenting kerned characters and a novel method based on font weight and stress for locating the boundaries of individual characters, even if their images touch.
© (1994) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Larry Spitz, Larry Spitz, "Text characterization by connected component transformations", Proc. SPIE 2181, Document Recognition, (23 March 1994); doi: 10.1117/12.171097; https://doi.org/10.1117/12.171097


Non-Manhattan layout extraction algorithm
Proceedings of SPIE (March 20 2013)
Archiving of line-drawing images
Proceedings of SPIE (November 20 1995)
Text segmentation for automatic document processing
Proceedings of SPIE (January 06 1999)
Machine-printed Arabic OCR
Proceedings of SPIE (February 24 1994)
New thinning algorithm using rough-set theory
Proceedings of SPIE (April 13 1993)

Back to Top