Paper
29 January 2007 Content-based document image retrieval in complex document collections
G. Agam, S. Argamon, O. Frieder, D. Grossman, D. Lewis
Author Affiliations +
Proceedings Volume 6500, Document Recognition and Retrieval XIV; 65000S (2007) https://doi.org/10.1117/12.703163
Event: Electronic Imaging 2007, 2007, San Jose, CA, United States
Abstract
We address the problem of content-based image retrieval in the context of complex document images. Complex documents typically start out on paper and are then electronically scanned. These documents have rich internal structure and might only be available in image form. Additionally, they may have been produced by a combination of printing technologies (or by handwriting); and include diagrams, graphics, tables and other non-textual elements. Large collections of such complex documents are commonly found in legal and security investigations. The indexing and analysis of large document collections is currently limited to textual features based OCR data and ignore the structural context of the document as well as important non-textual elements such as signatures, logos, stamps, tables, diagrams, and images. Handwritten comments are also normally ignored due to the inherent complexity of offline handwriting recognition. We address important research issues concerning content-based document image retrieval and describe a prototype for integrated retrieval and aggregation of diverse information contained in scanned paper documents we are developing. Such complex document information processing combines several forms of image processing together with textual/linguistic processing to enable effective analysis of complex document collections, a necessity for a wide range of applications. Our prototype automatically generates rich metadata about a complex document and then applies query tools to integrate the metadata with text search. To ensure a thorough evaluation of the effectiveness of our prototype, we are developing a test collection containing millions of document images.
© (2007) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
G. Agam, S. Argamon, O. Frieder, D. Grossman, and D. Lewis "Content-based document image retrieval in complex document collections", Proc. SPIE 6500, Document Recognition and Retrieval XIV, 65000S (29 January 2007); https://doi.org/10.1117/12.703163
Lens.org Logo
CITATIONS
Cited by 6 scholarly publications and 2 patents.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Prototyping

Image processing

Optical character recognition

Image retrieval

Analytical research

Databases

Data processing

Back to Top