Translator Disclaimer
16 January 2006 Complex document information processing: prototype, test collection, and evaluation
Author Affiliations +
Analysis of large collections of complex documents is an increasingly important need for numerous applications. Complex documents are documents that typically start out on paper and are then electronically scanned. These documents have rich internal structure and might only be available in image form. Additionally, they may have been produced by a combination of printing technologies (or by handwriting); and include diagrams, graphics, tables and other non-textual elements. The state of the art today for a large document collection is essentially text search of OCR'd documents with no meaningful use of data found in images, signatures, logos, etc. Our prototype automatically generates rich metadata about a complex document and then applies query tools to integrate the metadata with text search. To ensure a thorough evaluation of the effectiveness of our prototype, we are also developing a roughly 42,000,000 page complex document test collection. The collection will include relevance judgments for queries at a variety of levels of detail and depending on a variety of content and structural characteristics of documents, as well as "known item" queries looking for particular documents.
© (2006) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
G. Agam, S. Argamon, O. Frieder, D. Grossman, and D. Lewis "Complex document information processing: prototype, test collection, and evaluation", Proc. SPIE 6067, Document Recognition and Retrieval XIII, 60670N (16 January 2006);

Back to Top