Current systems for automatic extraction of index terms from business documents either take a rule-based
or training-based approach. As both approaches have their advantages and disadvantages it seems natural to
combine both methods to get the best of both worlds. We present a combination method with the steps selection,
normalization, and combination based on comparable scores produced during extraction. Furthermore, novel
evaluation metrics are developed to support the assessment of each step in an existing extraction system. Our
methods were evaluated on an example extraction system with three individual extractors and a corpus of 12,000
scanned business documents.
Archiving official written documents such as invoices, reminders and account statements in business and private
area gets more and more important. Creating appropriate index entries for document archives like sender's name,
creation date or document number is a tedious manual work. We present a novel approach to handle automatic
indexing of documents based on generic positional extraction of index terms. For this purpose we apply the
knowledge of document templates stored in a common full text search index to find index positions that were
successfully extracted in the past.