1 April 1998 Duplicate document detection in DocBrowse
Author Affiliations +
Abstract
Duplicate documents are frequently found in large databases of digital documents, such as those found in digital libraries or in the government declassification effort. Efficient duplicate document detection is important not only to allow querying for similar documents, but also to filter out redundant information in large document databases. We have designed three different algorithm to identify duplicate documents. The first algorithm is based on features extracted from the textual content of a document, the second algorithm is based on wavelet features extracted from the document image itself, and the third algorithm is a combination of the first two. These algorithms are integrated within the DocBrowse system for information retrieval from document images which is currently under development at MathSoft. DocBrowse supports duplicate document detection by allowing (1) automatic filtering to hide duplicate documents, and (2) ad hoc querying for similar or duplicate documents. We have tested the duplicate document detection algorithms on 171 documents and found that text-based method has an average 11-point precision of 97.7 percent while the image-based method has an average 11- point precision of 98.9 percent. However, in general, the text-based method performs better when the document contains enough high-quality machine printed text while the image- based method performs better when the document contains little or no quality machine readable text.
© (1998) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Vikram Chalana, Vikram Chalana, Andrew G. Bruce, Andrew G. Bruce, Thien Nguyen, Thien Nguyen, } "Duplicate document detection in DocBrowse", Proc. SPIE 3305, Document Recognition V, (1 April 1998); doi: 10.1117/12.304630; https://doi.org/10.1117/12.304630
PROCEEDINGS
10 PAGES


SHARE
RELATED CONTENT


Back to Top