Proc. SPIE. 6072, Security, Steganography, and Watermarking of Multimedia Contents VIII
KEYWORDS: Data modeling, Sensors, Fingerprint recognition, Digital watermarking, Feature extraction, Associative arrays, Information security, Intellectual property, Steganography, Classification systems
Digital fingerprinting, watermarking, and tracking technologies have gained importance in the recent years in response to growing problems such as digital copyright infringement. While fingerprints and watermarks can be generated in many different ways, use of natural language processing for these purposes has so far been limited. Measuring similarity of literary works for automatic copyright infringement detection requires identifying and comparing creative expression of content in documents. In this paper, we present a linguistic approach to automatically fingerprinting novels based on their expression of content. We use natural language processing techniques to generate "expression fingerprints". These fingerprints consist of both syntactic and semantic elements of language, i.e., syntactic and semantic elements of expression. Our experiments indicate that syntactic and semantic elements of expression enable accurate identification of novels and their paraphrases, providing a significant improvement over techniques used in text classification literature for automatic copy recognition. We show that these elements of expression can be used to fingerprint, label, or watermark works; they represent features that are essential to the character of works and that remain fairly consistent in the works even when works are paraphrased. These features can be directly extracted from the contents of the works on demand and can be used to recognize works that would not be correctly identified either in the absence of pre-existing labels or by verbatim-copy detectors.