In previous papers, we have documented success in determining the key people of interest from a large corpus of real-world evidence. Our recent efforts focus on exploring additional domains and data sources. Internet data sources such as email, web pages, and news feeds make it easier to gather a large corpus of documents for various domains, but detecting people of interest in these sources introduces new challenges. Analyzing these massive sources magnifies entity resolution problems, and demands a storage management strategy that supports efficient algorithmic analysis and visualization techniques. This paper discusses the techniques we used in order to analyze the ENRON email repository, which are also applicable to analyzing web pages returned from our "Buddy" meta-search engine.
Proc. SPIE. 5812, Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security 2005
KEYWORDS: Defense and security, Data mining, Human-machine interfaces, Visualization, Distance measurement, Analytical research, Optical character recognition, Algorithm development, Social networks, Information security
The challenge of identifying important individuals and their membership as part of a group is a continuing and ever growing problem. In recent years, the data mining community has been identifying and discussing a new paradigm of data analysis using uni-party data. Within this paradigm, a methodology known as Link Discovery based on Correlation Analysis (LDCA), defines a process to compensate for the lack of relational data. CORAL, a specific implementation of LDCA, demonstrated the value of this methodology by identifying suspects involved in a Ponzi scheme with limited success. This paper introduces several new algorithms and analyzes their ability to generate a prioritized ranking of individuals involved in the Ponzi scheme based on their individual activity. To compare the accuracy of each algorithm, we present the experimental results of the algorithms, and conclude with a discussion of open issues and future activities.
In previous work, we introduced a new paradigm called Uni-Party Data Community Generation (UDCG) and a new methodology to discover social groups (a.k.a., community models) called Link Discovery based on Correlation Analysis (LDCA). We further advanced this work by experimenting with a corpus of evidence obtained from a Ponzi scheme investigation. That work identified several UDCG algorithms, developed what we called "Importance Measures" to compare the accuracy of the algorithms based on ground truth, and presented a Concept of Operations (CONOPS) that criminal investigators could use to discover social groups. However, that work used a rather small random sample of manually edited documents because the evidence contained far too many OCR and other extraction errors. Deferring the evidence extraction errors allowed us to continue experimenting with UDCG algorithms, but only used a small fraction of the available evidence. In attempt to discover techniques that are more practical in the near-term, our most recent work focuses on being able to use an entire corpus of real-world evidence to discover social groups. This paper discusses the complications of extracting evidence, suggests a method of performing name resolution, presents a new UDCG algorithm, and discusses our future direction in this area.