25 February 1999 Discovering clusters in databases containing mixed continuous and discrete-valued attributes
Author Affiliations +
Abstract
Clustering is concerned with the unsupervised discovery of natural grouping of records in a database. Recently, considerable effort has been put into the development of effective clustering algorithms for spatial data mining. These algorithms are mainly developed to deal with continuous valued attributes. They typically use a distance measure, defined in the Euclidean space to determine if two records should be placed in the same or in disjoint clusters. Since such a distance measure cannot be validly defined in non-Euclidean space, these algorithms cannot be used to handle databases, which contain records also characterized by discrete qualitative or categorical values. Owing to the fact that there is usually a mixture of both continuous and discrete valued attributes in many real-world databases, it is important that a clustering algorithm should be developed to handle data mining tasks involving them. In this paper, we present such a clustering algorithm. The algorithm can be divided into two phases: a discretization phase and a genetic algorithm-based clustering phase. The first phase involves discretizing continuous attributes into discrete attributes, based on the use of an information theoretic measure, which minimizes loss of information during such process. These discretized and the discrete attributes are then used in the second phase. In this phase, clustering is carried out, using a genetic algorithm. By representing a specific grouping of records in a chromosome and using a weight of evidence measure as a fitness measure, to determine if such grouping is meaningful, we present here an effective GA for data clustering. To evaluate the effectiveness of the proposed techniques, we tested it using real data. The experimental results showed that it is very promising.
© (1999) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Keith C. C. Chan, Lewis L. H. Chung, "Discovering clusters in databases containing mixed continuous and discrete-valued attributes", Proc. SPIE 3695, Data Mining and Knowledge Discovery: Theory, Tools, and Technology, (25 February 1999); doi: 10.1117/12.339987; https://doi.org/10.1117/12.339987
PROCEEDINGS
9 PAGES


SHARE
Back to Top