In this paper, we propose the DBI-based parallel clustering partition method to address the problem of determination on the number of clusters for large-scale datasets. First, we calculate the dispersion of the samples within the class cluster under the current K centroids. Second, according to the idea of MapReduce programming framework, the parallelized algorithm processing is designed to calculate the distance between each class cluster in the clustering result, and the distance between class clusters is measured by calculating the new center of mass formed by the data samples between different class clusters. Third, the maximum of the similarity between this class cluster and all other class clusters is calculated as the similarity of the clustering result class clusters. Finally, the similarity of all class clusters is averaged as the DBI index under the current K value, which is used as the evaluation criterion for clustering performance. The experimental results show the effectiveness and efficiency of our algorithm on two datasets for experimental comparison.
|