摘要

The K-means algorithm is one of the most popular clustering algorithms. However, it is sensitive to initialised partitions and circular dataset. To address this problem, this paper introduces a CK-means clustering algorithm based on the K-means algorithm and the Canopy algorithm, which uses the MapReduce programming model of Hadoop platform. The experimental results prove that the CK-means algorithm has strong advantages for processing large datasets. The theoretical analysis shows that the CK-means algorithm and the traditional algorithm are of the same order of magnitude. The experimental results on artificial data show that the improved algorithm is better than the traditional algorithm in terms of acceleration ratio, accuracy and expansion rate. An experiment on real data is performed to obtain appropriate parameters.