Large-scale k-means clustering via variance reduction

作者:Zhao, Yawei*; Ming, Yuewei; Liu, Xinwang; Zhu, En; Zhao, Kaikai; Yin, Jianping
来源:Neurocomputing, 2018, 307: 184-194.
DOI:10.1016/j.neucom.2018.03.059

摘要

With the increase of the volume of data such as images in web, it is challenging to perform k-means clustering on millions or even billions of images efficiently. One of the reasons is that k-means requires to use a batch of training data to update cluster centers at every iteration, which is time-consuming. Conventionally, k-means is accelerated by using one or a mini-batch of instances to update the centers, which leads to a bad performance due to the stochastic noise. In the paper, we decrease such stochastic noise, and accelerate k-means by using variance reduction technique. Specifically, we propose a position correction mechanism to correct the drift of the cluster centers, and propose a variance reduced k-means named VRKM. Furthermore, we optimize VRKM by reducing its computational cost, and propose a new variant of the variance reduced k-means named VRKM++. Comparing with VRKM, VRKM++ does not have to compute the batch gradient, and is more efficient. Extensive empirical studies show that our methods VRKM and VRKM++ outperform the state-of-the-art method, and obtain about 2 x and 4 speedups for large-scale clustering, respectively. The source code is available at https://www.github.com/YaweiZhao/VRKM.sofia-ml.