摘要

In this article, we focus on computational aspects of spectral clustering algorithms that have recently shown promising results in machine learning, statistics, and computer vision. These algorithms cluster observations (of size n) into groups by investigating eigenvectors of an affinity matrix or its Laplacian matrix, both of which are size n x n. However, when the sample size is large, the computation involved in the matrix eigen-decomposition is expensive or even infeasible. To overcome the computation hurdle, subsampling techniques, such as the Nystrom extension, have been used to approximate eigenvectors of large matrices. We study statistical properties of this approximation and their influence on the accuracy of various spectral clustering algorithms. We show that the perturbation of the spectrum due to subsampling could lead to a large discrepancy among clustering results. In order to provide accurate and stable results for large datasets, we propose a method to combine multiple subsamples using data spectroscopic clustering and the Nystrom extension. In addition, we propose a sparse approximation of the eigenvectors to further speed up computation. Simulation and experiments on real datasets show that our approaches work quickly and provide reasonable results that are more stable across samples than the single sample approach. This article has supplementary material online.

  • 出版日期2012-6