摘要

A method of document clustering based on locality preserving indexing (LPI) and support vector machines (SVM) is presented. The document space is generally of high dimensionality, and clustering in such a high-dimensional space is often infeasible due to the curse of dimensionality. In this paper, by using LPI, the documents are projected into a lower-dimension semantic space in which the documents related to the same semantic are close to each other. Then, by using SVM, the vectors in semantic space are mapped by means of a Gaussian kernel to a high-dimensional feature space in which the minimal enclosing sphere is searched. The sphere, when mapped back to semantics space, can separate into several independent components by the support vectors, each enclosing a separate cluster of documents. By combining the LPI and SVM, not only higher clustering accuracies in a more unsupervised effective way, but also better generalization properties can be obtained. Extensive demonstrations are performed on the Reuters-21578 and TDT2 data sets.