摘要

To improve the clustering quality of massive extensible markup language (XML) document collections, this paper proposes a novel XML document clustering method. First, the approach extracts hierarchy path sequences from documents and uses them to transform documents into vectors in a Euclidean space. Based on the particle swarm model, a clustering method using PSO (particle swarm optimization) is then applied. In order to improve the convergence of the algorithm, a C-means algorithm is applied in the final stage so that the enhanced mixed algorithm MCPX is obtained. The advantages of the MCPX algorithm is that it can skip out of the local optima of the search space to obtain a global optima with reasonable time expense. Experimental results show that the proposed technique has satisfactory clustering convergence and accuracy.

全文