摘要

As the Internet big data grow rapidly, it urgently needs us to design new clustering approaches that can handle large-scale semi-structured and unstructured text data. The existing methods have the following disadvantages: the commonly used text datasets are very monotonous, the accuracy of text clustering on semi-structured and unstructured Web texts is very low, and the efficiency of clustering can't be guaranteed when the cardinality of documents is very large. Aiming to cope with these drawbacks in existing methods, a new clustering model based on swarm intelligence was proposed, called Switch (a Swarm intelligence based text clustering algorithm), which can support multiple languages including Tibetan, Chinese, and English as well. The basic idea of the proposed method is that: it first constructs the vector space model and then obtains the feature vector set of texts by employing the natural language processing and data preprocessing techniques. The parameters of the proposed swarm intelligence based text clustering algorithm are initialized, and the agents can randomly move in a two dimensional text space. The agents calculate the similarity of texts in the grids where they currently stay in to other texts, and use the probability transition function to calculate the probability of picking up and dropping down texts. A distributed dynamic text stream clustering architecture based on multi-agent was proposed, and the proposed distributed architecture was applied to the swarm intelligence based text clustering approach. The distributed working environment of swarm intelligence is designed to be a set of soft agents through communication. Three agents were proposed, including similarity calculation agents, state awareness agents and text parsing agents. By coping with the problems of agent states synchronization, the cost of communication between processors, and load balancing of processors, the calculation tasks are partitioned into different subtasks and the processors perform these tasks in a distributed fashion. In addition, the working mechanism of the proposed distributed swarm intelligent clustering approach based on multi-agent was introduced and the distributed communication schema was given, by which the agents can communicate with others and collaborate with each other to complete the task of text clustering. The distributed clustering on computer clusters can be achieved by the middleware of JADE based on multi-agents, and its advantages include: it has better distributed computing power and large memory processing capability than the stand-alone processing, and employs JADE middleware to perform communication and cooperation among agents in order to complete text clustering efficiently. Experiments were conducted on real semi-structured Web text datasets including Tibetan, Chinese and English. By taking Tibetan as an example, the results show that: the clustering accuracy of the proposed distributed clustering approach is averagely improved by 12.2% and 3.8% and the time cost is reduced by 73.0% and 50.6% on average by comparing to the k-means and stand-alone single node cluster. The results show that when the number of agents is between 150 and 250 in the computer cluster with n nodes, the time cost of text clustering might approximate to 1/n time cost with regard to a stand-alone node.

全文