A Distributed Text Clustering Model Based on Multi-Agent

Qiao, Shao Jie; Han, Nan<sup>*</sup>; Jin, Che Qing; Gao, Yun Jun; Li, Tian Rui; Tang, Chang Jie; Kang, Jian

doi:10.11897/SP.J.1016.2018.01709

摘要

As the Internet big data grow rapidly, it urgently needs us to design new clustering approaches that can handle large-scale semi-structured and unstructured text data. The existing methods have the following disadvantages: the commonly used text datasets are very monotonous, the accuracy of text clustering on semi-structured and unstructured Web texts is very low, and the efficiency of clustering can't be guaranteed when the cardinality of documents is very large. Aiming to cope with these drawbacks in existing methods, a new clustering model based on swarm intelligence was proposed, called Switch (a Swarm intelligence based text clustering algorithm), which can support multiple languages including Tibetan, Chinese, and English as well. The basic idea of the proposed method is that: it first constructs the vector space model and then obtains the feature vector set of texts by employing the natural language processing and data preprocessing techniques. The parameters of the proposed swarm intelligence based text clustering algorithm are initialized, and the agents can randomly move in a two dimensional text space. The agents calculate the similarity of texts in the grids where they currently stay in to other texts, and use the probability transition function to calculate the probability of picking up and dropping down texts. A distributed dynamic text stream clustering architecture based on multi-agent was proposed, and the proposed distributed architecture was applied to the swarm intelligence based text clustering approach. The distributed working environment of swarm intelligence is designed to be a set of soft agents through communication. Three agents were proposed, including similarity calculation agents, state awareness agents and text parsing agents. By coping with the problems of agent states synchronization, the cost of communication between processors, and load balancing of processors, the calculation tasks are partitioned into different subtasks and the processors perform these tasks in a distributed fashion. In addition, the working mechanism of the proposed distributed swarm intelligent clustering approach based on multi-agent was introduced and the distributed communication schema was given, by which the agents can communicate with others and collaborate with each other to complete the task of text clustering. The distributed clustering on computer clusters can be achieved by the middleware of JADE based on multi-agents, and its advantages include: it has better distributed computing power and large memory processing capability than the stand-alone processing, and employs JADE middleware to perform communication and cooperation among agents in order to complete text clustering efficiently. Experiments were conducted on real semi-structured Web text datasets including Tibetan, Chinese and English. By taking Tibetan as an example, the results show that: the clustering accuracy of the proposed distributed clustering approach is averagely improved by 12.2% and 3.8% and the time cost is reduced by 73.0% and 50.6% on average by comparing to the k-means and stand-alone single node cluster. The results show that when the number of agents is between 150 and 250 in the computer cluster with n nodes, the time cost of text clustering might approximate to 1/n time cost with regard to a stand-alone node.

出版日期2018
单位西南交通大学; 四川大学; 浙江大学; 华东师范大学

全文

访问全文

收藏分享被引浏览

更新时间：2023-10-18 05:24

A Distributed Text Clustering Model Based on Multi-Agent

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友