摘要

With the rise of microblog, topic detection in microblog posts has been a hotspot in natural language processing and text mining. Different from regular text, microblog post is a kind of short and idiomatic text. Microblog post contains little information, which brings great challenge for its topic detection. To address the issue of topic detection in microblog, a new single pass algorithm based on a double-vector model (DVM;Single Pass-DM) is proposed. First, a support vector machine (SVM) based algorithm is employed to filter irrelevant posts, thereby improving the accuracy of the algorithm. As for the representation model, on the basis of the traditional vector space model, a DVM that includes event and keyword vector is put forward. Subsequently, a combination of Jacoby,cosine and semantic similarity is used for similarity computation. Finally, some structural characteristics of microblog posts are used to support the topic detection problem. To validate the performance of the proposed algorithm, experiments are conducted on a real-world dataset. Experimental results show that, comparing with three benchmark algorithms SinglePass, Agglomerative Hierarchical Clustering (AHC) and Densitybased Spatial Clustering (DBSCAN), the performance of SinglePass-DM has been improved greatly.

  • 出版日期2016

全文