摘要

In recent years, microblogging sites like Twitter have become an important and popular source for real-time information and news dissemination, and they have become a prime target of spammers inevitably. A series of incidents have shown that the security threats caused by Twitter spam can reach far beyond the social media platform to impact the real world. To mitigate the threat, a lot of recent studies apply machine learning techniques to classify Twitter spam and promising results are reported. However, most of these studies overlook the class imbalance problem in real-world Twitter data. In this paper, we experimentally demonstrate that the unequal distribution between spam and non-spam classes has a great impact on spam detection rate. To address the problem, we propose FOS, a fuzzy-based oversampling method that generates synthetic data samples from limited observed samples based on the idea of fuzzy-based information decomposition. Moreover, we develop an ensemble learning approach that learns more accurate classifiers from imbalanced data in three steps. In the first step, the class distribution in the imbalanced data set is adjusted by using various strategies, including random oversampling, random undersampling and FOS. In the second step, a classification model is built upon each of the redistributed data sets. In the final step, a majority voting scheme is introduced to combine the predictions from all the classification models. We conduct experiments on real-world Twitter data for the purpose of evaluation. The results indicate that the proposed learning approach can significantly improve the spam detection rate in data sets with imbalanced class distribution.

  • 出版日期2017-8