摘要

Objectives: The role of social media as a source of timely and massive information has become more apparent since the era of Web 2.0. Multiple studies illustrated the use of information in social media to discover biomedical and health-related knowledge. Most methods proposed in the literature employ traditional document classification techniques that represent a document as a bag of words. These techniques work well when documents are rich in text and conform to standard English; however, they are not optimal for social media data where sparsity and noise are norms. This paper aims to address the limitations posed by the traditional bag-of-word based methods and propose to use heterogeneous features in combination with ensemble machine learning techniques to discover health-related information, which could prove to be useful to multiple biomedical applications, especially those needing to discover health-related knowledge in large scale social media data. Furthermore, the proposed methodology could be generalized to discover different types of information in various kinds of textual data. Methodology: Social media data is characterized by an abundance of short social-oriented messages that do not conform to standard languages, both grammatically and syntactically. The problem of discovering health-related knowledge in social media data streams is then transformed into a text classification problem, where a text is identified as positive if it is health-related and negative otherwise. We first identify the limitations of the traditional methods which train machines with N-gram word features, then propose to overcome such limitations by utilizing the collaboration of machine learning based classifiers, each of which is trained to learn a semantically different aspect of the data. The parameter analysis for tuning each classifier is also reported. Data sets: Three data sets are used in this research. The first data set comprises of approximately 5000 hand-labeled tweets, and is used for cross validation of the classification models in the small scale experiment, and for training the classifiers in the real-world large scale experiment. The second data set is a random sample of real-world Twitter data in the US. The third data set is a random sample of real-world Facebook Timeline posts. Evaluations: Two sets of evaluations are conducted to investigate the proposed model's ability to discover health-related information in the social media domain: small scale and large scale evaluations. The small scale evaluation employs 10-fold cross validation on the labeled data, and aims to tune parameters of the proposed models, and to compare with the stage-of-the-art method. The large scale evaluation tests the trained classification models on the native, real-world data sets, and is needed to verify the ability of the proposed model to handle the massive heterogeneity in real-world social media. Findings: The small scale experiment reveals that the proposed method is able to mitigate the limitations in the well established techniques existing in the literature, resulting in performance improvement of 18.61% (F-measure). The large scale experiment further reveals that the baseline fails to perform well on larger data with higher degrees of heterogeneity, while the proposed method is able to yield reasonably good performance and outperform the baseline by 46.62% (F-Measure) on average.

  • 出版日期2014-6