A Quantitative Analysis of the Temporal Effects on Automatic Text Classification

作者:Salles Thiago*; Rocha Leonardo; Goncalves Marcos Andre; Almeida Jussara M; Mourao Fernando; Meira Wagner Jr; Viegas Felipe
来源:Journal of the Association for Information Science and Technology, 2016, 67(7): 1639-1667.
DOI:10.1002/asi.23452

摘要

Automatic text classification (TC) continues to be a relevant research topic and several TC algorithms have been proposed. However, the majority of TC algorithms assume that the underlying data distribution does not change over time. In this work, we are concerned with the challenges imposed by the temporal dynamics observed in textual data sets. We provide evidence of the existence of temporal effects in three textual data sets, reflected by variations observed over time in the class distribution, in the pairwise class similarities, and in the relationships between terms and classes. We then quantify, using a series of full factorial design experiments, the impact of these effects on four well-known TC algorithms. We show that these temporal effects affect each analyzed data set differently and that they restrict the performance of each considered TC algorithm to different extents. The reported quantitative analyses, which are the original contributions of this article, provide valuable new insights to better understand the behavior of TC algorithms when faced with nonstatic (temporal) data distributions and highlight important requirements for the proposal of more accurate classification models.

  • 出版日期2016-7