A Term Weighting Scheme Based on the Measure of Relevance and Distinction for Text Categorization

作者:Yang Jieming*; Wang Jing; Liu Zhiying; Qu Zhaoyang
来源:16th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), 2015-06-01 to 2015-06-03.

摘要

Feature selection is often considered as a key step in text categorization. In this paper, we proposed a new feature selection algorithm, named AD, which comprehensively measures the degree of relevance and distinction of terms occur in document set. We evaluated AD on three benchmark document collections, 20-Newsgroups, Reuters-21578 and WebKB, using two classification algorithms, Naive Bayes and Support Vector Machines. The experimental results, comparing AD with six classic feature-selection algorithms, show that the proposed method AD is significantly superior to Information Gain, Mutual Information, Odds Ratio, DIA association factor Orthogonal Centroid Feature Selection and Ambiguity Measure when Naive Bayes classifier is used and significantly outperforms IG, MI, OR, DIA, OCFS and AM when Support Vector Machines is used.