A Text Categorization Method using Extended Vector Space Model by Frequent Term Sets

Yuan, Man; Ouyang, Yuan Xin<sup>*</sup>; Xiong, Zhang

摘要

Text categorization is one of the most important research topics in Natural Language Processing and Information Retrieval due to the ever-increasing electronic documents. This paper presents a new text categorization method using frequent term sets. A novel constraint measure AD-Sup was introduced to extract discriminative features from frequent term sets for classification task. Then text documents are represented in the global feature space which contains both single terms and frequent term sets. To solve the sparse instance problem, a term weighting strategy is then implemented which assigns estimated weights using feature similarity and highly reduces the sparse rate. Through extensive experiments, the optimal proportion of single features and frequent term set features is empirically determined. Classification results on Reuters-21578 and WebKB corpus demonstrate that AD-Sup constraint is effective to extract useful frequent features and the combination strategy is effective to build better feature space and improve the SVM classifier.

出版日期2013-1
单位北京航空航天大学; 北京航空航天大学深圳研究院

收藏分享被引(9) 浏览

更新时间：2019-09-04 00:40

A Text Categorization Method using Extended Vector Space Model by Frequent Term Sets

摘要

产品服务

站内浏览

服务支持

联系方式

科研之友