A Self-enriching Methodology for Clustering Narrow Domain Short Texts

Pinto David<sup>*</sup>; Rosso Paolo; Jimenez Salazar Hector

doi:10.1093/comjnl/bxq069

摘要

Clustering narrow domain short texts is considered to be a complex task because of the intrinsic features of the corpus to be clustered: (i) the low frequencies of vocabulary terms in short texts, and (ii) the high vocabulary overlapping associated to narrow domains. The aim of this paper is to introduce a self-term expansion methodology for improving the performance of clustering methods when dealing with corpora of this kind. This methodology allows raw textual data to be enriched by adding co-related terms from an automatically constructed lexical knowledge resource obtained from the same target data set (and not from an external resource). We also propose a set of supervised and unsupervised text assessment measures for evaluating different corpus features, such as shortness, stylometry and domain broadness. With the help of these measures, we may determine beforehand whether or not to use the methodology proposed in this paper. Finally, we integrate all these assessment measures in a freely available web-based system named Watermarking Corpora On-line System, which may be used by computer scientists in order to evaluate the different features associated with a given textual corpus.

出版日期2011-7

全文

访问全文

收藏分享被引(13) 浏览

更新时间：2018-02-09 13:27

A Self-enriching Methodology for Clustering Narrow Domain Short Texts

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友