A segment-based approach to clustering multi-topic documents

Tagarelli Andrea<sup>*</sup>; Karypis George

doi:10.1007/s10115-012-0556-z

摘要

Document clustering has been recognized as a central problem in text data management. Such a problem becomes particularly challenging when document contents are characterized by subtopical discussions that are not necessarily relevant to each other. Existing methods for document clustering have traditionally assumed that a document is an indivisible unit for text representation and similarity computation, which may not be appropriate to handle documents with multiple topics. In this paper, we address the problem of multi-topic document clustering by leveraging the natural composition of documents in text segments that are coherent with respect to the underlying subtopics. We propose a novel document clustering framework that is designed to induce a document organization from the identification of cohesive groups of segment-based portions of the original documents. We empirically give evidence of the significance of our segment-based approach on large collections of multi-topic documents, and we compare it to conventional methods for document clustering.

出版日期2013-3

全文

访问全文

收藏分享被引(33) 浏览

更新时间：2024-04-10 21:30

A segment-based approach to clustering multi-topic documents

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友