摘要

In recent years, algebraic methods, more precisely matrix decomposition approaches, have become a key tool for tackling document summarization problem. Typical algebraic methods used in multi-document summarization (MDS) vary from soft and hard clustering approaches to low-rank approximations. In this paper, we present a novel summarization method AASum which employs the archetypal analysis for generic MDS. Archetypal analysis (AA) is a promising unsupervised learning tool able to completely assemble the advantages of clustering and the flexibility of matrix factorization. In document summarization, given a content-graph data matrix representation of a set of documents, positively and/or negatively salient sentences are values on the data set boundary. These extreme values, archetypes, can be computed using AA. While each sentence in a data set is estimated as a mixture of archetypal sentences, the archetypes themselves are restricted to being sparse mixtures, i.e., convex combinations of the original sentences. Since AA in this way readily offers soft clustering, we suggest to consider it as a method for simultaneous sentence clustering and ranking. Another important argument in favor of using AA in MDS is that in contrast to other factorization methods, which extract prototypical, characteristic, even basic sentences, AA selects distinct (archetypal) sentences and thus induces variability and diversity in produced summaries. Experimental results on the DUC generic summarization data sets evidence the improvement of the proposed approach over the other closely related methods.

  • 出版日期2014-12