Detecting near-duplicate documents using paragraph features

作者:Wang Haitao*; Liu Shufen; Jia Zongpu
来源:Journal of Computational Information Systems, 2015, 11(4): 1295-1302.
DOI:10.12733/jcis13338

摘要

To improve search efficiency and user satisfaction on massive data set obtained, we present a novel approach to check duplicate documents in a large database using weight value, which consists of three parts: selecting long sentence of paragraph, obtaining feature set of paragraph by calculating weight value of sentence and generating the fingerprint of paragraph. By means of these steps and the similarity degree is calculated by the formula proposed, the near-duplicate documents can be detected in the given database. To demonstrate the approach feasibility, we choose the Sougou news data sets as the test object to prove some impact factors on precision/recall ratio of algorithm, by comparing our approach with other algorithm respectively in terms of precision/recall ratio and execution times under relatively ideal parameter value circumstance, the results prove that our approach is effective and feasible in near-duplicate detection aspect of documents.

  • 出版日期2015
  • 单位Polytechnic university

全文