摘要

Most of the existing Chinese webpage duplicate elimination approaches do not focus on noisy and fuzzy duplicates elimination. In this paper, we propose an efficient and noise-tolerant Chinese webpage duplicate elimination approach based on Length-variable Feature Code. First, an Independent Extraction Unit is defined to eliminate the impact of short paragraphs on feature code extraction. Then the concept of repeatability is introduced by using the longest common substring to enhance the noise tolerant capability. Experimental results on 10 million webpage dataset show that the proposed approach can efficiently deal with duplicates from massive WebPages with the duplicate elimination precision of 99.03%.

  • 出版日期2012

全文