A length-variable feature code based fuzzy duplicates elimination approach for large scale chinese webpages

Guo Hongzhi<sup>*</sup>; Chen Qingcai; Xin Cong; Wang Xiaolong

doi:10.4304/jsw.7.11.2622-2629

摘要

Most of the existing Chinese webpage duplicate elimination approaches do not focus on noisy and fuzzy duplicates elimination. In this paper, we propose an efficient and noise-tolerant Chinese webpage duplicate elimination approach based on Length-variable Feature Code. First, an Independent Extraction Unit is defined to eliminate the impact of short paragraphs on feature code extraction. Then the concept of repeatability is introduced by using the longest common substring to enhance the noise tolerant capability. Experimental results on 10 million webpage dataset show that the proposed approach can efficiently deal with duplicates from massive WebPages with the duplicate elimination precision of 99.03%.

出版日期2012

全文

访问全文

收藏分享被引浏览

更新时间：2018-08-03 17:08

A length-variable feature code based fuzzy duplicates elimination approach for large scale chinese webpages

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友