A real environment oriented parallel duplicates removal approach for large scale Chinese WebPages

Guo Hongzhi<sup>*</sup>; Chen Qingcai; Xin Cong; Wang Xiaolong; Bi Ye

登录

免费注册

赞收藏引用

科研之友

微信

新浪微博

Facebook

分享链接

A real environment oriented parallel duplicates removal approach for large scale Chinese WebPages

作者：Guo Hongzhi^*; Chen Qingcai; Xin Cong; Wang Xiaolong; Bi Ye

来源：Journal of Computational Information Systems, 2011, 7(5): 1420-1427.

摘要

Aiming at the problems that existing Chinese webpage duplicate removal approaches are restricted by the page scale and the algorithm efficiency, we propose an efficient distributed parallel duplicates elimination approach based on Linux cluster. This paper not only solves the problems of memory limitations and large computation caused by the huge data scale, but also researches into the time-series problems in the distributed parallel computing and gives an effective solution. Experimental results on 10 million webpage dataset show that the proposed approach can deal with duplicates from massive web pages well and truly.

出版日期2011

全文

访问全文

收藏分享被引浏览

更新时间：2018-08-07 12:11

相似论文
引用论文
参考文献

产品服务

科研之友科研之友机构版科创云

站内浏览

科研成果科研人员科研机构

服务支持

帮助中心隐私政策服务条款

联系方式

在线客服：【立即咨询】客户热线：400-1616-289 电子邮箱：support@scholarmate.com

微信公众号