A method for removing near-duplicates web of massive data based on mapreduce

Wang Haitao; Liu Shufen; Zhang Jun<sup>*</sup>; Jia Zongpu

doi:10.12733/jcis10879

摘要

With the explosive growth of data, storage systems are facing huge storage pressure due to a mass of redundant data caused by the duplicate copies or regions of files, therefore, it is increasing to become a research hot spot of handling massive data removal reduplication. This paper, after researching on the relevant knowledge, proposes a novel approach of massive data removal reduplication, that is, firstly, choosing the appropriate hash functions conducts the removal reduplication of URL data source, then, setting thematic concept index gets ride of semantics reduplication of result which the former step generated. Each of stage has the corresponding algorithm respectively. Finally, the algorithm is carried out under the circumstance of MapReduce computing model, and proves the feasibility of algorithm by comparison with other classification algorithm.

出版日期2014
单位Polytechnic university

全文

访问全文

收藏分享被引浏览

更新时间：2018-12-06 13:30

A method for removing near-duplicates web of massive data based on mapreduce

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友