摘要

With the explosive growth of data, storage systems are facing huge storage pressure due to a mass of redundant data caused by the duplicate copies or regions of files, therefore, it is increasing to become a research hot spot of handling massive data removal reduplication. This paper, after researching on the relevant knowledge, proposes a novel approach of massive data removal reduplication, that is, firstly, choosing the appropriate hash functions conducts the removal reduplication of URL data source, then, setting thematic concept index gets ride of semantics reduplication of result which the former step generated. Each of stage has the corresponding algorithm respectively. Finally, the algorithm is carried out under the circumstance of MapReduce computing model, and proves the feasibility of algorithm by comparison with other classification algorithm.

  • 出版日期2014
  • 单位Polytechnic university

全文