Large-Scale Schema-Free Data Deduplication Approach with Adaptive Sliding Window Using MapReduce

Ma, Kun<sup>*</sup>; Dong, Fusen; Yang, Bo

doi:10.1093/comjnl/bxv052

摘要

Data deduplication is the task of identifying all groups of objects within one or several data sets, respectively. However, this task will become difficult in the context of big data. To address this limitation, we propose a new schema-free data deduplication approach in parallel in the aspect of breeding data deduplication related to food safety. Although MapReduce framework enables efficient parallel execution of data-intensive tasks, it cannot find duplicates in adjacent block. Furthermore, current deduplication approaches with MapReduce are restricted to fixed sliding Therefore, we investigate possible solutions to improve current deduplication approaches with MapReduce, to make sliding window size adaptive using adaptive multiple duplicate count strategy with alterable window step, and find duplicates by overlapping boundary objects in adjacent blocks. Moreover, we propose a multi-pass Partition-Sort-Map-Reduce approach with adaptive sliding window to speed up the deduplication process. Finally, our experimental evaluation based on the breeding data on large datasets shows the high effectiveness and efficiency of the proposed approaches.

出版日期2015-11
单位济南大学

全文

访问全文

收藏分享被引(12) 浏览

更新时间：2024-05-18 09:30

Large-Scale Schema-Free Data Deduplication Approach with Adaptive Sliding Window Using MapReduce

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友