摘要

Data deduplication is the task of identifying all groups of objects within one or several data sets, respectively. However, this task will become difficult in the context of big data. To address this limitation, we propose a new schema-free data deduplication approach in parallel in the aspect of breeding data deduplication related to food safety. Although MapReduce framework enables efficient parallel execution of data-intensive tasks, it cannot find duplicates in adjacent block. Furthermore, current deduplication approaches with MapReduce are restricted to fixed sliding Therefore, we investigate possible solutions to improve current deduplication approaches with MapReduce, to make sliding window size adaptive using adaptive multiple duplicate count strategy with alterable window step, and find duplicates by overlapping boundary objects in adjacent blocks. Moreover, we propose a multi-pass Partition-Sort-Map-Reduce approach with adaptive sliding window to speed up the deduplication process. Finally, our experimental evaluation based on the breeding data on large datasets shows the high effectiveness and efficiency of the proposed approaches.