An efficient MapReduce algorithm for similarity join in metric spaces

Liu, Wen; Shen, Yanming<sup>*</sup>; Wang, Peng

doi:10.1007/s11227-016-1651-9

摘要

Given a massive set of records, similarity join is to find pairs of records with similarity score greater than a threshold. In this paper, we address the problem of scaling up similarity join for general metric distance functions using MapReduce. First, we propose a novel index structure, Similarity Join Tree (SJT), which partitions data based on the underlying data distribution, and distributes similar records to the same group. Different from existing approaches, SJT can prune a large number of comparisons within reduce tasks by utilizing the by-product results generated in partitioning data. Then, to avoid the straggler reduce tasks, we design a graph partition algorithm by extending the well known Fiduccia-Mattheyses algorithm which can ensure load balancing while minimizing communication cost and redundancy in all reduce tasks. Experimental results using real data sets show that our approach is more effective and scalable compared to state-of-the-art algorithms.

出版日期2016-3
单位复旦大学; 大连理工大学

全文

访问全文

收藏分享被引(5) 浏览

更新时间：2023-11-16 20:18

An efficient MapReduce algorithm for similarity join in metric spaces

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友