String similarity joins

Yu, Jiang; Guoliang, Li; Jianhua, Feng; Wen-Syan, Li

doi:10.14778/2732296.2732299

摘要

<jats:p>String similarity join is an important operation in data integration and cleansing that finds similar string pairs from two collections of strings. More than ten algorithms have been proposed to address this problem in the recent two decades. However, existing algorithms have not been thoroughly compared under the same experimental framework. For example, some algorithms are tested only on specific datasets. This makes it rather difficult for practitioners to decide which algorithms should be used for various scenarios. To address this problem, in this paper we provide a comprehensive survey on a wide spectrum of existing string similarity join algorithms, classify them into different categories based on their main techniques, and compare them through extensive experiments on a variety of real-world datasets with different characteristics. We also report comprehensive findings obtained from the experiments and provide new insights about the strengths and weaknesses of existing similarity join algorithms which can guide practitioners to select appropriate algorithms for various scenarios.</jats:p>

出版日期2014-4
单位清华大学

全文

访问全文

收藏分享被引(83) 浏览

更新时间：2024-04-12 04:17

String similarity joins

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友