Approximate joins for XML at label level

作者:Li Fei; Wang Hongzhi*; Hao Liang; Li Jianzhong; Gao Hong
来源:Information Sciences, 2014, 282: 237-249.
DOI:10.1016/j.ins.2014.06.007

摘要

In heterogeneous}CIVIL data sources, the same real-world object may not be represented exactly the same. Thus approximate join techniques are often applied, in which XML documents are joined based on similarity. In previous XML join methods, researchers consider each XML label as a unit and entirely disregard the similarity between different labels. However, real-world data sets are often 'dirty'. The labels should be also approximately matched in the join. To improve the join quality, our approach considers both XML structure and node label similarity by applying two tailored similarity measures. Min-hash, a probabilistic hash function, is employed to achieve scalability. Extensive experiments confirm that the join quality is fundamentally improved when the label similarity is considered and our join efficiency is even higher than some of the most efficient methods.

全文