A Survey on Tree Edit Distance Lower Bound Estimation Techniques for Similarity Join on XML Data

作者:Li Fei; Wang Hongzhi*; Li Jianzhong; Gao Hong
来源:Sigmod Record, 2013, 42(4): 29-39.
DOI:10.1145/2590989.2590994

摘要

When integrating tree-structured data from autonomous and heterogeneous sources, exact joins often fail for the same object may be represented differently. Approximate join techniques are often used, in which similar trees are considered describing the same real-world object. A commonly accepted metric to evaluate tree similarity is the tree edit distance. While yielding good results, this metric is computationally complex, thus has limited benefit for large databases. To make the join process efficient, many previous works take filtering and refinement mechanisms. They provide lower bounds for the tree edit distance in order to reduce unnecessary calculations. This work explores some widely accepted filtering and refinement based methods, and combines them to form multi-level filters. Experimental results indicate that string-based lower bounds are tighter yet more computationally complex than set-based lower bounds, and multi-level filters provide the tightest lower bound efficiently.