Diagnosing and Minimizing Semantic Drift in Iterative Bootstrapping Extraction

作者:Li, Zhixu; He, Ying; Gu, Binbin; Liu, An*; Li, Hongsong; Wang, Haixun; Zhou, Xiaofang
来源:IEEE Transactions on Knowledge and Data Engineering, 2018, 30(5): 852-865.
DOI:10.1109/TKDE.2017.2782697

摘要

Semantic drift is a common problem in iterative information extraction. Previous approaches for minimizing semantic drift may incur substantial loss in recall. We observe that most semantic drifts are introduced by a small number of questionable extractions in the earlier rounds of iterations. These extractions subsequently introduce a large number of questionable results, which lead to the semantic drift phenomenon. We call these questionable extractions Drifting Points (DPs). If erroneous extractions are the "symptoms" of semantic drift, then DPs are the "causes" of semantic drift. In this paper, we propose a method to minimize semantic drift by identifying the DPs and removing the effect introduced by the DPs. We use isA (concept-instance) extraction as an example to describe our approach in cleaning information extraction errors caused by semantic drift, but we perform experiments on different relation extraction processes on three large real data extraction collections. The experimental results show that our DP cleaning method enables us to clean around 90 percent incorrect instances or patterns with about 90 percent precision, which outperforms the previous approaches we compare with.