摘要

There has been a surge of interest in the development of probabilistic techniques to discover meaningful data facts across multiple datasets provided by different organizations. The key aim is to approximate the structure and content of the induced data into a concise synopsis in order to extract meaningful data facts. Performing sensible queries across unrelated datasets is a complex task that requires a complete understanding of each contributing database's schema to define the structure of its information. Alternative approaches that use data modeling enterprise tools have been proposed, in order to give users without complex schema knowledge the ability to query databases. Unfortunately, data modeling-based matching is a content-based technique and incurs significant query evaluation costs, due to attribute level pairwise comparisons. We propose a multi-faceted classification technique for performing structural analysis on knowledge domain clusters, using a novel Ontology Guided Data Linkage (OGDL) framework. This framework supports self-organization of contributing databases through the discovery of structural dependencies, by performing multi-level exploitation of ontological domain knowledge relating to tables, attributes and tuples. The framework thus automates the discovery of schema structures across unrelated databases, based on the use of direct and weighted correlations between different ontological concepts, using a h-gram (hash gram) record matching technique for concept clustering and cluster mapping. We demonstrate the feasibility of our OGDL algorithms through a set of accuracy, performance and scalability experimental tests run on real-world datasets, and show that our system runs in polynomial time and performs well in practice. To the best of our knowledge, this is the first attempt initiated to solve data linkage problems using a multi-faceted cluster mapping strategy, and we believe that our approach presents a significant advancement towards accurate query answering and future real-time online semantic reasoning capacity.

  • 出版日期2013-9

全文