A cleaning method for consistency and currency in related data

作者:Du Yue-Feng; Shen De-Rong; Nie Tie-Zheng; Kou Yue; Yu Ge
来源:Chinese Journal of Computers, 2017, 40(1): 92-106.
DOI:10.11897/SP.J.1016.2017.00092

摘要

Data consistency and data currency are critical issues of big data quality management. Conditional functional dependencies (CFDs) and currency constraints (CCs) are two of techniques which analyzes data consistency and data currency. However, data in real world is always mixed with potential inconsistent and non-current errors which cannot be detected by the existing methods, even be intractable to be repaired. It results in low-quality data.Note that, the content expressed by these real-life data are related to each other. And this association contributes to discovering potential errors existing in data. To solve this problem, we employ condition-combined functional dependencies (CCFDs) which put related data together in error detection.In this paper, we propose a cleaning method for consistency and currency in related data.In practice, the detection and the repairing of data cleaning are interactive. A accuracy detection will provide a high-quality basis for repairs. As well the results of the repairs will feed back to the detection. Hence, we design an automatic cleaning framework which detects and repairs data errors iteratively. Futhermore, we discuss the fundamental problems of data cleaning mixed with consistency and currency.We prove that the problem of minimum repairing cost using CCFDs and CCs is Σ2p-complete (NPNP) so that we propose a heuristic repairing method which computes the minimum-cost target values for repairing the errors in each iterations. Otherwise, to improve the precision of data repairing, we present Repairing Sequences Graph. It calculates the errors which should be repaired preferentially. Our solution is approved more effective and efficient, even evidenced by our empirical evaluation on two real-life datasets.

全文