摘要

Repairing obsolete data items to the up-to-date values faces great challenges in the area of improving data quality. Previous methods of data repairing are based on either quality rules or statistical techniques, but both of the two types of methods have their limitations. To overcome the shortages of the previous methods, this paper focuses on combining quality rules and statistical techniques to improve data currency. (1) A new class of currency repairing rules (CRR for short) is proposed to express both domain knowledge and statistical information. Domain knowledge is expressed by the rule pattern, and the statistical information is described by the conditional probability distribution corresponding to each rule. (2) The problem of generating minimized CRRs is studied in both static and dynamic world. In the static world, the problem of generating minimized CRR patterns is proved to be NP-hard, and two approximate algorithms are provided to solve the problem. In dynamic world, methods are provided to update the CRRs without recomputing the whole CRR set in case of data being changed. In some special cases, the updates can be finished in time. In both cases, the methods for learning conditional probabilities for each CRR pattern are provided. (3) Based on the CRRs, the problems of finding optimal repairing plans with and without cost budget is studied, and methods are provided to solve them. (4) The experiments based on both real and synthetic data sets show that the proposed methods are efficient and effective.