Existing data quality works have so far focused on the computation of many data characteristics as a mean of quantifying different quality dimensions, like freshness, consistency, accuracy, or completeness, that are all defined about some ideal (clean) dataset.
We claim that this approach falls short in providing a full specification of the quality of the data since it does not take into consideration the task for which the data is to be used, neither any future instances of the dataset.
We argue that apart from the difference from the clean dataset, it is equally important to know the degree to which such difference affects the results of the task at hand.
Thus, we extend the existing data quality definition to include that degree.
Our approach, not only allows data quality to be considered in the context of the intended task, but can also provide useful information even in the absence of the clean dataset, and proffer an understanding of the effect of data quality in future dataset instances.
We describe a system and its implementation that computes this extended form of data quality through a principled approach of systematic noise generation and task result evaluation.
We perform numerous experiments illustrating the effectiveness of the approach and how this allows contextualizing traditional data quality measures.
“Estimating the extent of the effects of data quality through observations.”
Proceedings of the 37th IEEE International Conference on Data Engineering, ICDE 2021