|Charles Sutton||The University of Edinburgh|
|Timothy Hobson||The Alan Turing Institute|
|James Geddes||The Alan Turing Institute|
Many analyses in data science are not one-off projects, but are repeated over multiple data samples. The authors introduce the data diff problem, which attempts to turn this problem into an opportunity.
Many analyses in data science are not one-off projects, but are repeated over multiple data samples, such as once per month, once per quarter, and so on. For example, if a data scientist performs an analysis in 2017 that saves a significant amount of money, then she will likely to be asked to perform the same analysis on data from 2018. But more data analyses means more effort spent in data wrangling. We introduce the data diff problem, which attempts to turn this problem into an opportunity. Comparing the repeated data samples against each other, inconsistencies may be indicative of underlying issues in data quality. By analogy to text \textttdiff, the data diff problem is to find a “patch”, that is, transformation in a specified domain-specific language, that transforms the data samples so that they are identically distributed. We present a prototype tool for data diff that formalizes the problem as a bipartite matching problem, calibrating its parameters using a bootstrap procedure. The tool is evaluated quantitatively and through a case study on an open government data set.