August 2, 2017

The hidden costs of poor data quality

The phrase “data quality” is frequently—and often ambiguously—thrown around many data analytics organizations. It can be used as an object of concern, an excuse for a failure, or a goal for future improvement.

We’d all love 100% accuracy, but in the era of moving fast and breaking things, don’t we want to sacrifice a little accuracy in the name of speed? After all, isn’t it often better to make fast decisions with imperfect information and adjust course if necessary at a later point?

There is certainly a trade-off at play here. The optimal level of quality to aim for is likely less than 100%. But it’s probably higher than you think. The speed vs. accuracy trade-off needs to be calculated with all costs considered, not just the most visible and direct ones. For this reason, let’s ignore the obvious cost of low-quality data, which is incorrect decisions and lost opportunities to apply machine learning models due to an unacceptable signal-to-noise ratio. Instead, let’s focus on the lesser-considered costs of poor data quality.

It reduces the standard for derivative works

Broken windows theory is a pop-science concept originating from criminology which states that visible signs of disorder encourage further disorder. When participants in a system are acclimitized to a slightly broken status quo, it becomes normative to produce more slightly broken output.

I suspect this concept applies to data quality as well, in two ways: phsycological and practical. Ambient quality issues degrade the shared standard for quality within a team, and the “definition of done” will slip. For an analysis on a tight timeline, 90% accuracy may be accepted instead of spending the time investigating the root cause of the innacuracy.

From a practical standpoint, it becomes more difficult to implement data quality checks or perform sanity tests when nothing is black or white, but rather shades of gray. For example, if a tracking event should always have a session-ID associated with it, then it becomes easy to detect when a problem arises, with something as simple as COUNT(*) WHERE session_id IS NULL. But when tracking events usually but not always have a session-ID associated with them, you can’t apply simple rules, nor can you identify if a specific data transformation/aggregation/analysis introduces meaningfully more null values. Soon you are in the domain of anomaly detection, which is an endeavour in and of itself.

It takes dramatically longer to reach a final result

Anscombe’s Quartet is a famous example of why it’s not enough to analyze a dataset using descriptive statistic alone. You need to look at the distribution of the underlying data. But what if you do visualize the data during the initial analysis, and now you want to build a simple dashboard for ongoing monitoring? Particularly when working with large datasets, it is much more convenient to apply a server-side aggregate function in SQL such as AVG() than to pull the raw data, filter for outliers, and then perform client-side aggregation in software such as Tableau.

So you draw up a simple query based on AVG() and feed it into a dashboard. This works, until a deployment next month starts sending negative values to the database due to an incorrectly configured timezone on one of the backend server. This goes unnoticed until an analyst a year from now realizes that the average time-to-purchase is being understated by more than 50%. Oops!

Well, what if we write a more “defensive query” in the first place, to prevent such an incident from occuring? For example…

SELECT AVG(GREATEST(0, time_to_purchase))

It is possible to write such queries in response to specific (and known) quality issues, but doing so for multiple types of potential quality problems leads to clunky and slow queries peppered with SELECT DISTINCT statements that not only take longer to execute, but are more difficult to write, read, and modify in the future.

It accumulates multiplicatively, not additively

A common question posed to analysts is “how accurate is this data”? This is deceptively difficult to answer. Non-technical stakeholders often conceive of error as an additive quantity, but in reality the effect is often multiplicative.

For example, let’s say we are trying to calculate the average number of searches per visit on our website. We’ve got two tables for this: visits and searches. To make things interesting, there was a technical bug a few months ago, in which some visitors were assigned a session-ID of undefined.

So we write a query,

SELECT AVG(COUNT(s.search_id))
FROM visits v
JOIN searches s
  ON v.session_id = s.session_id
GROUP BY v.session_id

The problem here is that SQL will inadverently perform a cross-join on these undefined rows, giving you a wildly inaccurate result…

It undermines confidence in the analytics team

Your team has collected and transformed the data, built the model, and performed the analysis. Now it’s time to communicate the results to stakeholders.

But often the data presents inconvenient truthes, which stakeholders may be reluctant to accept. In order to reduce cognitive dissonance, people often engage in motivated reasoning, questioning the quality of the data, and whether we can actually trust the results.

So even if you’ve already paid the inflated costs assocated with reaching a meaningful and accurate insight in the face of poor data quality, you may yet face a more difficult evaganlizing for controversial actions or outcomes based on those results.

© Geoff Ruddock 2020