June 10, 2018

Redshift function of the week: RATIO_TO_REPORT

A common use-case

A very common scenario one comes across while performing data analysis is wanting to compute a basic count of some event—such as visits, searches, or purchases—split by a single dimension—such as country, device, or marketing channel. Quite often this arises as an intermediate need while working towards some other primary task.

Let’s work with a simple example: you’d like to get a rough sense of how many of your company’s orders come from from each country. So you write the following query,

SELECT country, COUNT(*) AS num_orders
FROM orders
ORDER BY 2 DESC

And you get the following result back in your SQL client,

country num_orders
USA 21264505
Canada 6408593
Mexico 2208305

This is kind of difficult to read. You can immediately discern that USA has the most orders, but it’s tough to eyeball proportions here without counting digits and performing some rough mental math. So you refine your query to tell you what you are actually interested in: the relative proportion of orders between countries.

WITH country_totals AS (
  SELECT
      country
    , COUNT(*) AS num_orders
  FROM orders
  GROUP BY 1
  ORDER BY 2 DESC
)
SELECT
    country
  , ROUND(num_orders ::NUMERIC / SUM(num_orders) OVER (), 2) AS pct_orders
FROM country_totals
country num_orders
USA 0.71169602953464
Canada 0.21443048342874
Mexico 0.07395810292746

This is better. It answers your question, and you can go back to your main task. But is it optimal? It took you 12 lines to answer a relatively simple question. Hopefully you didn’t write these from scratch, but even if you pasted a snippet, it’s not particularly easy to read if you or someone else needs to refer back to this query in the future.

Can we do better?

Enter the RATIO_TO_REPORT function

Amazon Redshift provides an off-the-shelf window function called ratio_to_report which basically solves what we are trying to accomplish. You can use it as follows,

SELECT
    country
  , RATIO_TO_REPORT(COUNT(num_orders)) OVER () AS pct_orders
FROM orders
GROUP BY 1
ORDER BY 2 DESC

We can reason through this query as follows: the GROUP BY operation totals orders for each country, and then the RATIO_TO_REPORT function is called on the already-grouped rows, dividing each by their grand total. Running this function gives us the exact same output as the previous query, but with half the lines of code, and a more readable result.

As a final step, we can clean up the output even further by rounding the percentage to a meaningful precision, so that our eyes don’t spend an extra 100ms parsing the fact that it is a fraction when we look at the table.

SELECT
    country
  , ROUND(RATIO_TO_REPORT(COUNT(num_orders)) OVER (), 2) AS pct_orders
FROM orders
GROUP BY 1
ORDER BY 2 DESC
country num_orders
USA 0.71
Canada 0.21
Mexico 0.07

Boom! We’ve arrived at a short query which gives us a clean result that answers our underlying question. This is a trivial scenario, but the sort that one encounters daily, and so taking the optimal approach pays off in long-run efficiency.

Further reading

© Geoff Ruddock 2019