Redshift function of the week: RATIO_TO_REPORT
Jun 10, 2018
A very common scenario one comes across while performing data analysis is wanting to compute a basic count of some event—such as visits, searches, or purchases—split by a single dimension—such as country, device, or marketing channel. Quite often this arises as an intermediate need while working towards some other primary task.
Let’s work with a simple example: you’d like to get a rough sense of how many of your company’s orders come from from each country. So you write the following query,
SELECT country, COUNT(*) AS num_orders FROM orders ORDER BY 2 DESC
And you get the following result back in your SQL client,
This is kind of difficult to read. You can immediately discern that USA has the most orders, but it’s tough to eyeball proportions here without counting digits and performing some rough mental math. So you refine your query to tell you what you are actually interested in: the relative proportion of orders between countries.
WITH country_totals AS ( SELECT country , COUNT(*) AS num_orders FROM orders GROUP BY 1 ORDER BY 2 DESC ) SELECT country , ROUND(num_orders ::NUMERIC / SUM(num_orders) OVER (), 2) AS pct_orders FROM country_totals
This is better. It answers your question, and you can go back to your main task. But is it optimal? It took you 12 lines to answer a relatively simple question. Hopefully you didn’t write these from scratch, but even if you pasted a snippet, it’s not particularly easy to read if you or someone else needs to refer back to this query in the future.
Can we do better?
Enter the RATIO_TO_REPORT function
Amazon Redshift provides an off-the-shelf window function called ratio_to_report which basically solves what we are trying to accomplish. You can use it as follows,
SELECT country , RATIO_TO_REPORT(COUNT(num_orders)) OVER () AS pct_orders FROM orders GROUP BY 1 ORDER BY 2 DESC
We can reason through this query as follows: the
GROUP BY operation totals orders for each country, and then the
RATIO_TO_REPORT function is called on the already-grouped rows, dividing each by their grand total. Running this function gives us the exact same output as the previous query, but with half the lines of code, and a more readable result.
As a final step, we can clean up the output even further by rounding the percentage to a meaningful precision, so that our eyes don’t spend an extra 100ms parsing the fact that it is a fraction when we look at the table.
SELECT country , ROUND(RATIO_TO_REPORT(COUNT(num_orders)) OVER (), 2) AS pct_orders FROM orders GROUP BY 1 ORDER BY 2 DESC
Boom! We’ve arrived at a short query which gives us a clean result that answers our underlying question. This is a trivial scenario, but the sort that one encounters daily, and so taking the optimal approach pays off in long-run efficiency.
RATIO_TO_REPORT Window Function – AWS documentation. Not particularly easy to read, but still a primary resource.
SQL queries don’t start with SELECT – It’s useful to understand the SQL “order of operations” when working with queries which combine both groupby aggregation and window functions.
Calculationg Proportional Values in SQL – Implementation details for different SQL engines.