Geoff Ruddock

Keep your SQL queries DRY with Jinja templating

A usecase for templating your SQL queries

Suppose you have a table raw_events which contains events related to an email marketing campaign. You’d like to see the total number of each event type per day. This is a classic use-case for a pivot table, but let’s suppose you are using an SQL engine such as Redshift / Postgres which does not have a built-in pivot function.

The quick-and-dirty solution here is to manually build the pivot table yourself, using a series of CASE WHEN expressions.

SELECT
    date_,
    SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send
    SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver
    SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open
    SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_click
FROM raw_events
GROUP BY 1
ORDER BY 1 ASC

This gets the job done for our toy example, it is not particularly scalable. Suppose our event_type column had 40 possible values, instead of just four. In that case, our solution is sub-optimal on two criteria:

  1. Readability โ€“ With 30x possible values, our query would be ~10x as many lines as before. While it won’t take 10x longer to read, it does impose a cognitive cost to read. This is exacerbated when we’ve got a number of sub-queries in a single file.
  2. Maintainability โ€“ If we add new event_types in the future, this query must be updated to match. This is tedious, and introduces an opportunity for error.

Can we do better?

This is a pretty good scenario to use jinja, a Python templating library which lets us perform basic flow control (loops, conditionals) inside of text templates. It is heavily used among the Flask community, but is also well suited for data analytics with SQL.

I’ll avoid giving a mediocre regurgitation of jinja syntax here, and defer to their own excellent documentation instead. Let’s skip ahead to see what our query would look like using a jinja template.

from jinja2 import Template
sql = """
SELECT
    date_,
    {%- for event in events %}
    SUM(CASE WHEN event_type = '{{event}}' THEN 1 END) AS num_{{event}}
    {%- if not loop.last -%}
        , 
    {%- endif -%}
    {%- endfor %}
FROM raw_events
GROUP BY 1
ORDER BY 1 ASC
"""

print(Template(sql).render(events=['send', 'deliver', 'open', 'click']))
SELECT
    date_,
    SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send,
    SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver,
    SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open,
    SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_click
FROM raw_events
GROUP BY 1
ORDER BY 1 ASC

In our toy example, the resulting query is not that much shorter, but it has the benefit of abstracting out a variable events, which contains the list of possible values for the event_type column. In the future, we can extend this query easily by simply appending to the events list.

Whitespace

You may have noticed that I added some unexplained - characters in the blocks above to get a pretty output.

The default output without these characters is a bit ugly.

sql = """
SELECT
    date_,
    {% for event in events %}
    SUM(CASE WHEN event_type = '{{event}}' THEN 1 END) AS num_{{event}}
    {% if not loop.last %}
        , 
    {% endif %}
    {% endfor %}
FROM raw_events
GROUP BY 1
ORDER BY 1 ASC
"""

print(Template(sql).render(events=['send', 'deliver', 'open', 'click']))
SELECT
    date_,
    
    SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send
    
        , 
    
    
    SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver
    
        , 
    
    
    SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open
    
        , 
    
    
    SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_click
    
    
FROM raw_events
GROUP BY 1
ORDER BY 1 ASC

Adding a minus sign (-) tells jinja to strip the whitespace before or after a block.

There are four possible positions for the minus sign:

  1. Start of opening block
  2. End of opening block
  3. Start of closing block
  4. End of closing block

Let’s take a look at the effect of adding a minus sign in each position.

Start of opening block

Adding the minus sign to the start of the opening block strips the leading whitespace outside of the for-loop. Basically, it just removes the extra line inhabited by the {%- for event in events %} block itself.

๐Ÿ˜„ This removes the empty line between date_ and the first SUM.

๐Ÿ˜ข But it does not remove the empty lines between each SUM statement.

sql = """
SELECT
    date_,
    {%- for event in events %}
    SUM(CASE WHEN event_type = '{{event}}' THEN 1 END) AS num_{{event}}
    {%- if not loop.last -%}
        , 
    {%- endif -%}
    {% endfor %}
FROM raw_events
GROUP BY 1
ORDER BY 1 ASC
"""

print(Template(sql).render(events=['send', 'deliver', 'open', 'click']))
SELECT
    date_,
    SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send,
    SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver,
    SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open,
    SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_click
FROM raw_events
GROUP BY 1
ORDER BY 1 ASC

End of opening block

Adding the minus sign to the end of the opening block strips the leading whitespace within the for-loop.

๐Ÿ˜„ This removes the empty lines between each SUM statement.

๐Ÿ˜ข But it leaves an empty line between the final statement and the SQL outside of the for-loop.

sql = """
SELECT
    date_,
    {% for event in events -%}
    SUM(CASE WHEN event_type = '{{event}}' THEN 1 END) AS num_{{event}}
    {%- if not loop.last -%}
        , 
    {%- endif -%}
    {% endfor %}
FROM raw_events
GROUP BY 1
ORDER BY 1 ASC
"""

print(Template(sql).render(events=['send', 'deliver', 'open', 'click']))
SELECT
    date_,
    SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send,SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver,SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open,SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_click
FROM raw_events
GROUP BY 1
ORDER BY 1 ASC

Start of closing block

Adding the minus sign to the start of the closing block strips the trailing whitespace within the for-loop.

๐Ÿ˜„ This removes the empty lines between each SUM statement.

๐Ÿ˜ข But it leaves an empty line between the date_ and the first SUM statement.

sql = """
SELECT
    date_,
    {% for event in events %}
    SUM(CASE WHEN event_type = '{{event}}' THEN 1 END) AS num_{{event}}
    {%- if not loop.last -%}
        , 
    {%- endif -%}
    {%- endfor %}
FROM raw_events
GROUP BY 1
ORDER BY 1 ASC
"""

print(Template(sql).render(events=['send', 'deliver', 'open', 'click']))
SELECT
    date_,
    
    SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send,
    SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver,
    SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open,
    SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_click
FROM raw_events
GROUP BY 1
ORDER BY 1 ASC

End of closing block

Adding the minus sign to the end of the closing block removes the trailing whitespace outside of the for-loop.

๐Ÿ˜ข This looks the worst. It leaves an empty line between each SUM statement. While it removes the final empty line between the for-loop and the rest of the untemplated SQL, it pulls FROM onto the incorrect level of indentation.

sql = """
SELECT
    date_,
    {% for event in events %}
    SUM(CASE WHEN event_type = '{{event}}' THEN 1 END) AS num_{{event}}
    {%- if not loop.last -%}
        , 
    {%- endif -%}
    {% endfor -%}
FROM raw_events
GROUP BY 1
ORDER BY 1 ASC
"""

print(Template(sql).render(events=['send', 'deliver', 'open', 'click']))
SELECT
    date_,
    
    SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send,
    SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver,
    SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open,
    SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_clickFROM raw_events
GROUP BY 1
ORDER BY 1 ASC

The ideal mix

By combining minus signs on the start of the opening block and the start of the ending block, we can tell jinja to strip the first leading empty line, and also the lines between each SUM statement.

sql = """
SELECT
    date_,
    {%- for event in events %}
    SUM(CASE WHEN event_type = '{{event}}' THEN 1 END) AS num_{{event}}
    {%- if not loop.last -%}
        , 
    {%- endif -%}
    {%- endfor %}
FROM raw_events
GROUP BY 1
ORDER BY 1 ASC
"""

print(Template(sql).render(events=['send', 'deliver', 'open', 'click']))
SELECT
    date_,
    SUM(CASE WHEN event_type = 'send' THEN 1 END) AS num_send,
    SUM(CASE WHEN event_type = 'deliver' THEN 1 END) AS num_deliver,
    SUM(CASE WHEN event_type = 'open' THEN 1 END) AS num_open,
    SUM(CASE WHEN event_type = 'click' THEN 1 END) AS num_click
FROM raw_events
GROUP BY 1
ORDER BY 1 ASC

Correlation matrix

Let’s look at a slightly more compliated example. Suppose we used the previous query to make a new table called daily_event_counts. Now we are interested in measuring the pairwise correlation between each type of event.

We can use the CORR() function to calculate each pair, but we need to tell the SQL engine which columns to use for each calculation. This is a good example of where the quick-and-dirty approach fails to scale. We have four types of events, but there are 4ร—4=16 pairwise correlations.

SELECT
    'send' AS x,
    'deliver' AS y,
    CORR(send, deliver) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'send' AS x,
    'open' AS y,
    CORR(send, open) AS corr_
FROM daily_event_counts

In reality, we are only interested in six of these pairwise correlations. The four diagonals will just equal one, and the matrix is symmetric, so half the computations are redundant. For the sake of simplicity, let’s ignore this for now, and proceed to calculate all sixteen.

Nested for-loops with jinja

Nested for-loop are relatively straightforward, but I will point out two changes:

  1. We want to place UNION ALL after all iterations except the final one. Previously we used {%- if not loop.last -%} to check if it was the final iteration. Since we now have a nested loop, we need to keep track of two indices. We can do this by using the block {% set outer_loop = loop %} to assign the outer loop to a new variable outer_loop before it is “replaced” by the inner loop.
  2. We add a minus sign (-) on the end of the outer opening block, to avoid getting an additional empty line between iterations of the outer loop. This gives us a consistent spacing of one empty line between each UNION ALL statement.
from jinja2 import Template

sql = """
{%- for x in cols -%}
{% set outer_loop = loop %}
{%- for y in cols %}
SELECT
    '{{x}}' AS x,
    '{{y}}' AS y,
    CORR({{x}}, {{y}}) AS corr_
FROM daily_event_counts
{% if not (loop.last and outer_loop.last) %}
UNION ALL
{% endif %}
{%- endfor %}
{%- endfor %}
    """

print(Template(sql).render(cols=['send', 'deliver', 'open', 'click']))
SELECT
    'send' AS x,
    'send' AS y,
    CORR(send, send) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'send' AS x,
    'deliver' AS y,
    CORR(send, deliver) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'send' AS x,
    'open' AS y,
    CORR(send, open) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'send' AS x,
    'click' AS y,
    CORR(send, click) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'deliver' AS x,
    'send' AS y,
    CORR(deliver, send) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'deliver' AS x,
    'deliver' AS y,
    CORR(deliver, deliver) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'deliver' AS x,
    'open' AS y,
    CORR(deliver, open) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'deliver' AS x,
    'click' AS y,
    CORR(deliver, click) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'open' AS x,
    'send' AS y,
    CORR(open, send) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'open' AS x,
    'deliver' AS y,
    CORR(open, deliver) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'open' AS x,
    'open' AS y,
    CORR(open, open) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'open' AS x,
    'click' AS y,
    CORR(open, click) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'click' AS x,
    'send' AS y,
    CORR(click, send) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'click' AS x,
    'deliver' AS y,
    CORR(click, deliver) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'click' AS x,
    'open' AS y,
    CORR(click, open) AS corr_
FROM daily_event_counts

UNION ALL

SELECT
    'click' AS x,
    'click' AS y,
    CORR(click, click) AS corr_
FROM daily_event_counts

Further reading


comments powered by Disqus