April 16, 2017

Jupyter Notebooks for Interactive SQL Exploration

I’m always hesitant to tell people that I work as a data scientist. Partially because it’s too vague of a job description to mean much, but also partially because it feels hubristic to use the job title “scientist” to describe work which does not necessarily involve the scientific method.

Data is a collection of facts. Data, in general, is not the subject of study. Data about something in particular, such as physical phenomena or the human mind, provide the content of study. To call oneself a “data scientist” makes no sense. One cannot study data in general. One can only study data about something in particular.

There Is No Science of Data

So it’s always nice when I find an opportunity to borrow a concept or practice from actual science and apply it in my day-to-day. One of my favourites is the practice of keeping a lab notebook with commentary and supplementary details around the meandering path taken towards a final result.

Jupyter notebooks and R Markdown are two common tools that make it easy to intermingle code and analysis (as markdown) in a way that allows you to elucidate your thought process along a particular path.

But I have always felt a bit frustrated that there is not a similar tool for SQL. I try to get out of SQL and into python as soon as possible, but sometimes it is inevitable. On occasion, while writing a query to pull a starting dataset for some sort of analysis in pandas, I find myself troubleshooting something like missing or duplicate records in SQL. Usually this involves executing a sequence of simple queries against various tables in the database to narrow down the source of the problem, often using the output of one as input into another query. For example:

  1. Pull a single order-ID which is missing from my dataset
  2. Query the orders table for that order-ID to find the corresponding customer-ID
  3. Query the customers table for that customer-ID to find device data

What I did before

I generally prefer writing SQL queries in my IDE (PyCharm) which provides a number of useful features including auto-completion of column and table names, along with warnings that appear for typos, etc.

Usually I will add comments above queries as I go along using the -- comment syntax and at the end of the chain of queries I may copy/paste everything into a .sql file to save somewhere in case I need to run through that specific chain of troubleshooting steps again.

Enter Jupyter w/ SQL magics

There is a neat jupyter extension called ipython-sql that adds an %%sql magic command to your jupyter notebooks. Magic commands are special non-python commands starting with the % which, when run from a notebook cell, add some sort of additional functionality.

Prefixing a code cell with %%sql will let you execute the SQL code below against your database, and return the result below. It even applies syntax highlighting to your SQL, making it more readable.

How to use

First thing we need to do is install the extension,

! pip install ipython-sql

Next, we need to load the extension and create a connection with your database,

%reload_ext sql  # Use reload_ext instead of load_ext to avoid message on re-running cell.
%config SqlMagic.autopandas = True  # Return a pandas DataFrame instead of an SQL ResultSet.

# Provide the JDBC connection string template for your database. 
redshift_str_template = 'postgresql://{user}:{pwd}@{host}:{port}/{db}'

# Fill in the string with your credentials, stored in environment variables.
connect_str = redshift_str_template.format(
    user=os.environ['REDSHIFT_USERNAME'],
    pwd=os.environ['REDSHIFT_PASSWORD'],
    host=os.environ['REDSHIFT_HOST'],
    port=os.environ['REDSHIFT_PORT'],
    db=os.environ['REDSHIFT_DB'])

# Open a connection to your database
%sql $connect_str

The code above assumes that you are using Amazon Redshift as a database, and that your credentials are stored in environment variables. If this is not the case, you can replace the os.environ[] calls with strings, but be careful not to commit your notebook to a shared repository with plaintext credentials.

Now we can run an SQL query in our notebook,

%%sql
SELECT COUNT(*)
FROM sales
WHERE ts > CURRENT_DATE - interval '7 days'

Another cool feature is the ability to save the output to a variable,

%%sql num_sales <<
SELECT COUNT(*)
FROM sales
WHERE ts > CURRENT_DATE - interval '7 days'

It works in reverse too, so you can feed a python variable such as N_DAYS = 7 back into a query by referencing it with a trailing : in your SQL.

%%sql num_sales <<
SELECT COUNT(*)
FROM sales
WHERE ts > CURRENT_DATE - interval 'N_DAYS: days'

Using these two features together, it is possible to write notebook which performs a sequence of debugging steps, with each query taking a dynamic value from the previous output. You can then save this notebook, and easily re-run the same troubleshooting steps on fresh data when the problem arises in the future.

Couldn’t I achieve the same thing with jinja and psycopg2?

Theoretically we could write queries into string variables in a Jupyter notebook and run them using psycopg2 or pandas, but this always felt too clunky to be usable. The above approach almost entirely removes the friction and boilerplate code, while also giving us the benefit of syntax highlighting.

Further reading

© Geoff Ruddock 2020