Over the life of any given data analysis project, I typically use a mix of both SQL and pandas code at various points. This notebook contains some reference material and gotchas for fluently translating between the two.
Boolean logicSQL truth tablePandas truth table (object)Pandas truth table (boolean)Templating SQL queriesPython list to SQL query stringEquivalent functionsSplitting stringsCASE WHEN import numpy as np import pandas as pd import sqlite3 Package Version --------- --------- python 3.
A usecase for templating your SQL queries Suppose you have a table raw_events which contains events related to an email marketing campaign. You’d like to see the total number of each event type per day. This is a classic use-case for a pivot table, but let’s suppose you are using an SQL engine such as Redshift / Postgres which does not have a built-in pivot function.
The quick-and-dirty solution here is to manually build the pivot table yourself, using a series of CASE WHEN expressions.
I fielded an interesting request recently from our PR team, who wanted to generate a creative representation of our data based on the direction and distance of trips booked on our platform. Distance a key attribute of interest for a travel business, so it is naturally easy to retrieve this data. However the direction of a trip is something that had not been previously analyzed, and so it was not available off-the-shelf in our data warehouse.
A very common scenario one comes across while performing data analysis is wanting to compute a basic count of some event—such as visits, searches, or purchases—split by a single dimension—such as country, device, or marketing channel. Amazon Redshift provides an off-the-shelf window function called ratio_to_report which basically solves what we are trying to accomplish. Running this function gives us the exact same output as the previous query, but with half the lines of code, and a more readable result.
I’m always hesitant to tell people that I work as a data scientist. Partially because it’s too vague of a job description to mean much, but also partially because it feels hubristic to use the job title “scientist” to describe work which does not necessarily involve the scientific method.
Data is a collection of facts. Data, in general, is not the subject of study. Data about something in particular, such as physical phenomena or the human mind, provide the content of study.