I fielded an interesting request recently from our PR team, who wanted to generate a creative representation of our data based on the direction and distance of trips booked on our platform. Distance a key attribute of interest for a travel business, so it is naturally easy to retrieve this data. However the direction of a trip is something that had not been previously analyzed, and so it was not available off-the-shelf in our data warehouse.
What do we mean by “direction” anyway?
The most intuitive interpretation of direction seemed like compass bearing, so I set out to find a way to convert a pair of spatial coordinates (latitude and longitude) into a variable which represents degrees right of true north. Unfortunately I could not find any suitable built-in functions to deal with spatial data in Redshift.
While it would not be difficult to spin up a jupyter notebook, pull in some data via SQL and run each row throw some function, it would not be an ideal approach. Keeping a small data request like this as a pure SQL query means it is easily reproducable in the future, without worrying about python package versions, anaconda environments, etc. Furthermore, anyone with access to the data warehouse can fetch updated data, rather than only someone comfortable with python.
Enter Python UDFs in Redshift
But all is not lost. Python UDFs to the rescue! Redshift lets you declare user-defined functions that take some scalar inputs, run a chunk of python code and return the output right back into SQL. Instead of declaring your function as a python fuction using def my_func(param)
syntax, you place its contents in the UDF function declaration below.
CREATE OR REPLACE FUNCTION my_func (param_a float,
param b float)
RETURNS float STABLE AS
$$ < python code > $$ LANGUAGE plpythonu;
Trying to remember as little decade-old trigonometry knowledge as possible, I found a working function on this stackexchange question and plugged it into our UDF boilerplate below. The final result looks like this,
CREATE OR REPLACE FUNCTION bearing_between_coordinates (x_lat float,
x_lon float,
y_lat float,
y_lon float) RETURNS float STABLE AS
$$
import math
startLat = math.radians(x_lat)
startLong = math.radians(x_lon)
endLat = math.radians(y_lat)
endLong = math.radians(y_lon)
dLong = endLong - startLong
dPhi = math.log(math.tan(endLat/2.0+math.pi/4.0)/math.tan(startLat/2.0+math.pi/4.0))
if abs(dLong) > math.pi:
if dLong > 0.0:
dLong = -(2.0 * math.pi - dLong)
else:
dLong = (2.0 * math.pi + dLong)
return (math.degrees(math.atan2(dLong, dPhi)) + 360.0) % 360.0;
$$ LANGUAGE plpythonu;
Execute this once in your database console, then you can use it within an existing query, for example,
SELECT bearing_between_coordinates(x_lat, x_lon, y_lat, y_lon) AS bearing
FROM lat_lon_coords
Or you can stitch together trig functions in Redshift
Update: A Python UDF may be overkill. I realized after writing the above that I can replicate the contents of the function itself using built-in trigonometric functions in Redshift. This results in the “almost one-liner” below. I opted to use a CTE to convert inputs to radians rather than embedding in the select to make that behemoth slightly less unreadable.
There is definitely a trade-off on interpretability though. This SQL code does a poor job of projecting intent compared to a defined function. Rather than reading the first line of the function declaration, you need to read all the way through to the final alias bearing_degrees
to understand why we are chaining together a bunch of trig functions anyway.
WITH coords_as_radians AS (
SELECT
RADIANS(x_lat) AS x_lat
, RADIANS(x_lon) AS x_lon
, RADIANS(y_lat) AS y_lat
, RADIANS(y_lon) AS y_lon
FROM raw_coordinates
)
SELECT (DEGREES(ATAN2(SIN(arr_lon-dep_lon)*COS(arr_lat), COS(dep_lat)*SIN(arr_lat)-SIN(dep_lat)*COS(arr_lat)*COS(arr_lon-dep_lon)))+360)::DECIMAL(18, 2) % 360.00 AS bearing_degrees
FROM coords_as_radians
;
comments powered by Disqus