‹ Geoff Ruddock

Tags / python


Over the life of any given data analysis project, I typically use a mix of both SQL and pandas code at various points. This notebook contains some reference material and gotchas for fluently translating between the two. Boolean logicSQL truth tablePandas truth table (object)Pandas truth table (boolean)Templating SQL queriesPython list to SQL query stringEquivalent functionsSplitting stringsCASE WHEN import numpy as np import pandas as pd import sqlite3 Package Version --------- --------- python 3.
Goal Suppose you have a complex dataset—one that you don’t fully understand—and you want to understand why one field is unexpectedly NULL in some rows. You could identify potentially relevant dimensions and manually flip through each to see if their values are correlated with the missingness of your data. But this can be tedious. And since you’d effectively be acting like a decision tree, why not try to solve the problem with a decision tree classifier?
ReadingCreate an ElementTree object from an XML fileList (only) child elementsList all elements (of any depth)Pretty printWritingCreate an elementAttach a sub-elementAppend a sub-elementFurther reading import requests from xml.etree import ElementTree as et xml_file_contents = requests.get('https://www.w3schools.com/xml/cd_catalog.xml').content with open('cd_catalog.xml', 'wb') as f: f.write(xml_file_contents) Reading Create an ElementTree object from an XML file tree = et.ElementTree(file='cd_catalog.xml') tree <xml.etree.ElementTree.ElementTree at 0x7fad503d3100> root = tree.getroot() root <Element 'CATALOG' at 0x7fad503db3b0> List (only) child elements list(root)[0:5] [<Element 'CD' at 0x7fad503dbd60>, <Element 'CD' at 0x7fad503f1130>, <Element 'CD' at 0x7fad503f1360>, <Element 'CD' at 0x7fad503f1590>, <Element 'CD' at 0x7fad503f1810>] List all elements (of any depth) # ignore the enumerate() wrapper and if statement here, purely to reduce output.
Emoji to PNG Motivation I put together an emoji search Alfred workflow which uses alfy to filter this JSON file of emoji. There are plenty of existing emoji Alfred workflows around, but I wanted one that allowed me to edit the aliases for individual emoji. The one missing piece was to have the workflow display the emoji itself as the icon for each result. The Alfred Script Filter JSON Format includes an icon field, but it expects the path to an actual icon file on disk.
A usecase for templating your SQL queries Suppose you have a table raw_events which contains events related to an email marketing campaign. You’d like to see the total number of each event type per day. This is a classic use-case for a pivot table, but let’s suppose you are using an SQL engine such as Redshift / Postgres which does not have a built-in pivot function. The quick-and-dirty solution here is to manually build the pivot table yourself, using a series of CASE WHEN expressions.
Despite being a relatively modern phone, my OnePlus 6T records video using the H.264 codec rather than the newer H.265 HEVC codec. A minute of 1080p video takes up ~150MB of storage, and double that for 60fps mode or 4K. Even though the phone has a decent amount of storage (64GB) it quickly fills up if you record a lot of video. The storage savings from HEVC are pretty astounding. It typically requires 50% less bitrate (and hence storage space) to achieve the same level of quality as H.264.
A few weeks ago while learning about Naive Bayes, I wrote a post about implementing Naive Bayes from scratch with Python. The exercise proved quite helpful for building intuition around the algorithm. So this is a post in the same spirit on the topic of AdaBoost.
While learning about Naive Bayes classifiers, I decided to implement the algorithm from scratch to help solidify my understanding of the math. So the goal of this notebook is to implement a simplified and easily interpretable version of the sklearn.naive_bayes.MultinomialNB estimator which produces identical results on a sample dataset.
There are a number of good free sources for market data such as Yahoo Finance or Google Finance. It is easy to pull this data into python using something like the yfinance package. But these sources generally only contain data for currently listed stocks.
I love jupyter notebooks. As a data scientist, notebooks are probably the fundamental tool in my daily worflow. They fulfill multiple roles: documenting what I have tried in a lab notebook for the benefit of my future self, and also serving as a self-contained format for the final version of an analysis, which can be committed to our team git repo and then discovered or reproduced later by other members of the team.
TL;DR If you need a single random number (or up to 5) use the built-in random module instead of np.random. An instinct to vectorize An early learning for any aspiring pandas user is to always prefer “vectorized” operations over iteratively looping over individual values in some dataframe. These operations—which include most built-in methods—are compiled into Cython and executed at blazing-fast speeds behind the scenes. It is very often worth the effort of massaging your logic into a slightly less expressive form if you can leverage vectorized functions to avoid the performance hit of for-loops.
Netflix recently posted an article on their tech blog titled Reimagining Experimentation Analysis at Netflix. Most of the post is about their experimentation infrastructure, but their example of a visualization of an experiment result caught my eye. A/B test results are notoriously difficult to visualize in an intuitive (but still correct) way. I’ve searched for best practices before, and the the only reasonable template I could find is built for Excel, which doesn’t fit my python workflow.
While coding up a reinforcement learning algorithm in python, I came across a problem I had never considered before… What’s the fastest way to sample from an array while building it? If you’re reading this, you should first question whether you actually need to iteratively build and sample from a python array in the first place. If you can build the array first and then sample a vector from it using np.
Multimodal distributions are commonly seen when analyzing composite variables such as insurance claims, where some large proportion are zero, but then the proportion of non-zero values take on a distribution of their own. Breaking down these sorts of distributions into their component parts allows us to more effetively model each piece and then recombine them at a later stage.
You don’t need to be a dummy to fall for the ‘Dummy Variable Trap’ while fitting a linear model, especially if you are using default parameters for one-hot encoding in scikit-learn. By default, OneHotEncoder sets the parameter drop=None which in turn causes it to output $ k $ output columns. When then used to fit a linear model with intercept, this results in a situation where we have perfect multicollinearity, and so the model overfits the data using unrealistic coefficients.
I fielded an interesting request recently from our PR team, who wanted to generate a creative representation of our data based on the direction and distance of trips booked on our platform. Distance a key attribute of interest for a travel business, so it is naturally easy to retrieve this data. However the direction of a trip is something that had not been previously analyzed, and so it was not available off-the-shelf in our data warehouse.
A collection of boilerplate code and edge cases collected while using scikit-learn.
« Older posts Newer posts »