‹ Geoff Ruddock

Tags / python


A usecase for templating your SQL queries Suppose you have a table raw_events which contains events related to an email marketing campaign. You’d like to see the total number of each event type per day. This is a classic use-case for a pivot table, but let’s suppose you are using an SQL engine such as Redshift / Postgres which does not have a built-in pivot function. The quick-and-dirty solution here is to manually build the pivot table yourself, using a series of CASE WHEN expressions.
Despite being a relatively modern phone, my OnePlus 6T records video using the H.264 codec rather than the newer H.265 HEVC codec. A minute of 1080p video takes up ~150MB of storage, and double that for 60fps mode or 4K. Even though the phone has a decent amount of storage (64GB) it quickly fills up if you record a lot of video. The storage savings from HEVC are pretty astounding. It typically requires 50% less bitrate (and hence storage space) to achieve the same level of quality as H.264.
A few weeks ago while learning about Naive Bayes, I wrote a post about implementing Naive Bayes from scratch with Python. The exercise proved quite helpful for building intuition around the algorithm. So this is a post in the same spirit on the topic of AdaBoost.
While learning about Naive Bayes classifiers, I decided to implement the algorithm from scratch to help solidify my understanding of the math. So the goal of this notebook is to implement a simplified and easily interpretable version of the sklearn.naive_bayes.MultinomialNB estimator which produces identical results on a sample dataset.
There are a number of good free sources for market data such as Yahoo Finance or Google Finance. It is easy to pull this data into python using something like the yfinance package. But these sources generally only contain data for currently listed stocks.
I love jupyter notebooks. As a data scientist, notebooks are probably the fundamental tool in my daily worflow. They fulfill multiple roles: documenting what I have tried in a lab notebook for the benefit of my future self, and also serving as a self-contained format for the final version of an analysis, which can be committed to our team git repo and then discovered or reproduced later by other members of the team.
TL;DR If you need a single random number (or up to 5) use the built-in random module instead of np.random. An instinct to vectorize An early learning for any aspiring pandas user is to always prefer “vectorized” operations over iteratively looping over individual values in some dataframe. These operations—which include most built-in methods—are compiled into Cython and executed at blazing-fast speeds behind the scenes. It is very often worth the effort of massaging your logic into a slightly less expressive form if you can leverage vectorized functions to avoid the performance hit of for-loops.
Netflix recently posted an article on their tech blog titled Reimagining Experimentation Analysis at Netflix. Most of the post is about their experimentation infrastructure, but their example of a visualization of an experiment result caught my eye. A/B test results are notoriously difficult to visualize in an intuitive (but still correct) way. I’ve searched for best practices before, and the the only reasonable template I could find is built for Excel, which doesn’t fit my python workflow.
While coding up a reinforcement learning algorithm in python, I came across a problem I had never considered before… What’s the fastest way to sample from an array while building it? If you’re reading this, you should first question whether you actually need to iteratively build and sample from a python array in the first place. If you can build the array first and then sample a vector from it using np.
Multimodal distributions are commonly seen when analyzing composite variables such as insurance claims, where some large proportion are zero, but then the proportion of non-zero values take on a distribution of their own. Breaking down these sorts of distributions into their component parts allows us to more effetively model each piece and then recombine them at a later stage.
« Older posts Newer posts »