March 16, 2020

Building a Naive Bayes classifier from scratch with NumPy

Goal While learning about Naive Bayes classifiers, I decided to implement the algorithm from scratch to help solidify my understanding of the math. So the goal of this notebook is to implement a simplified and easily interpretable version of the sklearn.naive_bayes.MultinomialNB estimator which produces identical results on a sample dataset. While I generally find scikit-learn documentation very helpful, its source code is a bit trickier to grok, since it optimizes for efficiency—of both computational and maintenance—across a wide family of models. Read more

September 16, 2019

Building a hurdle regression estimator in scikit-learn

What are hurdle models? Google explains best, The hurdle model is a two-part model that specifies one process for zero counts and another process for positive counts. The idea is that positive counts occur once a threshold is crossed, or put another way, a hurdle is cleared. — Getting started with hurdle models [University of Virginia Library] What are hurdle models useful for? Many statistical learning models—particularly linear models—assume some level of normality in the response variable being predicted. Read more

July 29, 2019

One-hot encoding + linear regression = multi-collinearity

My coefficients are bigger than your coefficients I was attempting to fit a simple linear regression model the other day with sklearn.linear_model.LinearRegression but the model was making terribly inaccurate predictions on the test dataset. Upon inspecting the estimated coefficients, I noticed that they were of a crazy magnitude, on the order of billions. For reference, I was predicting a response which was approximately normally distributed with a mean value of 100. Read more

© Geoff Ruddock 2019