Goal Suppose you have a complex dataset—one that you don’t fully understand—and you want to understand why one field is unexpectedly NULL in some rows.
You could identify potentially relevant dimensions and manually flip through each to see if their values are correlated with the missingness of your data.
But this can be tedious. And since you’d effectively be acting like a decision tree, why not try to solve the problem with a decision tree classifier?

# Tags / machine-learning

A few weeks ago while learning about Naive Bayes, I wrote a post about implementing Naive Bayes from scratch with Python. The exercise proved quite helpful for building intuition around the algorithm. So this is a post in the same spirit on the topic of AdaBoost.

While learning about Naive Bayes classifiers, I decided to implement the algorithm from scratch to help solidify my understanding of the math. So the goal of this notebook is to implement a simplified and easily interpretable version of the sklearn.naive_bayes.MultinomialNB estimator which produces identical results on a sample dataset.

Multimodal distributions are commonly seen when analyzing composite variables such as insurance claims, where some large proportion are zero, but then the proportion of non-zero values take on a distribution of their own. Breaking down these sorts of distributions into their component parts allows us to more effetively model each piece and then recombine them at a later stage.

You don’t need to be a dummy to fall for the ‘Dummy Variable Trap’ while fitting a linear model, especially if you are using default parameters for one-hot encoding in scikit-learn. By default,

`OneHotEncoder`

sets the parameter `drop=None`

which in turn causes it to output $ k $ output columns. When then used to fit a linear model with intercept, this results in a situation where we have perfect multicollinearity, and so the model overfits the data using unrealistic coefficients.
A collection of boilerplate code and edge cases collected while using scikit-learn.