While learning about Naive Bayes classifiers, I decided to implement the algorithm from scratch to help solidify my understanding of the math. So the goal of this notebook is to implement a simplified and easily interpretable version of the sklearn.naive_bayes.MultinomialNB estimator which produces identical results on a sample dataset.
While I generally find scikit-learn documentation very helpful, its source code is a bit trickier to grok, since it optimizes for efficiency—of both computational and maintenance—across a wide family of models. Our estimator of interest MultinomialNB
inherits from _BaseDiscreteNB
which itself inherits from _BaseNB
which has multiple inheritence from BaseEstimator
and ClassifierMaixin
.
What is Naive Bayes?
Naive Bayes is a simple generative (probabilistic) classification model based on Bayes’ theorem. The typical example use-case for this algorithm is classifying email messages as spam or “ham” (non-spam) based on the previously observed frequency of words which have appeared in known spam or ham emails in the past.
$$ P(\text{ spam }|\text{ text }) = \frac{P(\text{ text }|\text{ spam }) , P(\text{ spam })}{P(\text{ text })} $$
Following typical ML notation, we use $y$ to denote the “class” of our message, where $y=1$ for spam messages and $y=0$ for non-spam messages. We will represent our text data as an array $x$ of length $j$, with each value representing the number of times the $j^{th}$ word appears in a particular email. The value of $j$ represents the collective number of words seen across all training data. Our model then becomes
$$ P(y|x) = \frac{P(x|y) , P(y)}{P(x)} $$
Why would we want to use something “naive”?
This classifier is “naive” in the sense that predictive features are assumed to be conditionally independent given their class. Naturally this is gross simplification of reality—if an email contains the word “sports” we would expect it to also be more likely to contain related words like “bet” or “odds”. But this assumption allows us to calculate the joint probability simply as the product of marginal likelihoods, without worrying about the correlation structure between different words. This makes the model much easier to fit.
All models are wrong, some are useful. — George Box, statistician
If our model were not “naive”, we would have to calculate the joint likelihood function as a messy product of $j$ separate conditional likelihood functions. With so many parameters to estimate, we would need a large quantity of training data to avoid overfitting.
$$ P(\mathbf{x} \vert y) = P(x_1 \vert y) P(x_2 \vert x_1, y) P(x_3 \vert x_1, x_2, y) \ldots P(x_j \vert x_1, x_2, \ldots, x_{j-1}, y) $$
But if we assume conditional independence, this calculation becomes much more simple. We can now just multiply together the likelihoods of each word $x_j$ conditional only on their class $y$ (whether or not they are spam) to get the joint likelihood for the entire message.
$$ P(\mathbf{x}|y=c) = \prod_{j=1}^J P(x_j | y=c) $$
Our toy dataset
The function below generates a test dataset based on Chapter 3.5, Exercise 3.22 from Machine Learning: A Probabilistic Perspective.
import numpy as np
import pandas as pd
from typing import Callable
from sklearn.feature_extraction.text import CountVectorizer
def make_spam_dataset(show_X=True) -> (pd.DataFrame, np.ndarray, Callable):
""" Create a small toy dataset for MultinomialNB implementation
Returns:
X: word count matrix
y: indicator of whether or not message is spam
msg_tx_func: a function to transform new test data into word count matrix
"""
vocab = [
'secret', 'offer', 'low', 'price', 'valued', 'customer', 'today',
'dollar', 'million', 'sports', 'is', 'for', 'play', 'healthy', 'pizza'
]
spam = [
'million dollar offer',
'secret offer today',
'secret is secret'
]
not_spam = [
'low price for valued customer',
'play secret sports today',
'sports is healthy',
'low price pizza'
]
all_messages = spam + not_spam
vectorizer = CountVectorizer(vocabulary=vocab)
word_counts = vectorizer.fit_transform(all_messages).toarray()
df = pd.DataFrame(word_counts, columns=vocab)
is_spam = [1] * len(spam) + [0] * len(not_spam)
msg_tx_func = lambda x: vectorizer.transform(x).toarray()
if show_X:
display(df)
return df.to_numpy(), np.array(is_spam), msg_tx_func
X, y, tx_func = make_spam_dataset()
secret | offer | low | price | valued | customer | today | dollar | million | sports | is | for | play | healthy | pizza | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
4 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
6 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Our model
Our model needs to resolve the three component “ingredients” necessary to make predictions on future data points. The table below describes each component, and shows the mapping between math notation above and variable names in our code below.
Variable | Math | Decription |
---|---|---|
prior |
$ P(y) $ | Our prior belief in the probability of any randomly selected message belonging to a particular class (spam or not-spam). |
lk_word |
$P(x_i \vert y)$ | The likelihood of each word, conditional on message class. We are implicitly using the multinomial distribution here. Intuitively, the word conditional likelihoods are just the normalized frequency within each message class. |
lk_message |
$ P(\mathbf{x} \vert y) $ | The likelihood of an entire message (combination of words present) conditional on the message belonging to a particular class. |
normalize_term |
$ P(\mathbf{x}) $ | The likelihood of an entire message across all possible classes. |
We’ve got a few additional attributes as well:
- The
alpha
attribute will be added to each word count, to avoid us having zero probabilities for words not seen in our training sample. - The
is_fitted_
attribute is a scikit-learn convention to ensure we don’t accidentally try to make predictions on a model that has not yet been fitted.
class NaiveBayes(object):
""" DIY binary Naive Bayes classifier based on categorical data """
def __init__(self, alpha=1.0):
""" """
self.prior = None
self.word_counts = None
self.lk_word = None
self.alpha = alpha
self.is_fitted_ = False
Fitting the model
Let’s implement our algorithm to handle an arbitrary number of classes, even though our toy example only has two (spam/not-spam). Our fit function needs to do two things:
Calculate prior
This one is easy. We split our input array into two sub-arrays in X_by_class
, then count the number of elements in each class to arrive at our prior
.
Calculate likelihoods
We set word_counts
by looping over each of the sub-arrays in X_by_class
and taking the column sums within each sub-array. Note that the numpy notation is a bit unintutive here, .sum(axis=0)
means that we collapse the $0^{th}$ axis (rows) leaving only columns. This gives us an array of shape (c,j)
which counts the number of times the $j^{th}$ word appears across all emails of class $c$.
Finally, our likelihood function lk_word
is simply these word counts divided by the total number of times all words appear in each class. We achieve this by taking row sum using .sum(axis=0)
.
from sklearn.utils.validation import check_X_y, check_array
def fit(self, X: np.ndarray, y: np.ndarray):
""" Fit training data for Naive Bayes classifier """
# not strictly necessary, but this ensures we have clean input
X, y = check_X_y(X, y)
n = X.shape[0]
X_by_class = np.array([X[y == c] for c in np.unique(y)])
self.prior = np.array([len(X_class) / n for X_class in X_by_class])
self.word_counts = np.array([sub_arr.sum(axis=0) for sub_arr in X_by_class]) + self.alpha
self.lk_word = self.word_counts / self.word_counts.sum(axis=1).reshape(-1, 1)
self.is_fitted_ = True
return self
Predicting new emails
We can now make predictions, either on the same emails we used to train the model, or on entirely new emails never before seen by the model. We’ll do this by first predicting probabilities for each class, then making our final prediction by taking the class with the highest probability.
Recall that our conditional likelihood for an entire message $\bf{x}$ is calculated as the product of conditional likelihoods for each word $x_j$ present in the message. Note here that if a word appears twice, its lk_word
gets factored twice into our joint likelihood.
$$ P(\mathbf{x}|y=c) = \prod_{j=1}^J P(x_j | y=c) $$
So we loop over each message (row) in our array $X$ and calculate individual conditional likelihoods, then multiply them all together and multiply by our clas priors. At the very end, we divide everything by $P(x)$ so that we have valid probabilities.
What if about previously unseen words?
Suppose we have a word which has never appeared in training messages labelled as spam. Its conditional likelihood would be zero, which would take our entire joint likelihood to zero as well. This is precisely why we added alpha
while calculating word counts in the fit function, so that this situation does not occur.
Probabilistic prediction
def predict_proba(self, X: np.ndarray) -> np.ndarray:
""" Predict probability of class membership """
assert self.is_fitted_, 'Model must be fit before predicting'
X = check_array(X)
# loop over each observation to calculate conditional probabilities
class_numerators = np.zeros(shape=(X.shape[0], self.prior.shape[0]))
for i, x in enumerate(X):
word_exists = x.astype(bool)
lk_words_present = self.lk_word[:, word_exists] ** x[word_exists]
lk_message = (lk_words_present).prod(axis=1)
class_numerators[i] = lk_message * self.prior
normalize_term = class_numerators.sum(axis=1).reshape(-1, 1)
conditional_probas = class_numerators / normalize_term
assert (conditional_probas.sum(axis=1) - 1 < 0.001).all(), 'Rows should sum to 1'
return conditional_probas
Binary prediction
Our predict_proba
function will return probabilities for each class, but in our toy example we really just want a binary outcome: is the message spam or not? Once we’ve done the work to get the class probabilities, it is easy to find the index of the highest probability class using np.argmax(axis=1)
.
def predict(self, X: np.ndarray) -> np.ndarray:
""" Predict class with highest probability """
return self.predict_proba(X).argmax(axis=1)
Putting it all together
We defined the above logic as standalone functions, so now we need to assign each of them to the relevant method of our NaiveBayes
class. This would not be necessary if we defined everything in a single module or notebook cell.
# attach functions defined above to our classifier
# this is not needed if you define the entire class in a single cell
NaiveBayes.fit = fit
NaiveBayes.predict_proba = predict_proba
NaiveBayes.predict = predict
preds = NaiveBayes().fit(X, y).predict(X)
print(f'Accuracy: {(preds == y).mean()}')
Accuracy: 1.0
You can find a gist with the code all together here.
Comparing with sklearn
The function below fits our model alongside MultinomialNB
and asserts that we have similar values for our priors, likelihoods, and predictions.
from sklearn.naive_bayes import MultinomialNB
def test_against_benchmark():
""" Check that DIY model matches outputs from scikit-learn estimator """
X, y, _ = make_spam_dataset(show_X=False)
bench = MultinomialNB().fit(X, y)
model = NaiveBayes(alpha=1).fit(X, y)
assert (model.prior / np.exp(bench.class_log_prior_) - 1 < 0.001).all()
print('[✔︎] Identical prior probabilities')
assert (model.lk_word / np.exp(bench.feature_log_prob_) - 1 < 0.001).all()
print('[✔︎] Identical word likelihoods')
assert (model.predict_proba(X) / bench.predict_proba(X) - 1 < 0.001).all()
print('[✔︎] Identical predictions')
test_against_benchmark()
[✔︎] Identical prior probabilities
[✔︎] Identical word likelihoods
[✔︎] Identical predictions
Further reading
If you want to learn more about Naive Bayes, here are a few of the resources I found most helpful.
- Notes on Naive Bayes Classifiers for Spam Filtering (Jonathan Lee, University of Washington) is a good entry point, as it provides a relatively succinct description of the typical spam detection example for Naive Bayes.
- The Naive Bayes Model, Maximum-Likelihood Estimation, and the EM Algorithm (Michael Collins, Columbia) provides a more comprehensive walkthrough of the math behind NB, including derivation of maximum likleihood estimates.
- sklearn.naive_bayes.MultinomialNB (scikit-learn docs) is the example implementation which I tried to reproduce.
- Naive Bayes from Scratch in Python (Kenzo Takahashi) is the best DIY post I’ve seen so far, and the key inspiration for this post.
comments powered by Disqus