### Goal

While learning about Naive Bayes classifiers, I decided to implement the algorithm from scratch to help solidify my understanding of the math. So the goal of this notebook is to implement a simplified and easily interpretable version of the sklearn.naive_bayes.MultinomialNB estimator which produces identical results on a sample dataset.

While I generally find scikit-learn documentation very helpful, its source code is a bit trickier to grok, since it optimizes for efficiency—of both computational and maintenance—across a wide family of models. Our estimator of interest `MultinomialNB`

inherits from `_BaseDiscreteNB`

which itself inherits from `_BaseNB`

which has multiple inheritence from `BaseEstimator`

and `ClassifierMaixin`

.

### What is Naive Bayes?

Naive Bayes is a simple generative (probabilistic) classification model based on Bayes’ theorem. The typical example use-case for this algorithm is classifying email messages as spam or “ham” (non-spam) based on the previously observed frequency of words which have appeared in known spam or ham emails in the past.

$$ P(\text{ spam }|\text{ text }) = \frac{P(\text{ text }|\text{ spam }) \, P(\text{ spam })}{P(\text{ text })} $$

Following typical ML notation, we use $y$ to denote the “class” of our message, where $y=1$ for spam messages and $y=0$ for non-spam messages. We will represent our text data as an array $x$ of length $j$, with each value representing the number of times the $j^{th}$ word appears in a particular email. The value of $j$ represents the collective number of words seen across all training data. Our model then becomes

$$ P(y|x) = \frac{P(x|y) \, P(y)}{P(x)} $$

### Why would we want to use something “naive”?

This classifier is “naive” in the sense that predictive features are assumed to be conditionally independent given their class. Naturally this is gross simplification of reality—if an email contains the word “sports” we would expect it to also be more likely to contain related words like “bet” or “odds”. But this assumption allows us to calculate the joint probability simply as the product of marginal likelihoods, without worrying about the correlation structure between different words. This makes the model *much* easier to fit.

All models are wrong, some are useful. — George Box, statistician

If our model were not “naive”, we would have to calculate the joint likelihood function as a messy product of $j$ separate conditional likelihood functions. With so many parameters to estimate, we would need a large quantity of training data to avoid overfitting.

$$ P(\mathbf{x} \vert y) = P(x_1 \vert y) P(x_2 \vert x_1, y) P(x_3 \vert x_1, x_2, y) \ldots P(x_j \vert x_1, x_2, \ldots, x_{j-1}, y) $$

But if we assume conditional independence, this calculation becomes much more simple. We can now just multiply together the likelihoods of each word $x_j$ conditional only on their class $y$ (whether or not they are spam) to get the joint likelihood for the entire message.

$$ P(\mathbf{x}|y=c) = \prod_{j=1}^J P(x_j | y=c) $$

### Our toy dataset

The function below generates a test dataset based on Chapter 3.5, Exercise 3.22 from *Machine Learning: A Probabilistic Perspective*.

```
import numpy as np
import pandas as pd
from typing import Callable
from sklearn.feature_extraction.text import CountVectorizer
def make_spam_dataset(show_X=True) -> (pd.DataFrame, np.ndarray, Callable):
""" Create a small toy dataset for MultinomialNB implementation
Returns:
X: word count matrix
y: indicator of whether or not message is spam
msg_tx_func: a function to transform new test data into word count matrix
"""
vocab = [
'secret', 'offer', 'low', 'price', 'valued', 'customer', 'today',
'dollar', 'million', 'sports', 'is', 'for', 'play', 'healthy', 'pizza'
]
spam = [
'million dollar offer',
'secret offer today',
'secret is secret'
]
not_spam = [
'low price for valued customer',
'play secret sports today',
'sports is healthy',
'low price pizza'
]
all_messages = spam + not_spam
vectorizer = CountVectorizer(vocabulary=vocab)
word_counts = vectorizer.fit_transform(all_messages).toarray()
df = pd.DataFrame(word_counts, columns=vocab)
is_spam = [1] * len(spam) + [0] * len(not_spam)
msg_tx_func = lambda x: vectorizer.transform(x).toarray()
if show_X:
display(df)
return df.to_numpy(), np.array(is_spam), msg_tx_func
X, y, tx_func = make_spam_dataset()
```

secret | offer | low | price | valued | customer | today | dollar | million | sports | is | for | play | healthy | pizza | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |

1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |

3 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |

4 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |

5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |

6 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |

### Our model

Our model needs to resolve the three component “ingredients” necessary to make predictions on future data points. The table below describes each component, and shows the mapping between math notation above and variable names in our code below.

Variable | Math | Decription |
---|---|---|

`prior` |
$ P(y) $ | Our prior belief in the probability of any randomly selected message belonging to a particular class (spam or not-spam). |

`lk_word` |
$P(x_i \vert y)$ | The likelihood of each word, conditional on message class. We are implicitly using the multinomial distribution here. Intuitively, the word conditional likelihoods are just the normalized frequency within each message class. |

`lk_message` |
$ P(\mathbf{x} \vert y) $ | The likelihood of an entire message (combination of words present) conditional on the message belonging to a particular class. |

`normalize_term` |
$ P(\mathbf{x}) $ | The likelihood of an entire message across all possible classes. |

We’ve got a few additional attributes as well:

- The
`alpha`

attribute will be added to each word count, to avoid us having zero probabilities for words not seen in our training sample. - The
`is_fitted_`

attribute is a scikit-learn convention to ensure we don’t accidentally try to make predictions on a model that has not yet been fitted.

```
class NaiveBayes(object):
""" DIY binary Naive Bayes classifier based on categorical data """
def __init__(self, alpha=1.0):
""" """
self.prior = None
self.word_counts = None
self.lk_word = None
self.alpha = alpha
self.is_fitted_ = False
```

### Fitting the model

Let’s implement our algorithm to handle an arbitrary number of classes, even though our toy example only has two (spam/not-spam). Our fit function needs to do two things:

#### Calculate prior

This one is easy. We split our input array into two sub-arrays in `X_by_class`

, then count the number of elements in each class to arrive at our `prior`

.

#### Calculate likelihoods

We set `word_counts`

by looping over each of the sub-arrays in `X_by_class`

and taking the column sums within each sub-array. Note that the numpy notation is a bit unintutive here, `.sum(axis=0)`

means that we collapse the $0^{th}$ axis (rows) leaving only columns. This gives us an array of shape `(c, j)`

which counts the number of times the $j^{th}$ word appears across all emails of class $c$.

Finally, our likelihood function `lk_word`

is simply these word counts divided by the total number of times all words appear in each class. We achieve this by taking row sum using `.sum(axis=0)`

.

```
from sklearn.utils.validation import check_X_y, check_array
def fit(self, X: np.ndarray, y: np.ndarray):
""" Fit training data for Naive Bayes classifier """
# not strictly necessary, but this ensures we have clean input
X, y = check_X_y(X, y)
n = X.shape[0]
X_by_class = np.array([X[y == c] for c in np.unique(y)])
self.prior = np.array([len(X_class) / n for X_class in X_by_class])
self.word_counts = np.array([sub_arr.sum(axis=0) for sub_arr in X_by_class]) + self.alpha
self.lk_word = self.word_counts / self.word_counts.sum(axis=1).reshape(-1, 1)
self.is_fitted_ = True
return self
```

### Predicting new emails

We can now make predictions, either on the same emails we used to train the model, or on entirely new emails never before seen by the model. We’ll do this by first predicting probabilities for each class, then making our final prediction by taking the class with the highest probability.

Recall that our conditional likelihood for an entire message $\bf{x}$ is calculated as the product of conditional likelihoods for each word $x_j$ present in the message. Note here that if a word appears twice, its `lk_word`

gets factored twice into our joint likelihood.

$$ P(\mathbf{x}|y=c) = \prod_{j=1}^J P(x_j | y=c) $$

So we loop over each message (row) in our array $X$ and calculate individual conditional likelihoods, then multiply them all together and multiply by our clas priors. At the very end, we divide everything by $P(x)$ so that we have valid probabilities.

#### What if about previously unseen words?

Suppose we have a word which has never appeared in training messages labelled as spam. Its conditional likelihood would be zero, which would take our entire joint likelihood to zero as well. This is precisely why we added `alpha`

while calculating word counts in the fit function, so that this situation does not occur.

#### Probabilistic prediction

```
def predict_proba(self, X: np.ndarray) -> np.ndarray:
""" Predict probability of class membership """
assert self.is_fitted_, 'Model must be fit before predicting'
X = check_array(X)
# loop over each observation to calculate conditional probabilities
class_numerators = np.zeros(shape=(X.shape[0], self.prior.shape[0]))
for i, x in enumerate(X):
word_exists = x.astype(bool)
lk_words_present = self.lk_word[:, word_exists] ** x[word_exists]
lk_message = (lk_words_present).prod(axis=1)
class_numerators[i] = lk_message * self.prior
normalize_term = class_numerators.sum(axis=1).reshape(-1, 1)
conditional_probas = class_numerators / normalize_term
assert (conditional_probas.sum(axis=1) - 1 < 0.001).all(), 'Rows should sum to 1'
return conditional_probas
```

#### Binary prediction

Our `predict_proba`

function will return probabilities for each class, but in our toy example we really just want a binary outcome: is the message spam or not? Once we’ve done the work to get the class probabilities, it is easy to find the index of the highest probability class using `np.argmax(axis=1)`

.

```
def predict(self, X: np.ndarray) -> np.ndarray:
""" Predict class with highest probability """
return self.predict_proba(X).argmax(axis=1)
```

### Putting it all together

We defined the above logic as standalone functions, so now we need to assign each of them to the relevant method of our `NaiveBayes`

class. This would not be necessary if we defined everything in a single module or notebook cell.

```
# attach functions defined above to our classifier
# this is not needed if you define the entire class in a single cell
NaiveBayes.fit = fit
NaiveBayes.predict_proba = predict_proba
NaiveBayes.predict = predict
preds = NaiveBayes().fit(X, y).predict(X)
print(f'Accuracy: {(preds == y).mean()}')
```

```
Accuracy: 1.0
```

You can find a gist with the code all together here.

### Comparing with sklearn

The function below fits our model alongside `MultinomialNB`

and asserts that we have similar values for our priors, likelihoods, and predictions.

```
from sklearn.naive_bayes import MultinomialNB
def test_against_benchmark():
""" Check that DIY model matches outputs from scikit-learn estimator """
X, y, _ = make_spam_dataset(show_X=False)
bench = MultinomialNB().fit(X, y)
model = NaiveBayes(alpha=1).fit(X, y)
assert (model.prior / np.exp(bench.class_log_prior_) - 1 < 0.001).all()
print('[✔︎] Identical prior probabilities')
assert (model.lk_word / np.exp(bench.feature_log_prob_) - 1 < 0.001).all()
print('[✔︎] Identical word likelihoods')
assert (model.predict_proba(X) / bench.predict_proba(X) - 1 < 0.001).all()
print('[✔︎] Identical predictions')
test_against_benchmark()
```

```
[✔︎] Identical prior probabilities
[✔︎] Identical word likelihoods
[✔︎] Identical predictions
```

### Further reading

If you want to learn more about Naive Bayes, here are a few of the resources I found most helpful.

- Notes on Naive Bayes Classifiers for Spam Filtering (Jonathan Lee, University of Washington) is a good entry point, as it provides a relatively succinct description of the typical spam detection example for Naive Bayes.
- The Naive Bayes Model, Maximum-Likelihood Estimation, and the EM Algorithm (Michael Collins, Columbia) provides a more comprehensive walkthrough of the math behind NB, including derivation of maximum likleihood estimates.
- sklearn.naive_bayes.MultinomialNB (scikit-learn docs) is the example implementation which I tried to reproduce.
- Naive Bayes from Scratch in Python (Kenzo Takahashi) is the best DIY post I’ve seen so far, and the key inspiration for this post.