Building a Naive Bayes classifier from scratch with NumPy

While learning about Naive Bayes classifiers, I decided to implement the algorithm from scratch to help solidify my understanding of the math. So the goal of this notebook is to implement a simplified and easily interpretable version of the sklearn.naive_bayes.MultinomialNB estimator which produces identical results on a sample dataset.

While I generally find scikit-learn documentation very helpful, its source code is a bit trickier to grok, since it optimizes for efficiency—of both computational and maintenance—across a wide family of models. Our estimator of interest MultinomialNB inherits from _BaseDiscreteNB which itself inherits from _BaseNB which has multiple inheritence from BaseEstimator and ClassifierMaixin.

What is Naive Bayes?

Naive Bayes is a simple generative (probabilistic) classification model based on Bayes’ theorem. The typical example use-case for this algorithm is classifying email messages as spam or “ham” (non-spam) based on the previously observed frequency of words which have appeared in known spam or ham emails in the past.

$P (spam | text) = \frac{P (text | spam), P (spam)}{P (text)}$

Following typical ML notation, we use $y$ to denote the “class” of our message, where $y = 1$ for spam messages and $y = 0$ for non-spam messages. We will represent our text data as an array $x$ of length $j$ , with each value representing the number of times the $j^{t h}$ word appears in a particular email. The value of $j$ represents the collective number of words seen across all training data. Our model then becomes

$P (y | x) = \frac{P (x | y), P (y)}{P (x)}$

Why would we want to use something “naive”?

This classifier is “naive” in the sense that predictive features are assumed to be conditionally independent given their class. Naturally this is gross simplification of reality—if an email contains the word “sports” we would expect it to also be more likely to contain related words like “bet” or “odds”. But this assumption allows us to calculate the joint probability simply as the product of marginal likelihoods, without worrying about the correlation structure between different words. This makes the model much easier to fit.

All models are wrong, some are useful. — George Box, statistician

If our model were not “naive”, we would have to calculate the joint likelihood function as a messy product of $j$ separate conditional likelihood functions. With so many parameters to estimate, we would need a large quantity of training data to avoid overfitting.

$P (x | y) = P (x_{1} | y) P (x_{2} | x_{1}, y) P (x_{3} | x_{1}, x_{2}, y) \dots P (x_{j} | x_{1}, x_{2}, \dots, x_{j - 1}, y)$

But if we assume conditional independence, this calculation becomes much more simple. We can now just multiply together the likelihoods of each word $x_{j}$ conditional only on their class $y$ (whether or not they are spam) to get the joint likelihood for the entire message.

$P (x | y = c) = \prod_{j = 1}^{J} P (x_{j} | y = c)$

Our toy dataset

The function below generates a test dataset based on Chapter 3.5, Exercise 3.22 from Machine Learning: A Probabilistic Perspective.

import numpy as np
import pandas as pd
from typing import Callable

from sklearn.feature_extraction.text import CountVectorizer

def make_spam_dataset(show_X=True) -> (pd.DataFrame, np.ndarray, Callable):
    """ Create a small toy dataset for MultinomialNB implementation
    
    Returns:
        X: word count matrix
        y: indicator of whether or not message is spam
        msg_tx_func: a function to transform new test data into word count matrix
    """

    vocab = [
        'secret', 'offer', 'low', 'price', 'valued', 'customer', 'today',
        'dollar', 'million', 'sports', 'is', 'for', 'play', 'healthy', 'pizza'
    ]

    spam = [
        'million dollar offer',
        'secret offer today',
        'secret is secret'
    ]
    
    not_spam = [
        'low price for valued customer',
        'play secret sports today',
        'sports is healthy',
        'low price pizza'
    ]

    all_messages = spam + not_spam
    
    vectorizer = CountVectorizer(vocabulary=vocab)
    word_counts = vectorizer.fit_transform(all_messages).toarray()
    df = pd.DataFrame(word_counts, columns=vocab)
    is_spam = [1] * len(spam) + [0] * len(not_spam)
    msg_tx_func = lambda x: vectorizer.transform(x).toarray()
    
    if show_X:
        display(df)
        
    return df.to_numpy(), np.array(is_spam), msg_tx_func

X, y, tx_func = make_spam_dataset()

	secret	offer	low	price	valued	customer	today	dollar	million	sports	is	for	play	healthy	pizza
0	0	1	0	0	0	0	0	1	1	0	0	0	0	0	0
1	1	1	0	0	0	0	1	0	0	0	0	0	0	0	0
2	2	0	0	0	0	0	0	0	0	0	1	0	0	0	0
3	0	0	1	1	1	1	0	0	0	0	0	1	0	0	0
4	1	0	0	0	0	0	1	0	0	1	0	0	1	0	0
5	0	0	0	0	0	0	0	0	0	1	1	0	0	1	0
6	0	0	1	1	0	0	0	0	0	0	0	0	0	0	1

Our model

Our model needs to resolve the three component “ingredients” necessary to make predictions on future data points. The table below describes each component, and shows the mapping between math notation above and variable names in our code below.

Variable	Math	Decription
`prior`	$P (y)$	Our prior belief in the probability of any randomly selected message belonging to a particular class (spam or not-spam).
`lk_word`	$P (x_{i} \| y)$	The likelihood of each word, conditional on message class. We are implicitly using the multinomial distribution here. Intuitively, the word conditional likelihoods are just the normalized frequency within each message class.
`lk_message`	$P (x \| y)$	The likelihood of an entire message (combination of words present) conditional on the message belonging to a particular class.
`normalize_term`	$P (x)$	The likelihood of an entire message across all possible classes.

We’ve got a few additional attributes as well:

The alpha attribute will be added to each word count, to avoid us having zero probabilities for words not seen in our training sample.
The is_fitted_ attribute is a scikit-learn convention to ensure we don’t accidentally try to make predictions on a model that has not yet been fitted.

class NaiveBayes(object):
    """ DIY binary Naive Bayes classifier based on categorical data """

    def __init__(self, alpha=1.0):
        """ """
        self.prior = None
        self.word_counts = None
        self.lk_word = None
        self.alpha = alpha
        self.is_fitted_ = False

Fitting the model

Let’s implement our algorithm to handle an arbitrary number of classes, even though our toy example only has two (spam/not-spam). Our fit function needs to do two things:

Calculate prior

This one is easy. We split our input array into two sub-arrays in X_by_class, then count the number of elements in each class to arrive at our prior.

Calculate likelihoods

We set word_counts by looping over each of the sub-arrays in X_by_class and taking the column sums within each sub-array. Note that the numpy notation is a bit unintutive here, .sum(axis=0) means that we collapse the $0^{t h}$ axis (rows) leaving only columns. This gives us an array of shape (c,j) which counts the number of times the $j^{t h}$ word appears across all emails of class $c$ .

Finally, our likelihood function lk_word is simply these word counts divided by the total number of times all words appear in each class. We achieve this by taking row sum using .sum(axis=0).

from sklearn.utils.validation import check_X_y, check_array

def fit(self, X: np.ndarray, y: np.ndarray):
    """ Fit training data for Naive Bayes classifier """

    # not strictly necessary, but this ensures we have clean input
    X, y = check_X_y(X, y)
    n = X.shape[0]

    X_by_class = np.array([X[y == c] for c in np.unique(y)])
    self.prior = np.array([len(X_class) / n for X_class in X_by_class])

    self.word_counts = np.array([sub_arr.sum(axis=0) for sub_arr in X_by_class]) + self.alpha
    self.lk_word = self.word_counts / self.word_counts.sum(axis=1).reshape(-1, 1)

    self.is_fitted_ = True
    return self

Predicting new emails

We can now make predictions, either on the same emails we used to train the model, or on entirely new emails never before seen by the model. We’ll do this by first predicting probabilities for each class, then making our final prediction by taking the class with the highest probability.

Recall that our conditional likelihood for an entire message $x$ is calculated as the product of conditional likelihoods for each word $x_{j}$ present in the message. Note here that if a word appears twice, its lk_word gets factored twice into our joint likelihood.

$P (x | y = c) = \prod_{j = 1}^{J} P (x_{j} | y = c)$

So we loop over each message (row) in our array $X$ and calculate individual conditional likelihoods, then multiply them all together and multiply by our clas priors. At the very end, we divide everything by $P (x)$ so that we have valid probabilities.

What if about previously unseen words?

Suppose we have a word which has never appeared in training messages labelled as spam. Its conditional likelihood would be zero, which would take our entire joint likelihood to zero as well. This is precisely why we added alpha while calculating word counts in the fit function, so that this situation does not occur.

Probabilistic prediction

def predict_proba(self, X: np.ndarray) -> np.ndarray:
    """ Predict probability of class membership """

    assert self.is_fitted_, 'Model must be fit before predicting'
    X = check_array(X)

    # loop over each observation to calculate conditional probabilities
    class_numerators = np.zeros(shape=(X.shape[0], self.prior.shape[0]))
    for i, x in enumerate(X):
        word_exists = x.astype(bool)
        lk_words_present = self.lk_word[:, word_exists] ** x[word_exists]
        lk_message = (lk_words_present).prod(axis=1)
        class_numerators[i] = lk_message * self.prior

    normalize_term = class_numerators.sum(axis=1).reshape(-1, 1)
    conditional_probas = class_numerators / normalize_term
    assert (conditional_probas.sum(axis=1) - 1 < 0.001).all(), 'Rows should sum to 1'
    return conditional_probas

Binary prediction

Our predict_proba function will return probabilities for each class, but in our toy example we really just want a binary outcome: is the message spam or not? Once we’ve done the work to get the class probabilities, it is easy to find the index of the highest probability class using np.argmax(axis=1).

def predict(self, X: np.ndarray) -> np.ndarray:
    """ Predict class with highest probability """
    return self.predict_proba(X).argmax(axis=1)

Putting it all together

We defined the above logic as standalone functions, so now we need to assign each of them to the relevant method of our NaiveBayes class. This would not be necessary if we defined everything in a single module or notebook cell.

# attach functions defined above to our classifier
# this is not needed if you define the entire class in a single cell

NaiveBayes.fit = fit
NaiveBayes.predict_proba = predict_proba
NaiveBayes.predict = predict

preds = NaiveBayes().fit(X, y).predict(X)
print(f'Accuracy: {(preds == y).mean()}')

Accuracy: 1.0

You can find a gist with the code all together here.

Comparing with sklearn

The function below fits our model alongside MultinomialNB and asserts that we have similar values for our priors, likelihoods, and predictions.

from sklearn.naive_bayes import MultinomialNB

def test_against_benchmark():
    """ Check that DIY model matches outputs from scikit-learn estimator """

    X, y, _ = make_spam_dataset(show_X=False)

    bench = MultinomialNB().fit(X, y)
    model = NaiveBayes(alpha=1).fit(X, y)

    assert (model.prior / np.exp(bench.class_log_prior_) - 1 < 0.001).all()
    print('[✔︎] Identical prior probabilities')
    
    assert (model.lk_word / np.exp(bench.feature_log_prob_) - 1 < 0.001).all()
    print('[✔︎] Identical word likelihoods')
    
    assert (model.predict_proba(X) / bench.predict_proba(X) - 1 < 0.001).all()
    print('[✔︎] Identical predictions')

test_against_benchmark()

[✔︎] Identical prior probabilities
[✔︎] Identical word likelihoods
[✔︎] Identical predictions

Geoff Ruddock