Miscellanea: The Bayes Classifier, an Example with Gaussian Class Conditionals

By David Dobor (draft)

Introduction

This is part 2 of a 5-part tutorial on Bayesian classification. It can be read independently from the other parts, but requires familiarity with basic probability and statistics, especially with the Bayes rule.

The classification problem in general, briefly repeated here for convenience, is as follows. We are given $N$ training objects $\mathbf{X_1, ... X_N}$, each is a feature vector of dimension $D$. Each of the $D$ features is either a discrete or a continuous random variable. The objects are labeled by $y \in \{1, 2, ..., C\}$ that describe to which of the $C$ classes a given object belongs. Our job is to predict the class $y$ for an unseen object $\mathbf{X}_{new}$.

In the following example, the features are assumed to be drawn from Gaussian (normal) distributions. (This differs from the email classification example discussed previously, in which the features were discrete (e.g. number of times a certain word occurred in a message.))

The rest of this page:

Describes the problem
Shows how to compute the posterior class distribution of a new observation using the Bayes rule (without the naive Bayes assumption first)
Shows how to predict the class label of a new observation
Demonstrates how to plot the decision contours
Exposes possible flaws of the Bayes classification model
Introduces the Naive Bayes simplification and re-computes the results

The MATLAB code that generated all the figures that follow can be found here.

The Problem Setup

We use artificially generated data to illustrate Bayesian classification. All of the 90 observations in this data set have been labeled as belonging to one of the 3 different classes. Each class contains 30 observations. The labels are $y \in \{1, 2, 3\}$. The red circles denote class 1, the green diamonds class 2, and the blue squares class 3. The observations themselves (I'll call them data points sometimes) consist of just 2 features and are represented as points in a plane (in $R^2$). We denote feature vectors by $\mathbf{X} = (X_1, X_2)$. Each feature vector is drawn from a bivariate normal distribution. Given class $y$, the observations are assumed to be drawn from the same bivariate normal distribution with parameters $\mu_y$ and $\Sigma_y$: $P(\mathbf{X} | Y = y) = \mathcal{N}(\mu_y, \Sigma_y)$.

The plot on the right superimposes iso-probability lines on the data (i.e. the probability density takes the same value at every point along any contour line). Since the data are generated from bivariate normal distributions, these lines are ellipses. (Incidentally, notice that class 3 has relatively high covariance between its features. This is evident by the iso-probability ellipses not being aligned with the coordinate axes; if there were zero correlation between the features, the axes of the ellipses would be parallel to the coordinate axes. We'll return to this point at the end of this tutorial.)

The Goal: Assign a label to a new observation. For example, given a new feature vector, say $\mathbf{X}_{new} = [X_1 = x_1, X_2 = x_2]^T =[4, 7]^T$, determine which of the 3 class labels to assign to it. We'll discuss the decision rule we'll use in order to make this assignment in a minute.

Recall the Bayes formula: $$ P(Y | X_1, X_2) = \frac{P(Y) \cdot P(X_1, X_2 | Y)} {P(X_1, X_2)} $$ Also recall that the terms in this formula are usually refered to as follows: $$posterior=\frac{prior \cdot likelihood} {evidence} $$

Once we compute this posterior distribution of $Y$ for a new data point, the decision rule as to which label to assign to the point is simple: assign to this data point the label with the highest posterior probability. We will see this on a concrete example below.

Let's begin by computing all the ingredients for the Bayes formula:

The Terms in the Bayes Formula

Prior: With 30 data points belonging to each class, and the total of 90 points, we use $P(Y = y) = N_y/N = 30/90 = 1/3$ for the prior. This represents our prior belief about which class a new data point might belong to, before we get to observe it.

Likelihood:I've generated these data points from 3 different Gaussian distributions, and I know what the parameters $(\mu_y, \Sigma_y)$ are for each of the three $y$s. However, the parameters are not usually known in many real-world applications and need to be estimated from the data. So I'm going to keep my parameters a secret and we'll estimate them from the data. (If you'd really like to know, gen_data.m generated the data and thus holds the key to the secret.)

I'll use Maximum Likelihood Estimates ( MLE ) for the parameters. (We'll consider alternatives to this later.)

The MLEs for the Gaussian distribution are derived in every statistics textbook, so of course I'll skip that. (This is usually done by differentiating the log-likelihood with respect to each parameter, setting it to zero, finding the solution, and showing that it's indeed a maximum.) The result is this: $$ \mu_y = \frac{1}{N} \sum_{n=1}^{N_y} x_n $$ $$ \Sigma_y = \frac{1}{N} \sum_{n=1}^{N_y} (x_n - \mu_y) \cdot (x_n - \mu_y)^T $$

The summations here are only over the data points that came from class $y$.

Step 0: Estimate the Parameters

As just mentioned, the parameters of the three normal distributions must be estimated. This is itself a machine-learning task, but here it's not complicated. The following script computes the $(\mu_y, \Sigma_y)$ MLE:

    load data_bayes3  % load the data generated by gen_data.m
    labels = unique(y);
    for c = 1:length(labels)
      pos = find(y==labels(c));
      class_mean(c,:) = mean(X(pos,:));     % (2x1)
      class_covar(:,:,c) = cov(X(pos,:),1); % (2x2)
    end

Step 1: Write a Function that Computes Normal Density

Here we just implement the familiar bivariate normal density formula: $$p(\mathbf{x}|\mu,\Sigma) = \frac{1}{(2\pi)^{D/2}|\Sigma|^{1/2}}\exp\left\{-\frac{1}{2}(\mathbf{x}-\mu)^T\Sigma^{-1}(\mathbf{x}-\mu)\right\}$$

The following MATLAB function takes the parameters of a normal distribution together with the the point(s) at which to evaluate the density and returns the density at that point. (Multiple points can be passed in as rows of X.)

    %Implements the normal probability density with parameters mu and cov_mat.
   %Rows of X are points at which the density is evaluated 
    function prob = density_norm(X,mu,cov_mat);

    [n,d] = size(X);
    % center the data points
    X = X-ones(n,1)*mu;
    const = 1/((2*pi)^(d/2)*sqrt(det(cov_mat)));
    arg = diag(X*inv(cov_mat)*X');
    prob = exp((-.5)*arg);
    prob = const*prob;

Step 2: Predict

Now we have everything we need to do the prediction, i.e. assign a label to an unseen data point. Suppose that the new point that we need to classify is $[4, 7]^T$. We will compute the posterior probability that this point belongs to class $y$.

First, compute the class-conditional densities by passing this point to the function from step1. Do this for each class:

    x = [4 7];
    probs=[];
    for c = 1:3
      probs(c) = density_norm(x, class_mean(c,:), class_covar(:,:,c));
    end

    posteriors = (1/3)*probs
    max(posteriors)                     %largest posterior probability
    find(posteriors == max(posteriors)) %index of the largest posterior

Multiplying each of the three class-conditional densities by its prior, 1/3, return the desired posterior probability.

The entries in the following table show the results:

$Y$	$P(\mathbf{X} = [4, 7]^T \| Y = y)$	$P(\mathbf{Y} = y)$	$P(\mathbf{X} = [4, 7]^T \| Y = y) \times P(\mathbf{Y} = y)$)
1	0.0245	$\frac {1} {3}$	0.0082
2	0.0695	$\frac {1} {3}$	0.0232
3	0.0000	$\frac {1} {3}$	0.0000

Finally, to classify the point, select the largest entry in the last column and choose the label corresponding to it. Notice that this new point is incredibly unlikely to belong to class 3 (its exact probability returned by MATLAB is 1.9686e-21) and almost 3 times as likely to belong to class 2 than to class 1. Therefore:

Answer: assign label $\mathbf{2}$ to point $\mathbf{X} = [4, 7]^T$

Plot the Classification Contours

We can plot the contours for the classification probabilities over a grid of many points in $R^2$ (MATLAB's meshgrid function is very helpful in this, see code). For each class, our model will assign a high probability to the region in the plane where points of that class congregate and low probability to regions far away from the points.

This is what the probability contours look like for each of the classes:

Why Bayes Classification can be Bad for You

Really, it's not that bad, Bayesian Classification often works very well in practice, even with the naive assumption (see below), as we'll see later.

However, notice the strange effects in the top left figure above. The area in the middle lower part of the plot is assigned as high a probability as the area closer to the cluster of class 1 points. This is explained by the relative 'steepness' of class-conditional contours for class 3, as compared with the contours for the other classes. Class 3 density decays so fast that the density for class 1 is actually higher to the right of class 3. This is kind of unfortunate, because it doesn't seem sensible to label the points in middle lower region, that are so far away from class 1, as belonging to class 1. It would be much better if probabilities became less certain as we moved farther away from the data.

Naive Bayes Simplification

In the example just considered, we were able to capture the dependencies between the two features - our covariance matrices had non-zero off diagonal entries. For example, the data points in class 3, show strong dependency between the features which is reflected by relatively large off-diagonal entries in the covariance matrix for this class. (This dependency is also reflected on the plot: notice that the principal axes of the ellipses, which represent the iso-probability levels, are not parallel to the coordinate axes.) In general, fitting 2-dimensional normal involves choosing 5 parameters: 2 for $\mu$, the means, and 3 for the distinct entries in the covariance matrix (since $\Sigma$ is symmetric, we really have 3 possibly distinct entries, not 4). This is perfectly fine for our small example of just 90 points. However, as the number of dimensions starts increasing, the $D$-dimensional normal would require fitting $D + D + \frac{D(D-1)} {2}$ parameters ($D$ for the mean, the rest for the covariance matrix). So if we only had 30 data points ans 10 features, we wouldn't be able to reliably fit the 65 parameters.

This problem is overcome in practice by making the naive Bayes assumption: for a given class the features are assumed to be independent and therefore the $D$-dimensional class-conditional distributions are factorized into a product of $D$ univariate distributions. A univariate normal requires just 2 parameters, $\mu$ and $\sigma^2$, so overall we have way fewer parameters to fit.

But this means that we are also restricting the shapes of the class-conditional distributions: Now they need to be aligned with the axes. This is the price exacted by the decrease in the number of parameters: we give up the model flexibility and are unable to capture the within-class dependencies.

Anyway, here's the likelihood function with the naive Bayes assumption: $$P(X_1, X_2, ..., X_D | Y = y) = \prod_{d = 1}^D P(X_d | Y = y)$$

And here's everything re-plotted with the naive Bayes assumption. Notice how the principal axes of the ellipses are now parallel to the coordinate axes.