The Bayes Classifier, an Example with Gaussian Class Conditionals
By David Dobor (draft)
Introduction
This is part 2 of a 5-part tutorial on Bayesian
classification. It can be read independently from the other parts,
but requires familiarity with basic probability and
statistics, especially with the Bayes rule.
The classification problem in general, briefly repeated here for
convenience, is as follows. We are given $N$ training objects
$\mathbf{X_1, ... X_N}$, each is a feature vector of dimension
$D$. Each of the $D$ features is either a discrete or a continuous
random variable. The objects are labeled by $y \in \{1, 2, ..., C\}$
that describe to which of the $C$ classes a given object
belongs. Our job is to predict the class $y$ for an unseen
object $\mathbf{X}_{new}$.
In the following example, the features are assumed to be drawn from
Gaussian (normal) distributions. (This differs from the email
classification example discussed previously, in which the features
were discrete (e.g. number of times a certain word occurred in a
message.))
The rest of this page:
- Describes the problem
- Shows how to compute the posterior class distribution of a new
observation using the Bayes rule (without the naive Bayes assumption
first)
- Shows how to predict the class label of a new observation
- Demonstrates how to plot the decision contours
- Exposes possible flaws of the Bayes classification model
- Introduces the Naive Bayes simplification and re-computes the
results
The MATLAB code that generated all the figures that follow can be
found here.
The Problem Setup
We use artificially generated data to illustrate Bayesian
classification. All of the 90 observations in this data set have
been labeled as belonging to one of the 3 different classes. Each
class contains 30 observations. The labels are $y \in \{1, 2,
3\}$. The red circles denote class 1, the green diamonds class 2,
and the blue squares class 3. The observations themselves (I'll call
them data points sometimes) consist of just 2 features and are
represented as points in a plane (in $R^2$). We denote feature
vectors by $\mathbf{X} = (X_1, X_2)$. Each feature vector is drawn
from a bivariate normal distribution. Given class $y$, the
observations are assumed to be drawn from the same bivariate normal
distribution with parameters $\mu_y$ and $\Sigma_y$: $P(\mathbf{X} |
Y = y) = \mathcal{N}(\mu_y, \Sigma_y)$.
|
|
The plot on the right superimposes iso-probability lines on the
data (i.e. the probability density takes the same value at every
point along any contour line). Since the data are generated from
bivariate normal distributions, these lines are
ellipses. (Incidentally, notice that class 3 has relatively high
covariance between its features. This is evident by the
iso-probability ellipses not being aligned with the coordinate axes;
if there were zero correlation between the features, the axes of the
ellipses would be parallel to the coordinate axes. We'll return to
this point at the end of this tutorial.)
The Goal: Assign a label to a new observation. For
example, given a new feature vector, say $\mathbf{X}_{new} = [X_1 =
x_1, X_2 = x_2]^T =[4, 7]^T$, determine which of the 3 class labels
to assign to it. We'll discuss the decision rule we'll use
in order to make this assignment in a minute.
Recall the Bayes formula:
$$ P(Y | X_1, X_2) = \frac{P(Y) \cdot P(X_1, X_2 | Y)} {P(X_1, X_2)}
$$
Also recall that the
terms in this formula are usually refered to as follows:
$$posterior=\frac{prior \cdot likelihood} {evidence} $$
Once we compute this posterior distribution of $Y$ for a new data
point, the decision rule as to which label to assign to the
point is simple: assign to this data point the label with the
highest posterior probability. We will see this on a concrete
example below.
Let's begin by computing all the ingredients for the Bayes formula:
The Terms in the Bayes Formula
Prior: With 30 data points belonging to each class, and
the total of 90 points, we use $P(Y = y) = N_y/N = 30/90 = 1/3$ for the
prior. This represents our prior belief about which class a new data
point might belong to, before we get to observe it.
Likelihood:I've generated these data points from 3
different Gaussian distributions, and I know what the parameters
$(\mu_y, \Sigma_y)$ are for each of the three $y$s. However, the
parameters are not usually known in many real-world applications and
need to be estimated from the data. So I'm going to keep my
parameters a secret and we'll estimate them from the data. (If
you'd really like to know, gen_data.m generated the data
and thus holds the key to the secret.)
I'll use Maximum Likelihood Estimates ( MLE ) for the
parameters. (We'll consider alternatives to this later.)
The MLEs for the Gaussian distribution are derived in every
statistics textbook, so of course I'll skip that. (This is usually
done by differentiating the log-likelihood with respect to each
parameter, setting it to zero, finding the solution, and showing
that it's indeed a maximum.) The result is this:
$$ \mu_y = \frac{1}{N} \sum_{n=1}^{N_y} x_n $$ $$ \Sigma_y =
\frac{1}{N} \sum_{n=1}^{N_y} (x_n - \mu_y) \cdot (x_n - \mu_y)^T $$
The summations here are only over the data points that came from
class $y$.
Step 0: Estimate the Parameters
As just mentioned, the parameters of the three normal
distributions must be estimated. This is itself a machine-learning
task, but here it's not complicated. The following script computes
the $(\mu_y, \Sigma_y)$ MLE:
load data_bayes3 % load the data generated by gen_data.m
labels = unique(y);
for c = 1:length(labels)
pos = find(y==labels(c));
class_mean(c,:) = mean(X(pos,:)); % (2x1)
class_covar(:,:,c) = cov(X(pos,:),1); % (2x2)
end
Step 1: Write a Function that Computes Normal Density
Here we just implement the familiar bivariate normal density
formula:
$$p(\mathbf{x}|\mu,\Sigma) =
\frac{1}{(2\pi)^{D/2}|\Sigma|^{1/2}}\exp\left\{-\frac{1}{2}(\mathbf{x}-\mu)^T\Sigma^{-1}(\mathbf{x}-\mu)\right\}$$
The following MATLAB function takes the parameters of a normal
distribution together with the the point(s) at which to evaluate the
density and returns the density at that point. (Multiple points can
be passed in as rows of X.)
%Implements the normal probability density with parameters mu and cov_mat.
%Rows of X are points at which the density is evaluated
function prob = density_norm(X,mu,cov_mat);
[n,d] = size(X);
% center the data points
X = X-ones(n,1)*mu;
const = 1/((2*pi)^(d/2)*sqrt(det(cov_mat)));
arg = diag(X*inv(cov_mat)*X');
prob = exp((-.5)*arg);
prob = const*prob;
Step 2: Predict
Now we have everything we need to do the prediction, i.e. assign a
label to an unseen data point. Suppose that the new point that we
need to classify is $[4, 7]^T$. We will compute the posterior
probability that this point belongs to class $y$.
First, compute the class-conditional densities by passing this
point to the function from step1. Do this for each class:
x = [4 7];
probs=[];
for c = 1:3
probs(c) = density_norm(x, class_mean(c,:), class_covar(:,:,c));
end
posteriors = (1/3)*probs
max(posteriors) %largest posterior probability
find(posteriors == max(posteriors)) %index of the largest posterior
Multiplying each of the three class-conditional densities by its prior,
1/3, return the desired posterior probability.
The entries in the following table show the results:
$Y$ |
$P(\mathbf{X} = [4, 7]^T | Y = y)$ |
$P(\mathbf{Y} = y)$ |
$P(\mathbf{X} = [4, 7]^T | Y = y) \times P(\mathbf{Y} = y)$) |
1 |
0.0245 |
$\frac {1} {3}$ |
0.0082 |
2 |
0.0695 |
$\frac {1} {3}$ |
0.0232 |
3 |
0.0000 |
$\frac {1} {3}$ |
0.0000 |
Finally, to classify the point, select the largest entry in the
last column and choose the label corresponding to it. Notice that
this new point is incredibly unlikely to belong to class 3 (its exact
probability returned by MATLAB is 1.9686e-21) and almost 3 times as
likely to belong to class 2 than to class 1. Therefore:
Answer: assign label $\mathbf{2}$ to point $\mathbf{X} =
[4, 7]^T$
Plot the Classification Contours
We can plot the contours for the classification probabilities over
a grid of many points in $R^2$ (MATLAB's meshgrid function
is very helpful in this, see code). For each class, our model will
assign a high probability to the region in the plane where points of
that class congregate and low probability to regions far away from the
points.
This is what the probability contours look like for each of the
classes:
Why Bayes Classification can be Bad for You
Really, it's not that bad, Bayesian Classification often works
very well in practice, even with the naive assumption (see below),
as we'll see later.
However, notice the strange effects in the top left figure
above. The area in the middle lower part of the plot is assigned as
high a probability as the area closer to the cluster of class 1
points. This is explained by the relative 'steepness' of
class-conditional contours for class 3, as compared with the
contours for the other classes. Class 3 density decays so fast that
the density for class 1 is actually higher to the right of class
3. This is kind of unfortunate, because it doesn't seem sensible to
label the points in middle lower region, that are so far away from
class 1, as belonging to class 1. It would be much better if
probabilities became less certain as we moved farther away from the
data.
Naive Bayes Simplification
In the example just considered, we were able to capture the
dependencies between the two features - our covariance matrices had
non-zero off diagonal entries. For example, the data points in class
3, show strong dependency between the features which is reflected by
relatively large off-diagonal entries in the
covariance matrix for this class. (This dependency is also reflected
on the plot: notice that the principal axes of the ellipses, which
represent the iso-probability levels, are not parallel to the
coordinate axes.) In general, fitting 2-dimensional normal involves
choosing 5 parameters: 2 for $\mu$, the means, and 3 for the distinct
entries in the covariance matrix (since $\Sigma$ is symmetric, we
really have 3 possibly distinct entries, not 4). This is perfectly
fine for our small example of just 90 points. However, as the number
of dimensions starts increasing, the $D$-dimensional normal would
require fitting $D + D + \frac{D(D-1)} {2}$ parameters ($D$ for the
mean, the rest for the covariance matrix). So if we only had 30 data
points ans 10 features, we wouldn't be able to reliably fit the 65
parameters.
This problem is overcome in practice by making the naive Bayes
assumption: for a given class the features are assumed to
be independent and therefore the $D$-dimensional
class-conditional distributions are factorized into a product of $D$
univariate distributions. A univariate normal requires just 2
parameters, $\mu$ and $\sigma^2$, so overall we have way fewer
parameters to fit.
But this means that we are also restricting the shapes of the
class-conditional distributions: Now they need to be aligned with
the axes. This is the price exacted by the decrease in the number of
parameters: we give up the model flexibility and are unable to
capture the within-class dependencies.
Anyway, here's the likelihood function with the naive Bayes
assumption:
$$P(X_1, X_2, ..., X_D | Y = y) = \prod_{d = 1}^D P(X_d | Y = y)$$
And here's everything re-plotted with the naive Bayes
assumption. Notice how the principal axes of the ellipses are now
parallel to the coordinate axes.