Monday

A Brief Note About $log$ of Gamma

On the Distribution of Log of Gamma - Some Facts

David Dobor

A pdf version of this note can be found can be found here.

A random variable $X$ is said to have a gamma distribution with parameters $(\alpha, \beta)$, denoted by $\Gamma(\alpha, \beta)$, if its pdf is given by: $$ f(x) = \frac{1}{\Gamma({\alpha})} \cdot \frac{1}{\beta} \cdot \left( \frac{x}{\beta}\right)^{\alpha - 1} \cdot \mathrm{e}^{-\frac{x}{\beta}} \cdot\, 1_{(0,\infty)}(x) \,\,\, \text{where} \;\; \alpha, \beta > 0 $$

Here $\Gamma(\alpha)$ is the gamma function computed as $\int_0^{\infty} \mathrm{e}^{-y} y^{\alpha - 1} dy$ and $1_{(0,\infty)}(x)$ stands for the indicator function (i.e. it's value is $1$ for non-negative $x$-es and $0$ elsewhere).

Distribution of Log of Gamma

First, we are interested in the distribution of the natural logarithm of Gamma. Let $Y = \log (X)$ where $X\sim\Gamma(\alpha,\beta)$. We want the distribution of $Y$.

We'll use the following well known theorem about computing the density of a transform, $g$, of some random variable $X$. Loosely put, if $g(X) = Y$ and $g$ has an inverse, call it $h(Y)$, then the pdf of $Y$ is: $$ f_Y(y) = f_X(h(y)) |h'(y)| $$

So here, denoting the inverse of $Y = \log (X)$ by $h(Y)$, we have $ X=h(Y)=\mathrm{e}^Y$ and $h'(y) = \mathrm{e}^{y}$. Therefore: \begin{align*} f_Y(y) &= f_X(h(y)) |h'(y)| \\ &= \frac{1}{\beta^\alpha \Gamma(\alpha)} \left(\mathrm{e}^{\alpha (y - 1)} - \mathrm{e}^{-\mathrm{e}^{y/\beta}}\right)\,|\mathrm{e}^{y}| \,\,\,1_{(-\infty,\infty)}(y) \\ &= \frac{1}{\beta^\alpha \Gamma(\alpha)} \left(\mathrm{e}^{\alpha y} - \mathrm{e}^{-\mathrm{e}^{y/\beta}}\right)\,\mathrm{e}^{-y}\mathrm{e}^{y} \,\,\,1_{(-\infty,\infty)}(y) \\ &= \frac{1}{\beta^\alpha \Gamma(\alpha)} \;\; \mathrm{e}^{\left(\alpha y - e^{y/\beta}\right)}\,1_{(-\infty,\infty)}(y) \\ \end{align*}

This expression is a valid pdf - it integrates to 1 - and is said to have a Log-of-Gamma pdf with parameters $(\alpha, \beta)$.

Moments of Log of Gamma

We now find the Moment Generating Function (MGF) of Log-of-Gamma $(\alpha, \beta)$. By definition: $$ M_Y(t) = \int_{-\infty}^{+\infty} \mathrm{e}^{ty} \frac{1}{\beta^\alpha \Gamma(\alpha)} \;\; \mathrm{e}^{\left(\alpha y - e^{y/\beta}\right)} dy $$ To integrate this we use an old trick: we try to leave inside the integral sign the part of the function that integrates to 1 (this is often an easily recognizable pdf); we then pull the constants that get in the way of making this integral equal to 1 out of the integral sign. Luckily, this can be done here: \begin{align*} M_Y(t) &= \frac{1}{\beta^\alpha \; \Gamma(\alpha)} \int_{-\infty}^{+\infty} \mathrm{e}^{ty} \;\; \mathrm{e}^{\left(\alpha y - e^{y/\beta}\right)} dy\\ &= \frac{1}{\beta^\alpha \; \Gamma(\alpha)} \int_{-\infty}^{+\infty} \mathrm{e}^{\left(ty +\alpha y - e^{y/\beta}\right)} dy\\ &= \frac{1}{\beta^\alpha \; \Gamma(\alpha)} \cdot \frac{\beta^{\alpha + t} \; \Gamma(\alpha + t)}{\beta^{\alpha + t} \; \Gamma(\alpha + t)} \cdot \int_{-\infty}^{+\infty} \mathrm{e}^{\left(y(\alpha + t) - e^{y/\beta}\right)} dy\\ &= \frac{ \beta^{\alpha + t} \; \Gamma(\alpha + t) }{\beta^\alpha \; \Gamma(\alpha)} \cdot \int_{-\infty}^{+\infty} \frac{1}{\beta^{t+\alpha} \; \Gamma(\alpha + t)} \mathrm{e}^{\left(y(\alpha + t) - e^{y/\beta}\right)} dy\\ &= \frac{\beta^t \; \Gamma(\alpha + t)}{\Gamma(\alpha)} \end{align*} Where the last equality follows because the thing inside the integral is the Log-of-Gamma pdf derived above and integrates to 1.

Cumulant Generating Function

The Cumulant Generating Function (CGF) of a random variable $Y$ is defined as: $$ S_Y(t) = \log (M_Y(t)) $$ It can be verified that: \begin{align*} \frac{d}{dt} S_Y(t) \Big|_{t=0} = \mathrm{E}\, Y \;\;\; \text{ and} \;\;\; \frac{d^2}{dt^2} S_Y(t) \Big|_{t=0} = Var(Y) \end{align*}

Expectation and Variance of Log of Gamma

To find the expectation of Log-of-Gamma, we differentiate $S_Y(t) = \log (M_Y(t))$ and evaluate the result at $t = 0$. \begin{align*} (S_Y(t))^{'} &= (\log (M_Y(t)))^{'} = \left(\log \left(\frac{\beta^t \; \Gamma(\alpha + t)}{\Gamma(\alpha)} \right)\right)^{'} \\ &= \frac{1}{\frac{\beta^t \; \Gamma(\alpha + t)} {\Gamma(\alpha)}} \cdot \frac{1}{\Gamma(\alpha)} \cdot \left( \beta^t \: \Gamma(\alpha + t) \right)^{'}\\ &= \frac{\Gamma(\alpha)}{\beta^t \; \Gamma(\alpha + t)} \cdot \frac{1}{\Gamma(\alpha)} \cdot \left( \beta^t \log(\beta) \; \Gamma(\alpha + t) + \beta^t \; \Gamma^{'} (\alpha +t) \right)\\ &= \log(\beta) + \frac{\Gamma^{'} (\alpha +t) }{\Gamma(\alpha + t)}\\ \end{align*} The last term is the log-derivative of the Gamma function evaluated at $\alpha + t$. This log-derivative is called the digamma function and is denoted by $\psi(\cdot)$. Thus: $$ \boxed{(S_Y(t))^{'} = \log(\beta) +\psi(\alpha +t)} $$ We conclude that if $Y$ is Log-of-Gamma$(\alpha, \beta)$ distributed, then its expectation is given by: $$ \boxed{\mathrm{E}\, Y = \log(\beta) +\psi(\alpha)} $$ To find the variance of $Y$, we look at the second derivative, $(S_Y(t))^{''}$, and evaluate it at $t=0$. \begin{align*} (S_Y(t))^{''} &= (\log(\beta) +\psi(\alpha +t))^{'}\\ &= \psi^{'}(\alpha +t) \end{align*} The derivative of digamma is the so-called trigamma function: $$ \boxed{Var(Y) = \psi^{'}(\alpha)} $$

Thursday

David Dobor

The Good, the Bad, and the OK Shuffles

Introduction

A few years ago, the Software Security Group at Reliable Software Technologies discovered a major security flaw in the implementation of card games distributed by ASF Software Inc., a company that no longer exists. At least three major online poker firms that allow their clients to compete for real money, including planetpoker.com, had used ASF's poker implementation and found themselves vulnerable to a variety of exploits. The problem was in the shuffling of a deck of cards, and in the random number generator used to shuffle it. That generator wasn't random at all. It was perfectly predictable.

The piece of code that performed the shuffle included a call to a random number generator that chose a seed based on the number of milliseconds since midnight. There are only about 86 and a half million milliseconds in a day, a number alarmingly small compared to the $52!$ possible shuffles of a deck of cards. With only about 86 million shuffles of the deck to consider, it became possible for an attacker to determine exactly, in real time, during play, the unexposed cards in the hands of the opponents. An interesting account of the consequences of this can be found in this article .

But today's blog isn't about software security. It is about a simple shuffling algorithm that long predates computers. Sometimes called Knuth's shuffle, this algorithm is very efficient (its running time is linear in the size of the deck to be shuffled). It is also easy to implement, as we'll see in a minute, but just as easy to implement incorrectly without suspecting anything awry. In fact, it was an incorrect implementation of a simple shuffling algorithm that led to ASF's demise.

A Shuffle

Programmers usually first see shuffling algorithms in the context of sorting. It may sound counter intuitive, but sorting is actually one of the better ways of shuffling.

A shuffle is a uniform random permutation of array elements. Consider a computer program that takes as input an array of size $n$ and outputs one of the $n!$ permutations of the array elements. If each output has an equal probability of $1/n!$ of being output, then we'll call the outputs fair shuffles , or uniform random shuffles, and the program a fair shuffler.

Most computer languages allow for a quick and easy way of calling a (pseudo) random number generator that returns draws from the uniform distribution over [0, 1] ($U(0,1)$). So here's one way to shuffle an array of size $n$:

for every array entry, generate a random number from $U(0,1)$
sort the array using those random numbers as keys.

This is an effective shuffle. It is easy to show that it produces a uniform random permutation of the input (assuming no duplicate values).

But this shuffle requires a sort, and a sort seems like a lot of work for this problem. Sorting takes $O(nlog(n))$ time. Do we really have to pay that much in time in order to shuffle an array of $n$ elements?

We can do better than this! Here's how to do it in linear time.

A Better Shuffle and a Bad Shuffle

Here is a simple way of rearranging an array a so that the result is a uniform random permutation of its elements.

Let i be the loop variable that runs through the deck. In one pass through the deck, do:

in iteration i, pick integer r between 0 and i uniformly at random
swap a[i] and a[r]

Thus we are always swapping the $i^{th}$ card with a randomly selected card to its left, or itself. (Note: swapping the $i^{th}$ card with a randomly selected unseen card to its right would also work.)

It was shown well before the advent of computers [Fisher-Yates 1938] that this algorithm produces a uniformly random shuffle in linear time.

For comparison consider the following wrong implementation of shuffling by ASF. This algorithm is not a fair shuffler. However, ironically, it was publicly advertised by ASF as the fair way to shuffle a deck of cards.

Again, let i be the loop variable that runs through the deck. In one pass through the deck, do:

in iteration i, pick integer r between 0 and N-1 uniformly at random
swap a[i] and a[r]

To see that this second shuffle does not produce a uniform shuffle, consider shuffling an array of just 3 elements. Build a decision tree of depth $3$ by tracing the steps of the algorithm. Each depth level in this tree corresponds to the value of the loop variable $i$ - the $i^{th}$ card currently being considered. At each node at depth $i$ the decision is: which of the $3$ randomly selected cards should card $i$ be swapped with? Thus each node, except for the leaves, has $3$ branches leaving it. Leaves correspond to permutations of the 3 elements, the shuffles. Notice that not all permutations occur with the same frequency and therefore the algorithm does not produce a fair shuffle.

Anyway, here's a picture of the incorrect shuffle of a 3-element array. Notice that deck $231$ occurs more often than deck $123$. These uneven probabilities become even more exaggerated as the size of the array increases.

Java Implementation

To see all this in action, I've created a java class called FairShuffler. It has a static method shuffle() that implements the correct shuffling algorithm discussed above. To run it, we'll need a separate text file that represents the deck of cards. You will find a link to one such file in the comments of the code that follows. Instructions on how to compile and run the code are also in the comments:

/*************************************************************************
 *  Compilation:  javac FairShuffler.java
 *  Execution:    java FairShuffler 
 *  Data files:   https://raw2.github.com/david-dobor/algs/master/algs4/shuffle/cards.txt
 *  
 *  Reads in a list of strings and prints them in random order.
 *  The Knuth (or Fisher-Yates) shuffling algorithm guarantees
 *  to rearrange the elements in uniformly random order, under
 *  the assumption that Math.random() generates independent and
 *  uniformly distributed numbers between 0 and 1.
 *
 *  the cards.txt file is assumed to be stored locally 
 *  % more cards.txt
 *  2C 3C 4C 5C 6C 7C 8C 9C 10C JC QC KC AC
 *  2D 3D 4D 5D 6D 7D 8D 9D 10D JD QD KD AD
 *  2H 3H 4H 5H 6H 7H 8H 9H 10H JH QH KH AH
 *  2S 3S 4S 5S 6S 7S 8S 9S 10S JS QS KS AS
 *
 *  % java FairShuffler 
 *  8H  AS  JC  5S
 *  6C  8S  6D  2S
 *  2H  6H  10H 9D
 *  7C  KC  3D  5C
 *  ...
 *  QH  10C  3S  3H
 *
 *************************************************************************/
import java.io.File;
import java.io.IOException;
import java.util.Scanner;

public class FairShuffler { 
 
    public static void shuffle(Object[] a) {
        int N = a.length;
        for (int i = 0; i < N; i++) {
            // choose index uniformly in [i, N-1]
            int r = i + (int) (Math.random() * (N - i));
            Object swap = a[r];
            a[r] = a[i];
            a[i] = swap;
        }
    }
    
    public static void main(String[] args) {

        // read in the data
        String[] deck = read("cards.txt");

        FairShuffler.shuffle(deck);

        // print permutation
        for (int i = 0; i < deck.length; i++)
            System.out.printf("%-4s%s", deck[i], ((i + 1) % 4 == 0) ? "\n" : "");
 }
    
    //read in a local text file containing the cards 
    //return the cards as a String[] array
    public static String[] read(String fileName) {
     Scanner scanner = null;
        try {
            // read local file
            File file = new File(fileName);
            if (file.exists()) {
                scanner = new Scanner(file);
            }
        }
        catch (IOException ioe) {
            System.err.println("Could not open " + fileName);
        }
        if (!scanner.hasNextLine()) { return null; }

        // reference: http://weblogs.java.net/blog/pat/archive/2004/10/stupid_scanner_1.html
        //return scanner.useDelimiter("\\A").next();
        String[] cards = scanner.useDelimiter("\\A").next().trim().split("\\s+");
        return cards;
    }
}

Empirical Tests

Exercise: run an empirical test to check that this algorithm works as described (I mentioned that the math proof was easy but never gave it). Write a program called ShuffleTest.java that takes command-line arguments M and N, does N shuffles of an array of size M that is initialized with a[i] = i before each shuffle, and prints an M-by-M table such that each row i gives the number of times i wound up in position j for all j. All entries in this array should be close to M/N.

Solution: here's my take on it:

public class ShuffleTest {

 public static void main(String[] args) {
  int M = 7;        //number of elements in array (deck size)
  int N = 700000;   //number of times to shuffle the deck
  
  //for all shuffles,count the number of times element i ends up in position j
  int[][] count = new int[M][M];
  for (int n = 0; n < N; n++) {
   Integer[] deck = makeDeck(M);
   FairShuffler.shuffle(deck);
   for (int i = 0; i < M; i++) {
    for (int j = 0; j < M; j++) {
     if (deck[i] == j) count[i][j]++;
    }
   }
  }
  
  //print the counts
  for (int i = 0; i < M; i++) {
   for (int j = 0; j < M; j++) {
    System.out.printf("%-7d ", count[i][j]);
   }
   System.out.println();
  }
 }
 
 //initialize and return a deck of size M
 public static Integer[] makeDeck(int M) {
  Integer[] deck = new Integer[M];
  for (int i = 0; i < M; i++) {
   deck[i] = i;
  }
  return deck;
 }
}

Final Word

Finally, here's the recommended approach for shuffling a deck of cards if you are in the business of online poker or something like that:

get a cryptographically secure pseudo-random number generator,
assign a random 64-bit number to each card,
sort the cards according to their number.

That's it for today. And you thought shuffling a deck of cards was easy.

Friday

The Bayes Classifier, an Example with Gaussian Class Conditionals

By David Dobor (draft)

Introduction

This is part 2 of a 5-part tutorial on Bayesian classification. It can be read independently from the other parts, but requires familiarity with basic probability and statistics, especially with the Bayes rule.

The classification problem in general, briefly repeated here for convenience, is as follows. We are given $N$ training objects $\mathbf{X_1, ... X_N}$, each is a feature vector of dimension $D$. Each of the $D$ features is either a discrete or a continuous random variable. The objects are labeled by $y \in \{1, 2, ..., C\}$ that describe to which of the $C$ classes a given object belongs. Our job is to predict the class $y$ for an unseen object $\mathbf{X}_{new}$.

In the following example, the features are assumed to be drawn from Gaussian (normal) distributions. (This differs from the email classification example discussed previously, in which the features were discrete (e.g. number of times a certain word occurred in a message.))

The rest of this page:

Describes the problem
Shows how to compute the posterior class distribution of a new observation using the Bayes rule (without the naive Bayes assumption first)
Shows how to predict the class label of a new observation
Demonstrates how to plot the decision contours
Exposes possible flaws of the Bayes classification model
Introduces the Naive Bayes simplification and re-computes the results

The MATLAB code that generated all the figures that follow can be found here.

The Problem Setup

We use artificially generated data to illustrate Bayesian classification. All of the 90 observations in this data set have been labeled as belonging to one of the 3 different classes. Each class contains 30 observations. The labels are $y \in \{1, 2, 3\}$. The red circles denote class 1, the green diamonds class 2, and the blue squares class 3. The observations themselves (I'll call them data points sometimes) consist of just 2 features and are represented as points in a plane (in $R^2$). We denote feature vectors by $\mathbf{X} = (X_1, X_2)$. Each feature vector is drawn from a bivariate normal distribution. Given class $y$, the observations are assumed to be drawn from the same bivariate normal distribution with parameters $\mu_y$ and $\Sigma_y$: $P(\mathbf{X} | Y = y) = \mathcal{N}(\mu_y, \Sigma_y)$.

The plot on the right superimposes iso-probability lines on the data (i.e. the probability density takes the same value at every point along any contour line). Since the data are generated from bivariate normal distributions, these lines are ellipses. (Incidentally, notice that class 3 has relatively high covariance between its features. This is evident by the iso-probability ellipses not being aligned with the coordinate axes; if there were zero correlation between the features, the axes of the ellipses would be parallel to the coordinate axes. We'll return to this point at the end of this tutorial.)

The Goal: Assign a label to a new observation. For example, given a new feature vector, say $\mathbf{X}_{new} = [X_1 = x_1, X_2 = x_2]^T =[4, 7]^T$, determine which of the 3 class labels to assign to it. We'll discuss the decision rule we'll use in order to make this assignment in a minute.

Recall the Bayes formula: $$ P(Y | X_1, X_2) = \frac{P(Y) \cdot P(X_1, X_2 | Y)} {P(X_1, X_2)} $$ Also recall that the terms in this formula are usually refered to as follows: $$posterior=\frac{prior \cdot likelihood} {evidence} $$

Once we compute this posterior distribution of $Y$ for a new data point, the decision rule as to which label to assign to the point is simple: assign to this data point the label with the highest posterior probability. We will see this on a concrete example below.

Let's begin by computing all the ingredients for the Bayes formula:

The Terms in the Bayes Formula

Prior: With 30 data points belonging to each class, and the total of 90 points, we use $P(Y = y) = N_y/N = 30/90 = 1/3$ for the prior. This represents our prior belief about which class a new data point might belong to, before we get to observe it.

Likelihood:I've generated these data points from 3 different Gaussian distributions, and I know what the parameters $(\mu_y, \Sigma_y)$ are for each of the three $y$s. However, the parameters are not usually known in many real-world applications and need to be estimated from the data. So I'm going to keep my parameters a secret and we'll estimate them from the data. (If you'd really like to know, gen_data.m generated the data and thus holds the key to the secret.)

I'll use Maximum Likelihood Estimates ( MLE ) for the parameters. (We'll consider alternatives to this later.)

The MLEs for the Gaussian distribution are derived in every statistics textbook, so of course I'll skip that. (This is usually done by differentiating the log-likelihood with respect to each parameter, setting it to zero, finding the solution, and showing that it's indeed a maximum.) The result is this: $$ \mu_y = \frac{1}{N} \sum_{n=1}^{N_y} x_n $$ $$ \Sigma_y = \frac{1}{N} \sum_{n=1}^{N_y} (x_n - \mu_y) \cdot (x_n - \mu_y)^T $$

The summations here are only over the data points that came from class $y$.

Step 0: Estimate the Parameters

As just mentioned, the parameters of the three normal distributions must be estimated. This is itself a machine-learning task, but here it's not complicated. The following script computes the $(\mu_y, \Sigma_y)$ MLE:

    load data_bayes3  % load the data generated by gen_data.m
    labels = unique(y);
    for c = 1:length(labels)
      pos = find(y==labels(c));
      class_mean(c,:) = mean(X(pos,:));     % (2x1)
      class_covar(:,:,c) = cov(X(pos,:),1); % (2x2)
    end

Step 1: Write a Function that Computes Normal Density

Here we just implement the familiar bivariate normal density formula: $$p(\mathbf{x}|\mu,\Sigma) = \frac{1}{(2\pi)^{D/2}|\Sigma|^{1/2}}\exp\left\{-\frac{1}{2}(\mathbf{x}-\mu)^T\Sigma^{-1}(\mathbf{x}-\mu)\right\}$$

The following MATLAB function takes the parameters of a normal distribution together with the the point(s) at which to evaluate the density and returns the density at that point. (Multiple points can be passed in as rows of X.)

    %Implements the normal probability density with parameters mu and cov_mat.
   %Rows of X are points at which the density is evaluated 
    function prob = density_norm(X,mu,cov_mat);

    [n,d] = size(X);
    % center the data points
    X = X-ones(n,1)*mu;
    const = 1/((2*pi)^(d/2)*sqrt(det(cov_mat)));
    arg = diag(X*inv(cov_mat)*X');
    prob = exp((-.5)*arg);
    prob = const*prob;

Step 2: Predict

Now we have everything we need to do the prediction, i.e. assign a label to an unseen data point. Suppose that the new point that we need to classify is $[4, 7]^T$. We will compute the posterior probability that this point belongs to class $y$.

First, compute the class-conditional densities by passing this point to the function from step1. Do this for each class:

    x = [4 7];
    probs=[];
    for c = 1:3
      probs(c) = density_norm(x, class_mean(c,:), class_covar(:,:,c));
    end

    posteriors = (1/3)*probs
    max(posteriors)                     %largest posterior probability
    find(posteriors == max(posteriors)) %index of the largest posterior

Multiplying each of the three class-conditional densities by its prior, 1/3, return the desired posterior probability.

The entries in the following table show the results:

$Y$	$P(\mathbf{X} = [4, 7]^T \| Y = y)$	$P(\mathbf{Y} = y)$	$P(\mathbf{X} = [4, 7]^T \| Y = y) \times P(\mathbf{Y} = y)$)
1	0.0245	$\frac {1} {3}$	0.0082
2	0.0695	$\frac {1} {3}$	0.0232
3	0.0000	$\frac {1} {3}$	0.0000

Finally, to classify the point, select the largest entry in the last column and choose the label corresponding to it. Notice that this new point is incredibly unlikely to belong to class 3 (its exact probability returned by MATLAB is 1.9686e-21) and almost 3 times as likely to belong to class 2 than to class 1. Therefore:

Answer: assign label $\mathbf{2}$ to point $\mathbf{X} = [4, 7]^T$

Plot the Classification Contours

We can plot the contours for the classification probabilities over a grid of many points in $R^2$ (MATLAB's meshgrid function is very helpful in this, see code). For each class, our model will assign a high probability to the region in the plane where points of that class congregate and low probability to regions far away from the points.

This is what the probability contours look like for each of the classes:

Why Bayes Classification can be Bad for You

Really, it's not that bad, Bayesian Classification often works very well in practice, even with the naive assumption (see below), as we'll see later.

However, notice the strange effects in the top left figure above. The area in the middle lower part of the plot is assigned as high a probability as the area closer to the cluster of class 1 points. This is explained by the relative 'steepness' of class-conditional contours for class 3, as compared with the contours for the other classes. Class 3 density decays so fast that the density for class 1 is actually higher to the right of class 3. This is kind of unfortunate, because it doesn't seem sensible to label the points in middle lower region, that are so far away from class 1, as belonging to class 1. It would be much better if probabilities became less certain as we moved farther away from the data.

Naive Bayes Simplification

In the example just considered, we were able to capture the dependencies between the two features - our covariance matrices had non-zero off diagonal entries. For example, the data points in class 3, show strong dependency between the features which is reflected by relatively large off-diagonal entries in the covariance matrix for this class. (This dependency is also reflected on the plot: notice that the principal axes of the ellipses, which represent the iso-probability levels, are not parallel to the coordinate axes.) In general, fitting 2-dimensional normal involves choosing 5 parameters: 2 for $\mu$, the means, and 3 for the distinct entries in the covariance matrix (since $\Sigma$ is symmetric, we really have 3 possibly distinct entries, not 4). This is perfectly fine for our small example of just 90 points. However, as the number of dimensions starts increasing, the $D$-dimensional normal would require fitting $D + D + \frac{D(D-1)} {2}$ parameters ($D$ for the mean, the rest for the covariance matrix). So if we only had 30 data points ans 10 features, we wouldn't be able to reliably fit the 65 parameters.

This problem is overcome in practice by making the naive Bayes assumption: for a given class the features are assumed to be independent and therefore the $D$-dimensional class-conditional distributions are factorized into a product of $D$ univariate distributions. A univariate normal requires just 2 parameters, $\mu$ and $\sigma^2$, so overall we have way fewer parameters to fit.

But this means that we are also restricting the shapes of the class-conditional distributions: Now they need to be aligned with the axes. This is the price exacted by the decrease in the number of parameters: we give up the model flexibility and are unable to capture the within-class dependencies.

Anyway, here's the likelihood function with the naive Bayes assumption: $$P(X_1, X_2, ..., X_D | Y = y) = \prod_{d = 1}^D P(X_d | Y = y)$$

And here's everything re-plotted with the naive Bayes assumption. Notice how the principal axes of the ellipses are now parallel to the coordinate axes.

Monday

Plotting the Bivariate Gaussian Density

...In which I decide to heed to a friend's advise and use MATLAB to generate some contour plots for the bivariate normal distribution and obtain pretty-looking results...

Cute(r) Plots for the Gaussian Density

The following plots of the bivariate normal (Gaussian) density function were generated in MATLAB. The script follows, together with as a small numerical example on how to compute the density.

The PDF

The probability density function (pdf) for the multivariate ($D$-variate) Gaussian is this: $$p(\mathbf{x}|\mu,\Sigma) = \frac{1}{(2\pi)^{D/2}|\Sigma|^{1/2}}\exp\left\{-\frac{1}{2}(\mathbf{x}-\mu)^T\Sigma^{-1}(\mathbf{x}-\mu)\right\}$$ where $\mu$ is the $D$x1 vector that describes the position of the distribution, and $\Sigma$ is a symmetric $D$x$D$ positive definite matrix (i.e. $\mathbf{u^T} \Sigma \mathbf{u}$ is a positive real number for any $D$-dimensional vector $\mathbf{u}$) that describes the shape of the distribution.

In our bivariate case $D$ is 2. The locus of points in the plane each of which gives the same fixed value of the density is an ellipse.

If this sounds like a mouthful, don't despair, a 2-$D$ numeric example will clarify things. It's really much less complicated than it looks.

An Example

Suppose for example that $\mu = \left[ \begin{array} {c} 1 \\ 0 \end{array} \right]$ and $\Sigma = \left[ \begin{array}{cc} 0.8 & 0.7 \\ 0.7 & 1.3 \end{array} \right] $. Evaluate the density at some point $[x, y]$, one step at a time.

Consider the expression in the exponent first and ask MATLAB to compute the inverse of $\Sigma$ (by asking it to evaluate inv(sigma), see code that follows in a minute). MATLAB returns $\Sigma^{-1} = \left[ \begin{array}{cc} 2.36 & -1.27 \\ -1.27 & 1.45 \end{array} \right] $. Thus the expression in the exponent is $$ \left[ \begin{array} {r} x - 1, y - 0 \end{array} \right] \left[ \begin{array}{cc} 2.36 & -1.27 \\ -1.27 & 1.45 \end{array} \right] \left[ \begin{array} {c} x - 1 \\ y - 0 \end{array} \right] $$ So if you were to evaluate this at $[x, y] = [3, 1]$, for instance, then $$ \left[ \begin{array} {r} 2, 1 \end{array} \right] \left[ \begin{array}{cc} 2.36 & -1.27 \\ -1.27 & 1.45 \end{array} \right] \left[ \begin{array} {c} 2 \\ 1 \end{array} \right] = \left[ \begin{array} {r} 2, 1 \end{array} \right] \left[ \begin{array} {c} 3.45 \\ -1.09 \end{array} \right] = 5.82 $$

The expression that multiplies $e^{-5.82}$ is just a constant ($| \Sigma |$ is the determinant of $\Sigma$). To plot the contours, we don't really need this exact constant, but let's compute it anyway - it's really quick in MATLAB. When asked to compute det(sigma), MATLAB returns 0.55. Thus, we finally have: $$ \frac{1} {2 \pi \sqrt{0.55}} \exp(-5.82) = \text{close enough to zero} $$ So this bivariate density evaluates to about 0 at $[3, 1]^T$, which is in the black region in the figure above.

Different Types of the Covariance Matrix

The previous example looked at a full covariance matrix all of whose off-diagonal entries were non-zero. The random variables were dependent and the iso-probability contour lines were ellipsoids whose axes were not aligned with the coordinate axes in any special way.

If the random variables $X$ and $Y$ are independent, then their covariance is zer0 and the off-diagonal entries in $\Sigma$ are equal to zero.

In general, the covariance matrices in the bivariate case take three forms: $$ \Sigma_{spher} = \left[ \begin{array}{cc} \sigma^2 & 0 \\ 0 & \sigma^2 \end{array} \right], \Sigma_{diag} = \left[ \begin{array}{cc} \sigma_1^2 & 0 \\ 0 & \sigma_2^2 \end{array} \right], \Sigma_{full} = \left[ \begin{array}{cc} \sigma_{11}^2 & \sigma_{12}^2 \\ \sigma_{21}^2 & \sigma_{22}^2 \end{array} \right] $$

And this is what their pictures look like:

Notice that the spherical covariance matrices are multiples of the identity matrix. The variables are independent and the equi-probability contour lines are perfect circles. On the other hand, diagonal covariance matrices permit distinct non-zero entries on the main diagonal. The variables are still independent but scaled differently and the equi-probability contour lines are ellipses (hyper-ellipsoids in higher dimensional spaces) and the principal axes are aligned with the coordinate axes.

The full covariance matrices I considered in the beginning of this blog are symmetric positive definite. Variables there are dependent and the equi-probability contour lines are ellipsoids whose principal axes are not aligned with the coordinate axes.

The Code

Finally, here's the code that plotted the figure:

  % Plots the Bivariate Normal Density Function for Various Parameters Values
  % author: David Dobor 

  clear all;close all;

  %% Define a Gaussian with a Full Covariance Matrix 
  mu = [1;2];
  sigma = [0.8 0.7;0.7 1.3];
  % Define the grid for visualization
  [X,Y] = meshgrid(-3:0.07:4,-2:0.07:5);
  % Define the constant
  const = (1/sqrt(2*pi))^2;
  const = const/sqrt(det(sigma));
  % Compute Density at every point on the grid
  temp = [X(:)-mu(1) Y(:)-mu(2)];
  pdf = const*exp(-0.5*diag(temp*inv(sigma)*temp'));
  pdf = reshape(pdf,size(X));

  % plot the result 
  figure(1)
  surfc(X, Y, pdf, 'LineStyle', 'none');

  %add some info to the plot
  mu_str = '$$\mu = \left[ \begin{array} {c} 1 \\  2 \end{array} \right]$$';
  text('Interpreter','latex',...
  'String',mu_str,...
  'Position',[-2.6 4.2],...
  'FontSize',11, 'Color', 'white')

  sigma_str = '$$\Sigma = \left[ \begin{array}{cc} 0.8 & 0.7 \\ 0.7 & 1.3\end{array} \right]$$';
  text('Interpreter','latex',...
  'String',sigma_str,...
  'Position',[-2.6 3],...
  'FontSize', 11, 'Color', 'white')

  % Add title and axis labels
  set(gca,'Title',text('String','Bivariate Normal Density, Full Covariance',...
  'FontAngle', 'italic', 'FontWeight', 'bold'),...
  'xlabel',text('String', '$\mathbf{X}$', 'Interpreter', 'latex'),...
  'ylabel',text('String', '$\mathbf{Y}$', 'Interpreter', 'latex'))

  % Adjust the view angle
  view(0, 90);
  colormap('hot')

And here is the code that produced the 3-$D$ mesh-grid view of the density:

  %plot the surface in 3-d (no need to recompute meshgrid but done anyway)
  figure(2)
  [X,Y] = meshgrid(-3:0.1:4,-2:0.1:5);
  const = (1/sqrt(2*pi))^2;
  const = const/sqrt(det(sigma));
  temp = [X(:)-mu(1) Y(:)-mu(2)];
  pdf = const*exp(-0.5*diag(temp*inv(sigma)*temp'));
  pdf = reshape(pdf,size(X));

  mesh(X, Y, pdf);

  % Add title and axis labels
  set(gca,'Title',text('String','A 3-D View of the Bivariate Normal Density',...
  'FontAngle', 'Italic', 'FontWeight', 'bold'),...
  'xlabel',text('String', '$\mathbf{X}$', 'Interpreter', 'latex'),...
  'ylabel',text('String', '$\mathbf{Y}$', 'Interpreter', 'latex'),...
  'zlabel',text('String', 'density', 'FontAngle', 'Italic', 'FontWeight', 'bold'))
  view(-10, 50)
  colormap('winter')

Copy and paste the code into your MATLAB environment and experiment by changing the parameters of the distribution and behold the different shapes.

BONUS: Exercise: Alter the parameters in the code so that the output is an ellipsoid that is slanted from the top left to the bottom right of the figure. What is the covariance matrix in this case?

Friday

kNN Classification of Handwriting, in Python

Introduction

In today's blog, I will develop a simple classifier that recognizes handwritten digits.

I'll write a kNN (k-nearest-neighbor) classifier and test it on a set of scanned handwritten digit images. The images come from the MNIST data set. They have been pre-processed by image-processing software and stored as text files. Each digit is of the same size and color: 32x32 black and white. '0's stand for the black pixels in an image. Examples of handwritten digits '9' and '2':

    00000000000000111111111010000000 00000000011111110000000000000000
    00000000000011111111111110000000 00000001111111111000000000000000
    00000000000111111111111111100000 00000011111111111100000000000000
    00000000000111111111111111110000 00000001111111111111000000000000
    00000000011111111111111111110000 00000001111111111111000000000000
    00000000011111111000000111110000 00000001111111111111000000000000
    00000000111111000000001111100000 00000001111100111111100000000000
    00000001111110000000001111100000 00000001111100011111100000000000
    00000001111100000000001111100000 00000001111000011111100000000000
    00000001111100000000001111100000 00000001110000011111100000000000
    00000001111000000001111111100000 00000000100000011111100000000000
    00000001111000000011111111000000 00000000000000011111000000000000
    00000001111000000111111110000000 00000000000000001111100000000000
    00000011111000111111111100000000 00000000000000001111100000000000
    00000011111001111111111100000000 00000000000000001111100000000000
    00000001111111111111111000000000 00000000000000011111100000000000
    00000001111111111111111000000000 00000000000000011111100000000000
    00000001111111111111111000000000 00000000000000111111000000000000
    00000000011111111111111000000000 00000000000000111110000000000000
    00000000001111100111110000000000 00000000000001111110000000000000
    00000000000000001111000000000000 00000000000011111100000000000000
    00000000000000011111000000000000 00000000000001111110000000000000
    00000000000000011111000000000000 00000000000011111110000000000000
    00000000000000111110000000000000 00000000001111111100000000000000
    00000000000000111110000000000000 00000000011111111000000000000000
    00000000000001111110000000000000 00000000111111111111111111100000
    00000000000011111110000000000000 00000001111111111111111111110000
    00000000000111111100000000000000 00000011111111111111111111110000
    00000000000111111000000000000000 00000011111111111111111111110000
    00000000000111111100000000000000 00000011111111111111111111110000
    00000000000111111100000000000000 00000001111111111111111111110000
    00000000000011110000000000000000 00000000111111111110000000000000

Preliminaries

knn can be implemented quickly in Python or MATLAB. I will use Python.

The original data can be found here. (If you'd like to replicate what follows, you can download this data set. Click on 'Raw' button on that page and unpack the zip file on your machine. Rename the two unpacked directories to train and test. Then put the python scripts that follow in the same directory.)

The images are stored in two directories, train, containing the training dataset, and test, containing the test dataset. The training dataset has about 1900 images, just about 200 samples from each digit. The test dataset contains about 900 examples.

I will use python's numpy package in what follows. (numpy considerably simplifies array operations. However, it is not packaged by default with every python distribution and may need to be installed separately.)

These are the modules we'll need:

  from numpy import *
  import operator
  from os import listdir

Writing the Classifier

Function knn_classify takes an image of a digit and outputs a label (0, 1, ..., or 9) that it thinks should be assigned to that image. It does this by looking at how 'close' this image is to all other images in the training data set. It selects the k closest images to the input image and looks at their labels. Of these labels it selects the majority and assigns that label to the input image, breaking ties arbitrarily. For example, if our 3NN classifier found 3 closest images with labels '9', '9', and '3', it would assign label '9' to its input image.

Of course we need to be clear about what is meant by images being 'close' to one-another. Take two images and consider 2 pixels, one from each, in the same position, say in row i and column j. If for some (i,j) the pixel values in the two images disagree, we'll say that there's a non-zero distance between them. Walking over all (i,j), the more agreements we get between the pixels, the closer the images are.

One could represent the 32x32 images as 1x1024 arrays (vectors). The number of mismatches in the corresponding positions (at indices) of the two arrays being compared can then be computed. Or we could compute the usual euclidean distance between the vectors. There are a few ways to skin this cat, and just about as many ways to write this classifier.

In a nutshell, here is what a kNN classifier is. (Thanks Koba for the link!)

Anyway, here's some python code that does this. It takes as input a vector inX (the test image that needs to be classified), a matrix called dataMat (the Nx1024 matrix where N is the number of images in the training data set), the list of labels for the N images (which better have the same length N as the number of images), and k, the number of nearest neighbors to use when taking the majority vote. The k smallest distances are selected by sorting the distances from the input image vector inX to each row vector in the dataMat matrix in ascending order and selecting top k of them.


def knn_classify(inX, dataMat, labels, k):
    N = dataMat.shape[0]
    diffMat = tile(inX, (N,1)) - dataMat
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5     #don't really need euclidean distance, but OK
    sorted_ind = distances.argsort() #the indices of sorted distances      
    class_count={}          
    for i in range(k):  #the lowest k distances are used to vote on the class of inX
        vote_i_label = labels[sorted_ind[i]]
        class_count[vote_i_label] = class_count.get(vote_i_label,0) + 1
    sorted_class_count = sorted(class_count.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sorted_class_count[0][0]

(A python note: the class_count dictionary is decomposed into a list of tuples. The tuples are then sorted by the second item using the itemgetter method from the operator module imported earlier. )

Image to Vector

Next, let's write a small function img2vector that converts an image to a vector.

 
def img2vector(file):
    vec = zeros((1,1024))
    fh = open(file)
    for i in range(32):
        line = fh.readline()
        for j in range(32):
            vec[0,32*i+j] = int(line[j])
    return vec

Testing the kNN on Handwritten Digits

Now we have the data in the format that can be plugged into the classifier.

Finally, here's the function test_handwriting(). It checks how accurate our classifier is in predicting the labels:


def test_handwriting():
    labels = []
    training_files = listdir('train')           # get the list of training set files
    N = len(training_files)                     # N: number of training data points
    trainingMat = zeros((N,1024))
    for i in range(N):
        full_filename = training_files[i]        #file name including 'txt' extension
        filename = full_filename.split('.')[0]     #take off .txt
        label = int(filename.split('_')[0])
        labels.append(label)
        trainingMat[i,:] = img2vector('train/%s' % full_filename)
    test_files = listdir('test')        # get the list of test set files
    error_count = 0.0
    n_test = len(test_files)            # n_test: number of test data points
    for i in range(n_test):             # iterate through the test set
        full_filename = test_files[i]
        filename = full_filename.split('.')[0]     #take off .txt
        label = int(filename.split('_')[0])
        vectorUnderTest = img2vector('test/%s' % full_filename)
        classifierResult = knn_classify(vectorUnderTest, trainingMat, labels, 3)
        print "the kNN assigned label is: %d, the true label is: %d" % (classifierResult, label)
        if (classifierResult != label): error_count += 1.0
    print "\nthe total number of errors is: %d" % error_count
    #print "\nthe total error rate is: %f" % (error_count/float(n_test))
    print "\nthe accuracy is: %f" % (1.0 - error_count/float(n_test))

Put the above three functions in a file named knn.py and save the file in the same directory where you put the train and test directories containing the digit text files.

Then say this to your python interpreter:

  >>> import knn #or reload(knn) if already imported 
  >>> kNN.test_handwriting()

The output is interesting to observe. Depending on the speed of your computer, it might take a few minutes to complete. Note that most digits get classified correctly. Every now and then you observe a '7' misclassified as '1' or a '9' as a '3' or something like that.

I got 98.84% accuracy. Pretty good, I would say.

Tuesday

The Perceptron Learning Algorithm, a small 3-class Classification Example in Detail

Introduction

In a previous blog, I implemented the perceptron learning algorithm for the binary classification problem and gave a detailed example. In this blog, I will consider a 3-class classification example in detail. This is only slightly more complicated than the binary classification problem.

A Quick Recap of the Perceptron Learning Algorithm

The perceptron is a linear classifier. The way you learn the weights of the perceptron is by iterating through the training data. You look up one training data item at a time and perform an update, if necessary. You then look at the next data item, again performing an update if necessary. You keep going until you've cycled through the entire data set once, without making any updates.

A Numeric Example

As long as the data are linearly separable, the perceptron learning rule will find the separating boundary. The proof of the algorithm's convergence (along with a nice 2 dimensional illustrative example of the algorithm in acton) can be found in chapter 4 of the neural network design book.

Let's look at a concrete 3-class classification problem. Each of the three classes has its own weight vector. (Recall that for the binary classification problem we only had one weight vector.) Let's say that at some stage during the run of the algorithm, the weights are as follows: $$w_1 = [1, 2, -2]^T, w_2 = [3, -2, -1]^T, w_3= [-1, 2, 4]^T $$

The next training data point to consider is the following feature vector with label 2: $$f(x) = [1, -0.5, 3]^T, y^* = 2$$

To do the perceptron update, we first classify the data point based on the weights that we currently have. We compute the three inner products: $$ w_{1}^T \cdot f(x) = 1 \cdot 1 + 2 \cdot (-0.5) + (-2) \cdot 3 = -6 $$

Similarly, $$ w_2^T \cdot f(x) = 1 $$ $$ w_3^T \cdot f(x) = 10 $$

Of these three numbers, 10 is the largest. This means that based on the current weight vector, the prediction of our perceptron classifier, for this particular feature vector $f(x)$ is $y = 3$.

So the true label is $y^* = 2$ and our current weights predict the label of $y = 3$.

Therefore, we need to do an update. We need to change the weight vectors $w_2$ and $w_3$ and nudge them in the correct direction, so to speak. As things stand at the moment, $w_3$ `favors' $f(x)$ more than $w_2$ favors it. We want the opposite: we want the inner product $ w_2^T \cdot f(x) $ to be higher than ${w_3}^T \cdot f(x) $.

To achieve this, we update $\bf{w_2}$ as follows: $$ \bf{w_2} \Leftarrow \bf{w_2} + \bf{f(x)} $$

This gives us the new $\bf{w_2}$: $$[3, -2, -1]^T + [1, -0.5, 3]^T = [4, -2.5, 2]^T $$

The update for $w_3$ goes in the opposite direction. $w_3$ is the weight vector winning the current prediction and we don't want it to have that high inner product with $f(x)$. So we update it as follows: $$ \bf{w_3} \Leftarrow \bf{w_3} - \bf{f(x)} $$

The new $w_3$ is now $$[-1, 2, 4]^T - [1, -0.5, 3]^T = [-2, 2.5, 1]^T $$

This completes the update. If you recompute the inner products with the new weight vectors, you get: $$ w_2^T \cdot f(x) = 11.25 $$ $$ w_3^T \cdot f(x) = -2.5 $$

Thus, $w_2^T \cdot f(x)$ being higher after the update, we see that the new label for the feature vector $f(x)$ is $y = 2$.

Monday

A Brief Note About $log$ of Gamma

On the Distribution of Log of Gamma - Some Facts

Distribution of Log of Gamma

Moments of Log of Gamma

Cumulant Generating Function

Expectation and Variance of Log of Gamma

Thursday

The Good, the Bad, and the OK Shuffles

Introduction

A Shuffle

A Better Shuffle and a Bad Shuffle

Java Implementation

Empirical Tests

Final Word

Friday

The Bayes Classifier, an Example with Gaussian Class Conditionals

Introduction

The Problem Setup

The Terms in the Bayes Formula

Step 0: Estimate the Parameters

Step 1: Write a Function that Computes Normal Density

Step 2: Predict

Plot the Classification Contours

Why Bayes Classification can be Bad for You

Naive Bayes Simplification

Monday

Plotting the Bivariate Gaussian Density

Cute(r) Plots for the Gaussian Density

The PDF

An Example

Different Types of the Covariance Matrix

The Code

Friday

kNN Classification of Handwriting, in Python

Introduction

Preliminaries

Writing the Classifier

Image to Vector

Testing the kNN on Handwritten Digits

Tuesday

The Perceptron Learning Algorithm, a small 3-class Classification Example in Detail

Introduction

A Quick Recap of the Perceptron Learning Algorithm

A Numeric Example

Blog Archive