GPyTorch Classification Tutorial

Introduction

This example is the simplest form of using an RBF kernel in an VariationalGP module for classification. This basic model is usable when there is not much training data and no advanced techniques are required.

In this example, we’re modeling a unit wave with period 1/2 centered with positive values @ x=0. We are going to classify the points as either +1 or -1.

Variational inference uses the assumption that the posterior distribution factors multiplicatively over the input variables. This makes approximating the distribution via the KL divergence possible to obtain a fast approximation to the posterior. For a good explanation of variational techniques, sections 4-6 of the following may be useful: https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf

In [1]:
import math
import torch
import gpytorch
from matplotlib import pyplot as plt

%matplotlib inline

Set up training data

In the next cell, we set up the training data for this example. We’ll be using 15 regularly spaced points on [0,1] which we evaluate the function on and add Gaussian noise to get the training labels. Labels are unit wave with period 1/2 centered with positive values @ x=0.

In [2]:
train_x = torch.linspace(0, 1, 10)
train_y = torch.sign(torch.cos(train_x * (4 * math.pi)))

Setting up the classification model

The next cell demonstrates the simplist way to define a classification Gaussian process model in GPyTorch. If you have already done the GP regression tutorial, you have already seen how GPyTorch model construction differs from other GP packages. In particular, the GP model expects a user to write out a forward method in a way analogous to PyTorch models. This gives the user the most possible flexibility.

Since exact inference is intractable for GP classification, GPyTorch approximates the classification posterior using variational inference. We believe that variational inference is ideal for a number of reasons. Firstly, variational inference commonly relies on gradient descent techniques, which take full advantage of PyTorch’s autograd. This reduces the amount of code needed to develop complex variational models. Additionally, variational inference can be performed with stochastic gradient decent, which can be extremely scalable for large datasets.

If you are unfamiliar with variational inference, we recommend the following resources: - Variational Inference: A Review for Statisticians by David M. Blei, Alp Kucukelbir, Jon D. McAuliffe. - Scalable Variational Gaussian Process Classification by James Hensman, Alex Matthews, Zoubin Ghahramani.

The necessary casses

For most GP regression models, you will need to construct the following GPyTorch objects:

  1. A GP Model (gpytorch.models.VariationalGP) - This handles basic variational inference.
  2. A Likelihood (gpytorch.likelihoods.BernoulliLikelihood) - This is a good likelihood for binary classification
  3. A Mean - This defines the prior mean of the GP.
  • If you don’t know which mean to use, a gpytorch.means.ConstantMean() is a good place to start.
  1. A Kernel - This defines the prior covariance of the GP.
  • If you don’t know which kernel to use, a gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel()) is a good place to start.
  1. A MultivariateNormal Distribution (gpytorch.distributions.MultivariateNormal) - This is the object used to represent multivariate normal distributions.

The GP Model

The VariationalGP model is GPyTorch’s simplist approximate inference model. It approximates the true posterior with a multivariate normal distribution. The model defines all the variational parameters that are needed, and keeps all of this information under the hood.

The components of a user built VariationalGP model in GPyTorch are:

  1. An __init__ method that takes the training data as an input. The __init__ function will also construct a mean module, a kernel module, and whatever other modules might be necessary.
  2. A forward method that takes in some \(n \times d\) data x and returns a MultivariateNormal with the prior mean and covariance evaluated at x. In other words, we return the vector \(\mu(x)\) and the \(n \times n\) matrix \(K_{xx}\) representing the prior mean and covariance matrix of the GP.

(For those who are unfamiliar with GP classification: even though we are performing classification, the GP model still returns a MultivariateNormal. The likelihood transforms this latent Gaussian variable into a Bernoulli variable)

Here we present a simple classification model, but it is posslbe to construct more complex models. See some of the scalable classification examples or deep kernel learning examples for some other examples.

In [3]:
class GPClassificationModel(gpytorch.models.VariationalGP):
    def __init__(self, train_x):
        super(GPClassificationModel, self).__init__(train_x)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        latent_pred = gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
        return latent_pred

# Initialize model and likelihood
model = GPClassificationModel(train_x)
likelihood = gpytorch.likelihoods.BernoulliLikelihood()

Model modes

Like most PyTorch modules, the ExactGP has a .train() and .eval() mode. - .train() mode is for optimizing variational parameters model hyperameters. - .eval() mode is for computing predictions through the model posterior.

Learn the variational parameters (and other hyperparameters)

In the next cell, we optimize the variational parameters of our Gaussian process. In addition, this optimization loop also performs Type-II MLE to train the hyperparameters of the Gaussian process.

The most obvious difference here compared to many other GP implementations is that, as in standard PyTorch, the core training loop is written by the user. In GPyTorch, we make use of the standard PyTorch optimizers as from torch.optim, and all trainable parameters of the model should be of type torch.nn.Parameter. The variational parameters are predefined as part of the VariationalGP model.

In most cases, the boilerplate code below will work well. It has the same basic components as the standard PyTorch training loop:

  1. Zero all parameter gradients
  2. Call the model and compute the loss
  3. Call backward on the loss to fill in gradients
  4. Take a step on the optimizer

However, defining custom training loops allows for greater flexibility. For example, it is possible to learn the variational parameters and kernel hyperparameters with different learning rates.

In [4]:
# Find optimal model hyperparameters
model.train()
likelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam([
    {'params': model.parameters()},
    # BernoulliLikelihood has no parameters
], lr=0.1)

# "Loss" for GPs - the marginal log likelihood
# num_data refers to the amount of training data
mll = gpytorch.mlls.VariationalMarginalLogLikelihood(likelihood, model, num_data=len(train_y))

training_iter = 50
for i in range(training_iter):
    # Zero backpropped gradients from previous iteration
    optimizer.zero_grad()
    # Get predictive output
    output = model(train_x)
    # Calc loss and backprop gradients
    loss = -mll(output, train_y)
    loss.backward()
    print('Iter %d/%d - Loss: %.3f' % (
        i + 1, training_iter, loss.item(),
    ))
    optimizer.step()
Iter 1/50 - Loss: 326.033
Iter 2/50 - Loss: 229.205
Iter 3/50 - Loss: 147.791
Iter 4/50 - Loss: 94.322
Iter 5/50 - Loss: 58.356
Iter 6/50 - Loss: 34.048
Iter 7/50 - Loss: 20.180
Iter 8/50 - Loss: 14.053
Iter 9/50 - Loss: 12.395
Iter 10/50 - Loss: 11.749
Iter 11/50 - Loss: 10.924
Iter 12/50 - Loss: 10.087
Iter 13/50 - Loss: 9.328
Iter 14/50 - Loss: 8.994
Iter 15/50 - Loss: 7.611
Iter 16/50 - Loss: 6.660
Iter 17/50 - Loss: 6.201
Iter 18/50 - Loss: 5.975
Iter 19/50 - Loss: 5.992
Iter 20/50 - Loss: 5.758
Iter 21/50 - Loss: 5.404
Iter 22/50 - Loss: 5.043
Iter 23/50 - Loss: 4.632
Iter 24/50 - Loss: 4.990
Iter 25/50 - Loss: 5.021
Iter 26/50 - Loss: 5.250
Iter 27/50 - Loss: 4.832
Iter 28/50 - Loss: 4.951
Iter 29/50 - Loss: 5.426
Iter 30/50 - Loss: 4.512
Iter 31/50 - Loss: 5.000
Iter 32/50 - Loss: 4.938
Iter 33/50 - Loss: 4.912
Iter 34/50 - Loss: 4.636
Iter 35/50 - Loss: 4.414
Iter 36/50 - Loss: 4.946
Iter 37/50 - Loss: 4.769
Iter 38/50 - Loss: 4.891
Iter 39/50 - Loss: 4.702
Iter 40/50 - Loss: 4.603
Iter 41/50 - Loss: 4.739
Iter 42/50 - Loss: 4.913
Iter 43/50 - Loss: 4.942
Iter 44/50 - Loss: 4.547
Iter 45/50 - Loss: 4.944
Iter 46/50 - Loss: 4.412
Iter 47/50 - Loss: 4.744
Iter 48/50 - Loss: 4.679
Iter 49/50 - Loss: 4.549
Iter 50/50 - Loss: 4.697

Make predictions with the model

In the next cell, we make predictions with the model. To do this, we simply put the model and likelihood in eval mode, and call both modules on the test data.

In .eval() mode, when we call model() - we get GP’s latent posterior predictions. These will be MultivariateNormal distributions. But since we are performing binary classification, we want to transform these outputs to classification probabilities using our likelihood.

When we call likelihood(model()), we get a torch.distributions.Bernoulli distribution, which represents our posterior probability that the data points belong to the positive class.

f_preds = model(test_x)
y_preds = likelihood(model(test_x))

f_mean = f_preds.mean
f_samples = f_preds.sample(sample_shape=torch.Size((1000,))

The gpytorch.fast_pred_var context is not needed, but here we are giving a preview of using one of our cool features, getting faster predictive distributions using LOVE.

In [5]:
# Go into eval mode
model.eval()
likelihood.eval()

with torch.no_grad(), gpytorch.fast_pred_var():
    # Test x are regularly spaced by 0.01 0,1 inclusive
    test_x = torch.linspace(0, 1, 101)
    # Get classification predictions
    observed_pred = likelihood(model(test_x))

    # Initialize fig and axes for plot
    f, ax = plt.subplots(1, 1, figsize=(4, 3))
    ax.plot(train_x.numpy(), train_y.numpy(), 'k*')
    # Get the predicted labels (probabilites of belonging to the positive class)
    # Transform these probabilities to be 0/1 labels
    pred_labels = observed_pred.mean.ge(0.5).float().mul(2).sub(1)
    ax.plot(test_x.numpy(), pred_labels.numpy(), 'b')
    ax.set_ylim([-3, 3])
    ax.legend(['Observed Data', 'Mean', 'Confidence'])
../../_images/examples_02_Simple_GP_Classification_Simple_GP_Classification_10_0.png
In [ ]: