GPyTorch Classification Tutorial

Introduction

This example is the simplest form of using an RBF kernel in an AbstractVariationalGP module for classification. This basic model is usable when there is not much training data and no advanced techniques are required.

In this example, we’re modeling a unit wave with period 1/2 centered with positive values @ x=0. We are going to classify the points as either +1 or -1.

Variational inference uses the assumption that the posterior distribution factors multiplicatively over the input variables. This makes approximating the distribution via the KL divergence possible to obtain a fast approximation to the posterior. For a good explanation of variational techniques, sections 4-6 of the following may be useful: https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf

[1]:
import math
import torch
import gpytorch
from matplotlib import pyplot as plt

%matplotlib inline
/home/gpleiss/anaconda3/envs/gpytorch/lib/python3.7/site-packages/matplotlib/__init__.py:999: UserWarning: Duplicate key in file "/home/gpleiss/.dotfiles/matplotlib/matplotlibrc", line #57
  (fname, cnt))

Set up training data

In the next cell, we set up the training data for this example. We’ll be using 15 regularly spaced points on [0,1] which we evaluate the function on and add Gaussian noise to get the training labels. Labels are unit wave with period 1/2 centered with positive values @ x=0.

[2]:
train_x = torch.linspace(0, 1, 10)
train_y = torch.sign(torch.cos(train_x * (4 * math.pi)))

Setting up the classification model

The next cell demonstrates the simplist way to define a classification Gaussian process model in GPyTorch. If you have already done the GP regression tutorial, you have already seen how GPyTorch model construction differs from other GP packages. In particular, the GP model expects a user to write out a forward method in a way analogous to PyTorch models. This gives the user the most possible flexibility.

Since exact inference is intractable for GP classification, GPyTorch approximates the classification posterior using variational inference. We believe that variational inference is ideal for a number of reasons. Firstly, variational inference commonly relies on gradient descent techniques, which take full advantage of PyTorch’s autograd. This reduces the amount of code needed to develop complex variational models. Additionally, variational inference can be performed with stochastic gradient decent, which can be extremely scalable for large datasets.

If you are unfamiliar with variational inference, we recommend the following resources: - Variational Inference: A Review for Statisticians by David M. Blei, Alp Kucukelbir, Jon D. McAuliffe. - Scalable Variational Gaussian Process Classification by James Hensman, Alex Matthews, Zoubin Ghahramani.

The necessary classes

For most variational GP models, you will need to construct the following GPyTorch objects:

  1. A GP Model (gpytorch.models.AbstractVariationalGP) - This handles basic variational inference.
  2. A Variational distribution (gpytorch.variational.VariationalDistribution) - This tells us what form the variational distribution q(u) should take.
  3. A Variational strategy (gpytorch.variational.VariationalStrategy) - This tells us how to transform a distribution q(u) over the inducing point values to a distribution q(f) over the latent function values for some input x.
  4. A Likelihood (gpytorch.likelihoods.BernoulliLikelihood) - This is a good likelihood for binary classification
  5. A Mean - This defines the prior mean of the GP.
  • If you don’t know which mean to use, a gpytorch.means.ConstantMean() is a good place to start.
  1. A Kernel - This defines the prior covariance of the GP.
  • If you don’t know which kernel to use, a gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel()) is a good place to start.
  1. A MultivariateNormal Distribution (gpytorch.distributions.MultivariateNormal) - This is the object used to represent multivariate normal distributions.

The GP Model

The AbstractVariationalGP model is GPyTorch’s simplist approximate inference model. It approximates the true posterior with a distribution specified by a VariationalDistribution, which is most commonly some form of MultivariateNormal distribution. The model defines all the variational parameters that are needed, and keeps all of this information under the hood.

The components of a user built AbstractVariationalGP model in GPyTorch are:

  1. An __init__ method that constructs a mean module, a kernel module, a variational distribution object and a variational strategy object. This method should also be responsible for construting whatever other modules might be necessary.
  2. A forward method that takes in some \(n \times d\) data x and returns a MultivariateNormal with the prior mean and covariance evaluated at x. In other words, we return the vector \(\mu(x)\) and the \(n \times n\) matrix \(K_{xx}\) representing the prior mean and covariance matrix of the GP.

(For those who are unfamiliar with GP classification: even though we are performing classification, the GP model still returns a MultivariateNormal. The likelihood transforms this latent Gaussian variable into a Bernoulli variable)

Here we present a simple classification model, but it is posslbe to construct more complex models. See some of the scalable classification examples or deep kernel learning examples for some other examples.

[3]:
from gpytorch.models import AbstractVariationalGP
from gpytorch.variational import CholeskyVariationalDistribution
from gpytorch.variational import VariationalStrategy


class GPClassificationModel(AbstractVariationalGP):
    def __init__(self, train_x):
        variational_distribution = CholeskyVariationalDistribution(train_x.size(0))
        variational_strategy = VariationalStrategy(self, train_x, variational_distribution)
        super(GPClassificationModel, self).__init__(variational_strategy)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        latent_pred = gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
        return latent_pred


# Initialize model and likelihood
model = GPClassificationModel(train_x)
likelihood = gpytorch.likelihoods.BernoulliLikelihood()

Model modes

Like most PyTorch modules, the ExactGP has a .train() and .eval() mode. - .train() mode is for optimizing variational parameters model hyperameters. - .eval() mode is for computing predictions through the model posterior.

Learn the variational parameters (and other hyperparameters)

In the next cell, we optimize the variational parameters of our Gaussian process. In addition, this optimization loop also performs Type-II MLE to train the hyperparameters of the Gaussian process.

The most obvious difference here compared to many other GP implementations is that, as in standard PyTorch, the core training loop is written by the user. In GPyTorch, we make use of the standard PyTorch optimizers as from torch.optim, and all trainable parameters of the model should be of type torch.nn.Parameter. The variational parameters are predefined as part of the VariationalGP model.

In most cases, the boilerplate code below will work well. It has the same basic components as the standard PyTorch training loop:

  1. Zero all parameter gradients
  2. Call the model and compute the loss
  3. Call backward on the loss to fill in gradients
  4. Take a step on the optimizer

However, defining custom training loops allows for greater flexibility. For example, it is possible to learn the variational parameters and kernel hyperparameters with different learning rates.

[4]:
from gpytorch.mlls.variational_elbo import VariationalELBO

# Find optimal model hyperparameters
model.train()
likelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)

# "Loss" for GPs - the marginal log likelihood
# num_data refers to the amount of training data
mll = VariationalELBO(likelihood, model, train_y.numel())

training_iter = 50
for i in range(training_iter):
    # Zero backpropped gradients from previous iteration
    optimizer.zero_grad()
    # Get predictive output
    output = model(train_x)
    # Calc loss and backprop gradients
    loss = -mll(output, train_y)
    loss.backward()
    print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iter, loss.item()))
    optimizer.step()
Iter 1/50 - Loss: 1.225
Iter 2/50 - Loss: 8.492
Iter 3/50 - Loss: 2.531
Iter 4/50 - Loss: 3.006
Iter 5/50 - Loss: 5.014
Iter 6/50 - Loss: 3.859
Iter 7/50 - Loss: 1.783
Iter 8/50 - Loss: 1.525
Iter 9/50 - Loss: 2.158
Iter 10/50 - Loss: 2.525
Iter 11/50 - Loss: 2.080
Iter 12/50 - Loss: 1.602
Iter 13/50 - Loss: 1.520
Iter 14/50 - Loss: 1.704
Iter 15/50 - Loss: 1.773
Iter 16/50 - Loss: 1.597
Iter 17/50 - Loss: 1.438
Iter 18/50 - Loss: 1.574
Iter 19/50 - Loss: 1.795
Iter 20/50 - Loss: 1.737
Iter 21/50 - Loss: 1.847
Iter 22/50 - Loss: 1.789
Iter 23/50 - Loss: 1.505
Iter 24/50 - Loss: 1.369
Iter 25/50 - Loss: 1.503
Iter 26/50 - Loss: 1.363
Iter 27/50 - Loss: 1.322
Iter 28/50 - Loss: 1.330
Iter 29/50 - Loss: 1.378
Iter 30/50 - Loss: 1.343
Iter 31/50 - Loss: 1.416
Iter 32/50 - Loss: 1.467
Iter 33/50 - Loss: 1.441
Iter 34/50 - Loss: 1.425
Iter 35/50 - Loss: 1.327
Iter 36/50 - Loss: 1.498
Iter 37/50 - Loss: 1.393
Iter 38/50 - Loss: 1.208
Iter 39/50 - Loss: 1.429
Iter 40/50 - Loss: 1.361
Iter 41/50 - Loss: 1.435
Iter 42/50 - Loss: 1.287
Iter 43/50 - Loss: 1.673
Iter 44/50 - Loss: 1.601
Iter 45/50 - Loss: 1.275
Iter 46/50 - Loss: 1.321
Iter 47/50 - Loss: 1.750
Iter 48/50 - Loss: 1.487
Iter 49/50 - Loss: 1.195
Iter 50/50 - Loss: 1.430

Make predictions with the model

In the next cell, we make predictions with the model. To do this, we simply put the model and likelihood in eval mode, and call both modules on the test data.

In .eval() mode, when we call model() - we get GP’s latent posterior predictions. These will be MultivariateNormal distributions. But since we are performing binary classification, we want to transform these outputs to classification probabilities using our likelihood.

When we call likelihood(model()), we get a torch.distributions.Bernoulli distribution, which represents our posterior probability that the data points belong to the positive class.

f_preds = model(test_x)
y_preds = likelihood(model(test_x))

f_mean = f_preds.mean
f_samples = f_preds.sample(sample_shape=torch.Size((1000,))
[5]:
# Go into eval mode
model.eval()
likelihood.eval()

with torch.no_grad():
    # Test x are regularly spaced by 0.01 0,1 inclusive
    test_x = torch.linspace(0, 1, 101)
    # Get classification predictions
    observed_pred = likelihood(model(test_x))

    # Initialize fig and axes for plot
    f, ax = plt.subplots(1, 1, figsize=(4, 3))
    ax.plot(train_x.numpy(), train_y.numpy(), 'k*')
    # Get the predicted labels (probabilites of belonging to the positive class)
    # Transform these probabilities to be 0/1 labels
    pred_labels = observed_pred.mean.ge(0.5).float().mul(2).sub(1)
    ax.plot(test_x.numpy(), pred_labels.numpy(), 'b')
    ax.set_ylim([-3, 3])
    ax.legend(['Observed Data', 'Mean', 'Confidence'])
../../_images/examples_02_Simple_GP_Classification_Simple_GP_Classification_10_0.png
[ ]: