GPyTorch Classification Tutorial

Introduction

This example is the simplest form of using an RBF kernel in an AbstractVariationalGP module for classification. This basic model is usable when there is not much training data and no advanced techniques are required.

In this example, we’re modeling a unit wave with period 1/2 centered with positive values @ x=0. We are going to classify the points as either +1 or -1.

Variational inference uses the assumption that the posterior distribution factors multiplicatively over the input variables. This makes approximating the distribution via the KL divergence possible to obtain a fast approximation to the posterior. For a good explanation of variational techniques, sections 4-6 of the following may be useful: https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf

[1]:
import math
import torch
import gpytorch
from matplotlib import pyplot as plt

%matplotlib inline

Set up training data

In the next cell, we set up the training data for this example. We’ll be using 15 regularly spaced points on [0,1] which we evaluate the function on and add Gaussian noise to get the training labels. Labels are unit wave with period 1/2 centered with positive values @ x=0.

[2]:
train_x = torch.linspace(0, 1, 10)
train_y = torch.sign(torch.cos(train_x * (4 * math.pi))).add(1).div(2)

Setting up the classification model

The next cell demonstrates the simplist way to define a classification Gaussian process model in GPyTorch. If you have already done the GP regression tutorial, you have already seen how GPyTorch model construction differs from other GP packages. In particular, the GP model expects a user to write out a forward method in a way analogous to PyTorch models. This gives the user the most possible flexibility.

Since exact inference is intractable for GP classification, GPyTorch approximates the classification posterior using variational inference. We believe that variational inference is ideal for a number of reasons. Firstly, variational inference commonly relies on gradient descent techniques, which take full advantage of PyTorch’s autograd. This reduces the amount of code needed to develop complex variational models. Additionally, variational inference can be performed with stochastic gradient decent, which can be extremely scalable for large datasets.

If you are unfamiliar with variational inference, we recommend the following resources: - Variational Inference: A Review for Statisticians by David M. Blei, Alp Kucukelbir, Jon D. McAuliffe. - Scalable Variational Gaussian Process Classification by James Hensman, Alex Matthews, Zoubin Ghahramani.

The necessary classes

For most variational GP models, you will need to construct the following GPyTorch objects:

  1. A GP Model (gpytorch.models.AbstractVariationalGP) - This handles basic variational inference.
  2. A Variational distribution (gpytorch.variational.VariationalDistribution) - This tells us what form the variational distribution q(u) should take.
  3. A Variational strategy (gpytorch.variational.VariationalStrategy) - This tells us how to transform a distribution q(u) over the inducing point values to a distribution q(f) over the latent function values for some input x.
  4. A Likelihood (gpytorch.likelihoods.BernoulliLikelihood) - This is a good likelihood for binary classification
  5. A Mean - This defines the prior mean of the GP.
  • If you don’t know which mean to use, a gpytorch.means.ConstantMean() is a good place to start.
  1. A Kernel - This defines the prior covariance of the GP.
  • If you don’t know which kernel to use, a gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel()) is a good place to start.
  1. A MultivariateNormal Distribution (gpytorch.distributions.MultivariateNormal) - This is the object used to represent multivariate normal distributions.

The GP Model

The AbstractVariationalGP model is GPyTorch’s simplist approximate inference model. It approximates the true posterior with a distribution specified by a VariationalDistribution, which is most commonly some form of MultivariateNormal distribution. The model defines all the variational parameters that are needed, and keeps all of this information under the hood.

The components of a user built AbstractVariationalGP model in GPyTorch are:

  1. An __init__ method that constructs a mean module, a kernel module, a variational distribution object and a variational strategy object. This method should also be responsible for construting whatever other modules might be necessary.
  2. A forward method that takes in some \(n \times d\) data x and returns a MultivariateNormal with the prior mean and covariance evaluated at x. In other words, we return the vector \(\mu(x)\) and the \(n \times n\) matrix \(K_{xx}\) representing the prior mean and covariance matrix of the GP.

(For those who are unfamiliar with GP classification: even though we are performing classification, the GP model still returns a MultivariateNormal. The likelihood transforms this latent Gaussian variable into a Bernoulli variable)

Here we present a simple classification model, but it is posslbe to construct more complex models. See some of the scalable classification examples or deep kernel learning examples for some other examples.

[3]:
from gpytorch.models import AbstractVariationalGP
from gpytorch.variational import CholeskyVariationalDistribution
from gpytorch.variational import VariationalStrategy


class GPClassificationModel(AbstractVariationalGP):
    def __init__(self, train_x):
        variational_distribution = CholeskyVariationalDistribution(train_x.size(0))
        variational_strategy = VariationalStrategy(self, train_x, variational_distribution)
        super(GPClassificationModel, self).__init__(variational_strategy)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        latent_pred = gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
        return latent_pred


# Initialize model and likelihood
model = GPClassificationModel(train_x)
likelihood = gpytorch.likelihoods.BernoulliLikelihood()

Model modes

Like most PyTorch modules, the ExactGP has a .train() and .eval() mode. - .train() mode is for optimizing variational parameters model hyperameters. - .eval() mode is for computing predictions through the model posterior.

Learn the variational parameters (and other hyperparameters)

In the next cell, we optimize the variational parameters of our Gaussian process. In addition, this optimization loop also performs Type-II MLE to train the hyperparameters of the Gaussian process.

The most obvious difference here compared to many other GP implementations is that, as in standard PyTorch, the core training loop is written by the user. In GPyTorch, we make use of the standard PyTorch optimizers as from torch.optim, and all trainable parameters of the model should be of type torch.nn.Parameter. The variational parameters are predefined as part of the VariationalGP model.

In most cases, the boilerplate code below will work well. It has the same basic components as the standard PyTorch training loop:

  1. Zero all parameter gradients
  2. Call the model and compute the loss
  3. Call backward on the loss to fill in gradients
  4. Take a step on the optimizer

However, defining custom training loops allows for greater flexibility. For example, it is possible to learn the variational parameters and kernel hyperparameters with different learning rates.

[4]:
from gpytorch.mlls.variational_elbo import VariationalELBO

# Find optimal model hyperparameters
model.train()
likelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)

# "Loss" for GPs - the marginal log likelihood
# num_data refers to the amount of training data
mll = VariationalELBO(likelihood, model, train_y.numel())

training_iter = 50
for i in range(training_iter):
    # Zero backpropped gradients from previous iteration
    optimizer.zero_grad()
    # Get predictive output
    output = model(train_x)
    # Calc loss and backprop gradients
    loss = -mll(output, train_y)
    loss.backward()
    print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iter, loss.item()))
    optimizer.step()
Iter 1/50 - Loss: 0.905
Iter 2/50 - Loss: 5.393
Iter 3/50 - Loss: 8.450
Iter 4/50 - Loss: 3.792
Iter 5/50 - Loss: 6.634
Iter 6/50 - Loss: 7.030
Iter 7/50 - Loss: 6.352
Iter 8/50 - Loss: 4.972
Iter 9/50 - Loss: 3.931
Iter 10/50 - Loss: 3.196
Iter 11/50 - Loss: 2.832
Iter 12/50 - Loss: 2.654
Iter 13/50 - Loss: 2.448
Iter 14/50 - Loss: 2.242
Iter 15/50 - Loss: 1.954
Iter 16/50 - Loss: 1.772
Iter 17/50 - Loss: 1.581
Iter 18/50 - Loss: 1.569
Iter 19/50 - Loss: 1.478
Iter 20/50 - Loss: 1.451
Iter 21/50 - Loss: 1.443
Iter 22/50 - Loss: 1.443
Iter 23/50 - Loss: 1.443
Iter 24/50 - Loss: 1.435
Iter 25/50 - Loss: 1.420
Iter 26/50 - Loss: 1.398
Iter 27/50 - Loss: 1.372
Iter 28/50 - Loss: 1.344
Iter 29/50 - Loss: 1.313
Iter 30/50 - Loss: 1.280
Iter 31/50 - Loss: 1.246
Iter 32/50 - Loss: 1.213
Iter 33/50 - Loss: 1.182
Iter 34/50 - Loss: 1.153
Iter 35/50 - Loss: 1.126
Iter 36/50 - Loss: 1.103
Iter 37/50 - Loss: 1.085
Iter 38/50 - Loss: 1.069
Iter 39/50 - Loss: 1.055
Iter 40/50 - Loss: 1.041
Iter 41/50 - Loss: 1.025
Iter 42/50 - Loss: 1.010
Iter 43/50 - Loss: 0.993
Iter 44/50 - Loss: 0.976
Iter 45/50 - Loss: 0.958
Iter 46/50 - Loss: 0.943
Iter 47/50 - Loss: 0.928
Iter 48/50 - Loss: 0.915
Iter 49/50 - Loss: 0.903
Iter 50/50 - Loss: 0.892

Make predictions with the model

In the next cell, we make predictions with the model. To do this, we simply put the model and likelihood in eval mode, and call both modules on the test data.

In .eval() mode, when we call model() - we get GP’s latent posterior predictions. These will be MultivariateNormal distributions. But since we are performing binary classification, we want to transform these outputs to classification probabilities using our likelihood.

When we call likelihood(model()), we get a torch.distributions.Bernoulli distribution, which represents our posterior probability that the data points belong to the positive class.

f_preds = model(test_x)
y_preds = likelihood(model(test_x))

f_mean = f_preds.mean
f_samples = f_preds.sample(sample_shape=torch.Size((1000,))
[5]:
# Go into eval mode
model.eval()
likelihood.eval()

with torch.no_grad():
    # Test x are regularly spaced by 0.01 0,1 inclusive
    test_x = torch.linspace(0, 1, 101)
    # Get classification predictions
    observed_pred = likelihood(model(test_x))

    # Initialize fig and axes for plot
    f, ax = plt.subplots(1, 1, figsize=(4, 3))
    ax.plot(train_x.numpy(), train_y.numpy(), 'k*')
    # Get the predicted labels (probabilites of belonging to the positive class)
    # Transform these probabilities to be 0/1 labels
    pred_labels = observed_pred.mean.ge(0.5).float()
    ax.plot(test_x.numpy(), pred_labels.numpy(), 'b')
    ax.set_ylim([-1, 2])
    ax.legend(['Observed Data', 'Mean', 'Confidence'])
../../_images/examples_02_Simple_GP_Classification_Simple_GP_Classification_10_0.png
[ ]: