GPyTorch Regression Tutorial

Introduction

In this notebook, we demonstrate many of the design features of GPyTorch using the simplest example, training an RBF kernel Gaussian process on a simple function. We’ll be modeling the function

\begin{align*} y &= \sin(2\pi x) + \epsilon \\ \epsilon &\sim \mathcal{N}(0, 0.2) \end{align*}

with 11 training examples, and testing on 51 test examples.

Note: this notebook is not necessarily intended to teach the mathematical background of Gaussian processes, but rather how to train a simple one and make predictions in GPyTorch. For a mathematical treatment, Chapter 2 of Gaussian Processes for Machine Learning provides a very thorough introduction to GP regression (this entire text is highly recommended): http://www.gaussianprocess.org/gpml/chapters/RW2.pdf

In [1]:
import math
import torch
import gpytorch
from matplotlib import pyplot as plt

%matplotlib inline
%load_ext autoreload
%autoreload 2

Set up training data

In the next cell, we set up the training data for this example. We’ll be using 11 regularly spaced points on [0,1] which we evaluate the function on and add Gaussian noise to get the training labels.

In [2]:
# Training data is 11 points in [0,1] inclusive regularly spaced
train_x = torch.linspace(0, 1, 100)
# True function is sin(2*pi*x) with Gaussian noise
train_y = torch.sin(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * 0.2

Setting up the model

The next cell demonstrates the most critical features of a user-defined Gaussian process model in GPyTorch. Building a GP model in GPyTorch is different in a number of ways.

First in contrast to many existing GP packages, we do not provide full GP models for the user. Rather, we provide the tools necessary to quickly construct one. This is because we believe, analogous to building a neural network in standard PyTorch, it is important to have the flexibility to include whatever components are necessary. As can be seen in more complicated examples, like the CIFAR10 Deep Kernel Learning example which combines deep learning and Gaussian processes, this allows the user great flexibility in designing custom models.

For most GP regression models, you will need to construct the following GPyTorch objects:

  1. A GP Model (gpytorch.models.ExactGP) - This handles most of the inference.
  2. A Likelihood (gpytorch.likelihoods.GaussianLikelihood) - This is the most common likelihood used for GP regression.
  3. A Mean - This defines the prior mean of the GP.
  • If you don’t know which mean to use, a gpytorch.means.ConstantMean() is a good place to start.
  1. A Kernel - This defines the prior covariance of the GP.
  • If you don’t know which kernel to use, a gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel()) is a good place to start.
  1. A MultivariateNormal Distribution (gpytorch.distributions.MultivariateNormal) - This is the object used to represent multivariate normal distributions.

The components of a user built (Exact, i.e. non-variational) GP model in GPyTorch are, broadly speaking:

  1. An __init__ method that takes the training data and a likelihood, and constructs whatever objects are necessary for the model’s forward method. This will most commonly include things like a mean module and a kernel module.
  2. A forward method that takes in some \(n \times d\) data x and returns a MultivariateNormal with the prior mean and covariance evaluated at x. In other words, we return the vector \(\mu(x)\) and the \(n \times n\) matrix \(K_{xx}\) representing the prior mean and covariance matrix of the GP.

This specification leaves a large amount of flexibility when defining a model. For example, to compose two kernels via addition, you can either add the kernel modules directly:

self.covar_module = ScaleKernel(RBFKernel() + WhiteNoiseKernel())

Or you can add the outputs of the kernel in the forward method:

covar_x = self.rbf_kernel_module(x) + self.white_noise_module(x)
In [3]:
# We will use the simplest form of GP model, exact inference
class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# initialize likelihood and model
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood)

Model modes

Like most PyTorch modules, the ExactGP has a .train() and .eval() mode. - .train() mode is for optimizing model hyperameters. - .eval() mode is for computing predictions through the model posterior.

Training the model

In the next cell, we handle using Type-II MLE to train the hyperparameters of the Gaussian process.

The most obvious difference here compared to many other GP implementations is that, as in standard PyTorch, the core training loop is written by the user. In GPyTorch, we make use of the standard PyTorch optimizers as from torch.optim, and all trainable parameters of the model should be of type torch.nn.Parameter. Because GP models directly extend torch.nn.Module, calls to methods like model.parameters() or model.named_parameters() function as you might expect coming from PyTorch.

In most cases, the boilerplate code below will work well. It has the same basic components as the standard PyTorch training loop:

  1. Zero all parameter gradients
  2. Call the model and compute the loss
  3. Call backward on the loss to fill in gradients
  4. Take a step on the optimizer

However, defining custom training loops allows for greater flexibility. For example, it is easy to save the parameters at each step of training, or use different learning rates for different parameters (which may be useful in deep kernel learning for example).

In [4]:
# Find optimal model hyperparameters
model.train()
likelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam([
    {'params': model.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

training_iter = 50
for i in range(training_iter):
    # Zero gradients from previous iteration
    optimizer.zero_grad()
    # Output from model
    output = model(train_x)
    # Calc loss and backprop gradients
    loss = -mll(output, train_y)
    loss.backward()
    print('Iter %d/%d - Loss: %.3f   log_lengthscale: %.3f   log_noise: %.3f' % (
        i + 1, training_iter, loss.item(),
        model.covar_module.base_kernel.log_lengthscale.item(),
        model.likelihood.log_noise.item()
    ))
    optimizer.step()
Iter 1/50 - Loss: 1.084   log_lengthscale: 0.000   log_noise: 0.000
Iter 2/50 - Loss: 1.043   log_lengthscale: -0.100   log_noise: -0.100
Iter 3/50 - Loss: 1.004   log_lengthscale: -0.196   log_noise: -0.200
Iter 4/50 - Loss: 0.964   log_lengthscale: -0.293   log_noise: -0.300
Iter 5/50 - Loss: 0.922   log_lengthscale: -0.387   log_noise: -0.399
Iter 6/50 - Loss: 0.877   log_lengthscale: -0.479   log_noise: -0.499
Iter 7/50 - Loss: 0.825   log_lengthscale: -0.572   log_noise: -0.598
Iter 8/50 - Loss: 0.767   log_lengthscale: -0.667   log_noise: -0.698
Iter 9/50 - Loss: 0.705   log_lengthscale: -0.762   log_noise: -0.799
Iter 10/50 - Loss: 0.644   log_lengthscale: -0.860   log_noise: -0.899
Iter 11/50 - Loss: 0.590   log_lengthscale: -0.960   log_noise: -1.001
Iter 12/50 - Loss: 0.543   log_lengthscale: -1.058   log_noise: -1.102
Iter 13/50 - Loss: 0.502   log_lengthscale: -1.150   log_noise: -1.204
Iter 14/50 - Loss: 0.462   log_lengthscale: -1.234   log_noise: -1.306
Iter 15/50 - Loss: 0.426   log_lengthscale: -1.303   log_noise: -1.408
Iter 16/50 - Loss: 0.389   log_lengthscale: -1.360   log_noise: -1.509
Iter 17/50 - Loss: 0.360   log_lengthscale: -1.404   log_noise: -1.611
Iter 18/50 - Loss: 0.321   log_lengthscale: -1.432   log_noise: -1.712
Iter 19/50 - Loss: 0.280   log_lengthscale: -1.454   log_noise: -1.812
Iter 20/50 - Loss: 0.250   log_lengthscale: -1.465   log_noise: -1.911
Iter 21/50 - Loss: 0.227   log_lengthscale: -1.469   log_noise: -2.010
Iter 22/50 - Loss: 0.188   log_lengthscale: -1.461   log_noise: -2.108
Iter 23/50 - Loss: 0.158   log_lengthscale: -1.442   log_noise: -2.204
Iter 24/50 - Loss: 0.125   log_lengthscale: -1.411   log_noise: -2.300
Iter 25/50 - Loss: 0.095   log_lengthscale: -1.377   log_noise: -2.393
Iter 26/50 - Loss: 0.070   log_lengthscale: -1.340   log_noise: -2.485
Iter 27/50 - Loss: 0.050   log_lengthscale: -1.298   log_noise: -2.574
Iter 28/50 - Loss: 0.032   log_lengthscale: -1.256   log_noise: -2.662
Iter 29/50 - Loss: 0.014   log_lengthscale: -1.218   log_noise: -2.746
Iter 30/50 - Loss: 0.003   log_lengthscale: -1.182   log_noise: -2.828
Iter 31/50 - Loss: -0.001   log_lengthscale: -1.148   log_noise: -2.906
Iter 32/50 - Loss: -0.008   log_lengthscale: -1.121   log_noise: -2.980
Iter 33/50 - Loss: -0.012   log_lengthscale: -1.102   log_noise: -3.049
Iter 34/50 - Loss: -0.011   log_lengthscale: -1.103   log_noise: -3.114
Iter 35/50 - Loss: -0.014   log_lengthscale: -1.114   log_noise: -3.174
Iter 36/50 - Loss: -0.014   log_lengthscale: -1.138   log_noise: -3.228
Iter 37/50 - Loss: -0.010   log_lengthscale: -1.169   log_noise: -3.275
Iter 38/50 - Loss: -0.011   log_lengthscale: -1.204   log_noise: -3.317
Iter 39/50 - Loss: -0.008   log_lengthscale: -1.239   log_noise: -3.352
Iter 40/50 - Loss: -0.001   log_lengthscale: -1.270   log_noise: -3.380
Iter 41/50 - Loss: -0.005   log_lengthscale: -1.296   log_noise: -3.401
Iter 42/50 - Loss: 0.008   log_lengthscale: -1.317   log_noise: -3.415
Iter 43/50 - Loss: 0.001   log_lengthscale: -1.331   log_noise: -3.422
Iter 44/50 - Loss: 0.009   log_lengthscale: -1.343   log_noise: -3.423
Iter 45/50 - Loss: 0.001   log_lengthscale: -1.350   log_noise: -3.419
Iter 46/50 - Loss: -0.001   log_lengthscale: -1.360   log_noise: -3.410
Iter 47/50 - Loss: 0.007   log_lengthscale: -1.374   log_noise: -3.397
Iter 48/50 - Loss: 0.000   log_lengthscale: -1.388   log_noise: -3.380
Iter 49/50 - Loss: -0.010   log_lengthscale: -1.396   log_noise: -3.359
Iter 50/50 - Loss: -0.008   log_lengthscale: -1.404   log_noise: -3.337

Make predictions with the model

In the next cell, we make predictions with the model. To do this, we simply put the model and likelihood in eval mode, and call both modules on the test data.

Just as a user defined GP model returns a MultivariateNormal containing the prior mean and covariance from forward, a trained GP model in eval mode returns a MultivariateNormal containing the posterior mean and covariance. Thus, getting the predictive mean and variance, and then sampling functions from the GP at the given test points could be accomplished with calls like:

f_preds = model(test_x)
y_preds = likelihood(model(test_x))

f_mean = f_preds.mean
f_var = f_preds.variance
f_covar = f_preds.covariance_matrix
f_samples = f_preds.sample(sample_shape=torch.Size(1000,))

The gpytorch.fast_pred_var context is not needed, but here we are giving a preview of using one of our cool features, getting faster predictive distributions using LOVE.

In [5]:
# Get into evaluation (predictive posterior) mode
model.eval()
likelihood.eval()

# Test points are regularly spaced along [0,1]
# Make predictions by feeding model through likelihood
with torch.no_grad(), gpytorch.fast_pred_var():
    test_x = torch.linspace(0, 1, 51)
    observed_pred = likelihood(model(test_x))

Plot the model fit

In the next cell, we plot the mean and confidence region of the Gaussian process model. The confidence_region method is a helper method that returns 2 standard deviations above and below the mean.

In [6]:
with torch.no_grad():
    # Initialize plot
    f, ax = plt.subplots(1, 1, figsize=(4, 3))

    # Get upper and lower confidence bounds
    lower, upper = observed_pred.confidence_region()
    # Plot training data as black stars
    ax.plot(train_x.numpy(), train_y.numpy(), 'k*')
    # Plot predictive means as blue line
    ax.plot(test_x.numpy(), observed_pred.mean.numpy(), 'b')
    # Shade between the lower and upper confidence bounds
    ax.fill_between(test_x.numpy(), lower.numpy(), upper.numpy(), alpha=0.5)
    ax.set_ylim([-3, 3])
    ax.legend(['Observed Data', 'Mean', 'Confidence'])
../../_images/examples_01_Simple_GP_Regression_Simple_GP_Regression_12_0.png
In [ ]: