Exact GP Regression with Multiple GPUs and Kernel Partitioning

In this notebook, we’ll demonstrate training exact GPs on very large datapoints using two key features:

  1. The ability to distribute the kernel matrix across multiple GPUs, thereby using additional parallelism.
  2. Partitioning the kernel in to chunks that are computed on-the-fly when performing each MVM so as to reduce memory usage enough.

We’ll be using the protein dataset, which will have about 37000 training examples. Note that although the techniques in this notebook can be applied to much larger datasets than this, the training time required will depend dramatically on the computational resources you have available: both the number of GPUs available and the amount of memory they have (which determines the partition size) have a dramatic effect on performance.

import math
import torch
import gpytorch
import sys
from matplotlib import pyplot as plt
from LBFGS import FullBatchLBFGS

%matplotlib inline
%load_ext autoreload
%autoreload 2

Downloading Data

We will be using the Protein UCI dataset which contains a total of 40000+ data points. The next cell will download this dataset from a Google drive and load it.

import os
import urllib.request
from scipy.io import loadmat
dataset = 'protein'
if not os.path.isfile(f'{dataset}.mat'):
    print(f'Downloading \'{dataset}\' UCI dataset...')

data = torch.Tensor(loadmat(f'{dataset}.mat')['data'])

Normalization and train/test Splits

In the next cell, we split the data 80/20 as train and test, and do some basic z-score feature normalization.

import numpy as np

N = data.shape[0]
# make train/val/test
n_train = int(0.8 * N)
train_x, train_y = data[:n_train, :-1], data[:n_train, -1]
test_x, test_y = data[n_train:, :-1], data[n_train:, -1]

# normalize features
mean = train_x.mean(dim=-2, keepdim=True)
std = train_x.std(dim=-2, keepdim=True) + 1e-6 # prevent dividing by 0
train_x = (train_x - mean) / std
test_x = (test_x - mean) / std

# normalize labels
mean, std = train_y.mean(),train_y.std()
train_y = (train_y - mean) / std
test_y = (test_y - mean) / std

# make continguous
train_x, train_y = train_x.contiguous(), train_y.contiguous()
test_x, test_y = test_x.contiguous(), test_y.contiguous()

output_device = torch.device('cuda:0')

train_x, train_y = train_x.to(output_device), train_y.to(output_device)
test_x, test_y = test_x.to(output_device), test_y.to(output_device)

How many GPUs do you want to use?

In the next cell, specify the n_devices variable to be the number of GPUs you’d like to use. By default, we will use all devices available to us.

n_devices = torch.cuda.device_count()
print('Planning to run on {} GPUs.'.format(n_devices))
Planning to run on 2 GPUs.

GP Model + Training Code

In the next cell we define our GP model and training code. For this notebook, the only thing different from the Simple GP tutorials is the use of the MultiDeviceKernel to wrap the base covariance module. This allows for the use of multiple GPUs behind the scenes.

class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood, n_devices):
        super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        base_covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

        self.covar_module = gpytorch.kernels.MultiDeviceKernel(
            base_covar_module, device_ids=range(n_devices),

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

def train(train_x,
    likelihood = gpytorch.likelihoods.GaussianLikelihood().to(output_device)
    model = ExactGPModel(train_x, train_y, likelihood, n_devices).to(output_device)

    optimizer = FullBatchLBFGS(model.parameters(), lr=0.1)
    # "Loss" for GPs - the marginal log likelihood
    mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

    with gpytorch.beta_features.checkpoint_kernel(checkpoint_size), \

        def closure():
            output = model(train_x)
            loss = -mll(output, train_y)
            return loss

        loss = closure()

        for i in range(n_training_iter):
            options = {'closure': closure, 'current_loss': loss, 'max_ls': 10}
            loss, _, _, _, _, _, _, fail = optimizer.step(options)

            print('Iter %d/%d - Loss: %.3f   lengthscale: %.3f   noise: %.3f' % (
                i + 1, n_training_iter, loss.item(),

            if fail:
                print('Convergence reached!')

    print(f"Finished training on {train_x.size(0)} data points using {n_devices} GPUs.")
    return model, likelihood

Automatically determining GPU Settings

In the next cell, we automatically determine a roughly reasonable partition or checkpoint size that will allow us to train without using more memory than the GPUs available have. Not that this is a coarse estimate of the largest possible checkpoint size, and may be off by as much as a factor of 2. A smarter search here could make up to a 2x performance improvement.

import gc

def find_best_gpu_setting(train_x,
    N = train_x.size(0)

    # Find the optimum partition/checkpoint size by decreasing in powers of 2
    # Start with no partitioning (size = 0)
    settings = [0] + [int(n) for n in np.ceil(N / 2**np.arange(1, np.floor(np.log2(N))))]

    for checkpoint_size in settings:
        print('Number of devices: {} -- Kernel partition size: {}'.format(n_devices, checkpoint_size))
            # Try a full forward and backward pass with this setting to check memory usage
            _, _ = train(train_x, train_y,
                         n_devices=n_devices, output_device=output_device,
                         preconditioner_size=preconditioner_size, n_training_iter=1)

            # when successful, break out of for-loop and jump to finally block
        except RuntimeError as e:
            print('RuntimeError: {}'.format(e))
        except AttributeError as e:
            print('AttributeError: {}'.format(e))
            # handle CUDA OOM error
    return checkpoint_size

# Set a large enough preconditioner size to reduce the number of CG iterations run
preconditioner_size = 100
checkpoint_size = find_best_gpu_setting(train_x, train_y,
Number of devices: 2 -- Kernel partition size: 0
/home/jrg365/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py:26: UserWarning:
    There is an imbalance between your GPUs. You may want to exclude GPU 1 which
    has less than 75% of the memory or cores of GPU 0. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
  warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
RuntimeError: CUDA out of memory. Tried to allocate 2.49 GiB (GPU 0; 10.73 GiB total capacity; 4.99 GiB already allocated; 2.39 GiB free; 22.42 MiB cached)
Number of devices: 2 -- Kernel partition size: 18292
RuntimeError: CUDA out of memory. Tried to allocate 1.25 GiB (GPU 1; 7.93 GiB total capacity; 6.23 GiB already allocated; 1.19 GiB free; 28.49 MiB cached)
Number of devices: 2 -- Kernel partition size: 9146
Iter 1/1 - Loss: 0.947   lengthscale: 0.583   noise: 0.428
Finished training on 36584 data points using 2 GPUs.


model, likelihood = train(train_x, train_y,
                          n_devices=n_devices, output_device=output_device,
Iter 1/20 - Loss: 1.030   lengthscale: 0.671   noise: 0.570
Iter 2/20 - Loss: 0.995   lengthscale: 0.606   noise: 0.476
Iter 3/20 - Loss: 0.969   lengthscale: 0.534   noise: 0.410
Iter 4/20 - Loss: 0.949   lengthscale: 0.473   noise: 0.360
Iter 5/20 - Loss: 0.935   lengthscale: 0.442   noise: 0.325
Iter 6/20 - Loss: 0.927   lengthscale: 0.418   noise: 0.302
Iter 7/20 - Loss: 0.912   lengthscale: 0.383   noise: 0.267
Iter 8/20 - Loss: 0.907   lengthscale: 0.358   noise: 0.243
Iter 9/20 - Loss: 0.904   lengthscale: 0.342   noise: 0.225
Iter 10/20 - Loss: 0.898   lengthscale: 0.326   noise: 0.212
Iter 11/20 - Loss: 0.897   lengthscale: 0.320   noise: 0.205
Iter 12/20 - Loss: 0.895   lengthscale: 0.318   noise: 0.200
Iter 13/20 - Loss: 0.895   lengthscale: 0.313   noise: 0.193
Iter 14/20 - Loss: 0.896   lengthscale: 0.313   noise: 0.193
Finished training on 36584 data points using 2 GPUs.

Testing: Computing test time caches

# Get into evaluation (predictive posterior) mode

with torch.no_grad(), gpytorch.settings.fast_pred_var():
    latent_pred = model(test_x)

Testing: Computing predictions

with torch.no_grad(), gpytorch.settings.fast_pred_var():
    %time latent_pred = model(test_x)

test_rmse = torch.sqrt(torch.mean(torch.pow(latent_pred.mean - test_y, 2)))
print(f"Test RMSE: {test_rmse.item()}")
CPU times: user 53 ms, sys: 12.3 ms, total: 65.3 ms
Wall time: 63 ms
Test RMSE: 0.5422027707099915
[ ]: