This is a collection of my thoughts on Katherine Bailey’s post Gaussian Processes for Dummies. I’m just learning about Gaussian processes myself, and what follows surely reflects some of my confusion about all this. In the future I’ll revisit these notes with (hopefully) a more informed perspective.
It seems like Bailey’s post outlines a method for inferring the values of a function \(f\) after learning a handful of its points. I’ll implement the algorithm described in their post (which is originally in Python) in R, then experiment a bit to try to understand what to expect from it.
library(rethinking) # github.com/rmcelreath/rethinking
library(corrplot)
library(matrixcalc)
We’ll take a set of points on a sine curve, \(f(x) = \sin x\), and use this as our training data. We store the x values in xtrain
, and their corresponding sine values in ytrain
.
xtrain <- c(-4, -3, -2, -1, 1, 2, 5)
ytrain <- sin(xtrain)
plot(xtrain, ytrain)
This training data will be used for about the first half of these notes.
Starting with a set of points (xtrain
, for example), we can impose a set of correlations on those points based on their distances from each other. For our purposes, we want nearer points to be more correlated, and farther points to be less correlated. This is what we’d expect from a continuous function: nearby x-values should have nearby y values.
We’ll let the covariance between points \(x_i\) and \(x_j\) fall off like a Gaussian curve as the distance between the points increases, as in
\[ \operatorname{Cov}(x_i,x_j) = \exp\!\left(-\eta^2 (x_i - x_j)^2\right). \]
We can tweak the parameter \(\eta^2\) to increase or decrease the distance over which points covary.
Here’s an R function to compute this covariance matrix for two input lists, with a small adjustment along the diagonal to ensure positive definiteness:
covK <- function(a, b, rhosq) {
K <- outer(a, b, function(x,y) exp(-rhosq*(x - y)^2))
if (length(a) == length(b))
diag(K) <- 1 + 0.01
return(K)
}
If we compute the the Cholesky decomposition of a covariance matrix \(\mathbf K\),
\[ \mathbf K = \mathbf L \mathbf L^T, \]
we end up with a matrix \(\mathbf L\) which takes uncorrelated samples \(\mathbf x\) to correlated samples \(\mathbf L \mathbf x\) (this is matrix multiplication of \(\mathbf L\) and \(\mathbf x\)).
Now, our goal is to estimate the function \(f\), and our prior will be that its values are normally distributed with mean 0 and variance 1 and correlated according to \(\mathbf K\). To get a few sample functions from this prior, we generate standard normal noise on the interval we’re interested in, then give it the desired correlation by multiplying it by the appropriate Cholesky factor \(\mathbf L\).
Our “test” points, which is just a grid of \(x\)-values where we want to estimate \(f(x)\), will be be 100 equally spaced points ranging from -5 to 5. We compute the covariance matrix corresponding to distances between these points using covK()
and its Cholesky factor using chol()
, generate 3 sets of standard normal noise at these points, multiply them each by the Cholesky factor, and plot the results.
n <- 100
xtest <- seq(from = -5, to = 5, length.out = n)
rhosq <- 5 # Bailey's parameter choice
# compute correlation matrix and Cholesky factor
K_ss <- covK(xtest, xtest, rhosq)
L_ss <- chol(K_ss)
# take three samples of Gaussian noise from the posterior
# and correlate each one by multiplying it against L_ss
num_post_samples <- 3
uncorrelated <- matrix(rnorm(num_post_samples*n), nrow = n)
correlated <- L_ss %*% uncorrelated
# plot the three samples
plot(0, xlim = c(-5,5), ylim = c(-3,3), type = "n", xlab = "x", ylab = "y")
col = c("black", "red", "blue")
for (i in 1:num_post_samples)
lines(xtest, correlated[,i], col = col[i])
mtext("Three samples from the prior")
Some of the variable names here have been chosen to match the ones in Bailey’s code.
In the above plot, notice that the resulting curves are no longer just pure noise. Their values now covary according to a decaying exponential of their squared distances; if one x-value has a high y-value, the nearby x-values will also have high y-values, but x-values farther away don’t feel the same pull upward. Any x-values farther than 1 unit apart essentially don’t correlate at all.
Determination of the posterior mean at each xtest
(in other words, our guess for the value of \(f\) at each xtest
) and the posterior correlator matrix can be done using matrix algebra. I haven’t looked at the reference given by Bailey (Murphy’s Machine Learning: A Probablistic Perspective) for this calculation yet, so I can’t say I understand it yet. But I have done my best to accurately implement it in R, based on the Python code given by Bailey.
I wrote a function gpfit
to calculate the posterior mean, variance, and posterior correlator matrix given a set of training x-values xtrain
and their corresponding y-values ytrain
, a set of test x-values xtest
, and a fit parameter rhosq
. The function also returns the smallest eigenvalue of a matrix from an intermediate step of the calculation—this will be used later to differentiate between different possible values of the parameter \(\rho^2\).
# this function requires covK() defined above
gpfit <- function(xtrain, ytrain, xtest, rhosq) {
# compute posterior mean
K <- covK(xtrain, xtrain, rhosq)
L <- chol(K)
K_s <- covK(xtrain, xtest, rhosq)
Lk <- solve(L, K_s)
X <- solve(L, ytrain)
mu <- t(Lk) %*% X
# compute posterior variance
K_ss <- covK(xtest, xtest, rhosq)
variance <- diag(K_ss) - apply(Lk^2, 2, sum)
# compute posterior correlator matrix if it's well-defined
K_ss <- covK(xtest, xtest, rhosq)
A <- K_ss - t(Lk) %*% Lk
mineigen = min(eigen(A, symmetric = TRUE, only.values = TRUE)$values)
if (mineigen > 0) # is A positive definite?
L_post <- chol(A)
else
L_post <- NULL
return(list(mean = mu, variance = variance, L_post = L_post, mineigen = mineigen))
}
In the following code we use gpfit()
to compute the Gaussian process fit, then plot the posterior mean fit curve and shade the 95% posterior density interval for the mean (as determined by the variance returned by gpfit()
).
fit <- gpfit(xtrain, ytrain, xtest, rhosq)
PI <- matrix(nrow = 2, ncol = n)
PI[1,] <- fit$mean - 2*sqrt(fit$variance)
PI[2,] <- fit$mean + 2*sqrt(fit$variance)
plot(0, xlim = c(-5, 5), ylim = c(-3, 3), type = "n", xlab = "x", ylab = "y")
shade(PI, xtest)
lines(xtest, fit$mean, col = "red", lwd = 3, lty = "dashed")
points(xtrain, ytrain, pch = 16, col = rangi2, cex = 1.5)
As gpfit()
also returns the posterior correlator matrix \(\mathbf L\), we can sample a fit curve from the posterior distribution of curves by generating standard Gaussian noise \(\mathbf x\) then transforming it via \(\mathbf x_{\text{post}} = \mathbf \mu + \mathbf L \mathbf x\). In the following code we sample three curves from the posterior and include them in the previous plot.
num_post_samples <- 3
uncorrelated <- matrix(rnorm(num_post_samples*n), nrow = n)
correlated <- fit$L_post %*% uncorrelated
plot(0, xlim = c(-5, 5), ylim = c(-3, 3), type = "n", xlab = "x", ylab = "y")
shade(PI, xtest)
col = c("gray2", "red", rangi2)
for (i in 1:num_post_samples)
lines(xtest, fit$mean + correlated[,i], col = col[i]) # mu + Lx
lines(xtest, fit$mean, col = "red", lwd = 3, lty = "dashed")
points(xtrain, ytrain, pch = 16, col = rangi2, cex = 1.5)
mtext("Three samples from the posterior")
Let’s try to get a better picture of the posterior through sampling. We’ll sample 100,000 curves from the posterior, then plot the first 150 of them behind the posterior mean curve.
num_post_samples <- 1e5
uncorrelated <- matrix(rnorm(num_post_samples*n), nrow = n)
correlated <- fit$L_post %*% uncorrelated
plot(0, xlim = c(-5, 5), ylim = c(-3, 3), type = "n", xlab = "x", ylab = "y")
for (i in 1:150)
lines(xtest, fit$mean + correlated[,i], col = col.alpha("black", 0.15)) # mu + Lx
lines(xtest, fit$mean, col = "red", lwd = 3, lty = "dashed")
points(xtrain, ytrain, pch = 16, col = rangi2, cex = 1.5)
mtext(paste("150 samples from the posterior"))
These seem to follow a different distribution than the regions that were shaded according to the variance above. Using all 100,000 sample curves, we’ll compute and plot the empirical 95% intervals implied by the sample.
posterior <- apply(correlated, 2, function(x) fit$mean + x) # mu + Lx
mu.HPDI <- apply(posterior, 1, HPDI, prob = 0.95)
plot(0, xlim = c(-5, 5), ylim = c(-3, 3), type = "n", xlab = "x", ylab = "y")
shade(mu.HPDI, xtest)
lines(xtest, fit$mean, col = "red", lwd = 3, lty = "dashed")
points(xtrain, ytrain, pch = 16, col = rangi2, cex = 1.5)
mtext(paste("95% posterior density interval (", num_post_samples, " samples)", sep = ""))
While there is some similarity to the region defined by the variance, this region isn’t nearly as regular the earlier one. And this isn’t due to random sampling—we’ve taken enough samples to get a good approximation. You can try running the code above several times (or taking many more samples if your compute can handle it) to be sure.
In the above calculations we took \(\rho^2 = 5\) to match the Python code written by Bailey. This parameter measures how “smooth” the interpolating curve is, with a smaller \(\rho^2\) yielding a smoother curve. Try decreasing it to 4 and running the code, and then to 3 and running it again. But decrease it too low (try \(\rho^2 = 2\)) and you’ll get an error when you try to sample curves from the posterior. However, the posterior mean curve is computed just fine. That’s odd.
The above code doesn’t really illustrate it well, but if you decrease it even further, say to \(\rho^2 = 1.5\), then the posterior “variance” computed by gpfit
becomes negative and the posterior mean fails to interpolate the training points (but maybe that’s a good thing—we’ll explore this in section 2). Strange behavior indeed.
I suspect the reasons for this are buried in assumptions behind the specific steps in the matrix algebra. It’s possible that the steps could be altered to still give results in these cases, but I haven’t looked at it yet.
Using the variance
and mineigen
values returned by gpfit()
we can determine the minimum values of the parameter \(\rho^2\) which yield good behaviors in sampling from the posterior and computing the posterior mean curve.
The function minrhosq()
below uses a root-finding algorithm to determine the smallest admissible value of \(\rho^2\) for a given training set and test set. We pass it these training and test sets, along with an interval which we know contains the desired value of \(\rho^2\).
By default, minrhosq()
finds the smallest value of rhosq
with which we can still sample from the posterior using \(\mathbf L\). If we look closely at how the matrix \(\mathbf L\) is computed, we see that it won’t exist when the value of mineigen
is negative (because then the matrix which is used to construct \(\mathbf L\) isn’t positive definite), and it will exist if mineigen
is positive. We find the desired value of rhosq
by finding where this mineigen
transitions from positive to negative.
minrhosq <- function(xtrain, ytrain, xtest, interval, forinterp = FALSE) {
if (!forinterp) {
# find the smallest value of rhosq which gives positive mineigen
root <- uniroot(
function(x) gpfit(xtrain, ytrain, xtest, x)$mineigen,
interval
)
} else {
# find the smallest value of rhosq which gives positive variance
root <- uniroot(
function(x) min(gpfit(xtrain, ytrain, xtest, x)$variance),
interval
)
}
# $f.root is the value of the function at $root
# -- If $f.root is negative, return $root + (root precision) as rhosq instead
if (root$f.root > 0)
return(root$root)
else
return(root$root + root$estim.prec)
}
We know that the minimal \(\rho^2\) for which \(\mathbf L\) exists is somewhere between 2 and 3. Note that mineigen
is negative when rhosq
is 2 and positive when it’s 3:
gpfit(xtrain, ytrain, xtest, 2)$mineigen
[1] -0.02939852
gpfit(xtrain, ytrain, xtest, 3)$mineigen
[1] 0.01
In the following code we use minrhosq()
to find this value of rhosq
between 2 and 3, then plug it into gpfit()
and plot the posterior mean curve and the 95% posterior density interval.
rhosq <- minrhosq(xtrain, ytrain, xtest, c(2,3))
fit <- gpfit(xtrain, ytrain, xtest, rhosq)
num_post_samples <- 1e5
uncorrelated <- matrix(rnorm(num_post_samples*n), nrow = n)
correlated <- fit$L_post %*% uncorrelated
posterior <- apply(correlated, 2, function(x) fit$mean + x)
mu.HPDI <- apply(posterior, 1, HPDI, prob = 0.95)
plot(0, xlim = c(-5, 5), ylim = c(-3, 3), type = "n", xlab = "x", ylab = "y")
shade(mu.HPDI, xtest)
lines(xtest, fit$mean, col = "red", lwd = 3, lty = "dashed")
points(xtrain, ytrain, pch = 16, col = rangi2, cex = 1.5)
mtext(paste("95% posterior density interval (", num_post_samples, " samples)", sep = ""))
We’re getting some interesting artifacts from the posterior samples. I think these are occurring because we’re right on the border of the posterior correlator \(\mathbf L\) breaking down. If we increase rhosq
slightly, these should go away:
fit <- gpfit(xtrain, ytrain, xtest, rhosq + 0.1) # slightly increase rhosq
correlated <- fit$L_post %*% uncorrelated
posterior <- apply(correlated, 2, function(x) fit$mean + x)
mu.HPDI <- apply(posterior, 1, HPDI, prob = 0.95)
plot(0, xlim = c(-5, 5), ylim = c(-3, 3), type = "n", xlab = "x", ylab = "y")
shade(mu.HPDI, xtest)
lines(xtest, fit$mean, col = "red", lwd = 3, lty = "dashed")
points(xtrain, ytrain, pch = 16, col = rangi2, cex = 1.5)
mtext(paste("95% posterior density interval (", num_post_samples, " samples)", sep = ""))
If we decrease rhosq
slightly below the value obtained in the last section, we can no longer sample from the posterior. However, we can still compute the posterior mean curve and estimate the posterior density interval using the variance returned by gpfit()
.
But there is another minimum now, lower than the previous one, beyond which the posterior mean curve no longer interpolates the training points. We can estimate this minimum by seeing which values of rhosq
produce a negative variance:
min(gpfit(xtrain, ytrain, xtest, 1)$variance)
[1] -0.1584486
min(gpfit(xtrain, ytrain, xtest, 2)$variance)
[1] 0.00519122
So, we know that the minimal rhosq
in this case is between 1 and 2. Now we call minrhosq()
, passing it the interval (1,2) and the switch forinterp = TRUE
.
Unfortunately now we can’t sample from the posterior because the posterior correlator matrix (at least as it is defined here, using the Cholesky decomposition) doesn’t exist. At least we can still use the variance returned by gpfit()
to estimate the posterior density intervals.
rhosq <- minrhosq(xtrain, ytrain, xtest, c(1,2), forinterp = TRUE)
fit <- gpfit(xtrain, ytrain, xtest, rhosq)
PI <- matrix(nrow = 2, ncol = n)
PI[1,] <- fit$mean - 2*sqrt(fit$variance)
PI[2,] <- fit$mean + 2*sqrt(fit$variance)
plot(0, xlim = c(-5, 5), ylim = c(-3, 3), type = "n")
shade(PI, xtest)
lines(xtest, fit$mean, col = "red", lwd = 3, lty = "dashed")
points(xtrain, ytrain, pch = 16, col = rangi2, cex = 1.5)
This is the smoothest curve so far!
If we decrease rhosq
any farther, the posterior mean curve will no longer interpolate the training points. But is this really so bad? We’ll explore this next.
If we wanted to, we could entertain the idea that our training points have some measurement error in them. In that case interpolating the points would be blatant overfitting of the model.
In this section we’ll explore how to determine the parameter rhosq
from the training data itself. We’ll do this by creating a model which learns the distances between the points of xtrain
which have correlated y-values in ytrain
, according to the Gaussian kernel
\[ \operatorname{Cov}(x_i,x_j) = \exp\!\left(-\eta^2 (x_i - x_j)^2\right) \]
that we used above.
First we compute the matrix of distances between the x-values of our training data and put the training data into a data frame for convenience.
d <- data.frame(x = as.factor(xtrain), y = ytrain)
distances <- as.matrix(dist(xtrain))
colnames(distances) <- d$x
rownames(distances) <- d$x
distances
-4 -3 -2 -1 1 2 5
-4 0 1 2 3 5 6 9
-3 1 0 1 2 4 5 8
-2 2 1 0 1 3 4 7
-1 3 2 1 0 2 3 6
1 5 4 3 2 0 1 4
2 6 5 4 3 1 0 3
5 9 8 7 6 4 3 0
In our model, we’ll suppose that the measurement error on our training points is normally distributed with mean 0 and standard deviation 0.1. We’ll fit the model using map2stan
from the rethinking
package. A good overview of how this works can be found in McElreath’s book Statistical Rethinking: A Bayesian Course, chapter 13 section 4.
m1 <- map2stan(
alist(
y ~ dnorm(mu, 0.1),
mu <- g[x],
g[x] ~ GPL2(Dmat, 1, rhosq, 0.01),
rhosq ~ dgamma(1.5, 1)
),
data = list(
x = d$x,
y = d$y,
Dmat = distances
),
warmup = 2e3,
iter = 1e4,
control = list(adapt_delta = 0.95)
)
Take a look at the mean value of rhosq
estimated by the model. The precis()
function also returns some information about the posterior density of rhosq
, including its standard deviation and 89% highest posterior density interval.
precis(m1)
To get an idea of the model’s uncertainty about rhosq
, let’s take 200 samples from its posterior and plot their corresponding distance correlation functions. The mean correlation function is shown as a thick solid line.
post <- extract.samples(m1)
rhosq <- mean(post$rhosq)
# plot the posterior mean covariance function
curve(
exp(-mean(post$rhosq)*x^2),
from = 0, to = 10,
xlab = "dstance", ylab = "covariance",
ylim = c(0, 1),
lwd = 3
)
# plot 200 functions sampled from posterior
for ( i in 1:200 )
curve(
exp(-post$rhosq[i]*x^2),
col = col.alpha("black", 0.17),
add = TRUE
)
The model is definitely uncertain about the value of rhosq
, but let’s see what kind of result the mean estimate gives us.
Here’s a visualization of the correlation matrix between the values of xtrain
. Note that each x is moderately correlated with the other values distance 1 away, and there is a miniscule correlation with values distance 2 away. Beyond that there is no discernable correlation.
# compute posterior mean covariance among points
K <- covK(xtrain, xtrain, rhosq)
# convert to correlation matrix
Rho <- round(cov2cor(K), 2)
# add row and column names for convenience
colnames(Rho) <- d$x
rownames(Rho) <- colnames(Rho)
corrplot(Rho, method = "circle")
Perhaps a better way to see this is by drawing lines between the training points corresponding to the strength of their correlation.
plot(xtrain, ytrain, pch = 16, col = rangi2, cex = 1.5)
# overlay lines shaded by Rho
for (i in 1:length(xtrain))
for (j in 1:length(xtrain))
if (i < j)
lines(
c(xtrain[i], xtrain[j]),
c(ytrain[i], ytrain[j]),
lwd = 2, col = col.alpha("black", Rho[i,j]^2)
)
So there is some expected correlation between the four points on the left, and between the two points in the middle, just based on their horizontal proximity.
Finally we’ll plot the inferred posterior mean curve using the mean value of rhosq
from the model. The “true” sine curve is shown in black.
fit <- gpfit(xtrain, ytrain, xtest, rhosq)
plot(0, xlim = c(-5, 5), ylim = c(-3, 3), type = "n")
lines(xtest, fit$mean, col = "red", lwd = 3, lty = "dashed")
points(xtrain, ytrain, pch = 16, col = rangi2, cex = 1.5)
curve(sin(x), lty = "dashed", col = col.alpha("black", 0.5), add = TRUE)
To be honest, I’m nut sure if the behavior shown above is intended by the computations that go into creating the posterior mean fit curve. It actually works pretty well in that example, but let’s explore an example where it doesn’t do so well.
Instead of the nice, widely-spaced training set we’ve been using up to this point, we’ll use a training set consisting of a random sample of 10 x-values between -5 and 5.
set.seed(5)
xtrain <- runif(7, -5, 5)
ytrain <- sin(xtrain)
plot(xtrain, ytrain)
d <- data.frame(x = as.factor(round(xtrain,2)), y = ytrain)
distances <- as.matrix(dist(xtrain))
colnames(distances) <- d$x
rownames(distances) <- d$x
Now what parameter does the model learn from the data?
m2 <- map2stan(
m1,
data = list(
x = d$x,
y = d$y,
Dmat = distances
),
warmup = 2e3,
iter = 1e4,
control = list(adapt_delta = 0.95)
)
precis(m2)
The mean rhosq
is much larger, so the model only expect points which are very close together to correlate. As before, let’s plot a picture of the uncertainty around the model’s estimate of rhosq
.
post <- extract.samples(m2)
rhosq <- mean(post$rhosq)
# plot the posterior mean covariance function
curve(
exp(-rhosq*x^2),
from = 0, to = 10,
xlab = "dstance", ylab = "covariance",
ylim = c(0, 1),
lwd = 3
)
# plot 200 functions sampled from posterior
for ( i in 1:200 )
curve(
exp(-post$rhosq[i]*x^2),
col = col.alpha("black", 0.08),
add = TRUE
)
The model is actually very certain about this value of rhosq
. This is because it now has a much wider variety of distances to learn from. Here’s what the correlations between the new training data looks like:
# compute posterior mean covariance among points
K <- covK(xtrain, xtrain, rhosq)
# convert to correlation matrix
Rho <- round(cov2cor(K), 2)
# add row and column names for convenience
colnames(Rho) <- d$x
rownames(Rho) <- colnames(Rho)
# plot the training data
plot(xtrain, ytrain, pch = 16, col = rangi2)
# overlay lines shaded by Rho
for (i in 1:length(xtrain))
for (j in 1:length(xtrain))
if (i < j)
lines(
c(xtrain[i], xtrain[j]),
c(ytrain[i], ytrain[j]),
lwd = 2, col = col.alpha("black", Rho[i,j]^2)
)
Pretty much the only points that are correlated are the two near x=2.
But now something strange happens when we use gpfit()
to compute the posterior mean fit curve to this new data.
fit <- gpfit(xtrain, ytrain, xtest, rhosq)
plot(0, xlim = c(-5, 5), ylim = c(-3, 3), type = "n")
lines(xtest, fit$mean, col = "red", lwd = 3, lty = "dashed")
points(xtrain, ytrain, pch = 16, col = rangi2, cex = 1.5)
curve(sin(x), lty = "dashed", col = col.alpha("black", 0.5), add = TRUE)
All of the points except the two near x=2 were interpolated, and near those points the posterior mean curve blew up.
I really suspect that this kind of thing could be accounted for in the matrix algebra in gpfit()
. I’d really like to revisit this example once I understand that calculation a bit more.
Of course, not all is lost. In this section we’ll show that our model behaves fairly well for evenly spaced training data, provided the x-values are far enough apart.
Here our training data will consist of 9 evenly spaced points from -6 to 7.
xtrain <- seq(from = -6, to = 7, length.out = 9)
ytrain <- sin(xtrain)
d <- data.frame(x = as.factor(round(xtrain,2)), y = ytrain)
distances <- as.matrix(dist(xtrain))
colnames(distances) <- d$x
rownames(distances) <- d$x
plot(xtrain, ytrain)
We attempt to learn the Gaussian process parameter from the data:
m3 <- map2stan(
m1,
data = list(
x = d$x,
y = d$y,
Dmat = distances
),
warmup = 2e3,
iter = 1e4,
control = list(adapt_delta = 0.99)
)
And plot the uncertainty about it, this time choosing the median from the posterior to be our value of rhosq
(just because):
post <- extract.samples(m3)
( rhosq <- median(post$rhosq) )
[1] 1.031402