\[ \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\Uniform}{Uniform} \DeclareMathOperator{\Binomial}{Binomial} \]

The goal of this note is to give some intuition for how a normal approximation for a Bayesian (joint) posterior could be computed numerically. We’ll use the normal approximation generated by the map function from the rethinking package as a baseline to compare against.

library(rethinking)    # https://github.com/rmcelreath/rethinking

1 Estimating a binomial probability

We’ll start by considering the simple problem of estimating a binomial probability \(p\).

Suppose that out of 9 trials we observed 6 successes, and that our prior for \(p\) is

\[ p \sim \Normal(0.25, 0.5). \]

1.1 Analytical solution

The binomial likelihood of seeing 6 successes in 9 trials is \(\Binomial(6 \mid 9,p)\), so by Bayes’ theorem our posterior for \(p\) is

\[ f(p) \propto \Binomial(6 \mid 9,p) \Normal(p \mid 0.25,0.5), \]

and numerical integration shows that the constant of proportionality is about 1/0.0581627. Here’s a plot of this posterior:

curve(
    dbinom(6, 9, x)*dnorm(x, 0.25, 0.5)/0.0581627,
    xlim = c(0,1), xlab = "p", ylab = "posterior"
)

1.2 Posterior estimate from map

Let’s use rethinking::map to get a normal approximation for the posterior of the binomial parameter.

m1 <- map(
    alist(
        k ~ dbinom(n, p),
        p ~ dnorm(0.25, 0.5)
    ),
    data = list(n = 9, k = 6),
    start = list(p = 0.5)
)
precis(m1)

Here’s a plot of the normal approximation from map, shown as a dashed curve.

# plot the exact posterior
curve(
    dbinom(6, 9, x)*dnorm(x, 0.25, 0.5)/0.0581627,
    xlim = c(0,1), xlab = "p", ylab = "posterior"
)
# plot the normal approximation for the posterior from map()
curve(dnorm(x, coef(m1)["p"], precis(m1)@output$StdDev), lty = 2, add = TRUE)

1.3 Manually obtaining the normal approximation

There are two parts to finding the normal approximation. First we find the maximum of the posterior (which becomes the mean of the normal), then we estimate the curvature at that point (which tells us about the standard deviation of the normal).

There are many ways to tackle the first step. I think map uses some kind of adaptive hill-climbing, where the algorithm takes steps toward a peak of the posterior \(f(p)\) along paths of steepest ascent, adjusting its step size depending on the steepness at its current position. But we’re not going to get that fancy. Instead we’ll just compute the posterior on a grid, then perform a brute force search for the highest grid point.

(You might say to yourself: “If you’re computing the posterior on a grid, do you really need the normal approximation?” No, we wouldn’t need the normal approximation. Computing the posterior on a grid gives us much more information since it’s a more accurate picture of the posterior than the normal approximation. However, our goal is to illustrate the basic concepts of the normal approximation, and using a grid is the simplest approach.)

The second step involves some light calculus.

Suppose we’ve already found the location \(p_0\) of the maximum of the posterior. We’ll use this \(p_0\) as the mean of our normal approximation. Then we just need to find the standard deviation \(\sigma\) which makes

\[ f(p) \approx \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left[-\frac{(p-p_0)^2}{2\sigma^2}\right] \]

when \(p\) is near \(p_0\). Taking the logarithm of both sides, we have

\[ \log f(p) \approx -\frac{(p-p_0)^2}{2\sigma^2} + \text{const}. \tag{$*$} \]

This is what’s meant by “quadratic approximation”: our goal is to approximate \(\log f(p)\) near its maximum by a quadratic polynomial of the form \(-(p-\mu)^2/(2\sigma^2) + \text{const}\).

Well since \(f(p)\) (and hence \(\log f(p)\)) has a maximum at \(p = p_0\), the linear term drops out of the Taylor series for \(\log f(p)\) centered at \(p = p_0\), and the most significant terms are the constant term and the quadratic term. In other words, if we define \(g(p) = \log f(p)\), then

\[ \log f(p) \approx \frac{g''(p_0)}{2} (p - p_0)^2 + \text{const} \]

when \(p\) is near \(p_0\). Plugging this into \((*)\), ignoring the constant terms, and solving for \(\sigma\), we see that we should choose

\[ \sigma = \sqrt{- \frac{1}{g''(p_0)}}. \]

That is to say, for our normal approximation we should choose the standard deviation to be the square root of the negative reciprocal of the second derivative of \(\log f(p)\) at its maximum.

We will also compute this second derivative from the grid of points we use to estimate \(p_0\) in the first step.

1.3.1 First step: finding the maximum

Recall that the posterior is proportional to \(\Binomial(6 \mid 9,p) \Normal(p \mid 0.25,0.5)\). We’ll evaluate the logarithm of this,

\[ g(p) = \log f(p) = \log \Binomial(6 \mid 9,p) + \log \Normal(p \mid 0.25,0.5) + \text{const}, \]

at a grid of points, then find which point on the grid corresponds to the maximum.

# the grid
gridsize <- 1000
p.grid <- seq(from = 0, to = 1, length.out = gridsize)
# evaluate the posterior at the grid points
g.grid <- sapply(p.grid, function(p) dbinom(6, 9, p, log = TRUE) + dnorm(p, 0.25, 0.5, log = TRUE))
# find the point on the grid which maximizes the posterior
index_of_max <- which.max(g.grid)
( p0 <- p.grid[index_of_max] )
[1] 0.6276276

1.3.2 Second step: computing the second derivative

The variable g.grid now contains a sequence of values of the log-posterior: \(g_1, g_2, \ldots, g_{1000}\). Using these values, we will use the discrete second derivative to approximate the true value of \(g''(p_0)\).

The discrete second derivative is defined by

\[ g''(p_i) \approx \frac{g_{i-1} - 2g_i + g_{i+1}}{\Delta p^2}, \]

where \(\Delta p\) is the distance between the grid points \(p_i\). In code:

( gpp <- (g.grid[index_of_max - 1] - 2*g.grid[index_of_max] + g.grid[index_of_max + 1])*(gridsize-1)^2 )
[1] -40.86723

(Note that, since the points of p.grid range from 0 to 1, the distance between them is 1/(gridsize-1). So we’ve replaced “dividing by \(\Delta p^2\)” in the above formula with “multiplying by (gridsize-1)^2”.)

As a sanity check, it makes sense that \(g''(p_0)\) is negative—we’re at a local max, after all. Finally we use this value for \(\sigma\):

( sigma <- sqrt(-1/gpp) )
[1] 0.1564273

What if we didn’t have the full grid to work with?

If we hadn’t computed the posterior on a grid beforehand, we could always choose an appropriately small step size and compute just the three grid points we need in the formula for the discrete second derivative.

1.3.3 Results

Here’s a plot of the normal approximation we’ve computed:

# plot the exact posterior
curve(
    dbinom(6, 9, x)*dnorm(x, 0.25, 0.5)/0.0581627,
    xlim = c(0,1), xlab = "p", ylab = "posterior"
)
# plot our normal approximation for the posterior
curve(dnorm(x, p0, sigma), lty = 2, col = rangi2, add = TRUE)

And here’s a table comparing the estimates of the mean and standard deviation from map with our estimates:

rbind(
    as.data.frame(precis(m1)@output[1:2], row.names = "map's"),
    data.frame(Mean = p0, StdDev = sigma, row.names = "ours")
)

The only things we needed to calculate this mean and this standard deviation were the likelihood and the prior. We didn’t even need to know the constant of proportionality we’d need to multiply them by to get the exact posterior.

2 Estimating the mean and standard deviation for a normal likelihood

Now we consider a problem where we’ll estimate two parameters instead of one. This means that we’ll be finding a normal approximation for their joint posterior.

Suppose that we have data containing 20 observations which we know to be normally distributed with some mean and variance. Our prior for the mean \(\mu\) is

\[ \mu \sim \Normal(0,5) \]

and our prior for the standard deviation \(\sigma\) is

\[ \sigma \sim \Uniform(0,2). \]

Here’s our data:

true_mu <- 2
true_sd <- 1
set.seed(1)
d <- rnorm(20, true_mu, true_sd)

2.1 Analytical solution

The joint posterior for \(\mu\) and \(\sigma\) is

\[ f(\mu,\sigma) \propto \Normal(\mu \mid 0,5) \Uniform(\sigma \mid 0,2) \prod_{i=1}^{20} \Normal(d_i \mid \mu,\sigma), \]

where \(d_1, d_2, \ldots, d_{20}\) are our 20 data points.

We’ll skip finding the constant of proportionality and plotting the posterior and move straight to the estimated means, standard deviations, and correlation for the parameters \(\mu\) and \(\sigma\).

2.2 Posterior estimates from map

We’ll use map to find normal approximations to the marginal posteriors for \(\mu\) and \(\sigma\).

m2 <- map(
    alist(
        x ~ dnorm(mu, sigma),
        mu ~ dnorm(0, 5),
        sigma ~ dunif(0,2)
    ),
    data = list(x = d)
)
precis(m2, corr = TRUE)

Note that the model estimates mu and sigma to have a very small negative correlation of -0.005502.

2.3 Manually obtaining the normal approximation

In this case we’re finding the normal approximation to the joint posterior for \(\mu\) and \(\sigma\). Finding the maximum of the posterior will basically be just as easy as before, but to estimate the curvature near this maximum we’ll need to use some multivariable calculus.

Suppose we’ve already found the location \((\mu_0, \sigma_0)\) of the maximum of the posterior. We need to find the covariance matrix \(\mathbf \Sigma\) which makes

\[ f(\mathbf m) \approx \frac{1}{2\pi \sqrt{\lvert \Sigma \rvert}} \exp\!\left[ -\frac{1}{2} (\mathbf m - \mathbf m_0)^{\text{T}} \mathbf\Sigma^{-1} (\mathbf m - \mathbf m_0) \right] \]

when \(\mathbf m\) is near \(\mathbf m_0\), where

\[ \mathbf m = \begin{pmatrix} \mu \\ \sigma \end{pmatrix} \qquad \text{and} \qquad \mathbf m_0 = \begin{pmatrix} \mu_0 \\ \sigma_0 \end{pmatrix}. \]

Taking logarithms, we want

\[ \log f(\mathbf m) \approx -\frac{1}{2} (\mathbf m - \mathbf m_0)^{\text{T}} \mathbf\Sigma^{-1} (\mathbf m - \mathbf m_0) + \text{const}. \]

Just as before, since \(f(\mathbf m)\) (and hence \(\log f(\mathbf m)\)) has a maximum at \(\mathbf m = \mathbf m_0\), the linear terms drop out of the Taylor series for \(\log f(\mathbf m)\), and we’re left with the approximation

\[ \log f(\mathbf m) \approx \frac{1}{2} (\mathbf m - \mathbf m_0)^{\text{T}} \mathbf H(\mathbf m_0) (\mathbf m - \mathbf m_0) + \text{const}, \]

where \(\mathbf H(\mathbf m_0)\) is the Hessian matrix for \(g(\mathbf m) = \log f(\mathbf m)\) evaluated at \(\mathbf m = \mathbf m_0\). (This Hessian matrix is just a matrix of the second derivatives of \(\log f(\mathbf m)\).) Comparing this formula to the one above, we see that we should take

\[ \mathbf \Sigma = -\mathbf H(\mathbf m_0)^{-1}. \]

Compare this with our choice for \(\sigma^2\) in section 1.3.

2.3.1 Finding the maximum

There’s probably a cleaner way to do this, but we’ll use two applications of which.max and max to find the indices of the maximum of the joint posterior.

# the grid
mu.gridsize <- 5000
mu.max <- 5
sigma.gridsize <- 1000
mu.grid <- seq(from = -mu.max, to = mu.max, length.out = mu.gridsize)
sigma.grid <- seq(from = 0, to = 2, length.out = sigma.gridsize)
g.grid <- outer(
    mu.grid, sigma.grid,
    function(m,s) {
        log_post <- dnorm(m, 0, 5, log = TRUE) + dunif(s, 0, 2, log = TRUE)
        for (i in 1:length(d))
            log_post <- log_post + dnorm(d[i], m, s, log = TRUE)
        return(log_post)
    }
)
# find the entries with the maximum of the posterior
rom <- which.max(apply(g.grid, 1, max))    # rom = Row Of Max
com <- which.max(apply(g.grid, 2, max))    # com = Col Of Max
# find corresponding mu and sigma
( mu0 <- mu.grid[rom] )
[1] 2.187437
( sigma0 <- sigma.grid[com] )
[1] 0.8908909

There’s an awkward for loop in the above code that you might have noticed. Originally I wrote the call to outer like

log_posterior <- function(mu, sigma)
    dnorm(mu, 0, 5, log = TRUE) + dunif(sigma, 0, 2, log = TRUE) + sum(dnorm(d, mu, sigma, log = TRUE))
v_log_posterior <- Vectorize(log_posterior, c("mu", "sigma"))
g.grid <- outer(mu.grid, sigma.grid, v_log_posterior)

…but this ended up being extremely slow. The code above with the for loop ended up being much faster.

2.3.2 Computing the Hessian

The variable g.grid now contains a 2d array of values of the log posterior.

Whereas in the one-dimensional case there was only a single second derivative to worry about, now there are four: \(g_{xx}(\mathbf m)\), \(g_{xy}(\mathbf m)\), \(g_{yx}(\mathbf m)\), and \(g_{yy}(\mathbf m)\). But thankfully \(g_{xy} = g_{yx}\) if \(g\) is nice enough (it is), so really there are only three to worry about.

We still compute them using discrete differences:

\[ g_{xx}(\mu_i, \sigma_j) \approx \frac{g_{i-1,j} - 2 g_{i,j} + g_{i+1,j}}{\Delta\mu^2}, \\ g_{yy}(\mu_i, \sigma_j) \approx \frac{g_{i,j-1} - 2 g_{i,j} + g_{i,j+1}}{\Delta\sigma^2}, \\ g_{xy}(\mu_i, \sigma_j) \approx \frac{g_{i-1,j-1} - g_{i-1,j+1} - g_{i+1,j-1} + g_{i+1,j+1}}{4\Delta \sigma \Delta \mu}. \]

Then our Hessian is

\[ \mathbf H(\mathbf m_0) = \begin{pmatrix} g_{xx}(\mathbf m_0) & g_{xy}(\mathbf m_0) \\ g_{yx}(\mathbf m_0) & g_{yy}(\mathbf m_0) \end{pmatrix}. \]

Here’s the code:

# compute the Hessian
g_xx <- (g.grid[rom-1, com] - 2*g.grid[rom, com] + g.grid[rom+1, com])*((mu.gridsize-1)/(2*mu.max))^2
g_yy <- (g.grid[rom, com-1] - 2*g.grid[rom, com] + g.grid[rom, com+1])*((sigma.gridsize-1)/2)^2
g_xy <- (g.grid[rom-1, com-1] - g.grid[rom-1, com+1] - g.grid[rom+1, com-1] +
             g.grid[rom+1, com+1])*((mu.gridsize-1)*(sigma.gridsize-1)/(16*mu.max))
( hess <- matrix(c(g_xx, g_xy, g_xy, g_yy), ncol = 2) )
            [,1]        [,2]
[1,] -25.2388638  -0.1745989
[2,]  -0.1745989 -50.2700662
# compute covariance matrix
( big_sigma <- -solve(hess) )
              [,1]          [,2]
[1,]  0.0396223870 -0.0001376172
[2,] -0.0001376172  0.0198930319

2.3.3 Results

We can read off the marginal standard deviations from the covariance matrix:

# standard deviation for mu posterior
( mu_sd <- sqrt(big_sigma[1,1]) )
[1] 0.1990537
# standard deviation for sigma posterior
( sigma_sd <- sqrt(big_sigma[2,2]) )
[1] 0.1410427

And note that we picked up the same slight correlation between \(\mu\) and \(\sigma\) that map did:

# estimated correlation between mu and sigma
big_sigma[1,2]/(mu_sd*sigma_sd)
[1] -0.004901757

Here’s a summary comparing all estimates from map with our estimates:

data.frame(
    mu_Mean = c(coef(m2)["mu"], mu0),
    mu_StdDev = c(precis(m2)@output[1,2], mu_sd),
    sigma_Mean = c(coef(m2)["sigma"], sigma0),
    sigma_StdDev = c(precis(m2)@output[2,2], sigma_sd),
    corr = c(precis(m2, corr=TRUE)@output[1,6], big_sigma[1,2]/(mu_sd*sigma_sd)),
    row.names = c("map's", "ours")
)

3 More parameters

This brute force approach to finding the maximum of the joint posterior can only take us so far. As the dimensionality of g.grid increases with more parameters, just constructing g.grid would take ages. A more intelligent approach, like picking a starting point and only computing the values of \(g\) necessary to forge a path of steepest ascent, would be much desired. Based on what McElreath hinted at in the book, I think this is something like what map does.

I wanted to keep things simple in this note, and the brute force approach is as simple as it gets.


Antonio R. Vargas

1 Oct. 2018

Updated 6 Oct. 2018.

---
title: "Exploring numerical normal posterior approximation"
output:
  html_notebook:
    toc: TRUE
---

$$
\DeclareMathOperator{\Normal}{Normal}
\DeclareMathOperator{\Uniform}{Uniform}
\DeclareMathOperator{\Binomial}{Binomial}
$$

The goal of this note is to give some intuition for how a normal approximation for a Bayesian (joint) posterior could be computed numerically. We'll use the normal approximation generated by the `map` function from the `rethinking` package as a baseline to compare against.

```{r}
library(rethinking)    # https://github.com/rmcelreath/rethinking
```

# 1 Estimating a binomial probability

We'll start by considering the simple problem of estimating a binomial probability $p$.

Suppose that out of 9 trials we observed 6 successes, and that our prior for $p$ is

$$
p \sim \Normal(0.25, 0.5).
$$

## 1.1 Analytical solution

The binomial likelihood of seeing 6 successes in 9 trials is $\Binomial(6 \mid 9,p)$, so by Bayes' theorem our posterior for $p$ is

$$
f(p) \propto \Binomial(6 \mid 9,p) \Normal(p \mid 0.25,0.5),
$$

and numerical integration shows that the constant of proportionality is about 1/0.0581627. Here's a plot of this posterior:

```{r}
curve(
    dbinom(6, 9, x)*dnorm(x, 0.25, 0.5)/0.0581627,
    xlim = c(0,1), xlab = "p", ylab = "posterior"
)
```

## 1.2 Posterior estimate from `map`

Let's use `rethinking::map` to get a normal approximation for the posterior of the binomial parameter.

```{r}
m1 <- map(
    alist(
        k ~ dbinom(n, p),
        p ~ dnorm(0.25, 0.5)
    ),
    data = list(n = 9, k = 6),
    start = list(p = 0.5)
)
precis(m1)
```

Here's a plot of the normal approximation from `map`, shown as a dashed curve.

```{r}
# plot the exact posterior
curve(
    dbinom(6, 9, x)*dnorm(x, 0.25, 0.5)/0.0581627,
    xlim = c(0,1), xlab = "p", ylab = "posterior"
)
# plot the normal approximation for the posterior from map()
curve(dnorm(x, coef(m1)["p"], precis(m1)@output$StdDev), lty = 2, add = TRUE)
```

## 1.3 Manually obtaining the normal approximation {#s13}

There are two parts to finding the normal approximation. First we find the maximum of the posterior (which becomes the mean of the normal), then we estimate the curvature at that point (which tells us about the standard deviation of the normal).

There are many ways to tackle the first step. I think `map` uses some kind of adaptive hill-climbing, where the algorithm takes steps toward a peak of the posterior $f(p)$ along paths of steepest ascent, adjusting its step size depending on the steepness at its current position. But we're not going to get that fancy. Instead we'll just compute the posterior on a grid, then perform a brute force search for the highest grid point.

**(**You might say to yourself: "If you're computing the posterior on a grid, do you really need the normal approximation?" No, we wouldn't need the normal approximation. Computing the posterior on a grid gives us much more information since it's a more accurate picture of the posterior than the normal approximation. However, our goal is to illustrate the basic concepts of the normal approximation, and using a grid is the simplest approach.**)**

The second step involves some light calculus.

Suppose we've already found the location $p_0$ of the maximum of the posterior. We'll use this $p_0$ as the mean of our normal approximation. Then we just need to find the standard deviation $\sigma$ which makes

$$
f(p) \approx \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left[-\frac{(p-p_0)^2}{2\sigma^2}\right]
$$

when $p$ is near $p_0$. Taking the logarithm of both sides, we have

$$
\log f(p) \approx -\frac{(p-p_0)^2}{2\sigma^2} + \text{const}. \tag{$*$}
$$

This is what's meant by "quadratic approximation": our goal is to approximate $\log f(p)$ near its maximum by a quadratic polynomial of the form $-(p-\mu)^2/(2\sigma^2) + \text{const}$.

Well since $f(p)$ (and hence $\log f(p)$) has a maximum at $p = p_0$, the linear term drops out of the Taylor series for $\log f(p)$ centered at $p = p_0$, and the most significant terms are the constant term and the quadratic term. In other words, if we define $g(p) = \log f(p)$, then

$$
\log f(p) \approx \frac{g''(p_0)}{2} (p - p_0)^2 + \text{const}
$$

when $p$ is near $p_0$. Plugging this into $(*)$, ignoring the constant terms, and solving for $\sigma$, we see that we should choose

$$
\sigma = \sqrt{- \frac{1}{g''(p_0)}}.
$$

That is to say, for our normal approximation we should choose the standard deviation to be the square root of the negative reciprocal of the second derivative of $\log f(p)$ at its maximum.

We will also compute this second derivative from the grid of points we use to estimate $p_0$ in the first step.

### 1.3.1 First step: finding the maximum

Recall that the posterior is proportional to $\Binomial(6 \mid 9,p) \Normal(p \mid 0.25,0.5)$. We'll evaluate the *logarithm* of this,

$$
g(p) = \log f(p) = \log \Binomial(6 \mid 9,p) + \log \Normal(p \mid 0.25,0.5) + \text{const},
$$

at a grid of points, then find which point on the grid corresponds to the maximum.

```{r}
# the grid
gridsize <- 1000
p.grid <- seq(from = 0, to = 1, length.out = gridsize)
# evaluate the posterior at the grid points
g.grid <- sapply(p.grid, function(p) dbinom(6, 9, p, log = TRUE) + dnorm(p, 0.25, 0.5, log = TRUE))
# find the point on the grid which maximizes the posterior
index_of_max <- which.max(g.grid)
( p0 <- p.grid[index_of_max] )
```

### 1.3.2 Second step: computing the second derivative

The variable `g.grid` now contains a sequence of values of the log-posterior: $g_1, g_2, \ldots, g_{1000}$. Using these values, we will use the *discrete second derivative* to approximate the true value of $g''(p_0)$.

The discrete second derivative is defined by

$$
g''(p_i) \approx \frac{g_{i-1} - 2g_i + g_{i+1}}{\Delta p^2},
$$

where $\Delta p$ is the distance between the grid points $p_i$. In code:

```{r}
( gpp <- (g.grid[index_of_max - 1] - 2*g.grid[index_of_max] + g.grid[index_of_max + 1])*(gridsize-1)^2 )
```

(Note that, since the points of `p.grid` range from 0 to 1, the distance between them is `1/(gridsize-1)`. So we've replaced "dividing by $\Delta p^2$" in the above formula with "multiplying by `(gridsize-1)^2`".)

As a sanity check, it makes sense that $g''(p_0)$ is negative---we're at a local max, after all. Finally we use this value for $\sigma$:

```{r}
( sigma <- sqrt(-1/gpp) )
```

#### What if we didn't have the full grid to work with?

If we hadn't computed the posterior on a grid beforehand, we could always choose an appropriately small step size and compute just the three grid points we need in the formula for the discrete second derivative.

### 1.3.3 Results

Here's a plot of the normal approximation we've computed:

```{r}
# plot the exact posterior
curve(
    dbinom(6, 9, x)*dnorm(x, 0.25, 0.5)/0.0581627,
    xlim = c(0,1), xlab = "p", ylab = "posterior"
)
# plot our normal approximation for the posterior
curve(dnorm(x, p0, sigma), lty = 2, col = rangi2, add = TRUE)
```

And here's a table comparing the estimates of the mean and standard deviation from `map` with our estimates:

```{r}
rbind(
    as.data.frame(precis(m1)@output[1:2], row.names = "map's"),
    data.frame(Mean = p0, StdDev = sigma, row.names = "ours")
)
```

The only things we needed to calculate this mean and this standard deviation were the likelihood and the prior. We didn't even need to know the constant of proportionality we'd need to multiply them by to get the exact posterior.

# 2 Estimating the mean and standard deviation for a normal likelihood

Now we consider a problem where we'll estimate two parameters instead of one. This means that we'll be finding a normal approximation for their *joint* posterior.

Suppose that we have data containing 20 observations which we know to be normally distributed with some mean and variance. Our prior for the mean $\mu$ is

$$
\mu \sim \Normal(0,5)
$$

and our prior for the standard deviation $\sigma$ is

$$
\sigma \sim \Uniform(0,2).
$$

Here's our data:

```{r}
true_mu <- 2
true_sd <- 1
set.seed(1)
d <- rnorm(20, true_mu, true_sd)
```

## 2.1 Analytical solution

The joint posterior for $\mu$ and $\sigma$ is

$$
f(\mu,\sigma) \propto \Normal(\mu \mid 0,5) \Uniform(\sigma \mid 0,2) \prod_{i=1}^{20} \Normal(d_i \mid \mu,\sigma),
$$

where $d_1, d_2, \ldots, d_{20}$ are our 20 data points.

We'll skip finding the constant of proportionality and plotting the posterior and move straight to the estimated means, standard deviations, and correlation for the parameters $\mu$ and $\sigma$.

## 2.2 Posterior estimates from `map`

We'll use `map` to find normal approximations to the marginal posteriors for $\mu$ and $\sigma$.

```{r}
m2 <- map(
    alist(
        x ~ dnorm(mu, sigma),
        mu ~ dnorm(0, 5),
        sigma ~ dunif(0,2)
    ),
    data = list(x = d)
)
precis(m2, corr = TRUE)
```

Note that the model estimates `mu` and `sigma` to have a very small negative correlation of `r precis(m2, corr=TRUE)@output[1,6]`.

## 2.3 Manually obtaining the normal approximation

In this case we're finding the normal approximation to the *joint* posterior for $\mu$ and $\sigma$. Finding the maximum of the posterior will basically be just as easy as before, but to estimate the curvature near this maximum we'll need to use some multivariable calculus.

Suppose we've already found the location $(\mu_0, \sigma_0)$ of the maximum of the posterior. We need to find the covariance matrix $\mathbf \Sigma$ which makes

$$
f(\mathbf m) \approx \frac{1}{2\pi \sqrt{\lvert \Sigma \rvert}} \exp\!\left[ -\frac{1}{2} (\mathbf m - \mathbf m_0)^{\text{T}} \mathbf\Sigma^{-1} (\mathbf m - \mathbf m_0) \right]
$$

when $\mathbf m$ is near $\mathbf m_0$, where

$$
\mathbf m = \begin{pmatrix} \mu \\ \sigma \end{pmatrix} \qquad \text{and} \qquad \mathbf m_0 = \begin{pmatrix} \mu_0 \\ \sigma_0 \end{pmatrix}.
$$

Taking logarithms, we want

$$
\log f(\mathbf m) \approx -\frac{1}{2} (\mathbf m - \mathbf m_0)^{\text{T}} \mathbf\Sigma^{-1} (\mathbf m - \mathbf m_0) + \text{const}.
$$

Just as before, since $f(\mathbf m)$ (and hence $\log f(\mathbf m)$) has a maximum at $\mathbf m = \mathbf m_0$, the linear terms drop out of the Taylor series for $\log f(\mathbf m)$, and we're left with the approximation

$$
\log f(\mathbf m) \approx \frac{1}{2} (\mathbf m - \mathbf m_0)^{\text{T}} \mathbf H(\mathbf m_0) (\mathbf m - \mathbf m_0) + \text{const},
$$

where $\mathbf H(\mathbf m_0)$ is the Hessian matrix for $g(\mathbf m) = \log f(\mathbf m)$ evaluated at $\mathbf m = \mathbf m_0$. (This *Hessian matrix* is just a matrix of the second derivatives of $\log f(\mathbf m)$.) Comparing this formula to the one above, we see that we should take

$$
\mathbf \Sigma = -\mathbf H(\mathbf m_0)^{-1}.
$$

Compare this with our choice for $\sigma^2$ in [section 1.3](#s13).

### 2.3.1 Finding the maximum

There's probably a cleaner way to do this, but we'll use two applications of `which.max` and `max` to find the indices of the maximum of the joint posterior.

```{r}
# the grid
mu.gridsize <- 5000
mu.max <- 5
sigma.gridsize <- 1000
mu.grid <- seq(from = -mu.max, to = mu.max, length.out = mu.gridsize)
sigma.grid <- seq(from = 0, to = 2, length.out = sigma.gridsize)
g.grid <- outer(
    mu.grid, sigma.grid,
    function(m,s) {
        log_post <- dnorm(m, 0, 5, log = TRUE) + dunif(s, 0, 2, log = TRUE)
        for (i in 1:length(d))
            log_post <- log_post + dnorm(d[i], m, s, log = TRUE)
        return(log_post)
    }
)
# find the entries with the maximum of the posterior
rom <- which.max(apply(g.grid, 1, max))    # rom = Row Of Max
com <- which.max(apply(g.grid, 2, max))    # com = Col Of Max
# find corresponding mu and sigma
( mu0 <- mu.grid[rom] )
( sigma0 <- sigma.grid[com] )
```

There's an awkward `for` loop in the above code that you might have noticed. Originally I wrote the call to `outer` like

```{r eval = FALSE}
log_posterior <- function(mu, sigma)
    dnorm(mu, 0, 5, log = TRUE) + dunif(sigma, 0, 2, log = TRUE) + sum(dnorm(d, mu, sigma, log = TRUE))
v_log_posterior <- Vectorize(log_posterior, c("mu", "sigma"))
g.grid <- outer(mu.grid, sigma.grid, v_log_posterior)
```

...but this ended up being extremely slow. The code above with the `for` loop ended up being much faster.


### 2.3.2 Computing the Hessian

The variable `g.grid` now contains a 2d array of values of the log posterior.

Whereas in the one-dimensional case there was only a single second derivative to worry about, now there are four: $g_{xx}(\mathbf m)$, $g_{xy}(\mathbf m)$, $g_{yx}(\mathbf m)$, and $g_{yy}(\mathbf m)$. But thankfully $g_{xy} = g_{yx}$ if $g$ is nice enough (it is), so really there are only three to worry about.

We still compute them using discrete differences:

$$
g_{xx}(\mu_i, \sigma_j) \approx \frac{g_{i-1,j} - 2 g_{i,j} + g_{i+1,j}}{\Delta\mu^2}, \\
g_{yy}(\mu_i, \sigma_j) \approx \frac{g_{i,j-1} - 2 g_{i,j} + g_{i,j+1}}{\Delta\sigma^2}, \\
g_{xy}(\mu_i, \sigma_j) \approx \frac{g_{i-1,j-1} - g_{i-1,j+1} - g_{i+1,j-1} + g_{i+1,j+1}}{4\Delta \sigma \Delta \mu}.
$$

Then our Hessian is

$$
\mathbf H(\mathbf m_0) = \begin{pmatrix} g_{xx}(\mathbf m_0) & g_{xy}(\mathbf m_0) \\ g_{yx}(\mathbf m_0) & g_{yy}(\mathbf m_0) \end{pmatrix}.
$$

Here's the code:

```{r}
# compute the Hessian
g_xx <- (g.grid[rom-1, com] - 2*g.grid[rom, com] + g.grid[rom+1, com])*((mu.gridsize-1)/(2*mu.max))^2
g_yy <- (g.grid[rom, com-1] - 2*g.grid[rom, com] + g.grid[rom, com+1])*((sigma.gridsize-1)/2)^2
g_xy <- (g.grid[rom-1, com-1] - g.grid[rom-1, com+1] - g.grid[rom+1, com-1] +
             g.grid[rom+1, com+1])*((mu.gridsize-1)*(sigma.gridsize-1)/(16*mu.max))
( hess <- matrix(c(g_xx, g_xy, g_xy, g_yy), ncol = 2) )
# compute covariance matrix
( big_sigma <- -solve(hess) )
```

### 2.3.3 Results

We can read off the marginal standard deviations from the covariance matrix:

```{r}
# standard deviation for mu posterior
( mu_sd <- sqrt(big_sigma[1,1]) )
# standard deviation for sigma posterior
( sigma_sd <- sqrt(big_sigma[2,2]) )
```

And note that we picked up the same slight correlation between $\mu$ and $\sigma$ that `map` did:

```{r}
# estimated correlation between mu and sigma
big_sigma[1,2]/(mu_sd*sigma_sd)
```

Here's a summary comparing all estimates from `map` with our estimates:

```{r}
data.frame(
    mu_Mean = c(coef(m2)["mu"], mu0),
    mu_StdDev = c(precis(m2)@output[1,2], mu_sd),
    sigma_Mean = c(coef(m2)["sigma"], sigma0),
    sigma_StdDev = c(precis(m2)@output[2,2], sigma_sd),
    corr = c(precis(m2, corr=TRUE)@output[1,6], big_sigma[1,2]/(mu_sd*sigma_sd)),
    row.names = c("map's", "ours")
)
```


# 3 More parameters

This brute force approach to finding the maximum of the joint posterior can only take us so far. As the dimensionality of `g.grid` increases with more parameters, just constructing `g.grid` would take ages. A more intelligent approach, like picking a starting point and only computing the values of $g$ necessary to forge a path of steepest ascent, would be much desired. Based on what McElreath hinted at in the book, I think this is something like what `map` does.

I wanted to keep things simple in this note, and the brute force approach is as simple as it gets.

----

[Antonio R. Vargas](https://mathstat.dal.ca/~antoniov/)

1 Oct. 2018

Updated 6 Oct. 2018.















































