A while back I worked a couple problems involving binned data and it really struck me just how straightforward they were using a Bayesian approach and how natural it was to get the correct uncertainty estimates. I stumbled across an old reddit thread and was reminded me of these notes I wrote back then.

1 Binned Poisson data

Someone on reddit asked:

“Does Poisson work with completely censored or ordinally coded data? For example, 1 = never, 2 = 1 or 2 units, 3 = 3 or 4 units, 4 = 5+”

Commenters suggested using an ordinal regression model. While this would work to some degree, this is clearly not ordinary ordinal data. If we incorporate the knowledge that the underlying data is Poisson we will get better uncertainty estimates.

All we do is sum over the possible likelihoods corresponding to each bin. If there’s a 1 in the data we use the likelihood

\[\operatorname{Poisson}(0 | \theta).\]

If there’s a 2 in the data we use the likelihood

\[\operatorname{Poisson}(1 | \theta) + \operatorname{Poisson}(2 | \theta).\]

If there’s a 3 we use

\[\operatorname{Poisson}(3 | \theta) + \operatorname{Poisson}(4 | \theta),\]

and if there’s a 4 in the data we use the infinite sum

\[\operatorname{Poisson}(5 | \theta) + \operatorname{Poisson}(6 | \theta) + \operatorname{Poisson}(7 | \theta) + ...,\]

which is just

\[1 - \operatorname{PoissonCDF}(4 | \theta).\]

Here’s this likelihood implemented in R:

library(rethinking)

dpois_binned <- function(x, t, log = TRUE) {
    if (length(x) > length(t))
        t <- rep_len(t, length(x))
    if (length(t) > length(x))
        x <- rep_len(x, length(t))
     
    ll <- rep(0, length(x))
     
    for (i in 1:length(x)) {
        if (x[i] == 1)
            ll[i] <- dpois(0, t[i], log = TRUE)
        else if (x[i] == 2)
            ll[i] <- log_sum_exp(c(dpois(1, t[i], log = TRUE),
                                   dpois(2, t[i], log = TRUE)))
        else if (x[i] == 3)
            ll[i] <- log_sum_exp(c(dpois(3, t[i], log = TRUE),
                                   dpois(4, t[i], log = TRUE)))
        else
            ll[i] <- ppois(4, t[i], lower.tail = FALSE, log.p = TRUE)
    }
    
    if (log == FALSE)
        ll <- exp(ll)
    
    return(ll)
}

It’s vectorized in x and t, and the sums of the likelihoods are done on the log scale using rethinking::log_sum_exp() for numerical stability.

Now let’s simulate some data and see if we can recover the original Poisson parameter with rethinking::quap(). We’ll also fit a model using the original, unbinned data to compare.

theta <- 8  # true Poisson parameter
df_p <- data.frame(val = rpois(100, theta))
df_p$cat <- as.integer(cut(df_p$val, c(-Inf, c(0, 2, 4), Inf), 1:4))

Here’s a histogram of the binned data df_p$cat:

plot(table(df_p$cat))

There are only a handful of 2s and 3s. Almost the entire dataset was put into bin 4.

fit_p1 <- quap(  # model for binned data
    alist(
        cat ~ dpois_binned(theta),
        log(theta) <- a,
        a ~ dnorm(0, 5)
    ),
    data = df_p
)

fit_p2 <- quap(  # model for original data
    alist(
        val ~ dpois(theta),
        log(theta) <- a,
        a ~ dnorm(0, 5)
    ),
    data = df_p
)

samples1 <- extract.samples(fit_p1)
samples2 <- extract.samples(fit_p2)
theta1 <- exp(samples1$a)
theta2 <- exp(samples2$a)

plot(
    density(theta1),
    main = "Posteriors for theta", xlab = "theta",
    ylim = c(0, 1.4)
)
points(density(theta2), type = "l", lty = 2, col = rangi2)

The black curve is the posterior for theta from the binned data and the dashed blue curve is the posterior from the original, unbinned data. Even though so much of the data was in bin 4, the first model (using the binned data only) was still able to give a decent estimate for theta, whose true value was 8.

I encourage you to play around with this, changing the theta used to generate the data and the number of data points.

2 Binned Log-Normal data

This blog post sets the scene:

Hypothetically, say you're given the data in the table below,
and you're asked to find the mean:

Group     Frequency
0 to 25         114
25 to 50         76
50 to 75         58
75 to 100        51
100 to 250      140
250 to 500      107
500 to 1000      77
1000 to 5000    124
5000 or more     42

We’ll follow one suggestion from the post to fit a log-normal distribution to this data.

First we need to implement the likelihood. This is very similar to what we did above, only now we’re dealing with a continuous distribution. So if we encounter the bin “25 to 50”, instead of summing separate likelihoods we integrate the log-normal likelihood from 25 to 50, which is the same thing as

\[\operatorname{LogNormalCDF}(50 | \mu, \sigma) - \operatorname{LogNormalCDF}(25 | \mu, \sigma).\]

We number the bins 1 (0 to 25) to 9 (5000 or more).

log_minus_exp <- function(a, b) {
    abmax <- max(a, b)
    xminus <- exp(a - abmax) - exp(b - abmax)
    abmax + log(xminus)
}

endpoints <- c(25, 50, 75, 100, 250, 500, 1000, 5000)

dlnorm_binned <- function(x, mu, sigma, log = TRUE) {
    max_len <- max(length(x), length(mu), length(sigma))
    if (length(x) < max_len)
        x <- rep_len(x, max_len)
    if (length(mu) < max_len)
        mu <- rep_len(mu, max_len)
    if (length(sigma) < max_len)
        sigma <- rep_len(sigma, max_len)
    
    ll <- rep(0, max_len)
    
    for (i in 1:max_len) {
        if (x[i] == 1)
            ll[i] <- plnorm(25, mu[i], sigma[i], log = TRUE)
        else if (1 < x[i] & x[i] <= length(endpoints))
            ll[i] <- log_minus_exp(
                plnorm(endpoints[x[i]], mu[i], sigma[i], log = TRUE),
                plnorm(endpoints[x[i]-1], mu[i], sigma[i], log = TRUE)
            )
        else
            ll[i] <- plnorm(
                endpoints[length(endpoints)],
                mu[i], sigma[i],
                lower.tail = FALSE, log = TRUE
            )
    }
    
    if (log == FALSE)
        ll <- exp(ll)
    
    return(ll)
}

Again we compute everything on the log scale, except this time we define a new function log_minus_exp() to compute the differences between LogNormalCDFs. It’s basically identical to McElreath’s rethinking::log_sum_exp() except for a minus sign.

Now let’s fit the model and estimate the mean of the data.

df_ln <- data.frame(
    cat = unlist(
        mapply(rep, 1:9, c(114, 76, 58, 51, 140, 107, 77, 124, 42))
    )
)

fit_ln <- quap(
    alist(
        cat ~ dlnorm_binned(mu, sigma),
        log(sigma) <- a,
        mu ~ dnorm(0, 10),
        a ~ dnorm(0, 5)
    ),
    data = df_ln
)

samples <- extract.samples(fit_ln)
sigma <- exp(samples$a)
mean_est <- exp(samples$mu + sigma^2/2)  # formula for log-normal mean

plot(density(mean_est), main = "Posterior for the mean", xlab = "mean")
points(median(mean_est), 0.00005, pch = 16, col = rangi2)
lines(HPDI(mean_est), rep(0.00005, 2), lwd = 2, col = rangi2)

Our estimate for the mean is 1385.3, with a 95% posterior density interval of 1090.7 to 1696.4.

We can also visualize the fit of our distribution to a histogram of the data. I modified the code for “Figure 1” in the blog post and plotted many samples of plausible lognormal likelihoods based on our fit:

par(mar = c(4,4,4,2))
group_freq = c(114,76,58,51,140,107,77,124,42) # observed frequencies
cutoffs = c(0,25,50,75,100,250,500,1000,5000,20000) # bin boundaries

plot(
  c(1,max(cutoffs)), c(0,max(1.25*group_freq/diff(cutoffs)/sum(group_freq))),
  type = "n", xlab = "", ylab = "", las=1,
  main = paste0(
    "Density Plot of Observed Relative Frequency vs. ",
    "Log-Normal Distribution,\n",
    "log-scaled in x"
  ),
  log="x"
)

rect(
  xleft = 1 + cutoffs[-length(cutoffs)], xright = 1 + cutoffs[-1],
  ybottom = rep(0,length(cutoffs)-1), ytop = group_freq/diff(cutoffs)/sum(group_freq),
  col="Lightblue"
)

for(i in 1:200)
  curve(
    dlnorm(x, samples$mu[i], sigma[i]),
    col = rgb(0, 0, 0, 0.1),
    n = 301,
    add = TRUE
  )

It’s interesting to try generating some log-normal data, putting it into these bins, and trying to recover the original parameters from the binned and pre-binned data like we did above for the Poisson data. Here’s some code for the data generation:

df_ln <- data.frame(val = rlnorm(100, 5.3, 2))
df_ln$cat <- as.integer(cut(df_ln$val, c(-Inf, endpoints, Inf), 1:9))

That’s all for now.


Antonio R. Vargas

20 March 2022