A while back I worked a couple problems involving binned data and it really struck me just how straightforward they were using a Bayesian approach and how natural it was to get the correct uncertainty estimates. I stumbled across an old reddit thread and was reminded me of these notes I wrote back then.
Someone on reddit asked:
“Does Poisson work with completely censored or ordinally coded data? For example, 1 = never, 2 = 1 or 2 units, 3 = 3 or 4 units, 4 = 5+”
Commenters suggested using an ordinal regression model. While this would work to some degree, this is clearly not ordinary ordinal data. If we incorporate the knowledge that the underlying data is Poisson we will get better uncertainty estimates.
All we do is sum over the possible likelihoods corresponding to each bin. If there’s a 1 in the data we use the likelihood
\[\operatorname{Poisson}(0 | \theta).\]
If there’s a 2 in the data we use the likelihood
\[\operatorname{Poisson}(1 | \theta) + \operatorname{Poisson}(2 | \theta).\]
If there’s a 3 we use
\[\operatorname{Poisson}(3 | \theta) + \operatorname{Poisson}(4 | \theta),\]
and if there’s a 4 in the data we use the infinite sum
\[\operatorname{Poisson}(5 | \theta) + \operatorname{Poisson}(6 | \theta) + \operatorname{Poisson}(7 | \theta) + ...,\]
which is just
\[1 - \operatorname{PoissonCDF}(4 | \theta).\]
Here’s this likelihood implemented in R:
library(rethinking)
<- function(x, t, log = TRUE) {
dpois_binned if (length(x) > length(t))
<- rep_len(t, length(x))
t if (length(t) > length(x))
<- rep_len(x, length(t))
x
<- rep(0, length(x))
ll
for (i in 1:length(x)) {
if (x[i] == 1)
<- dpois(0, t[i], log = TRUE)
ll[i] else if (x[i] == 2)
<- log_sum_exp(c(dpois(1, t[i], log = TRUE),
ll[i] dpois(2, t[i], log = TRUE)))
else if (x[i] == 3)
<- log_sum_exp(c(dpois(3, t[i], log = TRUE),
ll[i] dpois(4, t[i], log = TRUE)))
else
<- ppois(4, t[i], lower.tail = FALSE, log.p = TRUE)
ll[i]
}
if (log == FALSE)
<- exp(ll)
ll
return(ll)
}
It’s vectorized in x and t, and the sums of the likelihoods are done on the log scale using rethinking::log_sum_exp()
for numerical stability.
Now let’s simulate some data and see if we can recover the original Poisson parameter with rethinking::quap()
. We’ll also fit a model using the original, unbinned data to compare.
<- 8 # true Poisson parameter
theta <- data.frame(val = rpois(100, theta))
df_p $cat <- as.integer(cut(df_p$val, c(-Inf, c(0, 2, 4), Inf), 1:4)) df_p
Here’s a histogram of the binned data df_p$cat
:
plot(table(df_p$cat))
There are only a handful of 2s and 3s. Almost the entire dataset was put into bin 4.
<- quap( # model for binned data
fit_p1 alist(
~ dpois_binned(theta),
cat log(theta) <- a,
~ dnorm(0, 5)
a
),data = df_p
)
<- quap( # model for original data
fit_p2 alist(
~ dpois(theta),
val log(theta) <- a,
~ dnorm(0, 5)
a
),data = df_p
)
<- extract.samples(fit_p1)
samples1 <- extract.samples(fit_p2)
samples2 <- exp(samples1$a)
theta1 <- exp(samples2$a)
theta2
plot(
density(theta1),
main = "Posteriors for theta", xlab = "theta",
ylim = c(0, 1.4)
)points(density(theta2), type = "l", lty = 2, col = rangi2)
The black curve is the posterior for theta from the binned data and the dashed blue curve is the posterior from the original, unbinned data. Even though so much of the data was in bin 4, the first model (using the binned data only) was still able to give a decent estimate for theta, whose true value was 8.
I encourage you to play around with this, changing the theta
used to generate the data and the number of data points.
This blog post sets the scene:
Hypothetically, say you're given the data in the table below,
and you're asked to find the mean:
Group Frequency
0 to 25 114
25 to 50 76
50 to 75 58
75 to 100 51
100 to 250 140
250 to 500 107
500 to 1000 77
1000 to 5000 124
5000 or more 42
We’ll follow one suggestion from the post to fit a log-normal distribution to this data.
First we need to implement the likelihood. This is very similar to what we did above, only now we’re dealing with a continuous distribution. So if we encounter the bin “25 to 50”, instead of summing separate likelihoods we integrate the log-normal likelihood from 25 to 50, which is the same thing as
\[\operatorname{LogNormalCDF}(50 | \mu, \sigma) - \operatorname{LogNormalCDF}(25 | \mu, \sigma).\]
We number the bins 1 (0 to 25) to 9 (5000 or more).
<- function(a, b) {
log_minus_exp <- max(a, b)
abmax <- exp(a - abmax) - exp(b - abmax)
xminus + log(xminus)
abmax
}
<- c(25, 50, 75, 100, 250, 500, 1000, 5000)
endpoints
<- function(x, mu, sigma, log = TRUE) {
dlnorm_binned <- max(length(x), length(mu), length(sigma))
max_len if (length(x) < max_len)
<- rep_len(x, max_len)
x if (length(mu) < max_len)
<- rep_len(mu, max_len)
mu if (length(sigma) < max_len)
<- rep_len(sigma, max_len)
sigma
<- rep(0, max_len)
ll
for (i in 1:max_len) {
if (x[i] == 1)
<- plnorm(25, mu[i], sigma[i], log = TRUE)
ll[i] else if (1 < x[i] & x[i] <= length(endpoints))
<- log_minus_exp(
ll[i] plnorm(endpoints[x[i]], mu[i], sigma[i], log = TRUE),
plnorm(endpoints[x[i]-1], mu[i], sigma[i], log = TRUE)
)else
<- plnorm(
ll[i] length(endpoints)],
endpoints[
mu[i], sigma[i],lower.tail = FALSE, log = TRUE
)
}
if (log == FALSE)
<- exp(ll)
ll
return(ll)
}
Again we compute everything on the log scale, except this time we define a new function log_minus_exp()
to compute the differences between LogNormalCDFs. It’s basically identical to McElreath’s rethinking::log_sum_exp()
except for a minus sign.
Now let’s fit the model and estimate the mean of the data.
<- data.frame(
df_ln cat = unlist(
mapply(rep, 1:9, c(114, 76, 58, 51, 140, 107, 77, 124, 42))
)
)
<- quap(
fit_ln alist(
~ dlnorm_binned(mu, sigma),
cat log(sigma) <- a,
~ dnorm(0, 10),
mu ~ dnorm(0, 5)
a
),data = df_ln
)
<- extract.samples(fit_ln)
samples <- exp(samples$a)
sigma <- exp(samples$mu + sigma^2/2) # formula for log-normal mean
mean_est
plot(density(mean_est), main = "Posterior for the mean", xlab = "mean")
points(median(mean_est), 0.00005, pch = 16, col = rangi2)
lines(HPDI(mean_est), rep(0.00005, 2), lwd = 2, col = rangi2)
Our estimate for the mean is 1385.3, with a 95% posterior density interval of 1090.7 to 1696.4.
We can also visualize the fit of our distribution to a histogram of the data. I modified the code for “Figure 1” in the blog post and plotted many samples of plausible lognormal likelihoods based on our fit:
par(mar = c(4,4,4,2))
= c(114,76,58,51,140,107,77,124,42) # observed frequencies
group_freq = c(0,25,50,75,100,250,500,1000,5000,20000) # bin boundaries
cutoffs
plot(
c(1,max(cutoffs)), c(0,max(1.25*group_freq/diff(cutoffs)/sum(group_freq))),
type = "n", xlab = "", ylab = "", las=1,
main = paste0(
"Density Plot of Observed Relative Frequency vs. ",
"Log-Normal Distribution,\n",
"log-scaled in x"
),log="x"
)
rect(
xleft = 1 + cutoffs[-length(cutoffs)], xright = 1 + cutoffs[-1],
ybottom = rep(0,length(cutoffs)-1), ytop = group_freq/diff(cutoffs)/sum(group_freq),
col="Lightblue"
)
for(i in 1:200)
curve(
dlnorm(x, samples$mu[i], sigma[i]),
col = rgb(0, 0, 0, 0.1),
n = 301,
add = TRUE
)