This note is an analysis of vehicle crash rates in Oregon’s 36 counties. We’ll incorporate data on vehicle registrations, housing density, and travel time between counties. The analysis will be fully Bayesian, using the R package rethinking to interface with the Stan sampling engine.
We’ll fit two models:
You can click any heading in this page to return back here to the table of contents.
The libraries we’ll need:
library(rethinking)
library(choroplethr)
library(corrplot)
The Oregon Department of Transportation maintains a comprehensive crash database. According to the ODOT Crash Data Disclaimers,
The information contained in these reports is compiled from individual driver and police crash reports submitted to the Oregon Department of Transportation as required in ORS 811.720.
And a readme within the database itself says the following:
As of 01/01/2004, drivers are required to file an Accident and Insurance Report Form with DMV within 72 hours when damage to the driver’s vehicle is over $1,500; damage to any vehicle is over $1,500 and any vehicle is towed from the scene as a result of damage from the accident; if injury or death resulted from the accident; or if damage to any one person’s property other than a vehicle involved in the accident is over $1,500.
For this analysis I downloaded the total number of reported crashes in each county for the years 2013 through 2016.
The database also contains data for 2012 and 2017. However, there’s a big yellow disclaimer at the top stating that the 2017 data isn’t finalized, and the DMV didn’t provide data on vehicle registrations (see below) for the year 2012, so I didn’t include these years in the analysis.
Why would we compute crashes per registered vehicle instead of per capita? Because crashes per capita would count children and adults who don’t drive, and overcount adults who regularly use public transit or carpool. When you’re driving, you’re not really thinking about how many people are on the road, but how many cars are on the road.
The Oregon DMV provides vehicle registration statistics by county.
This data was a bit tricky to access because it’s only provided in PDF format (CSV would have been easier to work with). Worse, the data for the year 2015 is an image!
I extracted/transcribed the total number of registered vehicles in each county for the same timespan as above, 2013 through 2016.
Here’s the combined data, which can be downloaded from my website.
df <- read.csv("Oregon crashes and registrations.csv", skip = 5)
df
For the second model I wanted something that would capture a county’s “density”. Population density seemed like a poor choice for the reasons above, but housing density is intertwined with road design.
The US Census Bureau records data on the number of housing units per square mile of land area, which can be accessed through their American FactFinder web app.
I extracted the housing and land data (as well as the population data) for each county in Oregon from the 2010 census. The result can be downloaded from my website.
county_densities <- read.csv("Oregon county densities.csv", skip = 6)
county_densities
Since we’re going to be making a map of the estimated crash rates, this table also contains the FIPS code for each county from Wikipedia as well as approximate latitude and longitude coordinates for each county seat.
The second model we’ll discuss incorporates “distance” between counties to account for cross-traffic.
There are many ways to measure this distance; for example, we could use miles as-the-crow-flies from the geographic center of each county to the next, or we could use miles along roads. But the geographic centers of counties are sometimes out in the middle of nowhere, and the number of miles we would drive from one county to the next doesn’t take into account slower travel along hilly roads that wind around or faster travel along wide, straight freeways.
If we use the county seat instead of the geographic center, this gives a better approximation of the “center” of driving within a county. And instead of counting the miles between these county seats, average driving time more accurately reflects the effective distance between two counties and the likelihood that cars regularly travel between them.
I used the Open Source Routing Machine API to get approximate driving times between each of the 36 coordinates for the county seats in the table above.
Since these times weren’t always symmetric (it could take slightly longer to go from A to B than to go from B to A), I used the average of the coming and going times. The OSRM API returns a 36-by-36 matrix of times \(M\), so the average was computed by adding this matrix to its transpose and dividing by 2:
\[ \mathbf M_{\text{avg}} = (\mathbf M + \mathbf M^T)/2. \]
This matrix of average driving times (in units of hours) can be downloaded from my website.
timesmat <- read.csv("Travel times between Oregon county seats.csv", skip = 4, row.names = 1)
timesmat <- as.matrix(timesmat)
Our first goal is to estimate the yearly number of crashes per 100 registered vehicles in each county. To combat the possibility that smaller counties might have much higher or lower crash rates than expected due to sampling variation, we’ll share some information across estimates by using a multilevel model.
The only data we’ll use in this model are the numbers of crashes and registrations in each county.
For each county \(n\), we’ll model the number of crashes \(c_{nm}\) in a given year \(m\) as a Poisson random variable dependent on an average county rate \(\theta_n\) with exposure proportional to the number of vehicles registered in that county that year \(r_{nm}\).1 In symbols, the likelihood for our model is
\[ c_{nm} \sim \operatorname{Poisson}\!\left(\frac{r_{nm} \theta_n}{100}\right). \]
Dividing by 100 in this formula makes \(\theta_n\) the average crash rate per 100 registered vehicles in county \(n\). Our goal is to estimate these 36 different \(\theta_n\) parameters.
To facilitate partial pooling of information among these thetas we’ll place an adaptive Gaussian prior on their logarithms,
\[ \log \theta_n \sim \operatorname{Normal}(\mu, \sigma), \]
with weak priors on its hyperparameters \(\mu\) and \(\sigma\):
\[ \begin{align} \mu &\sim \operatorname{Normal}(0, 5), \\ \sigma &\sim \operatorname{Exponential}(1). \end{align} \]
If we didn’t load the data before, we could do it now with the command
df <- read.csv("Oregon crashes and registrations.csv", skip = 5)
Here’s the R code to fit this model using the rethinking package. We’ll extract roughly 80,000 Hamiltonian Monte Carlo samples from the joint posterior for our estimates.
fit1 <- map2stan(
alist(
crashes ~ dpois(lambda),
log(lambda) <- log(registrations) - log(100) + log_rate[county],
log_rate[county] ~ dnorm(mu, sigma),
mu ~ dnorm(0, 5),
sigma ~ dexp(1)
),
data = df,
iter = 2e4,
warmup = 1e3,
chains = 4,
cores = 4
)
precis(fit1, depth = 2)
By looking at the posterior means for the hyperparameters \(\mu\) and \(\sigma\) we see that the collection of log-rates \(\log \theta_n\) is roughly centered around 0 with a standard deviation of about 0.32.
Now we extract the samples for the log rates from the model, exponentiate them, then compute means, standard deviations, and 89% highest posterior density intervals for the desired parameters \(\theta_n\).
samples <- extract.samples(fit1)
rates <- exp(samples$log_rate)
rates_mu <- apply(rates, 2, mean)
rates_sd <- apply(rates, 2, sd)
rates_HPDI <- apply(rates, 2, HPDI)
ordered_rates <- data.frame(
mean = round(rates_mu, 3),
sd = round(rates_sd, 3),
HPDI = paste(round(rates_HPDI[1,], 2), "to", round(rates_HPDI[2,], 2))
)
ordered_rates <- ordered_rates[order(rates_mu),]
colnames(ordered_rates) <- c("Bayes-adjusted rate", "standard deviation", "89% probability interval")
rownames(ordered_rates) <- levels(df$county)[order(rates_mu)]
ordered_rates
To visualize these rates we’ll make a map of Oregon divided by county with each county colored according to its estimated rate. This type of map is called a choropleth.
The package choroplethr identifies counties by their FIPS codes, so we need to load the county_densities
data now if we didn’t earlier.
county_densities <- read.csv("Oregon county densities.csv", skip = 6)
regions <- data.frame(
value = rates_mu,
region = county_densities$FIPS
)
county_choropleth(
regions,
state_zoom = "oregon",
title = "Yearly crashes per 100 registered vehicles in the period 2013-2016",
legend = "Crash Rate"
) + scale_fill_brewer(name = "Crash Rate", palette = "YlOrRd", drop = FALSE)
One striking feature in this map is that areas with higher density tend to have higher crash rates. It’s reasonable that since cars are forced to interact with each other more in those areas that there are more opportunities for crashes to occur.
Our goal now is to investigate which counties’ crash rates are higher or lower than expected given their housing densities. Or, put another way: if we evened out the housing density across the whole state, where would we then expect to see high and low crash rates?
Let’s take a look at a plot of the log crash rates \(\log \theta_n\) we estimated in our previous model versus the log housing density of each county. We’ll indicate the 89% probability intervals for our estimates with vertical lines.
plot(
apply(samples$log_rate, 2, mean) ~ log(county_densities$Housing.density),
xlab = "log housing density",
ylab = "log crash rate",
ylim = c(-0.95, 0.875)
)
for (i in 1:36)
lines(
rep(log(county_densities$Housing.density)[i], 2),
HPDI(samples$log_rate[,i])
)
There does seem to be a positive relationship between log crash rate and log housing density, so it’s reasonable to control for housing density by adding a term for log density into the model.
Since we don’t have data for the individual years we’re interested in, we’ll use the same 2010 housing densities for each year of our data. We’ll also standardize the log densities so that the intercepts in the model correspond to crash rates at roughly the state’s average log housing density.
df$log_density <- rep(log(county_densities$Housing.density), 4)
df$log_dens_std <- (df$log_density - mean(df$log_density))/sd(df$log_density)
df$log_dens_std <- round(df$log_dens_std, 4)
So if \(d_n\) is the standardized log housing density of the \(n^\text{th}\) county, the likelihood for our new model can be written
\[ \begin{align} c_{nm} &\sim \operatorname{Poisson}(\lambda) \\ \log \lambda &= \log r_{nm} - \log 100 + \log \theta_n + \beta d_n, \end{align} \]
where \(\beta\) is the new parameter to be estimated.
By adding a term for housing density to the model we transfer information related to housing density out of the log crash rates \(\log \theta_n\) into the new parameter \(\beta\). To compensate for this we will allow the log crash rates in nearby counties to share their information by explicitly modeling their correlations.
The basic idea is that counties which are nearby are expected to share frequent cross-traffic. For example, Multnomah county receives huge number of commuters from nearby counties in Oregon and also from Washington state. By modeling correlations between nearby counties we can adjust for crashes involving out-of-county drivers and get a better picture of where the drivers involved in crashes originate.
We will model these correlations using Gaussian process regression.2 We’ll put an adaptive multivariate Gaussian prior on the vector of log crash rates,
\[ (\log \theta_1, \ldots, \log \theta_{36}) \sim \operatorname{Normal}(\boldsymbol \mu, \mathbf K), \]
where \(\boldsymbol \mu = (\nu, \nu, \ldots, \nu)\) is a vector of identical means and the covariance matrix \(\mathbf K\) is defined by
\[ \mathbf K_{ij} = \eta^2 \exp\!\left(-\rho^2 \mathbf D_{ij}^2\right) + \delta_{ij} \sigma^2, \]
in which \(\mathbf D\) is the matrix of distances between counties (we’re using driving times); \(\delta_{ij}\) is \(0\) if \(i \neq j\) and \(1\) if \(i = j\); and \(\eta^2\), \(\rho^2\), and \(\sigma^2\) are parameters to be estimated. We can interpret these parameters like:
If we didn’t load the distance matrix \(\mathbf D\) earlier, we can do so now with the code
timesmat <- read.csv("Travel times between Oregon county seats.csv", skip = 4, row.names = 1)
timesmat <- as.matrix(timesmat)
Gaussian process models can be a bit finicky to fit. If we use weak priors for the parameters \(\eta^2\) and \(\sigma^2\), their posteriors end up with a lot of mass concentrated near 0 but also extremely long tails, both of which can be hard for the sampling engine to deal with.
The strong priors for etasq
and sigmasq
we use below were arrived at by fitting the model using very loose priors to start, and gradually tightening those priors onto the bulk of the parameters’ posteriors until Stan was able to generate the desired samples with minimal errors.
The Gamma(7,7) prior for rhosq
was chosen because I expect counties 1 hour away from each other to covary a small amount, and the covariance bewteen counties 2 hours away to be miniscule.
fit2 <- map2stan(
alist(
crashes ~ dpois(lambda),
log(lambda) <- log(registrations) - log(100) + avg_log_rate + log_rate[county] + b*log_dens_std,
log_rate[county] ~ GPL2(Dmat, etasq, rhosq, sigmasq),
avg_log_rate ~ dnorm(0, 0.25),
b ~ dnorm(0.2, 0.25),
etasq ~ dgamma(6, 70),
rhosq ~ dgamma(7, 7),
sigmasq ~ dgamma(6, 190)
),
data = list(
crashes = df$crashes,
registrations = df$registrations,
county = df$county,
log_dens_std = df$log_dens_std,
Dmat = timesmat
),
start = list(etasq = 0.1, rhosq = 0.9, sigmasq = 0.03, avg_log_rate = -0.05, b = 0.18),
iter = 2e4,
warmup = 3e3,
chains = 4,
cores = 4,
control = list(adapt_delta = 0.999999, max_treedepth = 15)
)
This took a little over an hour to run on my laptop.
precis(fit2, depth = 2)
Let’s take a look at the posterior uncertainty for the covariance curve
\[ f(x) = \eta^2 \exp\!\left(-\rho^2 x^2\right). \]
We’ll plot 300 covariance curves sampled from the posterior, with the median covariance curve in solid black.
samples <- extract.samples(fit2)
# plot the posterior median covariance function
curve(
median(samples$etasq)*exp(-median(samples$rhosq)*x^2),
from=0, to=5, ylim=c(0, 0.17),
xlab="distance (hours driving)", ylab="covariance",
lwd=2
)
# plot 300 functions sampled from posterior
for (i in 1:300)
curve(
samples$etasq[i]*exp(-samples$rhosq[i]*x^2),
col = col.alpha("black", 0.1),
add = TRUE
)
And here are the means, standard deviations, and 89% highest posterior density intervals for these new density-relative crash rates.
log_rates <- sapply(1:36, function(i) samples$avg_log_rate + samples$log_rate[,i])
rates <- exp(log_rates)
rates_mu <- apply(rates, 2, mean)
rates_sd <- apply(rates, 2, sd)
rates_HPDI <- apply(rates, 2, HPDI)
ordered_rates <- data.frame(
mean = round(rates_mu, 3),
sd = round(rates_sd, 3),
HPDI = paste(round(rates_HPDI[1,], 2), "to", round(rates_HPDI[2,], 2))
)
ordered_rates <- ordered_rates[order(rates_mu),]
colnames(ordered_rates) <- c(
"density-relative crash rate",
"standard deviation",
"89% probability interval"
)
rownames(ordered_rates) <- levels(df$county)[order(rates_mu)]
ordered_rates
Let’s visualize the inferred correlations between these counties. We’ll plot dots at the latitudes and longitudes of the county seats, scaling the size of the dots proportional to the new crash rates. Then we’ll draw lines between these dots, with darker lines indicating higher correlations.
# compute posterior median covariance among counties
K <- matrix(0, nrow = 36, ncol = 36)
for (i in 1:36) for (j in 1:36)
K[i,j] <- median(samples$etasq)*exp(-median(samples$rhosq)*timesmat[i,j]^2)
diag(K) <- median(samples$etasq) + median(samples$sigmasq)
# convert to correlation matrix
Rho <- cov2cor(K)
# adjust label positions
pos <- rep(3, 36)
pos[match("wheeler", levels(df$county))] <- 1
pos[match("clackamas", levels(df$county))] <- 4
pos[match("multnomah", levels(df$county))] <- 4
pos[match("polk", levels(df$county))] <- 2
pos[match("marion", levels(df$county))] <- 4
pos[match("benton", levels(df$county))] <- 2
pos[match("linn", levels(df$county))] <- 4
pos[match("yamhill", levels(df$county))] <- 2
pos[match("lane", levels(df$county))] <- 4
pos[match("crook", levels(df$county))] <- 4
pos[match("deschutes", levels(df$county))] <- 1
pos[match("baker", levels(df$county))] <- 4
pos[match("jackson", levels(df$county))] <- 4
pos[match("klamath", levels(df$county))] <- 4
pos[match("harney", levels(df$county))] <- 1
# scale point size by density-relative crash rate
psize <- 1.3*sqrt(rates_mu)
lats <- county_densities$County.seat.latitude
longs <- county_densities$County.seat.longitude
plot(
longs, lats,
xlab = "longitude", ylab = "latitude",
ylim = c(min(lats), 46.4),
cex = psize, pch = 16, col = rangi2,
asp = 1
)
text(
longs, lats,
labels = levels(df$county),
cex = 0.7,
pos = pos
)
# overlay lines shaded by correlation
for (i in 1:36) for (j in 1:36)
if ( i < j )
lines(
c(longs[i], longs[j]), c(lats[i], lats[j]),
lwd = 2, col = col.alpha("black", Rho[i,j]^2)
)
Finally, here’s the new choropleth for these density-relative crash rates.
regions <- data.frame(
value = rates_mu,
region = county_densities$FIPS
)
county_choropleth(
regions,
state_zoom = "oregon",
title = "Crash rate relative to housing density in the period 2013-2016",
legend = "Crash Rate"
) + scale_fill_brewer(name = "Crash Rate", palette = "YlOrRd", drop = FALSE)
30 Nov 2018
A similar model was used in the book Bayesian Data Analysis by Gelman et. al. to model kidney cancer death rates in counties across the US. See section 2.7 in the 3rd edition.↩
The book Statistical Rethinking by Richard McElreath has an amazing example of using Gaussian process regression to model toolkit complexity among historic Oceanic societies. See section 13.4 in the 1st edition.↩