Abstract: The Every Student Succeeds Act (ESSA), enacted in 2015, requires states to provide data “that can be cross-tabulated by, at a minimum, each major racial and ethnic group, gender, English proficiency status, and children with or without disabilities,”1 taking care not to reveal personally identifiable information about any individual student. As state education agencies come into compliance with ESSA, they will be publishing more and more datasets which at least partially suppress or omit data to protect student privacy. In this article we will give an example of how suppressed data can be analyzed from a Bayesian perspective using cross-tabulated data on graduation rates recently released by the Oregon Department of Education.
The Oregon Department of Education tracks each incoming group of high school freshmen—called a cohort—over the course of their time in high school and calculates the percentage of those students who graduate within four years.2 This graduation rate has been increasing steadily since the 2008-2009 school year.3
School year | Four-year cohort graduation rate | Which grad rate definition? |
---|---|---|
2008-09 | 66.2% | pre-2014 |
2009-10 | 66.38% | pre-2014 |
2010-11 | 67.65% | pre-2014 |
2011-12 | 68.44% | pre-2014 |
2012-13 | 68.66% | pre-2014 |
2013-14 | 71.98% | post-2014 |
2014-15 | 73.82% | post-2014 |
2015-16 | 74.83% | post-2014 |
2016-17 | 76.65% | post-2014 |
This data can be found in ODE’s Graduation Reports, which also break down each cohort by school, and then by several subcategories such as gender, race/ethnicity, and status as economically disadvantaged. However, it’s not possible to see, for example, how many Hispanic/Latino students who graduated are economically disadvantaged, which might help explain why they have a higher or lower graduation rate compared to other races/ethnicities. And while the reports do calculate graduation rates for students with/without disabilities, students who are/aren’t English learners, students who are/aren’t homeless, etc., we can’t determine the graduation rates for students in multiple such subgroups or for students in none of them.
On December 31, 2018, ODE released a cross-tabulated dataset for the 2016-17 cohort which contains counts for cohort and graduates from every combination of the following categories:
Using this new data we can attempt to disentangle these factors from one another. The questions we will focus on are:
…when all other factors in the above list are equal?
# Libraries and helper functions we'll need.
library(viridisLite)
library(readxl)
library(loo)
library(coda)
library(rstan)
options(mc.cores = 4) # number of cores in your processor
cols <- viridis(10)
turq <- cols[5]
# From the 'rethinking' package (https://github.com/rmcelreath/rethinking)
col.alpha <- function( acol , alpha=0.2 ) {
acol <- col2rgb(acol)
acol <- rgb(acol[1]/255, acol[2]/255, acol[3]/255, alpha)
acol
}
The cross-tabulated dataset can be downloaded from ODE’s School and District Profiles: Accountability Measures page under “Cohort Graduation Rates / Dropout Rates” with the title “Cross-Tabulated 2016-17 Four-year Cohort Graduation Rates”.
To protect the privacy of the students, some of the data in this file has been suppressed.4 If any group contains fewer than 10 students, an asterisk (*) is entered instead of the number of students in the group. So if there are fewer than 10 students of a certain group who graduate, there will be an * instead of the number of graduates. And if there are fewer than 10 students who do not graduate, there will be an * instead of the number of students in the cohort.
Rather than throw these rows away, we will show in section 3 how they can still be used in a Bayesian model—even the rows where both the size of the cohort and the number of graduates are suppressed. In fact we will argue that ignoring those double-suppressed rows can bias the inferred graduation rates toward the extremes of 0% and 100%.
Before loading the spreadsheet into R, a quick look reveals that whenever the number of graduates is suppressed, so is the size of the cohort. This quirk of this particular dataset will allow us to write a simpler model than if there were rows with unsuppressed cohort sizes but suppressed numbers of graduates.
Now let’s load the spreadsheet. If the cohort size is suppressed but the number of graduates isn’t, we’ll put a 1 for the cohort size. If both cohort size and number of graduates is suppressed, we’ll put a 0 for the cohort size. This will let us treat the two cases differently in the analysis.
ctab <- read_xlsx(
"ESSA_STATE_Cohort_Graduation_Cross_Tabulated_1718_SP.xlsx",
sheet = 2,
na = "*"
)
ctab <- ctab[,c(-1, -10)]
for (i in seq_len(nrow(ctab))) {
if (is.na(ctab$`Adjusted Cohort`[i]))
if (is.na(ctab$`Number of Graduates`[i])) {
ctab$`Adjusted Cohort`[i] <- 0
ctab$`Number of Graduates`[i] <- 0
} else
ctab$`Adjusted Cohort`[i] <- 1
}
ctab
And format it for the models, giving it short column names and using 0 and 1 for the binary categories. We’ll also assign each race/ethnicity a unique ID.
df <- data.frame(
female = as.integer(ifelse(ctab$Gender == "Female", 1, 0)),
race_ethn = as.factor(ctab$`Race/Ethnicity`),
eng_learn = as.integer(ifelse(ctab$`English Learners in High School` == "Y", 1, 0)),
disability = as.integer(ifelse(ctab$`Students with Disabilities` == "Y", 1, 0)),
homeless = as.integer(ifelse(ctab$Homeless == "Y", 1, 0)),
econ_disad = as.integer(ifelse(ctab$`Economically Disadvantaged` == "Y", 1, 0)),
cohort = as.integer(ctab$`Adjusted Cohort`),
graduates = as.integer(ctab$`Number of Graduates`)
)
df$race_ethn_id <- as.integer(df$race_ethn)
df
We will model the number of students who graduate as a binomial random variable,
\[ \textrm{graduates}_i \sim \operatorname{Binomial}(\textrm{cohort}_i, \theta). \]
Of course, we can only use this likelihood if we know both the size of the cohort and the number of graduates.
If we know the number of graduates in some row \(i\) but the cohort size has been suppressed, then all we know is that \(\textrm{cohort}_i\) is between \(\textrm{graduates}_i\) and \(\textrm{graduates}_i + 9\) inclusive, with all possibilities being equally likely.
Suppose we have 16 graduates with an unknown cohort size. Assuming a uniform prior on the graduation rate \(\theta\), there are ten possibilities for its posterior distribution:
## [1] 1
plot(
0,
xlim = c(0, 1), ylim = c(0, 1),
type = "n",
xlab = "graduation rate theta", ylab = "density",
main = "Posterior distributions p(theta | graduates=16, cohort=n)",
yaxt = "n"
)
axis(2, 0)
for (i in 16:25)
curve(dbinom(16, i, x), col = cols[i - 15], lwd = 2, n = 301, add = TRUE)
for (i in 1:10) {
lines(c(0, 0.1), rep(1 - 0.05*i, 2), col = cols[i], lwd = 2)
text(0.14, 1 - 0.05*i, labels = paste("n =", i + 15))
}