Abstract: The Every Student Succeeds Act (ESSA), enacted in 2015, requires states to provide data “that can be cross-tabulated by, at a minimum, each major racial and ethnic group, gender, English proficiency status, and children with or without disabilities,”1 taking care not to reveal personally identifiable information about any individual student. As state education agencies come into compliance with ESSA, they will be publishing more and more datasets which at least partially suppress or omit data to protect student privacy. In this article we will give an example of how suppressed data can be analyzed from a Bayesian perspective using cross-tabulated data on graduation rates recently released by the Oregon Department of Education.

1 Introduction

The Oregon Department of Education tracks each incoming group of high school freshmen—called a cohort—over the course of their time in high school and calculates the percentage of those students who graduate within four years.2 This graduation rate has been increasing steadily since the 2008-2009 school year.3

School year Four-year cohort graduation rate Which grad rate definition?
2008-09 66.2% pre-2014
2009-10 66.38% pre-2014
2010-11 67.65% pre-2014
2011-12 68.44% pre-2014
2012-13 68.66% pre-2014
2013-14 71.98% post-2014
2014-15 73.82% post-2014
2015-16 74.83% post-2014
2016-17 76.65% post-2014

This data can be found in ODE’s Graduation Reports, which also break down each cohort by school, and then by several subcategories such as gender, race/ethnicity, and status as economically disadvantaged. However, it’s not possible to see, for example, how many Hispanic/Latino students who graduated are economically disadvantaged, which might help explain why they have a higher or lower graduation rate compared to other races/ethnicities. And while the reports do calculate graduation rates for students with/without disabilities, students who are/aren’t English learners, students who are/aren’t homeless, etc., we can’t determine the graduation rates for students in multiple such subgroups or for students in none of them.

On December 31, 2018, ODE released a cross-tabulated dataset for the 2016-17 cohort which contains counts for cohort and graduates from every combination of the following categories:

Using this new data we can attempt to disentangle these factors from one another. The questions we will focus on are:

…when all other factors in the above list are equal?

2 The data

The cross-tabulated dataset can be downloaded from ODE’s School and District Profiles: Accountability Measures page under “Cohort Graduation Rates / Dropout Rates” with the title “Cross-Tabulated 2016-17 Four-year Cohort Graduation Rates”.

To protect the privacy of the students, some of the data in this file has been suppressed.4 If any group contains fewer than 10 students, an asterisk (*) is entered instead of the number of students in the group. So if there are fewer than 10 students of a certain group who graduate, there will be an * instead of the number of graduates. And if there are fewer than 10 students who do not graduate, there will be an * instead of the number of students in the cohort.

Rather than throw these rows away, we will show in section 3 how they can still be used in a Bayesian model—even the rows where both the size of the cohort and the number of graduates are suppressed. In fact we will argue that ignoring those double-suppressed rows can bias the inferred graduation rates toward the extremes of 0% and 100%.

Before loading the spreadsheet into R, a quick look reveals that whenever the number of graduates is suppressed, so is the size of the cohort. This quirk of this particular dataset will allow us to write a simpler model than if there were rows with unsuppressed cohort sizes but suppressed numbers of graduates.

Now let’s load the spreadsheet. If the cohort size is suppressed but the number of graduates isn’t, we’ll put a 1 for the cohort size. If both cohort size and number of graduates is suppressed, we’ll put a 0 for the cohort size. This will let us treat the two cases differently in the analysis.

And format it for the models, giving it short column names and using 0 and 1 for the binary categories. We’ll also assign each race/ethnicity a unique ID.

3 The heart of the models

We will model the number of students who graduate as a binomial random variable,

\[ \textrm{graduates}_i \sim \operatorname{Binomial}(\textrm{cohort}_i, \theta). \]

Of course, we can only use this likelihood if we know both the size of the cohort and the number of graduates.

3.1 Partially suppressed rows

If we know the number of graduates in some row \(i\) but the cohort size has been suppressed, then all we know is that \(\textrm{cohort}_i\) is between \(\textrm{graduates}_i\) and \(\textrm{graduates}_i + 9\) inclusive, with all possibilities being equally likely.

Suppose we have 16 graduates with an unknown cohort size. Assuming a uniform prior on the graduation rate \(\theta\), there are ten possibilities for its posterior distribution:

## [1] 1

To account for all of these possibilities, we average the likelihood over the distribution of cohort sizes. That is, we marginalize \(\textrm{cohort}_i\) out of the likelihood:

\[ p(\textrm{graduates}_i \mid \theta) = \frac{1}{10} \sum_{n = 0}^{9} \operatorname{Binomial}(\textrm{graduates}_i \mid \textrm{graduates}_i + n, \theta). \]

This gives us a new posterior for the graduation rate \(\theta\) which incorporates our uncertainty about the size of the cohort.

This posterior is much wider than any of the individual posteriors for specific sizes of the cohort. That said, it still contributes information to the model—far more than if the row was ignored completely.

3.2 Completely suppressed rows

It’s easy to assume (like I did when I started this project) that the rows where both the size of the cohort and the number of graduates are suppressed contain no information about the graduation rate. But because of the particular suppression rules that were used, it turns out that if we count up all the ways where fewer than 10 students graduate and fewer than 10 students don’t, there are more ways for roughly half of the students to graduate than there are for nearly all of them to graduate or for nearly all of them not to graduate.

We use the same idea from the last section, marginalizing out both the size of the cohort \(n\) and the number of graduates \(k\) from the likelihood. For each \(k\) in \(0,\ 1,\ \ldots,\ 9\), there are ten options for \(n\): \(k,\ k+1,\ \ldots,\ k+9\), with all of these pairs of \(k\) and \(n\) being equally likely. Averaging over these possibilities, we end up with a distribution for \(\theta\):

\[ p(\theta) = \frac{1}{100} \sum_{k=0}^{9} \sum_{n = k}^{k+9} \operatorname{Binomial}(k \mid n, \theta). \]

If we assume a uniform prior on \(\theta\), we get the following posterior for the graduation rate \(\theta\).

There is clearly more posterior probability near the graduation rate 0.5 than near the extreme rates 0 and 1.

Each of the 134 completely suppressed rows in the data contributes a factor like \(p(\theta)\) to the posterior. In the first model we fit below, where we aim to estimate the overall graduation rate for all students, the completely suppressed rows contribute a factor of \(p(\theta)^{134}\) in the posterior, shown below.

Essentially the effect of including this factor is the difference between using \(p(\theta)^{134}\) as a prior for \(\theta\) and using a uniform, flat prior for \(\theta\). Consequently, failing to include the completely suppressed rows will bias the inferred graduation rate away from 0.5 and toward the extreme rates 0 and 1. In other words, if the true graduation rate is higher than 0.5, then by ignoring the completely suppressed rows you will tend to overestimate the rate, and if the true rate is lower than 0.5 you will tend to underestimate the rate.

This bias is most pronounced in models which attempt to differentiate between student groups. To illustrate, we will fit our most complex model below twice, first including every row in the data and then again omitting any rows with suppressions, and compare the estimated parameters from each fit.

3.3 Further reading

For more information about marginalizing discrete parameters out of the likelihood and some additional examples, see the section titled “Latent Discrete Parameters” (section 7) in the Stan User’s Guide.

4 The four models

We will fit four increasingly complex models building on the “heart” described above using the Stan sampling engine.

In the first model we will estimate the overall graduation rate of the cohort. In the second we will compare the graduation rates of the racial-ethnic groups in the data, controlling for gender and the main effects of being an English learner, having a disability, being homeless, and being economically disadvantaged. In the third model we will include interactions among these four burdening factors. After fitting these three models we will carry out an analysis comparing the estimated predictive power of each model using the PSIS-LOO information criterion, and finally we will fit a fourth model incorporating what we have learned from the comparison.

The Stan code for each of these models can be downloaded here. We will make references to it throughout.

4.1 Implementing the likelihood

As described in section 3, we will use a Binomial likelihood in our models, marginalizing out certain variables when they are suppressed in the data.

Below is the Stan implementation of the log-likelihood5 \(\log p(k,n \mid \mu)\), where \(\mu = \operatorname{logit}(\theta)\). If the cohort n is 0 or 1 then we use the appropriate marginalized log-likelihood. Since everything is on the logarithmic scale, summing is done using Stan’s log_sum_exp() function, which is defined on a vector \(\mathbf a = (a_i)\) by

\[ \texttt{log_sum_exp}(\mathbf a) = \log \sum_i \exp a_i. \]

In the case that both cohort size and number of graduates are suppressed, we used a computer algebra system to simplify the function \(p(\theta)\) from section 3.2. Consequently we only need to sum 19 terms instead of the original 100. This leads to a huge speedup in sampling.

// Stan code
real binomial_rsupp_lpmf(int k, int n, real mu) {
    if (n > 1) {
        // nothing suppressed
        return binomial_logit_lpmf(k | n, mu);
    } else if (n == 1) {
        // cohort suppressed, but not graduates
        vector[10] lp;
        for (j in 0:9)
            lp[j+1] = binomial_logit_lpmf(k | k + j, mu);
        return log(0.1) + log_sum_exp(lp);
    } else {
        // both cohort and graduates suppressed
        vector[19] lp = [0,
                         log(19) + mu,
                         2*log(3) + log(19) + 2*mu,
                         log(3) + log(17) + log(19) + 3*mu,
                         2*log(2) + log(3) + log(17) + log(19) + 4*mu,
                         2*log(2) + 2*log(3) + log(17) + log(19) + 5*mu,
                         2*log(2) + log(3) + log(7) + log(17) + log(19) + 6*mu,
                         2*log(2) + log(3) + log(13) + log(17) + log(19) + 7*mu,
                         log(2) + 2*log(3) + log(13) + log(17) + log(19) + 8*mu,
                         log(2) + log(11) + log(13) + log(17) + log(19) + 9*mu,
                         log(2) + 2*log(3) + log(13) + log(17) + log(19) + 10*mu,
                         2*log(2) + log(3) + log(13) + log(17) + log(19) + 11*mu,
                         2*log(2) + log(3) + log(7) + log(17) + log(19) + 12*mu,
                         2*log(2) + 2*log(3) + log(17) + log(19) + 13*mu,
                         2*log(2) + log(3) + log(17) + log(19) + 14*mu,
                         log(3) + log(17) + log(19) + 15*mu,
                         2*log(3) + log(19) + 16*mu,
                         log(19) + 17*mu,
                         18*mu]';
        return log(0.1) - 18*log_sum_exp(0, mu) + log_sum_exp(lp);
    }
}

I factored the coefficients of each term because I figured computation and summation of a few smaller logarithms would be numerically safer than a single logarithm of a larger number, but it probably doesn’t make a difference here.

4.2 Model 1: Recovering the overall graduation rate

The first model we fit will simply estimate the overall graduation rate \(\theta\).

##     mean se_mean    sd  5.5% 94.5% n_eff Rhat
## mu 1.209       0 0.011 1.192 1.227 25399    1

Since \(\mu = \operatorname{logit}(\theta)\), we apply the inverse-logit to mu to recover the graduation rate.

We know from the Graduation Reports (see the introduction) that the overall graduation rate was 76.65%, so this estimate isn’t too far off.

4.3 Model 2: Including racial-ethnic group, gender, and burdening factors

Now we’ll fit a model which includes a separate intercept for each racial-ethnic group, a term for an indicator for whether the cohort consists of females, and a term for each of the indicators English learner, disability, homeless, and economically disadvantaged:

\[ \begin{align} \operatorname{logit}(\theta_i) &= a + a_{\textit{re}}[\textrm{race-ethn}_i] + b_{\textit{f}} \times \textrm{female}_i + b_{\textit{el}} \times \textrm{eng_learn}_i + b_{\textit{d}} \times \textrm{disability}_i \\ &\qquad + b_{\textit{h}}\times\textrm{homeless}_i + b_{\textit{ed}}\times\textrm{econ_disad}_i. \end{align} \]

Because our aim is to compare the graduation rates of different races/ethnicities, and because of huge differences in sample sizes between them6, we will place a multilevel structure on the group of intercepts relating to race-ethnicity of the form

\[ a_{\textit{re}}[\textrm{race-ethn}_i] \sim \operatorname{Normal}(0, \sigma), \]

with weakly informative priors on \(a\) and \(\sigma\):

\[ \begin{align} a &\sim \operatorname{Normal}(0, 4), \\ \sigma &\sim \operatorname{Gamma}(1.5, 1). \end{align} \]

All of the other parameters are assigned fixed \(\operatorname{Normal}(0, 2)\) priors.

##          mean se_mean   sd  5.5% 94.5% n_eff Rhat
## a_re[1] -0.65       0 0.24 -1.02 -0.29 14191    1
## a_re[2]  0.81       0 0.24  0.45  1.19 14256    1
## a_re[3] -0.15       0 0.23 -0.51  0.21 13835    1
## a_re[4]  0.05       0 0.23 -0.29  0.41 13283    1
## a_re[5]  0.01       0 0.23 -0.34  0.37 13551    1
## a_re[6]  0.03       0 0.26 -0.37  0.45 17373    1
## a_re[7] -0.07       0 0.23 -0.41  0.28 13172    1
## a        1.82       0 0.23  1.47  2.17 13208    1
## b_f      0.30       0 0.02  0.26  0.34 71826    1
## b_el    -0.77       0 0.06 -0.86 -0.68 71740    1
## b_d     -0.78       0 0.03 -0.83 -0.73 70427    1
## b_h     -0.98       0 0.04 -1.04 -0.92 75740    1
## b_ed    -0.73       0 0.03 -0.78 -0.69 73012    1
## sigma    0.56       0 0.21  0.32  0.93 28051    1

Each of the coefficients \(b_{\textit{el}}\), \(b_{\textit{d}}\), \(b_{\textit{h}}\), and \(b_{\textit{ed}}\) is negative, indicating that students with any of the four burdening factors are expected to have a lower graduation rate, regardless of gender and racial/ethnic background.

The coefficient \(b_{\textit{f}}\) is positive, indicating that female students are expected to have higher graduation rates, regardless of racial/ethnic background or burdening factors.

4.4 Model 3: Including interactions between burdens

In our last model we will include terms in the model which capture all interactions between the four burdening factors (English learner, with disability, homeless, and economically disadvantaged).

We can express the model succinctly using brms syntax7 as

graduates | trials(cohort) ~ (1 | race_ethn_id) + female + eng_learn*disability*homeless*econ_disad

Since it uses the same predictors, this model can use the same data frame that was created for model 2.

##              mean se_mean   sd  5.5% 94.5%  n_eff Rhat
## a_re[1]     -0.65       0 0.24 -1.03 -0.28  19877    1
## a_re[2]      0.87       0 0.24  0.50  1.25  19993    1
## a_re[3]     -0.13       0 0.24 -0.50  0.24  19230    1
## a_re[4]      0.02       0 0.23 -0.34  0.38  18687    1
## a_re[5]     -0.01       0 0.24 -0.38  0.35  18954    1
## a_re[6]      0.04       0 0.27 -0.37  0.46  23572    1
## a_re[7]     -0.08       0 0.23 -0.44  0.27  18595    1
## a            1.94       0 0.23  1.58  2.30  18727    1
## b_f          0.30       0 0.02  0.27  0.34 114216    1
## b_el        -2.24       0 0.18 -2.54 -1.95  62583    1
## b_d         -1.14       0 0.06 -1.23 -1.04  73179    1
## b_h         -2.66       0 0.15 -2.91 -2.43  65066    1
## b_ed        -0.84       0 0.03 -0.89 -0.79  90826    1
## b_el_d       1.14       0 0.53  0.29  1.94  53121    1
## b_el_h       2.38       0 0.64  1.28  3.25  38150    1
## b_el_ed      1.18       0 0.19  0.87  1.49  62568    1
## b_d_h        1.45       0 0.55  0.54  2.24  49054    1
## b_d_ed       0.31       0 0.07  0.20  0.42  71204    1
## b_h_ed       1.69       0 0.16  1.44  1.94  64731    1
## b_el_d_h    -0.92       0 0.92 -2.33  0.57  42751    1
## b_el_d_ed   -0.04       0 0.54 -0.85  0.84  52668    1
## b_el_h_ed   -1.63       0 0.66 -2.55 -0.52  39064    1
## b_d_h_ed    -1.07       0 0.56 -1.87 -0.15  48878    1
## b_el_d_h_ed  0.03       0 0.95 -1.52  1.49  42984    1
## sigma        0.58       0 0.22  0.33  0.96  37794    1

To illustrate the biasing effect of ignoring the rows in which data has been suppressed, let’s fit the model again using only the rows which are complete.

##              mean se_mean   sd  5.5% 94.5%  n_eff Rhat
## a_re[1]     -0.68    0.00 0.25 -1.07 -0.29  14644    1
## a_re[2]      0.90    0.00 0.26  0.52  1.31  14731    1
## a_re[3]     -0.13    0.00 0.25 -0.52  0.25  14086    1
## a_re[4]      0.02    0.00 0.24 -0.35  0.40  13686    1
## a_re[5]     -0.01    0.00 0.25 -0.39  0.37  13907    1
## a_re[6]      0.02    0.00 0.29 -0.42  0.48  19238    1
## a_re[7]     -0.08    0.00 0.24 -0.45  0.30  13623    1
## a            1.93    0.00 0.25  1.56  2.31  13678    1
## b_f          0.31    0.00 0.02  0.27  0.35 119321    1
## b_el        -2.59    0.00 0.23 -2.95 -2.22  61913    1
## b_d         -1.14    0.00 0.06 -1.24 -1.05  63727    1
## b_h         -2.76    0.00 0.16 -3.01 -2.51  58171    1
## b_ed        -0.84    0.00 0.03 -0.89 -0.80  90050    1
## b_el_d       0.56    0.01 1.42 -1.71  2.84  59710    1
## b_el_h       0.36    0.01 1.42 -1.91  2.62  60601    1
## b_el_ed      1.52    0.00 0.24  1.14  1.90  61724    1
## b_d_h        0.19    0.01 1.41 -2.06  2.45  60017    1
## b_d_ed       0.32    0.00 0.07  0.20  0.43  61711    1
## b_h_ed       1.78    0.00 0.16  1.52  2.04  57795    1
## b_el_d_h    -0.49    0.01 1.42 -2.76  1.78  59119    1
## b_el_d_ed    0.56    0.01 1.42 -1.71  2.83  59873    1
## b_el_h_ed    0.36    0.01 1.41 -1.90  2.63  60704    1
## b_d_h_ed     0.19    0.01 1.41 -2.07  2.43  60137    1
## b_el_d_h_ed -0.47    0.01 1.42 -2.74  1.81  59371    1
## sigma        0.60    0.00 0.23  0.35  1.00  32329    1

Almost every interaction coefficient has a different estimate from the ones fit to the full dataset. Even the main effects of English learner status and homeless status (b_el and b_h) are slightly different.

4.5 Comparing the models

In this section we will use the PSIS-LOO information criterion8 to estimate and compare the predictive power of our three models. While this type of analysis is always important as a check against potentially overfitting the data, it’s especially important in the current setting of working with suppressed data. It will give us an idea of how “surprised” our models would be if they were to somehow learn the true data hidden inside a previously suppressed row, and how well our models are able to predict rows representing large cohorts just based on the patterns in other rows.

To prepare the models for this comparison we computed the log-likelihood of the full dataset—see the generated quantities {} sections of the Stan codes. As such, we can pass our stanfit objects directly to the rstan::loo() function, which uses the loo package to compute the PSIS-LOO for each of our models.

##      elpd_diff elpd_loo se_elpd_loo p_loo   se_p_loo looic   se_looic
## loo3     0.0    -653.4     41.7        55.6    12.4   1306.8    83.5 
## loo2  -182.6    -836.0     77.5        87.7    24.9   1672.0   155.0 
## loo1 -2097.8   -2751.2    631.6       182.7    97.2   5502.4  1263.2

Here the model with the smallest looic is expected to have the most predictive power. This comparison indicates that the third model, which included interactions between burdening factors, is moderately better than the second model and much better than the first.

The Pareto shape parameter k, computed as part of PSIS-LOO, serves as a self-diagnostic of the estimate. If k is too high, then PSIS-LOO is not guaranteed to give a good estimate of the expected log pointwise predictive density. This is most often the case when certain data points are disproportionately influential.

For us, these influential data points are usually going to be rows which contain many students. Every such row can have a large impact on the inferred graduation rate.

In the three graphs below, one for each model, the shape parameter k is plotted for every row in the data. If a k is higher than 0.5, the corresponding row number from the data is displayed.

According to Vehtari et al. (2017), when k < 0.5 the PSIS-LOO estimate converges quickly, and good performance has been observed for values of k up to 0.7. But when k is higher than 0.7, Vehtari et al. do not recommend relying on the PSIS-LOO estimates. There are indeed several rows in the data which have k values higher than 0.7, so we should expect the looic estimates above to be inaccurate to some degree.

Let’s see which rows give PSIS-LOO trouble.

Nearly all of these rows represent large cohorts and consequently have large invidividual influence on the posteriors. It’s important to correct the PSIS-LOO estimates for these rows to better reflect how well our model captures the patterns in large cohorts just based on seeing the patterns in smaller cohorts.

4.5.1 Replacing unreliable PSIS-LOO estimates with explicit cross-validations

One of the recommendations made by Vehtari et al. (2017) is to perform explicit cross validations for any rows in the data for which the Pareto shape parameter k is higher than 0.7. This has been implemented in the rstanarm and brms packages, but not for generic Stan fits using rstan, so we will need to write it ourselves. But first we need to talk about bias.

In leave-one-out cross validation, the model is refit using all but one row of the data, and a score is given for how well the refit model predicts the left-out row. But since this prediction is conditioned on one less row of data, there is some bias in the given score. Unlike rstanarm or brms, we will include an approximate correction for this bias9 because the rows we will be leaving out carry much more weight than an “average” row in our data. Otherwise we borrow heavily from the reloo() implementations in those packages.

log_sum_exp <- function(x) {
    max_x <- max(x)
    max_x + log(sum(exp(x - max_x)))
}

log_mean_exp <- function(x) {
    log_sum_exp(x) - log(length(x))
}

# For reference, see:
# https://github.com/stan-dev/rstanarm/blob/master/R/loo.R#L655
# https://github.com/paul-buerkner/brms/blob/master/R/loo-helpers.R#L432
reloo <- function(loo_out, model, data, metadata, completedata, k_threshold = 0.7) {
    bad_obs <- pareto_k_ids(loo_out, k_threshold)
    n_bad <- length(bad_obs)
    
    if (!length(bad_obs)) {
        message(
            "All pareto_k estimates below threshold of ",
            k_threshold,
            ". \nReturning loo object."
        )
        return(loo_out)
    }
    
    lls <- vector("list", n_bad)
    message(
        n_bad, " problematic observation(s) found.",
        "\nModel will be refit ", n_bad, " times."
    )
    
    for (j in seq_len(n_bad)) {
        message(
            "\nFitting model ", j, " out of ", n_bad,
            " (leaving out observation ", bad_obs[j], ")"
        )
        omitted <- bad_obs[j]
        
        # remove the bad observation from the data
        data_omitted <- data
        metadata_omitted <- metadata
        for (k in seq_len(length(data)))
            data_omitted[[k]] <- data_omitted[[k]][-omitted]
        metadata_omitted$N <- metadata_omitted$N - 1
        
        # refit the model on the new data
        fit_j <- stan(
            fit = model,
            data = c(data_omitted, metadata_omitted, completedata),
            iter = n_iter,
            warmup = n_warmup
        )
        
        # extract the log-likelihoods for the complete dataset using the
        # left-out-out posterior
        lls[[j]] <- extract_log_lik(fit_j)
    }
    
    # compute \hat{lpd}_j for each of the held out observations (using log-lik
    # matrix from full posterior, not the leave-one-out posteriors)
    ll_x <- extract_log_lik(model)[, bad_obs, drop = FALSE]
    hat_lpd <- apply(ll_x, 2, log_mean_exp)
    
    # compute elpd_{cloo,j} for each of the held out observations
    log_pp_holdouts <- lapply(lls, function(ll_i) apply(ll_i, 2, log_mean_exp))
    b <- hat_lpd - unlist(lapply(log_pp_holdouts, mean))
    elpd_cloo <- sapply(seq_len(n_bad), function(i) log_pp_holdouts[[i]][bad_obs[i]] + b[i])
    # compute effective number of parameters
    p_loo <- hat_lpd - elpd_cloo
    
    # replace parts of the loo object with these computed quantities
    sel <- c("elpd_loo", "p_loo", "looic")
    loo_out$pointwise[bad_obs, sel] <- cbind(elpd_cloo, p_loo, -2 * elpd_cloo)
    loo_out$estimates[sel, "Estimate"] <- with(loo_out, colSums(pointwise[,sel]))
    loo_out$estimates[sel, "SE"] <- with(loo_out, {
        N <- nrow(pointwise)
        sqrt(N * apply(pointwise[,sel], 2, var))
    })
    loo_out$diagnostics$pareto_k[bad_obs] <- NA
    return(loo_out)
}

Now we can run this for our three fits to get updated estimates of their looic.

##        elpd_diff elpd_loo se_elpd_loo p_loo   se_p_loo looic   se_looic
## reloo3     0.0    -679.0     48.3        81.2    22.5   1357.9    96.5 
## reloo2  -214.5    -893.5     95.3       145.2    51.1   1787.0   190.7 
## reloo1 -3003.2   -3682.2   1156.4      1113.7   640.7   7364.4  2312.8

The largest difference in these new estimates is a much higher looic for the first model (which just estimated the overall graduation rate). The other two looic estimates have not changed much, but we can be more confident in their estimates now.

## elpd_diff        se 
##     214.5      73.9

As before, the third model, which incorporates interactions between the burdening factors, is expected to have more predictive power than the second model.

4.5.2 Why is pLOO so high?

In the PSIS-LOO output, the quantity p_loo is an estimate of the effective number of parameters in the model. For a well-specified multilevel model this should be lower than the number of actual parameters. However, each of the three p_loo values above is higher than the numbers of actual parameters in the three models. This indicates that the models are all misspecified to some degree. Apparently they are encountering a wider variety of graduation rates than the structures of the models would generate.

There are a number of plausible reasons for this. It may a consequence of the uncertainty surrounding suppressed rows, or it may be something more fundamental. For example, English learner status may affect different races/ethnicities in different ways, perhaps in relative ease of access to different language aids or in a student being more or less isolated from other students who speak the same first language. These models aim to estimate the average graduation rates of the different races/ethnicities after subtracting out the average effects of gender and the four burdening factors across all races/ethnicities, and so could not capture these kinds of differences.

Regardless of what is causing this overdispersion of graduation rates, we should measure it and incorporate it into our posterior uncertainty around any inferred graduation rate.

4.6 Model 4: Accounting for residual overdispersion

The final modification we will make to our models is to estimate a separate intercept for each row in the data. This is not so crazy—we should already expect every row in the data to be a (somewhat) unique graduation rate corresponding to its particular combination of predictors. These intercepts will be assigned an adaptive normal prior, as in the following formula using brms syntax.

graduates | trials(cohort) ~ (1 | obs) + (1 | race_ethn_id) + female + ...terms for burdening factors...

Here the obs variable would have a different level for each row in the data.

The benefit of this is twofold. Not only will we be able to directly measure the residual overdispersion using the standard deviation of the adaptive normal prior, but this will partially pool information among rows in the data, enhancing the estimates for suppressed rows using information from the unsuppressed rows.

##              mean se_mean   sd  5.5% 94.5% n_eff Rhat
## a_re[1]     -0.58       0 0.25 -0.97 -0.20 16721    1
## a_re[2]      0.86       0 0.26  0.48  1.27 16548    1
## a_re[3]     -0.14       0 0.25 -0.52  0.24 16064    1
## a_re[4]     -0.04       0 0.24 -0.40  0.33 15363    1
## a_re[5]     -0.01       0 0.24 -0.38  0.37 15744    1
## a_re[6]      0.02       0 0.28 -0.41  0.46 20789    1
## a_re[7]     -0.04       0 0.24 -0.41  0.32 15131    1
## a            1.70       0 0.25  1.32  2.07 15096    1
## b_f          0.22       0 0.07  0.11  0.33 54645    1
## b_el        -1.89       0 0.22 -2.24 -1.53 55056    1
## b_d         -0.98       0 0.15 -1.22 -0.75 39491    1
## b_h         -2.32       0 0.23 -2.68 -1.95 43523    1
## b_ed        -0.50       0 0.11 -0.68 -0.32 31057    1
## b_el_d       0.96       0 0.55  0.08  1.80 58046    1
## b_el_h       2.04       0 0.63  1.00  2.95 43678    1
## b_el_ed      0.82       0 0.26  0.41  1.23 46932    1
## b_d_h        1.26       0 0.57  0.33  2.12 52068    1
## b_d_ed       0.12       0 0.18 -0.17  0.41 34840    1
## b_h_ed       1.36       0 0.25  0.96  1.77 39233    1
## b_el_d_h    -0.81       0 0.91 -2.23  0.68 49009    1
## b_el_d_ed    0.18       0 0.59 -0.73  1.13 54243    1
## b_el_h_ed   -1.32       0 0.67 -2.30 -0.22 43914    1
## b_d_h_ed    -0.96       0 0.59 -1.85  0.01 49656    1
## b_el_d_h_ed -0.01       0 0.97 -1.58  1.51 49769    1
## sigma        0.56       0 0.22  0.32  0.96 33744    1
## sigma_obs    0.23       0 0.03  0.18  0.29 34354    1

The posterior estimate for the logit residual standard deviation is roughly 0.23.

5 Results

In this section we will analyse and interpret the posterior parameter estimates from our models.

5.1 Gender and graduation rate

First we will compare the graduation rates of male and female students.

By exponentiating the parameter b_f we obtain the proportional change in graduation odds of female students compared to male students. Let’s plot the posterior for exp(b_f), indicating its median with a turquoise dot and its 89% probability density interval with a turquoise line.

The model estimates that females have roughly 25% higher odds of graduation than male students, controlling for the average effects of race/ethnicity and the presence of burdening factors. It assigns 89% probability that this proportional change in odds is between 12% and 38%.

Next we’ll plot the posteriors of the graduation rates themselves. We’ll incorporate the variation among races as well as the additional uncertainty estimated from the residual overdispersion in the data.

How can we interpret these posteriors? Suppose we’re a teacher, and we’re getting a new student. We know the student doesn’t have any of the four burdens, but we don’t know their racial-ethnic background. Before meeting them, should we guess that student probably won’t graduate? Definitely not. Should we guess there’s a 50/50 chance they’ll graduate? That would also be silly, considering the overall graduation rate is over 75%, so they probably have at least a 75% chance to graduate.

What these curves show are the plausibility of all of the different %-chances-to-graduate for this student. The colored dots indicate medians: it’s equally likely that the student’s %-chance-to-graduate is above and below those points.

Finally we will estimate the posterior for the graduation rate of an unburdened student, taking into account the observed variation across races and between genders. To do this we combine the posteriors plotted above proportionally according to the actual number of male and female students from the Graduation Report.

Just as above, this curve represents the plausibility of different %-chances-to-graduate for an unburdened student of unknown race and unknown gender.

Using this we could, for example, compute the posterior probability that the %-chance-to-graduate for unburdened students is higher than 80% after accounting for variation in race/ethnicity and gender:

## [1] 0.7475

5.2 The effects of the burdening factors

Let’s compare the population effects estimated by the second model and the final model.

##          mean se_mean   sd  5.5% 94.5% n_eff Rhat
## a_re[1] -0.65       0 0.24 -1.02 -0.29 14191    1
## a_re[2]  0.81       0 0.24  0.45  1.19 14256    1
## a_re[3] -0.15       0 0.23 -0.51  0.21 13835    1
## a_re[4]  0.05       0 0.23 -0.29  0.41 13283    1
## a_re[5]  0.01       0 0.23 -0.34  0.37 13551    1
## a_re[6]  0.03       0 0.26 -0.37  0.45 17373    1
## a_re[7] -0.07       0 0.23 -0.41  0.28 13172    1
## a        1.82       0 0.23  1.47  2.17 13208    1
## b_f      0.30       0 0.02  0.26  0.34 71826    1
## b_el    -0.77       0 0.06 -0.86 -0.68 71740    1
## b_d     -0.78       0 0.03 -0.83 -0.73 70427    1
## b_h     -0.98       0 0.04 -1.04 -0.92 75740    1
## b_ed    -0.73       0 0.03 -0.78 -0.69 73012    1
## sigma    0.56       0 0.21  0.32  0.93 28051    1
##              mean se_mean   sd  5.5% 94.5% n_eff Rhat
## a_re[1]     -0.58       0 0.25 -0.97 -0.20 16721    1
## a_re[2]      0.86       0 0.26  0.48  1.27 16548    1
## a_re[3]     -0.14       0 0.25 -0.52  0.24 16064    1
## a_re[4]     -0.04       0 0.24 -0.40  0.33 15363    1
## a_re[5]     -0.01       0 0.24 -0.38  0.37 15744    1
## a_re[6]      0.02       0 0.28 -0.41  0.46 20789    1
## a_re[7]     -0.04       0 0.24 -0.41  0.32 15131    1
## a            1.70       0 0.25  1.32  2.07 15096    1
## b_f          0.22       0 0.07  0.11  0.33 54645    1
## b_el        -1.89       0 0.22 -2.24 -1.53 55056    1
## b_d         -0.98       0 0.15 -1.22 -0.75 39491    1
## b_h         -2.32       0 0.23 -2.68 -1.95 43523    1
## b_ed        -0.50       0 0.11 -0.68 -0.32 31057    1
## b_el_d       0.96       0 0.55  0.08  1.80 58046    1
## b_el_h       2.04       0 0.63  1.00  2.95 43678    1
## b_el_ed      0.82       0 0.26  0.41  1.23 46932    1
## b_d_h        1.26       0 0.57  0.33  2.12 52068    1
## b_d_ed       0.12       0 0.18 -0.17  0.41 34840    1
## b_h_ed       1.36       0 0.25  0.96  1.77 39233    1
## b_el_d_h    -0.81       0 0.91 -2.23  0.68 49009    1
## b_el_d_ed    0.18       0 0.59 -0.73  1.13 54243    1
## b_el_h_ed   -1.32       0 0.67 -2.30 -0.22 43914    1
## b_d_h_ed    -0.96       0 0.59 -1.85  0.01 49656    1
## b_el_d_h_ed -0.01       0 0.97 -1.58  1.51 49769    1
## sigma        0.56       0 0.22  0.32  0.96 33744    1
## sigma_obs    0.23       0 0.03  0.18  0.29 34354    1

The coefficients b_el, b_d, and b_h are much more negative in the final model than in the second model, indicating that there is a significant impact on graduation rates for students who are English language learners, who have disabilities, or who are homeless. Further, five of the six coefficients on the first-order interaction terms are surely positive. This suggests that the difference in graduation rates between a student with two burdening factors and a student with one burdening factor is much smaller than the difference between a student with no burdening factors and a student with one burdening factor.

While the coefficient for the economically disadvantaged factor is still negative, it is much smaller than all of the other burdening factor coefficients. This may reflect the benefits of programs available to students who are identified as economically disadvantaged, such as free and reduced price meals10.

5.3 Graduation rates across races/ethnicities

Now we will compare the individual graduation rates of each race/ethnicity, controlling for the average effects of gender and burdening factors.

One thing to keep in mind is that the burdening factors are estimated as averages across all races/ethnicities, and do not reflect whether the graduation rates of certain races/ethnicities are more sensitive to their presence than others. In fact, while we won’t include the details here, preliminary analysis indicates that there is a significant amount of variation in the effects of English learner status and of economically disadvantaged status on the graduation rates of different races/ethnicities. It would be extremely valuable to compare the graduation rates we present in this section with rates estimated by different models which attempt to take these interactions into account.

These graduation rates aren’t more “correct” or “incorrect” than those from another model, but they certainly do not tell the whole story and need to be interpreted within the precise definition of our model.

We will use the graduation rate of the majority group, white students, as a frame of reference for the other graduation rates. By dividing each exponentiated coefficient exp(a_re[i]) by that of white students, exp(a_re[7]), we obtain the proportional change in graduation odds of each group relative to white students.

Except for the groups with the lowest and highest proportional change in odds, the model does not see evidence for much difference between the various races/ethnicities. Note that the 89% probability density intervals for those rows all overlap 1, indicating that the model believes their graduation rates could plausibly be both lower or higher than the graduation rate of white students, all other predictors being equal.

It’s a little easier to see this by plotting the change in odds posteriors themselves. We’ll draw a gray vertical line at 1, indicating the frame of reference representing the graduation rate of white students.

We can calculate the posterior probability that the ranking in this table and plot is the true ranking of graduation rates:

## [1] 0.07266667

Indeed, it’s far more likely that this is not the true ranking.

There are two outlying groups, American Indian/Alaska Native students and Asian students. American Indian/Alaska Native students have roughly 60% the odds of graduating that White students have, and Asian students have roughly two and a half times the odds that White students have. The model is more or less certain that these two groups bracket the other races/ethnicities; the posterior probability that the graduation rate for American Indian/Alaska Native students is lower than all of the other groups and that the graduation rate for Asian students is higher than all of the other groups is

## [1] 0.9956111

5.4 Reconstructing the original data

In this last section we will use our model to estimate the graduation rate for each student subgroup in the original data. In effect we will “fill in” the graduation rates for the rows which have been suppressed and adjust the graduation rates in rows with small cohorts to compensate for noisy data.

In this table, 0 indicates absence and 1 indicates presence of the characteristic in that column. In the Race/Ethn column, the different numbers correspond to the different races/ethnicities:

1 - American Indian/Alaska Native
2 - Asian
3 - Black/African American
4 - Hispanic/Latino
5 - Multi-Racial
6 - Native Hawaiian/Pacific Islander
7 - White

This table can be downloaded as a .csv file here.

6 Conclusion

Bayesian statistics provides a natural framework for analyzing suppressed data by allowing us to assign probabilities to the possible values that a suppressed entry might have. In this case we encountered suppressed counts, and we averaged over their possible values to determine the likelihood of different proportions that the suppressed entries could have represented. This approach takes into account every last bit of underlying information in the data. If we had instead ignored the suppressed entries, we would have biased our results.

The data we analyzed contains information about the numbers of graduates and the total numbers of students in different subgroups of high school students in Oregon. We estimated the average effects of race/ethnicity, of gender, and of the interactions of four burdening factors: being an English learner, having a disability, being homeless, and being economically disadvantaged.

In its state plan11 under ESSA, Oregon has set the goal of achieving 90% graduation rates for every student group we discussed by 2025. I hope the analysis presented here shows the value of the new cross-tabulated data being released in revealing underlying connections between these groups. These connections will need to be taken into account when developing and implementing statewide programs.


Antonio R. Vargas

20 Jan 2019

Updated 11 Mar 2019


  1. Section 1111(g)(2)(N) of the Elementary and Secondary Education Act of 1965, which is one of the amendments made by ESSA. The quoted text can be found in ESSA.

  2. ODE includes students in this calculation who transferred into the Oregon public school system, and excludes students who transferred out of the public school system or who passed away.

  3. The definition of “graduates” changed between the 2012-13 cohort and the 2013-14 cohort to include students who have received Modified Diplomas as described in ORS 329.451 and students who have satisfied requirements for at least one type of standard diploma recognized by their district but have not graduated. For more information, see page 5 of the Cohort Graduation Rate Policy and Technical Manual 2013-14, available on ODE’s Graduation Reports page.

  4. For more information about the suppression of student data, see the technical brief Statistical Methods for Protecting Personally Identifiable Information in Aggregate Reporting published by the National Center for Education Statistics.

  5. Stan samples from the log-posterior, so we compute the log-likelihood instead of the likelihood.

  6. According to the Graduation Reports, the largest racial-ethnic group, White students, contains 30,011 members, while the smallest racial-ethnic group, Native Hawaiian/Pacific Islander students, contains 337 members.

  7. See Bürkner, P. C., brms: An R Package for Bayesian Multilevel Models Using Stan and the brms repository on Github.

  8. For an introduction to PSIS-LOO and related information criteria see Vehtari, Gelman, and Gabry, Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC (2017) [SpringerLink, arXiv].

  9. The bias correction we use is derived from the one given in section 3.8 Gelman, Hwang, and Vehtari, Understanding predictive information criteria for Bayesian models (2014) [SpringerLink, arXiv].

  10. For information about free and reduced price meals, see the August 21, 2018 news release Free and Reduced Price Meal Income Guidelines Announced for 2018-19 School Year from the ODE 2018 News Releases page. Other programs available to students identified as economically disadvantaged are listed on page 8 of the 2018 Poverty Report published by Oregon’s Chief Education Office.

  11. Oregon’s State Plan Under ESSA can be read here. The graduation rate goal is discussed on page 37.