Accessing Oregon Open Data from R

The RSocrata package makes it easy to pull data from data.oregon.gov directly into R.

Here are the packages we’ll use in the examples:

library(dplyr)
library(stringr)
library(janitor)
library(RSocrata)

Suppose you’ve already used the search engine on data.oregon.gov and decided you want the data on new businesses registered last month. Since you already know the URL, you can just pass that to RSocrata::read.socrata() to pull that data set into R:

new_businesses <- read.socrata("https://data.oregon.gov/business/New-Businesses-Registered-Last-Month/esjy-u4fc")

new_businesses %>% 
  head(30)

You can also use RSocrata::ls.socrata() to get a list of all available data sets:

data_sets <- ls.socrata("https://data.oregon.gov")

data_sets

This can be helpful if you’re an advanced user and want to run a custom search that is difficult with the web interface.

For example, we could write a search that uses a simple regular expression to find all data sets whose title, description, or keywords contain any of the following tokens:

birth
natal
matern
midwife
women
woman
female

Note that the keyword column is a list column, where each row is a character vector. So we’ll have to carefully loop through each vector in the list.

search_regex <- regex("birth|natal|matern|midwife|women|woman|female", ignore_case = TRUE)

data_sets %>% 
  filter(
    sapply(keyword, function(x) any(str_detect(x, search_regex))) |
    str_detect(title, search_regex) |
    str_detect(description, search_regex)
  ) %>% 
  select(title, everything())

It looks like there are several data sets relating to Central Line-Associated Bloodstream Infections (CLABSI) and a grantee list for a bill.

Let’s take a look at one of these.

data_sets %>% 
  filter(title == "2017 Central Line-Associated Bloodstream Infections (CLABSI) Table – Acute Care Hospitals") %>% 
  pull(description)

## [1] "Oregon hospitals report CLABSIs from adult, pediatric, and neonatal intensive care units, and adult and pediatric medical, surgical, and medical/surgical wards as part of Oregon's mandatory healthcare-associated infections reporting program. For information regarding standardized infection ratio baselines, please see https://www.cdc.gov/nhsn/2015rebaseline/index.html."

Interesting! We can pull the data set by passing the value in the landingPage column to RSocrata::read.socrata():

clabsi <- 
  data_sets %>% 
  filter(title == "2017 Central Line-Associated Bloodstream Infections (CLABSI) Table – Acute Care Hospitals") %>% 
  pull(landingPage) %>% 
  read.socrata() %>% 
  clean_names()  # simplify the column names

clabsi

Here is a list of all hospitals (and a row for Oregon overall) that did not meet the target max number of CLABSI infections.

clabsi %>% 
  filter(
    hospital_location == "All Adult/Ped ICUs & M/S/MS Wards Combined*",
    x2020_hhs_targets == "Target Not Met"
  ) %>% 
  select(
    hospital_name,
    observed_infections:x2020_hhs_targets
  )

Antonio R. Vargas

5 March 2023