The RSocrata package makes it easy to pull data from data.oregon.gov directly into R.
Here are the packages we’ll use in the examples:
Suppose you’ve already used the search engine on data.oregon.gov and decided you want
the data on new
businesses registered last month. Since you already know the URL,
you can just pass that to RSocrata::read.socrata()
to pull
that data set into R:
new_businesses <- read.socrata("https://data.oregon.gov/business/New-Businesses-Registered-Last-Month/esjy-u4fc")
new_businesses %>%
head(30)
You can also use RSocrata::ls.socrata()
to get a list of
all available data sets:
This can be helpful if you’re an advanced user and want to run a custom search that is difficult with the web interface.
For example, we could write a search that uses a simple regular expression to find all data sets whose title, description, or keywords contain any of the following tokens:
Note that the keyword
column is a list column, where
each row is a character vector. So we’ll have to carefully loop through
each vector in the list.
search_regex <- regex("birth|natal|matern|midwife|women|woman|female", ignore_case = TRUE)
data_sets %>%
filter(
sapply(keyword, function(x) any(str_detect(x, search_regex))) |
str_detect(title, search_regex) |
str_detect(description, search_regex)
) %>%
select(title, everything())
It looks like there are several data sets relating to Central Line-Associated Bloodstream Infections (CLABSI) and a grantee list for a bill.
Let’s take a look at one of these.
data_sets %>%
filter(title == "2017 Central Line-Associated Bloodstream Infections (CLABSI) Table – Acute Care Hospitals") %>%
pull(description)
## [1] "Oregon hospitals report CLABSIs from adult, pediatric, and neonatal intensive care units, and adult and pediatric medical, surgical, and medical/surgical wards as part of Oregon's mandatory healthcare-associated infections reporting program. For information regarding standardized infection ratio baselines, please see https://www.cdc.gov/nhsn/2015rebaseline/index.html."
Interesting! We can pull the data set by passing the value in the
landingPage
column to
RSocrata::read.socrata()
:
clabsi <-
data_sets %>%
filter(title == "2017 Central Line-Associated Bloodstream Infections (CLABSI) Table – Acute Care Hospitals") %>%
pull(landingPage) %>%
read.socrata() %>%
clean_names() # simplify the column names
clabsi
Here is a list of all hospitals (and a row for Oregon overall) that did not meet the target max number of CLABSI infections.
clabsi %>%
filter(
hospital_location == "All Adult/Ped ICUs & M/S/MS Wards Combined*",
x2020_hhs_targets == "Target Not Met"
) %>%
select(
hospital_name,
observed_infections:x2020_hhs_targets
)
5 March 2023