By Christian McDonald, Assistant Professor of Practice
School of Journalism and Media, Moody College of Communication
University of Texas at Austin
Hospital names change over time, causing problems for any analysis that groups or aggregates by the PROVIDER_NAME.
This notebook provides a list of THCIC_IDs with first their official name from the Current Facility Contact Information found at Outpatient Data Reporting Requirements. If the case’s THCIC_ID is not listed in the facilities list, then the facilities’ most recent name is used.
This notebook updates records to use their most recent names/ID combinations and creates “full” and “current” provider lists for later notebooks.
library(tidyverse)
library(janitor)
library(readxl)
### production data
path_prod <- "data-processed/ahrq_del_all.rds"
del <- read_rds(path_prod)
del %>% nrow()
## [1] 1457413
### facilities list
facilities_list <- read_excel("resources/FacilityList_20200910.xlsx") %>% clean_names()
Since PROVIDER_NAMES change over time, we want to use the most recent facilites list to get the most current name of providers in our data. If the record’s ID is not on the current list, then we use the most recent PROVIDER_NAME for that THCIC_ID.
The process we are using is this:
The concept of group and slice from the tail by row_number came from this stack overflow article.
providers_full <- del %>%
distinct(THCIC_ID, PROVIDER_NAME, DISCHARGE) %>%
arrange(THCIC_ID, DISCHARGE) %>%
select(THCIC_ID, PROVIDER_NAME) %>%
group_by(THCIC_ID) %>%
slice(tail(row_number(), 1)) %>%
rename(PROVIDER_NAME_RECENT = PROVIDER_NAME) %>%
left_join(facilities_list, by = c("THCIC_ID" = "thcic_id")) %>%
select(THCIC_ID, PROVIDER_NAME_RECENT, facility, city, address) %>%
# Create new provider name and fill from facilities list
mutate(
PROVIDER_NAME_CLEANED = case_when(
is.na(facility) ~ PROVIDER_NAME_RECENT,
TRUE ~ facility
)
) %>%
rename(
PROVIDER_CITY = city,
PROVIDER_ADDRESS = address
) %>%
select(THCIC_ID, PROVIDER_NAME_CLEANED, PROVIDER_CITY, PROVIDER_ADDRESS) %>%
arrange(PROVIDER_NAME_CLEANED)
providers_full %>% nrow()
## [1] 285
providers_full %>% head()
Check for duplicate ids in cleaned list.
providers_full %>%
count(THCIC_ID) %>%
filter(n > 1)
Check for duplicate names in cleaned list.
providers_full %>%
count(PROVIDER_NAME_CLEANED) %>%
filter(n > 1)
There are a number of cases where there is a THCIC_ID
used in our data does not have a match in the most recent facilities list. You can see these by searching the cleaned providers for records with no PROVIDER_CITY
.
providers_full %>%
filter(is.na(PROVIDER_CITY))
x The reasons are explained above in Problems with the data.
Here we update the AHRQ deliveries data to use the best name from our “full” providers list. Any records that were not in the current facilities list will use their most recent name and not have a PROVIDER_CITY or PROVIDER_ADDRESS added.
del_names <- del %>% # or assign back to the same dataframe
left_join(providers_full, by = "THCIC_ID") %>%
rename(
PROVIDER_NAME_ORIG = PROVIDER_NAME,
PROVIDER_NAME = PROVIDER_NAME_CLEANED,
)
del_names_cnt <- del_names %>% nrow()
# scope of records updated
fac_cleaned <- del_names %>%
filter(PROVIDER_NAME != PROVIDER_NAME_ORIG)
fac_cleaned_cnt <- fac_cleaned %>% nrow()
fac_cleaned_pct <- round_half_up((fac_cleaned_cnt / del_names_cnt) * 100,1)
cat(
fac_cleaned_cnt, " of ", del_names_cnt,
" records were updated with new PROVIDER_NAMEs, or ",
fac_cleaned_pct, "%", sep = ""
)
## 196686 of 1457413 records were updated with new PROVIDER_NAMEs, or 13.5%
#peek
fac_cleaned %>%
select(PROVIDER_NAME_ORIG, PROVIDER_NAME, THCIC_ID) %>%
distinct() %>%
arrange(PROVIDER_NAME_ORIG)
When we build data for our charts we will end up dropping records that are not in our current facilities list. Here we get an idea of the scope of that.
This is across all years of data.
del_names %>%
filter(is.na(PROVIDER_CITY)) %>%
select(THCIC_ID, PROVIDER_NAME) %>%
distinct() %>%
arrange(PROVIDER_NAME)
# get 2019 records
del_names_2019 <- del_names %>%
filter(YR == "2019")
# get numb rows
del_names_2019_rows <- del_names_2019 %>% nrow()
# Find the number of rows with missing cities
prov_city_null <- del_names_2019 %>%
filter(is.na(PROVIDER_CITY))
prov_city_null_rows <- prov_city_null %>% nrow()
# Get percentage of data with missing cities
perc <- (prov_city_null_rows / (del_names %>% nrow()) * 100) %>% round_half_up(2)
# percent of records
cat("There are ", prov_city_null_rows, " of ", del_names_2019_rows, " records with missing cities, or ", perc, "%", sep ='')
## There are 1027 of 356983 records with missing cities, or 0.07%
# providers dropped
del_names_2019 %>%
count(PROVIDER_NAME, PROVIDER_CITY) %>%
filter(is.na(PROVIDER_CITY))
Oakbend Medical Center has two THCIC_IDs in the data because they have two locations (1705 Jackson Street and 22003 Southwest Freeway, both in Richmond). It is listed as such in the most current facilities directory.
del_names %>%
distinct(THCIC_ID, PROVIDER_NAME, PROVIDER_CITY, PROVIDER_ADDRESS) %>%
filter(str_detect(PROVIDER_NAME, "Oakbend"))
Memorial Hospital is a popular name, being in three cities.
del_names %>%
distinct(THCIC_ID, PROVIDER_NAME, PROVIDER_CITY, PROVIDER_ADDRESS) %>%
filter(PROVIDER_NAME == "Memorial Hospital")
Here we build a nice, clean list of current facilities to use later for charts.
providers_current <- facilities_list %>%
select(thcic_id, facility, address, city, county, zip) %>%
rename(
THCIC_ID = thcic_id,
PROVIDER_NAME = facility,
PROVIDER_ADDRESS = address,
PROVIDER_CITY = city,
PROVIDER_COUNTY = county,
PROVIDER_ZIP = zip
)
# peek
providers_current %>% head()
Two exports needed. - The cleaned “full” providers list - A “current” providers list - The cleaned deliveries data
# write the file
providers_full %>%
write_rds("data-processed/providers_full.rds")
providers_full %>%
write_csv("exports/providers_full.csv")
providers_current %>%
write_rds("data-processed/providers_current.rds")
providers_current %>%
write_csv("exports/providers_current.csv")
del_names %>%
write_rds("data-processed/ahrq_del_cleaned.rds")
# klaxon to sound completion
beepr::beep(4)