By Christian McDonald, Assistant Professor of Practice
School of Journalism and Media, Moody College of Communication
University of Texas at Austin


Hospital names change over time, causing problems for any analysis that groups or aggregates by the PROVIDER_NAME.

This notebook provides a list of THCIC_IDs with first their official name from the Current Facility Contact Information found at Outpatient Data Reporting Requirements. If the case’s THCIC_ID is not listed in the facilities list, then the facilities’ most recent name is used.

Problems in the data

This notebook updates records to use their most recent names/ID combinations and creates “full” and “current” provider lists for later notebooks.

Setup

library(tidyverse)
library(janitor)
library(readxl)

Importing the data

  • We start with all delivery data per AHRQ standards since it is our largest collected dataset.
  • The facilities list is the most current one published by the Center for Health Statistics on Sept. 10, 2020, around the time they released our most current data, 4q2019.
### production data
path_prod <- "data-processed/ahrq_del_all.rds"

del <- read_rds(path_prod)

del %>% nrow()
## [1] 1457413
### facilities list
facilities_list <- read_excel("resources/FacilityList_20200910.xlsx") %>% clean_names()

Process for creating updated full provider list

Since PROVIDER_NAMES change over time, we want to use the most recent facilites list to get the most current name of providers in our data. If the record’s ID is not on the current list, then we use the most recent PROVIDER_NAME for that THCIC_ID.

The process we are using is this:

  • get distinct id, name, quarter from the delivery data
    • arrange by id, quarter so we can find the most recent version
    • group and slice the most recent name, creating a new col with the recent name
  • join with facilities_list to get the official name
    • In some cases the id is not in the facilities list. In those cases, use case_when to use the “recent” name

The concept of group and slice from the tail by row_number came from this stack overflow article.

providers_full <- del %>% 
  distinct(THCIC_ID, PROVIDER_NAME, DISCHARGE) %>% 
  arrange(THCIC_ID, DISCHARGE) %>% 
  select(THCIC_ID, PROVIDER_NAME) %>% 
  group_by(THCIC_ID) %>% 
  slice(tail(row_number(), 1)) %>% 
  rename(PROVIDER_NAME_RECENT = PROVIDER_NAME) %>% 
  left_join(facilities_list, by = c("THCIC_ID" = "thcic_id")) %>% 
  select(THCIC_ID, PROVIDER_NAME_RECENT, facility, city, address) %>% 
  # Create new provider name and fill from facilities list
  mutate(
    PROVIDER_NAME_CLEANED = case_when(
       is.na(facility) ~ PROVIDER_NAME_RECENT,
       TRUE ~ facility
    )
  ) %>% 
  rename(
    PROVIDER_CITY = city,
    PROVIDER_ADDRESS = address
  ) %>% 
  select(THCIC_ID, PROVIDER_NAME_CLEANED, PROVIDER_CITY, PROVIDER_ADDRESS) %>%
  arrange(PROVIDER_NAME_CLEANED)

providers_full %>% nrow()
## [1] 285
providers_full %>% head()

Check for duplicate ids in cleaned list.

providers_full %>% 
  count(THCIC_ID) %>% 
  filter(n > 1)

Check for duplicate names in cleaned list.

providers_full %>% 
  count(PROVIDER_NAME_CLEANED) %>% 
  filter(n > 1)

Troubleshooting providers list

There are a number of cases where there is a THCIC_ID used in our data does not have a match in the most recent facilities list. You can see these by searching the cleaned providers for records with no PROVIDER_CITY.

providers_full %>% 
  filter(is.na(PROVIDER_CITY))

x The reasons are explained above in Problems with the data.

Update PROVIDER_NAMES from full providers list

Here we update the AHRQ deliveries data to use the best name from our “full” providers list. Any records that were not in the current facilities list will use their most recent name and not have a PROVIDER_CITY or PROVIDER_ADDRESS added.

del_names <- del %>% # or assign back to the same dataframe
  left_join(providers_full, by = "THCIC_ID") %>% 
  rename(
    PROVIDER_NAME_ORIG = PROVIDER_NAME,
    PROVIDER_NAME = PROVIDER_NAME_CLEANED,
    )

del_names_cnt <- del_names %>% nrow()

# scope of records updated
fac_cleaned <- del_names %>% 
  filter(PROVIDER_NAME != PROVIDER_NAME_ORIG)

fac_cleaned_cnt <- fac_cleaned %>% nrow()
fac_cleaned_pct <- round_half_up((fac_cleaned_cnt / del_names_cnt) * 100,1)


cat(
  fac_cleaned_cnt, " of ", del_names_cnt,
  " records were updated with new PROVIDER_NAMEs, or ",
  fac_cleaned_pct, "%", sep = ""
)
## 196686 of 1457413 records were updated with new PROVIDER_NAMEs, or 13.5%
#peek
fac_cleaned %>% 
  select(PROVIDER_NAME_ORIG, PROVIDER_NAME, THCIC_ID) %>% 
  distinct() %>% 
  arrange(PROVIDER_NAME_ORIG)

Scope of records with no PROVIDER_CITY

When we build data for our charts we will end up dropping records that are not in our current facilities list. Here we get an idea of the scope of that.

List facilities with no PROVIDER_CITY

This is across all years of data.

del_names %>% 
  filter(is.na(PROVIDER_CITY)) %>% 
  select(THCIC_ID, PROVIDER_NAME) %>% 
  distinct() %>% 
  arrange(PROVIDER_NAME)

2019 records that will be affected

# get 2019 records
del_names_2019 <- del_names %>% 
  filter(YR == "2019")
# get numb rows
del_names_2019_rows <- del_names_2019 %>% nrow()

# Find the number of rows with missing cities
prov_city_null <- del_names_2019 %>% 
  filter(is.na(PROVIDER_CITY))

prov_city_null_rows <- prov_city_null %>% nrow()

# Get percentage of data with missing cities
perc <- (prov_city_null_rows / (del_names %>% nrow()) * 100) %>% round_half_up(2)

# percent of records
cat("There are ", prov_city_null_rows, " of ", del_names_2019_rows, " records with missing cities, or ", perc, "%", sep ='')
## There are 1027 of 356983 records with missing cities, or 0.07%
# providers dropped
del_names_2019 %>% 
  count(PROVIDER_NAME, PROVIDER_CITY) %>% 
  filter(is.na(PROVIDER_CITY))

Hospital names with more than one ID

Oakbend Medical Center

Oakbend Medical Center has two THCIC_IDs in the data because they have two locations (1705 Jackson Street and 22003 Southwest Freeway, both in Richmond). It is listed as such in the most current facilities directory.

del_names %>% 
  distinct(THCIC_ID, PROVIDER_NAME, PROVIDER_CITY, PROVIDER_ADDRESS) %>% 
  filter(str_detect(PROVIDER_NAME, "Oakbend"))

Memorial Hospital

Memorial Hospital is a popular name, being in three cities.

del_names %>% 
  distinct(THCIC_ID, PROVIDER_NAME, PROVIDER_CITY, PROVIDER_ADDRESS) %>% 
  filter(PROVIDER_NAME == "Memorial Hospital")

Build a current facilities list

Here we build a nice, clean list of current facilities to use later for charts.

providers_current <- facilities_list %>% 
  select(thcic_id, facility, address, city, county, zip) %>% 
  rename(
    THCIC_ID = thcic_id,
    PROVIDER_NAME = facility,
    PROVIDER_ADDRESS = address,
    PROVIDER_CITY = city,
    PROVIDER_COUNTY = county,
    PROVIDER_ZIP = zip
  )

# peek
providers_current %>% head()

Exports and writes

Two exports needed. - The cleaned “full” providers list - A “current” providers list - The cleaned deliveries data

# write the file
providers_full %>%
  write_rds("data-processed/providers_full.rds")

providers_full %>%
  write_csv("exports/providers_full.csv")

providers_current %>%
  write_rds("data-processed/providers_current.rds")

providers_current %>%
  write_csv("exports/providers_current.csv")

del_names %>% 
  write_rds("data-processed/ahrq_del_cleaned.rds")

# klaxon to sound completion
beepr::beep(4)