By Christian McDonald, Assistant Professor of Practice
School of Journalism and Media, Moody College of Communication
University of Texas at Austin


The purpose of this notebook is to process multiple quarters of THCIC in-patient public use data files into a single data file of all births. This requires importing and applying several filtering options.

We are using AHRQ’s IQI 33 Primary Cesarean Delivery Rate, Uncomplicated method to define a birth, but do not apply further filtering for complications. We do apply some data filtering/cleaning for missing data as outlined in IQI 33.

The output is all data-processed/ahrq_del_all.rds, which we can then analyzed in future files.

Of note, there is another notebook 01-process-test that uses the same filtering criteria but prints examples at various stages for manual checking.

Also, there is another notebook 00-process-lists where various AHRQ lists of ICD-10 and other codes are defined separately. Those values are written out to the procedures-lists folder as .rds and .csv files and then imported into this notebook and others. See that notebook to inspect the lists. It must be run before this notebook as outputs are used here.

library(fs)
library(tidyverse)

Data sources

Test data

I have a set of test data that grabs the first 10,000 rows from the first quarter of 2016-2019. This is used to test the processing without the load of the production data, which is considerable.

# set test_flag = TRUE to use test data
test_flag <- F
data_dir_test <- "data-test"
tsv_files_test <- dir_ls(data_dir_test, recurse = TRUE, regexp = "test_base1")
tsv_files_test
## data-test/test_base1_1q2016.txt data-test/test_base1_1q2017.txt 
## data-test/test_base1_1q2018.txt data-test/test_base1_1q2019.txt

Production data

Part of this process is to loop through all the matching files in the data-raw folder to process them. This sets up those directories.

# set up production data
data_dir <- "data-raw"
tsv_files <- dir_ls(data_dir, recurse = TRUE, regexp = "PUDF_base1_")

# peek at the captured file list
tsv_files
## data-raw/PUDF 1Q2016 tab-delimited/PUDF_base1_1q2016_tab.txt
## data-raw/PUDF 1Q2017 tab-delimited/PUDF_base1_1q2017_tab.txt
## data-raw/PUDF 1Q2018 tab-delimited/PUDF_base1_1q2018_tab.txt
## data-raw/PUDF 1Q2019 tab-delimited/PUDF_base1_1q2019_tab.txt
## data-raw/PUDF 2Q2016 tab-delimited/PUDF_base1_2q2016_tab.txt
## data-raw/PUDF 2Q2017 tab-delimited/PUDF_base1_2q2017_tab.txt
## data-raw/PUDF 2Q2018 tab-delimited/PUDF_base1_2q2018_tab.txt
## data-raw/PUDF 2Q2019 tab-delimited/PUDF_base1_2q2019_tab.txt
## data-raw/PUDF 3Q2016 tab-delimited/PUDF_base1_3q2016_tab.txt
## data-raw/PUDF 3Q2017 tab-delimited/PUDF_base1_3q2017_tab.txt
## data-raw/PUDF 3Q2018 tab-delimited/PUDF_base1_3q2018_tab.txt
## data-raw/PUDF 3Q2019 tab-delimited/PUDF_base1_3q2019_tab.txt
## data-raw/PUDF 4Q2016 tab-delimited/PUDF_base1_4q2016_tab.txt
## data-raw/PUDF 4Q2017 tab-delimited/PUDF_base1_4q2017_tab.txt
## data-raw/PUDF 4Q2018 tab-delimited/PUDF_base1_4q2018_tab.txt
## data-raw/PUDF 4Q2019 tab-delimited/PUDF_base1_4q2019_tab.txt

Function to filter for deliveries

For processing this data, I build functions to apply different AHRQ filtering to each file before binding them into a single tibble. In some cases the functions filter multiple columns for multiple variables.

The functions are set up first, then used later in the processing loop.

Filter for birth deliveries

In this case, we are looking at all the diagnostic columns for values in the delocmd_list list, which comes from “DELOCMD” in the AHRQ IQI 33 reference. It uses the 2020 definitions.

delocmd_list <- read_rds("procedures-lists/ahrq_delocmd.rds") %>% .$delocmd

filter_del <- function(.data) {
  .data %>%
    filter_at(
      vars(
        matches("_DIAG"),
        -starts_with("POA")
      ),
      any_vars(
        . %in% delocmd_list
      )
    )
}

Filter out missing data

Again, from AHRQ’s IQI 33 definition:

“with missing gender (SEX=missing), age (AGE=missing), quarter (DQTR=missing), year (YEAR=missing) or principal diagnosis (DX1=missing).”

In base1, the fields are SEX_CODE, PAT_AGE, DISCHARGE for both quarter and year, and PRINC_DIAG_CODE.

filter_clean <- function(.data) {
  .data %>% 
      filter(
        SEX_CODE == "F",
        PAT_AGE != "`",
        RACE != "`",
        !is.na(DISCHARGE),
        !is.na(PRINC_DIAG_CODE)
      )
}

Child-bearing age

Researchers at the Office of Health Affairs-Population Health, The University of Texas System work with the THCIC file daily and they suggest filtering deliveries to women of normal child-bearing age, 15-49.

The codes for the ages 15-49 include “05” through “12”. For HIV or drug patients it includes “23” (18-44 yrs).

This function filters for those values.

age_list <- read_rds("procedures-lists/utoha_age.rds") %>% .$age

filter_age <- function(.data) {
  .data %>% 
    filter(PAT_AGE %in% age_list)
}

Import and process

This is the workhorse process to loop through the files and combine the data. It brings in each file and then applies the filtering functions above and then appends the results to the growing tibble.

At this time, the analysis utilizes only one (PUDF_base1) of several files in the release for each quarter. The raw data is never changed.

Of note:

Import warnings are supressed. This hides import errors like Missing column names filled in: 'X167' caused by the extra tabs in the raw files.

# warnings supressed.

# create the tibble
data <- tibble()

# set the files list
if(test_flag == T) files <- tsv_files_test else files <- tsv_files

for (file in files) {
  c <- read_tsv(
    file,
    col_types = cols(
      .default = col_character(),
      X168 = col_skip(),
      X167 = col_skip()
    )
  ) %>% 
  mutate_at(
    vars(contains("_CHARGES")), as.numeric
  ) %>%
  filter_del() %>% 
  filter_clean %>% 
  filter_age
  data <- bind_rows(data, c)
}

data %>% nrow()

The result of this function is data, which is our combined and filtered data.

Add year column

In the resulting file we add a YR column for convenience to use in later analysis.

data <- data %>% 
  mutate(
    YR = substr(DISCHARGE, 1, 4)
  )

data %>% 
  count(YR)

Remove other years

Because of a reporting lag, there are years in the original data that we are not using for our analysis. At some point in 2015 there was a switch from ICD-9 to ICD-10 coding, so going earlier would require some conversions. Not impossible, but not in scope at this time to ease complication.

We are using full years from 2016-2018 and a partial year 2019 through the 3rd quarter release.

data <- data %>% 
  filter(YR %in% c("2016", "2017", "2018", "2019"))

Write file

We write a single file of all the deliveries.

if(test_flag == T) write_path <- "data-test/ahrq_del_all_loop_test.rds" else write_path <- "data-processed/ahrq_del_all.rds"

data %>% write_rds(write_path)

data %>% nrow()
## [1] 1457413
# A klaxon to indicate the processing is complete
beepr::beep(4)