By Christian McDonald, Assistant Professor of Practice
School of Journalism and Media, Moody College of Communication
University of Texas at Austin


The purpose of this notebook is to process test data from multiple quarters of THCIC in-patient public use data files into a single data file of deliveries without complications. This requires importing and applying several filtering options.

The methods in this notebook are used in 01-process-ahrq-del-loop to process multiple years of data. This notebook cannot support that amount of data, but was used to check various steps in the filtering process for the main script. Please see the 01-process-ahrq-del-loop for more additional details.

There is another notebook 00-process-lists where various AHRQ lists of ICD-10 and other codes are defined separately. Those values are written out to the procedures-lists folder as .rds and .csv files and then imported into this notebook and others. See that notebook to inspect the lists.

library(fs)
library(tidyverse)

Set up import

We search through the data folder to build a list files to import into this notebook. The test data was created using the first 10,000 rows from one quarter of four years, 2016-2019.

# set up test data
test_data_dir <- "data-test"
test_tsv_files <- dir_ls(test_data_dir, recurse = TRUE, regexp = "test_base1")
test_tsv_files
## data-test/test_base1_1q2016.txt data-test/test_base1_1q2017.txt 
## data-test/test_base1_1q2018.txt data-test/test_base1_1q2019.txt

Import the base1 files

At this time, our analysis utilizes only one (PUDF_base1) of several files in the release for each quarter.

Of note:

# warnings are suppressed, so check problems()
# add/remove test_ as necessary
base1 <- test_tsv_files %>%
  map_dfr(
    read_tsv,
    col_types = cols(
      .default = col_character(),
      X168 = col_skip(),
      X167 = col_skip()
    )
  ) %>%
  mutate_at(
    vars(contains("_CHARGES")), as.numeric
  )

# number of rows
base1 %>% nrow()

# klaxon for import complete
# beepr::beep(3)

Filtering for deliveries

Filtering muliple columns, multiple conditions

The logic here looks through a number of columns for a number of ICD codes.

In ths case, we are looking at all columns with “DIAG” in name for values in the delocmd_list, which comes from "DELOCMD*" in our IQI 33 reference. See 01-process-lists for details.

Then we import the DELOCMD list and filter for it.

delocmd_list <- read_rds("procedures-lists/ahrq_delocmd.rds") %>% .$delocmd

del <- base1 %>% 
  filter_at(
    vars(
      matches("_DIAG"),
      -starts_with("POA")
    ),
    any_vars(
      . %in% delocmd_list
    )
  )

del %>% nrow()
## [1] 2749

We peek here at the resulting frame to eyeball codes.

del %>% 
  select(
    matches("_DIAG"),
    -starts_with("POA")
  ) %>% head(10)

Exclusions from the deliveries

Some further notebooks need to exclude cases for complications like for abnormal presentation, fetal death, or multiple gestation. Those will be handled in those notebooks as needed.

Here we only filter out missing or bad data.

Filter out blank cells per Appendix A

“with missing gender (SEX=missing), age (AGE=missing), quarter (DQTR=missing), year (YEAR=missing) or principal diagnosis (DX1=missing).”

In base1, the fields are SEX_CODE, PAT_AGE, DISCHARGE for both quarter and year, and PRINC_DIAG_CODE.

del_cln <- del %>% 
  filter(
    SEX_CODE == "F",
    PAT_AGE != "`",
    RACE != "`",
    !is.na(DISCHARGE),
    !is.na(PRINC_DIAG_CODE)
  )

del_cln %>% nrow()
## [1] 2731

Child-bearing age

Researchers at the Office of Health Affairs-Population Health, The University of Texas System work with the THCIC file daily and they suggest to filter deliveries to women of normal child-bearing age.

We’ll look here how those ages break down in the cleaned file:

del_cln %>% 
  count(PAT_AGE)

The codes for the ages 15-49 include 05-12. For HIV or drug patients it includes 23 (18-44 yrs). I import those from procedures-lists.

Here we will filter for those values.

age_list <- read_rds("procedures-lists/utoha_age.rds") %>% .$age

del_cln_age <- del_cln %>% 
  filter(PAT_AGE %in% age_list)

del_cln_age %>% nrow()
## [1] 2726

Peeking at records outside the child-bearing age list to make sure are none.

# set up not in
`%ni%` <- Negate(`%in%`)

del_cln_age %>% 
  filter(PAT_AGE %ni% age_list) %>% 
  select(PAT_AGE) %>% 
  count(PAT_AGE)

Add convenience columns for dates

del_cln_age_yr <- del_cln_age %>% 
  mutate(
    YR = substr(DISCHARGE, 1, 4)
  )

Remove other years

Because of a reporting lag, there are years in the original data that we are not using for our analysis. At some point in 2015 there was a switch from ICD-9 to ICD-10 coding, so going eariler would require some conversions. Not impossible, but not in scope at this time to ease complication.

We are using full years from 2016-2018 and a partial year 2019 through the 2nd quarter release. This is subject to change as new data is released.

del_cln_age_yr <- del_cln_age_yr %>% 
  filter(YR %in% c("2016", "2017", "2018", "2019"))

Write file

del_cln_age_yr %>% nrow()
## [1] 2726
del_cln_age_yr %>% write_rds("data-test/ahrq_del_all_single_test.rds")

beepr::beep(4)