The purpose of this notebook is to process test data from multiple quarters of THCIC in-patient public use data files into a single data file of Total number of vaginal deliveries during the reporting time period, with Excluded Populations removed. as defined by Leapfrog’s 2020 Leapfrog Hospital Survey for Episiotomy (page 118). I also apply some data cleaning to remove records that have blank fields or mothers outside typical child-bearing age.

The methods used here are also used in 01-process-lf-epi-loop to process multiple years of data. This notebook cannot support that amount of data, but was used to check various steps in the filtering process for the main script. Please see the 01-process-lf-epi-loop for more additional details.

There is another notebook 00-process-lists where various lists of ICD-10 and other codes are defined separately. Those values are written out to the procedures-lists folder as .rds and .csv files and then imported into this notebook and others. See that notebook to inspect the lists.

library(fs)
library(tidyverse)

Set up import

We search through the data folder to build a list files to import into this notebook. The test data was created using the first 10,000 rows from one quarter of four years, 2016-2019.

# set up test data
test_data_dir <- "data-test"
test_tsv_files <- dir_ls(test_data_dir, recurse = TRUE, regexp = "test_base1")
test_tsv_files
## data-test/test_base1_1q2016.txt data-test/test_base1_1q2017.txt 
## data-test/test_base1_1q2018.txt data-test/test_base1_1q2019.txt

Import the base1 files

At this time, our analysis utilizes only one (PUDF_base1) of several files in the release for each quarter.

Of note:

# warnings are suppressed, so check problems()
# add/remove test_ as necessary
base1 <- test_tsv_files %>%
  map_dfr(
    read_tsv,
    col_types = cols(
      .default = col_character(),
      X168 = col_skip(),
      X167 = col_skip()
    )
  ) %>% 
  mutate_at(
    vars(contains("_CHARGES")), as.numeric
  )


# number of rows
base1 %>% nrow()

# klaxon for import complete
# beepr::beep(3)

There are fewer than 40k because of header rows.

Set up lists

These are various lists we need for processing, so I’ll pull them all in here:

# import the lists
diag_cols <- read_rds("procedures-lists/cols_diag.rds") %>% .$diag
vag_msdrg_list <- read_rds("procedures-lists/lf_vag_msdrg.rds") %>% .$vag_msdrg
vag_aprdrg_list <- read_rds("procedures-lists/lf_vag_aprdrg.rds") %>% .$vag_aprdrg
vag_excl_list <- read_rds("procedures-lists/lf_vag_excl.rds") %>% .$vag_excl
age_list <- read_rds("procedures-lists/utoha_age.rds") %>% .$age

# peek at them
diag_cols
##  [1] "ADMITTING_DIAGNOSIS" "PRINC_DIAG_CODE"     "OTH_DIAG_CODE_1"    
##  [4] "OTH_DIAG_CODE_2"     "OTH_DIAG_CODE_3"     "OTH_DIAG_CODE_4"    
##  [7] "OTH_DIAG_CODE_5"     "OTH_DIAG_CODE_6"     "OTH_DIAG_CODE_7"    
## [10] "OTH_DIAG_CODE_8"     "OTH_DIAG_CODE_9"     "OTH_DIAG_CODE_10"   
## [13] "OTH_DIAG_CODE_11"    "OTH_DIAG_CODE_12"    "OTH_DIAG_CODE_13"   
## [16] "OTH_DIAG_CODE_14"    "OTH_DIAG_CODE_15"    "OTH_DIAG_CODE_16"   
## [19] "OTH_DIAG_CODE_17"    "OTH_DIAG_CODE_18"    "OTH_DIAG_CODE_19"   
## [22] "OTH_DIAG_CODE_20"    "OTH_DIAG_CODE_21"    "OTH_DIAG_CODE_22"   
## [25] "OTH_DIAG_CODE_23"    "OTH_DIAG_CODE_24"
vag_msdrg_list
## [1] "768" "796" "797" "798" "805" "806" "807"
vag_aprdrg_list
## [1] "541" "542" "560"
vag_excl_list
## [1] "O660"
age_list
## [1] "05" "06" "07" "08" "09" "10" "11" "12" "23"

Creating the denominator dataset

Defined by Leapfrom as: The total number of vaginal deliveries during the reporting time period, with Excluded Populations removed.

Filter for vaginal deliveries

del_vag <- base1 %>% 
  filter(
    MS_DRG %in% vag_msdrg_list | APR_DRG %in% vag_aprdrg_list
  )

Exclusions from the deliveries

From Leapfrog:

Exclude any cases with the following ICD-10-CM diagnostic code in a primary or secondary field: - O66.0: Obstructed labor due to shoulder dystocia.

The exclusion dataset is also defined in processed-lists, even though it is a single field.

del_vag_excl <- del_vag %>% 
  filter_at(
    vars(all_of(diag_cols)),
    all_vars(
      !(. %in% vag_excl_list)
    )
  ) 

del_vag_excl %>% nrow()
## [1] 1529

Filter out blank cells

Here I am using some cleaning concepts that are outlined by AHRQ that would seem logical to include with this analysis as well. They are defined as: “with missing gender (SEX=missing), age (AGE=missing), quarter (DQTR=missing), year (YEAR=missing) or principal diagnosis (DX1=missing).”

In base1, the fields are SEX_CODE, PAT_AGE, DISCHARGE for both quarter and year, and PRINC_DIAG_CODE.

del_vag_cln <- del_vag_excl %>% 
  filter(
    SEX_CODE == "F",
    PAT_AGE != "`",
    RACE != "`",
    !is.na(DISCHARGE),
    !is.na(PRINC_DIAG_CODE)
  )

del_vag_cln %>% nrow()
## [1] 1524

Child-bearing age

Researchers at the Office of Health Affairs-Population Health, The University of Texas System work with the THCIC file daily and they suggest to filter deliveries to women of normal child-bearing age. The codes for the ages 15-49 include 05-12. For HIV or drug patients it includes 23 (18-44 yrs). I import those from procedures-lists.

Here we will filter for those values.

del_vag_cln_age <- del_vag_cln %>% 
  filter(PAT_AGE %in% age_list)

del_vag_cln_age %>% nrow()
## [1] 1520

Add convenience columns for dates

del_vag_cln_age <- del_vag_cln_age %>% 
  mutate(
    YR = substr(DISCHARGE, 1, 4)
  )

Remove other years

Because of a reporting lag, there are years in the original data that we are not using for our analysis. At some point in 2015 there was a switch from ICD-9 to ICD-10 coding, so going eariler would require some conversions. Not impossible, but not in scope at this time to ease complication.

We are using full years from 2016-2018 and a partial year 2019 through the 2nd quarter release.

del_vag_cln_age_yr <- del_vag_cln_age %>% 
  filter(YR %in% c("2016", "2017", "2018", "2019"))

Write file

del_vag_cln_age_yr %>% nrow()
## [1] 1520
del_vag_cln_age_yr %>% write_rds("data-test/lf_del_vag_single_test.rds")

beepr::beep(4)