The purpose of this notebook is to process test data from multiple quarters of THCIC in-patient public use data files into a single data file of Total number of vaginal deliveries during the reporting time period, with Excluded Populations removed. as defined by Leapfrog’s 2020 Leapfrog Hospital Survey for Episiotomy (page 118). I also apply some data cleaning to remove records that have blank fields or mothers outside typical child-bearing age.
The methods used here are also used in 01-process-lf-epi-loop
to process multiple years of data. This notebook cannot support that amount of data, but was used to check various steps in the filtering process for the main script. Please see the 01-process-lf-epi-loop
for more additional details.
There is another notebook 00-process-lists
where various lists of ICD-10 and other codes are defined separately. Those values are written out to the procedures-lists
folder as .rds and .csv files and then imported into this notebook and others. See that notebook to inspect the lists.
library(fs)
library(tidyverse)
We search through the data
folder to build a list files to import into this notebook. The test data was created using the first 10,000 rows from one quarter of four years, 2016-2019.
# set up test data
test_data_dir <- "data-test"
test_tsv_files <- dir_ls(test_data_dir, recurse = TRUE, regexp = "test_base1")
test_tsv_files
## data-test/test_base1_1q2016.txt data-test/test_base1_1q2017.txt
## data-test/test_base1_1q2018.txt data-test/test_base1_1q2019.txt
At this time, our analysis utilizes only one (PUDF_base1) of several files in the release for each quarter.
Of note:
col_skip()
. The EMERGENCY_DEPT_FLAG
col was introduced in 2017, so we have to remove two differnet “last columns”.# warnings are suppressed, so check problems()
# add/remove test_ as necessary
base1 <- test_tsv_files %>%
map_dfr(
read_tsv,
col_types = cols(
.default = col_character(),
X168 = col_skip(),
X167 = col_skip()
)
) %>%
mutate_at(
vars(contains("_CHARGES")), as.numeric
)
# number of rows
base1 %>% nrow()
# klaxon for import complete
# beepr::beep(3)
There are fewer than 40k because of header rows.
These are various lists we need for processing, so I’ll pull them all in here:
# import the lists
diag_cols <- read_rds("procedures-lists/cols_diag.rds") %>% .$diag
vag_msdrg_list <- read_rds("procedures-lists/lf_vag_msdrg.rds") %>% .$vag_msdrg
vag_aprdrg_list <- read_rds("procedures-lists/lf_vag_aprdrg.rds") %>% .$vag_aprdrg
vag_excl_list <- read_rds("procedures-lists/lf_vag_excl.rds") %>% .$vag_excl
age_list <- read_rds("procedures-lists/utoha_age.rds") %>% .$age
# peek at them
diag_cols
## [1] "ADMITTING_DIAGNOSIS" "PRINC_DIAG_CODE" "OTH_DIAG_CODE_1"
## [4] "OTH_DIAG_CODE_2" "OTH_DIAG_CODE_3" "OTH_DIAG_CODE_4"
## [7] "OTH_DIAG_CODE_5" "OTH_DIAG_CODE_6" "OTH_DIAG_CODE_7"
## [10] "OTH_DIAG_CODE_8" "OTH_DIAG_CODE_9" "OTH_DIAG_CODE_10"
## [13] "OTH_DIAG_CODE_11" "OTH_DIAG_CODE_12" "OTH_DIAG_CODE_13"
## [16] "OTH_DIAG_CODE_14" "OTH_DIAG_CODE_15" "OTH_DIAG_CODE_16"
## [19] "OTH_DIAG_CODE_17" "OTH_DIAG_CODE_18" "OTH_DIAG_CODE_19"
## [22] "OTH_DIAG_CODE_20" "OTH_DIAG_CODE_21" "OTH_DIAG_CODE_22"
## [25] "OTH_DIAG_CODE_23" "OTH_DIAG_CODE_24"
vag_msdrg_list
## [1] "768" "796" "797" "798" "805" "806" "807"
vag_aprdrg_list
## [1] "541" "542" "560"
vag_excl_list
## [1] "O660"
age_list
## [1] "05" "06" "07" "08" "09" "10" "11" "12" "23"
Defined by Leapfrom as: The total number of vaginal deliveries during the reporting time period, with Excluded Populations removed.
del_vag <- base1 %>%
filter(
MS_DRG %in% vag_msdrg_list | APR_DRG %in% vag_aprdrg_list
)
From Leapfrog:
Exclude any cases with the following ICD-10-CM diagnostic code in a primary or secondary field: - O66.0: Obstructed labor due to shoulder dystocia.
The exclusion dataset is also defined in processed-lists, even though it is a single field.
del_vag_excl <- del_vag %>%
filter_at(
vars(all_of(diag_cols)),
all_vars(
!(. %in% vag_excl_list)
)
)
del_vag_excl %>% nrow()
## [1] 1529
Here I am using some cleaning concepts that are outlined by AHRQ that would seem logical to include with this analysis as well. They are defined as: “with missing gender (SEX=missing), age (AGE=missing), quarter (DQTR=missing), year (YEAR=missing) or principal diagnosis (DX1=missing).”
In base1, the fields are SEX_CODE
, PAT_AGE
, DISCHARGE
for both quarter and year, and PRINC_DIAG_CODE
.
del_vag_cln <- del_vag_excl %>%
filter(
SEX_CODE == "F",
PAT_AGE != "`",
RACE != "`",
!is.na(DISCHARGE),
!is.na(PRINC_DIAG_CODE)
)
del_vag_cln %>% nrow()
## [1] 1524
Researchers at the Office of Health Affairs-Population Health, The University of Texas System work with the THCIC file daily and they suggest to filter deliveries to women of normal child-bearing age. The codes for the ages 15-49 include 05-12. For HIV or drug patients it includes 23 (18-44 yrs). I import those from procedures-lists
.
Here we will filter for those values.
del_vag_cln_age <- del_vag_cln %>%
filter(PAT_AGE %in% age_list)
del_vag_cln_age %>% nrow()
## [1] 1520
del_vag_cln_age <- del_vag_cln_age %>%
mutate(
YR = substr(DISCHARGE, 1, 4)
)
Because of a reporting lag, there are years in the original data that we are not using for our analysis. At some point in 2015 there was a switch from ICD-9 to ICD-10 coding, so going eariler would require some conversions. Not impossible, but not in scope at this time to ease complication.
We are using full years from 2016-2018 and a partial year 2019 through the 2nd quarter release.
del_vag_cln_age_yr <- del_vag_cln_age %>%
filter(YR %in% c("2016", "2017", "2018", "2019"))
del_vag_cln_age_yr %>% nrow()
## [1] 1520
del_vag_cln_age_yr %>% write_rds("data-test/lf_del_vag_single_test.rds")
beepr::beep(4)