By Christian McDonald, Assistant Professor of Practice
School of Journalism and Media, Moody College of Communication
University of Texas at Austin
The purpose of this notebook is to process test data from multiple quarters of THCIC in-patient public use data files into a single data file of deliveries without complications. This requires importing and applying several filtering options.
The methods in this notebook are used in 01-process-ahrq-del-loop
to process multiple years of data. This notebook cannot support that amount of data, but was used to check various steps in the filtering process for the main script. Please see the 01-process-ahrq-del-loop
for more additional details.
There is another notebook 00-process-lists
where various AHRQ lists of ICD-10 and other codes are defined separately. Those values are written out to the procedures-lists
folder as .rds and .csv files and then imported into this notebook and others. See that notebook to inspect the lists.
library(fs)
library(tidyverse)
We search through the data
folder to build a list files to import into this notebook. The test data was created using the first 10,000 rows from one quarter of four years, 2016-2019.
# set up test data
test_data_dir <- "data-test"
test_tsv_files <- dir_ls(test_data_dir, recurse = TRUE, regexp = "test_base1")
test_tsv_files
## data-test/test_base1_1q2016.txt data-test/test_base1_1q2017.txt
## data-test/test_base1_1q2018.txt data-test/test_base1_1q2019.txt
At this time, our analysis utilizes only one (PUDF_base1) of several files in the release for each quarter.
Of note:
col_skip()
. The EMERGENCY_DEPT_FLAG
col was introduced in 2017, so we have to remove two differnet “last columns”.# warnings are suppressed, so check problems()
# add/remove test_ as necessary
base1 <- test_tsv_files %>%
map_dfr(
read_tsv,
col_types = cols(
.default = col_character(),
X168 = col_skip(),
X167 = col_skip()
)
) %>%
mutate_at(
vars(contains("_CHARGES")), as.numeric
)
# number of rows
base1 %>% nrow()
# klaxon for import complete
# beepr::beep(3)
The logic here looks through a number of columns for a number of ICD codes.
In ths case, we are looking at all columns with “DIAG” in name for values in the delocmd_list
, which comes from "DELOCMD*" in our IQI 33 reference. See 01-process-lists
for details.
Then we import the DELOCMD list and filter for it.
delocmd_list <- read_rds("procedures-lists/ahrq_delocmd.rds") %>% .$delocmd
del <- base1 %>%
filter_at(
vars(
matches("_DIAG"),
-starts_with("POA")
),
any_vars(
. %in% delocmd_list
)
)
del %>% nrow()
## [1] 2749
We peek here at the resulting frame to eyeball codes.
del %>%
select(
matches("_DIAG"),
-starts_with("POA")
) %>% head(10)
Some further notebooks need to exclude cases for complications like for abnormal presentation, fetal death, or multiple gestation. Those will be handled in those notebooks as needed.
Here we only filter out missing or bad data.
“with missing gender (SEX=missing), age (AGE=missing), quarter (DQTR=missing), year (YEAR=missing) or principal diagnosis (DX1=missing).”
In base1, the fields are SEX_CODE
, PAT_AGE
, DISCHARGE
for both quarter and year, and PRINC_DIAG_CODE
.
del_cln <- del %>%
filter(
SEX_CODE == "F",
PAT_AGE != "`",
RACE != "`",
!is.na(DISCHARGE),
!is.na(PRINC_DIAG_CODE)
)
del_cln %>% nrow()
## [1] 2731
Researchers at the Office of Health Affairs-Population Health, The University of Texas System work with the THCIC file daily and they suggest to filter deliveries to women of normal child-bearing age.
We’ll look here how those ages break down in the cleaned file:
del_cln %>%
count(PAT_AGE)
The codes for the ages 15-49 include 05-12. For HIV or drug patients it includes 23 (18-44 yrs). I import those from procedures-lists
.
Here we will filter for those values.
age_list <- read_rds("procedures-lists/utoha_age.rds") %>% .$age
del_cln_age <- del_cln %>%
filter(PAT_AGE %in% age_list)
del_cln_age %>% nrow()
## [1] 2726
Peeking at records outside the child-bearing age list to make sure are none.
# set up not in
`%ni%` <- Negate(`%in%`)
del_cln_age %>%
filter(PAT_AGE %ni% age_list) %>%
select(PAT_AGE) %>%
count(PAT_AGE)
del_cln_age_yr <- del_cln_age %>%
mutate(
YR = substr(DISCHARGE, 1, 4)
)
Because of a reporting lag, there are years in the original data that we are not using for our analysis. At some point in 2015 there was a switch from ICD-9 to ICD-10 coding, so going eariler would require some conversions. Not impossible, but not in scope at this time to ease complication.
We are using full years from 2016-2018 and a partial year 2019 through the 2nd quarter release. This is subject to change as new data is released.
del_cln_age_yr <- del_cln_age_yr %>%
filter(YR %in% c("2016", "2017", "2018", "2019"))
del_cln_age_yr %>% nrow()
## [1] 2726
del_cln_age_yr %>% write_rds("data-test/ahrq_del_all_single_test.rds")
beepr::beep(4)