---
title: "Clean PUDF"
author: "Christian McDonald"
---

## Goals

This notebook is designed to filter the PUDF data to get just maternal cases across the state. We then filter a version just Harris County. We facility info and do some other cleanup.

We use standards from two different but related groups in our analysis. For Severe Maternal Morbidity we use Healthcare Cost and Utilization Project (HCUP) standard, while other standards are Agency for Healthcare Research and Quality (AHRQ) Inpatient Quality Indicators.

::: callout-important

### Raw data is not saved

There are rules of engagement of this data, including that no raw data can be published. See [TDSHS's page](https://www.dshs.texas.gov/center-health-statistics/texas-health-care-information-collection/download-and-purchase-data/texas-inpatient-public-use-data-file-pudf) for more information.

:::

## Setup

```{r}
#| label: setup
#| message: false
#| warning: false

library(tidyverse)
library(janitor)
library(fs)
```

## Data sources

### Test flag

Because this project uses such a large amount of data, I created a "test" set of data by pulling the first 10,000 rows from each file and saving it in a special directory. The script for that is in `scripts/copy_test_data.sh`.

Here I set a test flag here that limits what data we are considering. Use `T` when you want to run fast with just test data, and `F` when you want to run against all the PUDF files. It takes about 10 minutes to process all the PUDF files.

```{r}
test_flag <- T
```


### Set paths

Here we set the source directories for the test and full data.

```{r}
#| label: set-paths

# Where IQI Indicator files are stored
dir_data_test <- "../data-original/pudf-data-test/"
dir_data_prod <- "../data-original/pudf-data-files/"
```

### Get list of files

We are building a list of files paths so we can loop over them later to import and process them.

```{r}
files_list <- dir_ls(
  if(test_flag == T) dir_data_test else dir_data_prod,
  recurse = T,
  # regexp = "_BASE|_base"
  regexp = "BASE_DATA_1.*\\.txt|base1.*\\.txt$"
)

files_list
```

## Filtering functions

Because the entire PUDF data is too big to hold into memory, we apply some filters as we process our files. The functions are set up first, then used later in the processing loop.

We use a series files that include ICD-10 codes that we processed in the [Stored Lists](00-stored-lists.qmd) notebook.

### Filter for births

We start with the following the [clinical coding definitions](https://datatools.ahrq.gov/hcup-fast-stats/?tab=special-emphasis&dash=92#accordion-92) from HCUP's Severe Maternal Morbidity analysis to build our list of deliveries.

The HCUP definition is larger and inclusive of the AHRQ standard of "DELOCMD" cases. Once we have our records, we mark those that meet the more narrow AHRQ "DELOCMD" definition.

First we import our necessary ICD-10 code lists.

```{r}
# ahrq delivery diagnosis codes
delocmd_list <- read_rds("../data-published/technical-specs/delocmd.rds") |> pull(delocmd)
# gap list that adds hcup delivery diagnosis codes
hcup_deldiag_gap_list <- read_rds("../data-published/technical-specs/hcup_deldiag.rds") |> pull(deldiag)
# combine above to get full hcup list
hcup_deldiag_combo_list <- c(delocmd_list, hcup_deldiag_gap_list)
# hcup MS-DRG codes
hcup_deldrg_list <- read_rds("../data-published/technical-specs/hcup_deldrg.rds") |> pull(deldrg)
# hcup procedure codes
hcup_delproc <- read_rds("../data-published/technical-specs/hcup_delproc.rds") |> pull(delproc)
```

Now we build a function that will look through each incoming file and filter it to include any delivery condition outlined by HCUP.

```{r}
filter_del <- function(.data) {
  .data |> 
    filter(
      # Any hcup diagnostic codes  
      if_any(matches("_DIAG") & -starts_with("POA"),
             ~ .x %in% hcup_deldiag_combo_list)
      # or hcup MS_DRG codes or any hcup procedure codes
      | MS_DRG %in% hcup_deldrg_list
      | if_any(matches("_SURG_PROC") & -contains("DAY"),
              ~ (.x %in% hcup_delproc))
    )
}
```

### Filter missing data

This filter results patients that are female (since we are dealing with deliveries). It also removes records where we might not have the discharge quarter/year, age or race.

```{r}
filter_clean <- function(.data) {
  .data |> 
    filter(
      SEX_CODE == "F",
      PAT_AGE != "`",
      RACE != "`",
      !is.na(DISCHARGE),
      !is.na(PRINC_DIAG_CODE)
    )
}
```


### Child-bearing age

In 2020 I spoke with researchers at the [Office of Health Affairs-Population Health](https://dellmed.utexas.edu/units/department-of-population-health) at Dell Medical Center who work with this data and they suggested filtering deliveries to women of normal child-bearing age, 15-49.

> We may not filter the age so we can see how many patients fall outside that range.

```{r}
age_list <- read_rds("../data-published/technical-specs/age.rds") |> pull(age)

filter_age <- function(.data) {
  .data |>  
    filter(PAT_AGE %in% age_list)
}
```

### Harris County

Finds our facilities in Harris County.

```{r}
filter_harris = function(.data) {
  .data |> 
    filter(FAC_CNTY == "Harris")
}
```


## Mutating functions


### Define delivery standard

> MIGHT NEED UPDATE

The definition for live births is broader for HCUP, we will following the [clinical coding definitions](https://datatools.ahrq.gov/hcup-fast-stats/?tab=special-emphasis&dash=92#accordion-92) from HCUP's Severe Maternal Morbidity analysis to find those records. The AHRQ records are inclusive within HCUP, so we'll flag AHRQ records so we can filter to them.

```{r}
define_standard <- 
  function(.data) {
    .data |>
    mutate(
      DEL_STD = if_else(if_any(
        matches("_DIAG") & -starts_with("POA"),
        ~ .x %in% delocmd_list),
        T, F)
      )    
  }
```

- A `TRUE` value means a record _ONLY_ fits the more narrow AHRQ delivery definition
- A `FALSE` value means a record fits _BOTH_ the AHRQ and HCUP delivery definitions

### Convert charges to numeric

To avoid sniffing errors, we force all columns to be text upon import. Here we convert some monetary variables back to a number.

```{r}
conv_charges <-
  function(.data) {
    .data |> mutate(across(contains("_CHARGES"), as.numeric))
  }
```


### Add year

Convenience variable for later analysis.

```{r}
conv_year <- function(.data) {
  .data |> mutate(YR = substr(DISCHARGE, 1, 4), .after = DISCHARGE)
}
```

### Add facility info

We want to understand the geographic location of each facility, so we add that information based on a facilities list provided by <thcichelp@dshs.texas.gov>.  While working on this, I compared the `PROVIDER` variable in our Harris data with `facility` in the facilities list and found there were only a handful of discrepancies and they were minor name variations, so I dropped `PROVIDER` in favor of `FAC_NAME`.

```{r}
facilities_list <-
  read_rds("../data-published/facilities.rds") |> 
  select(thcic_id, facility, submission_address, city, county) |> 
  rename(
    THCIC_ID = thcic_id,
    FAC_NAME = facility,
    FAC_ADDR = submission_address,
    FAC_CITY = city,
    FAC_CNTY = county
  )

add_facility_info <- function(.data) {
  .data |> 
    left_join(facilities_list, by = join_by(THCIC_ID)) |> 
    select(!PROVIDER_NAME)
}
```


### Add Grouper file

In 2022, the file structure changed and some fields were moved into a "GROUPER" file, and we need some of those values from the `MS_DRG` field in that data. We check the year of the incoming file and then decide if and how to join with their grouper file.

```{r}
add_grouper <- function(df, file_path) {
    
  file_year <- basename(file_path) |> str_extract("(\\d)q(\\d{4})", group = 2) |> as.numeric()

  # If year 2022, we join with the grouper file
  if (file_year == 2022) {
    grouper_path <- file_path |> str_replace("_base1_", "_grouper_")
    orig_grouper <- read_tsv(grouper_path,
      col_types = cols(
        .default = col_character()
      )) |>
    select(!starts_with("..."))
    
    data_join <- df |> left_join(orig_grouper, by = join_by(RECORD_ID))
  } 

  # In 2023, the file name structure changed, but idea is same
  else if (file_year >= 2023) {
    grouper_path <- file_path |> str_replace("_BASE_DATA_1_", "_GROUPER_")
    orig_grouper <- read_tsv(grouper_path,
      col_types = cols(
        .default = col_character()
      )) |>
    select(!starts_with("..."))
    
    data_join <- df |> left_join(orig_grouper, by = join_by(RECORD_ID))
  }

  # This keeps pre-2022 files as they are
  else {
    data_join <- df
  }

  return(data_join)
}
```

## Test with single file

Here we can test our various functions based on a single file. Makes for easier troubleshooting.

```{r}
#| warning: false
#| message: false
#| eval: false

testy_file_path <- "../data-original/pudf-data-test/PUDF 1Q2023 tab-delimited/IP_PUDF_BASE_DATA_1_1q2023_tab.txt"

testy <-
  read_tsv(testy_file_path,
           col_types = cols(.default = col_character())) |> 
    select(!starts_with("...")) |> 
    add_grouper(testy_file_path) |>
    filter_del() |>
    filter_clean() |>
    define_standard() |>
    conv_charges() |>
    conv_year() |>
    add_facility_info()

# testy |> names()
```


## Multi-file process

Here we process each quarterly file individually and export it so we can make some decisions about processing along the way.

In 2022, the "grouper" file was introduced, and we need to join with that file for some of our analysis. Then the file structure changed in 2023, so we account for that.

```{r}
#| label: process-multi
#| warning: false
#| message: false

dir_output_test <- "../data-processed/multi-test/"
dir_output_pudf <- "../data-processed/multi-pudf/"
dir_output <- if(test_flag == T) dir_output_test else dir_output_pudf
dir_create(dir_output)

process_files <- function(file_path) {
  
  # Imports the individual file
  data <- read_tsv(
    file_path,
    col_types = cols(
      .default = col_character()
    )
  )
  
  # Checks yr and adds grouper file if necessary
  data_join <- data |> add_grouper(file_path)
  
  # Other filtering and mutating
  data_mod <-
    data_join |>
    select(!starts_with("...")) |> 
    filter_del() |>
    filter_clean() |> 
    # filter_age()|> 
    define_standard() |>
    conv_charges() |> 
    conv_year() |>
    add_facility_info()

  # configures name to write file
  file_name <- basename(file_path) |> str_replace(".txt", ".rds")
  
  # saves the file
  write_rds(data_mod, file.path(dir_output, file_name))
}

## This runs the files list through the process function
walk(files_list, process_files)
```

The result is we end up with one file for each quarter saved in the `data-processed` folder, which is not committed in git or Github.

### Assemble the files

Now we bind all those processed files together into a single file of deliveries.

```{r}
#| label: multi-bind

rds_dir <- dir_ls(
  if(test_flag == T) dir_output_test else dir_output_pudf,
  recurse = T,
  regexp = "*.rds$"
)

rds_dir

done <- rds_dir |>
  map(read_rds) |>
  list_rbind()

done |> nrow()
```

## Export Maternal records

Here we write the file to `data-processed.` Again, not committed to git.

```{r}
dir_write_test <- "../data-processed/test/01-maternal-cases.rds"
dir_write_pudf <- "../data-processed/pudf/01-maternal-cases.rds"

done |> write_rds(if(test_flag == T) dir_write_test else dir_write_pudf)
```

## Export Harris records

Here we filter the Texas data to just facilities in Harris county and export those.

```{r}
dir_write_harris_test <- "../data-processed/test/01-maternal-harris.rds"
dir_write_harris_pudf <- "../data-processed/pudf/01-maternal-harris.rds"

done |> 
  filter(FAC_CNTY == "Harris") |> 
  write_rds(if(test_flag == T) dir_write_harris_test else dir_write_harris_pudf)
```


## Done

A klaxon to indicate the processing is complete, as it can take a while.

```{r}
beepr::beep(4)
```