---
title: "Categorization"
format: html
author: "Teresa Do"
---

## Goal of this notebook

This notebook is designed to flag the discharge data for various procedures, like Cesarean delivery, that are later used in analysis notebooks. We'll flag the data in this notebook and then write it out to an RDS file to be imported in the other notebooks.

## Definition Notes

We will not be defining "per 1,000 deliveries" as most of the AHRQ IQI Indicators indicate. We will instead use the definitions as is without multiplying by 1000. 

We will not be excluding cases with the missing MDC because we aren't able to determine whether the user indicated that MDC is provided.

Missing sex, age, quarter, year or principal diagnosis is already filtered out in the [Cleaning notebook.](01-clean-pudf.qmd)

## Setup

> Start with these for now.

```{r}
#| label: setup
#| message: false
#| warning: false

library(tidyverse)
library(janitor)
```

## Import maternal data

We previously found all of the maternal cases in the [Clean PUDF notebook.](01-clean-pudf.qmd). This filtered for all deliveries identified by any listed ICD-10-CM diagnosis code for deliveries in Harris County.

```{r}
#| label: all-deliveries-import
#| message: false
#| warning: false

deliveries <- read_rds("../data-processed/pudf/01-maternal-cases.rds")

deliveries |> nrow()
```

## Import IQI indicator and grouping lists

There were lists compiled related to stored lists [here](00-stored-lists.qmd) that we'll use to flag our data for the respective procedures we are looking into. Longer descriptions of those lists can also be found there.

```{r}
#| label: import-lists
#| message: false
#| warning: false

# identifying cesarean delivery (PRSECP)
prcsecp_list <- read_rds("../data-published/technical-specs/prcsecp.rds") |>
  pull(prcsecp)

# hysterotomy procedure codes (PRCSE2P)
prcse2p_list <- read_rds("../data-published/technical-specs/ahrq_prcse2p.rds") |>
  pull(prcse2p)

# abnormal complications/complications (PRCSECD)
prcsecd_list <- read_rds("../data-published/technical-specs/prcsecd.rds") |>
  pull(presecd)

# diagnosis codes assigned to MDC 15 Newborns & Other Neonates with Conditions... (MDC15PRINDX)
mdc15prindx_list <- read_rds("../data-published/technical-specs/mdc15prindx.rds") |>
  pull(mdc15prindx)

# diagnosis codes assigned to previous Cesarean delivery (PRVBACD)
prvbacd_list <- read_rds("../data-published/technical-specs/prvbacd.rds") |>
  pull(prvbacd)

# procedure codes assigned to vaginal deliveries (VAGDELP)
vagdelp_list <- read_rds("../data-published/technical-specs/vagdelp.rds") |> pull(vagdelp)

# diagnosis and procedures codes assigned to smm
smm_nonseries_list <- read_rds("../data-published/technical-specs/smm_nonseries.rds") |> pull(nonseries)
smm_proc_list <- read_rds("../data-published/technical-specs/smm_proc.rds") |> pull(smm_proc)
smm_excluded_list <- read_rds("../data-published/technical-specs/smm_exclude.rds") |> pull(smm_exclude)
smm_series_list <- read_rds("../data-published/technical-specs/smm_series.rds") |> pull(smm_series)
smm_nonexclusion_list <- read_rds("../data-published/technical-specs/smm_nonexclusion.rds") |> pull(smm_nonexclusion)
```

## Race recategorization

Throughout multiple analysis notebooks, we will be breaking down the rates by race as well as by hospital. The Texas Inpatient discharge data has this demographic information under two columns: `RACE` and `ETHNICITY`.

The `RACE` field categorizes individuals with values like so:

* `1`: American Indian
* `2`: Asian or Pacific Islander
* `3`: Black
* `4`: White
* `5`: Other (if a hospital has fewer than ten patients of one race)

There are a couple of things that we must change about the categorization. First, we want to consider Hispanic individuals, which are marked under the `ETHNICITY` field like so:

* `1`: Hispanic Origin
* `2`: Not of Hispanic Origin

*NOTE*: There are fewer American Indian cases than any other `RACE` category (think hundreds vs. thousands).

So we will change the column to also reflect whether someone marked that they were Hispanic.

>>> PUT RACE IN A NEW VARIABLE INSTEAD OF OVERWRITING

```{r}
#| label: race-recategorization
#| message: false
#| warning: false

deliveries_race <- deliveries |>
  mutate(
    MOD_RACE = case_when(
      ETHNICITY == "1" ~ "Hispanic",
      RACE == "1" ~ "American Indian",
      RACE == "2" ~ "Asian or Pacific Islander",
      RACE == "3" ~ "Black",
      RACE == "4" ~ "White",
      .default = "Other"
    )
  )

deliveries_race
```

## Medicaid vs. Other Payment Methods

One measure we're interested in is the percentage of people that pay for certain procedures with or without Medicaid. Medicaid is identified in the data as `MC`. We will consider the column `FIRST_PAYMENT_SRC` to be able to determine whether a case was paid with Medicaid or without.

```{r}
#| label: medicaid
#| message: false
#| warning: false

deliveries_race |> count(FIRST_PAYMENT_SRC)

deliveries_race_mc <- deliveries_race |>
  mutate(
    MC = if_else(FIRST_PAYMENT_SRC == "MC", T, F),
    MC = if_else(is.na(MC), F, MC)
  )

deliveries_race_mc |> count(MC)
```

## AHRQ Definition of Deliveries

Within this analysis, we are working with two standards: AHRQ and H-Cup. These two organizations define what is a delivery differently, particularly that they use different ICD-10 and MS-DRG codes to define what a delivery is.

We will save the AHRQ definition in a separate data frame for all of the definitions that use the AHRQ IQI Indicator definitions.

```{r}
#| label: ahrq-defined-deliveries

ahrq_deliveries <- deliveries_race_mc |>
  filter(DEL_STD == TRUE)

# honestly just going to save the resulting hcup definition in its own data
# frame so that the naming convention is more standard

hcup_deliveries <- deliveries_race_mc

ahrq_deliveries |> count(DEL_STD)
```

## Cesearean deliveries

In the [Cesarean Delivery Rate notebook](03-analysis-csec.qmd), we are comparing the Cesarean delivery rate for Texas, Harris County and individual hospitals in Harris County. In order to do so, we need to flag what cases in our maternal data are actually Cesarean deliveries.

We use the AHRQ definition of the Cesarean delivery rate to determine what cases are Cesarean deliveries. This is defined in [IQI 21 Cesarean Delivery Rate, Uncomplicated](https://qualityindicators.ahrq.gov/Downloads/Modules/IQI/V2025/TechSpecs/IQI_21_Cesarean_Delivery_Rate_Uncomplicated.pdf) as:

"Cesarean deliveries without a hysterotomy procedure per 1,000 deliveries. Excludes deliveries with complications (abnormal presentation, preterm delivery, fetal death, multiple gestation, or breech presentation)."

### Numerator and Denominator

The indicator uses the following numerator and denominator definitions to calculate the rate:

*Numerator*: Discharges, among cases meeting the inclusion and exclusion rules for the denominator, with any listed ICD-10-PCS procedure codes for Cesarean delivery (PRCSECP) and without any listed ICD-10-PCS procedure codes for hysterectomy (PRCSE2P).

*Denominator*: All deliveries, identified by any listed ICD-10-CM diagnosis code for outcome of delivery (DELOCMD).

Exclude:

* with any listed ICD-10-CM diagnosis code for abnormal presentation, preterm delivery, fetal death, multiple gestation, or breech presentation (Appendix A: PRCSECD)
* with a principal ICD-10-CM diagnosis code assigned to MDC 15 Newborns & Other Neonates with Conditions Originating in Perinatal Period (Appendix B: MDC15PRINDX)
* with an ungroupable DRG (DRG=999)
* with missing gender (SEX = missing), age (AGE = missing), quarter (DQTR = missing), year (YEAR = missing) or principal diagnosis (DX1 = missing)
* with missing MDC (MDC = missing) when user indicates that MDC is provided

### Denominator

Right now, we have a set of all delivery data, but for the denominator as specified by IQI 21, we need to exclude any of the cases we listed in AHRQ definition.

#### PRCSECD

Exclude discharges with "any listed ICD-10-CM diagnosis code for abnormal presentation, preterm delivery, fetal death, multiple gestation or breech presentation.

```{r}
#| label: exclude-PRCSECD
#| message: false
#| warning: false

deliveries |> nrow()

deliveries_abn <- ahrq_deliveries |>
  filter(
    if_all(
      matches("_DIAG") & -starts_with("POA"),
      ~ !(.x %in% prcsecd_list)
    )
  )

deliveries_abn |> nrow()
deliveries_abn |> slice_sample(n = 10)
```

#### MDC15PRINDX

Exclude any discharges with a principal ICD-10-CM diagnosis code assigned to MDC 15 Newborns & Other Neonates with Conditions Originating in Perinatal Period.

```{r}
#| label: exclude-MDC15PRINDX
#| message: false
#| warning: false

deliveries_abn |> nrow()

deliveries_abn_new <- deliveries_abn |>
  filter(
    if_all(
      matches("_DIAG") & -starts_with("POA"),
      ~ !(.x %in% mdc15prindx_list)
    )
  )

deliveries_abn_new |> nrow()
deliveries_abn_new |> slice_sample(n = 10)
```

#### Ungroupable DRG

DRG stands for Diagnosis Related Group.

We need to remove any diagnoses that are ungroupable (meaning that they don't fall under a categorization of types of diagnosis) which is signified with the `MS_DRG` or `APR_DRG` fields being 999. After 2022, `FROZEN_MS_DRG` and `FROZEN_APR_DRG` are used.

>>> REWORK TO CHANGE THIS TO LOOK AT ALL DRG COLUMNS AND THEN CASE_WHEN IF THEY ARE EMPTY OR 999

```{r}
#| label: MS_DRG
#| message: false
#| warning: false

deliveries_uncmp <- deliveries_abn_new |>
  # there aren't any APR_DRG with values "999" but this should filter anyways
  filter(
    (APR_DRG != "999") |
    is.na(APR_DRG)
  ) |>
  # filters for MS_DRG but keeps all of the na values as well
  filter(
    MS_DRG != "999" |
    is.na(MS_DRG)
  ) |>
  # 2022 and after has the values in FROZEN_APR_DRG and FROZEN_MS_DRG
  # manually checked and there aren't any cases with these fields present so
  # shouldn't remove any cases from previous filters
  filter(
    is.na(FROZEN_MS_DRG) | 
    is.na(FROZEN_APR_DRG) |
    FROZEN_MS_DRG != "999" |
    FROZEN_APR_DRG != "999"
  )

deliveries_uncmp |> nrow()
deliveries_uncmp |> slice_sample(n = 10)
```

This completes our "denominator" set of data which is saved in `deliveries_uncmp`.

### Numerator

Since the number of Cesarean deliveries meet the inclusion and exclusion rules for the denominator, we can work off the denominator set. 

We will determine the number of Cesarean deliveries using flags to find which deliveries are identified as a Cesarean delivery in the `PRCSECP` list but without any listed procedures for a hysterotomy in `PRCSE2P`.

#### Flag PRCSECP

Instead of filtering like we did for the denominator, we will just flag the varying procedures since we don't want to lose that data while plotting.

First, flag for Cesarean deliveries.

```{r}
#| label: flag-PRCSECP
#| message: false
#| warning: false

deliveries_uncmp_csec <- deliveries_uncmp |>
  mutate(
    PRCSECP = case_when(
      if_any(
        matches("_SURG_PROC") & -contains("DAY"),
      ~ (.x %in% prcsecp_list)
      ) ~ TRUE,
      .default = FALSE
    )
  )

deliveries_uncmp_csec |> count(PRCSECP)
deliveries_uncmp_csec |> slice_sample(n = 10)
```

#### Flag PRCSE2P

Now, flag for hysterotomies. Based off of the filtering we did for delivery diagnosis codes in in the `DELOCMD` list as seen in the Stored Lists [notebook,](00-stored-lists.qmd).

```{r}
#| label: flag-PRCSE2P
#| message: false
#| warning: false

deliveries_uncmp_csec_hyst <- deliveries_uncmp_csec |>
  mutate(
    PRCSE2P = case_when(
      if_any(
        matches("_SURG_PROC") & -contains("DAY"),
      ~ (.x %in% prcse2p_list)
      ) ~ TRUE,
      .default = FALSE
    )
  )

deliveries_uncmp_csec_hyst |> count(PRCSE2P)
deliveries_uncmp_csec_hyst |> filter(PRCSE2P == TRUE)
```

#### Create the Cesarean indicator column

Created the `CSEC` column and set true only if PRCSECP is TRUE and PRCSE2P is FALSE. Should match up with the number of cases marked true with `PRCSECP` column.

>>> EXPORT CHUNK

```{r}
#| label: csec-column
#| message: false
#| warning: false

tx_deliveries_csec <- deliveries_uncmp_csec_hyst |>
  mutate(
    CSEC = case_when(
      (PRCSECP == T & PRCSE2P == F) ~ TRUE,
      .default = FALSE
    )
  )

tx_deliveries_csec |> count(CSEC)
```

This completes our "numerator" set of data. This is saved in `tx_deliveries_csec`.

### Medicaid Categorization

One additional thing we are looking at for each deliverable is Medicaid. We want to know if Medicaid as the first payment affects a maternal health outcome. We will add the categorization for Cesarean deliveries now.

```{r}
#| label: csec-mc-categorization
#| message: false
#| warning: false

tx_deliveries_csec_mc <- tx_deliveries_csec |>
  mutate(
      CSEC_MC_CATEGORY = case_when(
        (CSEC == T & MC == T) ~ "CSEC_MC",
        (CSEC == T & MC == F) ~ "CSEC_NONMC",
        (CSEC == F & MC == T) ~ "NONCSEC_MC",
        (CSEC == F & MC == F) ~ "NONCSEC_NONMC"
      )
  )

tx_deliveries_csec_mc |> count(CSEC_MC_CATEGORY)
```

### Export

We need to export this data so we can use it in our [Cesarean analysis notebook.](03-analysis-csec.qmd)

```{r}
#| label: csec-export
#| message: false
#| warning: false

tx_deliveries_csec_mc |>
  write_rds("../data-processed/cesarean.rds")
```


## Primary Cesarean deliveries

In the [Primary Cesarean Delivery Rate notebook](03-analysis-pcsec.qmd), we are comparing the Primary Cesarean delivery rate for Texas, Harris County and individual hospitals in Harris County.

We use the AHRQ definition of the Primary Cesarean delivery rate to determine what cases are Cesarean deliveries. This is defined in [IQI 33 Primary Cesarean Delivery Rate, Uncomplicated](https://qualityindicators.ahrq.gov/Downloads/Modules/IQI/V2025/TechSpecs/IQI_33_Primary_Cesarean_Delivery_Rate_Uncomplicated.pdf) as:

"First-time Cesarean deliveries without a hysterotomy procedure per 1,000 deliveries. Excludes deliveries with complications (abnormal presentation, preterm delivery, fetal death, multiple gestation or breech presentation)."

### Numerator and Denominator

This indicator use the following numerator and denominator to calculate the rate:

*Numerator*: Discharges, meeting the inclusion and exclusion rules for the denominator, with any listed ICD-10-PCS procedure code for Cesarean delivery (PRCSECP) and without any listed ICD-10_PCS procedure code for hysterotomy (PRCSE2P).

*Denominator*: All deliveries, identified by any listed ICD-10-CM diagnosis code for outcome of delivery (DELOCMD) but excluding:

* with any listed ICD-10-CM diagnosis code for abnormal presentation, preterm delivery, fetal death, multiple gestation, or breech presentation. (Appendix A: PRCSECD)
* with any listed ICD-10-CM diagnosis code for previous Cesarean delivery (PRVBACD)
* with a principal ICD-10-CM diagnosis code assigned to MDC 15 Newborns & Other Neonates with Conditions Originating in Perinatal Period (Appendix B: MDC15PRINDX)
* with an ungroupable DRG (DRG=999)
* with missing sex, age, quarter, year or principal diagnosis
* with missing MDC when user indicates that MDC is provided

### Denominator

#### Exclude PRVBACD

Most of the work for the numerator and denominator was completed in the Cesarean delivery categorization. The only difference between the definitions between Cesarean and Primary Cesarean is that Primary Cesarean also excludes "any listed ICD-10-CM diagnosis code for previous Cesarean delivery (PRVBACD)."

Therefore, we will take the existing `deliveries_uncmp_csec_hyst` data frame and exclude any cases where a PRVBACD diagnosis code can be found in the diagnoses columns.

```{r}
#| label: exclude-prvbacd
#| message: false
#| warning: false

deliveries_uncmp_csec_hyst |> nrow()

tx_deliveries_uncmp_pcsec <- deliveries_uncmp_csec_hyst |>
  filter(
    if_all(
      matches("_DIAG") & -starts_with("POA"),
      ~ !(.x %in% prvbacd_list)
    )
  )

tx_deliveries_uncmp_pcsec |> nrow()
```

### Numerator

#### Create the Primary Cesarean indicator column

Similar to Cesarean deliveries, we need to indicate what is a Primary Cesarean delivery and what is not. We already have the ICD-10-PCS procedure codes for Ceasrean delivery (PRCSECP) flagged and the procedure codes for hysterotomy (PRCSE2P) flagged.

We will create a column `PCSEC` that is set true only if PRCSECP is TRUE and PRCSE2P is FALSE.

```{r}
#| label: pcsec-column
#| message: false
#| warning: false

tx_deliveries_pcsec <- tx_deliveries_uncmp_pcsec |>
  mutate(
    PCSEC = case_when(
      (PRCSECP == T & PRCSE2P == F) ~ TRUE,
      .default = FALSE
    )
  )

tx_deliveries_pcsec |> count(PCSEC)
```

### Medicaid Categorization

Add Medicaid for Primary Cesarean deliveries.

```{r}
#| label: pcsec-mc-categorization
#| message: false
#| warning: false

tx_deliveries_pcsec_mc <- tx_deliveries_pcsec |>
  mutate(
     PCSEC_MC_CATEGORY = case_when(
        (PCSEC == T & MC == T) ~ "PCSEC_MC",
        (PCSEC == T & MC == F) ~ "PCSEC_NONMC",
        (PCSEC == F & MC == T) ~ "NONPCSEC_MC",
        (PCSEC == F & MC == F) ~ "NONPCSEC_NONMC"
    )
  )

tx_deliveries_pcsec_mc |> count(PCSEC_MC_CATEGORY)
```

### Export

This concludes our data set for Primary Cesarean deliveries. We will export this as an RDS file so that we can complete our analysis in the [Primary Cesarean Delivery Rate notebook.](03-analysis-pcsec.qmd)

```{r}
#| label: export-pcsec
#| message: false
#| warning: false

tx_deliveries_pcsec_mc |> write_rds("../data-processed/pcsec.rds")
```

## Vaginal Birth After Cesarean (VBAC) Deliveries

In the [Vaginal Birth After Cesarean (VBAC) Delivery Rate notebook](03-analysis-vbac.qmd), we are comparing the VBAC delivery rate for Texas, Harris County and individual hospitals in Harris County.

We use the AHRQ definition of the VBAC delivery rate to determine what cases are VBAC deliveries. This is defined in [IQI 22 Vaginal Birth After Cesarean (VBAC) Delivery Rate, Uncomplicated](https://qualityindicators.ahrq.gov/Downloads/Modules/IQI/V2025/TechSpecs/IQI_22_Vaginal_Birth_After_Cesarean_(VBAC)_Delivery_Rate_Uncomplicated.pdf) as:

"Vaginal births per 1,000 deliveries by patients with previous Cesarean deliveries. Excludes deliveries with complications (abnormal presentation, preterm delivery, fetal death, multiple gestation, or breech presentation)."

### Numerator and Denominator

This indicator use the following numerator and denominator to calculate the rate:

*Numerator*: Number of vaginal deliveries among discharges meeting the inclusion and exclusion rules for the denominator. Vaginal deliveries are identified by any listed ICD-10-PCS procedure code for vaginal delivery (VAGDELP).

*Denominator*: Discharges with an ICD-10-CM diagnosis code for birth delivery outcome (DELOCMD) AND with any listed ICD-10-CM diagnosis code for previous Cesarean delivery (PRVBACD). The denominator excludes any discharges:

* with any listed ICD-10-CM diagnosis code for abnormal presentation, preterm delivery, fetal death, multiple gestation, or breech presentation. (Appendix A: PRCSECD)
* with a principal ICD-10-CM diagnosis code assigned to MDC 15 Newborns & Other Neonates with Conditions Originating in Perinatal Period (Appendix B: MDC15PRINDX)
* with an ungroupable DRG (DRG=999)
* with missing sex, age, quarter, year or principal diagnosis
* with missing MDC when user indicates that MDC is provided.

### Denominator: PRVBACD

While the exclusions matches many of the previous exclusions used in other IQI Indicators, there is one change made to the denominator. The denominator should still consider cases that identify birth delivery from the DELOCMD list, but it should also only have any cases that are deliveries AND had a diagnosis code for a previous Cesarean delivery (PRVBACD).

We'll filter the cases in the `deliveries_uncmp` date frame, which holds all of the previous exclusions, so that there are only those cases with previous Cesarean deliveries.

```{r}
#| label: VBAC-include-PRVBACD
#| message: false
#| warning: false

deliveries_uncmp |> nrow()

deliveries_uncmp_vbac <- deliveries_uncmp |>
  filter(
    # if any diagnosis column has a prvbacd code, then keep it.
    if_any(
      matches("_DIAG") & -starts_with("POA"),
      ~ (.x %in% prvbacd_list)
    )
  )

deliveries_uncmp_vbac |> nrow()
```

This concludes the denominator. This is stored in the `deliveries_uncmp_vbac` date frame.

### Numerator

#### Flag VAGDELP

For the numerator, we must flag all of the cases in the `deliveries_uncmp_vbac` data frame that were vaginal deliveries. Vaginal deliveries are identified by any listed ICD-10-PCS procedure code for vaginal delivery (VAGDELP).

```{r}
#| label: vbac-flag-vagdelp
#| message: false
#| warning: false

tx_deliveries_vbac <- deliveries_uncmp_vbac |>
  mutate(
    VBAC = case_when(
      if_any(
        matches("_SURG_PROC") & -contains("DAY"),
      ~ (.x %in% vagdelp_list)
      ) ~ TRUE,
      .default = FALSE
    )
  )

tx_deliveries_vbac |> count(VBAC)
```

This concludes the numerator for Vaginal Births After Cesarean.

### Medicaid Categorization

Add Medicaid Categorization for Vaginal Birth After Cesarean.

```{r}
#| label: vbac-mc-categorization
#| message: false
#| warning: false

tx_deliveries_vbac_mc <- tx_deliveries_vbac |>
  mutate(
     VBAC_MC_CATEGORY = case_when(
        (VBAC == T & MC == T) ~ "VBAC_MC",
        (VBAC == T & MC == F) ~ "VBAC_NONMC",
        (VBAC == F & MC == T) ~ "NONVBAC_MC",
        (VBAC == F & MC == F) ~ "NONVBAC_NONMC"
      )
  )

tx_deliveries_vbac_mc |> count(VBAC_MC_CATEGORY)
```

### Export

```{r}
#| label: vbac-export
#| message: false
#| warning: false

tx_deliveries_vbac_mc |> write_rds("../data-processed/vbac.rds")
```

## Vaginal delivery rate

In the [Vaginal Delivery Rate Analysis notebook](03-analysis-vaginal.qmd), we will be not be using a specific indicator provided by AHRQ, but we will be using the lists that they provided in the [Vaginal Births After Cesarean](https://qualityindicators.ahrq.gov/Downloads/Modules/IQI/V2025/TechSpecs/IQI_22_Vaginal_Birth_After_Cesarean_(VBAC)_Delivery_Rate_Uncomplicated.pdf) indicator. Specifically, we will be using the `VAGDELP` list to identify vaginal deliveries.

This is the definition we will use for vaginal deliveries:

"Vaginal births identified by `VAGDELP` out of deliveries identified by `DELOCMD`. Excludes deliveries with complications (abnormal presentation, preterm delivery, fetal death, multiple gestation, or breech presentation)."

### Flag VAGDELP

The main difference between this data set and the previous data set created for Vaginal Births After Cesarean is that we will now include all uncomplicated deliveries that we previously saved in `deliveries_uncmp` and just flag all of the vaginal deliveries as identified by `VAGDELP` list.

```{r}
#| label: vagdelp
#| message: false
#| warning: false

tx_deliveries_vaginal <- deliveries_uncmp |>
  mutate(
    VAGDEL = case_when(
      if_any(
        matches("_SURG_PROC") & -contains("DAY"),
        ~ (.x %in% vagdelp_list)
        ) ~ TRUE,
        .default = FALSE
      )
  )

tx_deliveries_vaginal |> count(VAGDEL)
```

## Medicaid Categorization

Add Medicaid categorization for vaginal delivery cases.

```{r}
#| label: vaginal-mc-categorization
#| message: false
#| warning: false

tx_deliveries_vaginal_mc <- tx_deliveries_vaginal |>
  mutate(
     VAGDEL_MC_CATEGORY = case_when(
        (VAGDEL == T & MC == T) ~ "VAGDEL_MC",
        (VAGDEL == T & MC == F) ~ "VAGDEL_NONMC",
        (VAGDEL == F & MC == T) ~ "NONVAGDEL_MC",
        (VAGDEL == F & MC == F) ~ "NONVAGDEL_NONMC"
      )
  )

tx_deliveries_vaginal_mc |> count(VAGDEL_MC_CATEGORY)
```

### Export

Export this in the notebook for analysis.

```{r}
#| label: export-vaginal
#| message: false
#| warning: false

tx_deliveries_vaginal_mc |> write_rds("../data-processed/vagdel.rds")
```

## Severe Maternal Morbidity

In the [Severe Maternal Morbidity Rate notebook](03-analysis-smm.qmd), we will be using the AHRQ H-Cup definition of identifying Severe Maternal Morbidity. They provided this definition at the [Healthcare Cost and Utilization Project (HCUP) Fast Stats](https://datatools.ahrq.gov/hcup-fast-stats/?tab=special-emphasis&dash=92#accordion-92) under "Clinical Coding Definitions."

Right now, the data available to me are maternal cases identified by the `DELOCMD` list in AHRQ's IQI indicators. This definition uses a slightly different denominator but also includes the `DELOCMD` list, also known as the Z37 series. Hopefully we could include the additional diagnosis, MS-DRG and procedure codes in a final version of the analysis.

Other than that, there are a ton of codes to identify SMM for a variety of conditions. We created lists of the codes outside of the "series" of codes present in the numerator list in the [Stored Lists notebook](00-stored-lists.qmd) called `smm_nonseries_list` and `smm_proc_list`.

We need to flag everything that can be considered an SMM which includes all of these series:

* A40 series: Streptococcal sepsis
* A41 series: Other sepsis
* D57 series: Sickle-cell disorders (except D571, D5720, D573, D5740, D5742, D5744, D5780)
* G45 series: Transient cerebral ischemic attacks and related syndromes
* G46 series: Vascular syndromes of brain in cerebrovascular diseases
* I21 series: Acute myocardial infarction
* I22 series: Subsequent ST elevation (STEMI) and non-ST elevation (NSTEMI) myocardial infarction
* 126 series: Pulmonary embolism (excluding 12603, 12695)
* I46 series: Cardiac arrest
* I50 series: Heart failure (excluding those related to chronic heart failure such as 15022, 15032, 15042, 150812)
* I60-I68 series: Cerebrovascular diseases (excluding 16302 and 16303)
* I71 series: Aortic aneurysm and dissection
* J96 series: Respiratory failure, not elsewhere classified (except J961, chronic respiratory failure)
* N17 series: Acute kidney failure
* O15 series: Eclampsia
* O291 series: Cardiac complications of anesthesia during pregnancy
* O292 series: Central nervous system complications of anesthesia during pregnancy
* O450 series: Premature separation of placenta with coagulation defect (except codes related to first trimester like O45001, O45011, O45021, O45091)
* O460 series: Antepartum hemorrhage with coagulation defect (except codes related to the first trimester like O46001, O46011, O46021, O46091)
* O88 series: Obstetric embolism (except codes related to the first trimester like O88011, O88111, O88211, O88311, O88811)
* R57 series: Shock, not elsewhere classified
* T811 series: Postprocedural shock but only include codes indicating initial encounter like T8110XA, T8111XA, T8112XA, T8119XA)

and then the `smm_nonseries_list` and `smm_proc_list`.

> want to do this all in a case_when mutating a column to signify whether a case is SMM or not
> need to look at regex expressions, specifying that the code starts with a specific combination of characters

We'll do this within a big `case_when` statement to specify which cases are SMM and which are not.

### Combine lists for diagnostic codes

SMM flagging can be divided up into flagging the diagnostic codes and flagging for the procedural codes. While we have a nice list of procedural codes in `smm_proc_list`, we had to treat the diagnostic codes different depending on whether they were series or non-series codes. If they were series, they had some added regex expression syntax in order to grab every code that fell under a series and deal with the exclusions present in the clinical coding definition. If they were non-series (or only had a couple of included codes from the series), we were able to just pull those codes directly. We made a couple of these lists for testing purposes only until we landed on a method that we were happy with.

```{r}
#| label: combine-diagnosis-lists
#| message: false
#| warning: false

# sep = "|" adds a logical operator between the smm_series_list and smm_nonseries_list
# collapse = "|" adds a logical operator between each element in the both lists

smm_series_collapsed <- str_c(smm_series_list, collapse = "|")
smm_nonseries_collapsed <- str_c(smm_nonseries_list, collapse = "|")

# need this list for the alternative method
smm_series_nonexclusions <- str_c(smm_nonexclusion_list, collapse = "|")

smm_diag_list <- str_c(smm_series_collapsed, smm_nonseries_collapsed, sep = "|")

smm_diag_list
smm_series_nonexclusions
smm_series_collapsed
smm_nonseries_collapsed
```

### Checking lookahead regular expression

In order to verify that we are collecting the cases that we need in regards to exclusions, we will test that the negative lookahead expression does using the D57 cases.

```{r}
#| label: d57-test
#| message: false
#| warning: false

# grabbing all d57 cases
d57_cases <- hcup_deliveries |>
  filter(
    if_any(
      matches("_DIAG") & -starts_with("POA"),
        ~ (.x |> str_detect("^(D57)"))
    )
  )

d57_cases

# each time we look at the excluded cases, we will grab some example records so that we can
# see whether the negative lookahead case works to exclude the case in our filter.
# looking at all of the cases that are D571, one of the excluded cases
d57_cases |>
  filter(
    if_any(
      matches("_DIAG") & -starts_with("POA"),
        ~ (.x |> str_detect("D571"))
    )
  )

# looking at all of the cases that are D5720, another excluded case
d57_cases |>
  filter(
    if_any(
      matches("_DIAG") & -starts_with("POA"),
        ~ (.x |> str_detect("D5720"))
    )
  )

# looking at all of the cases that are D573, third excluded case
d57_cases |>
  filter(
    if_any(
      matches("_DIAG") & -starts_with("POA"),
        ~ (.x |> str_detect("D573"))
    )
  )

# Checking the found cases of D571 against cases filtered with the negative lookahead expression
# also making sure that D5720 case stays
d57_cases |>
  filter(
    if_any(
      matches("_DIAG") & -starts_with("POA"),
        ~ (.x |> str_detect("^(D57(?!1))"))
    )
  ) |>
  # if we filter again for D571 cases, we see that there are 29 cases but those cases
  # also include other D57 codes that are also excluded.
  # In the final version of the regex code, we exclude EVERYTHING that should be excluded at the same time
  # This should not present a problem in the final code.
  filter(
    if_any(
      matches("_DIAG") & -starts_with("POA"),
      ~ (.x |> str_detect("D571"))
    )
  ) |> filter(
    RECORD_ID == "120230747212" | RECORD_ID == "120231704082"
  )

# tests with the exclusions to make sure that all three are excluded with the OR operator
d57_cases |>
  filter(
    if_any(
      matches("_DIAG") & -starts_with("POA"),
        ~ (.x |> str_detect("^(D57(?!1|20|3))"))
    )
  ) |>
  filter(
    RECORD_ID == "120230747212" | RECORD_ID == "120231704082" | RECORD_ID == "120230084349"
  )

# sanity check to make sure it was in there.
d57_cases |>
  filter(
    RECORD_ID == "120230747212" | RECORD_ID == "120231704082" | RECORD_ID == "120230084349"
  )
```

### Checking series without exclusions

We also use a regex expression to flag series without exclusions. We'll do the same kind of tests to verify that it does find cases with series included. We'll test with the first three series without exclusions: A40, A41 and G45.

```{r}
#| label: checking-series-without-exclusions
#| message: false
#| warning: false

# first grab all of the cases for A40, A41 and G45
# I will grab them indiviudally and then combine them for comparison because I first want to see how many cases for each
# 93 a40 cases
a40_cases <- hcup_deliveries |>
  filter(
    if_any(
      matches("_DIAG") & -starts_with("POA"),
      # this pattern detects strings that start with A40. We know this to work naturally
      # we just want to test that the combined regex expression works.
        ~ (.x |> str_detect("^(A40)"))
    )
  )

# 2603 a41 cases
a41_cases <- hcup_deliveries |>
  filter(
    if_any(
      matches("_DIAG") & -starts_with("POA"),
      # this pattern detects strings that start with A40. We know this to work naturally
      # we just want to test that the combined regex expression works.
        ~ (.x |> str_detect("^(A41)"))
    )
  )

# 47 g45 cases
g45_cases <- hcup_deliveries |>
  filter(
    if_any(
      matches("_DIAG") & -starts_with("POA"),
      # this pattern detects strings that start with A40. We know this to work naturally
      # we just want to test that the combined regex expression works.
        ~ (.x |> str_detect("^(G45)"))
    )
  )

nonexclusion <- a40_cases |>
  bind_rows(a41_cases) |>
  bind_rows(g45_cases) |>
  # remove duplicate rows that might include multiple of these codes in one row
  unique()

# sanity: should not include any ONLY a41 cases or g45 cases
nonexclusion |>
  filter(
    if_any(
      matches("_DIAG") & -starts_with("POA"),
      ~ (.x |> str_detect("^(A40)"))
    )
  ) |>
  filter(
    RECORD_ID == "120230086454" | RECORD_ID == "120230661816"
  )

# includes our A41 case
nonexclusion |>
  filter(
    if_any(
      matches("_DIAG") & -starts_with("POA"),
      ~ (.x |> str_detect("^(A40)|^(A41)"))
    )
  ) |>
  filter(
    RECORD_ID == "120230086454" | RECORD_ID == "120230661816"
  )

# should include both cases for a41 and g45
nonexclusion |>
  filter(
    if_any(
      matches("_DIAG") & -starts_with("POA"),
      ~ (.x |> str_detect("^(A40)|^(A41)|^(G45)"))
    )
  ) |>
  filter(
    RECORD_ID == "120230086454" | RECORD_ID == "120230661816"
  )

g45_cases
```

### SMM flagging

Here is the list of the processing steps for SMM flagging:

1. Mark rows with the procedure codes true because there are less of them and they are easier to understand.
2. Mark rows with series code including exclusions
3. Mark remaining series codes without exclusions
4. Mark non-series codes

>>> WANT TO MANUALLY GO THROUGH THE EXCLUSION CODES TO DOUBLE CHECK

```{r}
#| label: smm-flagging
#| message: false
#| warning: false

deliveries_smm <- hcup_deliveries |>
  mutate(
    SMM = case_when(
      if_any(
        matches("_SURG_PROC") & -contains("DAY"),
        ~ (.x %in% smm_proc_list)
      ) ~ TRUE,
      if_any(
        matches("_DIAG") & -starts_with("POA"),
        ~ (.x |> str_detect("^(D57(?!1|20|3|40|42|44|80))"))
      ) ~ TRUE,
      if_any(
        matches("_DIAG") & -starts_with("POA"),
        ~ (.x |> str_detect("^(I26(?!03|95))"))
      ) ~ TRUE,
      if_any(
        matches("_DIAG") & -starts_with("POA"),
        ~ (.x |> str_detect("^(I50(?!22|32|42|812))"))
      ) ~ TRUE,
      if_any(
        matches("_DIAG") & -starts_with("POA"),
        ~ (.x |> str_detect("^(I63(?!02|03))"))
      ) ~ TRUE,
      if_any(
        matches("_DIAG") & -starts_with("POA"),
        ~ (.x |> str_detect("^(J96(?!1))"))
      ) ~ TRUE,
      if_any(
        matches("_DIAG") & -starts_with("POA"),
        ~ (.x |> str_detect("^(O291(?!11|21|91))"))
      ) ~ TRUE,
      if_any(
        matches("_DIAG") & -starts_with("POA"),
        ~ (.x |> str_detect("^(O292(?!11|91))"))
      ) ~ TRUE,
      if_any(
        matches("_DIAG") & -starts_with("POA"),
        ~ (.x |> str_detect("^(O450(?!01|11|21|91))"))
      ) ~ TRUE,
      if_any(
        matches("_DIAG") & -starts_with("POA"),
        ~ (.x |> str_detect("^(O460(?!01|11|21|91))"))
      ) ~ TRUE,
      if_any(
        matches("_DIAG") & -starts_with("POA"),
        ~ (.x |> str_detect("^(O88(?!011|111|211|311|811))"))
      ) ~ TRUE,
      if_any(
        matches("_DIAG") & -starts_with("POA"),
        ~ (.x |> str_detect(smm_series_nonexclusions))
      ) ~ TRUE,
      if_any(
        matches("_DIAG") & -starts_with("POA"),
        ~ (.x %in% smm_nonseries_list)
      ) ~ TRUE,
      .default = FALSE
    )
  )

deliveries_smm |> count(SMM)
```

### Medicaid categorization

Add Medicaid categorizations for Severe Maternal Morbidity.

```{r}
#| label: smm-mc-categorization
#| message: false
#| warning: false

deliveries_smm_mc <- deliveries_smm |>
  mutate(
     SMM_MC_CATEGORY = case_when(
        (SMM == T & MC == T) ~ "SMM_MC",
        (SMM == T & MC == F) ~ "SMM_NONMC",
        (SMM == F & MC == T) ~ "NONSMM_MC",
        (SMM == F & MC == F) ~ "NONSMM_NONMC"
      )
  )
  
deliveries_smm_mc |> count(SMM_MC_CATEGORY)
```

### Export

Now that we've flagged everything that can be considered SMM, we can export this data to be analyzed like our other conditions.

```{r}
#| label: export-smm
#| message: false
#| warning: false

deliveries_smm_mc |> write_rds("../data-processed/smm.rds")
```