14  Using case_when

In the Military Surplus Cleaning chapter we created a control_type variable, creating a category to designate if our items have to be returned to the Department of Defense for disposal. Refer back to that chapter about why were need to do this.

I skipped the detailed explanation of the code there because it is a more advanced tactic and would mire the flow of the lesson. In this appendix, I go into the details for those who might be interested.

You can also learn more about case_when() and other categorization techniques in the Jedr Categorization Tutorial.

14.1 Catching up with the data

I’ll start here with the Military Surplus data after the shipment totals have been calculated.

Rows: 99,052
Columns: 12
$ state             <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name       <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn               <chr> "1005-01-587-7175", "1240-01-411-1265", "2320-01-371…
$ item_name         <chr> "MOUNT,RIFLE", "SIGHT,REFLEX", "TRUCK,UTILITY", "BAL…
$ quantity          <dbl> 10, 9, 1, 10, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ ui                <chr> "Each", "Each", "Each", "Kit", "Each", "Each", "Each…
$ acquisition_value <dbl> 1626.00, 371.00, 62627.00, 17264.71, 62627.00, 65800…
$ demil_code        <chr> "D", "D", "C", "D", "C", "C", "Q", "D", "C", "D", "D…
$ demil_ic          <dbl> 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ ship_date         <dttm> 2016-09-19, 2016-09-14, 2016-09-29, 2018-01-30, 201…
$ station_type      <chr> "State", "State", "State", "State", "State", "State"…
$ total_value       <dbl> 16260.0, 3339.0, 62627.0, 172647.1, 62627.0, 658000.…

14.1.1 Categorization logic with case_when()

We will use the mutate() function to create a new column called control_type. We’ve used mutate before, but this time we will fill in values in the new column based on other data inside each row. case_when() allows us to create a test (or a number of tests) and then mark the new value based on the answer. Once new data has been written the function evaluates the next row, so we write the most specific rules first.

I usually approach this by thinking of the logic first, then writing some code, then testing it. Sometimes my logic is faulty and I have to try again, which is why we test the results. Know this could go on for many cycles.

Here is the basic logic:

  • We want to create a new column to denote if the item is controlled. In that column we want it to be TRUE when an item is controlled, and FALSE when it is not.
  • We know that items with “AIRPLANE” are always controlled, no matter their demil designations.
  • Otherwise we know that items that have a demil_code of “A”, OR a demil_code of “Q” AND a demil_id of “6”, are non-controlled.
  • Everything else is controlled.

I’ve noted this logic in a specific order for a reason: It’s the order that we write the logic in the code based on how the function case_when() works. This process is powerful and can get complicated depending on the logic needed. This example is perhaps more complicated than I like to use when explaining this concept, but this is real data and we need this, so here we go.

Here is the code and annotations.

1leso_control <- leso_total |>
2  mutate(
    control_type = case_when(
3      str_detect(item_name, "AIRPLANE") ~ TRUE,
4      (demil_code == "A" | (demil_code == "Q" & demil_ic == 6)) ~ FALSE,
5      TRUE ~ TRUE
    )
  )

6leso_control |> glimpse()
1
Our first line creates a new tibble leso_control and fills it with the result of the rest of our expression. We start with the leso_total tibble.
2
We mutate the data and start with the name of new column: control_type. We are filling that column with the result of the case_when() function for each row. Within the case_when() we are making the determination if the item is controlled or not. The left side of the ~ is the test, and the right side of ~ is what we enter into the column if the test passes. But we have more than one test:
3
The first test is we use the str_detect() function to look inside the item_name field looking for the term “AIRPLANE”. If the test finds the term, then the control_type field gets a value of TRUE and we move to the next row. If not, it moves to the next rule to see if that is a match. (We could fill this column with any text or number we want, but we are using TRUE and FALSE because that is the most basic kind of data to keep. If the item is controlled, set the value is TRUE. If not, it should be set to FALSE.)
4
Our second rule has two complex tests and we want to mark the row FALSE if either are true. (Remember, this is based on what the DLA told me: items with A or Q6 are non-controlled.) Our case_when() logic first looks for the value “A” in the demil_code field. If it is yes, then it marks the row FALSE. If no it goes to the next part: Is there a “Q” in the demil_code field AND a “6” in the demil_ic field? Both “Q” and “6” have to be there to get marked as FALSE. If both fail, then we move to the next test.
5
The last test is our catch-all. If none of the other rules apply, then mark this row as TRUE, which means it is controlled. So our default in the end is to mark everything TRUE if any of the other rules don’t mark it first.
6
Lastly we glimpse at the data just so we can see the column was created.
Rows: 99,052
Columns: 13
$ state             <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name       <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn               <chr> "1005-01-587-7175", "1240-01-411-1265", "2320-01-371…
$ item_name         <chr> "MOUNT,RIFLE", "SIGHT,REFLEX", "TRUCK,UTILITY", "BAL…
$ quantity          <dbl> 10, 9, 1, 10, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ ui                <chr> "Each", "Each", "Each", "Kit", "Each", "Each", "Each…
$ acquisition_value <dbl> 1626.00, 371.00, 62627.00, 17264.71, 62627.00, 65800…
$ demil_code        <chr> "D", "D", "C", "D", "C", "C", "Q", "D", "C", "D", "D…
$ demil_ic          <dbl> 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ ship_date         <dttm> 2016-09-19, 2016-09-14, 2016-09-29, 2018-01-30, 201…
$ station_type      <chr> "State", "State", "State", "State", "State", "State"…
$ total_value       <dbl> 16260.0, 3339.0, 62627.0, 172647.1, 62627.0, 658000.…
$ control_type      <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…

As I said, there was a lot of trial and error to figure that out, but I’ll show some tests here to show that we did what we were intending.

This shows airplanes are marked as controlled with TRUE.

# showing the results and some columns that determined them
leso_control |> 
  select(item_name, demil_code, demil_ic, control_type) |> 
  filter(str_detect(item_name, "AIRPLANE"))

This shows how many items are marked TRUE vs FALSE for each demil_code and demil_ic combination. I used it to check that most A records were FALSE, along with Q6.

leso_control |> 
1  count(demil_code, demil_ic, control_type, name = "cnt") |>
2  pivot_wider(names_from = control_type, values_from = cnt)
1
This is the count shortcut using three variables. I name counted column cnt. It’s hard to visualize that result here but it counts how many times each unique combination of demil_code, demil_ic and control_type.
2
Here I use pivot_wider() to show the TRUE and FALSE counts on the same row. Just makes it easier to see.

We’re done with this extended explanation.