14  Categorization

14.1 Overview

Sometimes we have data where a variable has too many unique values to group/sum properly. It could be because we have a more specific need, or because the data is “dirty” with misspellings and minor differences which mean the same thing.

I want to outline several ways to deal with this, with growing complexity (and flexibility). We’ll end with an example from the military surplus data.

All of these solutions will use mutate() because you are changing or creating data. But the methods I’m considering are:

  • if_else() that requires only one test. We’ll make it TRUE based on a certain circumstance, and FALSE if not. I’ll also show how you could use values other than T/F.
  • case_match() is where you are looking inside a single column and changing one value to another, but we can do this with any number of values. They just all have to be 1-to-1. If this, then that.
  • case_when() allows you to perform a series of tests on multiple variables, setting new values based on the results.

I’ll use the Starwars characters dataset for my examples.

14.2 Setup

We need our basic libraries:

library(tidyverse)
library(janitor)

The starwars character data is included with Tidyverse, but I’m going to use a simplified version without the series data about movies, etc.

starwars <- starwars |> select(name:species)

starwars

14.3 if_else(): Creating a flag variable

I actually have a good example of this in the Denied Cleaning chapter when we make the audit benchmark column, but I’ll include a JedR version here.

Within our starwars data, we have a variable called species that has a number of values.

starwars |> count(species)

Most of the characters are “Human”, but let’s say I want to do a series of analysis based on whether a species is “Human” vs all the other options. I can create a flag variable (True/False) based on whether that. So, I want TRUE if “Human” and FALSE if not.

The if_else() function is perfect for this.

sw_human <- starwars |> 
1  mutate(
2    human = if_else(species == "Human", TRUE, FALSE),
    .after = species
  )

# selecting specific variables so we can see them easily
sw_human |> select(name, species, human)
1
I’m using mutate() to creat the new variable.
2
if_else() takes three arguments. The first is the test, the second is the value to insert if it is true (and i’m using an actual TRUE value here) and the third is to insert if it is false (and I’m using FALSE)

So now I can easily count how many characters are Human vs Not Human.

sw_human |> 
  count(human, name = "cnt_human")

I don’t have to use real TRUE and FALSE values here. I can insert anything.

starwars |> 
  mutate(
    human_text = if_else(species == "Human", "Human", "Not Human")
  ) |> 
  select(name, species, human_text)

14.4 case_match(): Recategorize values

But what if the test isn’t so simple … that it isn’t either one or the other.

Using case_match() we can make 1v1 switches for some values within a variable, and then choose what to do with the rest of them en masse.

One thing about case_match() … we are only affecting values in a single variable. It’s good for cleaning those, but not very flexible beyond that because we can’t consider other variable in our tests. It’s just “change this into that.”

Let’s say I want to update “Yoda’s species” to “Yoda”, keep “Human” as such and then make everything else “Other”.

sw_species_simple <- starwars |> 
  mutate(
1    new_species = case_match(
2      species,
3      "Yoda's species" ~ "Yoda",
4      "Human" ~ "Human",
5      .default = "Other"
    )
  )

# selecting specific variables to see the results
sw_species_simple |> select(name, species, new_species)
1
We set the name of the new variable first, then set the value to the result of the case_match() function. I could replace the same variable, but then I wouldn’t be able to inspect the changes.
2
case_match()works on the values from single variable, so you have to define which column you are working with. We are using the species variable here.
3
Here we change the “Yoda’s species” value to just “Yoda”.
4
Here we set “Human” as itself so we can preserve it. Otherwise it would also be changed to “Other”
5
Here we set what all the other values we have not specified should be changed to. We set them to “Other.”

Using this method, those species values that were NA are also changed to “Other” since they didn’t fit the other two roles.

If we count on the new variable, this is what we get.

sw_species_simple |> 
  count(new_species, name = "cnt_species_simple")

An option that can make this real useful is I can choose to only a couple of values in the variable and then leave the others as-is. I’m going to do this all at once to save time and just show the result.

Here I change “Yoda’s species” to just “Yoda” and all the NAs to “Other”, but leave the rest as it was, using the existing species values.

starwars |> 
  mutate(
    newer_species = case_match(
      species,
1      "Yoda's species" ~ "Yoda",
2      NA ~ "Other",
3      .default = species
    )
  ) |> 
4  slice(10:20) |>
5  select(name, species, newer_species)
1
We are changing “Yoda’s species” to “Yoda”.
2
We set all the NA values to “Other”.
3
We set the remain rows to be their original species values.
4
I’m using slice() here just to show you the row that includes “Yoda” so you can see it. It’s just a display thing.
5
Also a display thing: I’m selecting the relevant variables so we can see them.

14.5 case_when(): More power

If we need more logic, we can use case_when() to consider tests in any column to affect values in a single one.

This example is a bit contrived, but hopefully you can follow. We want Luke Skywalker and the Lars family to be classified as “Lars Farmers”, along with the famous droids who worked there.

We’ll start with this: Let’s create a new variable call lars_farm that defines the following:

  • Anyone Lars family and Luke Skywalker will be “Lars Farmers”.
  • We’ll also add R2D2 and C-3PO to the same “Lars Farmers” group.
  • Everyone else will get their original homeworld.
sw_lars <- starwars |> 
  mutate(
1    lars_farm = case_when(
2      str_detect(name, "Lars|Luke Skywalker") ~ "Lars Farmers",
3      name %in% c("R2-D2", "C-3PO") ~ "Lars Farmers",
4      .default = homeworld
    )
  )

# select relevant variables
sw_lars |> select(name, homeworld, lars_farm)
1
One difference between case_match() and case_when() is we can consider any test from any column. Here we create the new variable lars_farm to and set it to use the results from the case_when() function.
2
Here we use use str_detect() to look for “Luke Skywalker” and any name that includes “Lars”. We set those to “Lars Farmers”.
3
Here we add the two droids by specifically looking for those names and also set it to “Lars Farmers”.
4
We set everyone else to use their homeworld value.

14.6 Military surplus: control_type

In the Military Surplus Cleaning chapter we created a control_type variable, creating a category to designate if our items have to be returned to the Department of Defense for disposal. Refer back to that chapter about why were need to do this.

I skipped the detailed explanation of the code there because it is a more advanced tactic and would mire the flow of the lesson. In this appendix, I go into the details for those who might be interested.

14.6.1 Catching up with the data

I’ll start here with the Military Surplus data after the shipment totals have been calculated.

leso_total <- read_rds("data-processed/rwdir-leso-total.rds")

leso_total |> head()
leso_total |> glimpse()
Rows: 85,337
Columns: 12
$ state             <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name       <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn               <chr> "1005-01-587-7175", "1240-01-411-1265", "2540-01-565…
$ item_name         <chr> "MOUNT,RIFLE", "SIGHT,REFLEX", "BALLISTIC BLANKET KI…
$ quantity          <dbl> 10, 9, 10, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ ui                <chr> "Each", "Each", "Kit", "Each", "Each", "Each", "Each…
$ acquisition_value <dbl> 1626.00, 342.00, 18906.60, 658000.00, 10000.00, 6262…
$ demil_code        <chr> "D", "D", "D", "C", "Q", "C", "C", "D", "D", "D", "D…
$ demil_ic          <dbl> 1, 0, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3…
$ ship_date         <dttm> 2016-09-19, 2016-09-14, 2018-01-30, 2016-11-09, 201…
$ station_type      <chr> "State", "State", "State", "State", "State", "State"…
$ total_value       <dbl> 16260.00, 3078.00, 189066.00, 658000.00, 10000.00, 6…

14.6.2 Categorization logic with case_when()

We will use the mutate() function to create a new column called control_type. We’ve used mutate before, but this time we will fill in values in the new column based on other data inside each row. case_when() allows us to create a test (or a number of tests) and then mark the new value based on the answer. Once new data has been written the function evaluates the next row, so we write the most specific rules first.

I usually approach this by thinking of the logic first, then writing some code, then testing it. Sometimes my logic is faulty and I have to try again, which is why we test the results. Know this could go on for many cycles.

Here is the basic logic:

  • We want to create a new column to denote if the item is controlled. In that column we want it to be TRUE when an item is controlled, and FALSE when it is not.
  • We know that items with “AIRPLANE” are always controlled, no matter their demil designations.
  • Otherwise we know that items that have a demil_code of “A”, OR a demil_code of “Q” AND a demil_id of “6”, are non-controlled.
  • Everything else is controlled.

I’ve noted this logic in a specific order for a reason: It’s the order that we write the logic in the code based on how the function case_when() works. This process is powerful and can get complicated depending on the logic needed. This example is perhaps more complicated than I like to use when explaining this concept, but this is real data and we need this, so here we go.

Here is the code and annotations.

1leso_control <- leso_total |>
2  mutate(
    control_type = case_when(
3      str_detect(item_name, "AIRPLANE") ~ TRUE,
4      (demil_code == "A" | (demil_code == "Q" & demil_ic == 6)) ~ FALSE,
5      .default = TRUE
    )
  )

6leso_control |> glimpse()
1
Our first line creates a new tibble leso_control and fills it with the result of the rest of our expression. We start with the leso_total tibble.
2
We mutate the data and start with the name of new column: control_type. We are filling that column with the result of the case_when() function for each row. Within the case_when() we are making the determination if the item is controlled or not. The left side of the ~ is the test, and the right side of ~ is what we enter into the column if the test passes. But we have more than one test:
3
The first test is we use the str_detect() function to look inside the item_name field looking for the term “AIRPLANE”. If the test finds the term, then the control_type field gets a value of TRUE and we move to the next row. If not, it moves to the next rule to see if that is a match. (We could fill this column with any text or number we want, but we are using TRUE and FALSE because that is the most basic kind of data to keep. If the item is controlled, set the value is TRUE. If not, it should be set to FALSE.)
4
Our second rule has two complex tests and we want to mark the row FALSE if either are true. (Remember, this is based on what the DLA told me: items with A or Q6 are non-controlled.) Our case_when() logic first looks for the value “A” in the demil_code field. If it is yes, then it marks the row FALSE. If no it goes to the next part: Is there a “Q” in the demil_code field AND a “6” in the demil_ic field? Both “Q” and “6” have to be there to get marked as FALSE. If both fail, then we move to the next test.
5
The last test is our catch-all. If none of the other rules apply, then we set the default value as TRUE, which means it is controlled. So our default in the end is to mark everything TRUE if any of the other rules don’t mark it first.
6
Lastly we glimpse at the data just so we can see the column was created.
Rows: 85,337
Columns: 13
$ state             <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name       <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn               <chr> "1005-01-587-7175", "1240-01-411-1265", "2540-01-565…
$ item_name         <chr> "MOUNT,RIFLE", "SIGHT,REFLEX", "BALLISTIC BLANKET KI…
$ quantity          <dbl> 10, 9, 10, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ ui                <chr> "Each", "Each", "Kit", "Each", "Each", "Each", "Each…
$ acquisition_value <dbl> 1626.00, 342.00, 18906.60, 658000.00, 10000.00, 6262…
$ demil_code        <chr> "D", "D", "D", "C", "Q", "C", "C", "D", "D", "D", "D…
$ demil_ic          <dbl> 1, 0, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3…
$ ship_date         <dttm> 2016-09-19, 2016-09-14, 2018-01-30, 2016-11-09, 201…
$ station_type      <chr> "State", "State", "State", "State", "State", "State"…
$ total_value       <dbl> 16260.00, 3078.00, 189066.00, 658000.00, 10000.00, 6…
$ control_type      <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…

As I said, there was a lot of trial and error to figure that out, but I’ll show some tests here to show that we did what we were intending.

This shows airplanes are marked as controlled with TRUE.

# showing the results and some columns that determined them
leso_control |> 
  select(item_name, demil_code, demil_ic, control_type) |> 
  filter(str_detect(item_name, "AIRPLANE"))

This shows how many items are marked TRUE vs FALSE for each demil_code and demil_ic combination. I used it to check that most A records were FALSE, along with Q6.

leso_control |> 
1  count(demil_code, demil_ic, control_type, name = "cnt") |>
2  pivot_wider(names_from = control_type, values_from = cnt)
1
This is the count shortcut using three variables. I name counted column cnt. It’s hard to visualize that result here but it counts how many times each unique combination of demil_code, demil_ic and control_type.
2
Here I use pivot_wider() to show the TRUE and FALSE counts on the same row. Just makes it easier to see.

We’re done with this extended explanation.