15  Categorization

15.1 Overview

Sometimes we have data where a variable has too many unique values to group/sum properly. It could be because we have a more specific need, or because the data is “dirty” with misspellings and minor differences which mean the same thing.

I want to outline several ways to deal with this, with growing complexity (and flexibility). I’ll use the Starwars characters dataset for my examples. We’ll end with an example from the military surplus data.

Some of these functions are new as of 2026 as Posit constantly enhances their tools for data science.

All of these solutions will use mutate() because you are changing or creating data. But the methods I’m considering are:

  • The if_else() function sets a value based on a test. We’ll make it TRUE based on a certain circumstance, and FALSE if not. I’ll also show how you could use values other than T/F.
  • Two methods to change or create a column based on matching values from within a specific column. We’ll use replace_values() when we are partially updating an existing column. We’ll use recode_values() when we need defaults other than the original value.
  • When we need more complex rules that match based on conditions, we’ll tap into replace_when() and case_when().

Some of these solutions are covered in a Tidyverse vignette Recoding columns and replacing values.

15.2 Setup

We need our basic libraries:

library(tidyverse)
library(janitor)

The starwars character data is included with Tidyverse, but I’m going to use a simplified version without the series data about movies, etc.

starwars <- starwars |> select(name:species)

starwars

15.3 if_else(): Creating a flag variable

We use if_else() when we want a set value for when a condition is true or false. You supply a test (like does x == 3) and then supply a value to use when that condition is TRUE, and then when it is FALSE.

I actually have a good example of this in the Denied Cleaning chapter when we make the audit benchmark column, but I’ll include a JedR version here.

Within our starwars data, we have a variable called species that has a number of values.

starwars |> count(species, sort = TRUE)

Most of the characters are “Human”, but let’s say I want to do a series of analysis based on whether a species is “Human” vs all the other options. I can create a flag variable (True/False) based on whether that. So, I want TRUE if “Human” and FALSE if not.

The if_else() function is perfect for this.

sw_human <- 
  starwars |> 
1  mutate(
2    human = if_else(species == "Human", TRUE, FALSE)
  )

# selecting specific variables so we can see them easily
sw_human |> select(name, species, human)
1
I’m using mutate() to create the new variable.
2
if_else() takes three arguments. The first is the test, the second is the value to insert if it is true (and I’m using an actual TRUE value here) and the third is to insert if it is false (and I’m using FALSE).

So now I can easily count how many characters are Human vs Not Human.

sw_human |> 
  count(human, name = "cnt_human")

I don’t have to use real TRUE and FALSE values here. I can insert anything.

starwars |> 
  mutate(
    human_text = if_else(species == "Human", "Human", "Not Human")
  ) |> 
  select(name, species, human_text)

15.4 Changes based on matching values

These are solutions for when we want to look at values within a single column, and then make decisions based on those values. (While you might create a new column so you can compare the replacement values, both the “old” and “new” columns will have the same datatypes.)

15.4.1 replace_values(): Change just some values

Use this when you just want to change a few values in an existing column, while leaving the rest alone.

In our example we want to change values in species in just two cases, but retain the rest.

  • We want to rename “Yoda’s species” to just “Yoda”.
  • If the species is NA we want to change it to “Unknown”.
  • The rest should just keep their original value

This example below will only change data when specified conditions are met based on one specific target variable (species in this case). Otherwise it will use the value already in species.

sw_species_replaceval <-
  starwars |> 
  mutate(
1    species_new = replace_values(
2      species,
3      "Yoda's species" ~ "Yoda",
4      NA ~ "Unknown"
    )
  )

# selecting specific variables to see the results
sw_species_replaceval |> select(name, species, species_new)
1
Here we are creating the new column species_new so we can compare it to the original. We will push into that variable the result of the replace_values() function.
2
Here we specify which variable we are peeking into for our matches, i.e., our target column.
3
We look for “Yoda’s species” on the left side of the ~. (This is sometimes referred to as LHS for left-hand-side). To the right of the ~ we set our new values. (Sometimes called RHS.)
4
Here we are setting our NA values (on the left) to “Unknown” (on the right).

No other values are changes so all the other rows are filled in with their original species values. If you wanted those to be something else you would use recode_values().

15.4.2 recode_values(): When you need a default

When we need just a little more flexibility, we can use recode_values() which allows us to define a default. Our goal here is a little different.

  • We want to rename “Yoda’s species” to just “Yoda”.
  • We want to keep “Human” as such, but …
  • We want all other species to be “Other”.
sw_species_recodeval <- 
  starwars |> 
  mutate(
1    species_new = recode_values(
2      species,
3      "Yoda's species" ~ "Yoda",
4      "Human" ~ "Human",
5      default = "Other"
    )
  )

sw_species_recodeval |> 
  select(name, species, species_new)
1
Here we set the new variable and set it to the result of recode_values().
2
The target column we are looking into for our values to match on.
3
Again we set “Yoda’s species” to just “Yoda”.
4
We set “Human” to still be “Human” because otherwise it would get caught up in the default.
5
Here we set everything else to be “Other”.

15.4.3 More on recode …

There are some more complex uses for recode_values() that we won’t detail, but you can read more in the Recoding columns and replacing values vignette.

  • You can recode values to a different datatype, like starting with ranking values like 1 to 5 and turning them into a text ranking of “Strongly Disagree” to “Strongly Agree” with appropriate terms between them.
  • You can create or use lookups table with the “from” and “to” values.

15.5 Changes based on conditions

When you want change or create a column based on conditions, we have some more powerful and complex functions available.

15.5.1 case_when(): More power

If we need more logic when creating or changing data, we can use case_when() to consider tests in any column to affect values in a single one.

This example is a bit contrived, but hopefully you can follow. We want Skywalker and the Lars family members to be classified as “Lars Farmers”, along with the famous droids who worked there.

Let’s create a new variable call lars_farm that defines the following:

  • Anyone with the name Lars or Skywalker will be “Lars Farmers”.
  • We’ll also add R2D2 and C-3PO to the same “Lars Farmers” group.
  • Everyone else will get their original homeworld.
sw_lars <-
  starwars |> 
  mutate(
1    lars_farm = case_when(
2      str_detect(name, "Lars|Skywalker") ~ "Lars Farmers",
3      name %in% c("R2-D2", "C-3PO") ~ "Lars Farmers",
4      .default = homeworld
    )
  )

# select relevant variables
sw_lars |> select(name, homeworld, lars_farm)
1
Here we create the new variable lars_farm to and set it to use the results from the case_when() function.
2
Here we use use str_detect() to look for “Skywalker” (to catch Luke, Anakin and Shmi) and any name that includes “Lars”. We set those to “Lars Farmers”.
3
Here we add the two “droids we are looking for” by building a vector of their names and looking for names there. We also set those to “Lars Farmers”.
4
We use “.default =” set everyone else to use their homeworld value.

You might notice that we used “.default” with a period here and only “default” without the period when we used replace_values() above. Yep, they are inconsistent and it sucks. From what I can tell, the not-dot is the newer standard but the authors didn’t want to change the existing usage in case_when(), which predates that decision.

15.5.2 replace_when(): Complex changes to existing data

This is one of the newer functions. It seems to be a more strict version of case_when() and I don’t see myself using it often, if at all.

If you are updating values within the same column, but need more complexity than replace_values() then you can use replace_when() with broader rules to follow. This allows you to create rules beyond matching values, but forces the result to be the same datatype as the original column, which case_when() does not. In other words, the right-hand values cannot change to a number since species started as text. There is no “default” argument as data keeps its original value if not specified to change.

sw_replwhen <- 
  starwars |> 
1  mutate(species_new = replace_when(
2    species,
    species == "Human" & hair_color == "blond" ~ "Blond Human"
  ))

sw_replwhen |> 
  select(name, hair_color, species, species_new)
1
We set species to be overwritten by the result of the replace_when() function.
2
We specify which

15.6 Military surplus: control_type

Here we return to case_when() to describe in detail how we created the control_type variable in the Military Surplus Cleaning chapter. I skipped the detailed explanation of the code there because it would mire the flow of the lesson. Here we go into the details for those who might be interested.

15.6.1 Catching up with the data

I’ll start here with the Military Surplus data after the shipment totals have been calculated.

leso_total <- read_rds("data-processed/rwdir-leso-total.rds")

leso_total |> head()
leso_total |> glimpse()
Rows: 79,761
Columns: 12
$ state             <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name       <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn               <chr> "1240-01-411-1265", "2540-01-565-4700", "1005-01-587…
$ item_name         <chr> "SIGHT,REFLEX", "BALLISTIC BLANKET KIT", "MOUNT,RIFL…
$ quantity          <dbl> 9, 10, 10, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 6, 1, 1, 1,…
$ ui                <chr> "Each", "Kit", "Each", "Each", "Each", "Each", "Each…
$ acquisition_value <dbl> 339.00, 16568.15, 1626.00, 10000.00, 62627.00, 65800…
$ demil_code        <chr> "Q", "D", "D", "Q", "C", "C", "C", "D", "Q", "Q", "D…
$ demil_ic          <dbl> 3, 1, 1, 3, 1, 1, 1, 1, 3, 3, 1, 1, 1, 3, 1, 1, 1, 3…
$ ship_date         <dttm> 2016-09-14, 2018-01-30, 2016-09-19, 2017-03-28, 201…
$ station_type      <chr> "State", "State", "State", "State", "State", "State"…
$ total_value       <dbl> 3051.00, 165681.50, 16260.00, 10000.00, 62627.00, 65…

15.6.2 Categorization logic with case_when()

We will use the mutate() function to create a new column called control_type. We’ve used mutate before, but this time we will fill in values in the new column based on other data inside each row. case_when() allows us to create a test (or a number of tests) and then mark the new value based on the answer. Once new data has been written the function evaluates the next row, so we write the most specific rules first.

I usually approach this by thinking of the logic first, then writing some code, then testing it. Sometimes my logic is faulty and I have to try again, which is why I always test the results. Know this could go on for many cycles.

Here is the basic logic:

  • We want to create a new column to denote if the item is controlled. In that column we want it to be TRUE when an item is controlled, and FALSE when it is not.
  • We know that items with “AIRPLANE” are always controlled, no matter their demil designations.
  • Otherwise we know that items that have a demil_code of “A”, OR a demil_code of “Q” AND a demil_id of “6”, are non-controlled.
  • Everything else is controlled.

I’ve noted this logic in a specific order for a reason: It’s the order that we write the logic in the code based on how the function case_when() works. This process is powerful and can get complicated depending on the logic needed.

Here is the code and annotations.

leso_control <-
1  leso_total |>
2  mutate(
    control_type = case_when(
3      str_detect(item_name, "AIRPLANE") ~ TRUE,
4      (demil_code == "A" | (demil_code == "Q" & demil_ic == 6)) ~ FALSE,
5      .default = TRUE
    )
  )

6leso_control |> glimpse()
1
Our first line creates a new tibble leso_control and fills it with the result of the rest of our expression. We start with the leso_total tibble.
2
We mutate the data and start with the name of new column: control_type. We are filling that column with the result of the case_when() function for each row. Within the case_when() we are making the determination if the item is controlled or not. The left side of the ~ is the test, and the right side of ~ is what we enter into the column if the test passes. But we have more than one test:
3
The first test is we use the str_detect() function to look inside the item_name field looking for the term “AIRPLANE”. If the test finds the term, then the control_type field gets a value of TRUE and we move to the next row. If not, it moves to the next rule to see if that is a match. (We could fill this column with any text or number we want, but we are using TRUE and FALSE because that is the most basic kind of data to keep. If the item is controlled, set the value is TRUE. If not, it should be set to FALSE.)
4
Our second rule has two complex tests and we want to mark the row FALSE if either are true. (Remember, this is based on what the DLA told me: items with A or Q6 are non-controlled.) Our case_when() logic first looks for the value “A” in the demil_code field. If it is yes, then it marks the row FALSE. If no it goes to the next part: Is there a “Q” in the demil_code field AND a “6” in the demil_ic field? Both “Q” and “6” have to be there to get marked as FALSE. If both fail, then we move to the next test.
5
The last test is our catch-all. If none of the other rules apply, then we set the default value as TRUE, which means it is controlled. So our default in the end is to mark everything TRUE if any of the other rules don’t mark it first.
6
Lastly we glimpse at the data just so we can see the column was created.
Rows: 79,761
Columns: 13
$ state             <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name       <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn               <chr> "1240-01-411-1265", "2540-01-565-4700", "1005-01-587…
$ item_name         <chr> "SIGHT,REFLEX", "BALLISTIC BLANKET KIT", "MOUNT,RIFL…
$ quantity          <dbl> 9, 10, 10, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 6, 1, 1, 1,…
$ ui                <chr> "Each", "Kit", "Each", "Each", "Each", "Each", "Each…
$ acquisition_value <dbl> 339.00, 16568.15, 1626.00, 10000.00, 62627.00, 65800…
$ demil_code        <chr> "Q", "D", "D", "Q", "C", "C", "C", "D", "Q", "Q", "D…
$ demil_ic          <dbl> 3, 1, 1, 3, 1, 1, 1, 1, 3, 3, 1, 1, 1, 3, 1, 1, 1, 3…
$ ship_date         <dttm> 2016-09-14, 2018-01-30, 2016-09-19, 2017-03-28, 201…
$ station_type      <chr> "State", "State", "State", "State", "State", "State"…
$ total_value       <dbl> 3051.00, 165681.50, 16260.00, 10000.00, 62627.00, 65…
$ control_type      <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…

As I said, there was a lot of trial and error to figure that out, but I’ll show some tests here to show that we did what we were intending.

This shows airplanes are marked as controlled with TRUE.

# showing the results and some columns that determined them
leso_control |> 
  select(item_name, demil_code, demil_ic, control_type) |> 
  filter(str_detect(item_name, "AIRPLANE"))

This shows how many items are marked TRUE vs FALSE for each demil_code and demil_ic combination. I used it to check that most A records were FALSE, along with Q6.

leso_control |> 
1  count(demil_code, demil_ic, control_type, name = "cnt") |>
2  pivot_wider(names_from = control_type, values_from = cnt)
1
This is the count shortcut using three variables. I name counted column cnt. It’s hard to visualize that result here but it counts how many times each unique combination of demil_code, demil_ic and control_type.
2
Here I use pivot_wider() to show the TRUE and FALSE counts on the same row. Just makes it easier to see.

We’re done with this extended explanation.