library(tidyverse)
library(janitor)15 Categorization
15.1 Overview
Sometimes we have data where a variable has too many unique values to group/sum properly. It could be because we have a more specific need, or because the data is “dirty” with misspellings and minor differences which mean the same thing.
I want to outline several ways to deal with this, with growing complexity (and flexibility). I’ll use the Starwars characters dataset for my examples. We’ll end with an example from the military surplus data.
Some of these functions are new as of 2026 as Posit constantly enhances their tools for data science.
All of these solutions will use mutate() because you are changing or creating data. But the methods I’m considering are:
- The
if_else()function sets a value based on a test. We’ll make itTRUEbased on a certain circumstance, andFALSEif not. I’ll also show how you could use values other than T/F. - Two methods to change or create a column based on matching values from within a specific column. We’ll use
replace_values()when we are partially updating an existing column. We’ll userecode_values()when we need defaults other than the original value. - When we need more complex rules that match based on conditions, we’ll tap into
replace_when()andcase_when().
Some of these solutions are covered in a Tidyverse vignette Recoding columns and replacing values.
15.2 Setup
We need our basic libraries:
The starwars character data is included with Tidyverse, but I’m going to use a simplified version without the series data about movies, etc.
starwars <- starwars |> select(name:species)
starwars15.3 if_else(): Creating a flag variable
We use if_else() when we want a set value for when a condition is true or false. You supply a test (like does x == 3) and then supply a value to use when that condition is TRUE, and then when it is FALSE.
I actually have a good example of this in the Denied Cleaning chapter when we make the audit benchmark column, but I’ll include a JedR version here.
Within our starwars data, we have a variable called species that has a number of values.
starwars |> count(species, sort = TRUE)Most of the characters are “Human”, but let’s say I want to do a series of analysis based on whether a species is “Human” vs all the other options. I can create a flag variable (True/False) based on whether that. So, I want TRUE if “Human” and FALSE if not.
The if_else() function is perfect for this.
- 1
-
I’m using
mutate()to create the new variable. - 2
-
if_else()takes three arguments. The first is the test, the second is the value to insert if it is true (and I’m using an actualTRUEvalue here) and the third is to insert if it is false (and I’m usingFALSE).
So now I can easily count how many characters are Human vs Not Human.
sw_human |>
count(human, name = "cnt_human")I don’t have to use real TRUE and FALSE values here. I can insert anything.
starwars |>
mutate(
human_text = if_else(species == "Human", "Human", "Not Human")
) |>
select(name, species, human_text)15.4 Changes based on matching values
These are solutions for when we want to look at values within a single column, and then make decisions based on those values. (While you might create a new column so you can compare the replacement values, both the “old” and “new” columns will have the same datatypes.)
15.4.1 replace_values(): Change just some values
Use this when you just want to change a few values in an existing column, while leaving the rest alone.
In our example we want to change values in species in just two cases, but retain the rest.
- We want to rename “Yoda’s species” to just “Yoda”.
- If the species is
NAwe want to change it to “Unknown”. - The rest should just keep their original value
This example below will only change data when specified conditions are met based on one specific target variable (species in this case). Otherwise it will use the value already in species.
- 1
-
Here we are creating the new column
species_newso we can compare it to the original. We will push into that variable the result of thereplace_values()function. - 2
- Here we specify which variable we are peeking into for our matches, i.e., our target column.
- 3
-
We look for “Yoda’s species” on the left side of the
~. (This is sometimes referred to as LHS for left-hand-side). To the right of the~we set our new values. (Sometimes called RHS.) - 4
-
Here we are setting our
NAvalues (on the left) to “Unknown” (on the right).
No other values are changes so all the other rows are filled in with their original species values. If you wanted those to be something else you would use recode_values().
15.4.2 recode_values(): When you need a default
When we need just a little more flexibility, we can use recode_values() which allows us to define a default. Our goal here is a little different.
- We want to rename “Yoda’s species” to just “Yoda”.
- We want to keep “Human” as such, but …
- We want all other species to be “Other”.
- 1
-
Here we set the new variable and set it to the result of
recode_values(). - 2
- The target column we are looking into for our values to match on.
- 3
- Again we set “Yoda’s species” to just “Yoda”.
- 4
- We set “Human” to still be “Human” because otherwise it would get caught up in the default.
- 5
- Here we set everything else to be “Other”.
15.4.3 More on recode …
There are some more complex uses for recode_values() that we won’t detail, but you can read more in the Recoding columns and replacing values vignette.
- You can recode values to a different datatype, like starting with ranking values like
1to5and turning them into a text ranking of “Strongly Disagree” to “Strongly Agree” with appropriate terms between them. - You can create or use lookups table with the “from” and “to” values.
15.5 Changes based on conditions
When you want change or create a column based on conditions, we have some more powerful and complex functions available.
15.5.1 case_when(): More power
If we need more logic when creating or changing data, we can use case_when() to consider tests in any column to affect values in a single one.
This example is a bit contrived, but hopefully you can follow. We want Skywalker and the Lars family members to be classified as “Lars Farmers”, along with the famous droids who worked there.
Let’s create a new variable call lars_farm that defines the following:
- Anyone with the name Lars or Skywalker will be “Lars Farmers”.
- We’ll also add R2D2 and C-3PO to the same “Lars Farmers” group.
- Everyone else will get their original homeworld.
- 1
-
Here we create the new variable
lars_farmto and set it to use the results from thecase_when()function. - 2
-
Here we use use
str_detect()to look for “Skywalker” (to catch Luke, Anakin and Shmi) and any name that includes “Lars”. We set those to “Lars Farmers”. - 3
- Here we add the two “droids we are looking for” by building a vector of their names and looking for names there. We also set those to “Lars Farmers”.
- 4
-
We use “.default =” set everyone else to use their
homeworldvalue.
You might notice that we used “.default” with a period here and only “default” without the period when we used replace_values() above. Yep, they are inconsistent and it sucks. From what I can tell, the not-dot is the newer standard but the authors didn’t want to change the existing usage in case_when(), which predates that decision.
15.5.2 replace_when(): Complex changes to existing data
This is one of the newer functions. It seems to be a more strict version of case_when() and I don’t see myself using it often, if at all.
If you are updating values within the same column, but need more complexity than replace_values() then you can use replace_when() with broader rules to follow. This allows you to create rules beyond matching values, but forces the result to be the same datatype as the original column, which case_when() does not. In other words, the right-hand values cannot change to a number since species started as text. There is no “default” argument as data keeps its original value if not specified to change.
15.6 Military surplus: control_type
Here we return to case_when() to describe in detail how we created the control_type variable in the Military Surplus Cleaning chapter. I skipped the detailed explanation of the code there because it would mire the flow of the lesson. Here we go into the details for those who might be interested.
15.6.1 Catching up with the data
I’ll start here with the Military Surplus data after the shipment totals have been calculated.
leso_total <- read_rds("data-processed/rwdir-leso-total.rds")
leso_total |> head()leso_total |> glimpse()Rows: 79,761
Columns: 12
$ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn <chr> "1240-01-411-1265", "2540-01-565-4700", "1005-01-587…
$ item_name <chr> "SIGHT,REFLEX", "BALLISTIC BLANKET KIT", "MOUNT,RIFL…
$ quantity <dbl> 9, 10, 10, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 6, 1, 1, 1,…
$ ui <chr> "Each", "Kit", "Each", "Each", "Each", "Each", "Each…
$ acquisition_value <dbl> 339.00, 16568.15, 1626.00, 10000.00, 62627.00, 65800…
$ demil_code <chr> "Q", "D", "D", "Q", "C", "C", "C", "D", "Q", "Q", "D…
$ demil_ic <dbl> 3, 1, 1, 3, 1, 1, 1, 1, 3, 3, 1, 1, 1, 3, 1, 1, 1, 3…
$ ship_date <dttm> 2016-09-14, 2018-01-30, 2016-09-19, 2017-03-28, 201…
$ station_type <chr> "State", "State", "State", "State", "State", "State"…
$ total_value <dbl> 3051.00, 165681.50, 16260.00, 10000.00, 62627.00, 65…
15.6.2 Categorization logic with case_when()
We will use the mutate() function to create a new column called control_type. We’ve used mutate before, but this time we will fill in values in the new column based on other data inside each row. case_when() allows us to create a test (or a number of tests) and then mark the new value based on the answer. Once new data has been written the function evaluates the next row, so we write the most specific rules first.
I usually approach this by thinking of the logic first, then writing some code, then testing it. Sometimes my logic is faulty and I have to try again, which is why I always test the results. Know this could go on for many cycles.
Here is the basic logic:
- We want to create a new column to denote if the item is controlled. In that column we want it to be TRUE when an item is controlled, and FALSE when it is not.
- We know that items with “AIRPLANE” are always controlled, no matter their demil designations.
- Otherwise we know that items that have a
demil_codeof “A”, OR ademil_codeof “Q” AND ademil_idof “6”, are non-controlled. - Everything else is controlled.
I’ve noted this logic in a specific order for a reason: It’s the order that we write the logic in the code based on how the function case_when() works. This process is powerful and can get complicated depending on the logic needed.
Here is the code and annotations.
- 1
-
Our first line creates a new tibble
leso_controland fills it with the result of the rest of our expression. We start with theleso_totaltibble. - 2
-
We mutate the data and start with the name of new column:
control_type. We are filling that column with the result of thecase_when()function for each row. Within thecase_when()we are making the determination if the item is controlled or not. The left side of the~is the test, and the right side of~is what we enter into the column if the test passes. But we have more than one test: - 3
-
The first test is we use the
str_detect()function to look inside theitem_namefield looking for the term “AIRPLANE”. If the test finds the term, then thecontrol_typefield gets a value ofTRUEand we move to the next row. If not, it moves to the next rule to see if that is a match. (We could fill this column with any text or number we want, but we are usingTRUEandFALSEbecause that is the most basic kind of data to keep. If the item is controlled, set the value is TRUE. If not, it should be set to FALSE.) - 4
-
Our second rule has two complex tests and we want to mark the row FALSE if either are true. (Remember, this is based on what the DLA told me: items with A or Q6 are non-controlled.) Our
case_when()logic first looks for the value “A” in thedemil_codefield. If it is yes, then it marks the row FALSE. If no it goes to the next part: Is there a “Q” in thedemil_codefield AND a “6” in thedemil_icfield? Both “Q” and “6” have to be there to get marked as FALSE. If both fail, then we move to the next test. - 5
-
The last test is our catch-all. If none of the other rules apply, then we set the default value as
TRUE, which means it is controlled. So our default in the end is to mark everything TRUE if any of the other rules don’t mark it first. - 6
- Lastly we glimpse at the data just so we can see the column was created.
Rows: 79,761
Columns: 13
$ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn <chr> "1240-01-411-1265", "2540-01-565-4700", "1005-01-587…
$ item_name <chr> "SIGHT,REFLEX", "BALLISTIC BLANKET KIT", "MOUNT,RIFL…
$ quantity <dbl> 9, 10, 10, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 6, 1, 1, 1,…
$ ui <chr> "Each", "Kit", "Each", "Each", "Each", "Each", "Each…
$ acquisition_value <dbl> 339.00, 16568.15, 1626.00, 10000.00, 62627.00, 65800…
$ demil_code <chr> "Q", "D", "D", "Q", "C", "C", "C", "D", "Q", "Q", "D…
$ demil_ic <dbl> 3, 1, 1, 3, 1, 1, 1, 1, 3, 3, 1, 1, 1, 3, 1, 1, 1, 3…
$ ship_date <dttm> 2016-09-14, 2018-01-30, 2016-09-19, 2017-03-28, 201…
$ station_type <chr> "State", "State", "State", "State", "State", "State"…
$ total_value <dbl> 3051.00, 165681.50, 16260.00, 10000.00, 62627.00, 65…
$ control_type <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
As I said, there was a lot of trial and error to figure that out, but I’ll show some tests here to show that we did what we were intending.
This shows airplanes are marked as controlled with TRUE.
# showing the results and some columns that determined them
leso_control |>
select(item_name, demil_code, demil_ic, control_type) |>
filter(str_detect(item_name, "AIRPLANE"))This shows how many items are marked TRUE vs FALSE for each demil_code and demil_ic combination. I used it to check that most A records were FALSE, along with Q6.
- 1
-
This is the count shortcut using three variables. I name counted column
cnt. It’s hard to visualize that result here but it counts how many times each unique combination ofdemil_code,demil_icandcontrol_type. - 2
- Here I use pivot_wider() to show the TRUE and FALSE counts on the same row. Just makes it easier to see.
We’re done with this extended explanation.