Categorization
Tutorial by: Christian McDonald, JedR master
How to make categories from data
Welcome Padawan. May your quest to become a JedR Master be fruitful.
“Do or do not. There is no try.”
Sometimes as journalists (or data scientists) we need to create new categories for our data. An example might be where we have a column of data that has too many different values to plot effectively. We might be able to re-categorize those values to fewer individual choices, maybe combining less important values into a generic term like “other”.
Some real-worldish examples:
- You want to show race/ethnicity breakdowns in a chart or story about Hispanic representation. There are eight different values and some are quite small. You might create a column to use the values Asian, Black, Hispanic and White with all others changed to “Other”.
- Perhaps you are doing a story on crime and want to highlight violent crimes vs property vs other crimes. You might create a new column that categorizes many individual charges into these buckets.
We’re taking records and putting them into piles or groups based on one or more values.
Goals of this training session
- To apply
case_match()
within amutate()
to create data categories. It’s a one-to-one match, for the most part. - To apply
case_when()
within amutate()
to create data categories. This allows more complex logic.
Our scenario
Here at the Galactic News Hub, we have a dataset of all the sources we connect with on a regular basis: The starwars
data set that comes with the tidyverse
library.
This data includes a variable called species
, and our editor wants a report breaking down the species of our source list based on three broad categories: human, droid, and other. Our editor is hoping we can later cross reference these categories against gender and other variables within the source list.
To accomplish this, it would be helpful to have a “species_category” variable using these three values so we can group and aggregate with other columns.
Let’s take a look at the species column using count()
.
You can see that Human and Droid values dominate the data. The next most frequent values is NA
. Maybe we can convince our editor to make that category Other/Unknown.
Using case_match()
One of the easier methods to categorize data is to use case_match()
. The logic is straighforward: For this value, substitute that value.
JedR Mind Trick: When we categorize values like this, we want to create a new column intead of overwriting an existing one. This way you can check your work. You can always remove or rename columns later.
To make clear our goal here for the Galactic Hub News: We want to create a new column called species_category
that has three values: Human, Droid, Other/Unknown. And they need to be based on the existing species
variable.
How case_match works
While it is possible to recatagorize our values into the same column, a good JedR will instead use mutate()
and case_when()
to create a new column and then fill it with the results.
|>
data 1mutate(
2new_column = case_when(
3
existing_column,4"Old Value 1 from existing_column" ~ "New Value 1",
"Old Value 2 from existing_column" ~ "New Value 2",
5.default = "Value for everything else"
) )
- 1
-
This is standard usage for the
mutate()
function … you create the new column and then set it to a value. In this case, we are setting the value to the result of thecase_when()
function. - 2
- We are putting our categorized values into a new column.
- 3
-
Within
case_when()
, we have to specify the column we are looking into when we search for thse values. - 4
-
Then we have a series of statements separated with
~
. If we find what is on the left-hand side, then we change it to what is on the right-hand side. - 5
-
Lastly we have a “default” to handle everything else we haven’t specified. In some cases you might make this
.default = existing_column
to keep the existing value for those rows.
The order you put these is important. Once a match has been found and updated, that value won’t be changed again.
So, with our starwars
data, a partial solution might look like this:
At the end here we are using select()
and slide to just focus our result to just to focus on our columns and rows of interest. (The first 20 rows are all humans and droids.)
Note we’ve only changed two of the values we need to change, though. (You can see Toydarian and Dug have not been changed.) This method is very handy if you have just a few things to change.
In our case here, we could list all 36 changes, but … Use the Force, Luke.
Using .default
case_match()
changes exactly what you specify and leaves anything you don’t specify as NA
. If you want the non-matches to be something else, you can specify them with .default =
.
So, we’ll tackle this in a different way. We can set Human to “Human” (i.e, keep it) and Droid to “Mechanical”, but then everything else to “Other/Unknown”.
NOTE that the specific matches use ~
to separate what to look for and what to change it to, but the .default
designation uses =
.
Check your results
JedR Mind Trick: Even with these few rows, it’s hard to see all the changes made in your data. It is sometimes helpful to do a
count()
on your changes to check them. Just be sure to count the already-changed data, not the result you are saving into the new object.
Here we save the recategorized data into a new object called sw_species
and then run a count()
on our two columns to check the results.
Padawan practice: case_match
If you look at the gender
column in the data, you’ll see there are two values: masculine and feminine, but there are four records that don’t have a gender
value are listed as NA
.
For this quest, you will use case_match()
to fill in those blank values with unknown.
Note in the code you are creating a new dataframe and then filling it with the mutated starwars
data.
Within the case_match section you’ll need to note what variable you are working from, what to do if that value is NA, and set the remainder using a default.
At the end it prints the results.
gender_reveal <- starwars |>
mutate(
gender_clean = case_match(
gender,
NA ~ "unknown",
.default = gender
))
gender_reveal |>
select(name, gender, gender_clean)
<- starwars |>
gender_reveal mutate(
gender_clean = case_match(
gender,NA ~ "unknown",
.default = gender
))
|>
gender_reveal select(name, gender, gender_clean)
Using case_when()
The case_match()
function works great if you have one-to-one changes that don’t require logic. But when you have more complex needs, we need a more powerful use of the Force.
With the case_when()
function, we can set new values based on some kind of logic instead of one-to-one matching.
|>
data 1mutate(
new_column = case_when(
2~ "New value 1",
logic_test1 3~ "New value 2",
logic_test2 4.default = "Default value"
) )
- 1
- Again, we start with a mutate() and put our results into a new column.
- 2
-
But now with
case_when()
, instead of looking at values in a single column, we can use more complex logic. The test is on the left side of the~
, and on the right is a new value designation. - 3
- Each rule is run in order. A value is set only once, so if a value is set in one rule, it will NOT be reset by subsequent rules. You must write your rules from the most specific to the most general.
- 4
-
We still have a
.default =
to handles all values not already changed. If left out, values not specified would be setNA
.
Here we will use case_when()
and its logic statements to do the same thing we did above with case_match()
, setting Human and Droid, but making the rest “Other/Unknown”.
Again, we’ve used the select()
and slice()
just to preview some specific rows in the result.
A more complex example
Unlike case_match()
where you are drawing from values in one specific column, the logic tests used with case_when()
can be just about anything as long as they evaluate to TRUE
or FALSE
.
While this is a silly example, here we create a column called size_compare
that notes if a character is “Larger” or “Smaller” than an average male human, unless they are a droid, in which case they are labeled “Mechanical”.
Note we started with the rule for “Droid” because that is the most specific thing we needed to change. When then set the “larger” value, and the rest got “smaller.”
Padawan practice: case_when
Create a column called main_planet
using the homeworld
column. For each character with a homeworld of “Tatooine” or “Naboo”, label it as TRUE
and otherwise label as FALSE
, which will create a logical datatype column.
mainchars <- starwars |>
mutate(
main_planet_origin = case_when(
str_detect(homeworld, "Tatooine|Naboo") ~ TRUE,
.default = FALSE
))
mainchars |>
select(name, homeworld, main_planet_origin)
<- starwars |>
mainchars mutate(
main_planet_origin = case_when(
str_detect(homeworld, "Tatooine|Naboo") ~ TRUE,
.default = FALSE
))
|>
mainchars select(name, homeworld, main_planet_origin)
Page through your results to check your answers.
That’s it for now.
-30-