Categorization

Tutorial by: Christian McDonald, JedR master

How to make categories from data

Welcome Padawan. May your quest to become a JedR Master be fruitful.

“Do or do not. There is no try.”

Sometimes as journalists (or data scientists) we need to create new categories for our data. An example might be where we have a column of data that has too many different values to plot effectively. We might be able to re-categorize those values to fewer individual choices, maybe combining less important values into a generic term like “other”.

Some real-worldish examples:

You want to show race/ethnicity breakdowns in a chart or story about Hispanic representation. There are eight different values and some are quite small. You might create a column to use the values Asian, Black, Hispanic and White with all others changed to “Other”.
Perhaps you are doing a story on crime and want to highlight violent crimes vs property vs other crimes. You might create a new column that categorizes many individual charges into these buckets.

We’re taking records and putting them into piles or groups based on one or more values.

Goals of this training session

To apply case_match() within a mutate() to create data categories. It’s a one-to-one match, for the most part.
To apply case_when() within a mutate() to create data categories. This allows more complex logic.

Our scenario

Here at the Galactic News Hub, we have a dataset of all the sources we connect with on a regular basis: The starwars data set that comes with the tidyverse library.

This data includes a variable called species, and our editor wants a report breaking down the species of our source list based on three broad categories: human, droid, and other. Our editor is hoping we can later cross reference these categories against gender and other variables within the source list.

To accomplish this, it would be helpful to have a “species_category” variable using these three values so we can group and aggregate with other columns.

Let’s take a look at the species column using count().

You can see that Human and Droid values dominate the data. The next most frequent values is NA. Maybe we can convince our editor to make that category Other/Unknown.

Using `case_match()`

One of the easier methods to categorize data is to use case_match(). The logic is straighforward: For this value, substitute that value.

JedR Mind Trick: When we categorize values like this, we want to create a new column intead of overwriting an existing one. This way you can check your work. You can always remove or rename columns later.

To make clear our goal here for the Galactic Hub News: We want to create a new column called species_category that has three values: Human, Droid, Other/Unknown. And they need to be based on the existing species variable.

How case_match works

While it is possible to recatagorize our values into the same column, a good JedR will instead use mutate() and case_when() to create a new column and then fill it with the results.

data |> 
1  mutate(
2    new_column = case_when(
3      existing_column,
4      "Old Value 1 from existing_column" ~ "New Value 1",
      "Old Value 2 from existing_column" ~ "New Value 2",
5      .default = "Value for everything else"
    )
  )

1: This is standard usage for the mutate() function … you create the new column and then set it to a value. In this case, we are setting the value to the result of the case_when() function.
2: We are putting our categorized values into a new column.
3: Within case_when(), we have to specify the column we are looking into when we search for thse values.
4: Then we have a series of statements separated with ~. If we find what is on the left-hand side, then we change it to what is on the right-hand side.
5: Lastly we have a “default” to handle everything else we haven’t specified. In some cases you might make this .default = existing_column to keep the existing value for those rows.

The order you put these is important. Once a match has been found and updated, that value won’t be changed again.

So, with our starwars data, a partial solution might look like this:

At the end here we are using select() and slide to just focus our result to just to focus on our columns and rows of interest. (The first 20 rows are all humans and droids.)

Note we’ve only changed two of the values we need to change, though. (You can see Toydarian and Dug have not been changed.) This method is very handy if you have just a few things to change.

In our case here, we could list all 36 changes, but … Use the Force, Luke.

Using .default

case_match() changes exactly what you specify and leaves anything you don’t specify as NA. If you want the non-matches to be something else, you can specify them with .default =.

So, we’ll tackle this in a different way. We can set Human to “Human” (i.e, keep it) and Droid to “Mechanical”, but then everything else to “Other/Unknown”.

NOTE that the specific matches use ~ to separate what to look for and what to change it to, but the .default designation uses =.

Check your results

JedR Mind Trick: Even with these few rows, it’s hard to see all the changes made in your data. It is sometimes helpful to do a count() on your changes to check them. Just be sure to count the already-changed data, not the result you are saving into the new object.

Here we save the recategorized data into a new object called sw_species and then run a count() on our two columns to check the results.

Padawan practice: case_match

If you look at the gender column in the data, you’ll see there are two values: masculine and feminine, but there are four records that don’t have a gender value are listed as NA.

For this quest, you will use case_match() to fill in those blank values with unknown.

Note in the code you are creating a new dataframe and then filling it with the mutated starwars data.

Within the case_match section you’ll need to note what variable you are working from, what to do if that value is NA, and set the remainder using a default.

At the end it prints the results.

Solution 1

gender_reveal <- starwars |> 
  mutate(
    gender_clean = case_match(
      gender,
      NA ~ "unknown",
      .default = gender
    ))

gender_reveal |> 
  select(name, gender, gender_clean)

Using `case_when()`

The case_match() function works great if you have one-to-one changes that don’t require logic. But when you have more complex needs, we need a more powerful use of the Force.

With the case_when() function, we can set new values based on some kind of logic instead of one-to-one matching.

data |> 
1  mutate(
    new_column = case_when(
2      logic_test1 ~ "New value 1",
3      logic_test2 ~ "New value 2",
4      .default = "Default value"
    )
  )

1: Again, we start with a mutate() and put our results into a new column.
2: But now with case_when(), instead of looking at values in a single column, we can use more complex logic. The test is on the left side of the ~, and on the right is a new value designation.
3: Each rule is run in order. A value is set only once, so if a value is set in one rule, it will NOT be reset by subsequent rules. You must write your rules from the most specific to the most general.
4: We still have a .default = to handles all values not already changed. If left out, values not specified would be set NA.

Here we will use case_when() and its logic statements to do the same thing we did above with case_match(), setting Human and Droid, but making the rest “Other/Unknown”.

Again, we’ve used the select() and slice() just to preview some specific rows in the result.

A more complex example

Unlike case_match() where you are drawing from values in one specific column, the logic tests used with case_when() can be just about anything as long as they evaluate to TRUE or FALSE.

While this is a silly example, here we create a column called size_compare that notes if a character is “Larger” or “Smaller” than an average male human, unless they are a droid, in which case they are labeled “Mechanical”.

Note we started with the rule for “Droid” because that is the most specific thing we needed to change. When then set the “larger” value, and the rest got “smaller.”

Padawan practice: case_when

Create a column called main_planet using the homeworld column. For each character with a homeworld of “Tatooine” or “Naboo”, label it as TRUE and otherwise label as FALSE, which will create a logical datatype column.