13 A counting shortcut

We count stuff in data science (and journalism) all the time. In our Billboard project we used summarize() and the n() function to count rows because we needed to understand the concepts of group_by, summarize and arrange. It is perfectly valid and the best way to explain and understand grouping and summarizing.

But in the interest of full disclosure, know that dplyr has a shortcut to group, count and arrange rows of data. The count() function takes the columns you want to group and then does the summarize on n() for you. We’ll demonstrate them with the Billboard data. You can create a new notebook to try this, or just use this for reference.

We didn’t start with count() because it can’t do anything else other than count rows, and we needed to build our GSA knowledge to summarize all kinds of ways, like sum(), mean() or slice().

13.1 Setup and import

We don’t normally put these together, but we’re just setting up a quick demonstration

library(tidyverse)
hot100 <- read_rds("data-processed/01-hot100.rds")

13.2 Basic count

We’re going to rework our first quest of the Billboard analysis:

Which performer had the most appearances on the Hot 100 chart at any position?

Our logic is we want to count the number of rows based on each performer. We do this by adding the variables we want to group as arguments to count():

hot100 |> 
  count(performer)

13.2.1 Sort the results

If we want the highest counted row at the top (and we almost always do) then we can add an argument: sort = TRUE.

hot100 |> 
  count(performer, sort = TRUE)

13.2.2 Name the new column

Notice the counted table is called n. We can rename that with another argument, name = and give it the name we want in quotes.

hot100 |> 
  count(performer, sort = TRUE, name = "appearances")

13.2.3 Filter results as normal

To cut off the results, we just filter as we normally would.

hot100 |> 
  count(performer, sort = TRUE, name = "appearances") |> 
  filter(appearances > 650)

So the code above does the same things here as we did in our first Billboard quest, but with fewer lines.

13.3 Grouping on multiple variables

We can group on multiple variables by adding them. We’ll show this with the second quest:

Which song (title & performer) has been on the charts the most?

hot100 |> 
  count(
    title,
    performer,
    sort = TRUE,
    name = "appearances") |> 
  filter(appearances >= 70)