library(tidyverse)
<- read_rds("data-processed/01-hot100.rds") hot100
13 A counting shortcut
We count stuff in data science (and journalism) all the time. In our Billboard project we used summarize()
and the n()
function to count rows because we needed to understand the concepts of group_by, summarize and arrange. It is perfectly valid and the best way to explain and understand grouping and summarizing.
But in the interest of full disclosure, know that dplyr has a shortcut to group, count and arrange rows of data. The count()
function takes the columns you want to group and then does the summarize on n()
for you. We’ll demonstrate them with the Billboard data. You can create a new notebook to try this, or just use this for reference.
13.1 Setup and import
We don’t normally put these together, but we’re just setting up a quick demonstration
13.2 Basic count
We’re going to rework our first quest of the Billboard analysis:
Which performer had the most appearances on the Hot 100 chart at any position?
Our logic is we want to count the number of rows based on each performer. We do this by adding the variables we want to group as arguments to count()
:
|>
hot100 count(performer)
13.2.1 Sort the results
If we want the highest counted row at the top (and we almost always do) then we can add an argument: sort = TRUE
.
|>
hot100 count(performer, sort = TRUE)
13.2.2 Name the new column
Notice the counted table is called n
. We can rename that with another argument, name =
and give it the name we want in quotes.
|>
hot100 count(performer, sort = TRUE, name = "appearances")
13.2.3 Filter results as normal
To cut off the results, we just filter as we normally would.
|>
hot100 count(performer, sort = TRUE, name = "appearances") |>
filter(appearances > 650)
So the code above does the same things here as we did in our first Billboard quest, but with fewer lines.
13.3 Grouping on multiple variables
We can group on multiple variables by adding them. We’ll show this with the second quest:
Which song (title & performer) has been on the charts the most?
|>
hot100 count(
title,
performer,sort = TRUE,
name = "appearances") |>
filter(appearances >= 70)