library(tidyverse)
library(lubridate)
12 Grouping by dates
This lesson was pulled out of the Billboard project because it was getting really long, but you’ll need this skill when you are working on your mastery assignments.
It is not uncommon in data journalism to count or sum records by year based on a date within your data. Or even by other date parts like month or week. There are some really nifty features within the lubridate package that make this pretty easy.
We’ll run through some of those scenarios here using the Billboard Hot 100 data we used in Chapters 3 & 4. If you want to follow along, you can create a new Quarto Document in your Billboard project. Or you can just use this for reference.
12.1 Setting up
We need to set up our notebook with libraries and data before we can talk specifics. We need to load both the tidyverse and lubridate libraries. Lubridate is installed with the tidyverse package, but you have to load it separately.
And we need our cleaned Billboard Hot 100 data. We’ll glimpse the rows and check out date range so we can remember what we have.
<- read_rds("data-processed/01-hot100.rds")
hot100
|> glimpse() hot100
Rows: 344,300
Columns: 7
$ chart_date <date> 1958-08-04, 1958-08-04, 1958-08-04, 1958-08-04, 1958-08…
$ current_rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ title <chr> "Poor Little Fool", "Patricia", "Splish Splash", "Hard H…
$ performer <chr> "Ricky Nelson", "Perez Prado And His Orchestra", "Bobby …
$ previous_rank <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ peak_rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ wks_on_chart <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$chart_date |> summary() hot100
Min. 1st Qu. Median Mean 3rd Qu. Max.
"1958-08-04" "1975-02-01" "1991-08-03" "1991-08-02" "2008-02-02" "2024-07-27"
12.2 Plucking date parts
If you look at the lubridate cheetsheet under “GET AND SET DATE COMPONENTS” you’ll see functions to pluck out parts of a date, like year()
.
If we have a date, like perhaps Taylor Swift’s birthday, we can pluck the year from it.
year("1989-12-13")
[1] 1989
12.3 Grouping by a date part on the fly
Let’s show how this might be useful through an example question:
Which performer has the most appearances on the chart in a given year?
The logic works like this:
- Group all the records by
performer
AND the year of thechart_date
- Summarize and count the rows
|>
hot100 group_by(
1year(chart_date),
performer|>
) summarize(appearances = n()) |>
arrange(desc(appearances))
- 1
-
This is where we add the year to the group_by, plucking it from the
chart_date
with theyear()
function.
It looks like Morgan Wallen had the most appearances in 2023. There is definitely some kinda story here.
Anyway, notice how the year column name is kinda shite? We would not be able to easily reference that variable later, so we should rename that AS we make the group:
|>
hot100 group_by(
1yr = year(chart_date),
performer|>
) summarize(appearances = n()) |>
arrange(desc(appearances))
- 1
-
I added the
yr =
part here.
It is a good practice to rename any grouping variable made from a function like that. FWIW, it would’ve worked if I called the new column year
, but I named it yr
so I’m less likely to confuse it with the function year()
. It’s a personal preference what to name the new column.
12.4 Making reusable date parts
If you think you’ll use a date parts more than once, then it makes sense to create a new columns and save them. You might make several date parts, but we’ll start with only one.
I usually go back to my cleaning notebook to add these once I recognize I need them, and then rerun everything.
I’ve created a random sample of data with only the chart_date
and title
columns just so it is easier to see what we are doing. You would normally work with the whole data frame! Here is our sample:
<- hot100 |>
hot100_sample slice_sample(n = 6) |>
select(chart_date, title)
hot100_sample
12.4.1 Let’s make a year
Here’s how we do it:
|>
hot100_sample 1mutate(
yr = year(chart_date)
)
- 1
- We use mutate to create a new column. This is where it starts.
- We name the new column
yr
, and then set the value to theyear()
ofchart_date
.
12.4.2 The magical month
We can also pluck out the month of the date, which is pretty useful if you want to measure seasonality within a year, like hot days of summer, etc. The default month()
function pulls the month as a number.
|>
hot100_sample mutate(
mo = month(chart_date)
)
But there are some options within month()
to give us month NAMES that are ordered as factors instead of alphabetical.
|>
hot100_sample mutate(
mo_label = month(chart_date, label = TRUE),
mo_long = month(chart_date, label = TRUE, abbr = FALSE)
|>
) arrange(mo_label)
Note the datatype <ord>
under the column mo_label
and mo_long
. That means this is an “ordered factor” and that when sorted by those labels it will list in MONTH order instead of alphabetical order, which is quite useful.
12.4.3 Floor dates
Sometimes you want to count the number of records within a month and year, like all the songs in January 2000, then February 2000, etc. One way to do that is to create a floor_date, which gives you the “lowest” date within a certain unit like year or month. It’s easiest to show with our sample data:
|>
hot100_sample mutate(
fl_month = floor_date(chart_date, unit = "month"),
fl_year = floor_date(chart_date, unit = "year")
)
You can see the resulting new columns are real dates, but they are normalized:
- The
fl_month
gives you the first day of the month for thatchart_date
. - The
fl_year
gives you the first day of the year for thatchart_date
.
Let’s put this to use with an example. I’ll create a fl_month
on the fly to find Recent appearances by Taylor Swift. I’ll also do the year()
on the fly in my filter.
<- hot100 |>
swift_month filter(
== "Taylor Swift",
performer >= "2020-07-01"
chart_date |>
) group_by(
fl_month = floor_date(chart_date, unit = "month")
|>
) summarize(appearances = n())
swift_month
And chart it:
<- swift_month |>
swift_month_plot ggplot(aes(x = fl_month, y = appearances)) +
geom_col() +
geom_text(aes(label = appearances), vjust = -.3) +
scale_x_date(date_labels="%b %Y",date_breaks ="1 month") +
guides(x = guide_axis(angle = 45)) +
labs(
x = "Month and Year",
y = "Number of appearances",
title = "Swifts\' \"Midnights\" drives most hits",
subtitle = str_wrap("While her most recent new album drove the most appearances on the Billboard Hot 100 within a month, each \"Taylor\'s version\" album since 'Folklore' has also generated hits.")
)
ggsave("figures/swift_month_plot.png")
Saving 7 x 5 in image
Can you guess when she released her albums?