12  Grouping by dates

Note

This lesson was pulled out of the Billboard project because it was getting really long, but you’ll need this skill when you are working on your mastery assignments.

It is not uncommon in data journalism to count or sum records by year based on a date within your data. Or even by other date parts like month or week. There are some really nifty features within the lubridate package that make this pretty easy.

We’ll run through some of those scenarios here using the Billboard Hot 100 data we used in Chapters 3 & 4. If you want to follow along, you can create a new Quarto Document in your Billboard project. Or you can just use this for reference.

12.1 Setting up

We need to set up our notebook with libraries and data before we can talk specifics. We need to load both the tidyverse and lubridate libraries. Lubridate is installed with the tidyverse package, but you have to load it separately.

library(tidyverse)
library(lubridate)

And we need our cleaned Billboard Hot 100 data. We’ll glimpse the rows and check out date range so we can remember what we have.

hot100 <- read_rds("data-processed/01-hot100.rds")

hot100 |> glimpse()
Rows: 341,300
Columns: 7
$ chart_date    <date> 1958-08-04, 1958-08-04, 1958-08-04, 1958-08-04, 1958-08…
$ current_rank  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ title         <chr> "Poor Little Fool", "Patricia", "Splish Splash", "Hard H…
$ performer     <chr> "Ricky Nelson", "Perez Prado And His Orchestra", "Bobby …
$ previous_rank <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ peak_rank     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ wks_on_chart  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
hot100$chart_date |> summary()
        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"1958-08-04" "1974-12-14" "1991-04-20" "1991-04-19" "2007-08-25" "2023-12-30" 

12.2 Plucking date parts

If you look at the lubridate cheetsheet under “GET AND SET DATE COMPONENTS” you’ll see functions to pluck out parts of a date, like year().

If we have a date, like perhaps Taylor Swift’s birthday, we can pluck the year from it.

year("1989-12-13")
[1] 1989

12.3 Grouping by a date part on the fly

Let’s show how this might be useful through an example question:

Which performer has the most appearances on the chart in a given year?

The logic works like this:

  • Group all the records by performer AND the year of the chart_date
  • Summarize and count the rows
hot100 |> 
  group_by(
1    year(chart_date),
    performer
  ) |> 
  summarize(appearances = n()) |> 
  arrange(desc(appearances))
1
This is where we add the year to the group_by, plucking it from the chart_date with the year() function.

It looks like Morgan Wallen has the most appearances and 2023 is only half over as of this running. When I ran this code last year The Beatles topped the list. There is definitely some kinda story here.

Anyway, notice how the year column name is kinda shite? We would not be able to easily reference that variable later, so we should rename that AS we make the group:

hot100 |> 
  group_by(
1    yr = year(chart_date),
    performer
  ) |> 
  summarize(appearances = n()) |> 
  arrange(desc(appearances))
1
Added the “yr =” here

It is a good practice to rename any grouping variable made from a function like that. FWIW, it would’ve worked if I called the new column year, but I named it yr so I’m less likely to confuse it with the function year(). It’s a personal preference what to name the new column.

12.4 Making reusable date parts

If you think you’ll use a date parts more than once, then it makes sense to create a new columns and save them. You might make several date parts, but we’ll start with only one.

Tip

I usually go back to my cleaning notebook to add these once I recognize I need them, and then rerun everything.

I’ve created a random sample of data with only the chart_date and title columns just so it is easier to see what we are doing. You would normally work with the whole data frame! Here is our sample:

hot100_sample <- hot100 |>
  slice_sample(n = 6) |>
  select(chart_date, title)

hot100_sample

12.4.1 Let’s make a year

Here’s how we do it:

hot100_sample |> 
1  mutate(
    yr = year(chart_date)
  )
1
We use mutate to create a new column. This is where it starts.
  • We name the new column yr, and then set the value to the year() of chart_date.

12.4.2 The magical month

We can also pluck out the month of the date, which is pretty useful if you want to measure seasonality within a year, like hot days of summer, etc. The default month() function pulls the month as a number.

hot100_sample |> 
  mutate(
    mo = month(chart_date)
  )

But there are some options within month() to give us month NAMES that are ordered as factors instead of alphabetical.

hot100_sample |> 
  mutate(
    mo_label = month(chart_date, label = TRUE),
    mo_long = month(chart_date, label = TRUE, abbr = FALSE)
  ) |> 
  arrange(mo_label)

Note the datatype <ord> under the column mo_label and mo_long. That means this is an “ordered factor” and that when sorted by those labels it will list in MONTH order instead of alphabetical order, which is quite useful.

12.4.3 Floor dates

Sometimes you want to count the number of records within a month and year, like all the songs in January 2000, then February 2000, etc. One way to do that is to create a floor_date, which gives you the “lowest” date within a certain unit like year or month. It’s easiest to show with our sample data:

hot100_sample |> 
  mutate(
    fl_month = floor_date(chart_date, unit = "month"),
    fl_year = floor_date(chart_date, unit = "year")
  )

You can see the resulting new columns are real dates, but they are normalized:

  • The fl_month gives you the first day of the month for that chart_date.
  • The fl_year gives you the first day of the year for that chart_date.

Let’s put this to use with an example. I’ll create a fl_month on the fly to find Recent appearances by Taylor Swift. I’ll also do the year() on the fly in my filter.

swift_month <- hot100 |> 
  filter(
    performer == "Taylor Swift",
    chart_date >= "2020-07-01"
  ) |> 
  group_by(
    fl_month = floor_date(chart_date, unit = "month")
  ) |> 
  summarize(appearances = n())

swift_month

And chart it:

swift_month_plot <- swift_month |> 
  ggplot(aes(x = fl_month, y = appearances)) +
  geom_col() +
  geom_text(aes(label = appearances), vjust = -.3) +
  scale_x_date(date_labels="%b %Y",date_breaks  ="1 month") +
  guides(x =  guide_axis(angle = 45)) +
  labs(
    x = "Month and Year",
    y = "Number of appearances",
    title = "Swifts\' \"Midnights\" drives most hits",
    subtitle = str_wrap("While her most recent new album drove the most appearances on the Billboard Hot 100 within a month, each \"Taylor\'s version\"  album since 'Folklore' has also generated hits.")
  )

ggsave("figures/swift_month_plot.png")
Saving 7 x 5 in image

Can you guess when she released her albums?