4  Billboard Analysis

This chapter continues the Billboard Hot 100 project. In the previous chapter we downloaded, imported and cleaned our data. We’ll be working in the same project, but in a new document.

4.1 Learning goals for this chapter

  • To use the group_by/summarize/arrange combination to count rows of data
  • To filter our data in two ways: to focus data before summarizing, and to logically cut summarized lists.
  • We’ll also cover some more complex filters using and/or logic
  • Introduce the shortcut count() function

4.2 The questions we’ll answer

Now that we have the Billboard Hot 100 charts data in our project it’s time to find the answers to the following questions:

  • Which performer had the most appearances on the Hot 100 chart at any position?
  • Which song (title & performer) has been on the charts the most?
  • Which song (title & performer) was No. 1 for the most number of weeks?
  • Which performer had the most songs reach No. 1?
  • Which performer had the most songs reach No. 1 in the most recent five years?
  • Which performer has had the most Top 10 hits overall?

What are your guesses for the questions above? NO PEEKING!

In each case we’ll talk over the logic of finding the answer and the code to accomplish it.

Before we can get into the analysis, we want to set up a new notebook to separate our cleaning from our analysis.

Warning

The data outputs in this book might differ from what you get since the source data is updated every week. This is especially true in videos, gifs and takeaways.

This chapter won’t have data past 2024.

4.3 Setting up an analysis notebook

At the end of the last notebook we exported our clean data as an .rds file. We’ll now create a new Quarto notebook and import that data. It will be much easier this time.

  1. If you don’t already have it open, go ahead and open your Billboard project.
  2. If your Cleaning notebook is still open, go ahead and close it.
  3. Use the + menu to start a new Quarto Document.
  4. Set the title as “Analysis”.
  5. Save the file as 02-analysis.qmd in your project folder.
  6. Check your Environment tab (top right) and make sure the environment is empty. We don’t want to have any leftover data. If there is, then go under the Run menu and choose Restart R and Clear Output.

Because we set our _quarto.yml file to update navigation with any new Quarto file, this page will automatically show up in the sidebar when we render the site.

4.3.1 Add your goals, setup

Since we are starting a new notebook, we need to set up a few things. First up we want to list our goals.

  1. Add a headline and text describing the goals of this notebook. You are exploring the Billboard Hot 100 charts data.
  2. Go ahead and copy all the questions we outlined above into your notebook.
  3. Format those questions as a nice list. Start each line with a - or * followed by a space. There should be a blank line above and below the entire LIST but not between the items. List items should be on sequential lines … and it is the only markdown item like that.
  4. Now add a another headline (two hashes) called Setup.
  5. Add a chunk, add the option #| label: setup and add in the tidyverse and janitor libraries.
  6. Run the chunk to load the libraries.
Try to write the code on your own first
library(tidyverse)
library(janitor)
Important

Do you see the grey triangle above with the term Try to write the code on your own first next to it? That is code that I’ve written but hidden to give you a chance to write it yourself before you see it. If you click on the little triangle, it will turn and reveal the code!

4.3.2 Import the cleaned data

We need to import the data that we cleaned and exported in the last notebook. It’s just a couple of lines of code and you could write them all out and run it, but here is where I tell you for the first of a 1,000 times:

WRITE ONE LINE OF CODE. RUN IT. CHECK THE RESULTS. REPEAT.

Yes, sometimes for the sake of brevity I will give you multiple lines of code at once, but to understand what is actually going on you should add and run that code one line at a time.

We’ll painstakingly walk through that process here to belabor the point. Here are our goals for this bit:

  • Document that we are importing clean data
  • Import our cleaned data
  • Fill an R object with that data
  • Glimpse the R Object so we can check it

I want you to:

  1. Start a Markdown section with a headline noting you are importing the cleaned data. Add any other text notes you might need for yourself.
  2. Add a new code chunk.
  3. Inside your code chunk, TYPE IN this line of code:
read_rds("data-processed/01-hot100.rds")

This read_rds() function is just like the read_csv() function we used in our last notebook to read in data from our computer, except we are reading a different kind of file format: .rds. The argument (in quotes) is the path where our file is on our computer.

What did you expect will happen if you run the code?

Now run that line of code.

It should print the results of the data in your notebook. Did it?

If you got an error, you should read that error output for some clues about what happened. When using a read_ function problems are usually about one of two things: It can’t find the function (load the library tidyverse!) or it doesn’t understand the path to the file (that you have the folder structure wrong or something misspelled).

Note

CATCHING COMMON MISTAKES: In the appendices of this book there is a chapter on Troubleshooting that is worth bookmarking and reading.

Our next goal is to take all that data and put it into a new R object.

  1. Edit your one line of code to assign the data to an object hot100. Remember we will add the new object FIRST (our bucket) and then use the <- operator to put the data inside it (the water).
  2. Run it!

It should look like this:

hot100 <- read_rds("data-processed/01-hot100.rds")

What happened when you ran the code? Did you get an error? Did you get a result printed to the screen?

OK, that last one was a trick question. When you save data into an object it does NOT print it to your notebook, but it does save that object into your “Environment”, which means it is in memory and can be called and run at any time. You should see hot100 listed in your Environment tab at the top right of RStudio.

Now, before I started this quest, I knew I wanted that data inside the hot100 object and could’ve written it like that from the beginning. But I didn’t because most “read” problems are in the path, so I wanted to first make sure that function was written correctly. Also, when I put the data into the object I can’t see it until I print it out again, so I do it one piece at a time. You should, too.

So next we’ll print the data out to our screen. We do that by just by calling the object.

  1. Edit your code chunk to add a blank line then the object and run it, like this:
hot100 <- read_rds("data-processed/01-hot100.rds")

hot100

This should print out the data to your screen so you can see the first couple of columns and the first 10 rows of data.

But what we really need is to see all the column names, the data types and an example of the data. We can get that with our glimpse() function.

  1. Edit the last line of the chunk to add the pipe and the glimpse() function
  2. Run it.

Your code (and result) should look like this now:

hot100 <- read_rds("data-processed/01-hot100.rds")

hot100 |> glimpse()
Rows: 346,600
Columns: 7
$ chart_date    <date> 1958-08-04, 1958-08-04, 1958-08-04, 1958-08-04, 1958-08…
$ current_rank  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ title         <chr> "Poor Little Fool", "Patricia", "Splish Splash", "Hard H…
$ performer     <chr> "Ricky Nelson", "Perez Prado And His Orchestra", "Bobby …
$ previous_rank <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ peak_rank     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ wks_on_chart  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

Again, I could’ve written all that code at once before running it, as I’ve written code like that for many years. I still write and run code one line at a time, and you should, too. It is the easiest way to make sure your code is correct and find problems. So, for the second of a 1,000 times:

WRITE ONE LINE OF CODE. RUN IT. CHECK THE RESULTS. REPEAT.

Dealing with pipes

If you are looking at my code and writing it into your notebook, you have to run each line of code BEFORE you add the pipe |> at the end of that line. A pipe |> must have a function directly following it on the same or next (or same) line to work.

4.4 Introducing dplyr

One of the packages within the tidyverse is dplyr. Dplyr allows us to transform our data frames in ways that let us explore the data and prepare it for visualizing. It’s the R equivalent of common Excel functions like sort, filter and pivoting. There is a cheatsheet on the dplyr that you might find useful.

Images courtesy Hadley Wickham.

We’ve used select(), mutate() and arrange() already, but we’ll introduce more dplyr functions in this chapter.

While we go over each of these dplyr functions below to find the answers to our questions, there is a really good resource from David Keys of that explains each function with videos, animated gifs and example code.

4.5 Most appearances

Our first question: Which performer had the most appearances on the Hot 100 chart at any position?

Let’s work through the logic of what we need to do before I explain exactly how to do it.

Each row in the data is one song ranked on the chart. It includes a field with the name of the “performer” so we know who recorded it.

So, to figure out how many times a performer is in the data, we need to count the number of rows with the same performer.

We’ll use the tidyverse’s version of Group and Aggregate to get this answer. It is actually two different functions within dplyr that often work together: group_by() and summarize()

4.5.1 Group & Aggregate

Before we dive into the code, let’s review this video about “Group and Aggregate” to get a handle on the concept.

We’ll dive deep into this next.

4.5.2 Summarize

We’ll start with summarize() first because it can stand alone.

The summarize() function computes summary tables describing your data. We’re usually finding a single number to describe a bunch of data, like the “average” of numbers in column.

In our case we want a “summary” about the number of times a specific performer appears in data, hence we use summarize().

Tip

THEY BE THE ZAME: summarize() and summarise() are the same function, as R supports both the American and UK spelling of summarize. They work the same and I don’t care which you use.

Here is an example of summarize(). The data is shown on the left of the video and the code is at the right. The animation courtesy of Andrew Heiss1 gives you an visual representation of how it works.

The example above is giving us three summaries:

  • It creates a variable/column called “avg” that gets the mean() (or average) of all the values in the x column. (So adding those numbers together and then dividing by the number of rows.)
  • It creates a “total” variable using sum() to add together all those numbers in x, yielding “45”.
  • It creates a variable called “n” that counts the number of rows of the data with n().

Much like the mutate() function we used earlier, within summarize() we list the name of the new column first, then assign to it the function and column we want to accomplish using =. So we create “avg” then fill it with the mean(x).

Again, in our case (as we work toward finding the performer with most appearances) we want to summarize the number of rows, and there is a function for that: n(). (Think “number of observations”.) Every row in the data is an appearance … we just need to count how many rows have each performer.

4.5.3 Summarize appearances

So lets do this with our Hot100 data. We’ll use summarize to count all the rows in our data. Let’s write the code and run it, then I’ll explain:

  1. Set up a new section with a Markdown headline, text and explain you are looking for most appearances.
  2. Add a named code chunk and add the following:
1hot100 |>
2  summarize(appearances = n())
1
We start with the tibble first and then pipe into summarize().
2
Within the function, we define our summary: We name the new column “appearances” because that is a descriptive column name for our result, then we set that new column to count the number of rows.

Basically we are summarizing the total number of rows in the data. When this book was last updated there were 346500 rows.

But I bet you’re thinking: “Yo Professor, you said we want to count the number of times a performer has appeared, right?”

This is where we bring in a close friend to summarize() … the group_by() function.

4.5.4 Group by

The group_by() function pre-sorts data into groups so that whatever function follows is applied within each of the groups. That means group_by() always has something following it, and that something is usually summarize().

Let’s look at this with our example data from above. Here we group all the data together based on values in cat1, then perform all the summarize functions:

The grouping part of this happens behind the scenes and we don’t really see it. If you run group_by() by itself you won’t see any change, but it does happen.

Here is an example when we group by the second column cat2:

4.5.5 Group by performer

So, back to our Hot 100 data and our question Which performer had the most appearances on the Hot 100 chart at any position?

If we group_by() performer before we summarize and count our rows, it will put all of the rows with “Aerosmith” together, then all the “Beyonce” rows together, etc. and then it will count the rows within those groups … first those for Aerosmith and then those for Beyonce.

  1. Modify your code block to add the group_by line before summarize, then run it:
hot100 |>
1  group_by(performer) |>
  summarize(appearances = n())
1
Add the group_by() line before the summarize.

What we get in return is a summarized table that shows all 10,000+ different performers that have been on the charts, and the number of rows in which they appear in the data.

That’s great, but who had the most?

4.5.6 Arrange the results

Remember in our import notebook when we sorted all the rows by the oldest chart date? We’ll use the same arrange() function here, but we’ll change the result to descending order, because journalists almost always want to know the most of something.

  1. Edit your chunk to add the pipe and arrange function below and run it, then I’ll explain.
hot100 |>
  group_by(performer) |> 
  summarize(appearances = n()) |> 
1  arrange(desc(appearances))
1
We added the arrange() function and fed it the variable “appearances”, but we also wrapped that inside a desc() function to reverse the sort (descending order) putting the highest number ar the top. If we used just “appearances” then it would list the smallest values first.
Note

You’ll sometimes see me write this differently, piping the column into the desc() function like this: arrange(appearances |> desc()).

4.5.7 Get the top of the list

We’ve printed 10,000+ rows of data into our notebook when we really only wanted the Top 10 or so. You might think it doesn’t matter, but your knitted HTML file will store all that data and can make it a big file (as in hard drive space), so I try to avoid that when I can.

We can use the head() command again to get the first 10 result. It will give us six rows unless we specify a different number. There is a corresponding tail() function to get the last rows.

  1. Edit your code to pipe the result into head() function set to 10 rows.
  2. Review all the annotations below!
1hot100 |>
2  group_by(performer) |>
3  summarize(appearances = n()) |>
4  arrange(appearances |> desc()) |>
5  head(10)
1
We start with the hot100 data AND THEN
2
We group by the data by performer AND THEN
3
We summarize it by counting the number of rows in each group, calling the new column “appearances” AND THEN
4
We arrange the result by appearances in descending order AND THEN
5
We display just the first 10 rows

Since we have our answer here and we’re not using the result later, we don’t need to create a new object or anything. Printing it to our notebook is sufficient.

So, Taylor Swift … is that who you guessed? A little history here, when I started using this data back in 2018, Elton John topped the list. Elton John has been around a long time, but Swift’s popularity plus changes in how Billboard counts plays in the modern era (like streaming) has rocketed her to the top. (Sorry, Rocket Man). And it doesn’t hurt that she is re-releasing her entire catalog (Taylor’s version)!

Two important notes
  • The list we’ve created here is based on unique performer names, and as such considers collaborations separately. For instance, Drake is second on the list but those are only songs he performed alone and not the many, many collaborations he has done with other performers. Songs by “Drake” are counted separately than “Drake featuring Future” and even “Future featuring Drake”. You’ll need to make this clear when you write about this later. (If you include collaborations, Drake has way more appearances than Taylor.)
  • Using head() to “cut” this list is not the best method because there could be a tie between the 10th and 11th record. In the future we’ll learn filter() to do this better. But head() is pretty useful when you just need to see the top of your data.

4.5.8 Data Takeaways: Describing your learnings

We’ve learned something here from the data and we should note it for our future selves in something we call a “data takeaway,” which is a prose sentence explaining our finding. You’ll do this for each “answer” you find in your data.

  1. Add the following text after your code chunk to explain the findings
Data Takeaway: Taylor Swift has more solo appearances on the Billboard Hot 100 than any other artist in history with 1,600 appearances through December 28, 2024. This does not include collaborations with other artists.

Writing these sentences help you understand what you’ve found in your data. When you get toward the end of your project, you’ll collect these takeaways in your index.qmd file.

4.5.9 Render your notebook

Now that you have your first quest answer, let’s celebrate by rendering your notebook so you can see all your pretty work.

  1. In the toolbar of your notebook, click the Render button or use the Cmd-Shift-k keyboard command.

This will give you the website view of your page, and you should have the page listed in your site navigation if you did the setup properly at the beginning of this chapter.

4.6 Song with most appearances

Our next quest is this: Which song (title & performer) has been on the charts the most?

This is very similar to our quest to find the performer with the most appearances, but we have to consider both title and performer together because different artists can perform songs of the same name. For example, Adele’s song Hold On entered the Hot 100 at 49 in December 2021, but at least 18 other performers have also had a song titled Hold On on the Hot 100.

Let’s first talk through the logic of what we want to do and how it differed from our first quest:

  • We want to count rows where BOTH the title and performer are the same. We can do this by putting both values in our group_by() function, instead of just one thing. This will put all rows with both “Rush” as a performer AND Tom Sawyer as a title into the same group. Rows with “Rush” and Red Barchetta will be considered in a different group.
  • Then we want to summarize() to count the number of rows in each group.
  • Once we have a summary table, we’ll sort it by appearances in descending order to put the highest value on the top.

Let’s write down our quest and code.

  1. Start a new section (headline, text describing goal and a new code chunk.)
  2. Add the code below ONE LINE AT A TIME and run it and then I’ll outline it below.

WRITE ONE LINE OF CODE. RUN IT. CHECK IT. REPEAT

Remember, you write and run the line BEFORE you add the pipe |>! The annotations explain as if we were writing an English sentence.

1hot100 |>
2  group_by(performer, title) |>
3  summarize(appearances = n()) |>
4  arrange(desc(appearances))
1
Start with the hot100 data AND THEN …
2
Group the data by both performer and title AND THEN ..
3
Summarize the data by first naming our new column “appearances” and setting that to the result of the n() function that counts the number of rows AND THEN …
4
Arrange the data in descending order by the appearances column.

We will often use group_by(), summarize() and arrange() together, which is why I’ll refer to this as the GSA trio. They are like three close friends that always want to hang out together.

So, what was your guess or this question? A little bit of history for that answer … Glass Animals’ Heat Waves in 2022 overcame The Weeknd’s Blinding Lights, which had overcome Imagine Dragon’s Radioactive some time in 2021.

Note

When you run the code above you might see a warning summarise() has grouped output by ‘performer’. You can override using the.groups argument.” THIS IS NOT A PROBLEM. It is just R letting you know that the output of this remains grouped by the first item. Explaining it would break your brain right now. Don’t worry about it. This is not the droid you are looking for.

4.6.1 Introducing filter()

I showed you head() in the previous quest and that was useful to show just a few records at the top of a list, but it does so indiscriminately. Note there are some songs here with the same number of weeks on the charts. If we used head() we might split that tie, leaving us with an incomplete answer. A better strategy is to cut off the list at a logical place using filter(). Let’s dive into this new function:

Filtering is one of those Basic Data Journalism Functions:

The dplyr function filter() reduces the number of rows in our data based on one or more criteria within the data.

The syntax works like this:

# this is psuedo code. don't include it
data |> 
  filter(variable comparison value)

# example
hot100 |> 
  filter(performer == "Taylor Swift")

The filter() function typically works in this order:

  • What is the variable (or column) you are searching in.
  • What is the comparison you want to do. Equal to? Greater than?
  • What is the observation (or value in the data) you are looking for?

Note the two equals signs == in our Taylor Swift example above. It is important to use two of them when you are asking if a value is “true” or “equal to”, as a single = typically means you are assigning a value to something.

4.6.1.1 Comparisons: Logical tests

There are a number of these logical tests for the comparison:

Operator Definition
x < y Less than
x > y Greater than
x == y Equal to
x <= y Less than or equal to
x >= y Greater than or equal to
x != y Not equal to
x %in% c(y,z) In a group
is.na(x) Is NA (missing values)
!is.na(x) Is not NA

Where you apply a filter matters. If we want to consider only certain data when we are grouping/summarizing then we filter BEFORE the GSA. If we filter after the GSA, we affect only the results of the summarize function, which is what we want to do here.

4.6.1.2 Filter to a logical cutoff

In this case, I want you to use filter after the GSA actions to include only results with 65 or more appearances.

  1. Edit your current chunk to add a filter as noted in the example below. I’ll explain it after.
hot100 |>
  group_by(performer, title) |>
  summarize(appearances = n()) |>
  arrange(appearances |> desc()) |> 
1  filter(appearances >= 65)
1
filter() is the function. The first argument in the function is the column we are looking at – in our case the appearances column, which was created in the summarize line. We then provide a comparison operator >= to find values “greater than or equal to” 65.

The value we use to cut the list of – 65 in this case – is arbitrary. We just want to choose a value that makes common sense. 70 seemed too high because that would yield 7 songs, and 60 was too low because there are lots of ties in the lower 60s.

4.6.2 Data Takeaway: Song appearances

OK, time to write about what you’ve found.

  1. Add this data takeaway after the code chunk above.
Data Takeaway: The song "Heat Waves" by Glass Animals has appeared 91 times on the Billboard Hot 100 chart, more than any other song through December 28, 2024. "Blinding Lights" by The Weeknd is second with 90 appearances.

I’m providing these takeaways in this chapter as examples. You’ll be asked to write your own in future projects.

4.6.3 Render your second quest

This would be a good time to again Render your notebook so you can see what the output looks like. Look through that output, checking your headlines and such. Clean up any problems you might see.

4.7 Song the longest at No. 1

We introduced filter() in the last quest to limit the summary. For this quest you’ll need to filter the data before the GSA trio.

Let’s review the quest: Which song (title & performer) was No. 1 for the most number of weeks?

While this quest is very similar to the one above, it really helps to think about the logic of what you need and then build the query one line at a time to make each line works.

Let’s talk through the logic:

  • We are starting with our hot100 data.
  • Do we want to consider all the data? In this case, no: We only want titles that have a current_rank of 1. This means we will filter before any summarizing.
  • Do we need to put our data into buckets before we summarize it? In the end we want to count the number of rows with the same performer and title combinations. This means we need to group_by both performer and title.
  • How should we summarize to get our answer? Since we are counting rows, we need use n() as our summarize function, which counts the number or rows in each group.

So let’s step through this with code:

  1. Create a section with a headline, text and code chunk.
  2. Start with the hot100 data and then pipe into filter().
  3. Within the filter, set the current_rank to be == to 1.
  4. Run the result and check it.
hot100 |> 
  filter(current_rank == 1)

The result should show only titles with a 1 for current_rank.

The rest of our logic is just like our last quest. We need to group by the title and performer and then summarize using n() to count the rows.

  1. Edit your existing chunk to add the group_by and summarize functions. Name your new column appearances and set it to count the rows with n().
Try this on your own before you peek for the answer
hot100 |>
  filter(current_rank == 1) |> 
  group_by(performer, title) |>
  summarize(appearances = n())

Look at your results to make sure you have the three columns you expect: performer, title and appearances.

Note

While WRITING AND RUNNING ONE LINE AT A TIME is still “the Way”, when you run a line with group_by() it won’t usually show different results without its buddy summarize(). So I usually write those two together, or I write the summarize() line first to make sure it works, then edit in the group_by() line to split the data before the summary.

Our result above doesn’t quite get us where we want because it lists the results alphabetically by the performer. You need to arrange the data to show us the most appearances at the top.

  1. Edit your chunk to add the arrange() function to sort by appearances in desc() order. This is just like our last quest.
Maybe check your last quest to see how you used arrange, then try it before checking the answer here
hot100 |>
  filter(current_rank == 1) |> 
  group_by(performer, title) |>
  summarize(appearances = n()) |>
  arrange(appearances |> desc())

You have your answer now but we are listing more than 1,000 rows. Let’s cut this off at a logical place like we did in our last quest.

  1. Use filter() to cut your summary off at appearances of 15 or greater.
You’ve done this before … try it on your own!
hot100 |>
  filter(current_rank == 1) |> 
  group_by(performer, title) |>
  summarize(appearances = n()) |>
  arrange(appearances |> desc()) |> 
  filter(appearances >= 15)

Now you have the answers to the song (or songs) with the most weeks at No. 1 with a logical cutoff. If you add to the data later, that logic will still hold and not cut off arbitrarily at a certain number of records. For instance, Mariah Carey’s All I Want for Christmas could eventually top this list because that song reaches No. 1 for a week or two every holiday season.

4.7.1 Data Takeaway: Longest at No. 1

  1. Add this data takeaway to your notebook:
Data Takeaway: Both Lil Nas X's "Old Town Road" and Shaboozey's "A Bar Song (Tipsy)" have spent the most time at the top of the Billboard Hot 100 through December 28, 2024 with 19 appearances.

4.8 Performer with most No. 1 singles

Our next quest is this: Which performer had the most titles reach No. 1?

This sounds similar to our last quest, but there is a distinct difference. (That’s a dad joke that will reveal itself here in a bit.)

Again, let’s think through the logic of what we have to do to get our answer:

  • Do we need to consider all our data? No, we need to consider only No. 1 songs (so, filter!). But a song can be No. 1 for more than one week and we don’t want to count them more than once. We need each No. 1 title/performer combination only once. Another way to think of this is we need distinct combinations of title/performer. (We’ll introduce a new function to find this.)
  • Do we need our data organized into groups before we summarize? Well, once we have those distinct songs, we do need to consider the songs by performer.
  • How should we summarize to get our one answer? Once we have grouped all the performer’s No. 1 songs together we count the number of times they are on the list.

Let’s start by getting the No. 1 songs. You did this in the last quest.

  1. Create a new section with a headline, text and code chunk.
  2. Start with the hot100 data and filter it so you only have current_rank of 1.
hot100 |> 
  filter(current_rank == 1)

Now look at the result. Note how “Poor Little Fool” shows up more than once? Other songs do as well. If we counted rows by performer now, that would tell us the number of weeks they’ve had No. 1 songs, not how many different songs have made No. 1.

4.8.1 Using distinct()

The next challenge in our logic is to show only unique performer/title combinations. We do this with distinct().

We feed the distinct() function with the variables we want to consider together, in our case the perfomer and title. All other columns are dropped since including them would mess up their distinctness.

  1. Edit your chunk to add the distinct() function to your code chunk.
hot100 |> 
  filter(current_rank == 1) |> 
  distinct(title, performer)

Now we have a list of just No. 1 songs!

4.8.2 Summarize the performers

Now that we have our list of No. 1 songs, we can group and summarize the “number” of times a performer is in the list.

We’ll again use the group_by/summarize/arrange combination for this, but we are only grouping by performer since that is the value we are counting.

  1. Edit your chunk to add a group_by on performer and then a summarize() to count the rows. Name the new column no1_hits. Run it.
  2. After you are sure the group_by/summarize runs, add an arrange() to show the no1_hits in descending order.
You’ve done this before! Give it a go before checking the code here.
hot100 |> 
  filter(current_rank == 1) |>
  distinct(title, performer) |>
  group_by(performer) |>
  summarize(no1_hits = n()) |>
  arrange(no1_hits |> desc())

4.8.3 Filter for a good cutoff

Like we did earlier, use a filter() after your arrange to cut the list off at a logical place.

  1. Edit your chunk to filter the summary to show performers with 10 or more No. 1 hits.
You can do this. Really.
hot100 |> 
  filter(current_rank == 1) |>
  distinct(title, performer) |>
  group_by(performer) |>
  summarize(no1_hits = n()) |>
  arrange(no1_hits |> desc()) |> 
  filter(no1_hits >= 10)

So, The Beatles. Was that your guess? Look closely at that list … does anyone have any chance of topping them?

4.8.4 The perils of collaborations

If you are a student of music history, you might be scratching your head about now, thinking the Beatles have 20 No. 1 hits! You would be right, there is something up here.

The song Get Back by “The Beatles with Billy Preston” reached No. 1 on the Hot 100 on May 10, 1969. But that song does not show up with our other Beatles songs because those are listed as just “The Beatles”. This is the same issue I noted earlier about “Drake” and “Drake featuring Future” or “Future featuring Drake” or “Drake & Future”, as each is considered separate performers. They aren’t counted together.

How do we get around this? Well, we write accurately. We note our results consider collaborations separately. Yes, it’s true we could miss some things. It’s a good idea to double check your work, perhaps by searching the data for all distinct song titles that include “Beatles” in the performer field.

If the performer data were more consistent we might be able to extract each band/person, but it isn’t. Which of these would you consider separate? “Supremes & Temptations”, “Romeo & Juliet Soundtrack”, “Crosby, Stills & Nash”, “Delaney & Bonnie & Friends/Eric Clapton.” It’s a mess.

4.8.5 Data Takeaway: Most No. 1 hits

So, that makes this takeaway a bit more nuanced. Add it to your notebook:

Data Takeaway: The Beatles have the most No. 1 hits on the Billboard Hot 100 charts with 19, but they also have a 20th top song -- "Get Back" -- that is a collaboration with Billy Preston. Most other artists with double-digit top hits aren't making new music, with the exception of Mariah Carey and Taylor Swift.

4.9 Most No. 1 hits in last five years

Which performer had the most songs reach No. 1 in the most recent five years?

Let’s talk through the logic. This is very similar to the No. 1 hits above but with two differences:

  • In addition to filtering for No. 1 songs, we also want to filter for charts since 2020
  • We might need to adjust our last filter for a better “break point”.

There are a number of ways we could write a filter for the date, but we’ll do so in a way that gets all the rows on or after Jan. 1, 2020.

hot100 |> 
  filter(chart_date >= "2020-01-01") |> 
  head() # added just to shorten our result

But since we need this filter before our group, we can do this within the same filter function where we get the number one songs.

  1. Create a new section (headline, text, chunk).
  2. Build (from scratch, one line at a time) the same filter, group_by, summarize and arrange as above, but leave out the cut-off filter at the end. Make sure it runs.
  3. Edit your filter to put a comma after current_rank == 1 and then add this filter: chart_date >= "2019-01-01". Run the code.
  4. Build a new cut-off filter at the end and keep only rows with more than 1 top_hits.
No, really. Try it on your own first
hot100 |> 
  filter(
    current_rank == 1,
1    chart_date >= "2020-01-01"
  ) |> 
  distinct(title, performer) |> 
  group_by(performer) |> 
  summarize(top_hits = n()) |> 
  arrange(top_hits |> desc()) |> 
2  filter(top_hits > 1)
1
This is where we add the new filter
2
This is the new cutoff since the earlier value wouldn’t work

Now you know who has the most No. 1 hits since 2020 (as of this writing). “Are you not entertained?”

4.9.1 Data Takeaway: No. 1 in past five years

OK, here is the takeaway to add …

Data Takeaway: Taylor Swift not only has the most appearances of all time in the Hot 100, she also has the most No. 1 hits in the past five years with six.

4.10 Complex filters

You can combine filters in different ways to really target which rows to keep. For these I want you to play around a bit.

  1. Copy each of the examples below into your notebook, but change the title and/or performer to someone you like.
  2. As you do so, make headlines and notes in Markdown that describe what the different filters do.

4.10.1 Multiple truths

If you want to filter data for when two or more things are true, you can write two equations and combine with &. Only rows where both sides prove true are returned.

This shows when Poor Little Fool was No. 1, but not any other position.

hot100 |> 
  filter(title == "Poor Little Fool" & current_rank == 1) |> 
  select(chart_date:performer)

That select() line is capturing just the first four columns for us.

4.10.2 Either is true

If you want an or filter, then you write two equations with a | between them.

This shows songs by Adam Sandler OR Alabama Shakes.

hot100 |> 
  filter(performer == "Adam Sandler" | performer == "Alabama Shakes") |> 
  select(chart_date:performer)
Tip

The | is found as the Shift of the \ key above Return on your keyboard. That | character is also sometimes called a “pipe”, which gets confusing in R with |>.)

4.10.3 Mixing criteria

If you have multiple criteria, you separate them with a comma ,. Note I’ve also added returns to make the code more readable and added distinct to songs only once. Sometimes you might need parenthesis to control the order of operations, like with math.

This gives us rows with either Taylor Swift or Drake, but only those that reached No. 1 (and no collaborations.)

hot100 |> 
  filter(
1    (performer == "Taylor Swift" | performer == "Drake"),
    current_rank == 1
) |> 
  distinct(current_rank, title, performer)
1
In this case the filter for performer of Taylor or Drake will be done together because of the parenthesis, then the current rank filter is added.

4.10.4 Search within a string

And if you want to search for text within the data, you can use str_detect() to look for specific characters within a value to filter rows. str_detect() needs two arguments: 1) What column to search in, and 2) what to search for “in quotes”. I also use distinct() here to show only unique title/performer combinations.

This show songs where “2 Chainz” was among the performers.

hot100 |> 
  filter(str_detect(performer, "2 Chainz")) |> 
  distinct(performer, title)

You might be tempted to try “Drake” above, but there are so many other names that include Drake (like “Charlie Drake” and “Pete Drake”) that it’s too messy.

There are other variations and functions to help with filtering, but this is enough for now.

4.11 On your own

I want you to do two things on your own using the skills we’ve covered above.

4.11.1 Most Top 10 hits

Which performer had the most Top 10 hits overall?

The logic is very similar to the “Most No. 1 hits” quest you did before, but you need to adjust your filter to find songs within position 1 through 10. Don’t over think it, but do recognize that the “top” of the charts are smaller numbers, not larger ones.

  1. Make a new section with a Markdown headline
  2. Describe what you are doing in text
  3. Do it using the group_by/summarize method
  4. Filter to cut off at a logical number or rows. (i.e., don’t stop at a tie)
  5. Write a Data Takeaway describing your finding

4.11.2 Find something you want to know

Now I want you to find something else on your own. It doesn’t matter what it is. Just find something about a performer or song you like.

  1. Start a new section with a Markdown headline
  2. Use Markdown text to declare what you are looking for
  3. Find it!
  4. After your code, explain what functions you used and why (like what did they do for you)
  5. Write a Data Takeaway describing your finding

4.12 Summarizing by year

There is one more important concept I’d like you to learn and that you will need later, but this is already a long chapter and you have learned a lot of new stuff.

If you still have room in your brain, go through the Grouping by dates chapter, which uses data from this project.

If your brain is mush, that’s cool. Just turn this project in. But know you’ll need to go through that chapter later when you are working on your mastery project.

4.13 Update your index summary

We have one, last major thing to do here before we finish, and that is to update our index.qmd file with a summary of what you’ve found.

  1. Open your index.qmd file.
  2. After the text that is here, add the lede, source graf and data takeaways noted below.
  3. Read through the text and many necesarry updates based on your newer data!
  4. Then add your own data takeaways at the end.
## Summary

Taylor Swift has had more solo appearances on the Billboard Hot 100 chart than any other artist, according to an analysis of chart data from Billboard Magazine. Her 1,600 solo appearances tops Drake, though he has more when considering collaborations with other artists.

The chart has been published weekly since August 1958, and this analysis includes data through December 2024*. The chart data was pulled from an [archive managed by Christian McDonald](https://github.com/utdata/rwd-billboard-data), which is updated each week.

> *You should update this value with that last chart date in your data.

Other findings from the data analysis include:

- Data Takeaway: The song "Heat Wave" by Glass Animals has appeared 91 times on the Hot 100, more than any other song through 2023. The Weeknd's "Blinding Lights" has the next most appearances with 90.
- The songs "Old Town Road" by Lil Nas X Featuring Billy Ray Cyrus and "A Bar Song (Tipsy)" by Shaboozey have spent more time at the top of the Hot 100 than any other songs in history at 19 weeks.
- The Beatles have the most No. 1 hits with 19, but they also have a 20th top song -- "Get Back" -- that is a collaboration with Billy Preston. Most other artists with double-digit top hits aren't making new music, with the exception of Mariah Carey and Taylor Swift.
- Taylor Swift not only has the most appearances of all time in the Hot 100, she also has the most No. 1 hits in the past five years with six.
- ADD YOUR TOP 10 HITS
- ADD YOUR OWN FINDING

4.14 Turn in your project

  1. Make sure everything runs and Renders properly.
  2. Publish your changes to Quarto Pub and include the link to your project in your index notebook so you can bask in your glory.
  3. Zip your project folder. (Or export to zip if you are using posit.cloud).
  4. Upload to the Canvas assignment.
Important

To be clear, it is your zipped project I am grading. The Quarto Pub link is for convenience.

4.15 Review of what we’ve learned

We introduced a number of new functions in this lesson, most of them from tidyverse’s dplyr package. Mostly we filtered and summarized our data. Here are the functions we introduced in this chapter, many with links to documentation:

  • filter() returns only rows that meet logical criteria you specify.
  • str_detect() lets you search for a pattern within a string. Often used with filter().
  • summarize() builds a summary table about your data. You can count rows n() or do math on numerical values, like mean(). In the next chapter we will summarize with math functions.
  • group_by() is often used with summarize() to put data into groups before building a summary table based on the groups.
  • distinct() returns rows based on unique values in columns you specify. i.e., it can deduplicate data. It is NOT the same as select().

4.16 Soundtrack for this assignment

This lesson was constructed while listening to The Bright Light Social Hour. They’ve not had a song on the Hot 100, at least not yet.

hot100 |> 
  filter(str_detect(performer, "Bright Light"))

  1. Heiss, Andrew. 2024. “Visualizing {Dplyr}’s Mutate(), Summarize(), Group_by(), and Ungroup() with Animations.” April 4, 2024. https://doi.org/10.59350/d2sz4-w4e25.↩︎