Try to write the code on your own first
library(tidyverse)
library(janitor)
This chapter continues the Billboard Hot 100 project. In the previous chapter we downloaded, imported and cleaned our data. We’ll be working in the same project, but in a new document.
count()
functionNow that we have the Billboard Hot 100 charts data in our project it’s time to find the answers to the following questions:
What are your guesses for the questions above? NO PEEKING!
In each case we’ll talk over the logic of finding the answer and the code to accomplish it.
Before we can get into the analysis, we want to set up a new notebook to separate our cleaning from our analysis.
The data outputs in this book might differ from what you get since the source data is updated every week. This is especially true in videos, gifs and takeaways.
This chapter won’t have data past 2024.
At the end of the last notebook we exported our clean data as an .rds
file. We’ll now create a new Quarto notebook and import that data. It will be much easier this time.
+
menu to start a new Quarto Document.02-analysis.qmd
in your project folder.Because we set our _quarto.yml
file to update navigation with any new Quarto file, this page will automatically show up in the sidebar when we render the site.
Since we are starting a new notebook, we need to set up a few things. First up we want to list our goals.
-
or *
followed by a space. There should be a blank line above and below the entire LIST but not between the items. List items should be on sequential lines … and it is the only markdown item like that.#| label: setup
and add in the tidyverse and janitor libraries.library(tidyverse)
library(janitor)
Do you see the grey triangle above with the term Try to write the code on your own first next to it? That is code that I’ve written but hidden to give you a chance to write it yourself before you see it. If you click on the little triangle, it will turn and reveal the code!
We need to import the data that we cleaned and exported in the last notebook. It’s just a couple of lines of code and you could write them all out and run it, but here is where I tell you for the first of a 1,000 times:
WRITE ONE LINE OF CODE. RUN IT. CHECK THE RESULTS. REPEAT.
Yes, sometimes for the sake of brevity I will give you multiple lines of code at once, but to understand what is actually going on you should add and run that code one line at a time.
We’ll painstakingly walk through that process here to belabor the point. Here are our goals for this bit:
I want you to:
read_rds("data-processed/01-hot100.rds")
This read_rds()
function is just like the read_csv()
function we used in our last notebook to read in data from our computer, except we are reading a different kind of file format: .rds. The argument (in quotes) is the path where our file is on our computer.
What did you expect will happen if you run the code?
Now run that line of code.
It should print the results of the data in your notebook. Did it?
If you got an error, you should read that error output for some clues about what happened. When using a read_
function problems are usually about one of two things: It can’t find the function (load the library tidyverse!) or it doesn’t understand the path to the file (that you have the folder structure wrong or something misspelled).
CATCHING COMMON MISTAKES: In the appendices of this book there is a chapter on Troubleshooting that is worth bookmarking and reading.
Our next goal is to take all that data and put it into a new R object.
hot100
. Remember we will add the new object FIRST (our bucket) and then use the <-
operator to put the data inside it (the water).It should look like this:
<- read_rds("data-processed/01-hot100.rds") hot100
What happened when you ran the code? Did you get an error? Did you get a result printed to the screen?
OK, that last one was a trick question. When you save data into an object it does NOT print it to your notebook, but it does save that object into your “Environment”, which means it is in memory and can be called and run at any time. You should see hot100
listed in your Environment tab at the top right of RStudio.
Now, before I started this quest, I knew I wanted that data inside the hot100
object and could’ve written it like that from the beginning. But I didn’t because most “read” problems are in the path, so I wanted to first make sure that function was written correctly. Also, when I put the data into the object I can’t see it until I print it out again, so I do it one piece at a time. You should, too.
So next we’ll print the data out to our screen. We do that by just by calling the object.
<- read_rds("data-processed/01-hot100.rds")
hot100
hot100
This should print out the data to your screen so you can see the first couple of columns and the first 10 rows of data.
But what we really need is to see all the column names, the data types and an example of the data. We can get that with our glimpse()
function.
glimpse()
functionYour code (and result) should look like this now:
<- read_rds("data-processed/01-hot100.rds")
hot100
|> glimpse() hot100
Rows: 346,600
Columns: 7
$ chart_date <date> 1958-08-04, 1958-08-04, 1958-08-04, 1958-08-04, 1958-08…
$ current_rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ title <chr> "Poor Little Fool", "Patricia", "Splish Splash", "Hard H…
$ performer <chr> "Ricky Nelson", "Perez Prado And His Orchestra", "Bobby …
$ previous_rank <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ peak_rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ wks_on_chart <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
Again, I could’ve written all that code at once before running it, as I’ve written code like that for many years. I still write and run code one line at a time, and you should, too. It is the easiest way to make sure your code is correct and find problems. So, for the second of a 1,000 times:
WRITE ONE LINE OF CODE. RUN IT. CHECK THE RESULTS. REPEAT.
If you are looking at my code and writing it into your notebook, you have to run each line of code BEFORE you add the pipe |>
at the end of that line. A pipe |>
must have a function directly following it on the same or next (or same) line to work.
One of the packages within the tidyverse is dplyr. Dplyr allows us to transform our data frames in ways that let us explore the data and prepare it for visualizing. It’s the R equivalent of common Excel functions like sort, filter and pivoting. There is a cheatsheet on the dplyr that you might find useful.
We’ve used select()
, mutate()
and arrange()
already, but we’ll introduce more dplyr functions in this chapter.
While we go over each of these dplyr functions below to find the answers to our questions, there is a really good resource from David Keys of that explains each function with videos, animated gifs and example code.
Our first question: Which performer had the most appearances on the Hot 100 chart at any position?
Let’s work through the logic of what we need to do before I explain exactly how to do it.
Each row in the data is one song ranked on the chart. It includes a field with the name of the “performer” so we know who recorded it.
So, to figure out how many times a performer is in the data, we need to count the number of rows with the same performer.
We’ll use the tidyverse’s version of Group and Aggregate to get this answer. It is actually two different functions within dplyr that often work together: group_by()
and summarize()
Before we dive into the code, let’s review this video about “Group and Aggregate” to get a handle on the concept.
We’ll dive deep into this next.
We’ll start with summarize()
first because it can stand alone.
The summarize()
function computes summary tables describing your data. We’re usually finding a single number to describe a bunch of data, like the “average” of numbers in column.
In our case we want a “summary” about the number of times a specific performer appears in data, hence we use summarize()
.
THEY BE THE ZAME: summarize()
and summarise()
are the same function, as R supports both the American and UK spelling of summarize. They work the same and I don’t care which you use.
Here is an example of summarize()
. The data is shown on the left of the video and the code is at the right. The animation courtesy of Andrew Heiss1 gives you an visual representation of how it works.
The example above is giving us three summaries:
mean()
(or average) of all the values in the x
column. (So adding those numbers together and then dividing by the number of rows.)sum()
to add together all those numbers in x
, yielding “45”.n()
.Much like the mutate()
function we used earlier, within summarize()
we list the name of the new column first, then assign to it the function and column we want to accomplish using =
. So we create “avg” then fill it with the mean(x)
.
Again, in our case (as we work toward finding the performer with most appearances) we want to summarize the number of rows, and there is a function for that: n()
. (Think “number of observations”.) Every row in the data is an appearance … we just need to count how many rows have each performer.
So lets do this with our Hot100 data. We’ll use summarize to count all the rows in our data. Let’s write the code and run it, then I’ll explain:
summarize()
.
Basically we are summarizing the total number of rows in the data. When this book was last updated there were 346500 rows.
But I bet you’re thinking: “Yo Professor, you said we want to count the number of times a performer has appeared, right?”
This is where we bring in a close friend to summarize()
… the group_by()
function.
The group_by()
function pre-sorts data into groups so that whatever function follows is applied within each of the groups. That means group_by()
always has something following it, and that something is usually summarize()
.
Let’s look at this with our example data from above. Here we group all the data together based on values in cat1
, then perform all the summarize functions:
The grouping part of this happens behind the scenes and we don’t really see it. If you run group_by()
by itself you won’t see any change, but it does happen.
Here is an example when we group by the second column cat2
:
So, back to our Hot 100 data and our question Which performer had the most appearances on the Hot 100 chart at any position?
If we group_by()
performer before we summarize and count our rows, it will put all of the rows with “Aerosmith” together, then all the “Beyonce” rows together, etc. and then it will count the rows within those groups … first those for Aerosmith and then those for Beyonce.
|>
hot100 1group_by(performer) |>
summarize(appearances = n())
What we get in return is a summarized table that shows all 10,000+ different performers that have been on the charts, and the number of rows in which they appear in the data.
That’s great, but who had the most?
Remember in our import notebook when we sorted all the rows by the oldest chart date? We’ll use the same arrange()
function here, but we’ll change the result to descending order, because journalists almost always want to know the most of something.
|>
hot100 group_by(performer) |>
summarize(appearances = n()) |>
1arrange(desc(appearances))
arrange()
function and fed it the variable “appearances”, but we also wrapped that inside a desc()
function to reverse the sort (descending order) putting the highest number ar the top. If we used just “appearances” then it would list the smallest values first.
You’ll sometimes see me write this differently, piping the column into the desc()
function like this: arrange(appearances |> desc())
.
We’ve printed 10,000+ rows of data into our notebook when we really only wanted the Top 10 or so. You might think it doesn’t matter, but your knitted HTML file will store all that data and can make it a big file (as in hard drive space), so I try to avoid that when I can.
We can use the head()
command again to get the first 10 result. It will give us six rows unless we specify a different number. There is a corresponding tail()
function to get the last rows.
head()
function set to 10 rows.1|>
hot100 2group_by(performer) |>
3summarize(appearances = n()) |>
4arrange(appearances |> desc()) |>
5head(10)
Since we have our answer here and we’re not using the result later, we don’t need to create a new object or anything. Printing it to our notebook is sufficient.
So, Taylor Swift … is that who you guessed? A little history here, when I started using this data back in 2018, Elton John topped the list. Elton John has been around a long time, but Swift’s popularity plus changes in how Billboard counts plays in the modern era (like streaming) has rocketed her to the top. (Sorry, Rocket Man). And it doesn’t hurt that she is re-releasing her entire catalog (Taylor’s version)!
performer
names, and as such considers collaborations separately. For instance, Drake is second on the list but those are only songs he performed alone and not the many, many collaborations he has done with other performers. Songs by “Drake” are counted separately than “Drake featuring Future” and even “Future featuring Drake”. You’ll need to make this clear when you write about this later. (If you include collaborations, Drake has way more appearances than Taylor.)head()
to “cut” this list is not the best method because there could be a tie between the 10th and 11th record. In the future we’ll learn filter()
to do this better. But head()
is pretty useful when you just need to see the top of your data.We’ve learned something here from the data and we should note it for our future selves in something we call a “data takeaway,” which is a prose sentence explaining our finding. You’ll do this for each “answer” you find in your data.
Data Takeaway: Taylor Swift has more solo appearances on the Billboard Hot 100 than any other artist in history with 1,600 appearances through December 28, 2024. This does not include collaborations with other artists.
Writing these sentences help you understand what you’ve found in your data. When you get toward the end of your project, you’ll collect these takeaways in your index.qmd
file.
Now that you have your first quest answer, let’s celebrate by rendering your notebook so you can see all your pretty work.
This will give you the website view of your page, and you should have the page listed in your site navigation if you did the setup properly at the beginning of this chapter.
Our next quest is this: Which song (title & performer) has been on the charts the most?
This is very similar to our quest to find the performer with the most appearances, but we have to consider both title
and performer
together because different artists can perform songs of the same name. For example, Adele’s song Hold On entered the Hot 100 at 49 in December 2021, but at least 18 other performers have also had a song titled Hold On on the Hot 100.
Let’s first talk through the logic of what we want to do and how it differed from our first quest:
title
and performer
are the same. We can do this by putting both values in our group_by()
function, instead of just one thing. This will put all rows with both “Rush” as a performer AND Tom Sawyer as a title into the same group. Rows with “Rush” and Red Barchetta will be considered in a different group.summarize()
to count the number of rows in each group.Let’s write down our quest and code.
WRITE ONE LINE OF CODE. RUN IT. CHECK IT. REPEAT
Remember, you write and run the line BEFORE you add the pipe |>
! The annotations explain as if we were writing an English sentence.
1|>
hot100 2group_by(performer, title) |>
3summarize(appearances = n()) |>
4arrange(desc(appearances))
hot100
data AND THEN …
performer
and title
AND THEN ..
n()
function that counts the number of rows AND THEN …
appearances
column.
We will often use group_by()
, summarize()
and arrange()
together, which is why I’ll refer to this as the GSA trio. They are like three close friends that always want to hang out together.
So, what was your guess or this question? A little bit of history for that answer … Glass Animals’ Heat Waves in 2022 overcame The Weeknd’s Blinding Lights, which had overcome Imagine Dragon’s Radioactive some time in 2021.
When you run the code above you might see a warning “summarise()
has grouped output by ‘performer’. You can override using the.groups
argument.” THIS IS NOT A PROBLEM. It is just R letting you know that the output of this remains grouped by the first item. Explaining it would break your brain right now. Don’t worry about it. This is not the droid you are looking for.
I showed you head()
in the previous quest and that was useful to show just a few records at the top of a list, but it does so indiscriminately. Note there are some songs here with the same number of weeks on the charts. If we used head()
we might split that tie, leaving us with an incomplete answer. A better strategy is to cut off the list at a logical place using filter()
. Let’s dive into this new function:
Filtering is one of those Basic Data Journalism Functions:
The dplyr function filter()
reduces the number of rows in our data based on one or more criteria within the data.
The syntax works like this:
# this is psuedo code. don't include it
|>
data filter(variable comparison value)
# example
|>
hot100 filter(performer == "Taylor Swift")
The filter()
function typically works in this order:
Note the two equals signs ==
in our Taylor Swift example above. It is important to use two of them when you are asking if a value is “true” or “equal to”, as a single =
typically means you are assigning a value to something.
There are a number of these logical tests for the comparison:
Operator | Definition |
---|---|
x < y | Less than |
x > y | Greater than |
x == y | Equal to |
x <= y | Less than or equal to |
x >= y | Greater than or equal to |
x != y | Not equal to |
x %in% c(y,z) | In a group |
is.na(x) | Is NA (missing values) |
!is.na(x) | Is not NA |
Where you apply a filter matters. If we want to consider only certain data when we are grouping/summarizing then we filter BEFORE the GSA. If we filter after the GSA, we affect only the results of the summarize function, which is what we want to do here.
In this case, I want you to use filter after the GSA actions to include only results with 65 or more appearances.
|>
hot100 group_by(performer, title) |>
summarize(appearances = n()) |>
arrange(appearances |> desc()) |>
1filter(appearances >= 65)
filter()
is the function. The first argument in the function is the column we are looking at – in our case the appearances
column, which was created in the summarize line. We then provide a comparison operator >=
to find values “greater than or equal to” 65
.
The value we use to cut the list of – 65
in this case – is arbitrary. We just want to choose a value that makes common sense. 70
seemed too high because that would yield 7 songs, and 60
was too low because there are lots of ties in the lower 60s.
OK, time to write about what you’ve found.
Data Takeaway: The song "Heat Waves" by Glass Animals has appeared 91 times on the Billboard Hot 100 chart, more than any other song through December 28, 2024. "Blinding Lights" by The Weeknd is second with 90 appearances.
I’m providing these takeaways in this chapter as examples. You’ll be asked to write your own in future projects.
This would be a good time to again Render your notebook so you can see what the output looks like. Look through that output, checking your headlines and such. Clean up any problems you might see.
We introduced filter()
in the last quest to limit the summary. For this quest you’ll need to filter the data before the GSA trio.
Let’s review the quest: Which song (title & performer) was No. 1 for the most number of weeks?
While this quest is very similar to the one above, it really helps to think about the logic of what you need and then build the query one line at a time to make each line works.
Let’s talk through the logic:
hot100
data.current_rank
of 1. This means we will filter before any summarizing.group_by
both performer
and title
.n()
as our summarize function, which counts the number or rows in each group.So let’s step through this with code:
hot100
data and then pipe into filter()
.current_rank
to be ==
to 1
.|>
hot100 filter(current_rank == 1)
The result should show only titles with a 1
for current_rank
.
The rest of our logic is just like our last quest. We need to group by the title
and performer
and then summarize
using n()
to count the rows.
group_by
and summarize
functions. Name your new column appearances
and set it to count the rows with n()
.|>
hot100 filter(current_rank == 1) |>
group_by(performer, title) |>
summarize(appearances = n())
Look at your results to make sure you have the three columns you expect: performer, title and appearances.
While WRITING AND RUNNING ONE LINE AT A TIME is still “the Way”, when you run a line with group_by()
it won’t usually show different results without its buddy summarize()
. So I usually write those two together, or I write the summarize()
line first to make sure it works, then edit in the group_by()
line to split the data before the summary.
Our result above doesn’t quite get us where we want because it lists the results alphabetically by the performer. You need to arrange the data to show us the most appearances at the top.
arrange()
function to sort by appearances
in desc()
order. This is just like our last quest.|>
hot100 filter(current_rank == 1) |>
group_by(performer, title) |>
summarize(appearances = n()) |>
arrange(appearances |> desc())
You have your answer now but we are listing more than 1,000 rows. Let’s cut this off at a logical place like we did in our last quest.
filter()
to cut your summary off at appearances
of 15 or greater.|>
hot100 filter(current_rank == 1) |>
group_by(performer, title) |>
summarize(appearances = n()) |>
arrange(appearances |> desc()) |>
filter(appearances >= 15)
Now you have the answers to the song (or songs) with the most weeks at No. 1 with a logical cutoff. If you add to the data later, that logic will still hold and not cut off arbitrarily at a certain number of records. For instance, Mariah Carey’s All I Want for Christmas could eventually top this list because that song reaches No. 1 for a week or two every holiday season.
Data Takeaway: Both Lil Nas X's "Old Town Road" and Shaboozey's "A Bar Song (Tipsy)" have spent the most time at the top of the Billboard Hot 100 through December 28, 2024 with 19 appearances.
Our next quest is this: Which performer had the most titles reach No. 1?
This sounds similar to our last quest, but there is a distinct difference. (That’s a dad joke that will reveal itself here in a bit.)
Again, let’s think through the logic of what we have to do to get our answer:
Let’s start by getting the No. 1 songs. You did this in the last quest.
hot100
data and filter it so you only have current_rank
of 1.|>
hot100 filter(current_rank == 1)
Now look at the result. Note how “Poor Little Fool” shows up more than once? Other songs do as well. If we counted rows by performer
now, that would tell us the number of weeks they’ve had No. 1 songs, not how many different songs have made No. 1.
The next challenge in our logic is to show only unique performer/title combinations. We do this with distinct()
.
We feed the distinct()
function with the variables we want to consider together, in our case the perfomer
and title
. All other columns are dropped since including them would mess up their distinctness.
distinct()
function to your code chunk.|>
hot100 filter(current_rank == 1) |>
distinct(title, performer)
Now we have a list of just No. 1 songs!
Now that we have our list of No. 1 songs, we can group and summarize the “number” of times a performer is in the list.
We’ll again use the group_by/summarize/arrange combination for this, but we are only grouping by performer
since that is the value we are counting.
performer
and then a summarize()
to count the rows. Name the new column no1_hits
. Run it.arrange()
to show the no1_hits
in descending order.|>
hot100 filter(current_rank == 1) |>
distinct(title, performer) |>
group_by(performer) |>
summarize(no1_hits = n()) |>
arrange(no1_hits |> desc())
Like we did earlier, use a filter()
after your arrange to cut the list off at a logical place.
10
or more No. 1 hits.|>
hot100 filter(current_rank == 1) |>
distinct(title, performer) |>
group_by(performer) |>
summarize(no1_hits = n()) |>
arrange(no1_hits |> desc()) |>
filter(no1_hits >= 10)
So, The Beatles. Was that your guess? Look closely at that list … does anyone have any chance of topping them?
If you are a student of music history, you might be scratching your head about now, thinking the Beatles have 20 No. 1 hits! You would be right, there is something up here.
The song Get Back by “The Beatles with Billy Preston” reached No. 1 on the Hot 100 on May 10, 1969. But that song does not show up with our other Beatles songs because those are listed as just “The Beatles”. This is the same issue I noted earlier about “Drake” and “Drake featuring Future” or “Future featuring Drake” or “Drake & Future”, as each is considered separate performers. They aren’t counted together.
How do we get around this? Well, we write accurately. We note our results consider collaborations separately. Yes, it’s true we could miss some things. It’s a good idea to double check your work, perhaps by searching the data for all distinct song titles that include “Beatles” in the performer field.
If the performer
data were more consistent we might be able to extract each band/person, but it isn’t. Which of these would you consider separate? “Supremes & Temptations”, “Romeo & Juliet Soundtrack”, “Crosby, Stills & Nash”, “Delaney & Bonnie & Friends/Eric Clapton.” It’s a mess.
So, that makes this takeaway a bit more nuanced. Add it to your notebook:
Data Takeaway: The Beatles have the most No. 1 hits on the Billboard Hot 100 charts with 19, but they also have a 20th top song -- "Get Back" -- that is a collaboration with Billy Preston. Most other artists with double-digit top hits aren't making new music, with the exception of Mariah Carey and Taylor Swift.
Which performer had the most songs reach No. 1 in the most recent five years?
Let’s talk through the logic. This is very similar to the No. 1 hits above but with two differences:
There are a number of ways we could write a filter for the date, but we’ll do so in a way that gets all the rows on or after Jan. 1, 2020.
|>
hot100 filter(chart_date >= "2020-01-01") |>
head() # added just to shorten our result
But since we need this filter before our group, we can do this within the same filter function where we get the number one songs.
current_rank == 1
and then add this filter: chart_date >= "2019-01-01"
. Run the code.top_hits
.Now you know who has the most No. 1 hits since 2020 (as of this writing). “Are you not entertained?”
OK, here is the takeaway to add …
Data Takeaway: Taylor Swift not only has the most appearances of all time in the Hot 100, she also has the most No. 1 hits in the past five years with six.
You can combine filters in different ways to really target which rows to keep. For these I want you to play around a bit.
If you want to filter data for when two or more things are true, you can write two equations and combine with &
. Only rows where both sides prove true are returned.
This shows when Poor Little Fool was No. 1, but not any other position.
|>
hot100 filter(title == "Poor Little Fool" & current_rank == 1) |>
select(chart_date:performer)
That select()
line is capturing just the first four columns for us.
If you want an or filter, then you write two equations with a |
between them.
This shows songs by Adam Sandler OR Alabama Shakes.
|>
hot100 filter(performer == "Adam Sandler" | performer == "Alabama Shakes") |>
select(chart_date:performer)
The |
is found as the Shift of the \
key above Return on your keyboard. That |
character is also sometimes called a “pipe”, which gets confusing in R with |>
.)
If you have multiple criteria, you separate them with a comma ,
. Note I’ve also added returns to make the code more readable and added distinct to songs only once. Sometimes you might need parenthesis to control the order of operations, like with math.
This gives us rows with either Taylor Swift or Drake, but only those that reached No. 1 (and no collaborations.)
|>
hot100 filter(
1== "Taylor Swift" | performer == "Drake"),
(performer == 1
current_rank |>
) distinct(current_rank, title, performer)
And if you want to search for text within the data, you can use str_detect()
to look for specific characters within a value to filter rows. str_detect()
needs two arguments: 1) What column to search in, and 2) what to search for “in quotes”. I also use distinct()
here to show only unique title/performer combinations.
This show songs where “2 Chainz” was among the performers.
|>
hot100 filter(str_detect(performer, "2 Chainz")) |>
distinct(performer, title)
You might be tempted to try “Drake” above, but there are so many other names that include Drake (like “Charlie Drake” and “Pete Drake”) that it’s too messy.
There are other variations and functions to help with filtering, but this is enough for now.
I want you to do two things on your own using the skills we’ve covered above.
Which performer had the most Top 10 hits overall?
The logic is very similar to the “Most No. 1 hits” quest you did before, but you need to adjust your filter to find songs within position 1 through 10. Don’t over think it, but do recognize that the “top” of the charts are smaller numbers, not larger ones.
Now I want you to find something else on your own. It doesn’t matter what it is. Just find something about a performer or song you like.
There is one more important concept I’d like you to learn and that you will need later, but this is already a long chapter and you have learned a lot of new stuff.
If you still have room in your brain, go through the Grouping by dates chapter, which uses data from this project.
If your brain is mush, that’s cool. Just turn this project in. But know you’ll need to go through that chapter later when you are working on your mastery project.
We have one, last major thing to do here before we finish, and that is to update our index.qmd
file with a summary of what you’ve found.
index.qmd
file.## Summary
Taylor Swift has had more solo appearances on the Billboard Hot 100 chart than any other artist, according to an analysis of chart data from Billboard Magazine. Her 1,600 solo appearances tops Drake, though he has more when considering collaborations with other artists.
[archive managed by Christian McDonald](https://github.com/utdata/rwd-billboard-data), which is updated each week.
The chart has been published weekly since August 1958, and this analysis includes data through December 2024*. The chart data was pulled from an
> *You should update this value with that last chart date in your data.
Other findings from the data analysis include:
- Data Takeaway: The song "Heat Wave" by Glass Animals has appeared 91 times on the Hot 100, more than any other song through 2023. The Weeknd's "Blinding Lights" has the next most appearances with 90.
- The songs "Old Town Road" by Lil Nas X Featuring Billy Ray Cyrus and "A Bar Song (Tipsy)" by Shaboozey have spent more time at the top of the Hot 100 than any other songs in history at 19 weeks.
- The Beatles have the most No. 1 hits with 19, but they also have a 20th top song -- "Get Back" -- that is a collaboration with Billy Preston. Most other artists with double-digit top hits aren't making new music, with the exception of Mariah Carey and Taylor Swift.
- Taylor Swift not only has the most appearances of all time in the Hot 100, she also has the most No. 1 hits in the past five years with six.
- ADD YOUR TOP 10 HITS
- ADD YOUR OWN FINDING
To be clear, it is your zipped project I am grading. The Quarto Pub link is for convenience.
We introduced a number of new functions in this lesson, most of them from tidyverse’s dplyr package. Mostly we filtered and summarized our data. Here are the functions we introduced in this chapter, many with links to documentation:
filter()
returns only rows that meet logical criteria you specify.str_detect()
lets you search for a pattern within a string. Often used with filter()
.summarize()
builds a summary table about your data. You can count rows n()
or do math on numerical values, like mean()
. In the next chapter we will summarize with math functions.group_by()
is often used with summarize()
to put data into groups before building a summary table based on the groups.distinct()
returns rows based on unique values in columns you specify. i.e., it can deduplicate data. It is NOT the same as select()
.This lesson was constructed while listening to The Bright Light Social Hour. They’ve not had a song on the Hot 100, at least not yet.
|>
hot100 filter(str_detect(performer, "Bright Light"))
Heiss, Andrew. 2024. “Visualizing {Dplyr}’s Mutate(), Summarize(), Group_by(), and Ungroup() with Animations.” April 4, 2024. https://doi.org/10.59350/d2sz4-w4e25.↩︎