library(tidyverse)
library(janitor)
Welcome to the tidyverse
In this notebook we’ll learn how tidyverse functions help us suss through data.
Along the way, we’ll explore the names given to babies since 1880, based on data from the Social Security Administration.
Our questions
Some questions we’ll answer along the way:
- What is the most given name of all time?
- What are the most given names by sex?
- What are the most given names in each year?
- What are the top 5 highest shares for a name within a year for each sex?
- How many different names have been given since 1880s.
Setup
R is an open source language where the community can build and share new code through “packages”. Packages are basically collections of pre-written code that is stored where an R user can download it to their computer and use it.
There is a public-benefit corporation called Posit that has written a lot of R code for data science that has become popular with journalists because the code is all written to work together in a similar fashion … the tidyverse way. The most used packages are collected into an uber package called “tidyverse”, which has already been installed on our posit.cloud virtual machines. However, for each notebook you have to declare which packages you use.
Knowing which packages to install and use come with experience, but I always start with the tidyverse and janitor packages because they include our most-used functions.
Here we have a code chunk you should have near the top of every notebook. In addition to loading our two libraries there are some execution options that start with #|
. Those options really are “optional” so we’ll skip over them for now, but you should run the code chunk.
You may get lots of output from your setup chunk and some the text may be colored red. That doesn’t mean that it is broken, it’s just information.
Functions are the thing
In the introduction we mentioned that functions are the action verbs of programming. In fact, learning the functions and how they work IS programming. In R, it is the construction of these functions that make up our code.
At their heart, functions are just pre-written code to accomplish something. Some things to know:
- A function name usually describes what it does:
arrange()
sorts things in the order you specify.library()
loads a specific library.mean()
gets an average of multiple numbers. - A function has parenthesis written directly after it, without a space.
- Inside the parenthesis we’ll often add “arguments” specific to the function. These are options to make the function behave a specific way. Well-written functions will have smart defaults so you only have to add arguments to change the default behavior.
- With tidyverse functions, the first argument is usually the data you are using.
- To make our code more easy to read, we use pipes (
|>
or%>%
) to pass the results of one function into the first argument of the next function.
Here is some fake code to show kinda how a pipe works.
|>
john_lennon sleep(action = "wake up") |>
fall(direction = "out", location = "bed") |>
comb(method = "drag", direction = "across", body_part = "head")
OK, maybe there was too much Dad in that joke.
Importing data
The best way to learn tidyverse functions is to use them, so we’ll start using exploring our baby names data while we talk about these things.
- In the code below, we come across something that happens often in R: We have to create an object/variable before we will it with data. Think of this like a bucket of water … you have to have a bucket BEFORE you can fill it with something.
babynames_raw
is our empty bucket. - The
<-
part of the code is the assignment operator that fills our bucket with data. It “moves” the results of the code on the right into the object on the left. - The
read_rds()
function is a function that reads in a file in a special format that R can read quickly. We give it the path to the file as an argument. There is a file saved with this notebook calledbabynames.rds
and it is in ourdata
folder. - The last line just prints out the
babynames_raw
data so we can see it in our notebook.
- Try using the keyboard commands to run this code. Put your cursor somewhere inside the code chunk and use Ctrl + Shift + Enter on Windows or Cmd + Shift + Return on Mac.
<- read_rds("data/babynames.rds")
babynames_raw
babynames_raw
OYO: Import
This first “On your own” we’ll do together so I can show you some features of RStudio.
We want to import a file called applicants.rds
. It is stored in the same data/
directory that the babynames file is.
We always want to build code one line at a time, so run the code after we add each part.
- In the blank code chunk below …
- Start typing in the
read_rds()
function. You’ll see that type-assist will help you find it. When you get the right one, you can hit tab or enter to finish. - Next we’ll supply the first argument, which is the path to the file, in quotes. The path we need is
data/applicants.rds
. RUN the code chunk to see it works. - Next we’ll add the new object to put all this in, which we’ll call
applicants
. We need the<-
assignment operator, too. - Lastly on a new line we repeat
applicants
to print it out to the notebook. - Run it all again to make sure it works.
OK, thanks for doing that. We’ll use the applicants data later.
Exploring data
Returning to our babynames, let’s peek at the data.
Print out your data
You can print out your data just by naming it in a code chunk and running that line.
This will give you lots of information about your data, including how many rows and columns. It shows you the column names (also called variables) and the data type within those variables.
- Run the code chunk below so we can explore the data.
babynames_raw
Most of these column names are named what that mean, all accept n
. That is shorthand for “number of observations” i.e., the number of times that name was given in that year. We’re going to change that in a minute.
Glimpse your data
Sometimes our data has so many variables it is hard to see them all at once. I often use a function called glimpse()
, which lists each columns, it’s data type, etc.
- Run this code chunk to peek at your data in a differen way.
|> glimpse() babynames_raw
Rows: 2,117,219
Columns: 5
$ year <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880,…
$ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", …
$ name <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret", "Ida",…
$ n <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 1288, 1258,…
$ prop <dbl> 0.0724, 0.0267, 0.0205, 0.0199, 0.0179, 0.0162, 0.0151, 0.0145, 0…
Get summary statistics
Another neat function is summary()
which can give you some basic statistics about your data. This is good to see the highest and lowest values within specific variables.
- Yep, run this one, too.
|> summary() babynames_raw
year sex name n
Min. :1880 Length:2117219 Length:2117219 Min. : 5.0
1st Qu.:1956 Class :character Class :character 1st Qu.: 7.0
Median :1989 Mode :character Mode :character Median : 12.0
Mean :1979 Mean : 174.1
3rd Qu.:2007 3rd Qu.: 32.0
Max. :2023 Max. :99693.0
prop
Min. :0.0000000
1st Qu.:0.0000000
Median :0.0000000
Mean :0.0001221
3rd Qu.:0.0000000
Max. :0.0815000
Head and Tail
The head()
and tail()
functions just show you a select number of lines from the top or bottom of the file.
From now on, if you see a code chunk like this without any other directions to add to it, just run it so you can see what it does. It might be important for later chunks to work.
|> head(5) babynames_raw
OYO: Try tail
Try tail()
on your own. It works the same way as head().
- Using the empty code chunk below. Start with
babynames
and pipe intotail()
. There is a keyboard shortcut for the pipe: Ctrl + Shift + M on Windows or Cmd + Shift + M on a Mac. - Try different number values inside
tail()
and see what it does.
Cleaning data
Data is almost never perfect. In this case we have a column name “n” that isn’t very descriptive that could cause confusion later because in data science the term “n” typically stands for “number of observations”. While it is true our n
column is the number of times a name appears each year, let’s rename it to something more descriptive.
During this process, we will create a new R object called babynames
from the original babynames_raw
. Again, like we did above we have the create the thing before you can fill it, so that is why we see the new object first, then the <-
to fill the object with the results of the code on the right.
On the right side we are taking our raw data and piping it into the rename()
function. That function also wants to know the new things first but a variable like this gets assigned with a single =
sign, so times_given = n
is renaming the n column.
<- babynames_raw |> rename(times_given = n)
babynames
|> head() babynames
Cleaning data is often the most time consuming part of data science. We are usually making sure that variables are the correct data types, are well-named and don’t have any obvious problems. Here we just did the one thing to change a column name.
Focusing data
Sometimes we want to look at specific things in our data. We might want to look at specific columns/variables, or find particular rows.
Arrange to sort
We use the arrange()
function to sort data based on variables we specify. The default is to sort from smallest to largest, so we have to specify desc()
if we want it the other direction.
|>
babynames arrange(times_given)
|>
babynames arrange(desc(times_given))
The code chunk above actually has TWO lines of code so it prints out data TWICE to your notebook into different tabs. Click on the each tab to see and compare the results.
In the second line, we are nesting some functions. Even with only two functions, you can see that reading from the middle of a line of code can be confusing. This is why the pipe |>
is so important so we can move the result of one function into another.
Filter rows
Sometimes we want to find specific rows in our data. This next chunk of code reads like this:
- Take the
babynames
data, and then … - Arrange the result in descending order by the column
times_given
, and then … - Use the
filter()
function to look inside theyear
column and find only rows with the value “2023”.
|>
babynames arrange(desc(times_given)) |>
filter(year == 2023)
This gives us the most popular names in 2023! There were 31,682 different names given to boys and girls, although some names might be listed for both sexes.
Select columns
If you want to choose specific columns of data, we can use select()
.
Let’s take what we have above and add select to choose to look at just two of the variables.
|>
babynames arrange(desc(times_given)) |>
filter(year == 2023) |>
select(name, times_given) # selects specific columns
Another way to do it is to choose which columns NOT to use by adding a -
before the name.
|>
babynames arrange(desc(times_given)) |>
filter(year == 2023) |>
select(-prop) # removes the prop column
OYO: Top names for your YOB
Find the top names in the year you were born. Remember, it really helps to run your code after writing each line, before you add the pipe to add the next part.
- In the blank code block below, start with
babynames
and run the chunk. - Pipe into a filter to find your birth year. You can use Cmd + Shift + M (Mac) or Ctrl + Shift + M (Windows) to write the pipe.
- Pipe into another filter to find your sex.
- Arrange to find the most given.
Summarizing data
When we are looking at 2 million rows, its hard to see the forest through the trees. Sometimes we want to summarize a bunch of numbers into a single one so we can describe the group. Like finding an average of something, or using “median home sale price” to describe the value of a bunch home sales.
The function summarize()
does just this. Feed it data, the name of a column and the method you want to use to summarize, and it will do the math.
Adding with sum
Within our babynames data, we have the number of times a name was given to a person. We could total all those up to find out how many names were given through time.
|>
babynames summarize(
total_names_given = sum(times_given)
)
Inside the summarize, we had to give it our “summarizing” function, which is sum()
in this case, which will add together all the values in the column we give it, which is times_given
in our case.
Grouping before summarizing
We can organize our data into groups before we summarize the numbers. This is very common in data journalism … to put our data into different piles based on values within a column, before we count or add it.
Most given name
OK, but what is the most given name of all time? How can we add together how many times a person has been given each name in the list?
Let’s show an example.
|>
babynames group_by(name) |>
summarize(total_given = sum(times_given)) |>
arrange(desc(total_given))
Most given name by sex
What are the most given name by sex?
We can group by more than one value before we summarize. Here let’s find the most given names including sex, and then find just the female names.
- We group by both
name
andsex
. - We total how many times each name has been given for that sex.
- We sort the data to put the highest
total_given
values at the top. - We filter to find just the female names.
|>
babynames group_by(name, sex) |>
summarize(total_given = sum(times_given)) |>
arrange(desc(total_given)) |>
filter(sex == "F")
`summarise()` has grouped output by 'name'. You can override using the
`.groups` argument.
So, according to this data, more then 70,000 different names have been given to girls. The name use most often was Mary, with 4.1 million.
Most given each year
While grouping is often used with summarize, any function following group_by()
will be applied with groupings. Let’s show this using slice_max()
, which filters rows based on the highest value given.
In this code chunk, we take babynames and then find the three highest values of times_given
.
|>
babynames slice_max(times_given, n = 3)
But if we group our data by year and sex first, we get the top three names for EACH year and sex.
|>
babynames group_by(year, sex) |>
slice_max(times_given, n = 3)
OYO: Most popular names by sex
The prop
value in our data is the “proportion” or percentage of people that name was given to within its year. So in 1880, 7.2% of females were given the name Mary.
Use what you’ve learned so far today to find this:
What are the top 5 highest shares for a name within a year for each sex.
Grouping and counting
Sometimes we just want to count the number of times a value appears in a column. We can use summarize to do this, too, but instead of using sum()
to add together values, we use the function n()
to count the number of rows that have each value.
Total names given
We’ll use this method to find how many different names have been given through all time.
Here we will group by name
and then count how many rows have had that name. This gives us a unique row for each name, so the total number of rows in the end is our total number of names, about 103,500 of them.
|>
babynames group_by(name) |>
summarize(yrs_used = n()) |>
arrange(desc(yrs_used))
Couple of things to take from this result above.
- At 103,564 rows, that is now many different names have been used.
- The
yrs_used
column is the number of times that name appeared. The names listed at the top with 288 are names given to both boys and girls every year of our data.
Render your page
Click on the Render button at the top of the document to see an HTML version of your document. All the code on the page has to work for this to happen, so if you get an error look at the message in the console and try to figure out what might be wrong. If you are in a workshop, raise your hand for some help.
Depending on time, we might try publishing our website to quartopub.com.