Welcome to the tidyverse

In this notebook we’ll learn how tidyverse functions help us suss through data.

Along the way, we’ll explore the names given to babies since 1880, based on data from the Social Security Administration.

Our questions

Some questions we’ll answer along the way:

What is the most given name of all time?
What are the most given names by sex?
What are the most given names in each year?
What are the top 5 highest shares for a name within a year for each sex?
How many different names have been given since 1880s.

Setup

R is an open source language where the community can build and share new code through “packages”. Packages are basically collections of pre-written code that is stored where an R user can download it to their computer and use it.

There is a public-benefit corporation called Posit that has written a lot of R code for data science that has become popular with journalists because the code is all written to work together in a similar fashion … the tidyverse way. The most used packages are collected into an uber package called “tidyverse”, which has already been installed on our posit.cloud virtual machines. However, for each notebook you have to declare which packages you use.

Knowing which packages to install and use come with experience, but I always start with the tidyverse and janitor packages because they include our most-used functions.

Here we have a code chunk you should have near the top of every notebook. In addition to loading our two libraries there are some execution options that start with #|. Those options really are “optional” so we’ll skip over them for now, but you should run the code chunk.

library(tidyverse)
library(janitor)

You may get lots of output from your setup chunk and some the text may be colored red. That doesn’t mean that it is broken, it’s just information.

Functions are the thing

In the introduction we mentioned that functions are the action verbs of programming. In fact, learning the functions and how they work IS programming. In R, it is the construction of these functions that make up our code.

At their heart, functions are just pre-written code to accomplish something. Some things to know:

A function name usually describes what it does: arrange() sorts things in the order you specify. library() loads a specific library. mean() gets an average of multiple numbers.
A function has parenthesis written directly after it, without a space.
Inside the parenthesis we’ll often add “arguments” specific to the function. These are options to make the function behave a specific way. Well-written functions will have smart defaults so you only have to add arguments to change the default behavior.
With tidyverse functions, the first argument is usually the data you are using.
To make our code more easy to read, we use pipes (|> or %>%) to pass the results of one function into the first argument of the next function.

Here is some fake code to show kinda how a pipe works.

john_lennon |> 
  sleep(action = "wake up") |> 
  fall(direction = "out", location = "bed") |> 
  comb(method = "drag", direction = "across", body_part = "head")

OK, maybe there was too much Dad in that joke.

Importing data

The best way to learn tidyverse functions is to use them, so we’ll start using exploring our baby names data while we talk about these things.

In the code below, we come across something that happens often in R: We have to create an object/variable before we will it with data. Think of this like a bucket of water … you have to have a bucket BEFORE you can fill it with something. babynames_raw is our empty bucket.
The <- part of the code is the assignment operator that fills our bucket with data. It “moves” the results of the code on the right into the object on the left.
The read_rds() function is a function that reads in a file in a special format that R can read quickly. We give it the path to the file as an argument. There is a file saved with this notebook called babynames.rds and it is in our data folder.
The last line just prints out the babynames_raw data so we can see it in our notebook.

Try using the keyboard commands to run this code. Put your cursor somewhere inside the code chunk and use Ctrl + Shift + Enter on Windows or Cmd + Shift + Return on Mac.

babynames_raw <- read_rds("data/babynames.rds")

babynames_raw

OYO: Import

This first “On your own” we’ll do together so I can show you some features of RStudio.

We want to import a file called applicants.rds. It is stored in the same data/ directory that the babynames file is.

We always want to build code one line at a time, so run the code after we add each part.

In the blank code chunk below …
Start typing in the read_rds() function. You’ll see that type-assist will help you find it. When you get the right one, you can hit tab or enter to finish.
Next we’ll supply the first argument, which is the path to the file, in quotes. The path we need is data/applicants.rds. RUN the code chunk to see it works.
Next we’ll add the new object to put all this in, which we’ll call applicants. We need the <- assignment operator, too.
Lastly on a new line we repeat applicants to print it out to the notebook.
Run it all again to make sure it works.

OK, thanks for doing that. We’ll use the applicants data later.

Exploring data

Returning to our babynames, let’s peek at the data.

Print out your data

You can print out your data just by naming it in a code chunk and running that line.

This will give you lots of information about your data, including how many rows and columns. It shows you the column names (also called variables) and the data type within those variables.

Run the code chunk below so we can explore the data.

babynames_raw

Most of these column names are named what that mean, all accept n. That is shorthand for “number of observations” i.e., the number of times that name was given in that year. We’re going to change that in a minute.

Glimpse your data

Sometimes our data has so many variables it is hard to see them all at once. I often use a function called glimpse(), which lists each columns, it’s data type, etc.

Run this code chunk to peek at your data in a differen way.

babynames_raw |> glimpse()

Rows: 2,117,219
Columns: 5
$ year <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880,…
$ sex  <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", …
$ name <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret", "Ida",…
$ n    <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 1288, 1258,…
$ prop <dbl> 0.0724, 0.0267, 0.0205, 0.0199, 0.0179, 0.0162, 0.0151, 0.0145, 0…

Get summary statistics

Another neat function is summary() which can give you some basic statistics about your data. This is good to see the highest and lowest values within specific variables.

Yep, run this one, too.

babynames_raw |> summary()

      year          sex                name                 n          
 Min.   :1880   Length:2117219     Length:2117219     Min.   :    5.0  
 1st Qu.:1956   Class :character   Class :character   1st Qu.:    7.0  
 Median :1989   Mode  :character   Mode  :character   Median :   12.0  
 Mean   :1979                                         Mean   :  174.1  
 3rd Qu.:2007                                         3rd Qu.:   32.0  
 Max.   :2023                                         Max.   :99693.0  
      prop          
 Min.   :0.0000000  
 1st Qu.:0.0000000  
 Median :0.0000000  
 Mean   :0.0001221  
 3rd Qu.:0.0000000  
 Max.   :0.0815000

Head and Tail

The head() and tail() functions just show you a select number of lines from the top or bottom of the file.

From now on, if you see a code chunk like this without any other directions to add to it, just run it so you can see what it does. It might be important for later chunks to work.

babynames_raw |> head(5)

OYO: Try tail

Try tail() on your own. It works the same way as head().

Using the empty code chunk below. Start with babynames and pipe into tail(). There is a keyboard shortcut for the pipe: Ctrl + Shift + M on Windows or Cmd + Shift + M on a Mac.
Try different number values inside tail() and see what it does.

Cleaning data

Data is almost never perfect. In this case we have a column name “n” that isn’t very descriptive that could cause confusion later because in data science the term “n” typically stands for “number of observations”. While it is true our n column is the number of times a name appears each year, let’s rename it to something more descriptive.

During this process, we will create a new R object called babynames from the original babynames_raw. Again, like we did above we have the create the thing before you can fill it, so that is why we see the new object first, then the <- to fill the object with the results of the code on the right.

On the right side we are taking our raw data and piping it into the rename() function. That function also wants to know the new things first but a variable like this gets assigned with a single = sign, so times_given = n is renaming the n column.

babynames <- babynames_raw |> rename(times_given = n)

babynames |> head()

Cleaning data is often the most time consuming part of data science. We are usually making sure that variables are the correct data types, are well-named and don’t have any obvious problems. Here we just did the one thing to change a column name.

Focusing data

Sometimes we want to look at specific things in our data. We might want to look at specific columns/variables, or find particular rows.

Arrange to sort

We use the arrange() function to sort data based on variables we specify. The default is to sort from smallest to largest, so we have to specify desc() if we want it the other direction.

babynames |> 
  arrange(times_given)

babynames |> 
  arrange(desc(times_given))

The code chunk above actually has TWO lines of code so it prints out data TWICE to your notebook into different tabs. Click on the each tab to see and compare the results.

In the second line, we are nesting some functions. Even with only two functions, you can see that reading from the middle of a line of code can be confusing. This is why the pipe |> is so important so we can move the result of one function into another.

Filter rows

Sometimes we want to find specific rows in our data. This next chunk of code reads like this:

Take the babynames data, and then …
Arrange the result in descending order by the column times_given, and then …
Use the filter() function to look inside the year column and find only rows with the value “2023”.

babynames |> 
  arrange(desc(times_given)) |> 
  filter(year == 2023)

This gives us the most popular names in 2023! There were 31,682 different names given to boys and girls, although some names might be listed for both sexes.

Select columns

If you want to choose specific columns of data, we can use select().

Let’s take what we have above and add select to choose to look at just two of the variables.

babynames |> 
  arrange(desc(times_given)) |> 
  filter(year == 2023) |> 
  select(name, times_given) # selects specific columns

Another way to do it is to choose which columns NOT to use by adding a - before the name.

babynames |> 
  arrange(desc(times_given)) |> 
  filter(year == 2023) |> 
  select(-prop) # removes the prop column

OYO: Top names for your YOB

Find the top names in the year you were born. Remember, it really helps to run your code after writing each line, before you add the pipe to add the next part.

In the blank code block below, start with babynames and run the chunk.
Pipe into a filter to find your birth year. You can use Cmd + Shift + M (Mac) or Ctrl + Shift + M (Windows) to write the pipe.
Pipe into another filter to find your sex.
Arrange to find the most given.

Summarizing data

When we are looking at 2 million rows, its hard to see the forest through the trees. Sometimes we want to summarize a bunch of numbers into a single one so we can describe the group. Like finding an average of something, or using “median home sale price” to describe the value of a bunch home sales.

The function summarize() does just this. Feed it data, the name of a column and the method you want to use to summarize, and it will do the math.

Adding with sum

Within our babynames data, we have the number of times a name was given to a person. We could total all those up to find out how many names were given through time.

babynames |> 
  summarize(
    total_names_given = sum(times_given)
  )

Inside the summarize, we had to give it our “summarizing” function, which is sum() in this case, which will add together all the values in the column we give it, which is times_given in our case.

Grouping before summarizing

We can organize our data into groups before we summarize the numbers. This is very common in data journalism … to put our data into different piles based on values within a column, before we count or add it.

Most given name

OK, but what is the most given name of all time? How can we add together how many times a person has been given each name in the list?

Let’s show an example.

babynames |> 
  group_by(name) |> 
  summarize(total_given = sum(times_given)) |> 
  arrange(desc(total_given))

Most given name by sex

What are the most given name by sex?

We can group by more than one value before we summarize. Here let’s find the most given names including sex, and then find just the female names.

We group by both name and sex.
We total how many times each name has been given for that sex.
We sort the data to put the highest total_given values at the top.
We filter to find just the female names.

babynames |> 
  group_by(name, sex) |> 
  summarize(total_given = sum(times_given)) |> 
  arrange(desc(total_given)) |> 
  filter(sex == "F")

`summarise()` has grouped output by 'name'. You can override using the
`.groups` argument.

So, according to this data, more then 70,000 different names have been given to girls. The name use most often was Mary, with 4.1 million.

Most given each year

While grouping is often used with summarize, any function following group_by() will be applied with groupings. Let’s show this using slice_max(), which filters rows based on the highest value given.

In this code chunk, we take babynames and then find the three highest values of times_given.

babynames |> 
  slice_max(times_given, n = 3)

But if we group our data by year and sex first, we get the top three names for EACH year and sex.

babynames |> 
  group_by(year, sex) |> 
  slice_max(times_given, n = 3)

OYO: Most popular names by sex

The prop value in our data is the “proportion” or percentage of people that name was given to within its year. So in 1880, 7.2% of females were given the name Mary.

Use what you’ve learned so far today to find this:

What are the top 5 highest shares for a name within a year for each sex.

Grouping and counting

Sometimes we just want to count the number of times a value appears in a column. We can use summarize to do this, too, but instead of using sum() to add together values, we use the function n() to count the number of rows that have each value.

Total names given

We’ll use this method to find how many different names have been given through all time.

Here we will group by name and then count how many rows have had that name. This gives us a unique row for each name, so the total number of rows in the end is our total number of names, about 103,500 of them.

babynames |> 
  group_by(name) |> 
  summarize(yrs_used = n()) |> 
  arrange(desc(yrs_used))

Couple of things to take from this result above.

At 103,564 rows, that is now many different names have been used.
The yrs_used column is the number of times that name appeared. The names listed at the top with 288 are names given to both boys and girls every year of our data.

Render your page

Click on the Render button at the top of the document to see an HTML version of your document. All the code on the page has to work for this to happen, so if you get an error look at the message in the console and try to figure out what might be wrong. If you are in a workshop, raise your hand for some help.

Depending on time, we might try publishing our website to quartopub.com.