7  Intro to ggplot

This week, we’ll move on from summaries to talking about data visualizations (“data viz”), an essential skill for any data journalist. While there are a lot of different ways to make figures and graphs, this chapter will focus on one popular R package for data viz: ggplot.

7.1 Learning goals for this chapter

In this chapter, we’ll learn the basics of data visualization using the Grammar of Graphics principles. We’ll start with some smaller datasets to give you a sense of how the code works. And, in the next chapter, you’ll apply this to a dataset we have already used in the class.

Specifically our goals are:

  • To learn about the Grammar of Graphics
  • To make scatterplots
  • To make bar charts

7.2 Introduction to ggplot

ggplot2 is the data visualization library within Hadley Wickham’s tidyverse. It is a beast of a package because it supports a whole variety of different types of data visualizations, from bar charts and line charts to fancy choropleth maps and animated figures.

Even though the package is called ggplot2, the function to make graphs is just ggplot(). So, for simplicity, we’ll just call everything ggplot.

The ggplot package relies on a concept called the Grammar of Graphics, hence the gg in ggplot. The basic logic of the Grammar of Graphics is that any graph you could ever want to build will need similar things: a data set, some information about the scales of your variables, and the type of figure or graph that you want to create. These various things can be “layered” on top of each other to create a visually pleasing plot.

Folks who have used Adobe creative programs (e.g., Photoshop, Illustrator, etc.) can think about it like laying an image: each layer in your image should do something to change the image. Likewise, each layer in a ggplot figure will add to the overall graph.

7.2.1 What ggplot is best for

When producing figures and graphs in R, ggplot is the absolute best approach because you’ll see the results right in your notebook. These basic data visualizations are an essential skill for any data journalist: it helps you find important things in your data that you may ultimately report on. So, ggplot is important for any R-based data journalism project.

That being said, there are less complicated ways of creating publishable graphics. Tools like Datawrapper and Flourish can produce equally beautiful graphics without the code. So why learn ggplot? Because, (1) ggplot is super useful when you’re just learning about the data and (2) to get good enough in ggplot to make publishable graphics, you have to practice, practice, practice. Yes, ggplot is a big package with lots of nuance. But the more you take the time to learn it, the more you will master it.

7.2.2 The Grammar of Graphics

As noted above, the gg in ggplot stands for “Grammar of Graphics,” which is a fancy way of saying we’ll build our charts layer by layer. There are three main things you need to make a plot:

  • data: You have to tell ggplot the name of the object with your data.
  • mapping: Defines how variables in your dataset are mapped to visual properties (aesthetics) of your plot. i.e., which columns are on the x or y axis. We use the function aes() to describe these aesthetics.
  • geometries: This how you describe the shape of your visualization, whether it’s lines, bars, points, or something else. We do this through different geom_() functions.

In addition to these three things, there are lots of helper layers we’ll learn about along the way, including:

  • themes (“theme”): Allow you to define things like font, background color, and other things to “pretty up” the data viz.
  • coord_flip: A special layer for flipping the chart.
  • scales: Transforms data to make the plot more readable.
  • labels (“labs”): For making titles and labels.
  • facets: For graphing many elements of the same data set in the same space (one dataset, multiple figures).

This all may seem complicated now, but it’ll make sense once we start putting together these layers together, one at a time. After all, the best way to learn any R package is to do it.

7.3 Start a new project

  1. Get into RStudio and make sure you don’t have any other files or projects open.
  2. Create a new Quarto Website project, name it yourname-ggplot and save it in your rwd folder.
  3. (No need for a folder structure, we’ll do this all in one file.)
  4. We’ll use our index.qmd file for everything for this project, so update the title as “ggplot practice”, remove the boilerplate and create a setup section that loads library(tidyverse) and library(janitor), like we do with every notebook.

The ggplot package is a part of tidyverse, so we don’t need to do anything special there.

7.3.1 Install palmerpenguins

For this lesson, we will install a new R package, which will give us access to some data to work through some visualization techniques. I’d like you to meet the Palmer Penguins …

Artwork by @allison_horst

The palmerpenguins package includes scientific measurements of penguins observed on islands in the Palmer Archipelago near Palmer Station, Antarctica. The package and the awesome artwork is maintained by Allison Horst. It’s a great data set for exploration & visualization so we’ll use it to learn about ggplot.

Let’s install and load it.

  1. In your Console, run install.packages("palmerpenguins").
  2. In your setup section, add the library and rerun the chunk: library(palmerpenguins)

Remember you only have to install a package once on your computer, but you should load the library at the top of each notebook that uses it. (i.e., don’t leave the install.packages() part in your notebook.)

7.4 The layers of ggplot


Note

Much of this first plot explanation comes from Hadley Wickham’s R for Data Science, with edits to fit the lesson here.

Before we dive into ggplot, let’s peek at our data.

  1. Start a new section “First plot”.
  2. Add some text that you are studying pretty penguins.
  3. Add a code chunk like below and run it to see what the penguins dataset looks like.
penguins

As noted above, the data are measurements from a number or penguins, things like species, sex, bill length, etc.

Among the variables in penguins are:

  • flipper_length_mm: length of a penguin’s flipper, in millimeters.
  • body_mass_g: body mass of a penguin, in grams.
  • species: a penguin’s species (Adelie, Chinstrap, or Gentoo).

With these variables, we’ll explore the relationship between flipper lengths and body masses of these penguins, taking into consideration the species of the penguin.

7.4.1 Our goal chart

Our goal is to recreate this chart:

Figure 7.1: Our goal graphic

7.4.2 1st: the data

When working with the ggplot2 package, you’ll start nearly every figure with the ggplot() function. In the ggplot() function, you’ll tell R what data you’re using, and the coordinate system you want to build based on the data.

The first thing you’ll want to do is tell ggplot the dataset you want to use (in this case, penguins). Let’s do that now.

  1. Edit your first plot section,
  2. Make a new code chunk and add the code below.
ggplot(penguins)

This tells us… absolutely nothing! But that’s not surprising: you haven’t even told ggplot what variables you want to focus on or the way you want to visualize the data. To do that, you’ll need a mapping argument.

7.4.3 2nd: Map the data

We use the aes() (short for “aesthetic”) to describe which data to plot and where. This is considered a mapping argument, because you use this argument to tell ggplot how you want to map your data.

We use aes() to indicate the variable to plot horizontally (the x axis) and vertically (the y axis). So your aes() argument will look something like this: aes(x = some_variable, y = another_variable).

In our case, set our flipper_length_mm (the flipper length) for x and body_mass_g for y:

  1. Edit your code to add the following line of code. Note I’ve rearranged the indenting, too.
ggplot(
  penguins,
  aes(x = flipper_length_mm, y = body_mass_g)
)

We are getting closer! The plot knows what data we are using and the “range” of that data, so it has added tick marks and values along our x and y axes.

7.4.4 3rd: geometries

Now that we know which data to apply, we have to make a choice on how we want to show the shape of that data on our plot, and we do that through geometries (or “geoms” as we call them.) We’ll talk about different ones in a minute, but we’ll start with plotting our data as points.

This is our first added layer on our chart, and therefore introduction of the + option within ggplot. We use + to add on layers of information in our grammar of graphics. It’s sort of like the pipe |> in the way it works, but we use + when adding ggplot layers.

  1. Edit your chunk to add the + and the geom_point() functions as noted below.

(I really recommend you type the additions so you can see how RStudio helps fill the code.)

ggplot(
  penguins,
  aes(x = flipper_length_mm, y = body_mass_g)
1) +
2  geom_point()
1
Don’t forget the + here
2
The geom_point() plotted dots based on the values set in our aes() function.

OK, now we are getting somewhere. We can see each penguin plotted on the chart, and we can generally see that as the birds get heavier (higher on vertical axis) their flippers are also longer (to the right on the horizontal axis.)

Note

We do get this warning: Warning: Removed 2 rows containing missing values geom_point(). This is telling us there are two rows of data that don’t have one of the two values, so they were dropped. Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. As journalists, we just need to make sure we know why something is missing and note if it is important.

7.4.4.1 Other geometries

There are many geoms, but here are a few common ones:

  • geom_point() adds dots onto the grid based on the data. We used them above.
  • geom_line() adds lines between data points on the grid. Basically a line chart.
  • geom_col() and geom_bars() adds bars to the grid based on values in the data. A bar chart. We’ll use geom_col() later in this lesson but you can read about the difference between the two in a later chapter.
  • geom_text() adds labels based on values in the data.

7.4.5 Chart code review

Let’s review the different parts of this code. Note I’ve rewritten it a bit here to put some arguments on their own line so it is more readable.

1ggplot(
2  penguins,
3  aes(x = flipper_length_mm, y = body_mass_g)
) +
4  geom_point()
1
ggplot() is the function we use to make a chart.
2
Inside of ggplot() the first argument is the data, in this case the penguins data.
3
The second argument is the aes() function, where we apply our aesthetics. Aesthetics describe how we will paint our data onto the plot. Inside of this function, we set values for the x and y axis so we know WHERE to paint. There are some other aesthetic values we can paint with data, and we will.
4
Lastly we add on a layer (with + on the previous line) to add geom_point() to paint “points” on top of the plot based on the set aesthetics. (It’s like pointillism.)
Important

You might see this same code written different ways. Let’s talk about why.

Verbose arguments

ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()

The above version adds data = and mapping = to the first two arguments. It is the official descriptive way to do this, and how Hadley Wickham describes it in R for Data Science. But, since the first argument is assumed to be the data we often don’t include it unless it unless we are trying to differentiate it with different data used later. Same goes for mapping = … we don’t specify it unless necessary.

Piping into ggplot

penguins |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()

In this case we start with the data AND THEN pipes it into ggplot() as the first argument. Prof. McDonald often does it this way. The scoundrel.

7.4.6 Global vs local aesthetics

Our initial plot appears to show that flipper length is related to the weight of the penguin, but it’s always a good idea to be skeptical of any apparent relationship between two variables and ask if there may be other variables that explain or change the nature of this apparent relationship. For example, does this relationship exist for all three species?

Let’s color our points based on the species to see.

  1. Edit your plot and add the color mapping to the aes() function, like below. NOTE I’ve also rearranged some of the code to make it more readable.
ggplot(
  penguins,
  aes(x = flipper_length_mm,
      y = body_mass_g,
1      color = species)
) +
  geom_point()
1
This is the line where you are adding color.

When a categorical variable like the species is mapped to an aesthetic, ggplot will assign a color to each unique value and add a legend so you can tell them apart.

Let’s add another layer: a smooth curve displaying the relationship between body mass and flipper length. Since this is a new geometric object representing our data, we will add a new geom as a layer on top of our point geom: geom_smooth(). And we will specify that we want to draw the line of best fit based on a linear model with method = "lm".

  1. Edit your code to add on the last layer shown here:
ggplot(
  penguins,
  aes(
    x = flipper_length_mm,
    y = body_mass_g,
    color = species)
) +
  geom_point() +
  geom_smooth(method = "lm")

We’ve added the lines, but it doesn’t look like our “goal” graphic we started with in Section 7.4.1, which only has one line instead of three separate lines for each of the penguin species.

When we set the x, y and color aesthetics in our code, we set them at the global level. All those characteristics apply to every added layer. However, each geom function can have it’s own aesthetics, setting them at the local level, applying to only that layer.

Let’s show this by moving our color aesthetic from the global level to apply just at the point level.

  1. Edit your code to move the color = species bit from the global aesthetics to a local aesthetic within geom_point(), as shown below.
ggplot(
  penguins,
  aes(x = flipper_length_mm,
1      y = body_mass_g)
) +
2  geom_point(aes(color = species)) +
  geom_smooth(method = "lm")
1
This is the line where we remove the color = species function.
2
This is where we add it back, creating an aes() function inside geom_point() so it only applies to that geom.

OK, this is looking pretty good, but it is not a great idea to represent information using only colors on a plot, as people perceive colors differently due to color blindness or other color vision differences. We can improve readability of this chart of we also map species based on the shape aesthetic.

  1. Edit your code to add shape = species aesthetic to the geom_point() function, as shown below.
ggplot(
  penguins,
  aes(x = flipper_length_mm,
      y = body_mass_g)
) +
1  geom_point(aes(color = species, shape = species)) +
  geom_smooth(method = "lm")
1
We add the shape = species argument here inside the aes() function.

7.4.7 Labels

In addition to geoms, you can adjust and add labels (text layers) to our plot.

Labels (or labs, since we use the labs() function for them) are a series of text-based items we can layer onto our plots like titles, bylines and axis names.

Like geom_point() above, we’ll use the + at the end of each line before we add another layer.

Let’s add a labels layer so we can describe our chart to our readers. Note there are a number of arguments available in labs(), even more than those listed here.

  1. Edit your code to add the labs() layer indicated below.
ggplot(
  penguins,
  aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(aes(color = species, shape = species)) +
  geom_smooth(method = "lm") +
1  labs(
2    title = "Body mass and flipper length",
3    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
4    x = "Flipper length (mm)", y = "Body mass (g)",
5    color = "Species", shape = "Species"
  )
1
We start with the labs() function and open it up to write many arguments inside.
2
title gives us a headline.
3
subtitle is a readout under the headline. We sometimes need to wrap the text in a str_wrap() function to keep it from running out of the chart.
4
The x axis already had a label on our chart by default, but it was the not-very-print-friendly name of the variable. Fine for us because we know what it is, but here we are replacing that with text more appropriate for a reader. We do the same with the y label.
5
The color and shape arguments are similar in that we are replacing the existing label above the legend, resetting those to capitalize “Species”. We need to change them both so they are the same. (You might experiment by removing the shape argument and re-running it to see what happens.)

7.4.8 An aside: Titles and subtitles

Our example chart here is a scientific figure designed to show the relationship between the length of flippers to the body size of penguins, and that perhaps the relationship holds true among different species. It is not a good example of a journalistic chart.

We want our chart titles to be MORE THAN A LABEL … they should communicate something and further the story we are trying to tell with the plot. They should have a verb! If we were telling a story here, we might try: Flippers longer on heavier penguins.

Our subtitles should provide the context necessary to understand what we are looking at and why. The combination of the title and subtitle should stand alone to tell enough of this story that we don’t have to go elsewhere (like story) to understand it. A journalistic model might be something like this: Dr. Kristen Gorman studied three species of penguins at the Palmer Long-Term Ecological Research station in Antarctica between 2007 and 2009, recording basic biographical measurements for nearly 350 animals.

7.4.9 Themes

You can change just about anything on a ggplot chart if you know the function and arguments to describe it.

Themes control the display of all non-data elements of the plot. You can override all settings with a complete theme like theme_bw(), or choose to tweak individual settings by using theme(), like theme(plot.title = element_text(size = rel(2))), which makes a plot title bigger.

In addition to the built-in complete themes, there are many others from the R community.

Let’s show some here.

  1. Edit your code chunk to add on the last + theme_minimal() function, one of my favorites themes.
ggplot(
  penguins,
  aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(aes(color = species, shape = species)) +
  geom_smooth(method = "lm") +
  labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper length (mm)", y = "Body mass (g)",
    color = "Species", shape = "Species"
  ) +
  theme_minimal()

This changes the background color and grid lines to make the data pop a little better. There are other themes that create more radical change, some in their own packages. We’ll play more later.

OK, you’ve made your first ggplot chart! Are you ready to make another?

7.5 Let’s build a bar chart

In our first week of class, we sent out a survey where you told us your favorite Disney Princess and favorite flavor of ice cream. Let’s now play around with some of this data.

For this lesson, we’re not going to create a different notebook or download the data to our computer. Instead, we’re going to save the data directly into a tibble.

  1. Start a new section: Importing class data
  2. In the text, note that we are importing the chart data.
  3. Add the code below to get the data.
1class <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQfwR6DBW5Qv6O5aEBFJl4V8itnlDxFEc1e_-fOAtBMDxXx1GeEGb8o5VSgi33oTYqeFhVCevGGbG5y/pub?gid=0&single=true&output=csv") |>
2  clean_names()

3class
1
We create a new object and then fill it by reading the data directly from the web.
2
We lowercase the variable names using clean_names() from the janitor package.
3
Print the new object to peek at it.

So, now, you should have the data in your environment.

And with this data, I want to build a chart like this:

7.5.1 Prep princess data

While there are ways for ggplot to calculate values from your data on the fly, I prefer to first build a table of the values I want, and then I will plot it on a chart. It’s helpful to think of these steps as separate so you have a good workflow (clean the data, prepare the data in a table form, and then plot the data).

Today, our goal will be to make a bar chart, sometimes known as a column chart. This bar chart will show the number of votes for each princess from the data. So, we need to count the number of rows for each value … our typical group_by/summarize/arrange (GSA) process.

For this lesson, I’ll use the count() shortcut, which I write about in Counting Shortcuts. Next, I’ll save the summarized data into a new dataframe called princess_data. Follow along in your notebook:

  1. Add a section: Princess data
  2. Add text that you are creating a data frame to plot.
  3. Add the code below to create that data.
1princess_data <- class |>
2  count(princess, name = "votes", sort = TRUE)
  
3princess_data
1
Start with a new object and start filling it with class data.
2
Here we use the count() function to count the number of rows based on the princess column. The argument name = just names the new columns something other than n, and the sort = argument sorts the data in descending order based on the counted column.
3
Print it out so we can see it.

At this point, y’all should be plenty familiar with these summary functions, and the output should be easy to interpret: we’re just counting the number of rows for each princess.

Now that we have our table data, let’s actually plot it.

7.5.2 Build our plot with geom_col

Like in the previous lesson, we’ll start our plot by creating the first layer: the ggplot() function, which takes the data as its first argument and the aes() mapping layer as its second argument.

  1. Add some text noting that you’ll now plot.
  2. Add the following code chunk, which is the first layer
ggplot(
  princess_data,
1  aes(x = votes, y = princess)
)
1
sets our x and y axes values.

You’ll see the grid and x/y axis labels of the data, but no geometries are applied yet, so you won’t see any data. But remember, we’re adding these all in layers.

7.5.3 Add the geom_col layer

Now it is time to add our columns. To do this, we’ll use geom_col(). Similar to geom_point(), geom_col() adds a geometric layer that tells R how to display the data (in this case, with columns as opposed to points). Let’s write this code now.

  1. Edit the plot code to add the ggplot pipe + and on the next line add geom_col().
ggplot(
  princess_data,
  aes(x = votes, y = princess)
1) +
2  geom_col()
1
Don’t forget to add the + at the end of the previous line
2
The geom_col() function adds the bars based on the global aesthetics we’ve already set.

Our two-layer chart is getting somewhere now. We’re able to see the data in the plot, the order of the bars is alphabetical instead of in vote order. We can fix it.

7.5.4 Reorder the bars

The bars on our chart are in alphabetical order of the x axis (and reversed thanks to our flip.) We want to order the values based on the votes in the data.

Note

Complication alert: Categorical data can have factors, which are like an internal ordering system. Some categories, like months in a year, have an “order” that is not alphabetical. We don’t have that here, but know it is a thing.

We can reorder our categorical values in a plot by editing the x values in our aes() using reorder(). (There is a tidyverse function called fct_reorder() that works the same way for factors.)

reorder() takes two arguments: The column to reorder, and the column to base that reorder on. It can happen in two different ways, and I’ll be honest and say I don’t know which is easier to comprehend.

  • y = reorder(princess, votes) says “we shall set y as reordered values of princess based on the order of votes. OR …
  • y = princess |> reorder(votes) says “set the y axis as princess and then reorder by votes.

They both work. Even though I’m a fan of the tidyverse |> construct, I’m going with the first version.

  1. Edit the first line of your chunk to reorder the bars.
ggplot(
  princess_data,
1  aes(x = votes, y = reorder(princess, votes))
) + 
  geom_col()
1
This is the line where reorder() is added.

So now the bars are organized in vote size. But can we make it more clear about the number of votes for each princess? Let’s learn how to add this information.

7.5.5 Adding a geom_text layer

Now, we’re really starting to take advantage of the grammar of graphics by including more than one geometric layer. Specifically, we’ll be using geom_text() to add some information to our bar charts.

As we mentioned previously, geom layers can take individual aesthetics (that build on top of the global aesthetics you put in the first layer). When using geom_text(), we’ll include some local aesthetics using the aes() argument, to tell ggplot the label we’d like to add to the plot.

  1. Edit your plot chunk to add the + and geom_text() layer on the end of your code.
  2. Set the aesthetics of the geom_text() function to plot the labels onto the chart based on the number of votes: aes(label = votes), as noted below.
ggplot(princess_data,
       aes(x = votes, y = reorder(princess, votes))
) + 
  geom_col() +
1  geom_text(aes(label = votes))
1
This plots a text layer onto the chart based on both the position and values in votes.

Well that did… something. We’ve successfully added the numbers to this plot, but it’s not very pretty. First, the number is plotted at the end of the bar, making it harder to read. So we’ll want to horizontally adjust this by shifting the numbers a bit to the left. Second, black text is really hard to read against a dark grey background. So we’ll change the text of the number to white.

We can make both of these edits directly in the geom_text layer.

  1. Edit the last line of your plot chunk to add two new arguments.
  2. The first argument you will add is hjust, which moves the text left. (hjust stands for horizontal justification. vjust, or vertical justification, would move it up and down).
  3. The second argument you will add is color, which tells ggplot what the color of your text should be.

As a reminder, you should always separate your arguments within a function using commas (,).

ggplot(
  princess_data,
  aes(x = votes, y = reorder(princess, votes))
) + 
  geom_col() +
1  geom_text(aes(label = votes), hjust = 2, color = "white")
1
Note that both hjust and color are NOT inside the aes() function because we are not using the data to control them.

Great! But we’re still not done. Even though we’ve added labels to each bar chart, we still haven’t added a title, and the titles of our x and y axes are not great. So let’s work on those now.

7.5.6 Add some titles and more labels

Now that we have a chart, with some information displayed in bars, let’s add to this by giving the chart some labels. We’ll do this by adding a layer of labels to our chart using the the labs() function. We can add and change a number of things with labs(), including creating a title, and changing the x and y axis titles.

  1. Edit the last line of your plot chunk to add the ggplot pipe + and labs() in the next line.
  2. Add a title using the title = argument
  3. Add a subtitle using the subtitle = argument. This is a great place to put information about your data (like when it was collected).
  4. Add a caption using the caption = argument. Put your byline here!
  5. Change the x and y axes titles using x = and y =.
ggplot(
  princess_data,
  aes(x = votes, y = reorder(princess, votes))
) + 
  geom_col() +
  geom_text(aes(label = votes), hjust = 2, color = "white") + 
  labs(
1    title = "Pick of the princesses",
2    subtitle = str_wrap("Students in Reporting with Data each voted for their favorite Disney Princess. Some complained that Princess Leia was not an option."),
3    caption = "By Jo Lukito",
    x = "Number of votes",
    y = "Princess choices"
  )
1
A title to draw you in or communicate your goal.
2
The subtitle is the place to explain what is needed to understand the chart. In this case we wrap the text in a str_wrap() function so the words don’t run off the chart.
3
A caption is a good place to add your byline.

There you go! You’ve made a chart showing how our classes rated Disney Princesses.

7.6 On your own: Ice cream!

Now it is time for you to put these skills to work:

  1. Build a chart about the favorite ice creams from RWD classes.

Some things to consider:

  • You need a new section, etc.
  • You’re starting with the same class data
  • You need to prepare the data based on ice_cream (which is the name of a variable in your class data frame). You can use group_by/summarize (with n()) and arrange to get your data, or try the count() shortcut I used above.
  • Build the column chart with the ice cream values along the Y axis, but make sure they are ordered to the most popular flavor is at the top.

It’s essentially the same process we used for the princess chart, but using ice_cream variable. That said, I really recommend you write all the code from scratch, ONE LINE AT A TIME, so you can soak in what each line does.

7.7 What we’ve learned

There is a ton, really.

  • ggplot2 (which is really the ggplot() function) is the charting library for the tidyverse. This whole lesson was about it.
  • We also covered reorder(), which can reorder a variable based on the values in a different variable.

Check out the Resources chapter for more references, tutorials, etc.