This lesson uses COVID case data from the New York Times and Kyle Walker’s tidycensus package to make a map of COVID deaths per popluation for Texas counties. See [Reporting with Data in R] for more information.

As of this writing, the RWDIR chapter for this is here.

Setup

library(tidyverse)
library(janitor)
library(tidycensus)

Import covid data

The data is found here.

covid <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv") %>% 
  clean_names()
## Rows: 1849105 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): county, state, fips
## dbl  (2): cases, deaths
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
covid %>% head(10)

Get most recent Texas numbers.

There is an “Unknown” county, which we remove.

tx_covid <- covid %>% 
  filter(state == "Texas") %>% 
  group_by(county) %>% 
  slice_max(date) %>% 
  select(-state) %>% 
  filter(county != "Unknown")

tx_covid

Get population data

First we have to figure out the variable we want. We are using 2020 Decennial data from the redistricting file. This is the table from data.census.gov that has the population info.

Download the variables from the “pl” table:

vpl <- load_variables(2020, "pl", cache = T)
# view is commented out
# View(vpl)

The total population variable is: P1_001N

Test the variable against the portal:

get_decennial(
  year = 2020,
  variables = c(total_pop = "P1_001N"),
  geography = "state",
  state = "AL"
)
## Getting data from the 2020 decennial Census
## Using the PL 94-171 Redistricting Data summary file
## Note: 2020 decennial Census data use differential privacy, a technique that
## introduces errors into data to preserve respondent confidentiality.
## ℹ Small counts should be interpreted with caution.
## ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.
## This message is displayed once per session.

Now let’s get Texas by county. The output is suppressed.

tx_pop <- get_decennial(
  year = 2020,
  variables = c(total_pop = "P1_001N"),
  geography = "county",
  state = "TX",
  geometry = TRUE # adds shapes
) %>% clean_names()

Join the data

We use an inner join to bring the covid data into the population data.

You have to start with the population data for the geometry to work.

tx_joined <- tx_pop %>% 
  # this is the join line
  inner_join(tx_covid, by = c("geoid" = "fips"))

tx_joined %>% glimpse()
## Rows: 254
## Columns: 9
## $ geoid    <chr> "48005", "48225", "48373", "48473", "48441", "48055", "48355"…
## $ name     <chr> "Angelina County, Texas", "Houston County, Texas", "Polk Coun…
## $ variable <chr> "total_pop", "total_pop", "total_pop", "total_pop", "total_po…
## $ value    <dbl> 86395, 22066, 50123, 56794, 143208, 45883, 353178, 31040, 201…
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-95.00488 3..., MULTIPOLYGON (((…
## $ date     <date> 2021-10-24, 2021-10-24, 2021-10-24, 2021-10-24, 2021-10-24, …
## $ county   <chr> "Angelina", "Houston", "Polk", "Waller", "Taylor", "Caldwell"…
## $ cases    <dbl> 13665, 2847, 5900, 6461, 22777, 8376, 61305, 4308, 4736, 898,…
## $ deaths   <dbl> 397, 80, 198, 79, 542, 150, 1245, 126, 110, 28, 250, 156, 318…

Clean up the columns

Removing columns and renaming value.

tx_selected <- tx_joined %>% 
  select(-c(date, name, variable)) %>% 
  rename(
    total_pop = value
  )

tx_selected %>% glimpse()
## Rows: 254
## Columns: 6
## $ geoid     <chr> "48005", "48225", "48373", "48473", "48441", "48055", "48355…
## $ total_pop <dbl> 86395, 22066, 50123, 56794, 143208, 45883, 353178, 31040, 20…
## $ geometry  <MULTIPOLYGON [°]> MULTIPOLYGON (((-95.00488 3..., MULTIPOLYGON ((…
## $ county    <chr> "Angelina", "Houston", "Polk", "Waller", "Taylor", "Caldwell…
## $ cases     <dbl> 13665, 2847, 5900, 6461, 22777, 8376, 61305, 4308, 4736, 898…
## $ deaths    <dbl> 397, 80, 198, 79, 542, 150, 1245, 126, 110, 28, 250, 156, 31…

Creates our rate column

Creating cases/deaths per 1,000 population.

tx_rates <- tx_selected %>% 
  mutate(
    cases_per_pop = (cases / (total_pop / 1000)) %>% round(),
    deaths_per_pop = (deaths / (total_pop / 1000)) %>% round(),
  )

tx_rates %>% glimpse()
## Rows: 254
## Columns: 8
## $ geoid          <chr> "48005", "48225", "48373", "48473", "48441", "48055", "…
## $ total_pop      <dbl> 86395, 22066, 50123, 56794, 143208, 45883, 353178, 3104…
## $ geometry       <MULTIPOLYGON [°]> MULTIPOLYGON (((-95.00488 3..., MULTIPOLYG…
## $ county         <chr> "Angelina", "Houston", "Polk", "Waller", "Taylor", "Cal…
## $ cases          <dbl> 13665, 2847, 5900, 6461, 22777, 8376, 61305, 4308, 4736…
## $ deaths         <dbl> 397, 80, 198, 79, 542, 150, 1245, 126, 110, 28, 250, 15…
## $ cases_per_pop  <dbl> 158, 129, 118, 114, 159, 183, 174, 139, 235, 146, 222, …
## $ deaths_per_pop <dbl> 5, 4, 4, 1, 4, 3, 4, 4, 5, 5, 5, 4, 3, 3, 2, 2, 4, 5, 5…

Highest rates

Counties with highest case rates

tx_rates %>% 
  arrange(cases_per_pop %>% desc()) %>% 
  select(county, total_pop, cases, cases_per_pop,) %>% 
  head(10)

Counties with highest deaths rates

tx_rates %>% 
  arrange(deaths_per_pop %>% desc()) %>% 
  select(county, total_pop, deaths, deaths_per_pop) %>% 
  head(10)

Map the rates

Cases per 1000

ggplot(tx_rates) +
  geom_sf(aes(fill = cases_per_pop), color = "white", size = .2) +
  theme_void() +
  labs(
    title = "COVID cases per 1000 in Texas counties",
    caption = "Source: NYTimes, Census Bureau/2020 Decennial"
  ) +
  scale_fill_distiller(
    palette = "Greens",
    direction = 1,
    name = "Cases per 1000"
  )

Deaths per 1000

ggplot(tx_rates) +
  geom_sf(aes(fill = deaths_per_pop), color = "white", size = .1) +
  theme_void() +
  labs(
    title = "COVID deaths per 1,000 people in Texas counties",
    caption = "Source: NYTimes, Census Bureau/2020 Decennial"
  ) +
  scale_fill_distiller(
    palette = "Blues",
    direction = 1,
    name = "Deaths per\n1,000 population\n"
  )

Failed join

You have to start with the tibble that has the geometry. This one tries to start with the covid data and the result doesn’t work. The error is:

Error: stat_sf requires the following missing aesthetics: geometry Run `rlang::last_error()` to see where the error occurred.

The code is commented since it doesn’t work.

# tx_joined_flop <- tx_covid %>%
#   inner_join(tx_pop, by = c("fips" = "geoid"))
# 
# tx_joined_flop %>%
#   ggplot() +
#   geom_sf()