This lesson uses COVID case data from the New York Times and Kyle Walker’s tidycensus package to make a map of COVID deaths per popluation for Texas counties. See [Reporting with Data in R] for more information.
As of this writing, the RWDIR chapter for this is here.
library(tidyverse)
library(janitor)
library(tidycensus)
The data is found here.
covid <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv") %>%
clean_names()
## Rows: 1849105 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): county, state, fips
## dbl (2): cases, deaths
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
covid %>% head(10)
Get most recent Texas numbers.
There is an “Unknown” county, which we remove.
tx_covid <- covid %>%
filter(state == "Texas") %>%
group_by(county) %>%
slice_max(date) %>%
select(-state) %>%
filter(county != "Unknown")
tx_covid
First we have to figure out the variable we want. We are using 2020 Decennial data from the redistricting file. This is the table from data.census.gov that has the population info.
Download the variables from the “pl” table:
vpl <- load_variables(2020, "pl", cache = T)
# view is commented out
# View(vpl)
The total population variable is: P1_001N
Test the variable against the portal:
get_decennial(
year = 2020,
variables = c(total_pop = "P1_001N"),
geography = "state",
state = "AL"
)
## Getting data from the 2020 decennial Census
## Using the PL 94-171 Redistricting Data summary file
## Note: 2020 decennial Census data use differential privacy, a technique that
## introduces errors into data to preserve respondent confidentiality.
## ℹ Small counts should be interpreted with caution.
## ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.
## This message is displayed once per session.
Now let’s get Texas by county. The output is suppressed.
tx_pop <- get_decennial(
year = 2020,
variables = c(total_pop = "P1_001N"),
geography = "county",
state = "TX",
geometry = TRUE # adds shapes
) %>% clean_names()
We use an inner join to bring the covid data into the population data.
You have to start with the population data for the geometry to work.
tx_joined <- tx_pop %>%
# this is the join line
inner_join(tx_covid, by = c("geoid" = "fips"))
tx_joined %>% glimpse()
## Rows: 254
## Columns: 9
## $ geoid <chr> "48005", "48225", "48373", "48473", "48441", "48055", "48355"…
## $ name <chr> "Angelina County, Texas", "Houston County, Texas", "Polk Coun…
## $ variable <chr> "total_pop", "total_pop", "total_pop", "total_pop", "total_po…
## $ value <dbl> 86395, 22066, 50123, 56794, 143208, 45883, 353178, 31040, 201…
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-95.00488 3..., MULTIPOLYGON (((…
## $ date <date> 2021-10-24, 2021-10-24, 2021-10-24, 2021-10-24, 2021-10-24, …
## $ county <chr> "Angelina", "Houston", "Polk", "Waller", "Taylor", "Caldwell"…
## $ cases <dbl> 13665, 2847, 5900, 6461, 22777, 8376, 61305, 4308, 4736, 898,…
## $ deaths <dbl> 397, 80, 198, 79, 542, 150, 1245, 126, 110, 28, 250, 156, 318…
Removing columns and renaming value.
tx_selected <- tx_joined %>%
select(-c(date, name, variable)) %>%
rename(
total_pop = value
)
tx_selected %>% glimpse()
## Rows: 254
## Columns: 6
## $ geoid <chr> "48005", "48225", "48373", "48473", "48441", "48055", "48355…
## $ total_pop <dbl> 86395, 22066, 50123, 56794, 143208, 45883, 353178, 31040, 20…
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-95.00488 3..., MULTIPOLYGON ((…
## $ county <chr> "Angelina", "Houston", "Polk", "Waller", "Taylor", "Caldwell…
## $ cases <dbl> 13665, 2847, 5900, 6461, 22777, 8376, 61305, 4308, 4736, 898…
## $ deaths <dbl> 397, 80, 198, 79, 542, 150, 1245, 126, 110, 28, 250, 156, 31…
Creating cases/deaths per 1,000 population.
tx_rates <- tx_selected %>%
mutate(
cases_per_pop = (cases / (total_pop / 1000)) %>% round(),
deaths_per_pop = (deaths / (total_pop / 1000)) %>% round(),
)
tx_rates %>% glimpse()
## Rows: 254
## Columns: 8
## $ geoid <chr> "48005", "48225", "48373", "48473", "48441", "48055", "…
## $ total_pop <dbl> 86395, 22066, 50123, 56794, 143208, 45883, 353178, 3104…
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-95.00488 3..., MULTIPOLYG…
## $ county <chr> "Angelina", "Houston", "Polk", "Waller", "Taylor", "Cal…
## $ cases <dbl> 13665, 2847, 5900, 6461, 22777, 8376, 61305, 4308, 4736…
## $ deaths <dbl> 397, 80, 198, 79, 542, 150, 1245, 126, 110, 28, 250, 15…
## $ cases_per_pop <dbl> 158, 129, 118, 114, 159, 183, 174, 139, 235, 146, 222, …
## $ deaths_per_pop <dbl> 5, 4, 4, 1, 4, 3, 4, 4, 5, 5, 5, 4, 3, 3, 2, 2, 4, 5, 5…
tx_rates %>%
arrange(cases_per_pop %>% desc()) %>%
select(county, total_pop, cases, cases_per_pop,) %>%
head(10)
tx_rates %>%
arrange(deaths_per_pop %>% desc()) %>%
select(county, total_pop, deaths, deaths_per_pop) %>%
head(10)
ggplot(tx_rates) +
geom_sf(aes(fill = cases_per_pop), color = "white", size = .2) +
theme_void() +
labs(
title = "COVID cases per 1000 in Texas counties",
caption = "Source: NYTimes, Census Bureau/2020 Decennial"
) +
scale_fill_distiller(
palette = "Greens",
direction = 1,
name = "Cases per 1000"
)
ggplot(tx_rates) +
geom_sf(aes(fill = deaths_per_pop), color = "white", size = .1) +
theme_void() +
labs(
title = "COVID deaths per 1,000 people in Texas counties",
caption = "Source: NYTimes, Census Bureau/2020 Decennial"
) +
scale_fill_distiller(
palette = "Blues",
direction = 1,
name = "Deaths per\n1,000 population\n"
)
You have to start with the tibble that has the geometry. This one tries to start with the covid data and the result doesn’t work. The error is:
Error: stat_sf requires the following missing aesthetics: geometry Run `rlang::last_error()` to see where the error occurred.
The code is commented since it doesn’t work.
# tx_joined_flop <- tx_covid %>%
# inner_join(tx_pop, by = c("fips" = "geoid"))
#
# tx_joined_flop %>%
# ggplot() +
# geom_sf()