library(tidyverse)
library(janitor)
library(rvest)
Heisman voting table
This gets the Heisman Voting table for each year in a list of years.
While I could scrape and build the data all at once, I choose to save each scraped file to disc first so the scraping doesn’t have to be rerun.
Setup
Figure out the scrape
Figuring out the scrape with one url.
<- "https://www.sports-reference.com/cfb/awards/heisman-1935.html"
url
<- read_html(url) |> html_node("#heisman") |> html_table()
h_test
h_test
|> write_csv("data-raw/heisman/h_test.csv") h_test
Make scraping a function
Creates a function to scrape a page with this table with some time between scrapes so we don’t blast the server. It saves the files into data-raw/heisman.
<- function(yr) {
scrape_heisman # build the url
<- paste0("https://www.sports-reference.com/cfb/awards/heisman-", yr, ".html")
url
# Wait 2 seconds
Sys.sleep(2)
# get the table
<- read_html(url) |> html_node("#heisman") |> html_table()
table
# create an export url
<- paste0("data-raw/heisman/hv-", yr, ".csv")
export_url
# export the table
|> write_csv(export_url)
table }
Scrape all the years in a list. Doing only four years here.
<- c(2020:2023)
yrs
# Creates a loop to get those files
for (i in yrs) {
scrape_heisman(i)
}
Combine the files
Makes a list of all the files that end with a digit then .csv.
<- list.files(
files_list "data-raw/heisman",
pattern = "\\d.csv$",
full.names = TRUE
)
files_list
[1] "data-raw/heisman/hv-2020.csv" "data-raw/heisman/hv-2021.csv"
[3] "data-raw/heisman/hv-2022.csv" "data-raw/heisman/hv-2023.csv"
Takes that list and maps over them, applying read_csv while preserving the name of the file the data came from.
<- files_list |>
heisman_raw set_names(basename) |>
map(read_csv) |>
list_rbind(names_to = "source")
heisman_raw
Pull the year
Uses str_sub to pull the year from the source column base on its position.
<- heisman_raw |>
heisman_year mutate(year = str_sub(source,4,7), .after = source)
heisman_year
Now you can drop the source
columns.