library(tidyverse)
library(janitor)
library(rvest)Heisman voting table
This gets the Heisman Voting table for each year in a list of years.
While I could scrape and build the data all at once, I choose to save each scraped file to disc first so the scraping doesn’t have to be rerun.
Setup
Figure out the scrape
Figuring out the scrape with one url.
url <- "https://www.sports-reference.com/cfb/awards/heisman-1935.html"
h_test <- read_html(url) |> html_node("#heisman") |> html_table()
h_testh_test |> write_csv("data-raw/heisman/h_test.csv")Make scraping a function
Creates a function to scrape a page with this table with some time between scrapes so we don’t blast the server. It saves the files into data-raw/heisman.
scrape_heisman <- function(yr) {
# build the url
url <- paste0("https://www.sports-reference.com/cfb/awards/heisman-", yr, ".html")
# Wait 2 seconds
Sys.sleep(2)
# get the table
table <- read_html(url) |> html_node("#heisman") |> html_table()
# create an export url
export_url <- paste0("data-raw/heisman/hv-", yr, ".csv")
# export the table
table |> write_csv(export_url)
}Scrape all the years in a list. Doing only four years here.
yrs <- c(2020:2023)
# Creates a loop to get those files
for (i in yrs) {
scrape_heisman(i)
}Combine the files
Makes a list of all the files that end with a digit then .csv.
files_list <- list.files(
"data-raw/heisman",
pattern = "\\d.csv$",
full.names = TRUE
)
files_list[1] "data-raw/heisman/hv-2020.csv" "data-raw/heisman/hv-2021.csv"
[3] "data-raw/heisman/hv-2022.csv" "data-raw/heisman/hv-2023.csv"
Takes that list and maps over them, applying read_csv while preserving the name of the file the data came from.
heisman_raw <- files_list |>
set_names(basename) |>
map(read_csv) |>
list_rbind(names_to = "source")
heisman_rawPull the year
Uses str_sub to pull the year from the source column base on its position.
heisman_year <- heisman_raw |>
mutate(year = str_sub(source,4,7), .after = source)
heisman_yearNow you can drop the source columns.