Longhorn Soccer Stats

The goal here is to scrape individual and goalkeeper stats from the University of Texas women’s soccer team web pages. The data is stored in tables on the “Box Score” page for each game. The data is then cleaned and exported as RDS files.

Those files will have to be imported and combined, and then cleaned up before analysis.

Of note

This creates two files for each game (player stats and goalkeeper stats) and exports them as rds files. The idea is those files could be combined in a later script and further cleaning done.

This could be refactored to pull all the URLs from a schedule page and then run through a Github action to collect the data each week.

What we want

A list of the information we are trying to gather from each page.

  • The date of the game
  • The location of the game
  • The visiting team
  • The home team
  • Individual stats for visitors
  • Goalkeeping stats for visitors
  • Individual stats for home team
  • Goalkeeping stats for home team

Some pages to test

Setup

library(tidyverse)
library(janitor)
library(rvest)

Scrape stats function

Once I figured out how to do all this, I turned the process into a function that can be fed a list of URLs to make files for each match.

This combines all the processing worked out below and turns it into a function so we can feed a list of URLs to make all the files.

scrape_stats <- function(file_url) {
  
  # A pause to avoid hammering servers
  Sys.sleep(2)
  
  # reads in URL
  game_stats_raw <- read_html(file_url)
  
  # pulls game details from description list
  m_details <- game_stats_raw |> html_nodes("dd") |> html_text()
  
  # creates variables for game details
  m_date <- m_details[[1]] |> mdy()
  m_site <- m_details[[3]]
  m_stadium <- m_details[[4]]
  
  # pulls who is visitor vs home
  subheads <- game_stats_raw |> 
    html_nodes("#individual-stats") |> 
    html_nodes(".sub-heading") |>
    html_text()

  # UPDATES START HERE

  # Separates subheads into home/away team_score
  v_team_score <- subheads[[1]]
  h_team_score <- subheads[[2]]

  # Removes ranks and scores to get team name
  v_team <- v_team_score[[1]] |> str_remove(" \\d+$") |> str_remove("^#\\d+ ")
  h_team <- h_team_score[[1]] |> str_remove(" \\d+$") |> str_remove("^#\\d+ ")

  # extracts score from team_score
  v_score <- v_team_score |> str_extract("\\d+$")
  h_score <- h_team_score |> str_extract("\\d+$")

  # UPDATES END HERE
  
  # A function to handle some player stats cleaning
  clean_indi_stats <- function(df) {
    df |> mutate(
    start = case_when(
      player == "Starters" ~ "Starter",
      player == "Substitutes" ~ "Substitute",
      player == "Totals" ~ "Total",
      .default = NA
    ), .after = pos,
  ) |> fill(start) |> 
    filter(!player %in% c("Starters", "Substitutes", "Totals")) |> 
    mutate(
      date = m_date,
      site = m_site,
    )
  }
  
  # creates list of individual stats tables
  game_stats_tables <- game_stats_raw |> html_elements("section#individual-stats") |> minimal_html() |> html_table()
  
  # gets each table type depending on visitor/home
  v_pl_raw <- game_stats_tables[[1]] |> clean_names()
  v_gk_raw <- game_stats_tables[[2]] |> clean_names()
  h_pl_raw <- game_stats_tables[[3]] |> clean_names()
  h_gk_raw <- game_stats_tables[[4]] |> clean_names()
  
  # adds team names to individual stats
  v_pl <- v_pl_raw |> 
    clean_indi_stats() |> 
    mutate(team = v_team, vs = h_team)
  
  h_pl <- h_pl_raw |> 
    clean_indi_stats() |> 
    mutate(team = h_team, vs = v_team)
  
  # combines visitor/home data into one table
  pl <- bind_rows(v_pl, h_pl)
  
  # adds game details and team names to goalkeeping stats: visitors
  v_gk <- v_gk_raw |>
    filter(!goalie %in% c("Goalkeeping", "Totals")) |> 
    mutate(team = v_team, vs = h_team, date = m_date, site = m_site)
  
  # adds game details and team names to goalkeeping stats: home
  h_gk <- h_gk_raw |>
    filter(!goalie %in% c("Goalkeeping", "Totals")) |> 
    mutate(team = h_team, vs = v_team, date = m_date, site = m_site)

  # combines goalkeeper stats
  gk <- bind_rows(v_gk, h_gk)

  # preps file names for export based on matchup
  file_name_prefix <- paste(m_date, v_team, h_team, sep = "_") |> str_replace_all(" ", "_") |> str_replace_all("/", "-")
  
  pl_export_path <- paste("data-raw/soccer/", "pl_", file_name_prefix, ".rds", sep = "")
  gk_export_path <- paste("data-raw/soccer/", "gk_", file_name_prefix, ".rds", sep = "")

  # exports files
  pl |> write_rds(pl_export_path)
  gk |> write_rds(gk_export_path)
}

Testing the scrape

This concept could be refactored to use a schedule page to get all the URLs to the box score pages, but I don’t have that in me right now.

I first build a list of URLs to test the function. I’m building a couple of versions for convenience and to test some specific things.

A single page

url_one <- "https://texaslonghorns.com/sports/womens-soccer/stats/2024/texas-am/boxscore/16315"

This is a small list

url_list_short <- c(
  "https://texaslonghorns.com/sports/womens-soccer/stats/2024/houston/boxscore/16304",
  "https://texaslonghorns.com/sports/womens-soccer/stats/2024/iowa/boxscore/16307",
  "https://texaslonghorns.com/sports/womens-soccer/stats/2015/north-carolina/boxscore/8819",
  "https://texaslonghorns.com/sports/womens-soccer/stats/2024/texas-am/boxscore/16315"
)

This is a longer list of the entire 2024 season:

url_list_2024 <- c(
  "https://texaslonghorns.com/sports/womens-soccer/stats/2024/houston/boxscore/16304",
  "https://texaslonghorns.com/sports/womens-soccer/stats/2024/cal-state-bakersfield/boxscore/16305",
  "https://texaslonghorns.com/sports/womens-soccer/stats/2024/smu/boxscore/16306",
  "https://texaslonghorns.com/sports/womens-soccer/stats/2024/iowa/boxscore/16307",
  "https://texaslonghorns.com/sports/womens-soccer/stats/2024/central-michigan/boxscore/16308",
  "https://texaslonghorns.com/sports/womens-soccer/stats/2024/san-diego-state/boxscore/16309",
  "https://texaslonghorns.com/sports/womens-soccer/stats/2024/long-beach-state/boxscore/16310",
  "https://texaslonghorns.com/sports/womens-soccer/stats/2024/houston-christian/boxscore/16311",
  "https://texaslonghorns.com/sports/womens-soccer/stats/2024/alabama/boxscore/16312",
  "https://texaslonghorns.com/sports/womens-soccer/stats/2024/mississippi-state/boxscore/16313",
  "https://texaslonghorns.com/sports/womens-soccer/stats/2024/ole-miss/boxscore/16314",
  "https://texaslonghorns.com/sports/womens-soccer/stats/2024/texas-am/boxscore/16315",
  "https://texaslonghorns.com/sports/womens-soccer/stats/2024/florida/boxscore/16316",
  "https://texaslonghorns.com/sports/womens-soccer/stats/2024/oklahoma/boxscore/16317",
  "https://texaslonghorns.com/sports/womens-soccer/stats/2024/lsu/boxscore/16318",
  "https://texaslonghorns.com/sports/womens-soccer/stats/2024/arkansas/boxscore/16319",
  "https://texaslonghorns.com/sports/womens-soccer/stats/2024/missouri/boxscore/16320",
  "https://texaslonghorns.com/sports/womens-soccer/stats/2024/georgia/boxscore/16321"
)

Process all the files

This uses map to process all the files.

url_one |> map(scrape_stats)
[[1]]
# A tibble: 2 × 10
  position number goalie        minutes ga    saves team  vs    date       site 
  <chr>    <chr>  <chr>         <chr>   <chr> <chr> <chr> <chr> <date>     <chr>
1 GK       0      Fuller, Sydn… 90:00   2     4     Texa… Texas 2024-09-29 Aust…
2 GK       1      Justus, Mia   90:00   0     5     Texas Texa… 2024-09-29 Aust…

Working out details

This is the code I used to work out the process function above.

Get the raw HTML

game_stats_raw <-  read_html("https://texaslonghorns.com/sports/womens-soccer/stats/2015/north-carolina/boxscore/8819")

Match details

We pull these from a description list on the main “Box Score” page.

m_details <- game_stats_raw |> html_nodes("dd") |> html_text()

m_date <- m_details[[1]] |> mdy()
m_site <- m_details[[3]]
m_stadium <- m_details[[4]]

Teams and scores

We get these from the subheads on the “Individual” page. I had to target these pretty specifically by id and class to get the correct tables.

subheads <- game_stats_raw |> html_nodes("#individual-stats") |> html_nodes(".sub-heading") |> html_text()

v_team_score <- subheads[[1]]
h_team_score <- subheads[[2]]

v_team <- v_team_score[[1]] |> str_remove(" \\d+$") |> str_remove("^#\\d+ ")
h_team <- h_team_score[[1]] |> str_remove(" \\d+$") |> str_remove("^#\\d+ ")

v_score <- v_team_score |> str_extract("\\d+$")
h_score <- h_team_score |> str_extract("\\d+$")

paste("Visiting team was", v_team, "and their score was", v_score)
[1] "Visiting team was North Carolina and their score was 2"
paste("Home team was", h_team, "and their score was", h_score)
[1] "Home team was Texas and their score was 0"

Individual stats

There are four separate tables we need that are displayed on the “Individual” tab.

Get the tables

game_stats_tables <- game_stats_raw |>
  html_elements("section#individual-stats")|>
  minimal_html() |> html_table()

game_stats_tables
[[1]]
# A tibble: 23 × 8
   Pos      `#`      Player                        SH    SOG   G     A     MIN  
   <chr>    <chr>    <chr>                         <chr> <chr> <chr> <chr> <chr>
 1 Starters Starters "Starters"                    Star… Star… Star… Star… Star…
 2 D        5        "5 \r\n                     … 0     0     0     0     90   
 3 F        6        "6 \r\n                     … 5     1     0     1     61   
 4 M        10       "10 \r\n                    … 4     1     1     0     61   
 5 M        11       "11 \r\n                    … 0     0     0     0     66   
 6 F        12       "12 \r\n                    … 4     2     0     0     62   
 7 M        15       "15 \r\n                    … 0     0     0     0     66   
 8 D        16       "16 \r\n                    … 1     0     0     0     90   
 9 M        21       "21 \r\n                    … 1     1     1     0     66   
10 GK       23       "23 \r\n                    … 0     0     0     0     45   
# ℹ 13 more rows

[[2]]
# A tibble: 4 × 6
  Position      `#`           Goalie           Minutes       GA          Saves  
  <chr>         <chr>         <chr>            <chr>         <chr>       <chr>  
1 "Goalkeeping" "Goalkeeping" Goalkeeping      "Goalkeeping" Goalkeeping Goalke…
2 "GK"          "1"           Bryane Heaberlin "45:00"       0           2      
3 "GK"          "23"          Lindsey Harris   "45:00"       0           0      
4 ""            ""            Totals           ""            0           2      

[[3]]
# A tibble: 23 × 8
   Pos      `#`      Player                        SH    SOG   G     A     MIN  
   <chr>    <chr>    <chr>                         <chr> <chr> <chr> <chr> <chr>
 1 Starters Starters "Starters"                    Star… Star… Star… Star… Star…
 2 GK       1        "1 \r\n                     … 1     1     0     0     90   
 3 M        4        "4 \r\n                     … 1     1     0     0     90   
 4 M        5        "5 \r\n                     … 0     0     0     0     85   
 5 D        12       "12 \r\n                    … 0     0     0     0     90   
 6 D        13       "13 \r\n                    … 0     0     0     0     90   
 7 D        14       "14 \r\n                    … 0     0     0     0     90   
 8 F        17       "17 \r\n                    … 2     0     0     0     66   
 9 F        20       "20 \r\n                    … 1     0     0     0     89   
10 F        23       "23 \r\n                    … 1     0     0     0     68   
# ℹ 13 more rows

[[4]]
# A tibble: 3 × 6
  Position      `#`           Goalie      Minutes       GA          Saves      
  <chr>         <chr>         <chr>       <chr>         <chr>       <chr>      
1 "Goalkeeping" "Goalkeeping" Goalkeeping "Goalkeeping" Goalkeeping Goalkeeping
2 "GK"          "1"           Smith, Abby "90:00"       2           7          
3 ""            ""            Totals      ""            2           7          

Saves specific tables we need

v_pl_raw <- game_stats_tables[[1]] |> clean_names()
v_gk_raw <- game_stats_tables[[2]] |> clean_names()
h_pl_raw <- game_stats_tables[[3]] |> clean_names()
h_gk_raw <- game_stats_tables[[4]] |> clean_names()

Clean individual stats

A function to clean stats …

  • Adds a column to indicate starters vs subs
  • Removes totals and other headers
  # A function to handle some player stats cleaning
  clean_indi_stats <- function(df) {
    df |> mutate(
    start = case_when(
      player == "Starters" ~ "Starter",
      player == "Substitutes" ~ "Substitute",
      player == "Totals" ~ "Total",
      .default = NA
    ), .after = pos,
  ) |> fill(start) |> 
    filter(!player %in% c("Starters", "Substitutes", "Totals")) |> 
    mutate(
      date = m_date,
      site = m_site,
    )
  }

This takes the individual stats and runs them through the cleaning function above and then adds home and away team values.

v_pl <- v_pl_raw |> 
  clean_indi_stats() |> 
  mutate(team = v_team, vs = h_team)

v_pl
h_pl <- h_pl_raw |> 
  clean_indi_stats() |> 
  mutate(team = h_team, vs = v_team)

h_pl

Combine individual stats

Combines the visitor/home two player stats

pl <- bind_rows(v_pl, h_pl)

pl

Clean goalkeeper stats

This is the same process as above, but for goalkeepers. It’s a little less complicated.

v_gk <- v_gk_raw |>
  filter(!goalie %in% c("Goalkeeping", "Totals")) |> 
  mutate(team = v_team, vs = h_team, date = m_date, site = m_site)

h_gk <- h_gk_raw |>
  filter(!goalie %in% c("Goalkeeping", "Totals")) |> 
  mutate(team = h_team, vs = v_team, date = m_date, site = m_site)

v_gk
h_gk

Combine goalkeepers

gk <- bind_rows(v_gk, h_gk)

gk

Writing the exports

The actual exports are commented out so we don’t overwrite what came from above.

file_name_prefix <- paste(m_date, v_team, h_team, sep = "_") |> str_replace_all(" ", "_") |> str_replace_all("/", "-")

pl_export_path <- paste("data-raw/soccer/", "pl_", file_name_prefix, ".rds", sep = "")
gk_export_path <- paste("data-raw/soccer/", "gk_", file_name_prefix, ".rds", sep = "")

pl_export_path
[1] "data-raw/soccer/pl_2015-08-28_North_Carolina_Texas.rds"
gk_export_path
[1] "data-raw/soccer/gk_2015-08-28_North_Carolina_Texas.rds"
## Commented so it doesn't write anything out
# pl |> write_rds(pl_export_path)
# gk |> write_rds(gk_export_path)