Clean rosters

This notebook is to clean the main roster listed on the left part of the PDF.

See the home page about using Pinpoint to get to this point.

Setup

library(tidyverse)
library(janitor)

Import

This pulls in the first file that was exported from Pinpoint.

roster_raw <- read_csv("data-raw/pinpoint/Roster.csv") |> clean_names()

Rows: 929 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): File Name, NAME, ROSTER DESIGNATION, CURRENT STATUS, CONTRACT THRU,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Roster cleaning

Make short club name

I first need to build a club name of some type, which I’m doing from the file name. I’m basing the short code based on another dataset I have mls-salaries.

I also move around the new column and remove the Pinpoint link variable no one can use anyway.

add_short_name <- roster_raw |> 
  mutate(
    club_short = str_extract(file_name, "^\\w*") |> str_to_upper()
  ) |>
  relocate(club_short) |> 
  select(!c(file_name, validation_link))

add_short_name |> glimpse()

Rows: 929
Columns: 6
$ club_short         <chr> "ATL", "ATL", "ATL", "ATL", "ATL", "ATL", "ATL", "A…
$ name               <chr> "Luis Abram", "Thiago Almada", "Josh Cohen", "Giorg…
$ roster_designation <chr> "TAM Player", "Young Designated Player", NA, "Desig…
$ current_status     <chr> NA, NA, NA, NA, NA, NA, NA, "Unavailable - On Loan"…
$ contract_thru      <chr> "2025", "2026", "2025", "2025", "2027", "2024", "20…
$ option_years       <chr> "2026", NA, "2026", "2026", "2028", "2025", "2025",…

Make a club name

add_long_name <- add_short_name |> 
  mutate(
    club = case_match(
      club_short,
      "ATL" ~ "Atlanta United",
      "ATX" ~ "Austin FC",
      "CHI" ~ "Chicago Fire FC",
      "CIN" ~ "FC Cincinnati",
      "CLB" ~ "Columbus Crew",
      "CLT" ~ "Charlotte FC",
      "COL" ~ "Colorado Rapids",
      "DAL" ~ "FC Dallas",
      "DC" ~ "D.C. United",
      "HOU" ~ "Houston Dynamo FC",
      "LA" ~ "LA Galaxy",
      "LAFC" ~ "Los Angeles Football Club",
      "MIA" ~ "Inter Miami FC",
      "MIN" ~ "Minnesota United FC",
      "MTL" ~ "CF Montréal",
      "NE" ~ "New England Revolution",
      "NSH" ~ "Nashville FC",
      "NY" ~ "New York Red Bulls",
      "NYC" ~ "New York City FC",
      "ORL" ~ "Orlando City DC",
      "PHI" ~ "Philadelphia Union",
      "POR" ~ "Portland Timbers",
      "RSL" ~ "Real Salt Lake",
      "SEA" ~ "Seatle Sounders FC",
      "SJ" ~ "San Jose Earthquakes",
      "SKC" ~ "Sporting Kansas City",
      "STL" ~ "St. Louis City SC",
      "TOR" ~ "Toronto FC",
      "VAN" ~ "Vancouver Whitecaps FC",
      .default = NA
    )
  ) |> 
  relocate(club, .after = club_short)

add_long_name

Make roster designation

The roster is really split into different categories with an overline header. That’s bad data management so I create a roster_type variable to say what category they are in.

I have to do a lot of work to get the player categories right. You’d have to run the code line by line to really see it, but I annotated the lines below.

roster_subtype <- add_long_name |> 
  mutate(
1    roster_type = case_when(
2      is.na(lag(club_short)) ~ "SENIOR ROSTER",
3      lag(club_short) != club_short ~ "SENIOR ROSTER",
4      str_detect(name, "[A-Z]{3}") ~ name,
      .default = NA
    )
  ) |>
5  fill(roster_type) |>
6  relocate(roster_type, .after = club) |>
7  filter(!str_detect(name, "[A-Z]{3}"))

roster_subtype

1: Make a roster_type variable
2: I use lag() to basically find the first empty row which is the top row and set roster_type to “SENIOR ROSTER”. This is just for the first row of the data. There are probably better ways to do this.
3: Here I find when a new row has a different team using lag(), and if so set that row to SENIOR ROSTER. This is a typical use of lag.
4: If the name variable has three capital letters in a row then it is one of the categories (always in all caps) so I use name value in roster_type. This is telling me when a new roster category is starting.
5: This fill() takes those roster categories and fills them down the blank rows until it hits another existing category. This gives every play their category.
6: I move the new column to closer to the front of the tibble.
7: I remove the category headers since we don’t need those anymore.

Supplemental spot name

Here we rename SUPPLEMENTAL SPOT 31 to just SUPPLEMENTAL SPOT. It was inconsistent and I feel like the “31” needs explanation anyway. (No, I dunno what it means yet, but it is the only number used.)

roster_supp <- roster_subtype |> 
  mutate(
    roster_type = case_match(
      roster_type,
      "SUPPLEMENTAL SPOT 31" ~ "SUPPLEMENTAL SPOT",
      .default = roster_type
    )
  )

roster_supp |> count(roster_type)

Stray pipe in value

@philwest noted there is a stray | character in one of the fields. I fix that here.

roster_desig_fix <- roster_supp |> 
  mutate(
    roster_designation = str_remove(roster_designation, "\\|") |> str_trim()
  )

Make a list of teams

When I later work on the “other” player designations, I need to see if I have all the teams represented in my files. I create a list of those teams here so I can use it later.

clubs <- roster_desig_fix |> 
  distinct(club_short)

clubs

Export the files

Saving our the csv file for posterity. The rds file is the same data but in R’s native format. Also save out the clubs for later.

roster_desig_fix |> write_csv("data-out/rosters.csv")
roster_desig_fix |> write_rds("data-processed/rosters.rds")
clubs |> write_rds("data-processed/clubs.rds")