library(tidyverse)
library(janitor)
Clean rosters
This notebook is to clean the main roster listed on the left part of the PDF.
See the home page about using Pinpoint to get to this point.
Setup
Import
This pulls in the first file that was exported from Pinpoint.
<- read_csv("data-raw/pinpoint/Roster.csv") |> clean_names() roster_raw
Rows: 929 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): File Name, NAME, ROSTER DESIGNATION, CURRENT STATUS, CONTRACT THRU,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Roster cleaning
Make short club name
I first need to build a club name of some type, which I’m doing from the file name. I’m basing the short code based on another dataset I have mls-salaries.
I also move around the new column and remove the Pinpoint link variable no one can use anyway.
<- roster_raw |>
add_short_name mutate(
club_short = str_extract(file_name, "^\\w*") |> str_to_upper()
|>
) relocate(club_short) |>
select(!c(file_name, validation_link))
|> glimpse() add_short_name
Rows: 929
Columns: 6
$ club_short <chr> "ATL", "ATL", "ATL", "ATL", "ATL", "ATL", "ATL", "A…
$ name <chr> "Luis Abram", "Thiago Almada", "Josh Cohen", "Giorg…
$ roster_designation <chr> "TAM Player", "Young Designated Player", NA, "Desig…
$ current_status <chr> NA, NA, NA, NA, NA, NA, NA, "Unavailable - On Loan"…
$ contract_thru <chr> "2025", "2026", "2025", "2025", "2027", "2024", "20…
$ option_years <chr> "2026", NA, "2026", "2026", "2028", "2025", "2025",…
Make a club name
<- add_short_name |>
add_long_name mutate(
club = case_match(
club_short,"ATL" ~ "Atlanta United",
"ATX" ~ "Austin FC",
"CHI" ~ "Chicago Fire FC",
"CIN" ~ "FC Cincinnati",
"CLB" ~ "Columbus Crew",
"CLT" ~ "Charlotte FC",
"COL" ~ "Colorado Rapids",
"DAL" ~ "FC Dallas",
"DC" ~ "D.C. United",
"HOU" ~ "Houston Dynamo FC",
"LA" ~ "LA Galaxy",
"LAFC" ~ "Los Angeles Football Club",
"MIA" ~ "Inter Miami FC",
"MIN" ~ "Minnesota United FC",
"MTL" ~ "CF Montréal",
"NE" ~ "New England Revolution",
"NSH" ~ "Nashville FC",
"NY" ~ "New York Red Bulls",
"NYC" ~ "New York City FC",
"ORL" ~ "Orlando City DC",
"PHI" ~ "Philadelphia Union",
"POR" ~ "Portland Timbers",
"RSL" ~ "Real Salt Lake",
"SEA" ~ "Seatle Sounders FC",
"SJ" ~ "San Jose Earthquakes",
"SKC" ~ "Sporting Kansas City",
"STL" ~ "St. Louis City SC",
"TOR" ~ "Toronto FC",
"VAN" ~ "Vancouver Whitecaps FC",
.default = NA
)|>
) relocate(club, .after = club_short)
add_long_name
Make roster designation
The roster is really split into different categories with an overline header. That’s bad data management so I create a roster_type
variable to say what category they are in.
I have to do a lot of work to get the player categories right. You’d have to run the code line by line to really see it, but I annotated the lines below.
<- add_long_name |>
roster_subtype mutate(
1roster_type = case_when(
2is.na(lag(club_short)) ~ "SENIOR ROSTER",
3lag(club_short) != club_short ~ "SENIOR ROSTER",
4str_detect(name, "[A-Z]{3}") ~ name,
.default = NA
)|>
) 5fill(roster_type) |>
6relocate(roster_type, .after = club) |>
7filter(!str_detect(name, "[A-Z]{3}"))
roster_subtype
- 1
-
Make a
roster_type
variable - 2
-
I use
lag()
to basically find the first empty row which is the top row and setroster_type
to “SENIOR ROSTER”. This is just for the first row of the data. There are probably better ways to do this. - 3
-
Here I find when a new row has a different team using
lag()
, and if so set that row to SENIOR ROSTER. This is a typical use of lag. - 4
-
If the name variable has three capital letters in a row then it is one of the categories (always in all caps) so I use
name
value inroster_type
. This is telling me when a new roster category is starting. - 5
-
This
fill()
takes those roster categories and fills them down the blank rows until it hits another existing category. This gives every play their category. - 6
- I move the new column to closer to the front of the tibble.
- 7
- I remove the category headers since we don’t need those anymore.
Supplemental spot name
Here we rename SUPPLEMENTAL SPOT 31 to just SUPPLEMENTAL SPOT. It was inconsistent and I feel like the “31” needs explanation anyway. (No, I dunno what it means yet, but it is the only number used.)
<- roster_subtype |>
roster_supp mutate(
roster_type = case_match(
roster_type,"SUPPLEMENTAL SPOT 31" ~ "SUPPLEMENTAL SPOT",
.default = roster_type
)
)
|> count(roster_type) roster_supp
Stray pipe in value
@philwest noted there is a stray | character in one of the fields. I fix that here.
<- roster_supp |>
roster_desig_fix mutate(
roster_designation = str_remove(roster_designation, "\\|") |> str_trim()
)
Make a list of teams
When I later work on the “other” player designations, I need to see if I have all the teams represented in my files. I create a list of those teams here so I can use it later.
<- roster_desig_fix |>
clubs distinct(club_short)
clubs
Export the files
Saving our the csv file for posterity. The rds file is the same data but in R’s native format. Also save out the clubs for later.
|> write_csv("data-out/rosters.csv")
roster_desig_fix |> write_rds("data-processed/rosters.rds")
roster_desig_fix |> write_rds("data-processed/clubs.rds") clubs