Quality checks

Because we are extracting this data from PDF’s I used this file to troubleshoot and make sure all this is good.

Fixed issues

  • In the Cleaning roster notebook the roster_type == “SUPPLEMENTAL SPOT 31” was changed to just “SUPPLEMENTAL SPOT” for consistency. The roster has the “31” in all cases but seems odd to keep here. I can change later if needed.
  • In others there were initially some missing players. I ended up piecing everything together in Cleaning other.
  • Also in others I had to rework how players were awarded different types and notes because in the original list players can be listed more than once with different designations. I had to collapse all of that.

There aren’t any known issues as of now, but I’m keeping this notebook around for now.

Setup

library(tidyverse)
library(janitor)

Import the rosters

rosters <- read_rds("data-processed/rosters.rds")

rosters |> glimpse()
Rows: 868
Columns: 8
$ club_short         <chr> "ATL", "ATL", "ATL", "ATL", "ATL", "ATL", "ATL", "A…
$ club               <chr> "Atlanta United", "Atlanta United", "Atlanta United…
$ roster_type        <chr> "SENIOR ROSTER", "SENIOR ROSTER", "SENIOR ROSTER", …
$ name               <chr> "Luis Abram", "Thiago Almada", "Josh Cohen", "Giorg…
$ roster_designation <chr> "TAM Player", "Young Designated Player", NA, "Desig…
$ current_status     <chr> NA, NA, NA, NA, NA, NA, NA, "Unavailable - On Loan"…
$ contract_thru      <chr> "2025", "2026", "2025", "2025", "2027", "2024", "20…
$ option_years       <chr> "2026", NA, "2026", "2026", "2028", "2025", "2025",…

Do I have all the teams?

rosters |> 
  count(club_short)

There are 29 teams in the MLS as of May 2, 2024.

Let’s spot check some teams.

rosters |> filter(club_short == "PHI")

Others

We look at the others file here.

others <- read_rds("data-processed/others.rds")

others |> glimpse()
Rows: 337
Columns: 11
Groups: club_short [29]
$ club_short    <chr> "ATL", "ATL", "ATL", "ATL", "ATL", "ATL", "ATL", "ATL", …
$ name          <chr> "Aiden McFadden", "Bartosz Slisz", "Edwin Mosquera", "Er…
$ type_dp       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, F…
$ type_u22      <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FA…
$ type_int      <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE…
$ type_inj      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ type_una      <lgl> TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FAL…
$ notes_young   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ notes_unavail <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, F…
$ notes_notam   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, F…
$ notes_can     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …

Missing teams

At one point we were missing teams. Here we make sure there are 29.

others |> 
  count(club_short)

Checking season-ending list

I noticed this because we were missing injured players. Here I check for them.

others |> 
  filter(type_inj == TRUE)

Profiles check

The last check of everything

profiles <- read_rds("data-processed/profiles.rds")

profiles |> filter(club_short == "CLB")

Example for index page

profiles |> filter(club_short == "ATX") |> 
 head(2) |> glimpse()
Rows: 2
Columns: 17
$ club_short         <chr> "ATX", "ATX"
$ club               <chr> "Austin FC", "Austin FC"
$ roster_type        <chr> "SENIOR ROSTER", "SENIOR ROSTER"
$ name               <chr> "Guilherme Biro", "Julio Cascante"
$ roster_designation <chr> NA, "TAM Player"
$ current_status     <chr> NA, NA
$ contract_thru      <chr> "2026", "2025"
$ option_years       <chr> "2027", "2026"
$ type_dp            <lgl> FALSE, FALSE
$ type_u22           <lgl> FALSE, FALSE
$ type_int           <lgl> TRUE, FALSE
$ type_inj           <lgl> FALSE, FALSE
$ type_una           <lgl> FALSE, FALSE
$ notes_young        <lgl> FALSE, FALSE
$ notes_unavail      <lgl> FALSE, FALSE
$ notes_notam        <lgl> FALSE, FALSE
$ notes_can          <lgl> FALSE, FALSE

How many teams

profiles |> 
  count(club_short)

Stray header?

if you do a sort by roster designation, you’ll see a stray | by a U22 designated player at the top, but otherwise, it reveals DPs first in a list you can then secondarily sort by team, which is muy helpful.

profiles |> filter(type_u22 == T) |> 
  filter(name == "Maximiliano David Ayala")