MLS Player Rosters
I found out on May 2, 2024 from the Austin Chronicle’s Eric Goodman that MLS published roster information for all MLS players.
I downloaded the rosters in hopes of extracting the data. I cleaned it up and the files are available as CSVs below. The code is here.
As of May 3, 2024 this is a third pass and I’m sure it could benefit from more Quality Checks. I’ve found and fixed many problems and I wouldn’t be surprised if I missed things.
Known issues
There are no known issues at this time. If you see any problems add an issue in Github or ping me on X/Twitter @crit.
That said I do have some to-do lists:
- Turn the roster_type into a factor?
Example
Below is an image from one team plucked from the combined file.
Published file
- profiles.csv is the result of this cleaning effort, where there is a single row for each player with flags for their roster player designations and notes.
This is a glimpse of the first two records from Austin FC so you can see all the column names, data types and example values.
$ club_short <chr> "ATX", "ATX"
$ club <chr> "Austin FC", "Austin FC"
$ roster_type <chr> "SENIOR ROSTER", "SENIOR ROSTER"
$ name <chr> "Guilherme Biro", "Julio Cascante"
$ roster_designation <chr> NA, "TAM Player"
$ current_status <chr> NA, NA
$ contract_thru <chr> "2026", "2025"
$ option_years <chr> "2027", "2026"
$ type_dp <lgl> FALSE, FALSE
$ type_u22 <lgl> FALSE, FALSE
$ type_int <lgl> TRUE, FALSE
$ type_inj <lgl> FALSE, FALSE
$ type_una <lgl> FALSE, FALSE
$ notes_young <lgl> FALSE, FALSE
$ notes_unavail <lgl> FALSE, FALSE
$ notes_notam <lgl> FALSE, FALSE
$ notes_can <lgl> FALSE, FALSE
The profiles file was compiled from bits and pieces with lots of cleaning along the way. See “How this was done” below.
The roster types have been added as individual logical fields because a player can hold more than one type or designation. More about those …
Player types
If true, the player fits the player type designation listed on the right side of the original profiles.
type_dp
: DESIGNATED PLAYERtype_u22
: U22 INITIATIVE PLAYERtype_int
: INTERNATIONAL SLOTtype_inj
: SEASON-ENDING INJURYtype_una
: UNAVAILABLE PLAYER
Notes
If true, the player fits the notes parameter.
notes_young
(Young DP) Indicates a Young Designated Player. (This wasn’t really explained on the profiles).notes_unavail
*Indicates player is currently unavailable, and club may receive roster/international spot relief, but not Salary Budget relief unless otherwise determined pursuant to the loan agreement.notes_notam
^Player cannot be converted from a Designated Player to a non-Designated Player by using Targeted Allocation Money.notes_can
+In addition to the International Roster Slots, each Canadian Club is permitted to designate up to three International Players who have been under contract with MLS and registered with one or more Canadian clubs for at least one year who will not count toward the club’s International Roster Slots.- Off-roster Homegrown players can appear in MLS matches via a Short-Term Agreement.
How this was done
Pinpoint extract
I thought Google Pinpoint might be the easiest way to extract the data. Pinpoint wants individual files instead of one file with a bunch of pages, so I manually split the single document into individual files for each team. Once in Pinpoint I used Extract Structured Data tool with “table” processing to create two raw csv files:
data-raw/pinpoint/Rosters.csv
has the main roster that is on the left side of the page.data-raw/pinpoint/Other.csv
has all the other player designations listed on the right side of the page.
In the case of this “other” data I had to do more than one pass with Pinpoint to catch all the teams so there are multiple “raw” files. They are all in the repo in the data-raw folder.
Cleaning
All other cleaning was done in R in Quarto notebooks:
- Clean Roster: Cleans the rosters section
- Clean Other: Cleans the other designations
- Join files: Joins and cleans the two into the profiles
- Quality checks: Used as I was checking things
The cleaned csv files are exported into the data-out/
folder in the repo. There are also some .rds files in data-processed/