Clean other designations

The “Other” player type designations did not come out of pinpoint as cleanly as the main rosters. I have to pull them together based on different extractions. There have been other cleaning challenges to work through.

This is the second file I tackled, so there is more explanation of some techniques in Clean roster.

Setup

library(tidyverse)
library(janitor)

Import

I pull in different versions of these files that did better with different teams to make sure I have them all. Also the clubs file to compare with.

clubs <-  read_rds("data-processed/clubs.rds")
others_01 <- read_csv("data-raw/pinpoint/Other.csv") |> clean_names()

Rows: 761 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): File Name, NO., NAME, Validation Link

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

others_02 <- read_csv("data-raw/pinpoint/Other_2.csv") |> clean_names()

Rows: 663 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): File Name, NO., NAME, Validation Link

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Make short club name function

When I pulled out the individual files from the main PDF I named them based on their club. Here I use that name to build a short club name. Because I have to do this with several files, I made a function to do it.

club_name <- function(tib) {
  tib |> 
  mutate(
    club_short = str_extract(file_name, "^\\w*") |> str_to_upper()
  ) |>
  relocate(club_short) |> 
  select(!c(file_name, validation_link))
}

So I do this for our files …

others_club_01 <- club_name(others_01)
others_club_02 <- club_name(others_02)

Check the teams

I know from working with this that I’m missing teams in my pinpoint exports. I work through fixing that here.

The idea is:

compare with with the full list to find difference
search other files to see if those different teams are there

Get this list of missing files from the first set

1missing_clubs <- clubs |>
2  anti_join(others_club_01 |> select(club_short) |> distinct())

3miss_club_list <- missing_clubs |> pull(club_short)

miss_club_list

1: Take my list of clubs and compare them to the distinct() clubs available.
2: I use anti_join to find the teams that are not in the first file by comparing to the distinct names in our data.
3: I save that list of missing teams so I can use it later.

[1] "LAFC" "NE"   "VAN"

I think doublecheck that those teams really are missing. It should be zero rows.

others_club_01 |> 
  filter(club_short %in% miss_club_list)

Check to see if missing clubs in new files

Here I take the second file and see if the missing teams are in it.

others_club_02 |> 
  filter(club_short %in% miss_club_list) |> select(club_short) |> distinct()

And indeed they are so we can use just these two files to build one good file.

Combine missing rows to first file

Here I filter the second file to have just our missing teams, then bind them to our original data so we have a full set.

others_combo <- others_club_02 |> 
  filter(club_short %in% miss_club_list) |> 
  bind_rows(others_club_01)

# test that I have them all
others_combo |> 
  distinct(club_short) |> arrange(club_short)

Adding player_type variable

This sets the player category. This is fully explained in roster cleaning.

others_type <- others_combo |> 
  filter(name != "NAME") |>
  mutate(
    player_type = case_when(
      is.na(lag(club_short)) ~ "INTERNATIONAL SLOT",
      lag(club_short) != club_short ~ "INTERNATIONAL SLOT",
      str_detect(name, "[A-Z]{3}") ~ name |> str_trim(),
      .default = NA
    )
  ) |>
  fill(player_type) |> 
  filter(str_detect(no, "^\\d")) |>
  relocate(player_type, .after = club_short)


others_type

Build a clean name

And now I clean the names to remove the character that signifies the notes. I print them out to check them. We do this here because it might help us later.

others_name_fixes <- others_type |> 
  mutate(
    name_clean = str_remove_all(name, "\\(Young DP\\)|\\^|\\*|\\+") |> str_trim()
  ) |>
  relocate(name_clean, .before = name)

others_name_fixes

Set player type variables

We need to add a t/f value for each of the player designation types.

Let’s first see what they are:

others_name_fixes |> 
  count(player_type)

Then lets add variables based on them.

others_types <- others_name_fixes |> 
  mutate(
    type_dp = if_else(player_type == "DESIGNATED PLAYERS", T, F),
    type_u22 = if_else(player_type == "INITIATIVE PLAYERS", T, F),
    type_int = if_else(player_type == "INTERNATIONAL SLOT", T, F),
    type_inj = if_else(player_type == "SEASON-ENDING INJURY LIST", T, F),
    type_una = if_else(player_type == "UNAVAILABLE PLAYERS", T, F),
  )

others_types

Add notes variables

There are several note designations that players can have in these lists. This section creates categories based on those notes. See the index to learn more about what the notes mean.

others_notes <- others_types |> 
  mutate(
    notes_young = if_else(str_detect(name, "Young DP"), TRUE, FALSE),
    notes_unavail = if_else(str_detect(name, "\\*"), TRUE, FALSE),
    notes_notam = if_else(str_detect(name, "\\^"), TRUE, FALSE),
    notes_can =  if_else(str_detect(name, "\\+"), TRUE, FALSE),
  )

others_notes

others_notes |> 
  filter(name_clean == "Thiago Almada")

Compress the records

The challenge here is I need to combine rows to where there is one row for each player, but has all the correct “TRUE” values for all the notes and types.

Let’s first show the challenge. Note how there is more than one row for Giorgios Giakoumakis and not all the type_ and notes_ variables are consistent.

others_notes |> 
  select(!c(player_type, no, name)) |> 
  arrange(club_short, name_clean) |> 
  head(10)

And now we fix it … we first git rid of some columns we don’t need, then use summarize_if(is.logical, any) to collapse the rows to keep all the true values. H/T to Thomas Cason for this tip.

others_compressed <- others_notes |> 
  select(!c(player_type, no, name)) |> 
  group_by(club_short, name_clean) |>  summarize_if(is.logical, any)

others_compressed

Tighten columns

Actually I need to do is rename a column.

others_clean <- others_compressed |> 
  rename(name = name_clean)

others_clean |> glimpse()

Rows: 337
Columns: 11
Groups: club_short [29]
$ club_short    <chr> "ATL", "ATL", "ATL", "ATL", "ATL", "ATL", "ATL", "ATL", …
$ name          <chr> "Aiden McFadden", "Bartosz Slisz", "Edwin Mosquera", "Er…
$ type_dp       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, F…
$ type_u22      <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FA…
$ type_int      <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE…
$ type_inj      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ type_una      <lgl> TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FAL…
$ notes_young   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ notes_unavail <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, F…
$ notes_notam   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, F…
$ notes_can     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …

Export

Export csv and rds versions of the files.

others_clean |> write_csv("data-out/others.csv")
others_clean |> write_rds("data-processed/others.rds")