library(tidyverse)
library(janitor)
Clean other designations
The “Other” player type designations did not come out of pinpoint as cleanly as the main rosters. I have to pull them together based on different extractions. There have been other cleaning challenges to work through.
This is the second file I tackled, so there is more explanation of some techniques in Clean roster.
Setup
Import
I pull in different versions of these files that did better with different teams to make sure I have them all. Also the clubs file to compare with.
<- read_rds("data-processed/clubs.rds")
clubs <- read_csv("data-raw/pinpoint/Other.csv") |> clean_names() others_01
Rows: 761 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): File Name, NO., NAME, Validation Link
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv("data-raw/pinpoint/Other_2.csv") |> clean_names() others_02
Rows: 663 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): File Name, NO., NAME, Validation Link
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Make short club name function
When I pulled out the individual files from the main PDF I named them based on their club. Here I use that name to build a short club name. Because I have to do this with several files, I made a function to do it.
<- function(tib) {
club_name |>
tib mutate(
club_short = str_extract(file_name, "^\\w*") |> str_to_upper()
|>
) relocate(club_short) |>
select(!c(file_name, validation_link))
}
So I do this for our files …
<- club_name(others_01)
others_club_01 <- club_name(others_02) others_club_02
Check the teams
I know from working with this that I’m missing teams in my pinpoint exports. I work through fixing that here.
The idea is:
- compare with with the full list to find difference
- search other files to see if those different teams are there
Get this list of missing files from the first set
1<- clubs |>
missing_clubs 2anti_join(others_club_01 |> select(club_short) |> distinct())
3<- missing_clubs |> pull(club_short)
miss_club_list
miss_club_list
- 1
- Take my list of clubs and compare them to the distinct() clubs available.
- 2
-
I use
anti_join
to find the teams that are not in the first file by comparing to the distinct names in our data. - 3
- I save that list of missing teams so I can use it later.
[1] "LAFC" "NE" "VAN"
I think doublecheck that those teams really are missing. It should be zero rows.
|>
others_club_01 filter(club_short %in% miss_club_list)
Check to see if missing clubs in new files
Here I take the second file and see if the missing teams are in it.
|>
others_club_02 filter(club_short %in% miss_club_list) |> select(club_short) |> distinct()
And indeed they are so we can use just these two files to build one good file.
Combine missing rows to first file
Here I filter the second file to have just our missing teams, then bind them to our original data so we have a full set.
<- others_club_02 |>
others_combo filter(club_short %in% miss_club_list) |>
bind_rows(others_club_01)
# test that I have them all
|>
others_combo distinct(club_short) |> arrange(club_short)
Adding player_type variable
This sets the player category. This is fully explained in roster cleaning.
<- others_combo |>
others_type filter(name != "NAME") |>
mutate(
player_type = case_when(
is.na(lag(club_short)) ~ "INTERNATIONAL SLOT",
lag(club_short) != club_short ~ "INTERNATIONAL SLOT",
str_detect(name, "[A-Z]{3}") ~ name |> str_trim(),
.default = NA
)|>
) fill(player_type) |>
filter(str_detect(no, "^\\d")) |>
relocate(player_type, .after = club_short)
others_type
Build a clean name
And now I clean the names to remove the character that signifies the notes. I print them out to check them. We do this here because it might help us later.
<- others_type |>
others_name_fixes mutate(
name_clean = str_remove_all(name, "\\(Young DP\\)|\\^|\\*|\\+") |> str_trim()
|>
) relocate(name_clean, .before = name)
others_name_fixes
Set player type variables
We need to add a t/f value for each of the player designation types.
Let’s first see what they are:
|>
others_name_fixes count(player_type)
Then lets add variables based on them.
<- others_name_fixes |>
others_types mutate(
type_dp = if_else(player_type == "DESIGNATED PLAYERS", T, F),
type_u22 = if_else(player_type == "INITIATIVE PLAYERS", T, F),
type_int = if_else(player_type == "INTERNATIONAL SLOT", T, F),
type_inj = if_else(player_type == "SEASON-ENDING INJURY LIST", T, F),
type_una = if_else(player_type == "UNAVAILABLE PLAYERS", T, F),
)
others_types
Add notes variables
There are several note designations that players can have in these lists. This section creates categories based on those notes. See the index to learn more about what the notes mean.
<- others_types |>
others_notes mutate(
notes_young = if_else(str_detect(name, "Young DP"), TRUE, FALSE),
notes_unavail = if_else(str_detect(name, "\\*"), TRUE, FALSE),
notes_notam = if_else(str_detect(name, "\\^"), TRUE, FALSE),
notes_can = if_else(str_detect(name, "\\+"), TRUE, FALSE),
)
others_notes
|>
others_notes filter(name_clean == "Thiago Almada")
Compress the records
The challenge here is I need to combine rows to where there is one row for each player, but has all the correct “TRUE” values for all the notes and types.
Let’s first show the challenge. Note how there is more than one row for Giorgios Giakoumakis and not all the type_
and notes_
variables are consistent.
|>
others_notes select(!c(player_type, no, name)) |>
arrange(club_short, name_clean) |>
head(10)
And now we fix it … we first git rid of some columns we don’t need, then use summarize_if(is.logical, any)
to collapse the rows to keep all the true values. H/T to Thomas Cason for this tip.
<- others_notes |>
others_compressed select(!c(player_type, no, name)) |>
group_by(club_short, name_clean) |> summarize_if(is.logical, any)
others_compressed
Tighten columns
Actually I need to do is rename a column.
<- others_compressed |>
others_clean rename(name = name_clean)
|> glimpse() others_clean
Rows: 337
Columns: 11
Groups: club_short [29]
$ club_short <chr> "ATL", "ATL", "ATL", "ATL", "ATL", "ATL", "ATL", "ATL", …
$ name <chr> "Aiden McFadden", "Bartosz Slisz", "Edwin Mosquera", "Er…
$ type_dp <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, F…
$ type_u22 <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FA…
$ type_int <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE…
$ type_inj <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ type_una <lgl> TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FAL…
$ notes_young <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ notes_unavail <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, F…
$ notes_notam <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, F…
$ notes_can <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
Export
Export csv and rds versions of the files.
|> write_csv("data-out/others.csv")
others_clean |> write_rds("data-processed/others.rds") others_clean