Expand this to see code
# download.file("https://docs.google.com/spreadsheets/d/e/2PACX-1vRQBIKas7s5Wwe4bd9tMlAheCsjqeFU2JMZfk6h8Hc_QH6Dx02SEZmP6ieny13hCyZQosmsRirDEu9P/pub?output=csv", "data-original/outdoor-jail-logs-urls.csv")This notebook compiles data collected from one week of outdoor temperature logs collected by the Texas Department of Criminal Justice. The dates considered were July 24, 2023 through July 31, 2023. The logs are hand-written and data could not be accurately pulled from them using AI, so a group of Media Innovaiton Group fellows transcribed the week of logs for each prison we had records for, XXX in total. Those records were acquired through a public information request by Lauren McGaughey of the Texas Newsroom public radio collaborative.
We created a Google Spreadsheet file for each prison, recording the temperatures for each hour as best as we could decipher them. Here we download a tracking spreadsheet, which is then used to download all of the individual log sheets, combining them into a single file. When then prepare that file for analysis.
This sheet is in MIG Data > Jail Logs > Outdoor Logs Data. It is called Outdoor Logs Tracking Sheet.
Options are set to
eval: falseto avoid re-downloading. Set totrueto update.
# download.file("https://docs.google.com/spreadsheets/d/e/2PACX-1vRQBIKas7s5Wwe4bd9tMlAheCsjqeFU2JMZfk6h8Hc_QH6Dx02SEZmP6ieny13hCyZQosmsRirDEu9P/pub?output=csv", "data-original/outdoor-jail-logs-urls.csv")Read in the tracking sheet url
log_urls <- read_csv("data-original/outdoor-jail-logs-urls.csv") |> clean_names()
log_urls |> glimpse()Rows: 101
Columns: 7
$ unit_name <chr> "Bell", "Byrd", "Diboll", "Duncan", "Ellis", "Estelle", "Fer…
$ unit_code <chr> "CV", "DU", "DO", "N6", "E1", "E2", "FE", "GG", "GR", "NF", …
$ region <chr> "I", "I", "I", "I", "I", "I", "I", "I", "I", "I", "I", "I", …
$ slug <chr> NA, "r1-byrd", NA, "r1-duncan", "r1-ellis", "r1-estelle", "r…
$ url <chr> NA, "https://docs.google.com/spreadsheets/d/e/2PACX-1vROqTyA…
$ notes_gid <dbl> NA, 1133970569, NA, 535687266, 1770803795, 1551327169, 96070…
$ notes <chr> NA, NA, "closely linked to Duncan, pre-release", NA, NA, NA,…
In some cases we had the urls formatted incorrectly. Result should be 0 rows.
log_urls |>
filter(str_detect(url, "pubhtml"))This code below was courtesy of ChatGPT, based on the following prompt: “I’m using R. I have a data frame that has a list of urls and slugs, which are short file names. Please write me a loop or other code that will download all the files at those urls, with their names as the slug. These are all csv files.” There were some edits.
Options are set to
eval: falseto avoid re-downloading. Set totrueto update.
# set our urls to df
df <- log_urls
# Download loop
for (i in seq_len(nrow(df))) {
url <- df$url[i]
slug <- df$slug[i]
file_name <- paste0("data-original/2023_logs/", slug, ".csv")
if (is.na(url)) {
# message(paste("No URL for slug:", slug))
next
} else {
tryCatch({
download.file(url, destfile = file_name, mode = "wb")
}, error = function(e) {
message(paste("Failed to download:", url, "-", e$message, "\n"))
})
}
}Make a list of files:
logs_list <- list.files(
"data-original/2023_logs",
pattern = ".csv",
full.names = TRUE
)
logs_list [1] "data-original/2023_logs/r1-byrd.csv"
[2] "data-original/2023_logs/r1-duncan.csv"
[3] "data-original/2023_logs/r1-ellis.csv"
[4] "data-original/2023_logs/r1-estelle.csv"
[5] "data-original/2023_logs/r1-ferguson.csv"
[6] "data-original/2023_logs/r1-goodman.csv"
[7] "data-original/2023_logs/r1-goree.csv"
[8] "data-original/2023_logs/r1-holliday.csv"
[9] "data-original/2023_logs/r1-huntsville.csv"
[10] "data-original/2023_logs/r1-lewis.csv"
[11] "data-original/2023_logs/r1-polunsky.csv"
[12] "data-original/2023_logs/r1-wainwright.csv"
[13] "data-original/2023_logs/r1-wynne.csv"
[14] "data-original/2023_logs/r2-beto.csv"
[15] "data-original/2023_logs/r2-boyd.csv"
[16] "data-original/2023_logs/r2-choice-moore.csv"
[17] "data-original/2023_logs/r2-cole.csv"
[18] "data-original/2023_logs/r2-hutchins.csv"
[19] "data-original/2023_logs/r2-johnston.csv"
[20] "data-original/2023_logs/r2-michael.csv"
[21] "data-original/2023_logs/r2-powledge.csv"
[22] "data-original/2023_logs/r2-skyview.csv"
[23] "data-original/2023_logs/r2-telford.csv"
[24] "data-original/2023_logs/r3-clemens.csv"
[25] "data-original/2023_logs/r3-gist.csv"
[26] "data-original/2023_logs/r3-henley.csv"
[27] "data-original/2023_logs/r3-hightower.csv"
[28] "data-original/2023_logs/r3-hospital-galveston.csv"
[29] "data-original/2023_logs/r3-jester-3.csv"
[30] "data-original/2023_logs/r3-kegans.csv"
[31] "data-original/2023_logs/r3-leblanc.csv"
[32] "data-original/2023_logs/r3-lychner.csv"
[33] "data-original/2023_logs/r3-memorial.csv"
[34] "data-original/2023_logs/r3-plane.csv"
[35] "data-original/2023_logs/r3-ramsey.csv"
[36] "data-original/2023_logs/r3-scott.csv"
[37] "data-original/2023_logs/r3-stiles.csv"
[38] "data-original/2023_logs/r3-stringfellow.csv"
[39] "data-original/2023_logs/r3-terrell.csv"
[40] "data-original/2023_logs/r3-vance.csv"
[41] "data-original/2023_logs/r3-young.csv"
[42] "data-original/2023_logs/r4-briscoe.csv"
[43] "data-original/2023_logs/r4-connally.csv"
[44] "data-original/2023_logs/r4-cotulla.csv"
[45] "data-original/2023_logs/r4-dominguez.csv"
[46] "data-original/2023_logs/r4-fort-stockton.csv"
[47] "data-original/2023_logs/r4-garza-west.csv"
[48] "data-original/2023_logs/r4-glossbrenner.csv"
[49] "data-original/2023_logs/r4-lopez.csv"
[50] "data-original/2023_logs/r4-lynaugh.csv"
[51] "data-original/2023_logs/r4-mcconnell.csv"
[52] "data-original/2023_logs/r4-sanchez.csv"
[53] "data-original/2023_logs/r4-stevenson.csv"
[54] "data-original/2023_logs/r4-torres.csv"
[55] "data-original/2023_logs/r5-allred.csv"
[56] "data-original/2023_logs/r5-dalhart.csv"
[57] "data-original/2023_logs/r5-daniel.csv"
[58] "data-original/2023_logs/r5-formby.csv"
[59] "data-original/2023_logs/r5-jordan.csv"
[60] "data-original/2023_logs/r5-mechler.csv"
[61] "data-original/2023_logs/r5-montford.csv"
[62] "data-original/2023_logs/r5-roach.csv"
[63] "data-original/2023_logs/r5-smith.csv"
[64] "data-original/2023_logs/r5-wallace.csv"
[65] "data-original/2023_logs/r5-wheeler.csv"
[66] "data-original/2023_logs/r6-crain.csv"
[67] "data-original/2023_logs/r6-halbert.csv"
[68] "data-original/2023_logs/r6-hamilton.csv"
[69] "data-original/2023_logs/r6-havins.csv"
[70] "data-original/2023_logs/r6-hilltop.csv"
[71] "data-original/2023_logs/r6-hobby.csv"
[72] "data-original/2023_logs/r6-hughes.csv"
[73] "data-original/2023_logs/r6-luther.csv"
[74] "data-original/2023_logs/r6-marlin.csv"
[75] "data-original/2023_logs/r6-middleton.csv"
[76] "data-original/2023_logs/r6-o-daniel.csv"
[77] "data-original/2023_logs/r6-pack.csv"
[78] "data-original/2023_logs/r6-robertson.csv"
[79] "data-original/2023_logs/r6-san-saba.csv"
[80] "data-original/2023_logs/r6-sayle.csv"
[81] "data-original/2023_logs/r6-travis-county.csv"
[82] "data-original/2023_logs/r6-woodman.csv"
A little ChatGPT help here to find a way to count the number of lines in each file. There should be 193 in each one. In some cases our spreadsheets included blank rows, which we fixed. There is still a blank column, but we take care of that later.
This takes the list of files above and counts the lines. When then filter the list for any that aren’t 193 lines long. The result should be 0 rows.
count_lines <- function(file) {
length(readLines(file, warn = FALSE))
}
line_counts <- sapply(logs_list, count_lines)
line_counts_data <- tibble(file = basename(names(line_counts)), lines = line_counts)
line_counts_data |> filter(lines != 193)Sometimes we needed to check a specific file being downloaded.
# download.file(
# "https://docs.google.com/spreadsheets/d/e/2PACX-1vRPdRSPyFwnYPgrLkJktmtg2hy4BmWl0eILzwO7KLalYxjPF2xYMZj4kSTuo3u7wwUckskou4-Dm4zF/pub?gid=0&single=true&output=csv",
# "data-original/2023_logs/r1-estelle.csv"
# )Again uses the logs_list from above.
logs_raw <- logs_list |> #set_names(basename) |>
map(
read_csv,
col_types = cols(.default = col_character())
) |> list_rbind() |>
clean_names()New names:
New names:
New names:
• `` -> `...12`
logs_raw |> glimpse()Rows: 15,744
Columns: 12
$ unit <chr> "Byrd", "Byrd", "Byrd", "Byrd", "Byrd", "Byrd", "Byrd", "Byrd"…
$ date <chr> "7/24/2023", "7/24/2023", "7/24/2023", "7/24/2023", "7/24/2023…
$ rec <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12",…
$ time <chr> "12:30 a.m.", "1:30 a.m.", "2:30 a.m.", "3:30 a.m.", "4:30 a.m…
$ temp <chr> "83", "82", "81", "81", "80", "79", "79", "80", "83", "86", "9…
$ humid <chr> "54", "62", "67", "71", "74", "76", "N/A", "79", "72", "69", "…
$ wind <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ hi_wc <chr> "65", "68", "69", "71", "71", "71", "N/A", "84", "86", "95", "…
$ hi_wc_n <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ person <chr> "E. Johnson", "E. Johnson", "E. Johnson", "E. Johnson", "E. Jo…
$ notes <chr> NA, NA, NA, NA, "hi_wc corrected", "humid corrected", NA, NA, …
$ x12 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
logs_raw |> filter(!is.na(x12))logs_tight <- logs_raw |> select(-x12)
logs_tight |> glimpse()Rows: 15,744
Columns: 11
$ unit <chr> "Byrd", "Byrd", "Byrd", "Byrd", "Byrd", "Byrd", "Byrd", "Byrd"…
$ date <chr> "7/24/2023", "7/24/2023", "7/24/2023", "7/24/2023", "7/24/2023…
$ rec <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12",…
$ time <chr> "12:30 a.m.", "1:30 a.m.", "2:30 a.m.", "3:30 a.m.", "4:30 a.m…
$ temp <chr> "83", "82", "81", "81", "80", "79", "79", "80", "83", "86", "9…
$ humid <chr> "54", "62", "67", "71", "74", "76", "N/A", "79", "72", "69", "…
$ wind <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ hi_wc <chr> "65", "68", "69", "71", "71", "71", "N/A", "84", "86", "95", "…
$ hi_wc_n <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ person <chr> "E. Johnson", "E. Johnson", "E. Johnson", "E. Johnson", "E. Jo…
$ notes <chr> NA, NA, NA, NA, "hi_wc corrected", "humid corrected", NA, NA, …
Two decisions in the formatting of our log necessitate this:
rec column from 1 to 24 thinking we needed to label each reading, but the weather station data will be labeled 0 to 23 because it is the hour of time.Here we create a real datetime, date and proper hour.
Also, we remove artifact columns and any blank columns by specifying the columns we do need.
logs_dates <- logs_tight |>
mutate(
new_time = case_when(
time == "12:30 a.m." ~ "00:30:00",
time == "12:30 p.m." ~ "12:30:00",
str_sub(time, -4, -1) == "a.m." ~ paste0(str_extract(time, "^(\\d+)"),":30:00"),
str_sub(time, -4, -1) == "p.m." ~ paste0(str_extract(time, "^(\\d+)") |> as.numeric() + 12, ":30:00"),
.default = NA
)
) |>
mutate(
datetime = mdy_hms(paste(date, new_time)),
date = mdy(date),
hour = hour(datetime),
.after = time
) |>
select(
unit,
date,
rec,
datetime,
hour,
temp,
humid,
wind,
hi_wc,
hi_wc_n,
person,
notes
)
logs_dates |> slice_sample(n = 10)Our various readings come in as text. We convert them to numbers.
logs_numbs <- logs_dates |>
mutate(
across(c(temp, humid, wind, hi_wc, hi_wc_n), parse_number)
)Warning: There were 5 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(c(temp, humid, wind, hi_wc, hi_wc_n), parse_number)`.
Caused by warning:
! 3 parsing failures.
row col expected actual
2451 -- a number N/A
2476 -- a number N/A
2478 -- a number N/A
ℹ Run `dplyr::last_dplyr_warnings()` to see the 4 remaining warnings.
logs_numbs |> slice_sample(n = 10)logs_numbs |> filter(str_detect(unit, "Moore"))In some cases the name on the outdoor log sheet did not match our tracking sheet, which means we have trouble later when we need to match that information.
logs_fixes <- logs_numbs |>
mutate(
unit = case_match(
unit,
"Travis State Jail" ~ "Travis County",
"Choice Moore" ~ "Moore, C.",
"Jester 3" ~ "Jester III",
"Lane Murray" ~ "Murray",
"Wallace Pack" ~ "Pack",
"Mt. View" ~ "Woodman",
.default = unit)
)
logs_fixes |> filter(str_detect(unit, "Jordan"))It’s helpful later to have the region of the prison as part of this dataset, so I’m going to add it here from the unit info.
regions <- read_rds("data-processed/01-unit-info-cleaned.rds") |>
select(unit_name, region)
logs_regions <- logs_fixes |>
left_join(regions, by = join_by(unit == unit_name)) |>
relocate(region, .after = unit)
logs_regions |> filter(str_detect(unit, "Jordan"))logs_regions |> write_rds("data-processed/01-outdoor-cleaned.rds")When we transcribed the logs, we also included a notes sheet where we added any overall relavant or interesting information.
We will be following the same steps as above, except that we have to piece together the download url from our tracking sheet.
notes_urls <- log_urls |>
# filter(region == "II") |> # filters to region for testing
select(slug, url, notes_gid) |>
mutate(
pub_url = str_extract(url, "(.*pub\\?)", group = 1),
notes_url = paste0(pub_url, "gid=", notes_gid, "&output=csv")
) |>
filter(!is.na(url)) |>
select(slug, notes_url)
notes_urlsThen we download all the files. eval: false is set here so files won’t be re-downloaded.
# set our urls to df
notes_df <- notes_urls
# Download loop
for (i in seq_len(nrow(notes_df))) {
url <- df$notes_url[i]
slug <- df$slug[i]
file_name <- paste0("data-original/2023_notes/", slug, ".csv")
if (is.na(url)) {
# message(paste("No URL for slug:", slug))
next
} else {
tryCatch({
download.file(url, destfile = file_name, mode = "wb")
}, error = function(e) {
message(paste("Failed to download:", url, "-", e$message, "\n"))
})
}
}Make a list of file paths to combine.
notes_list <- list.files(
"data-original/2023_notes",
pattern = ".csv",
full.names = TRUE
)
notes_list [1] "data-original/2023_notes/r1-byrd.csv"
[2] "data-original/2023_notes/r1-duncan.csv"
[3] "data-original/2023_notes/r1-ellis.csv"
[4] "data-original/2023_notes/r1-estelle.csv"
[5] "data-original/2023_notes/r1-ferguson.csv"
[6] "data-original/2023_notes/r1-goodman.csv"
[7] "data-original/2023_notes/r1-goree.csv"
[8] "data-original/2023_notes/r1-holliday.csv"
[9] "data-original/2023_notes/r1-huntsville.csv"
[10] "data-original/2023_notes/r1-lewis.csv"
[11] "data-original/2023_notes/r1-polunsky.csv"
[12] "data-original/2023_notes/r1-wainwright.csv"
[13] "data-original/2023_notes/r1-wynne.csv"
[14] "data-original/2023_notes/r2-beto.csv"
[15] "data-original/2023_notes/r2-boyd.csv"
[16] "data-original/2023_notes/r2-choice-moore.csv"
[17] "data-original/2023_notes/r2-cole.csv"
[18] "data-original/2023_notes/r2-hutchins.csv"
[19] "data-original/2023_notes/r2-johnston.csv"
[20] "data-original/2023_notes/r2-michael.csv"
[21] "data-original/2023_notes/r2-powledge.csv"
[22] "data-original/2023_notes/r2-telford.csv"
[23] "data-original/2023_notes/r3-clemens.csv"
[24] "data-original/2023_notes/r3-gist.csv"
[25] "data-original/2023_notes/r3-henley.csv"
[26] "data-original/2023_notes/r3-hightower.csv"
[27] "data-original/2023_notes/r3-hospital-galveston.csv"
[28] "data-original/2023_notes/r3-jester-3.csv"
[29] "data-original/2023_notes/r3-kegans.csv"
[30] "data-original/2023_notes/r3-leblanc.csv"
[31] "data-original/2023_notes/r3-lychner.csv"
[32] "data-original/2023_notes/r3-memorial.csv"
[33] "data-original/2023_notes/r3-plane.csv"
[34] "data-original/2023_notes/r3-ramsey.csv"
[35] "data-original/2023_notes/r3-scott.csv"
[36] "data-original/2023_notes/r3-stiles.csv"
[37] "data-original/2023_notes/r3-stringfellow.csv"
[38] "data-original/2023_notes/r3-terrell.csv"
[39] "data-original/2023_notes/r3-vance.csv"
[40] "data-original/2023_notes/r3-young.csv"
[41] "data-original/2023_notes/r4-briscoe.csv"
[42] "data-original/2023_notes/r4-connally.csv"
[43] "data-original/2023_notes/r4-cotulla.csv"
[44] "data-original/2023_notes/r4-dominguez.csv"
[45] "data-original/2023_notes/r4-fort-stockton.csv"
[46] "data-original/2023_notes/r4-garza-west.csv"
[47] "data-original/2023_notes/r4-glossbrenner.csv"
[48] "data-original/2023_notes/r4-lopez.csv"
[49] "data-original/2023_notes/r4-lynaugh.csv"
[50] "data-original/2023_notes/r4-mcconnell.csv"
[51] "data-original/2023_notes/r4-sanchez.csv"
[52] "data-original/2023_notes/r4-stevenson.csv"
[53] "data-original/2023_notes/r4-torres.csv"
[54] "data-original/2023_notes/r5-dalhart.csv"
[55] "data-original/2023_notes/r5-daniel.csv"
[56] "data-original/2023_notes/r5-formby.csv"
[57] "data-original/2023_notes/r5-jordan.csv"
[58] "data-original/2023_notes/r5-mechler.csv"
[59] "data-original/2023_notes/r5-montford.csv"
[60] "data-original/2023_notes/r5-roach.csv"
[61] "data-original/2023_notes/r5-smith.csv"
[62] "data-original/2023_notes/r5-wallace.csv"
[63] "data-original/2023_notes/r6-crain.csv"
[64] "data-original/2023_notes/r6-halbert.csv"
[65] "data-original/2023_notes/r6-hamilton.csv"
[66] "data-original/2023_notes/r6-havins.csv"
[67] "data-original/2023_notes/r6-hilltop.csv"
[68] "data-original/2023_notes/r6-hobby.csv"
[69] "data-original/2023_notes/r6-hughes.csv"
[70] "data-original/2023_notes/r6-luther.csv"
[71] "data-original/2023_notes/r6-marlin.csv"
[72] "data-original/2023_notes/r6-middleton.csv"
[73] "data-original/2023_notes/r6-o-daniel.csv"
[74] "data-original/2023_notes/r6-pack.csv"
[75] "data-original/2023_notes/r6-robertson.csv"
[76] "data-original/2023_notes/r6-san-saba.csv"
[77] "data-original/2023_notes/r6-sayle.csv"
[78] "data-original/2023_notes/r6-travis-county.csv"
[79] "data-original/2023_notes/r6-woodman.csv"
Now we put those files together into a single tibble.
notes_raw <- notes_list |>
set_names(basename) |>
map(
read_csv,
col_types = cols(.default = col_character()),
col_names = c("notes")
) |> list_rbind(names_to = "unit") |>
clean_names()
notes_raw |> glimpse()Rows: 268
Columns: 4
$ unit <chr> "r1-byrd.csv", "r1-byrd.csv", "r1-duncan.csv", "r1-duncan.csv", …
$ notes <chr> "Name we code as Komomzy is a difficult signature to comprehend"…
$ x2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ x3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
We have a little cleanup here for the unit and region. We drop the x columns, which are empty.
notes_cleaned <- notes_raw |>
filter(!is.na(notes)) |>
mutate(
region = str_extract(unit, pattern = "(.*)-", group = 1),
unit = str_extract(unit, pattern = ".*-(.*).csv", group = 1)
) |>
select(unit, region, notes)
notes_cleaned |> head()We’ll export this as both a csv and an rds because we may just put these into a spreadsheet for review.
notes_cleaned |> write_rds("data-processed/01-outdoor-notes.rds")
notes_cleaned |> write_csv("data-processed/01-outdoor-notes.csv")