Outdoor log reliability

Author

Media Innovation Group

July 2025 questions

  • Can we find a NWS for Jordan that is close but has readings? Are there other cases like this were we don’t have readings, but can find an alternative station? Let’s try using virtualcrossing.

Reliability questions

  • How do log temperature readings compare to the nearest weather station?
    • diff in temp
    • diff in heat index

Setup, Import & Combine

Expand this to see code
library(tidyverse)
library(janitor)
library(scales)
library(DT)

We’re bringing in:

  • Our coded and cleaned outdoor log data
  • Hourly weather readings
  • Prison unit information
Expand this to see code
logs_all <- read_rds("data-processed/01-outdoor-cleaned.rds") # come from 01-outdoor-cleaning
hourly <- read_rds("data-processed/01-station-hourly-protocols.rds")
units <- read_rds("data-processed/01-unit-info-cleaned.rds")

Clip log data

We have eight days of data, but we can clip it to a “week” as the last seven seven days of July, 2023. As of July 2025, we are not clipping.

Expand this to see code
logs <- logs_all
# |> filter(date != "2023-07-24")

logs |> count(date) |> adorn_totals() |> tibble()

Combining files

Here we bring in the unit info and hourly logs.

Expand this to see code
# getting station from units
logs_expanded <- logs |> 
  left_join(units, by = join_by(unit == unit_name, region)) |> 
  # removing some unneeded cols
  select(!c(unit_code, type, county, nws))

# joining to get weather info
logs_nws <- logs_expanded |> 
  left_join(hourly, by = join_by(nws_id == station_id, date == date, hour == hr))

logs_expanded |> filter(str_detect(unit, "Jordan"))

Context on matches

Here we find the percentage of records with no NWS data.

Expand this to see code
match_checks <- logs_nws |> 
  mutate(match_null = if_else(is.na(name), T, F))

match_checks |> 
  tabyl(match_null) |> 
  adorn_pct_formatting()

About 15 percent of our hourly logs don’t have a station match. Here TRUE values means there is missing station data.

Let’s take a look at which ones are at issue:

Expand this to see code
match_checks |> 
  tabyl(unit, match_null) |> 
  adorn_percentages() |> 
  adorn_pct_formatting() |> 
  adorn_ns() |> 
  tibble()

In some cases we didn’t have a prison close enough. In other cases – like Jordan – we had a station, but there were no recordings for the time period.

Here is a filtered list to more easily see units that have no NWS readings:

Expand this to see code
no_compare <- match_checks |> 
  tabyl(unit, match_null) |> 
  filter(`FALSE` == 0) |> 
  tibble()

no_compare

We’re going to remove these units going forward, leaving us with 82 units.

Expand this to see code
no_compare_units <- no_compare |> pull(unit)

logs_nws_compare <- logs_nws |> 
  filter(!unit %in% no_compare_units)

logs_nws_compare |> count(unit) |> nrow()

TA: Missing weather stations

We don’t have a matching weather station for 11 of the 82 unites. Using visualcrossing we might be able to do a better job finding weather stations, and also to keep a better record of how far away these stations are from the units. We will do this, but later.

ALSO: In May25 work we removed stations with no matching weather station because we were looking at reliability. But if we want instead to look at protocols based on internal records, we shouldn’t remove them.

Reliability answers

We must remember these weather stations may be up to 40 miles away from the unit.

  • How do log temperature readings compare to the nearest weather station?
    • diff in temp
    • diff in heat index

Here we find the difference in the prison recorded temp and heat index compared to the nearest station, if we have one. In some cases the diffs are NA because we don’t have a nearby station, or one of the calculating numbers is missing for whatever reason.

We only do this heat index calculation if one of the temperatures if above 80 degrees, because heat indexes get wonky when it is below 80.

Expand this to see code
logs_diffs <- logs_nws_compare |> 
  mutate(
    tmp_diff = temp - tmp,
    hi_diff = case_when(
      temp >= 80 | tmp >= 80 ~ hi_wc - hi,
      .default = NA
    )
  )

logs_diffs |> slice_sample(n = 20)

Temperature differences

Average temp difference

Here we look across the dataset at the diff_tmp (unit temperature - weather staiton temperaure).

Expand this to see code
logs_diffs |> 
  summarise(
    max_tmp_diff = max(tmp_diff, na.rm = T),
    avg_tmp_diff = (mean(tmp_diff, na.rm = T) * 100) |> round(1),
    med_tmp_diff = median(tmp_diff, na.rm = T)
  )

Distribution temp difference

Let’s do a quick plot of these to see how the are distributed. i.e., how many rows are within x degrees of the weather station temperature. Note: there are 248 blank rows where we could not determine the difference because of a missing value.

Expand this to see code
ggplot(logs_diffs, aes(x = tmp_diff)) +
  geom_histogram(binwidth = 1, fill = "lightblue", color = "black") +
  labs(
    title = "Compare temp draft",
    subtitle = str_wrap("Each bar is how many unit records are within x degrees of the closest weather station."))
Warning: Removed 250 rows containing non-finite outside the scale range
(`stat_bin()`).

I can see that there are a just a couple of outliers Let’s consider removing those and see the spread mo betta.

There are only five records (out of 11,928) where the temp is 20 or more degrees off.

Expand this to see code
# the outliers
logs_diffs |> 
  filter(abs(tmp_diff) >= 20)
Expand this to see code
# chart it
logs_diffs |> 
  filter(abs(tmp_diff) < 20) |> 
  ggplot() +
  aes(x = tmp_diff) +
  geom_histogram(binwidth = 2, fill = "lightblue", color = "black") +
  scale_x_continuous(breaks = seq(-20, 20, 2)) +
  theme(panel.grid.minor.x = element_blank()) +
  labs(
    title = "Most temps within about 4 degrees",
    subtitle = str_wrap("Each bar is how many unit records are within x degrees of the closest weather station. We've removed any rows that are more the 20 degrees off (of which there are five).")
  )

How many rows negative vs positive?

Expand this to see code
logs_diffs |> filter(tmp_diff <= 0) |> nrow() # unit is lower
[1] 7053
Expand this to see code
logs_diffs |> filter(tmp_diff > 0) |> nrow() # unit is higher
[1] 6329

Standard deviation temp difference

This might be too technical or unneeded, but let’s look at this using the standard deviation. The standard deviation describes how much the values in the dataset typically vary from the mean (average). Like how whacked are they.

Expand this to see code
# The standard deviation of tmp_diff
# logs_diffs |> summarise(td_sd = sd(tmp_diff, na.rm = T))
tmp_diff_sd <- 
  logs_diffs |>
  filter(abs(tmp_diff) < 20) |>
  pull(tmp_diff) |> sd(na.rm = T)

tmp_diff_sd
[1] 4.131502
Expand this to see code
# percent outside the sd
logs_diffs |> 
  filter(abs(tmp_diff) < 20) |> 
  mutate(
    td_sd_out = case_when(
      abs(tmp_diff) > tmp_diff_sd ~ T,
      .default = F
    )
  ) |> 
  tabyl(td_sd_out) |> 
  adorn_pct_formatting()

Here about 1/4 of the rows are outside the standard deviation.

Percent temp within range

This might make more sense to a human: What percentage of the overall records fall within 2 degrees of the nearest weather station. How about within 4 degrees?

We are NOT removing outliers here.

The TRUE value here is the percentage of records within 2 or 4 degrees, respectively. To be clear, within 4 degrees also includes those within 2 degrees.

Expand this to see code
tmp_diff_2_4 <- logs_diffs |>
  # filter(abs(tmp_diff) < 20) |> 
  filter(!is.na(tmp_diff)) |> 
  mutate(in_2d = abs(tmp_diff) <= 2,
         in_4d = abs(tmp_diff) <= 4)

tmp_diff_2_4 |> 
  tabyl(in_2d) |> 
  adorn_pct_formatting()
Expand this to see code
tmp_diff_2_4 |> 
  tabyl(in_4d) |> 
  adorn_pct_formatting()

Now that seems easier to understand than standard deviation.

Within range degrees by unit

Let’s see how this differs by unit.

in_2d_prc is the percentage of records for that unit within 2 degrees the closest weather station. in_4d_prc is the same for within 4 degrees.

Expand this to see code
tmp_diff_2_4 |> 
  group_by(unit) |> 
  summarize(
    cnt = n(),
    in_2d_true = sum(in_2d == T),
    in_2d_prc = ((in_2d_true / cnt) * 100) |> round(1),
    in_4d_true = sum(in_4d == T),
    in_4d_prc = ((in_4d_true / cnt) * 100) |> round(1)
  ) |> 
  # removes the cnt true rows from display
  select(unit, ends_with("prc")) |> 
  arrange(in_2d_prc)

Sorted above based on lowest percentage within 2 degrees.

Now, these numbers could really depend on how close the weather station is to the unit. For instance, I’m not happy with the choice of a station at Pearland Regional Airport for the Ramsey unit (maybe 30 miles?). The Angleton/Brazoria airport is closer (13 miles), as is Houston Southwest Airport (14 miles).

We will check all the weather stations and add distance between unit and weather station at a future date.

Heat index

HI diff distribution

Let’s do the same for heat index. We’ll skip the standard deviations and just look at the distribution and percentages.

There are about 50 records outside a 20 degree difference that we show here, but remove for the plot.

Expand this to see code
logs_diffs |> 
  filter(abs(hi_diff) >= 20)
Expand this to see code
logs_diffs |> 
  filter(abs(hi_diff) < 20) |>
  ggplot() +
  aes(x = hi_diff) +
  geom_histogram(binwidth = 2, fill = "lightblue", color = "black") +
  scale_x_continuous(breaks = seq(-20, 20, 2)) +
  theme(panel.grid.minor.x = element_blank())

Within HI range

Here we find the percentage of heat index records within 2 or 4 degrees. We are NOT removing outliers here.

TRUE means within 2 or 4 degrees, respectively.

Expand this to see code
hi_diff_2_4 <- logs_diffs |>
  filter(!is.na(hi_diff)) |> 
  mutate(in_2d = abs(hi_diff) <= 2,
         in_4d = abs(hi_diff) <= 4)

hi_diff_2_4 |> 
  tabyl(in_2d) |> 
  adorn_pct_formatting()
Expand this to see code
hi_diff_2_4 |> 
  tabyl(in_4d) |> 
  adorn_pct_formatting()

Within HI range by unit

And now to do that by unit. We have more cases where we don’t have the heat index for both the unit and the weather station, so we include the count of records we are comparing here.

Expand this to see code
hi_diff_2_4 |> 
  group_by(unit) |> 
  summarize(
    cnt = n(),
    in_2d_true = sum(in_2d == T),
    in_2d_prc = ((in_2d_true / cnt) * 100) |> round(1),
    in_4d_true = sum(in_4d == T),
    in_4d_prc = ((in_4d_true / cnt) * 100) |> round(1)
  ) |> 
  # removes the cnt true rows from display
  select(unit, cnt, ends_with("prc")) |> 
  arrange(in_2d_prc)