Rows: 95,056
Columns: 12
$ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn <chr> "2540-01-565-4700", "1385-01-574-4707", "1005-01-587…
$ item_name <chr> "BALLISTIC BLANKET KIT", "UNMANNED VEHICLE,GROUND", …
$ quantity <dbl> 10, 1, 10, 9, 1, 1, 1, 18, 1, 2, 3, 10, 1, 1, 2, 1, …
$ ui <chr> "Kit", "Each", "Each", "Each", "Each", "Each", "Each…
$ acquisition_value <dbl> 16854.24, 10000.00, 1626.00, 365.00, 62627.00, 65800…
$ demil_code <chr> "D", "Q", "D", "D", "C", "C", "C", "Q", "Q", "Q", "Q…
$ demil_ic <dbl> 1, 3, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3…
$ ship_date <dttm> 2018-01-30 00:00:00, 2017-03-28 00:00:00, 2016-09-1…
$ station_type <chr> "State", "State", "State", "State", "State", "State"…
$ total_value <dbl> 168542.40, 10000.00, 16260.00, 3285.00, 62627.00, 65…
14 Using case_when
In the Military Surplus Cleaning chapter we created a control_type
variable, creating a category to designate if our items have to be returned to the Department of Defense for disposal. Refer back to that chapter about why were need to do this.
I skipped the detailed explanation of the code there because it is a more advanced tactic and would mire the flow of the lesson. In this appendix, I go into the details for those who might be interested.
You can also learn more about case_when()
and other categorization techniques in the Jedr Categorization Tutorial.
14.1 Catching up with the data
I’ll start here with the Military Surplus data after the shipment totals have been calculated.
14.1.1 Categorization logic with case_when()
We will use the mutate()
function to create a new column called control_type
. We’ve used mutate before, but this time we will fill in values in the new column based on other data inside each row. case_when()
allows us to create a test (or a number of tests) and then mark the new value based on the answer. Once new data has been written the function evaluates the next row, so we write the most specific rules first.
I usually approach this by thinking of the logic first, then writing some code, then testing it. Sometimes my logic is faulty and I have to try again, which is why we test the results. Know this could go on for many cycles.
Here is the basic logic:
- We want to create a new column to denote if the item is controlled. In that column we want it to be TRUE when an item is controlled, and FALSE when it is not.
- We know that items with “AIRPLANE” are always controlled, no matter their demil designations.
- Otherwise we know that items that have a
demil_code
of “A”, OR ademil_code
of “Q” AND ademil_id
of “6”, are non-controlled. - Everything else is controlled.
I’ve noted this logic in a specific order for a reason: It’s the order that we write the logic in the code based on how the function case_when()
works. This process is powerful and can get complicated depending on the logic needed. This example is perhaps more complicated than I like to use when explaining this concept, but this is real data and we need this, so here we go.
Here is the code and annotations.
1<- leso_total |>
leso_control 2mutate(
control_type = case_when(
3str_detect(item_name, "AIRPLANE") ~ TRUE,
4== "A" | (demil_code == "Q" & demil_ic == 6)) ~ FALSE,
(demil_code 5.default = TRUE
)
)
6|> glimpse() leso_control
- 1
-
Our first line creates a new tibble
leso_control
and fills it with the result of the rest of our expression. We start with theleso_total
tibble. - 2
-
We mutate the data and start with the name of new column:
control_type
. We are filling that column with the result of thecase_when()
function for each row. Within thecase_when()
we are making the determination if the item is controlled or not. The left side of the~
is the test, and the right side of~
is what we enter into the column if the test passes. But we have more than one test: - 3
-
The first test is we use the
str_detect()
function to look inside theitem_name
field looking for the term “AIRPLANE”. If the test finds the term, then thecontrol_type
field gets a value ofTRUE
and we move to the next row. If not, it moves to the next rule to see if that is a match. (We could fill this column with any text or number we want, but we are usingTRUE
andFALSE
because that is the most basic kind of data to keep. If the item is controlled, set the value is TRUE. If not, it should be set to FALSE.) - 4
-
Our second rule has two complex tests and we want to mark the row FALSE if either are true. (Remember, this is based on what the DLA told me: items with A or Q6 are non-controlled.) Our
case_when()
logic first looks for the value “A” in thedemil_code
field. If it is yes, then it marks the row FALSE. If no it goes to the next part: Is there a “Q” in thedemil_code
field AND a “6” in thedemil_ic
field? Both “Q” and “6” have to be there to get marked as FALSE. If both fail, then we move to the next test. - 5
-
The last test is our catch-all. If none of the other rules apply, then we set the default value as
TRUE
, which means it is controlled. So our default in the end is to mark everything TRUE if any of the other rules don’t mark it first. - 6
- Lastly we glimpse at the data just so we can see the column was created.
Rows: 95,056
Columns: 13
$ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn <chr> "2540-01-565-4700", "1385-01-574-4707", "1005-01-587…
$ item_name <chr> "BALLISTIC BLANKET KIT", "UNMANNED VEHICLE,GROUND", …
$ quantity <dbl> 10, 1, 10, 9, 1, 1, 1, 18, 1, 2, 3, 10, 1, 1, 2, 1, …
$ ui <chr> "Kit", "Each", "Each", "Each", "Each", "Each", "Each…
$ acquisition_value <dbl> 16854.24, 10000.00, 1626.00, 365.00, 62627.00, 65800…
$ demil_code <chr> "D", "Q", "D", "D", "C", "C", "C", "Q", "Q", "Q", "Q…
$ demil_ic <dbl> 1, 3, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3…
$ ship_date <dttm> 2018-01-30 00:00:00, 2017-03-28 00:00:00, 2016-09-1…
$ station_type <chr> "State", "State", "State", "State", "State", "State"…
$ total_value <dbl> 168542.40, 10000.00, 16260.00, 3285.00, 62627.00, 65…
$ control_type <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
As I said, there was a lot of trial and error to figure that out, but I’ll show some tests here to show that we did what we were intending.
This shows airplanes are marked as controlled with TRUE
.
# showing the results and some columns that determined them
|>
leso_control select(item_name, demil_code, demil_ic, control_type) |>
filter(str_detect(item_name, "AIRPLANE"))
This shows how many items are marked TRUE vs FALSE for each demil_code
and demil_ic
combination. I used it to check that most A records were FALSE, along with Q6.
|>
leso_control 1count(demil_code, demil_ic, control_type, name = "cnt") |>
2pivot_wider(names_from = control_type, values_from = cnt)
- 1
-
This is the count shortcut using three variables. I name counted column
cnt
. It’s hard to visualize that result here but it counts how many times each unique combination ofdemil_code
,demil_ic
andcontrol_type
. - 2
- Here I use pivot_wider() to show the TRUE and FALSE counts on the same row. Just makes it easier to see.
We’re done with this extended explanation.