Importing multiple files at once and creating your own function

Section 1: Importing Multiple Files

First, we are going to look at how to import multiple files at once using the map() function. The Taylor Swift song data is divided into one csv for each album, and we want to bring it all into one object.

Set up

Importing our needed libraries. We use tidyverse for almost everything.

1#| label: setup
2#| message: false
3#| warning: false

library(tidyverse)
1
#| label: sets the chunk label. Using setup specifically will run this code chunk first every time you render or run the page.
2
#| message: false hides the output messages that usually appear when you import libraries.
3
Sometimes you do something that creates a warning in R, but it doesn’t always affect the output of the code. #| warning: false hides those warnings. We will talk about this more later.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(scales)

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

Creating a list of file names

1taylor_files_list <- list.files(
2  "data-processed-taylor",
3  pattern = ".csv",
4  full.names = TRUE
  # Try changing full.names to FALSE and see what happens to the output
  )

taylor_files_list
1
list.files() creates a character vector of the names of files
2
The first argument is the path to find the files. In this case, we want to go into the directory (folder) called data-proccessed-taylor.
3
pattern is an optional argument that takes a regular expression to define which specific files you want. In this case we want all files that contain .csv.
4
full.names = TRUE gets the full path name rather than just the files name. Let’s change this and see what happens…
 [1] "data-processed-taylor/1989 (Deluxe).csv"                                                                      
 [2] "data-processed-taylor/1989 (Taylor's Version) [Deluxe].csv"                                                   
 [3] "data-processed-taylor/1989 (Taylor's Version).csv"                                                            
 [4] "data-processed-taylor/1989.csv"                                                                               
 [5] "data-processed-taylor/evermore (deluxe version).csv"                                                          
 [6] "data-processed-taylor/evermore.csv"                                                                           
 [7] "data-processed-taylor/Fearless (International Version).csv"                                                   
 [8] "data-processed-taylor/Fearless (Platinum Edition).csv"                                                        
 [9] "data-processed-taylor/Fearless (Taylor's Version).csv"                                                        
[10] "data-processed-taylor/folklore (deluxe version).csv"                                                          
[11] "data-processed-taylor/folklore- the long pond studio sessions (from the Disney+ special) [deluxe edition].csv"
[12] "data-processed-taylor/folklore.csv"                                                                           
[13] "data-processed-taylor/Live From Clear Channel Stripped 2008.csv"                                              
[14] "data-processed-taylor/Lover.csv"                                                                              
[15] "data-processed-taylor/Midnights (3am Edition).csv"                                                            
[16] "data-processed-taylor/Midnights (The Til Dawn Edition).csv"                                                   
[17] "data-processed-taylor/Midnights.csv"                                                                          
[18] "data-processed-taylor/Red (Deluxe Edition).csv"                                                               
[19] "data-processed-taylor/Red (Taylor's Version).csv"                                                             
[20] "data-processed-taylor/Red.csv"                                                                                
[21] "data-processed-taylor/reputation Stadium Tour Surprise Song Playlist.csv"                                     
[22] "data-processed-taylor/reputation.csv"                                                                         
[23] "data-processed-taylor/Speak Now (Deluxe Package).csv"                                                         
[24] "data-processed-taylor/Speak Now (Taylor's Version).csv"                                                       
[25] "data-processed-taylor/Speak Now World Tour Live.csv"                                                          
[26] "data-processed-taylor/Speak Now.csv"                                                                          
[27] "data-processed-taylor/Taylor Swift (Deluxe Edition).csv"                                                      
[28] "data-processed-taylor/THE TORTURED POETS DEPARTMENT- THE ANTHOLOGY.csv"                                       
[29] "data-processed-taylor/THE TORTURED POETS DEPARTMENT.csv"                                                      

Reading in and combining the files

6taylor_songs <- taylor_files_list |>  #set_names(basename) |>
1  map(
2  read_csv,
3  col_types = cols(album = col_character())
4) |> list_rbind() |>
5  clean_names()

taylor_songs                    
1
First we create a new object called taylor_songs which is where all our data will end up. Then we take our list of files names taylor_files_list and put it in the map() function. map() applies a method (in this case read_csv()) to every item in a list and returns a list as a result.
2
Our first argument is the function we want to apply which is read_csv() which returns a data frame of the csv.
3
After the comma is col_types = which is an argument in read_csv(). This argument can set any or all of the column types as you import the data. What happens when you comment out the col_types = line? Taylor’s album 1989 shows up as a number while the other album names show up as strings so get an error and can’t join the columns when we use list_rbind().
4
Once we use map to create a list of our newly created object from read_csv, we want to combine them all into one single object which already called taylor_songs, so we use list_rbind().
5
And then just to make everything look nice and uniform we use clean_names() to change the column names.
6
Let’s add set_names(basename) by uncommenting it and add names_to = "source" inside list_rbind(). Look at the output. It creates a column that shows which file each row came from. This can be really helpful when you maybe don’t have a column that has the date or year, but the file name does contain it. You can then extract it from that column.

OYO: Importing multiple files at once - Power Outages

Now, try this on your own with the power outage data.

Instructions: The files you want to import are all in a folder called "data-processed-power". We have already given you the object names to put your code in to help with consistency and preventing repetition across files. Uncomment each of the lines and input code to read in the csv’s.

power_files_list <- list.files(
  "data-processed-power",
  pattern = ".csv",
  full.names = TRUE
)

power_files_list
 [1] "data-processed-power/Alabama.csv"       
 [2] "data-processed-power/Arizona.csv"       
 [3] "data-processed-power/Arkansas.csv"      
 [4] "data-processed-power/California.csv"    
 [5] "data-processed-power/Colorado.csv"      
 [6] "data-processed-power/Connecticut.csv"   
 [7] "data-processed-power/Delaware.csv"      
 [8] "data-processed-power/Florida.csv"       
 [9] "data-processed-power/Georgia.csv"       
[10] "data-processed-power/Idaho.csv"         
[11] "data-processed-power/Illinois.csv"      
[12] "data-processed-power/Indiana.csv"       
[13] "data-processed-power/Iowa.csv"          
[14] "data-processed-power/Kansas.csv"        
[15] "data-processed-power/Kentucky.csv"      
[16] "data-processed-power/Louisiana.csv"     
[17] "data-processed-power/Maine.csv"         
[18] "data-processed-power/Maryland.csv"      
[19] "data-processed-power/Massachusetts.csv" 
[20] "data-processed-power/Michigan.csv"      
[21] "data-processed-power/Minnesota.csv"     
[22] "data-processed-power/Mississippi.csv"   
[23] "data-processed-power/Missouri.csv"      
[24] "data-processed-power/Montana.csv"       
[25] "data-processed-power/Nebraska.csv"      
[26] "data-processed-power/Nevada.csv"        
[27] "data-processed-power/New Hampshire.csv" 
[28] "data-processed-power/New Jersey.csv"    
[29] "data-processed-power/New Mexico.csv"    
[30] "data-processed-power/New York.csv"      
[31] "data-processed-power/North Carolina.csv"
[32] "data-processed-power/North Dakota.csv"  
[33] "data-processed-power/Ohio.csv"          
[34] "data-processed-power/Oklahoma.csv"      
[35] "data-processed-power/Oregon.csv"        
[36] "data-processed-power/Pennsylvania.csv"  
[37] "data-processed-power/Rhode Island.csv"  
[38] "data-processed-power/South Carolina.csv"
[39] "data-processed-power/South Dakota.csv"  
[40] "data-processed-power/Tennessee.csv"     
[41] "data-processed-power/Texas.csv"         
[42] "data-processed-power/Utah.csv"          
[43] "data-processed-power/Vermont.csv"       
[44] "data-processed-power/Virginia.csv"      
[45] "data-processed-power/Washington.csv"    
[46] "data-processed-power/West Virginia.csv" 
[47] "data-processed-power/Wisconsin.csv"     
[48] "data-processed-power/Wyoming.csv"       
power_outages <-  power_files_list |> 
  map(
    read_csv
  ) |> list_rbind() |> 
  clean_names()


power_outages |> glimpse()
Rows: 526,165
Columns: 14
$ state_event          <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
$ datetime_event_began <dttm> 2015-01-07 17:00:00, 2015-02-16 21:00:00, 2015-0…
$ datetime_restoration <dttm> 2015-01-08 08:35:00, 2015-02-18 14:00:00, 2015-0…
$ event_type           <chr> "Severe Weather - Winter", "Severe Weather - Wint…
$ fips                 <dbl> 1039, 1003, 1085, 1003, 1003, 1085, 1001, 1021, 1…
$ state                <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
$ county               <chr> "Covington", "Baldwin", "Lowndes", "Baldwin", "Ba…
$ start_time           <dttm> 2015-01-08 05:00:00, 2015-02-19 22:15:00, 2015-0…
$ duration             <dbl> 2.00, 1.00, 1.25, 1.00, 0.75, 1.25, 2.25, 2.50, 1…
$ end_time             <dttm> 2015-01-08 07:00:00, 2015-02-19 23:15:00, 2015-0…
$ min_customers        <dbl> 723, 368, 405, 368, 312, 405, 381, 289, 203, 352,…
$ max_customers        <dbl> 723, 567, 414, 567, 315, 414, 1607, 540, 224, 352…
$ mean_customers       <dbl> 723.0000, 417.7500, 410.4000, 417.7500, 313.3333,…
$ event_category       <chr> "Weather", "Weather", "Weather", "Weather", "Weat…

Section 2: Creating your own function

We want to create a function that creates a graph we can change over and over again.

Look at the data

First, let’s take a look at the Taylor Swift song data…

taylor_songs |> head(50)

There are a lot of columns that describe different aspects of the music, but two that are obvious and stand our are popularity and song length. Do Taylor’s songs have any sort of correlation in popularity to length of the song? Let’s graph it to find out. The data creator has told us that popularity is on a scale of 1-100.

Create a plot

ggplot(taylor_songs, aes( 
  x = duration_min,
  y = popularity)) + 
  geom_point() 

For our Taylor Swift fans, can you tell which song is all the way on the right?

What if we want to add labels and scale the axes differently? And what if we wanted to make the same graph for multiple albums? That would require a lot of re-coding things we’ve already typed out before (or a lot of copy and pasting). Rather than rewriting this the same methods over and over again, we can make a function that does it for us.

Make our own function

We are going to call our function graphing_taylor and add some extra details to the graph.

1graphing_taylor <- function(song_data) {
  ggplot(song_data, aes(x= duration_min, y = popularity)) +
    geom_point() +
    labs(x = "Length of song in minutes", y = "Popularity") +
    scale_x_continuous(n.breaks = 10) +
    scale_y_continuous(limits = c(0,100))
  }
1
Here we use the function method with one argument: song_data and then we define the function using the curly braces {}. This is extremely similar to JavaScript.

Now that we’ve built our function we can apply it to whatever album(s) we want. So let’s look at all the versions of the Red album first.

red <- taylor_songs |> filter(album |> str_detect("Red"))

red

Now we can apply our graphing function to the red object.

graphing_taylor(red)

And again we see our “All Too Well (10 minute version)” as a big outlier.

Now try it on your own.

OYO: Creating your own function

Let’s look at the power outage data.

#power_outages |> head(50)

We want to see what the top 10 causes of power outages are for each state and how many outages there have been for each category. To do this for Texas, we would filter for our state, group_by the event_category, count the number of occurrences for each category, arrange it in descending order and then take the top ten.

# filtered_data <- power_outages |> filter(state_event |> str_detect("Texas")) |>
#   group_by (event_category) |>
#   summarize(total_events = n()) |>
#   arrange(total_events |> desc()) |>
#   head(10)
# 
# filtered_data

Now we want to create a bar chart.

# ggplot(filtered_data,
#          aes(x = total_events,
#              y = reorder(event_category, total_events))) +
#          geom_col() +
#          labs(x = "Occurrences", y = "Outage Cause")  +
#         scale_x_continuous(labels = comma)

Now, your turn. We want to put these two steps into a single function so we can do it all at once for any state we want. Try it below and test if it matches the graph output above for Texas. You can copy and paste the filtering and graphing code but will need to adjust a few things for the function to work.

filter_graph_function <- function (state_string) {
  
}

Test it with Texas.

#filter_graph_function("Texas")

Try it with your home state or another state if you’re from Texas too.

#filter_graph_function(YOUR STATE HERE)