Appendix B — R Functions

An opinionated list of the most common data wrangling functions. It leans heavily into the Tidyverse.

B.1 Import/Export

  • read_csv() imports data from a CSV file. (It handles data types better than the base R read.csv()). Also write_csv() when you need export as CSV. Example: read_csv("path/to/file.csv").
  • write_rds to save a data frame as an .rds R data data file. This preserves all the data types. read_rds() to import R data. Example: read_rds("path/to/file.rds").
  • readxl is a package we didn’t use, but it has read_excel() that allows you to import from an Excel file, including specified sheets and cell ranges.
  • clean_names() from the library(janitor) package standardizes column names.

B.2 Data manipulation

  • select() to select columns. Example: select(col01, col02) to select cols or select(!c(col1, col2)) to remove them.
  • rename() to rename a column. Example: rename(new_name = old_name).
  • filter() to filter rows of data. Example: filter(column_name == "value").
  • distinct() will filter rows down to the unique values of the columns given.
  • arrange() sorts data based on values in a column. Use desc() to reverse the order. Example: arrange(col_name %>% desc())
  • mutate() changes and existing column or creates a new one. Example: mutate(new_col = (col01 / col02)).
  • We’ll typically use round_half_up() in journalism. This is often used within a mutate() function. (The base R round() function rounds to even numbers where 12.5 and 11.5 are both rounded to 12.)
  • if_else(), case_match() and case_when() are all functions that can be used with mutate() to create new categorizations with your data.
  • pivot_longer() “lengthens” data, increasing the number of rows and decreasing the number of columns. Example: pivot_longer(cols = 3:5, names_to = "new_key_col_name", values_to = "new_val_col_name") will take the third through the fifth columns and turn each value into a new row of data. It will put them into two columns: The first column will have the name you give it in names_to and contain the old column name that corresponds to each value pivoted. The second column will have the name of whatever you set in values_to and will contain all the values from each of the columns.
  • pivot_wider() is the opposite of pivot_longer(). Example: pivot_wider(names_from = col_of_key_values, values_from = col_with_values). See the link.

B.3 Aggregation

  • group_by() and summarize() often come together. When you use group_by(), every function after it is broken down by that grouping. We often add arrange() to these, calling this our GSA functions. Example: group_by(song, artist) %>% summarize(weeks = n(), top_chart_position = min(peak_position)). To break or remove groupings, use ungroup().
  • count() is a shortcut for GSA that counts the number rows based on variable groups you feed it.
  • tabyl() is another shortcut for GSA but it also calculates percentages.

B.4 Math

These are the function often used within summarize():

  • n() to count the number of rows. n_distinct() counts the unique values.
  • sum() to add things together.
  • mean() to get an average.
  • median() to get the median.
  • min() to get the smallest value. max() for the largest.
  • +, -, *, / are math operators similar to a calculator.